Timing Models -- A Parallel Program Performance Analysis and Experimentation Method

Yuan Shi

Shi@cis.temple.edu

Wes Powers

Wes@thira.cis.temple.edu

Room 1036, CIS Department

Temple University

Philadelphia, PA 19122
(215)204-6437 (Voice) (215)204-5082 (Fax)

May 13, 1998

ABSTRACT

Program complexity analysis has traditionally helped us in identifying intrinsic structures of originating problems and solution algorithms. This paper reports a methodology that extends program complexity analysis tools to predict an algorithm’s running time processed by a single or multiple processors. We call it Timing Models.

A Timing Model is an algebraic equation that estimates the total elapsed time of a running program based on the program’s complexity model. It captures the inherent dependency amongst multiple performance critical factors of a running program. It permits multiple dimension scalability analysis revealing a program’s inherent performance characteristics in any processing environment. Studies show that it is possible to build reasonably accurate models for both deterministic and non-deterministic algorithms for single and multiple processor environments.

This paper reports our recent study on parallel processing of NP-complete algorithms. Parallel performance of these algorithms is typically input dependent. For some inputs, it is possible to achieve superlinear speedup but for others it is only possible to get sublinear or negative speedup. This paper reports a programming method, a formulation and analysis that show an average superlinear speedup for an NP-complete algorithm is possible. Our computational results using a cluster of workstations closely confirm the theoretical analysis.

EXTENDTED ABSTRACT

Since the inception of electronic computing machines, systematic program analysis has provided us with much insight into the fundamental structures of the originating problem and solution algorithms. That not only helped us inventing more efficient algorithms, but also motivated us to design more efficient processing hardware. With the advent of fast electronic processing and communication hardware, the need for systematic parallel program analysis has become apparent. Unfortunately, complexity models cannot be directly used to predict processing times of a program. Studies in this area have been over-shadowed by the complexities of proprietary software/hardware systems. It seemed, at least for a time, that only those special purpose systems could deliver the promised parallel powers.

This problem is further compounded by a branch of complexity analysis methods that emphasize the understanding of transient behavior of an algorithm. These methods require complex mathematical tools. In modern electronic processing environments, however, they often deliver minute values to practical applications.

Fast advances in processing and communication hardware have made it possible to achieve high levels of parallel performance without specialized systems. Many concerns in understanding the transient algorithm behavior have become irrelevant within the context of fast hardware. On the other hand, understanding of the steady state behavior of a given algorithm in many processing environments seems to have a value in many critical decision processes ranging from purchase of parallel machines to deployment of a parallel programming paradigm.

The proposed paper reports a program analysis method called Timing Models for understanding the steady-state behavior of a computer algorithm. It uses simple algebraic equations to capture the intrinsic dependencies of multiple performance critical factors in a modern computing system. It allows multi-dimensional scalability analysis of an algorithm for serial and parallel processing environments.

The software and hardware complexities in a modern system are tamed using a technique called "complexity-hiding" that bundles several illusive factors into a practically measurable term. An example of this is the factor of "algorithm-specific processing speed W". This term is measured in terms of number of algorithmic steps per second by a processor, not MFLOPS (Million FLoating-point Operations Per Second) or MIPS (Million Instructions Per Second).

For example, a straightforward matrix multiplication algorithm takes O(N3) steps. The estimated processing time for a matrix multiplication algorithm with a given dimension N is then:

                                                                                                     or  .

We can obtain a W curve by recording a series of elapsed times (Tseq) solving for different matrix sizes. This curve reveals the true deliverable performance of the tested environment running the coded algorithm. The significance of the curve is that it gives us confidence as what value should be used for W in future performance calculations, such as enhanced CPU speed or increased memory size. The timing model records the basic dependency between W and N.

A closer look shows that

                                                        .

It really bundles two operating system, compiler, programming style and hardware related factors:

W’: the actual machine speed measured in number of instructions per second; and

c: a constant that records the average number of instructions generated by a compiler and interpreted by an operating system for each high level algorithmic step.

In practice, both W’ and c are hard to capture while W is easily obtainable.

Extensive computational experiments showed that memory intensive algorithms running on machines with small memory tend to show a slow degrading W curve (such as matrix multiplication) while for others W is almost a straight line (such as quicksort). For new generations of computing systems, we are confident that W can be treated as a constant for most practical applications. Program scalability analysis can then be conducted based on the steady values on the W curve.

Other processing factors can be treated similarly. This allows us to include disk I/O, memory access and network transfer speeds into the timing models. The result is a set of algebraic equations defining the estimated times for both sequential and parallel programs. Estimated speedups can be easily calculated. In the multi-dimensional analysis, an interesting study was identifying the network speed while varying problem size and the number of processors. We quickly learned the benefits of program flexibility in terms of computational efficiency.

The precision of the timing models depends on the skill of the person who devised the models. The accuracy of the prediction, however, depends on the values of the performance factors. These factors can be empirical or obtained via program instrumentation. With some training, a first-year graduate student can devise adequate models for most practical algorithms.

Timing models have been successfully used for deterministic algorithms [1,2,3,4]. This paper shows that it is also applicable to non-deterministic algorithms, such as NP-complete or search algorithms, for identifying the average performance behavior.

This paper discusses the basic modeling techniques using the timing model method for both deterministic and non-deterministic algorithms. Comparisons with existing methods, such as Amdahl’s and Gustafson’s laws will be conducted. These laws are overly simplified that tend to "misdiagnose" a large portion of practical algorithms that contain non-linear network/memory/disk factors.

A significant part of the paper devotes to the analysis of parallel NP-complete algorithms illustrated by a sum-of-subset problem. We consider this an important algorithm class since it is widely used in Artificial Intelligence and Optimization Research applications. We shall present our recent results that show an average superlinear speedup for NP-complete algorithms. Computational experiments confirm the theoretical analysis using a cluster of DEC/Alpha workstations and a parallel processing system named Synergy.

The computational results are from a teamwork by a group of 15 CIS graduate students at Temple University in a course titled "CIS669: Distributed and Parallel Systems" in Spring 1998. The second author is a graduate student in the above mentioned class who constructed the proof of the average superlinear speedup and performance profile of the sum-of-subset algorithm. The significance of this work is in its corollary that infers all NP-complete algorithms have a potential to gain better than linear speedup with little extra programming effort. Our computational experiments showed that it is indeed possible to achieve high parallel performance for this algorithm class using a cluster of workstations on a shared-medium Ethernet.

Details of the serial and parallel programs will be given in the final draft. Computational results will be illustrated in light of the timing models. Comparative study will be conducted against relevant work in the field.

Formal acknowledgement will also be included in the final draft. Other student’s names are omitted in this proposal for brevity. The primary author would like to thank the Provost Office of Temple University for the Grand-in-Aid funding that has made it easier to complete this project.

References:

  1. Kostas Blathras, Daniel Szyld and Yuan Shi, "Parallel Processing of Linear Systems Using Asynchronous Iterative Algorithms," submitted to Journal of Distributed and Parallel Computing, October 1997. (http://www.cis.temple.edu/~shi)
  2. John Dougherty, "Structured Performability Analysis of Fault Tolerant Parallel and Distributed Applications", Ph.D. Dissertation, CIS Department, Temple University, October 1997.
  3. David Mutschler, "An Examination of Fault Resilient Real-time Computing Using the Stateless Parallel Processing Concept." Ph.D. Dissertation, CIS Department, Temple University, May 1998.
  4. Yuan Shi, "Parallel Program Scalability Analysis," IASTED International Conference on Parallel and Distributed Computing, October 1997, pp.451-456