Synergy Project Home Page

Home Installation Setup Concepts Programming Compiling Debugging Execution Architecture Conclusions References

 

POFP Concepts
Coarse Grain Parallelism

The existence of fast processors and relatively slower inter-connection networks mandates the need for coarse grain parallel computing. Building a perfect balanced parallel architecture for all applications is a dream that will never become true. While a super-fast interconnection design can accommodate many highly network demanding tasks, given a parallel architecture, one can always find an application that can flood the interconnection network. If the processors must use off-the-shelf operating systems, such as Unix or NT, the situation is further worsened by the inefficiently coordinated kernel-to-network relaying processes.

This observation implies that only heavy computing tasks with low communication requirements (or large locality) are worthy of distributed parallel processing. The choice is dependent on the speed of inter-connection network(s), the speed of processors and the application requirements. Timing models are essential tools for making this choice.

Building timing models requires an understanding of the embedded parallel structures in the application. We need a mechanism to systematically discover application dependent parallelisms. The discovered structures should identify the most efficient way to parallel processing the application.

Let's look at the basics for parallel processing first.

There are three entity-to-entity dependency topologies that facilitate simultaneous processing of entities (or programs) : SIMD(Single Instruction Multiple Data), MIMD(Multiple Instruction Multiple Data) and pipeline (Ref. Flynn). Figure 1 illustrates their spatial arrangements and respective speedup factors.

wpe1.jpg (12458 bytes)

Figure 1. Parallel Topologies and Speedups

In the pipeline case, the speedup approaches 3 (the number of pipe stages) when the number of input data items becomes large.

An application can be effectively processed in parallel only if its embedded parallel structures are explicitly uncovered. This embedded parallel structure represents a natural unfolding of the data dependencies of the program. Thus the most efficient processing must not violate this structure or otherwise suffer a performance setback.

An incremental coarse-to-fine grain decomposition method (CTF) is devised to systematically uncover coarse grain SIMD, MIMD and pipeline potentials within an application. The method is adaptable to parallel processors employing interconnection networks of varying speeds.

Coarse-to-fine (CTF) Parallel Application Design

The basic assumption is that the user has marked (potentially) computing intense components in the input, either a sequential program or algorithm in pseudo codes. The output is a set of inter-connected POFP programs (or algorithm steps) exhibiting SIMD, MIMD and pipeline potentials.

CTF has the following steps:

a. Linear cut. This step cuts a program into segments according to user marked loops and recursive function calls. It produces segments of three types: Sequential (S), Loop (L, Iterative loops) and Recursive (R).

b. Merge. A pipeline with short data supplies is not an effective parallel processing device. Therefore, it should be eliminated to save communication overhead. This step merges linearly dependent segments to a single segment, if one does not expect long input data streams.

c. Inter-segment data dependency analysis. This pass makes the implicit data dependencies between decomposed (S, L and R) segments explicit.

d. L-segment analysis. The objective of this analysis is to identify one independent and computation intensive loop amongst all nested loops. The objective is to maximize the computation locality.

In this step, one may have to perform program transformations to remove "resolvable dependencies". When there are more than one independent nestings, the nesting depth of the selected loop is dependent on the number of parallel processors and interconnection network speed. A deeper independent loop should be selected if large number processors are connected via higher speed networks.

The result of the analysis is a three-piece partition for each L-segment: loop header (containing pre-processing statements and the parallel loop header), loop body (containing the computing intensive statements) and the loop tail (containing the post-processing statements). Typically we compose a loop master consisting of a loop header and a loop tail. The loop body is all that is needed for a loop worker.

e. R-segment analysis. The objective is to identify if the recursive function is data dependent on the previous activation of the same function. If not, we partition the segment into two-pieces: a master that recursively generates K*P independent partial states, where K is an adjustable constant and P is the number of parallel processors, and waits for results; and a worker that fetches any working batch, unwraps it, recursively investigates all potential paths and returns results.

For optimization problems using various branch-and-bound methods, there should be some global index tuples recording the partial optimal results. These index tuples should be updated at most the same frequency as the initial number of independent partial states. This treatment prevents data bleeding that would prematurely saturate the interconnection network.

e. Coding. This step uses the results of the previous analysis to generate multiple independent program modules communicating through passive data objects.

In selecting IPC objects, one can use tuple space objects to implement parallelizable L and R segments (coarse grain SIMDs); pipes to inter-connect all other segments. File objects are used only when there are multiple files to be accessed simultaneously.

Adhering to the CTF method can produce a network of modular POFP programs exhibiting coarse grain SIMD, MIMD and pipeline potentials.

Since POFP programs use only exchanged data tuples for inter-process communication and synchronization, these programs are totally location independent. A dataflow runtime environment can effectively process these programs with the highest possible speed. Performance tuning is accomplished by adjusting the size(s) of the exchanging data elements (tuples).

DESIGN.GIF (4916 bytes)

Coarse Grain Parallelism

Coarse-to-Fine(CTF) Application Design




 

Copyright @1995-2005 Temple University.
For problems or questions regarding this web contact shi@temple.edu.
Last updated: October 01, 2001.