Asynchronous iterative algorithms can reduce much of the data dependencies by using older data received earlier in time. This, however, can cause an increase in total iterations for the algorithm to converge. Considering that the total network capacity is typically far less than the overall computing capacities of multiple processors, the motivation of this study is to identify the possible time savings by reducing inter-processor communication at the expense of increased calculations.
In particular, we report a performance study on parallel implementations of a non-uniform magnetic field simulation program. We use a time complexity-based analysis method to understand the inherent interdependencies between computing and communication overheads for the parallel asynchronous algorithm. The results show not only that the computational experiments closely match the analytical results but also that the use of asynchronous iterative algorithms is beneficial for a vast number of parallel processing environments.
Keywords: Parallel Processing, Asynchronous Algorithms, Parallel Iterative Methods.
1. Introduction
In this paper we present several aspects of the parallel asynchronous solutions of linear algebraic systems of the form
,
(1)
where is a large
matrix, and x and b are N-vectors. This problem lies
at the core of many scientific and engineering applications. The computational
test problem we use in this paper is that of the determination of the magnetic
field in a region outside of a permanent magnet [10]; see further Sect.
3. Since full direct methods for the solution of (1) require
operations and usually sparse methods lack parallelism, we concentrate
in this paper on iterative methods, which are used in practice due to their
reduced time complexity.
When developing the parallel iterative algorithms for
the solution of (1), the most difficult problem is the inherent data dependencies.
For example, consider the classical (point) Jacobi iteration method
, l=1,2,…N; t=0,1, …;
(2)
where t is the iteration index, starting from
an initial vector If each of p processors
compute one of the components of x in (2), i.e., p=N, this
implies a very large amount of data "shuffling" between processors at each
iteration. Furthermore, all processors must synchronize at that point creating
a time barrier.
In an asynchronous version of (2), each processor would compute xl using the most recent information available from the other processors, i.e., values of xl which may be older than xj(t), say possibly xj(t-k), where k would depend on l and j; see further Sect. 2, where we use an equivalent notation. In other words, asynchronous iterative algorithms do not require exchange of the most recent values. The convergence of (2) would naturally be delayed by the use of less recent values of xj, but since no idle time for synchronization is necessary, and less overall communication takes place, asynchronous algorithms have the potential of outperforming standard synchronous iterative methods. We should mention here that, under certain hypothesis, asynchronous iterative methods are guaranteed to converge to the solution of the linear system (1); see, e.g., [9,14].
One of the goals of this paper is to illustrate the applicability of asynchronous iterative methods. In particular, we want to point out how these methods can be implemented on inexpensive clusters of workstations, even personal computers, connected with 10 Mbps shared medium Ethernet. We use a passive object programming system named Synergy [8], which provides tuple spaces as communication and synchronization mechanisms. The choice of this programming tool is discussed in Sect. 4.1.
In practical block two-stage asynchronous iterative methods,
each processor solves a small linear system, not just one equation, as
in (2). Each processor uses an (inner) iterative method, e.g., Gauss-Seidel,
for the approximate solution of the linear system. One of the critical
questions about two-stage methods is what criteria to use to stop the inner
iterations in each processor; see, e.g., [13, 18]. In this paper, we compare
two widely used stopping criteria: one based on the size of the (inner)
residuals, and the other on a fixed number of inner iterations. We report
that a fixed number of inner iterations is a better choice for the architecture
considered in this paper; see Sects. 4.3 and 6.
Another question which we address is load balancing.
Should the partition of the domain be fixed by the number of available
processors (static block allocation) or do we gain by dividing the computational
domain dynamically to account for the different computational complexity
of the tasks in each region? To answer this question we develop a timing
model. Using the timing model, we conclude that when the communication
network is slow, as is the case of the hardware considered in this paper,
the static block allocation is preferred. It follows as well, that a dynamic
block allocation would be preferable for faster networks; see Sects. 4.4
and 6.
We report on a systematic procedure for identifying the optimal data exchange for a given asynchronous iterative algorithm and parallel processing environment. Our experience showed that it is possible to gain speed using asynchronous algorithms and clusters of workstations on a slow Ethernet. Our scalability models also show that asynchronous algorithms are beneficial for all parallel environments in which the sum of multiple processors' capacity (in instructions per second) far exceeds the sum of networking capacities (in bytes per second).
The overall organization of this paper is as follows. In Sect. 2, we describe the applicable class of asynchronous iterative algorithms. In Sect. 3, we describe the computational test problem used in this study. In Sect. 4, we detail the asynchronous parallel program design and implementation. In Sect. 5, we present timing analysis for evaluation of various implementation alternatives. In Sect. 6, we present computational results. Conclusions are found in Sect. 7.
2. Block Asynchronous Iterative Algorithms
The asynchronous methods studied here for the solution of (1) are based on the Block Jacobi algorithm, see, e.g., Varga [22]. Let the matrix A to be partitioned into blocks
(3)
with the diagonal blocks
being square nonsingular of order , ,
and the vectors x and b are partitioned conformally.
ALGORITHM 1 (Block Jacobi). Given an initial approximation
of the vector
For t=1,2,…,until convergence For l=1 to
p, solve
. (4)
The entire linear system can be solved in parallel by p processors, and the iteration vector at each step is
.
The solution of the system (4) is the block counterpart to the iterations in (2). Thus as the discussion of Sect. 1 indicates, this algorithm exchanges data for each value of t, and has at that point a synchronization barrier. For the solution of each sub-system (4) one can chose to use a different iterative method, such as Gauss-Seidel, and this class of methods are called two-stage iterative methods; see e.g., [13,14], and the extensive references given therein.
For the asynchronous block Jacobi method, unlike in Algorithm 1, the processors are allowed to start the computation of the next iteration of the block without waiting for the simultaneous completion of the same iteration of other components. In other words, components of are updated using a vector which is made of block components of previous, not necessarily the latest, iterations. As in the standard references for asynchronous algorithms, such as [6], the iteration subscript is increased every time any (block) component of the iteration vector is computed. Thus, one defines, the sets by if the l-th block component of the iteration vector is computed at the t-th step. Thus the Asynchronous Block Jacobi method can be described as follows.
The term r(l,t) is used to denote the iteration number of the l-th block-component being used in the computation of any block-component in the t-th iteration, i.e., the iteration number of the j-th block-component available at the beginning of the computation of , if . We always assume that the terms r(l,t) of our asynchronous iterative algorithms satisfy the following minimal criteria as described in [14] and other references therein.
We use a magnetic field simulation problem as the test case for it represents typical physical simulation applications and it has a non-uniform geometry that requires load balancing when processed in parallel. The processing environment is a cluster of shared 10 mbps Ethernet connected DEC/Alpha workstations. Our simulation is to determine the magnetic field in a region outside a permanent magnet [10]; i.e., solve for satisfying
(5)
where is the volume magnetic charge density.
. (6)
The rectangular domain in Fig. 1 is discretized using an uniform grid with h horizontal and vertical spaces. Thus, the grid points are labeled (i,j), i.j=1,2,3,...,n and we denote by i,j the value of the function at the point of (i,j). The discretization of (6) using centered differences is then
. (7)
Therefore, the value of the magnetic field on each grid
point is a function of its four nearest
neighbors, see Fig. 2.
The Gauss-Seidel iteration scheme on equation (7) is
. (8)
This means that for the calculation of in each iteration cycle t, we use the most recent computed values of that point's four first-neighbors. The boundaries of the rectangular magnet have the following variations of (8) in order to incorporate the discontinuities of .
(Vertical edges)
(Horizontal edges)
(Corners)
4. Parallel Algorithm Development
A parallel program is typically more complex than its sequential counterpart. In this section, we first discuss our programming tool choice. We then discuss our partitioning choices and introduce a number of parallel implementation alternatives. Finally, we present an analysis method that can be used to evaluate the parallel algorithms by capturing the inherent interdependencies between the computing and communication overheads.
4.1 Parallel Programming Tool
The choice of programming tool can have a large impact on parallel implementation complexity and processing efficiency. An ideal parallel processing environment for parallel asynchronous iterative algorithms should be totally asynchronous with information over-write capabilities. This is because for asynchronous parallel algorithms, the sender and receiver are running independent of each other, and the receiver must obtain the latest information or otherwise introduce artificial synchronization barriers; see [20].
All message passing systems, such as MPI [19] and PVM [5], use bounded buffers. This implies a synchronous semantics, namely the sender should only send when the receiver is there to receive. Moreover, the messages do not over-write each other. If we were to use any message passing system, we had to implement the new messaging semantics on top of the provided messaging channels. This would introduce much programming complexity and processing overhead. Thus, we consider that these systems are not appropriate for programming parallel asynchronous algorithms.
The Synergy system uses passive objects for parallel program communication and synchronization. A passive object is coarse-grain data structure with a set of pre-defined operators. An example passive object is a tuple space with three operators: Read, Put and Get. The semantics of these operators is similar to rd, get and put operations in the Linda system [1] provided that tuples are uniquely named and First-In-First-Out (FIFO) ordered. Writing to the same named tuple means over-writing the existing tuple's content. This characteristic ideally meets the asynchronous iterative algorithm's requirement. The object passiveness restricts the operators from dynamically creating new objects at runtime. They can only create instances within an object thus leaving the outset communication topology fixed for each application. This feature was designed to facilitate automatic generation of efficient client/server programs from a fixed application configuration topology. For a more detailed description of Synergy, we refer the reader to [8].
Passive objects embed multithreaded controls under a simple asynchronous programming interface. Thus, there is no explicit process manipulation and synchronization statements in either the sender or the receiver programs. Unless specifically coded, each individual program has a single thread control within its programming space.
4.2 Data Partition Choice
A typical parallel implementation of a block iterative algorithm, such as Block Gauss-Seidel [6,16], assigns several mesh points to each processor such that each processor only communicates with its four nearest neighbors; see Fig. 3, where there are p=9 processors and n=36.
This intuitive parallel implementation, called tiles, has the following drawbacks:
Each grid point on the simulation mesh requires the values of its first neighbors during iteration t, as seen in (8). If we divide the rectangular grid of points into blocks of rows, with each block considered one work assignment and this block of rows corresponds to a diagonal block of the matrix A in (1), as shown in (3). Therefore, in Fig. 4, only the top and bottom rows of each block needs to be exchanged between neighboring processors.
4.3 Stopping Criteria
One way to look at the block asynchronous algorithm is
to think of it as a relaxation from the point algorithm (2). In other words,
we reduce the inter-process communications by transmitting the most
recent approximations only after each processor has performed a specified
amount of calculations (inner iterations). We define this amount
of calculations as the dataflow reduction criterion (DRC), i.e.,
stopping criteria in the approximation process. An inter-processor information
exchange is called an outer iteration.
There are two distinct ways of defining DRC: one by measuring
the size of the inner residuals; and another by setting a fixed number
of inner iteration limit. For the first case, let us define the residual
computed in the processor q as
.
This quantity is kept in the processor's memory between successive iterations, and it is checked if the residual is reduced by a specific amount. We use a threshold and we say that the residual threshold criterion is met, when the residual is reduced by a factor of , i.e., when .
4.4 Load Balancing and Block Allocation Methods
The overall processing of the parallel program is as follows. A master process partitions the matrix into work assignments and distributes them to parallel workers. A worker process starts calculations after it receives a work assignment and some global information. It starts information exchange with its neighbors when the DRC is satisfied. Since multiple copies of the same worker code run simultaneously, it is easy to imagine that workers will exchange data with their neighbors at different times.
After a worker receives values of the border elements
from its neighbors, it will resume the same calculation process until the
next DRC is met. A block is considered solved locally when the residual
of the block q falls below a prescribed e,
i.e., when. The master process then collects
all solved blocks and performs a Gauss-Seidel iteration across the reconstructed
linear system to check if the local solutions are indeed globally convergent.
The system terminates if a global convergence is reached. Otherwise the
master re-partitions the system and re-transmits the blocks. The re-calculation
cycles can generate very large communication volume. A system becomes unstable
when the number of re-calculations is too large. An unstable system indicates
that the processors are diverging into local solutions.
To further reduce the communication we can restrict the
number of workers and put an exact amount of rows on each (static partition).
This strategy can cause work load imbalance thus negatively impact our
performance since our computational test problem is a magnet with non-uniform
geometry. To ease the load imbalance, we can put many smaller blocks (dynamic
partitioning) in an FIFO queue having the processors fetch the block assignments
when they become idle. This way computing-intensive blocks will be automatically
processed more often. The drawback is that it requires more network traffic.
4.4.1 Static Allocation Algorithm Details
A static parallel block iterative algorithm has a master and many workers. The master program is responsible for constructing the n n grid geometry, partitioning the grid into blocks of rows, assigning these blocks to the p worker modules residing on different processors, receiving results, and composing the solution matrix. Each row block that is assigned to a worker module is composed of n/p rows.
If not, then it reassigns the blocks to the p workers. This procedure is repeated until the global convergence criterion is met.
4.4.2 Dynamic Allocation Algorithm Details
In this variation of the block-asynchronous parallel algorithm, only the worker module is modified, so that it uses the tuple space as the working assignment queue. As in the static version, a working assignment (a row-block partition of the grid) is a tuple. The tuple space is used as a FIFO queue containing all non-convergent tuples that have met the given DRC.
A dynamic worker module first reads global data (problem geometry) from the tuple space. It then extracts a work assignment tuple to compute. After the DRC is met, it will insert its border rows to be used by neighboring partitions. If local convergence is reached, it inserts the result for the master to retrieve. Otherwise, it reinserts the work assignment into the working tuple space. Such reinsertion will place the tuple at the end of the FIFO queue (see Fig. 6). If the number of working tuples is greater than the number of processors, a slow converging region will be processed by multiple processors. This can reduce load imbalance.
5. Parallel Program Evaluation
In this section, we build timing models for the static and dynamic partitioning algorithms. The objective is to identify the relative merits of both algorithms judging from their inherent interdependencies between the parallel computation and communication times. Note that the performance difference between inner iteration limit control and residual threshold control cannot be modeled analytically but can be observed via computational results.
Timing models [21] are program time complexity-based models. Timing analysis requires timing models for the sequential algorithm and the corresponding parallel algorithm. Our scalability analysis requires calibrating the processing parameters by running once the sequential and parallel programs on a target environment.
We use the following symbols in the timing model analysis.N Number of grid points (= n n)Qc Number of floating point operations required per grid pointQN Number of bytes required per grid pointI Number of inner iterationsE Number of outer iterationsW Processor power in number of algorithmic steps per secondu Network capacity in number of bytes per secondNote that W=W'/c represents the delivered processing power in number of algorithmic steps (as related to their time complexity models) per second, where W' is the actual processor power in number of machine instructions per second and c is a constant reflecting the average number of machine instructions generated from each algorithmic step. For our computational test problem, Qc= 11.
5.1 Sequential Gauss-Seidel Model
The sequential processing time can be modeled as
. (9)
To calibrate W, we run the sequential algorithm on two different processors: Intel486/100 MHZ and DEC Alpha/120 MHz. Here is a summary of the statistics.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
From Fig. 7 we can readily observe that there are little swapping effects. All programs can fit into memory. We can also notice the presence of a large constant c for this algorithm in comparison to other well known algorithms such as matrix multiplication and Linpack (see table below).
Linpack | Matrix Multiply (200200) | |
Intel486/100MHz/8MB | 2.4 Msps | 4.3 Msps |
Alpha/120MHz/32MB | 6.4 Msps | 29.4 Msps |
Parallel iterative algorithms can be modeled as a compute-aggregate-broadcast system as shown in Fig. 8.
To simplify the analysis, we express the total parallel processing time as
, (10)
where which defines Master's total pure computation time, which defines the maximum worker pure computation time (among p parallel workers), and
. (11)In (11),
which defines the master's total communication time including broadcast
of global geometry, distribution of working tuples and extraction of result
tuples, and which defines the total worker's
communication time including extraction of global information, extraction
of working tuples, return of result tuples and intermediate exchange of
data with neighboring processors. Lastly,
which defines the worst-case load imbalancing overhead assuming
times difference between the fastest and slowest processors and
times differences between the fastest converging block and the slowest
converging block.
Finally, we can define the static block algorithm timing
to be
(12)
5.3 Dynamic Block Allocation Algorithm
The dynamic block allocation algorithm has a similar model except that
.
Here, assuming the best load balancing (Tsync = 0), the only difference in the is in the tuple re-insertion overhead. Therefore, the total dynamic time model is
. (13)
An important investigation is to quantitatively decide when to use static and when to use dynamic block allocation algorithm in terms of parallel processing environments and problem sizes. Figs. 10 and 11 are obtained with numerical calculations using (12) , (13) and the following assumptions:
N = 256 256
W = 2.5 Mflops (million algorithmic steps per second)
u = 100 K Bytes Per Second
c = 100 K bytes (Global broadcast data size)
Qc = 11 Operations per grid point
QN = 8 bytes per grid point
p = 5
(two times difference in load imbalance).
Fig. 10 illustrates that for the given parallel processing environment, only the static algorithm will out-perform the sequential one. However, static algorithm's convergence time will increase more rapidly than the dynamic algorithm as we increase local iterations to reduce communication. Fig. 11 indicates that when the network speed goes above a certain point (approximately sustained 600 KBPS), the dynamic algorithm will out-perform the static algorithm.
6. Computational Results
Our computational results validate the predictions of the timing models, e.g., the shape of the curve Tdpar in Fig. 10 is confirmed in Fig. 13(a), and that of Tspar in Fig. 10 is confirmed in Fig. 15(a). The experiments reveal the quantitative performance consequences of data exchange frequencies and compare residual and inner iteration control methods. We have produced four sets of experiments. These are dynamic block allocation with residual threshold control, dynamic block allocation with inner iteration control, static block allocation with residual threshold control, and static block allocation with inner iteration control. The parallel processing environment is a cluster of five (5) DEC/Alpha workstations running the OSF/1 operating system. The workstations use a shared 10 mbps Ethernet. The parallel programming environment is C and Synergy V3.0 system.
For each set of experiments, we report elapsed times, total outer iterations and the number of total iterations. Elapsed time measures the actual wall-clock time of the application. Total outer iteration records the changes in outer iteration by the respective iteration control methods. Total iteration records the changes in maximal number of inner iterations performed by a parallel worker process. The problem size is 256256 grid points (or N = 65,536 equations). Parallel processing will yield no advantage for problems of smaller sizes.
From the performance charts (Figs. 12-15), we observe the following:
In this paper we report our analysis and computational results of an investigation that trades the abundance of computing power for scarce network bandwidth using asynchronous iterative algorithms. We consider that message-passing parallel systems are inappropriate for asynchronous parallel programs.
Our results show that it is possible to gain overall processing
speed by sacrificing some local convergence speed. We have also found that
the convergence time increases linearly (see Sect. 6) as we reduce data
exchange frequency. Since overall cumulative speed of multiple processors
is typically many times of the interconnection network speed, for most
practical applications asynchronous iterative algorithms are more advantageous
than synchronous algorithms when processed in parallel.
Our experiences in applying the timing model method to
parallel algorithm analysis showed that it is possible to analytically
prototype a complex parallel system with little program instrumentation.
The results can be used to predict the scalability of the system as well
as optimal control points.
References