Stateless Parallel Processing – An Industrial Strength Architecture for Higher Performance Computing Machines

Yuan Shi

Shi@cis.temple.edu

Room 1036, MS:38-24

CIS Department

Temple University

Philadelphia, PA 19122

(215)204-6437 (Voice) (215)204-5082(Fax)

ABSTRACT

In the highly competitive computing industry, users always get their wishes. It is only a matter of time. In multi-processor systems, users’ demands have gone way beyond designer’s capabilities. Despite of heroic MPP and SMP designs, discontent is easily heard.

This paper reports a Stateless Parallel Processing concept and a corresponding system architecture (U.S. Patent: #5,517,656). It was designed to be of industrial strength, namely high performance, programmable, fault tolerant, reconfigurable and scalable.

The design of a Stateless Parallel Processor (SPP) starts with a programming paradigm. We draw strengths from existing parallel systems for this task. In particular, the Linda and the Synergy systems that use the Tuple Space mechanism for inter-process communication and synchronization. The SPP hardware design draws strengths from industry standard high speed interconnection networks, namely SCI (Scalable Coherent Interconnect), FCS (Fiber Channel Standard), ATM (Asynchronous Transfer Mode) and more recent Gigabit Ethernet. The focus is to avoid the notorious tuple space implementation overhead and to achieve true scalability and usability.

The philosophy of the SPP design came after quantitative analysis of many practical parallel runs using a method we called Timing Models [5]. We learned quickly the true performance bottlenecks from the analytical and computational results for each application. The result is the SPP concept and architecture.

The proposed paper will illustrate the details of SPP hardware and software designs. Example parallel programs will also be included and the performance data on the prototype machines will be displayed and discussed.

EXTENDED ABSTRACT

In computing industry, end-users always win. Their expectations are driven by every new product announcement. For multi-processor systems, expectations run higher than system designer’s abilities. While vendors struggle to meet the performance criteria, the user wants programmability, load balance, fault tolerance, reconfigurability and scalability.

In hardware, once built, growing or changing anything inside of a computer is not easy. Unlike a personal computer, a multi-processor system cannot afford to upgrade the entire system once a year. Reconfigurability and scalability are not options. In fact, lacking any of the user desires can become detrimental in this highly competitive industry.

The success of SMP (Symmetric Multi-Processor) may be a surprise to MPP (Massively Parallel Processor) designers. It is amazing to see that the "least brainy design" has gone such a long way. However, to researchers, the SMP success refined the meaning of "high performance computing".

This paper reports a stateless parallel processing concept and a corresponding computing system architecture [2]. It was designed to confirm the newly found meanings of high performance computing and can promise to meet all user desires.

A stateless process is a program that can run on any available processor without causing global state inconsistencies. A stateless parallel processor (SPP) can only execute stateless processes. SPP does not support processes that directly send (or receive) message to (or from) specific other processes. This requirement is the key to the ultimate end-user desires. It facilitates programmability, ease of maintenance, automatic load balancing, processor fault tolerance, automatic processor reconfiguration and system scalability. However, this makes the direct use of the most popular communication protocols, such as shared memory, message passing and remote procedure calls, nearly impossible.

The programming paradigm design of SPP draws strengths from existing parallel systems, such as Linda [3] and Synergy[4]. It uses a Tuple Space as the only means for parallel process synchronization and communication.

In SPP, the Tuple Space is an intermediate data repository that is responsible for temporary storage and transfer of transient data. Processes send and receive tuples as the only means for communication and synchronization. There are three primitives:

In comparisons to shared memory and message passing protocols, the tuple space combines the advantages of "associative memory" (tuple name pattern matching) and traditional storage retrieval protocols (read and write).

One typical concern in using tuple space is the implementation overhead. On the surface, it may be extremely inefficient since every process-to-process communication would cost double, one from the sender to the space and the other from the space to the receiver, if the tuple space is implemented as a real independent server (or daemon). Implementing a virtual tuple space based on syntax analysis has been proven detrimental to scalability [3].

The SPP hardware design solves this problem using a two-tiered network: a unidirectional slotted ring and a switched network. All processors are connected on both networks. The ring is responsible for transmitting tuple header (name + other information) and the switched network is responsible for direct transfer of real tuples. In other words, the name matching in SPP is indirect via the slotted ring while real data transfer is direct via the switched network. Since the slotted ring permits simultaneous transmission of multiple stations, it offsets the overhead of tuple header communication. The switched network also permits multiple simultaneous information exchanges that enables true network scalability. Both networks have industry standard implementations with proven scalability and performance limitations [1].

The stateless processes allow full exploitation of dataflow computing principles. They allow dynamic processor reconfiguration based on the self-scheduling principles of dataflow machines. This permits taking full advantage of application specific parallelisms including SIMD, MIMD and pipelines without user programming. Processor fault tolerance can be achieved without "active replicas" that wastes precious processor cycles. The SMP load balancing problem disappears due to the finer processing grains while every SMP program is runnable in SPP. Furthermore, the SPP hardware design takes advantage of data locality while avoids the notorious cache-coherence problem (a problem that has bugged multi-processor designers for decades).

If we consider SMP coarse-grain parallel and MPP fine-grain parallel, then SPP is medium-grain parallel. It does not have the programmability problem as found in MPP and it does not have the limitation to run true paralle programs as in SMP. Paraphrasing Pfister’s words [1], "the ugly fly in the ointment has been removed."

This paper will detail the SPP hardware and operating system designs. Preliminary test data will be displayed as the basis for production system’s performance prediction. The final draft will also include three program examples running on the prototype systems at Temple University and the author’s home.

The author would like to thank the Deans Office of College of Arts and Science and the Provost Office of Temple University for the Grant-in-Aid funding that facilitated the prototype construction of the SPP system.

References:

  1. Gregory F. Pfister, "In Search of Clusters – The Coming Battle in Lowly Parallel Computing," Prentice Hall PTR, ISBN: 0-13-437625-0, 1995.
  2. Y. Shi, "Multi-computer System and Method," United State Patent Office, #5,517,656, May 1996.
  3. S. Ahuja, N. Carriero and D. Gelertner, Linda and friends, IEEE Computer, 26-32, August 1986.
  4. Y. Shi, "System for High-Level Virtual Computer With Heterogeneous Operating Systems," United States Patent Office, #5,381,534, March 1995.
  5. Y. Shi, "Parallel Program Scalability Analysis," IASTED International Conference on Parallel and Distributed Computing, October 1997, pp.451-456.