Justin Y. Shi, Ph.D.

a.k.a. Yuan Shi

SPPM
Motivation
Methodology
SC07 Exhibit

Synergy
Motivation
Methodology
Implementation
Downloads

PML
Motivation
Methodology
Computation Results

Light-Weight Fault Tolerant Protocol
Motivation
Methodology
Computation Results

High Performance Database Cluster
Motivation
Methodology
Computation Results


Partial Publication List

 

SPPM Team Members

Yasin Celik (PhD in progress)
Joseph Jupin (PhD in progress)
Aakash Pradeep (MS graduated in 2013)
Moussa Taifi (PhD graduated in 2013)
Ian F. Davis (MS graduated in 2012)
Fanfan Xiong (PhD graduated in 5/2008)
Yijian Yang (PhD graduated 8/2007)
Feijian Sun (PhD graduated in 5/2005)
David Mutchler (PhD graduated in ~1998)
John Dougherty (PhD graduated in 1997)
Kostas Blathras (PhD graduated in 1996)
Chris Prabu (Chris Gali) (MS graduated in 1994)
Issac Newton (MS graduated in 1994)
Avi Freedman (Undergrad 1993)

 


Associate Professor
shi@temple.edu | +1(215)204-6437  

Ph.D. University of Pennsylvania 1984 (High-level Concurrent Programming)
M.S. University of Pennsylvania 1983 (Software Engineering)
B.S. Shanghai Jiaotong University 1979 (Computer Engineering)

Research Focus:

  • Highly reliable high performance computing system: architecture, programming and automatic serial-to-parallel program generation methods
  • Highly reliable high performance transactional systems: architecture and programming

A 1992 Supercomputing research exhibit video (wmv|20 min|100MB) illustrates the goals that we have been working for some 20+ years, starring Drs. John Dougherty, Kostas Blathras and myself.

Research Results:

  • Fault Tolerant High Performance Stateless Parallel Processing Machines (SPPM)
  • Synergy Parallel Processing System
  • Parallel Markup Language (PML)
  • Low-overhead Fault Tolerance Protocol
  • Lossless Fault Tolerant High Performance Database Clustering Middleware

Patents:

  • U.S. Patent: #5,517,656, 5/1996 ("Multi-computer System and Method")
  • U.S. Patent: #5,381,534, 3/1995 ("System for High-Level Virtual Computer With Heterogeneous Operating Systems")
  • Pending 2007: 

"Highly Available High Performance Multiprocessor System "

"Apparatus and Method of Optimizing Data Clustering with Zero Transaction Loss"

Research Exhibits: Supercomputing 1991, 1992, 2004, 2005, 2006, 2007, 2008-2015. HPDC 2008.

Short Biography:

I studied computer engineering at Shanghai Jiaotong University in late 1970's. I participated in the Chinese DJS-130 microprocessor design and implementation. In 1979, I received COBOL and CODASIL Database training from Miasakitai Research Center, NEC Corporation. Subsequently, I led a development team for the first automated foreign trade processing system in Shanghai, China

In late 1980, I became a graduate student at University of Pennsylvania. My Ph.D. dissertation focused on the automatic linear system solver generation in PL/I and automatic generation of communication protocols for distributed processing in the LINK econometric simulation system developed by Dr. Lawrence R. Klein (Nobel Laureate). My job was to make it possible for total model opacity while allowing global data exchanges (without the knowledge of how the data were generated) and ensure overall system convergence. I developed a simple communication protocol to automatically discover the global network diameter and ensure the global system synchrony while allowing distributed parallel computing of econometric models of individual countries. For this work, I earned my Ph.D. Degree in December 1984. My thesis advisor was Dr. Noah Prywes. Drs. Amir Pnueli, Bolek Szymanski and Eva Ma were committee members.

In 1985, I began the study of fault tolerant multiprocessor systems after building a failed message-based real time multiprocessor system named Zodiac at the then Digital Equipment Corporation.

Knowing the fundamental deficiencies of message-passing systems, I have become a long time critic of message-passing interface (MPI) as the primary means for parallel processing. I have been a proponent of dataflow parallel computing and the tuple space mechanism for inter-process communication. 

From 1991-1992, my Ph.D student (Kostas Blathras) and myself were responsible for the successful removal of IBM3090 from the Computational Physics Group at T.J. Watson Research Center and replacing them with much cheaper clusters of mixed IBM R6000 and SGI workstations running Synergy v1.0.

In 1995, I was granted my first patent on automatic distributed program generation method.

In 1996, I was granted the second patent on an automatic load balancing multiprocessor architecture. In the same year, I developed the Timing Model method for quantifying application-specific performance gains using single or multiprocessors. Subsequently Stateless Parallel Processing (SPP) model and its initial architecture was conceived with potentials to address high performance, high availability and automatic parallel program generation all at the same time.

In the same time period, I accidentally discovered the equivalence between Amdahl and Gustafson's Laws on a lecture trip to Changsha Institute of Technology, Hunan, China. Synergy v3.0 was completed.

In 1997, I founded the Parallel Computers Technology Inc., an independent research and development company in Philadelphia, PA.

Since 1998, I have been working on three projects simultaneously: fault tolerant high performance processor design (SPPM), a sequential-to-parallel program generator (PML) and fault tolerant high performance database cluster middleware.

By end of 2006, Ian F. Davis joined our group. Ian completed the first SPPM prototype using .NET in three months. All three projects yielded significant results. Realizing the importance of these work, two new patents were filed.

From July 1, 2007-2009, I was the Chairperson of CIS Department at Temple University.  The current CIS Chairperson is Dr. Jie Wu.


SPPM (Stateless Parallel Processing Machines)

Motivation

SPPM research was inspired at the closing ceremony of the Zodiac Project at then Digital Equipment Corporation (DEC). Zodiac Project was a real-time multiprocessor system leveraging multiple BI-Buses and the ELN real time operating system in 1985. Although the project was completed, no one was able to answer a simple question by one of the DEC engineers: What if one of the processors dies? 

This can be translated into a research question:

What kind of multiprocessor architecture can scale performance and availability at the same time?

The phrase "availability" really means "application availability" not "system availability". 

"System availability" is not hard since one can automatically isolate faulty components by failing the running application. A scalable multiprocessor architecture that can keep multiple applications running while allowing computing or communication component failures is a necessary but much harder requirement.

Ironically, almost three decades later, the same question still troubles supercomputer designers and parallel program coders.

Methodology

SPPM was designed to deliver high performance and high application availability at the SAME time leveraging commodity processors and communication components. It has built-in load balancing and fault tolerance measures that can truly deliver application high availability at very low costs.

The SPPM operating principles include "dataflow computing" -- identical to the early dataflow machines and "statistic multiplexed computing or SMC" -- identical to the packet switching networks but applied in a higher level of distributed computing. There are only two kinds of application programs: stateless workers and stateful masters.

A stateless worker repeats a simple calculation cycle: get work and dig it. The name "stateless" refers to the use of Statistic Multiplexed Computing principles to offer runtime mechanisms to automatically discard partial results and failures and to ensure reliable once-only delivery of arbitrary services.

A stateful master is responsible for distributing working assignments and collecting the results. Compute-intense (iterative or recursive) applications can be accelerated using multiple workers managed by appropriate masters. 

The SPPM architecture also supports low overhead fault tolerance protocol for protecting master failures.

Putting all these together, the SPPM architecture reveals a simple design principle based on the generalized use of SPP concepts: it is possible to build a highly reliable high performance multiprocessor by minimizing unnecessarily exposed states in both computing and communication components.  

According to the impossibility studies, only decoupled program-program coordination APIs can ensure the resulting application to deliver reliable services using unreliable components.

Tightly coupled APIs and explicit parallelism are only suitable for small scale systems.

The move from explicit parallelism to implicit parallelism is analogous to migrating from circuit-switching to packet-switching in communication systems, with today's packet-switching systems delivering high performance and high availability at the same time (although slower compared to older circuit-switching systems).

We believe that this type of architecture is a blue-print for future cloud computing systems and multi-processors in general.

Back to top


Synergy

Motivation

Synergy was constructed to experiment with dataflow parallel processing concepts using networked distributed memory computers. 

Methodology

We took a drastically different approach than the Linda proposal by building a floating tuple space deamon that can be designated to run on any host of the network. The idea was to gain load balancing and fault tolerance benefits by sacrificing communication overheads. 

Implementation and Usage

Over the years, Synergy has become an earlier prototype implementation of SPPM. Synergy 3.0 is a Unix implementation supporting Fedora 5 and Solaris 10. The current Solaris release only supports "worker" fault tolerance. Fedora 5 release needs a kernel patch to support full fault tolerance. 

Since 1995, Synergy has been used for CIS graduate and undergraduate courses in Scripting for Sciences and Business, Programming Techniques, Algorithm Design, Operating Systems and Distributed and Parallel Processing.

Downloads

Here is the direct link to the github release of Synergy 3.0+.

Back to top.


PML (Parallel Markup Language)

Motivation

At the time of this writing, almost the entire HPC (High Performance Computing) community has given up on the feasibility of "parallel compilers."

This conclusion is no surprise if we continue to use message-passing interface as the only communication and synchronization mechanism for parallel processing.

The fundamental problem for tightly coupled program-program coodination APIs, such as MPI and OpenMP, is that they force the runtime system to generate fixed program-processor bindings. Each such binding creates an impossibility case as proved by Alan Fekete and others in 1993. Using traditional parallel compiler's approach,  generating parallel code using MPI requires the compiler to produce precise interface calls resulted from dependency analysis that in general are not practically possible.

Tightly coupled APIs also make dynamic load balancing impossible. These difficulties imped effective processing using extreme scale heterogeneous processors.

The current HPC programming model requires expertise in

  • domain knowledge
  • programming skills
  • architecture-specific parallel processing skills

 This model is clearly not economically viable.

A reasonable approach seems to develop markup languages to facilitate automatic parallel program generation from sequential sources, once we master the fundamental requirements for sustainable high performance computing. This is a very different approach than the traditional "either performance-or-reliability" efforts.

Methodology

The use of dataflow parallel model (SPPM) shifts the difficulties. In early 1970's, the MIT Id compiler has clearly demonstrated the feasibility of automatically generating data parallel programs. The problem was performance. SPPM addresses the performance issue by using a coarse-to-fine granularity exploitation approach. With added overheads for fault tolerance, it can still compete comfortably with MPICH2 without fault tolerance (see Computation Results). 

Specifically, we leverage the power of XML by developing a simple Parallel Markup Language (PML) with 8 tags(*) to markup explicit DATA DEPENDENCIES, thus we avoid the difficulties for dependency analysis. In comparison, all other markup language efforts were designed to "expose parallelisms". The tag processor (compiler) is expected to translate the exposed parallelisms into direct communication calls -- a rather difficult task indeed.

Using our approach, we can show that it is indeed possible to generate efficient parallel codes directly from serial code with the 7 tags for distributed memory machines and one tag for shared-memory multi-core processors (Feijian Sun's thesis and Ian F. Davis private communication).

(*) There are 7 tags for distributed memory clusters as studied in Feijian Sun's dissertation. There is also a need for the 8th tag for shared memory multi-core processors (not fully developed yet).

Computational Results

The following table lists the elapsed times for multiplying two matrices of different sizes using PML generated code with worker fault tolerance, hand crafted Synergy code with worker fault tolerance, MPICH2 code without fault tolerance and sequential code. The tests were conducted using 2-4 Blade 500 workstations (500MHZ with 512MB) running Solaris 10. The network is helf-duplex 10 Mbps Ethernet.

Nodes

Size

PML

Synergy

MPICH2

Seq.

2

600

6(G=20)

5(G=50)

5.12

8.9

2

800

15(G=40)

12.2(G=200)

11.87  

21.6

2

1000

29(G=63)

23.4(G=63)

22.86

42.4

2

1600

119(G=20)

95(G=100)

95

181.7

2

2000

231(G=40)

187.6(G=75)  

186  

358.7

4

600

4.5(G=16)

3.6(G=16)

3.3

8.9

4

800

10(G=12)

7.8(G=12)

7.2

21.6

4

1000

17.3(G=13)

14.1(G=13)

13.4

42.4

4

1600

66.7(G=23)

53(G=23)

53

181.7

4

2000

127(G=20)

101.2(G=21)  

100

358.7

 

PML generated fault tolerant codes performed competitively against Synergy (with fault tolerance) and MPI codes (without fault tolerance).

Back to top.


Light-Weight Fault Tolerant Protocol (LWFTP)

Motivation

Support for "worker" fault tolerance in SPPM is easily accomplished by implementing "shadow tuples".  For an application that requires more than 3 days to run, it is easy to see that master fault tolerance is not avoidable. 

This becomes especially important for large scale HPC systems that MTBF (Mean Time Beteween Failure) has come down to be less than 3 days for a cluster of 1,024+ nodes.

Methodology

LWFTP is a checkpint-restart (CPR) coordination protocol implemented for the SPPM architecture that will allow multiple master failures. (Yijian Yang's thesis) It leverages the worker fault tolerance feature in SPPM infrastructure to drastically reduce both spatial and temporal overheads. A significant result is the solution for seamless process migration with multiple live sockets.

Computational Results

Performance studies show that the proposed method has dramatically reduced checkpoint size for a parallel application from average P x 300MB to average 1 x 300MB. Overall checkpoint-restart overhead without processor failures is less than 5%. Overall checkpoint-recovery overhead with a single master processor failure and recovery is on average of 25% (check-point interval) of total elapsed application running time. Overhead with a single worker processor failure and recovery is less than 5% (processing granule size). 

These tests were conducted on a cluster of Intel processors with 2GHZ/1GB and 100 Mbps network.

Limitations of the current implementation include: 

·         Recovered process ID will be different than the original. This will not impact system integrity under SPPM but will have serious negative impact on direct communication systems such as MPI/PVM and OpenMP.

·          Local files are not supported. All files must be accessible via the network.

·         UDP sockets are not supported.

Back to top.


High Performance Fault Tolerant Database Cluster 

Motivation

This research started from a very simple request from a graduate student in 1998: how can one replicate concurrent transactions to multiple database servers?

The intent at the time was to provide failover possibility leveraging multiple "hot-sync'd" databases. It quickly escalated into a pursue for a highly reliable high performance database cluster architecture.

Methodology (U.S. Patent: #6,421,688 and other pending patents)

Leveraging the Stateless Parallel Processing principles, our research has identified  three critical measures to resolving transaction replication and data clustering difficulties: (in collaboration with Suntain Song):

·         Dynamic query serialization

·         Dynamic and session-based query load balancing

·         Non-stop resynchronization

Using these methods, we effectively have re-formed the traditional two-phase-commit protocol by separating it into a commit and a resynchronization phase. Synchronous (zero loss) transaction replication can now proceed flawlessly in high speed without fearing data inconsistencies nor transaction losses.

A more significant result is the overall performance scalability for heavy update queries on very large databases using a technique called "k-order shift mirroring".

These solutions are implemented in a single server program positioned between database clients and a set of independent database servers. It is named DBx database cluster middleware.

Computation Result

The following charts show the computational results with and without data partitioning.

Without horizontal partition, the basic DBX  gateway performance characteristics is illustrated in Fig. 1. where CD (Computing Density) is defined as number of calculations per query.

Fig.1. DBX Performance without Horizontal Partition

Here are 42 example calculations of expected performances of different scenarios without data partitioning:  

 

Speedup (90% Read)

 

 

Speedup (50% Read)

Cluster Size

CD=1

CD=20

CD=200

 

Cluster Size

CD=1

CD=20

CD=200

2

0.60

1.30

1.90

 

2

(0.20)

0.60

1.10

3

0.90

1.95

2.85

 

3

(0.30)

0.90

1.65

4

1.20

2.60

3.80

 

4

(0.40)

1.20

2.20

5

1.50

3.25

4.75

 

5

(0.50)

1.50

2.75

6

1.80

3.90

5.70

 

6

(0.60)

1.80

3.30

7

2.10

4.55

6.65

 

7

(0.70)

2.10

3.85

8

2.40

5.20

7.60

 

8

(0.80)

2.40

4.40

For a DBX cluster of 4 SQL Servers, an application with 90% reads can expect 4*.3=1.2 times speedup for CD=1, 4*.63~=2.4 times speedup for CD=20 and 4*.85=3.4 times speedup if CD=200. In comparison, systems using LogShipping (native SQL Server replication tool) would only yield negative performance regardless read ratios. Note that for applications with heavy updates, DBX performance can also be negative if CD is low. 

Fig. 2. DBX Performance with Horizontal Partition

Fig.2 illustrates the effects of DBX horizontal partitioning. For a cluster of 4 SQL Servers, even 5% read application can expect a speedup of 4*.6 = 2.4 times speedup. Different CD values have little impact on the overall performance. To reap for these performance results, the application must be re-engineered to accommodate the (changing) partitioned datasets.

The following 42 scenarios are for applications with manually partitioned data tables.

 

Speedup (90% Read)

 

 

Speedup (50% Read)

Cluster Size

CD=1

CD=20

CD=200

 

Cluster Size

CD=1

CD=20

CD=200

2

1.96

1.96

1.98

 

2

1.70

1.74

1.78

3

2.94

2.94

2.97

 

3

2.55

2.61

2.67

4

3.92

3.92

3.96

 

4

3.40

3.48

3.56

5

4.90

4.90

4.95

 

5

4.25

4.35

4.45

6

5.88

5.88

5.94

 

6

5.10

5.22

5.34

7

6.86

6.86

6.93

 

7

5.95

6.09

6.23

8

7.84

7.84

7.92

 

8

6.80

6.96

7.12

A more recent study by F. Xiong showed that it is possible to achieve unlimited scalability by separating the number of partitions (P) from the degree of redundancy (R). With DBx technology, we may need to re-think the design of industry standard benchmarks, such as those proposed by www.tpc.org.

Back to top.

 

 

 

Copyright (c) 2007-15. Justin Y. Shi. All Rights Reserved.