Short Bio

BS in Computer Engineering, Shanghai Jiao-tong University, Shanghai, China, 1977.
MS in Software Engineering, University of Pennsylvania, Philadelphia, USA, 1983.
PhD in Synchronized Decentralized Linear Solver with Configuration Management, University of Pennsylvania, USA, 1984.
Post-Doctoral Fellow, University of Pennsylvania, 1984-5.
Assistant Professor, Temple University, 1985-1992.
Interim Director, Center for Advanced Computing and Communications, Temple University, 1992-4.
Associate Professor, Temple University, 1992-present.
Graduate Chair, CIS Deparment, 1987-2015.
Chairman of CIS Department, 2007-9.
Associate Chainman of CIS Department, 2010-2015.
Senior Member, IEEE.

Research Focus:

Impossibility theories, foundations of computer programming, software safety, arbitrarily reliable network-scale high performance computing, and infinit scaling of data-intensive mission-critical service infrastructures, blockchain protocols and accountable reasoning systems.

Research

Solving Scalability

Nov 12

Point-to-point communication has been the norm for parallel computing and client-server programming paradigms. Reliability and scalability of these systems have been persistent challenges. The common flaw in these programming paradigms is the unrealistic dependency on component and network reliability. Assuming the reliable network is the top fallacy in distributed computing. These built-in assumptions prohibt software safety and scalability solutions. Decoupling program and data from physical processors, storage and network was recognized as the necessary conditions for mission-critical applications. A programming paradigm shift is necessary by completely removing the device IDs in the programming APIs. Sufficient condition to deliver the best-effort performance (and fault tolerance) is to enable runtime statistic multiplexed resource sharing at network scale. Together, this solution can resolve scalability challenges for centralized as well as decentralized infrastructures for compute-intensive and data-intensive applications for unlimited network-scale capacity growths.

System and Method for Network-Scale Reliable Parallel Computing, #EP3539261, European Patent Office, October 8, 2022.
Statistic Multiplexed Computing System for Network-scaled Reliable High-Performance Services, U.S. Patent No. #11,588,926 B2, February 2023.
TOIChain: A Proposal for High Performance Tamper Resistant Transactions without Scalability Limits, 20th International Conference on Foundations of Computer Science, Las Vegas, NV. 2024. Presentation Link
Amdahl's Law Wikipedia Corrections, 2024.

Research Projects

1977-Present

Micro-Controller for DJS-130 (first Chinese micro-computer), Senior Design, Shanghai Jiao-tong University, 1977.
CODASYL database implementation for foreigh trades, Shanghai Foreign Trade Corp. and NEC Corporation, 1978-9.
Synchronous terminations for decentralized linear solvers, LINK Project, United Nations, Profs. Laurence Klein, Noah Prywes, Amir Pnueli, Bolek Szymanski, Unversity of Pennsylvania, 1982-84.
Zodiac Project, Digital Equipment Corporation, Computer Command and Control Corporation, 1985-6.
Synergy Project, IBM T.J.Watson Research Lab, Computational Physics Group, 1991-2.
DBx Project, Parallel Computers Technology Inc. (PCTi), 2000-2017.
Growshare.net: An Urban Revitalization Civic Service, NSF I-Corp, 2014-5.
TOIChain Project, SMC Labs, 2022-present.
Contineous Learning Realtime Non-Axiomatic Reasoning Project, with Prof. Pei Wang and students, Temple University, 2024-present.

Teaching

Curriculum Initiatives

Nov 11

Book Project: Foundations of Networked Operating Systems -- for High Performance Mission-Critical Services, 2024-5 (under review).
Quantum Computing, HPC, Cryptocurrencies and Scalability, Undergraduate Course for Quantum Minor program in College of Science and Technology, 2024.
Cyber Defence and Information Assurance, Professional Masters Program, CIS Department, 2014-present.
Scripting for Business and Sciences, for Professional Masters in Bioinformatics, 2012-present.
Digital Media Minor Program, in collaboration with Klien College of Communication, 2011-present.
Online Curriculum Developments, CIS Department, 2011-2000.
Web and Mobile Programming, CIS Department, 2011-2000.
Fullstack Programming in Capstone Courses, 2020-1.
Active Content Addressable Networking for Operating Systems, 2021-present.
Generative AI for fullstack programming and operating systems, 2024.

Dissertation Supervision

1985-present

Computer Science

Kostas Blathras, Topic: Parallel Asynchronous Algorithms Using Data Reduction Methods. December 1996. (Last position: Greek Airforce)
John Dougherty, Topic: Performability Analysis of Parallel Programs With Partial Faults. December 1997. (Last position: Haverford College, PA)
David Mutchler, Topic: Validation of Stateless Parallel Processing Paradigm for Fault Tolerant Realtime Systems. May 1998. (Last position: U.S. Navy at Pax, MD)
Feijian Sun, Topic: Serial-to-Parallel Program Translation Using XML-Like Tag Language. August 2004. (Last position: Siemens Corporation)
Yijian Yang, Topic: Fault Tolerant High Performance Multi-processor – An Implementation of Stateless Machine. May 2007. (Last position: Bloomberg Corporation)
Fanfan Xiong, Topic: High Performance Fault Tolerant Multi-Database System – An Implementation of K-Order Shift Mirrored VLDB. May 2009. (Last position: MW Insurance Corporation)
Moussa Taifi, Stateless Parallel Processing Architecture for Extreme Scale HPC and Auction-based Clouds, Completed, November 2013. (Last position: Cloudmize)
Joseph Jupin, Record Aggregate and Linkage without Reliable Unique Identifier, Completed August 2016.
Yasin Celik, Feasibility Study of Statistic Multiplexed Computing, Completed August 2018. (Last position: Linkedin)

Mathematics (Quantum Computing)

Michael Ratner, Quantum Walks and Structured Searches on Free Groups, with Prof. Wei-Shih Yang, 2017.
Kai Zhao, Quantum Random Walk on Fractals, with Wei-Shih Yang, 2018.
Chia-Han Chou, Time-Inhomogeneous Quantum Markov Chains, with Prof. Wei-Shih Yang, 2020.
Alexander Song Ahn, Quantum Search on Random Graphs, with Prof. Wei-Shih Yang, 2020.
Tantrik Mukerji, Applications of Gaussian Fields to the Permanent and the Matching Polynomial, with Prof. Wei-Shih Yang, 2024.

Services

Editorship

Nov 12

Editor-in-Chief, IEEE Blockchain Technology Briefs, 2023-8.
Guest Editor, Feature Papers in High Performance Heterogeneous Parallel Computing Infrastructures, MDPI, 2024-6.
Guest Editor, Special Issue on In Search of Extreme Scale Scientific Computing Paradigms, Journal of Scientific Programming, 2018.

Administrative and Professional

1985-Present

Graduate Chair and PhD Qualify Exam Director, CIS Department, 1987-2015.
Interim Director, Center for Advanced Computing and Communications, Temple University, 1992-8.
Advisory Committee Member, Pittsburgh Supercomputing Center, 1986-8.
Research Exhibits, Supercomputing Conference, 1991-2018.
Synergy3.0, Github link, 1993-present.
Advisory Committee Member, Ben Franklin Technology Partners, 1988-2016.
Deparment Chair, CIS Department, Temple University, 2007-9.
Contributor, Statistic Multiplexed Computing (SMC): A Neglegted Path to Unlimited Application Scalability, NCAR Sofware Engineering Assembly, 2013.
Contributor, Seeking the Principles of Sustainable Software Engineering, WSSSPE2014 Workshop, 2014.
Department Associate Chair, CIS Department, Temple University, 2009-2015.
Member, IEEE Software Safety Standards Working Groups, SA1760, SA1228, 2017-9.
Contributor, NSF Workshop on Next Generation Cloud Research, Princeton University, 2019.
Organizing Committee, IEEE Sustech Conferences, 2022-3.
Reviewer and panelist for NSF, ACM and IEEE journals.

Patents

Infinitely Scalable Parallel Machines

Nov 11

System for High-Level Virtual Computer with Heterogeneous Operating Systems, USPTO #5,381,534, March 1995.
Multi-Computer System and Method, USPTO #5,517,656, May 1996.
System and Method for Network-Scale Reliable Parallel Computing, #EP3539261, European Patent Office, October 2022.
Statistic Multiplexed Computing System for Network-Scale Reliable High Performance Services, USPTO #11,588,926B2, February 2023.

Education

1974-1985

BS in Computer Engineering, Shanghai Jiao-tong University, China, 1974-1977.
NEC Product Training, Tokyo, Japan, 1978-9.
MS in Software Engineering, University of Pennsylvania, 1980-3.
PhD in Concurrent Programming, University of Pennsylvania, 1983-4.
Post-Doctoral Training, University of Pennsylvania, 1984-5.

Entrepreneurship

Database Cluster without Scaling Limits

2000-2017

Founder, CEO and Chairman, Parallel Computers Technology Inc. Wayne, Pennsylvania.

Mission-Critical Service Infrastructure Research

2022-Present

Co-Founder, Co-CEO, SMC Labs, LLC. King of Prussia, Pennsylvania.

Research Highlights

The Achilles Heel in Service Infrastructure Programming

November 18, 2024

Modern service infrastructures provide essential services to users. The service infrastructures employ multiple networked physical or virtual servers. For programming client-server interactions, Client-Server programming paradigms were created. These include RPC (remote procedure call), MPI (message passing interface), RMI (remote method invocation), shared-memory (OpenMP). There are also many other proposed models, such as Actor Model, Bulk Synchronous Parallel, Communicating Sequential Processes, Circuits, Dataflow, Functional, LogP Machine, Parallel Random Access Machine, SPMD PGAS, Go, Rust, etc. The problem is that using these models for service programming injects an unavoidable volnerability to the infrastructure: single-point failures. Crash-failure of any physical or virtual server can bring down the entire service. These are the Archilles Heels of modern service infrastructures. It is also the root cause for scalability challenges.

Since all phyical and virtual devices have finite service life with unpredictable natural and potential hostile interruptions, for software and data safety, the minimal requirement is the complete decoupling of program and data from physical and virtual devices. For program decoupling, the service infrastructure must be capable of executing the same functions from multiple servers without ambiguity. For data decoupling, synchronous data (and transaction) replication must be supported. This simple requirement places all existing programming models and paradigms into the "legacy" category. A paradigm shift seems not avoidable.

Once program and data are decoupled from physical and virtual devices, the sufficient condition to deliver the best-effort performance and fault tolerance requires realtime network-scale resource multiplexing. We call this programming paradigm shift and runtime optimization Statistic Multiplexed Computing and Communication or SMC². It is a natural extension of the Statistic Multiplexed Communication protocols that enabled the infinite scaling Internet.

The Myth of Infinite Scalability

The 1967 Amdahl's Law has been a myth for parallel program scalability predictions. The myth deepened after 1988 Gustafson's paper, now called Gustafson's Law. In Amdahl's Law, a single serial percentage (S) is used to represent the percentage of instructions in a program that cannot be run in parallel. The percentage is computed based on a single processor environment. The speedup = 1/(S + (1-S)/P), where P is the number of processors. Taking P->Infinity will yield Speedup = 1/S. In the past decades, Amdahl's Law seemed to imply the economic law of diminishing of return, as a small S can bring down the speedup dramatically.

In 1988, John Gustafson and colleague published a paper reevaluating Amdahl's Law by proposing a different S' the percentage of instructions that cannot be run in parallel under P processors. Because at the Sandia National Labotory, they only had a NCUBE machine with 1024 processors. The S' percentage is computed based on measured sequential and parallel elapsed times using P processors. To estimate the delivered speedup, Gustafson proposed Speedup = S' + (1-S')*P, meaning that the estimated sequential time is the sum of S' and (1-S')*P. It looks like taking P->Infinity would imply infinite speedup. Gustafson's argument was that engineers typically do this by solving bigger problems leveraging the already parallelized parts. However, this scaling step cannot be justified mathematically since scaling P will change S'.

In 1996, I had accidentally discovered that S and S' are mathematically related: S = S'/(S'+(1-S')*P) Link. It was still, however, not clear why the limits of the laws are so different.

The Amdahl's Law myth goes away if we calculate S from S' based on Gustafson's formula as we scale P Link. This revelation implies that every parallel program can indeed scale indefinitely if the problem size is not fixed. Otherwise, the law of diminishing return applies. Incidentally, the TOP500 Supercomputer competition has followed the exact same discipline allowing manufacturers to compete for solving the biggest Linkpack problem Link.

Backup/Restore: 1+1 `<` 1

Insanity: doing the same thing over and over again and expecting different results. --Albert Einstein.

For centralized data stores and databases, the current industry standard is backup/restore either asynchronously or synchronously. Since synchronous replication involved two-phase-commit protocol which is slow and error-prone, most production systems today rely on asynchronous backup/restore to defend single-point-failure of the primary server. Unlike synchronous replication, asynchronous replication cannot eliminate arbitrary data losses. In reality, the backup/restore system delivers less than a single server's performance, since it must carry replication overheads; and less than a single server's availability, since either primary or the backup server failure demands service shutdown. The probability is against us forever.

For compute intensive applications, checkpoint/restart carries the similar overheads with much energy waste since there are so many processors. These checkpoint overheads can all be ignored if we clearly understand how to conduct client re-transmission protocols.

Load Balancing

For parallel computing, load balancing is a difficult task. Even if the processors are homogeneous, the deliverable performances vary widely. The culprit is the shared communication channel(s). The communication load sharing is not homogeneous. Therefore, to deliver optimal parallel performance need to ensure all parallel tasks terminate at the same time with the minimal communication overheads. This requires tuning the task sizes without recompiling the programs.

Parallel program load balancing problem is analogous to the Brachistochrone problem in Math and Physics for the optimal decent from a high point to the ground balancing the normal force and gravity Link. Computational experiments demonstrated identical behavior of repeated (cycloid) optimal points (granularity).

Impossibility of Reliable Failure Detection

Reliable failure detection is theoretically impossible. The only next possible method is timeout discipline. As all existing programming paradigms and models assume all components and the network are reliable; the timeout retransmission discipline is absent. The TCP/IP packet retransmission discipline cannot protect the applications if either the sender or the receiver crashes arbitrarily. This was proved formally in 1993 Link. Therefore, each application must provide end-to-end acknowledgements to ensure the correct communication patterns. This simple requirement seems very hard to implement in existing programming paradigms.

Arbitrarily Reliable Systems

Building arbitrarily reliable system using unreliable components was first proposed by von Neumann in 1952 Link. The TCP/IP protocol deployed statistic multiplexed communication principle to enable today's infinitely scalable Internet. SMC² is a simple extension of the TCP/IP protocol for network-scale computing services.

Request for Reviewer and Critic

If you would be interested to review my book project entitled "Foundations of Networked Operating System", please email shi (at) temple.edu. Thank you!

Short Bio

Research Focus:

Solving Scalability

Research Projects

Curriculum Initiatives

Dissertation Supervision

Computer Science

Mathematics (Quantum Computing)

Editorship

Administrative and Professional

Infinitely Scalable Parallel Machines

Database Cluster without Scaling Limits

Mission-Critical Service Infrastructure Research

Research Highlights

The Achilles Heel in Service Infrastructure Programming

The Myth of Infinite Scalability

Backup/Restore: 1+1 `<` 1

Load Balancing

Impossibility of Reliable Failure Detection

Arbitrarily Reliable Systems

Request for Reviewer and Critic

Selected Papers

A Distributed Programming Model and Its Applications to Computation Intensive Problems for Heterogeneous Environments

Timing Models and Local Stopping Criteria for Asynchronous Iterative Algorithms

Stateless Parallel Processing Architecture for Extreme Scale HPC and Auction-Based Clouds

Feasibility Studies of Statistic Mutiplexed Computing

Archives

Social Media Contacts

Short Bio

Research Focus:

Solving Scalability

Research Projects

Curriculum Initiatives

Dissertation Supervision

Computer Science

Mathematics (Quantum Computing)

Editorship

Administrative and Professional

Infinitely Scalable Parallel Machines

Database Cluster without Scaling Limits

Mission-Critical Service Infrastructure Research

Research Highlights

The Achilles Heel in Service Infrastructure Programming

The Myth of Infinite Scalability

Backup/Restore: 1+1 < 1

Load Balancing

Impossibility of Reliable Failure Detection

Arbitrarily Reliable Systems

Request for Reviewer and Critic

Selected Papers

A Distributed Programming Model and Its Applications to Computation Intensive Problems for Heterogeneous Environments

Timing Models and Local Stopping Criteria for Asynchronous Iterative Algorithms

Stateless Parallel Processing Architecture for Extreme Scale HPC and Auction-Based Clouds

Feasibility Studies of Statistic Mutiplexed Computing

Archives

Social Media Contacts

Backup/Restore: 1+1 `<` 1