The Achilles Heel in Service Infrastructure Programming
Modern service infrastructures provide essential services to users. The service infrastructures employ multiple networked physical or virtual servers. For programming client-server interactions, Client-Server programming paradigms were created. These include RPC (remote procedure call), MPI (message passing interface), RMI (remote method invocation), shared-memory (OpenMP). There are also many other proposed models, such as Actor Model, Bulk Synchronous Parallel, Communicating Sequential Processes, Circuits, Dataflow, Functional, LogP Machine, Parallel Random Access Machine, SPMD PGAS, Go, Rust, etc. The problem is that using these models for service programming injects an unavoidable volnerability to the infrastructure: single-point failures. Crash-failure of any physical or virtual server can bring down the entire service. These are the Archilles Heels of modern service infrastructures. It is also the root cause for scalability challenges.
Since all phyical and virtual devices have finite service life with unpredictable natural and potential hostile interruptions, for software and data safety, the minimal requirement is the complete decoupling of program and data from physical and virtual devices. For program decoupling, the service infrastructure must be capable of executing the same functions from multiple servers without ambiguity. For data decoupling, synchronous data (and transaction) replication must be supported. This simple requirement places all existing programming models and paradigms into the "legacy" category. A paradigm shift seems not avoidable.
Once program and data are decoupled from physical and virtual devices, the sufficient condition to deliver the best-effort performance and fault tolerance requires realtime network-scale resource multiplexing. We call this programming paradigm shift and runtime optimization Statistic Multiplexed Computing and Communication or SMC2. It is a natural extension of the Statistic Multiplexed Communication protocols that enabled the infinite scaling Internet.
The Myth of Infinite Scalability
The 1967 Amdahl's Law has been a myth for parallel program scalability predictions. The myth deepened after 1988 Gustafson's paper, now called Gustafson's Law. In Amdahl's Law, a single serial percentage (S) is used to represent the percentage of instructions in a program that cannot be run in parallel. The percentage is computed based on a single processor environment. The speedup = 1/(S + (1-S)/P), where P is the number of processors. Taking P->Infinity will yield Speedup = 1/S. In the past decades, Amdahl's Law seemed to imply the economic law of diminishing of return, as a small S can bring down the speedup dramatically.
In 1988, John Gustafson and colleague published a paper reevaluating Amdahl's Law by proposing a different S' the percentage of instructions that cannot be run in parallel under P processors. Because at the Sandia National Labotory, they only had a NCUBE machine with 1024 processors. The S' percentage is computed based on measured sequential and parallel elapsed times using P processors. To estimate the delivered speedup, Gustafson proposed Speedup = S' + (1-S')*P, meaning that the estimated sequential time is the sum of S' and (1-S')*P. It looks like taking P->Infinity would imply infinite speedup. Gustafson's argument was that engineers typically do this by solving bigger problems leveraging the already parallelized parts. However, this scaling step cannot be justified mathematically since scaling P will change S'.
In 1996, I had accidentally discovered that S and S' are mathematically related: S = S'/(S'+(1-S')*P) Link. It was still, however, not clear why the limits of the laws are so different.
The Amdahl's Law myth goes away if we calculate S from S' based on Gustafson's formula as we scale P Link. This revelation implies that every parallel program can indeed scale indefinitely if the problem size is not fixed. Otherwise, the law of diminishing return applies. Incidentally, the TOP500 Supercomputer competition has followed the exact same discipline allowing manufacturers to compete for solving the biggest Linkpack problem Link.
Backup/Restore: 1+1 <
1
Insanity: doing the same thing over and over again and expecting different results. --Albert Einstein.
For centralized data stores and databases, the current industry standard is backup/restore either asynchronously or synchronously. Since synchronous replication involved two-phase-commit protocol which is slow and error-prone, most production systems today rely on asynchronous backup/restore to defend single-point-failure of the primary server. Unlike synchronous replication, asynchronous replication cannot eliminate arbitrary data losses. In reality, the backup/restore system delivers less than a single server's performance, since it must carry replication overheads; and less than a single server's availability, since either primary or the backup server failure demands service shutdown. The probability is against us forever.
For compute intensive applications, checkpoint/restart carries the similar overheads with much energy waste since there are so many processors. These checkpoint overheads can all be ignored if we clearly understand how to conduct client re-transmission protocols.
Load Balancing
For parallel computing, load balancing is a difficult task. Even if the processors are homogeneous, the deliverable performances vary widely. The culprit is the shared communication channel(s). The communication load sharing is not homogeneous. Therefore, to deliver optimal parallel performance need to ensure all parallel tasks terminate at the same time with the minimal communication overheads. This requires tuning the task sizes without recompiling the programs.
Parallel program load balancing problem is analogous to the Brachistochrone problem in Math and Physics for the optimal decent from a high point to the ground balancing the normal force and gravity Link. Computational experiments demonstrated identical behavior of repeated (cycloid) optimal points (granularity).
Impossibility of Reliable Failure Detection
Reliable failure detection is theoretically impossible. The only next possible method is timeout discipline. As all existing programming paradigms and models assume all components and the network are reliable; the timeout retransmission discipline is absent. The TCP/IP packet retransmission discipline cannot protect the applications if either the sender or the receiver crashes arbitrarily. This was proved formally in 1993 Link. Therefore, each application must provide end-to-end acknowledgements to ensure the correct communication patterns. This simple requirement seems very hard to implement in existing programming paradigms.
Arbitrarily Reliable Systems
Building arbitrarily reliable system using unreliable components was first proposed by von Neumann in 1952 Link. The TCP/IP protocol deployed statistic multiplexed communication principle to enable today's infinitely scalable Internet. SMC2 is a simple extension of the TCP/IP protocol for network-scale computing services.