Shadow Computing: A Scalable and Energy-aware Computational Model for Resiliency at Scale

Taieb Znati
Professor and Chair
Computer Science Department University of Pittsburgh
Location: 
SERC 306
Date: 
Wednesday, April 19, 2017 - 11:00
As our reliance on IT continues to increase, future applications will involve the processing of massive amounts of data and will require an exascale computing infrastructure to support several orders-of-magnitude increases in the levels of parallelism. As technology continues to improve, two emerging trends will impact next generation exascale computing infrastructure: (1) the number of computing, communications and storage elements will continue to increase dramatically; (2) the growing disparity between the speeds of microprocessors and those of the memory and storage hierarchy mandates the incorporation of new classes of high density, low latency and low power non-volatile memory, such as Phase Change Memory (PCM), into the hierarchy. A direct implication of these trends is that the rate of failures in future cloud computing will increase dramatically, making resiliency a major concern in future exascale computing infrastructure to support compute- and data-intensive applications. Unfortunately, current approach for resilience, which relies on automatic or application level checkpoint-restart, are not feasible in failure-prone computing environments as the time for checkpointing and rollback recovery is likely to exceed the mean time to failure. Addressing this shortcoming goes beyond adapting or optimizing well known and proven techniques, and calls for radical approaches to fault-tolerance in exascale computing infrastructures. The objective of this presentation is to explore innovative and scalable fault-tolerance mechanisms that, when integrated, will lead to efficient solutions for a “tunable” resiliency that takes into consideration the nature of the data and the requirements of the application. The focus will be on the design of an integrated framework that achieves high resiliency based on a new energy- and computation-aware approach to checkpointing and the design of scalable mechanisms to ensure high-levels of data availability in a failure-prone environment.