Characterizing, Monitoring, and Proactively Protecting Against Disk Failures

Surendar Chandra
Member of Technical Staff
Datrium, Inc
SERC 306
Friday, February 20, 2015 - 11:00
Modern storage systems orchestrate a group of disks to achieve their performance and reliability goals. Redundancy mechanisms such as RAID are built on the assumption of independent and instantaneous failures as well as exponential distribution of the time to failure. Recent work has shown the fallacy of such assumptions. Unfortunately, improvements such as RAID-6 are still based on certain assumptions given the limited understanding of disk fault modes. Multiple combined disk faults and latent sector errors can overcome RAID protection.
To address this knowledge gap, we collected and analyzed disk error logs from EMC backup storage systems over 60 months and including about 1 million SATA disks. We show that many disks fail at a similar age and the frequency of sector errors keeps increasing on working disks. Ensuring data reliability in the worst case requires adding considerable extra redundancy, making the traditional passive RAID approach impractical. By studying numerous types of disk errors, we show that a large number of reallocated sectors indicates a high probability of imminent whole-disk failure or, at a minimum, a burst of sector errors.
With these findings, we have built and evaluated an active defense mechanism that monitors the health of each disk and replaces those that are predicted to fail imminently. This proactive protection has been incorporated into our product and is observed to eliminate 88% of      
triple disk errors, which are 80% of all RAID failures. We also designed and simulated a method of using the joint failure probability to quantify and predict how likely a RAID group is to face multiple simultaneous disk failures. Our system can effectively identify most vulnerable RAID-6 systems, improving the coverage to 98% of triple errors.