CIS 4307: File Structure - Stable Storage

Data and Index Blocks

Suppose that a file is 1MB and a block is 8KB. Then the file will have 128 blocks. The question is: how do we keep track of these 128 data blocks of the file. We can do this by using index blocks (also called indirect blocks) that contain pointers to other index and data blocks. In Unix the inode is 64 or 128 bytes and contains 13 pointers (only 12 in older systems). The first 10 pointers are to data blocks. The 11th pointer points to a single indirection index block, i.e. an index block with pointers to data blocks. The 12th pointer points to a double indirection index block, i.e. an index block that points to index blocks that point to data blocks. The 13th pointer points to a triple indirection index block. If we assume that a block is 8KB and a pointer is 8B, then an index block will contain 1K pointers. The maximum size of a file will be 8KB*(10 + 2**10 + 2**20 + 2**30), that is more than 8TB. [Different Unix implementation may have different values.]

Some observations:

Stable Storage

We would like the following to hold when we write a value B to a disk sector s currently containing the value A: after the write operation the sector contains either A or B, independently of failures. Persistent storage where this property holds is called Stable Storage. It is highly desirable because it means that the write operation is atomic. Unfortunately disks satisfy instead a weaker property, the Weakly-Atomic Property: After the write operation the sector will contain a value that is either A, B, or such that when it is read it gives a read error. We want to show how, given disks that satisfy the weakly-atomic property we can build stable storage.

We consider two disks D1 and D2 that are mirror images (RAID-1) of each other, i.e. they have the same number of equally numbered sectors. Corresponding sectors are intended to have the same content. Here is how we write a value B to a sector s that has currently value A:

 1
    REPEAT
 2      write B to sector s on D1;
 3      read sector s from D1;
 4  UNTIL value read is without error;
 5
    REPEAT
 6      write B to sector s on D2;
 7      read sector s from D2;
 8  UNTIL value read is without error;
 9

We have numbered significant locations in the code. At 1 disks D1 and D2 have the same value A at sector s. At 2, 3, 4, and 5 D2 has value A at s. At 5 D1 has B at s and D2 has A at s. At 6, 7, and 8 D1 at s has value B. Reasoning similarly we see that at 9 we will have value B at s in both D1 and D2. Thus if all goes well we have succeeded in making D1 and D2 mirrors of each other. The problem is when there is a failure at an intermediate point after 1 and before 9. Here is the procedure to follow after a failure.

 10 read sector s from D1;
    if read error then {
      REPEAT
        read value of sector s from D2 and write to sector s on D1;
        read sector s from D1;
      UNTIL value read is without error;
      return;}
 11 read sector s from D2;
    if read error then {
      REPEAT
        read value of sector s from D1 and write to sector s on D2;
        read sector s from D2;
      UNTIL value read is without error;
      return;}
    Compare values read from D1 and D2;
    if different then 
      REPEAT 
        write value read from s on D1 to s on D2;
        read sector s from D2;
      UNTIL value read is without error;
If there is a failure during this procedure, then this procedure is called again. Note that if the read operation in statement 10 fails it must be because we had a failure writing to D1, thus D2 contains the "true" value (the old value). If instead the read operation in statement 11 fails it must be because we had a failure writing D2, thus D1 contains the "true" value (the new value).

Note that a write operation to stable storage takes (at least) two times the time required for a write to normal storage. This is so since the writes to the two copies must be done serially.

Directly Attached Storage (DAS), Network Attached Storage (NAS), Storage Area Network (SAN)

Disk storage may be made available to computer systems in a variety of ways. The traditional one is Directly Attached Storage (DAS): the disk is placed on a bus of a computer system and it is accessible exclusively by it. Access to the disk is done using some protocol such as various forms of SCSI or ATA-IDE. These protocols allow access to storage at the block level. They suffer of a number of limitations: scalability (limit on number of disks that can be attached directly), limits on distance disk-system (tens of meters), no hot swap (addition/removal while system is running), backup (it shares bus) and reliability (distance limitation) constraints.

Storage Area Networks (SAN) and Network Attached Storage (NAS) are intended to remedy the limitations of DAS. A SAN is a separate network of storage nodes dedicated to the storage function, which can be accessed and shared from a variety of computer systems thru special protocols like fiber channel. In a SAN, as in DAS, access to storage is made at the block level. Network Attached Storage is a configuration of storage nodes, seen as a file system and accessed using a high level protocol like NFS. It is implemented using the IP protocol.

ingargio@joda.cis.temple.edu