CIS 4307: File Structure - Stable Storage
Suppose that a file is 1MB and a block is 8KB. Then the file will have
128 blocks.
The question is: how do we keep track of these 128 data blocks of
the file. We can do this by using index blocks (also called
indirect blocks) that contain
pointers to other index and data blocks. In Unix the inode
is 64 or 128 bytes and contains
13 pointers (only 12 in older systems).
The first 10 pointers are to data blocks. The 11th pointer
points to a single indirection index block, i.e. an index block with
pointers to data blocks. The 12th pointer points to a double indirection
index block, i.e. an index block that points to index blocks that point to
data blocks. The 13th pointer points to a triple indirection index block.
If we assume that a block is 8KB and a pointer is 8B, then an index block will
contain 1K pointers. The maximum size of a file will be
8KB*(10 + 2**10 + 2**20 + 2**30), that is more than 8TB. [Different
Unix implementation may have different values.]
Some observations:
- Access time is different at different locations in a file.
The first 10 blocks are accessed with a single read (the pointers are in
main memory where the inode is brought when the file is opened). The next
1K blocks require up to two reads, one for the index block, one for the
data block.
The next 1M blocks require up to three reads, and the next 1G blocks require
up to four reads. Clearly access time is good at the beginning of the file
and gets worse and worse as we move towards its end.
- One needs to have a clear idea of how addressing is done, i.e.
how to determine, given an address, which blocks to
read and how to compute displacements.
For example, if we seek for byte 100,000,000 in a file, we need to go
to the 12,207 block, and then look for the 256th byte. We skip the first
10 pointers (each to a block) thus remaining with 12197 to go. We skip the
11th pointer (which points to a single indirection index block with 1024
pointers to data blocks) thus skipping 1024 blocks
remaining with 11173 blocks.
We are now at the
12th inode pointer. From the 12th inode pointer we go to a double indirection
index block.
Here each pointer counts as 1K blocks. Thus we have to skip 10 pointers
(i.e. 10*1K=10,240 blocks, thus remaining with 933 blocks to go) and follow
the 11th pointer to a single indirection index block.
Here we go to the 933 pointer
and follow it to a data block where the 256th byte is what we are aiming for.
Clearly we can write a procedure to do this calculation and to allocate
index and data blocks if it is necessary for a write operation.
- The following is a legal sequence of operations: write one byte,
seek 1MB(from beginning of file), write another byte,
seek 1GB (from beginning of file) and write another byte.
The size of the resulting file is 1GB+1 but most of it is with
undefined content. One needs to understand the resulting structure of the file
in terms of its data and index blocks. In the example
we will probably have:
- The first inode pointer points to a block where the first byte is written.
- The next 9 inode pointers are null.
- The next inode pointer points to an index block with 1014 null pointers,
followed by a pointer to a data block where only the first byte is defined.
The remaining 9 pointers are null.
- The next inode pointer points to an index block where the first 1023
blocks are null, etc.
- The next inode pointer is undefined.
- File creation is a time consuming operation. We have to create an inode
in the inode list, we have to create an entry in the directory where the file
is located and, if the file is non empty, say it has the first byte, we need
to allocate and initialize a data block. In general when a file is modified
we may have to update directory, inode, index blocks, and data blocks. If these
data structure are far apart we may spend a lot of time in seeking operations.
- Directories in Unix are like regular files. The only difference is that
they contain file entries. Each file entry has two parts, the name of the file,
and the inode number of the file. The latter is used to index into the in disk
inode list and to retrieve the inode that describes the file. This inode will
be in the in memory inode map while the file is open.
Stable Storage
We would like the following to hold when we
write a value B to a disk sector s currently containing the value A:
after the write operation the sector contains either A or B,
independently of failures.
Persistent storage where this property holds is called
Stable Storage. It is highly desirable because it means that the write
operation is atomic. Unfortunately disks
satisfy instead a weaker property, the Weakly-Atomic Property:
After the write operation the sector will contain a value that is either
A, B, or such that when it is read it
gives a read error. We want to show how, given
disks that satisfy the weakly-atomic property we can build stable storage.
We consider two disks D1 and D2 that are mirror images (RAID-1) of each
other, i.e.
they have the same number of equally numbered sectors. Corresponding
sectors are intended to have the same content. Here is how we write
a value B to a sector s that has currently value A:
1
REPEAT
2 write B to sector s on D1;
3 read sector s from D1;
4 UNTIL value read is without error;
5
REPEAT
6 write B to sector s on D2;
7 read sector s from D2;
8 UNTIL value read is without error;
9
We have numbered significant locations in the code. At 1 disks D1 and D2
have the same value A at sector s. At 2, 3, 4, and 5 D2 has value A at s.
At 5 D1 has B at s and D2 has A at s. At 6, 7, and 8 D1 at s has value B.
Reasoning similarly we see that at
9 we will have value B at s in both D1 and D2. Thus if all goes well
we have succeeded in making D1 and D2 mirrors of each other.
The problem is when there is a failure at an intermediate point after 1 and
before 9. Here is the procedure to follow after a failure.
10 read sector s from D1;
if read error then {
REPEAT
read value of sector s from D2 and write to sector s on D1;
read sector s from D1;
UNTIL value read is without error;
return;}
11 read sector s from D2;
if read error then {
REPEAT
read value of sector s from D1 and write to sector s on D2;
read sector s from D2;
UNTIL value read is without error;
return;}
Compare values read from D1 and D2;
if different then
REPEAT
write value read from s on D1 to s on D2;
read sector s from D2;
UNTIL value read is without error;
If there is a failure during this procedure, then this procedure is
called again. Note that if the read operation in statement 10 fails it must be
because we had a failure writing to D1, thus D2 contains the "true" value
(the old value). If instead the read operation in statement 11 fails
it must be because we had a failure writing D2, thus D1 contains the "true"
value (the new value).
Note that a write operation to stable storage takes (at least) two times
the time required for a write to normal storage. This is so since the writes
to the two copies must be done serially.
Directly Attached Storage (DAS), Network Attached Storage (NAS), Storage
Area Network (SAN)
Disk storage may be made available to computer systems in a variety of ways.
The traditional one is Directly Attached Storage (DAS): the disk is placed on
a bus of a computer system and it is accessible exclusively by it. Access to
the disk is done using some protocol such as various forms of SCSI or ATA-IDE.
These protocols allow access to storage at the block level. They suffer of a
number of limitations: scalability (limit on number of disks that can be
attached directly), limits on distance disk-system (tens of meters),
no hot swap (addition/removal while system is running), backup (it shares bus)
and reliability (distance limitation) constraints.
Storage Area Networks (SAN) and Network Attached Storage (NAS) are intended to
remedy the limitations of DAS. A SAN is a separate network of storage nodes
dedicated to the storage function, which can be accessed and shared from a
variety of computer systems thru special protocols like fiber channel. In a
SAN, as in DAS, access to storage is made at the block level. Network Attached
Storage is a configuration of storage nodes, seen as a file system and
accessed using a high level protocol like NFS. It is implemented using the
IP protocol.
ingargio@joda.cis.temple.edu