CIS 307: Files and all that

[Accessing a block], [Stable Storage], [Versions of the Unix File System], [Vnodes], [File Descriptors, File Objects, Vnodes, Inodes], [Buffer Cache], [Logical Volumes], [Blocks, Fragments, Sectors], [Data and Index Blocks], [No overwrites: Write-Once], [Read vs Write Efficiency], [Log structured file systems], [Conclusion]

These notes address material covered in Chapters 4 and 5 of Tanenbaum's textbook (and a little bit from chapters 7 and 11). It is assumed that you will study chapters 4 (read lightly the section on security) and 5 (read lightly the section on terminals), finish studying chapter 7. Of course, you have already studied file structures and basic IO operations in CIS207.

Accessing a block

Let's examine the time it takes to access (to read it or write it) a block on a disk, given its location as cylinder, surface, sector. We then have the following times:
Command Overhead
This is the time taken by the disk controller processor to respond to a request. It is negligible. With today's technology is in the range 0.1ms (in the case of hit in the controller's cache) to 0.5ms (in the case of a miss).
Seek Time
This is the time for moving the read/write head to the correct cylinder. The head will accelerate at the beginning, move at constant speed for a while, then decelerate. We can think of this time as a + b*d, where a and b are disk specific constants and d is the distance in cylinders from the start to the target cylinder. This seek time is on average many milliseconds, even many tens of milliseconds. Let's say it is 10ms on average. We may try to improve the seek time by carefully sheduling disk operations, or by locating records of a file so that the distance between them is not great.
Rotational Delay (Rotational Latency)
This is the time, once we are at the right cylinder, to wait for the sector to come to the read/write head. If we are lucky, it is immediately there. If unlucky we have to wait for a whole revolution. On average we will wait 1/2 a revolution. Since a disk spins fast, but not too fast [say, 5,400rpm, or 7,200rpm], a revolution will take about 10ms and the rotational delay will be about 5ms. We can do little to reduce the rotational delay [we may use disk scheduling, or use multiple heads, or improve sequential accesses, or ...].
Transfer Time
This is the time required to transfer the information. If we are transferring k bytes and the transfer rate is r bytes per second, this time is k/r. The transfer rate will be the smaller of two rates, the transfer rate of the bus to which the disk is connected [say, 10MBps and up], called the Interface Data Rate, and the transfer rate of the disk [i.e. what corresponds to the rotation of the disk], called the Media Data Rate. Say that a track has 64KB and the rotational speed is 6000rpm, then this data rate is 6.4MBps]. Usually the transfer rate of the disk is the smaller of the two trasfer rates. Note that as data density increases on disk, the disk data transfer rate increases in proportion. If we are transferring an 8KB block then the transfer time is 1.125ms (in general, if we transfer k bytes, a track contains n bytes, and a rotation takes r milliseconds, then the transfer time is k*r/n). Unless we use very big blocks, the transfer time is much smaller than the the seek and rotational delays.
An additional delay, that in the case of disk is absolutely negligible, is the time it takes for a signal to propagate. Since signals propagate at 200,000Km/second, i.e 200m/microsecond, i.e 20cm/nanosecond we don't have to worry about it for the foreseable future! If we add all these times we get for the access time:
   overhead + seek + latency + transfer = (0.5+10+5+1.1)ms = 16.6ms
Since accesses to disk are not random and show a locality of access, the real seek time tends to be 1/3 of the random seek time, thus the access time becomes (0.5+1/3810+5+1.1)ms = 9.9ms. When large disk caches are used, and the disk request can be satisfied from the cache, things get much better for the access time:
   overhead + transfer = 0.1ms + 1.1ms = 1.2ms

Stable Storage

[See Tanenbaum, pages 487-488] We would like the following to hold when we write a value B to a disk sector s currently containing the value A: after the write operation the sector contains either A or B, independently of failures. Persistent storage where this property holds is called Stable Storage. It is highly desirable because it means that the write operation is atomic. Unfortunately disks satisfy instead a weaker property, the Weakly-Atomic Property: After the write operation the sector will contain a value that is either A, B, or such that when it is read it gives a read error. We want to show how, given disks that satisfy the weakly-atomic property we can build stable storage.

We consider two disks D1 and D2 that are mirror images of each other, i.e. they have the same number of equally numbered sectors. Corresponding sectors are intended to have the same content. Here is how we write a value B to a sector s that has currently value A:

 1
    REPEAT
 2      write B to sector s on D1;
 3      read sector s from D1;
 4  UNTIL value read is without error;
 5
    REPEAT
 6      write B to sector s on D2;
 7      read sector s from D2;
 8  UNTIL value read is without error;
 9

We have numbered significant locations in the code. At 1 disks D1 and D2 have the same value A at sector s. At 2, 3, 4, and 5 D2 has value A at s. At 5 D1 has B at s and D2 has A at s. At 6, 7, and 8 D1 at s has value B. Reasoning similarly we see that at 9 we will have value B at s in both D1 and D2. Thus if all goes well we have succeeded in making D1 and D2 mirrors of each other. The problem is when there is a failure at an intermediate point after 1 and before 9. Here is the procedure to follow after a failure.

 10 read sector s from D1;
    if read error then {
      REPEAT
        read value of sector s from D2 and write to sector s on D1;
        read sector s from D1;
      UNTIL value read is without error;
      return;}
 11 read sector s from D2;
    if read error then {
      REPEAT
        read value of sector s from D1 and write to sector s on D2;
        read sector s from D2;
      UNTIL value read is without error;
      return;}
    Compare values read from D1 and D2;
    if different then 
      REPEAT 
        write value read from s on D1 to s on D2;
        read sector s from D2;
      UNTIL value read is without error;
If there is a failure during this procedure, then this procedure is called again. Note that if the read operation in statement 10 fails it must be because we had a failure writing to D1, thus D2 contains the "true" value (the old value). If instead the read operation in statement 11 fails it must be because we had a failure writing D2, thus D1 contains the "true" value (the new value).

Note that a write operation to stable storage takes (at least) two times the time required for a write to normal storage. This is so since the writes to the two copies must be done serially.

Versions of the Unix File System

The Unix File System has changed in time, so as to improve functionality, performance, and reliability. Three basic successive stages are represented by s5fs, the system 5 file system, FFS, the Berkeley Fast File System, and BSD-LFS, the Berkeley (and others) Log structured File System . And before them all was the file system of the original Unix system. Main concerns in the evolution were:

Data and Index Blocks

Suppose that a file is 1MB and a block is 8KB. Then the file will have 125 blocks. The question is: how do we keep track of these 125 data blocks of the file. We can do this by using index blocks (also called indirect blocks) that contain pointers to other index and data blocks. In Unix the inode (see Tanenbaum, page 165 and pages 308-310; notice their internal structure and how they are kept on disk within an inode table that is at a fixed location on disk) is 64 bytes and contains 13 pointers (only 12 in older systems). The first 10 pointers are to data blocks. The 11th pointer points to a single indirection index block, i.e. an index block with pointers to data blocks. The 12th pointer points to a double indirection index block, i.e. an index block that points to index blocks that point to data blocks. The 13th pointer points to a triple indirection index block. If we assume that a block is 8KB and a pointer is 8B, then an index block will contain 1K pointers. The maximum size of a file will be 8KB*(10 + 2**10 + 2**20 + 2**30), that is more than 16TB. [Different Unix implementation may have different values.]

Some observations:

Vnodes

Vnodes were introduced by Sun [this is after s5fs and FFS] A Vnode is a data structure that contains Thus a Vnode is just like a class in C++: it gives us a file as an object hiding the details of its implementation. Vnodes are used also to access files on volumes mounted across a network. In the normal case a Vnode will point to an inode.

File Descriptors, File Objects, Vnodes, Inodes

Here we review information we have encountered before on accessing a file, namely the basic control blocks encountered as we proceed from the user program to the actual blocks on, say, disk.

File Descriptors
A file descriptor is an entry in a data structure kept in user memory. It indexes an open file. It may contain flags, such as what to do with this file in case of an exec command, if to pass it to it or to close it. It contains a pointer to an open file object. Even a single process may have more than one file descriptor for a file. Dup and Dup2 can be used to create copies of existing file descriptors.
Open File Objects.
Open file objects are kept in system space. They contain the information about a particular use of a file, things like protection flags, mode of use [blocking/non], and the cursor. As we have seen more than one file descriptor, from one or more processes, may refer to an open file object. The open file object contains a pointer to a vnode. For a file we may have at the same time more than one open file object.
Vnodes
Vnodes, as just seen, are data structures kept in system space that provide a standard interface to the file, whether local or remote, mounted or not, of a file system, or another. There is at most one vnode per file. It remains in existence as long as there is an open access to this file. It contains things like a count [of open], locks, and a pointer to file specific data, usually an inode.
Inodes
Inodes have two incarnations, one on disk and one in main memory. Here we are concerned with the latter. An inode contains information about the owner of the file, creation dates, pointers to data and index blocks of the file, its inode number, the device id for the device where the file is stored, etc.. Inodes are kept in a hash table (with buckets) where the hash key is the pair (inode number, device id). There is at most one inode in main memory per file. It is created with the first open, deleted with the last close. The on disk inodes are kept in tables.

Buffer Cache

The buffer cache was used in System V as a disk cache for blocks to be written to disk. It is a hash table where the blocks are identified by a pair (device#, block#). When an IO requested to a block is made, the buffer cache is searched first, rapidly since it is a hash table. Blocks from the buffer cache are written back to disk within a system dependent time limit. The system needs to do storage management for the free blocks of the buffer cache and to keep track of the buffers currently undergoing an IO operation. Buffers get freed as necessary following an LRU policy. Note that not all blocks are written back. Directory blocks are written through (i.e. immediately) not written back (i.e. later).
In more recent Unix implementations use is made of the virtual memory mechanisms to map external files in virtual memory and do without the buffer cache.

Logical Volumes

It is convenient to de-couple the notion of volume upon which resides a file system from the actual physical volumes available in the file system. For example, we may be given a disk with 9GBs and we may prefer to see it split into three 3GB volumes so that each smaller volume can be managed and backed up independently. Alternatively we may have a need for very large files, larger than any single disk system. We can achieve both these goals by defining the concept of logical volume and creating maps from the logical volumes to disk partitions and physical volumes.

Blocks, Fragments, Sectors

The internal structure of disks is fairly complex. Here is some information on such structure for SunOS. The physical unit of storage is a sector (on the Sun it is 512B). The unit of transfer for information is the block (on the Sun it is 8KB). A block, thus, includes many sectors (16 in this case). Since blocks are large and often files are small, there is an intermediate unit between "sector" and "block", the fragment. A fragment is at least as large as a sector, and it is smaller than a block. Essentially, a block is broken up into fragments so that a small file just gets a fragment, and a larger file does not have to allocate a full block unless it is necessary. Distinct fragments within a block may belong to different files (implementation dependent, 2 fragments, or 4 fragments, or..). As a file grows, it may require rewrites of a fragment into a new fragment or a block. For example, if a file is (2*8KB+2KB), then it has two blocks plus a 2KB fragment. If it now grows to 3*8KB, a new block is allocated and written to while the fragment is freed. A file can have fragments only in its last block, and a file with a fragment must not have index blocks. [In other words, only small files are impacted by the concept of fragment, which allows us to contain the waste of memory due to small file sizes.] By using fragments the utilization of storage is improved at the cost of additional programming complexity.

No overwrites: Write-Once

The following is a description of a simple low level file system intended to be reliable and fairly efficient [the idea is by J.Abrial]. [This idea is employed to an extent in some versions of Unix with the notion of shadow inode. Changes are made from a copy of the inode until the operation is completed and the copy (the shadow inode) replaces the old inode.] For reliability it never overwrites a block [there is one exception to this rule] instead it always allocates a new block. While a computer is running there may be a difference between how the file system appears in disk and how it appears from main memory. The two views are reconciled when there is a checkpoint operation. We simplify things by assuming that each file has exactly one index block, from which one reaches all the data blocks of that file. The following figure describes an empty volume

And the following figure describes a volume in which we have two files, one with two blocks and the other with three.

When a volume is mounted we will have in (main) memory structures corresponding to the root (the root block is also called the superblock), c-root, to the catalog, c-catalog, and to the index blocks of each open file. The file system will be seen in memory from these structures. The following figure describes the state of volume and memory when the volume is mounted and we have opened the file f2.

We can describe the basic operations of this file system:

Mount()

  1. Allocate a block in (main) memory, call it c-root, and copy root block to it
  2. Allocate a block in memory, call it c-catalog, and copy catalog block to it
  3. Set c-root[master] to the address of c-catalog.
  4. Determine list of free blocks on disk (you could trace from root)

create(f): f is a file that does not exist (after, f is not open)

  1. Allocate and initialize a free block on disk
  2. Set c-catalog[f] to its address
[This operation could be made more efficient, to not require I/O]

open(f): f is a file that is not open

  1. Allocate a block in memory, call it c-f, and copy index block f to it
  2. Set c-root[f] to the address of c-f

read(f,p): f is an open file, p is one of its pages

  1. Allocate a block in memory, call it c-p, and copy block p to it

write(f,p,v): f is an open file, p is a page of f, v is a page-content

  1. Allocate a free block p' on disk and write v to it
  2. Set c-f[p] to the address of p'

close(f): f is an open file

  1. Allocate a free block f' on disk and copy c-f to it
  2. Set c-catalog[f] to f'
  3. Set c-root[f] to undefined

delete(f): f is an existing closed file

  1. Take f out of c-catalog

dismount()

  1. Close each open file
  2. Allocate a free block catalog' on disk
  3. Set c-root[master] and c-catalog[master] to catalog'
  4. Write c-catalog on catalog'
  5. Overwrite root with c-root. [Remember that root is in stable storage]
Only after the volume is dismounted the file system on disk coincides with the file disk as it was seen in main memory. Until the dismount operation all changes in the file system since the last mount are lost in case of failure. It is easy to think of a checkpoint operation that guaranties more frequent safe state updates. In any case after all failures we will always restart with a consistent file system on disk.

This technique is also efficient. We have seen that delete involves no I/O. The copy operation is also very fast because we only need to copy the index block. The data blocks are changed only for write operations.
[You may compare this approach to the way Unix maintains file consistency (Tanenbaum pp175-177).]

Read vs Write Efficiency

John Ousterhout has stated that the two principal bottlenecks impeding progress towards the development of high performance systems are:
  1. the low data rate on the main memory channel, and
  2. the slowness of disk [synchronous] write operations.
Ousterhout stresses the difference between read and write operations. Read operations from disk are speeded up substantially by the use of sizable semiconductor dynamic memory caches. Caches do not help in the case of write operations since for reliability reasons information is immediately written to disk. [In Berkeley Unix they write immediately (synchronously) the inodes and directories. Data blocks can be written out asynchronously.] Ousterhout suggests that we must improve average time for write operations by reducing the amount of synchronous writes by using buffers in static memory and, as we will see in log structured file systems, by reducing seek times required for access to inode and directory blocks by placing them close to each other. [Note that an initial asymmetry in the use of Read and Write operations, comes from the fact that if we want to write 10 bytes to a block, we have first to read the block into a buffer, then insert the 10 bytes at the write location in the buffer, finally write out the block. If we just wanted to read 10 bytes from a block, we could do so directly.]

Log structured files

[For additional information
Rosenblum,M.,Ousterhout,J.K.(1992) "The Design and Implementation of a log-structured File System", ACM Trans on Computer Systems, pp26-52 and also Saltzer]
Ousterhout has determined experimentally that traditional file systems have a write bandwidth of about 5% [i.e. if the disk is capable to write sequentially data at 10MBps, in reality we can write to files at the rate of only 0.5MBps. This is easy to see: if in a track we have 64KB and we write only a 4KB sector during a rotation, then we will use onty 1/16th of what is possible.]. The log-structured file system proposed by Rosenblum and Ousterhout achieves a factor of 10 improvement in write bandwidth [i.e. in the example above, we can write to files at the rate of over 5MBps]. This result is achieved in the case of small files (for big files performance does not improve significantly) and assuming that there are no crashes, or that there are write buffers in non-volatile RAM memory.

The key idea of the log-structured file system is to delay writes so that they can be made in big chunks, called segments, that include data blocks, index blocks, directory blocks, inodes, and inode list blocks (all inode list blocks are accessible from a single fixed location on disk). A segment is 1MB (or 0.5MB). A problem is how to manage space for segments and to deal with deletions in their content. The solution chosen is:

Conclusion

File systems remain an essential component of operating systems. Efforts are being made to improve their reliability and have their performance keep pace with improvements in CPU performance. Compression is being used to improve both performance and storage utilization. Real-time issues are a concern. As files are being used to store movies, it is important to be able to store very large files and to guaranty limited delays in reading sequentially files. And then there is the use of files in distributed systems....

ingargiola.cis.temple.edu