CIS 307: TOPIC: Virtual Memory

[Introduction], [Performance], [Hardware support for Virtual Memory], [From Patterson and Hennessy Book on Computer Architecture], [PDP11], [Hard versus Soft page Faults], [Working Set Principle], [The Five Minutes Rule], [Paging on Windows NT]

Virtual Memory is treated fully in Chapter 3 of Tanenbaum. Here are a few additional notes.

Introduction

A Virtual Memory system is a system where the addresses used in executing images, Virtual Addresses, are different from the addresses, Physical Addresses, used at the hardware level, on the bus between processor and main memory. There is a map (called the Memory Map from the Virtual Address Space to the Physical Address Space. This map is usually different in different processes. The map is set and evaluated repeatedly during program execution. For a given virtual address the map at different times will give different values. It will be undefined if the page or segment containing the virtual address is not in main memory, otherwise it will be the current location in main memory. [There are two other important maps: the Naming Map that binds the names you use in your program to virtual addresses; and the Content Map that maps physical addresses to the content at that location. You might think who/what implements these maps, and if they are forms of early/late binding.]

Virtual Memory was invented in the early Sixties. It is now almost universally used for the following reasons:

Size Independence:: It allows us to have programs whose size is limited by the size of the virtual memory, not by the size of the physical memory. Since virtual memory is usually (NOT always) larger than physical memory, large programs can run even on machines with small main memory.
Protection:: When my array index computation goes out of range, it is bad because it aborts my program. But it would be very bad if it would abort your program. Protection controls the access rights of a process to various portions of its virtual address space. With appropriate operating systems support it becomes possible to achieve any desired protection policy where sharing and separation between processes is as desired. This property is so important that many want virtual memory even in diskless machines, or in machines where the physical address space is larger than the virtual address space.
Relocatability:: If in location A we store the address of location B then, if the program (or the portion containing B), is relocated then we need to change the content of A. Alternatively, as seen in machines with Base and Bound registers, and in Virtual Memory, we can use addresses relative to the base, or virtual addresses, and avoid all recomputations.
Performance:: By calibrating carefully the amount of main memory made available to a process (and the size of cache memory available on the processor) it is possible to achieve virtual memories that have access times that are better, on average, than physical memory and size comparable to that of secondary storage. The best of both worlds [Thank you Locality].

The hardware required for Virtual Memories, on average is complex but these days it is cheap [in fact now the Memory Management hardware is just a portion of the processor chip]. Overall, a great idea. [When considering machines with Virtual Memory always look at what they do to reduce the main memory required to accomodate page tables.] Notice that the concept of Virtual Memory applies also to the use of Caches, that are used in conjunction with main memory and in conjunction with disks [in the modern disk systems that are used with mainframes and servers the disk system controller may use a big cache (even many gigabyte big) to improve read performance).

Performance

We will now try to determine, given a map from a Paged Virtual Memory to Physical Memory, and the access time to Physical memory, what is the access time (expected) to Virtual Memory. We will assume that the translation cost is zero (we can always include translation cost as a percentage of the memory access). We define some terms.

Hit Cost = H :: Time it takes to access virtual memory in the case that there is no page fault. Under the assumption that translation cost is 0, then H is equal to the access time of physical memory. [Note that the translation from virtual to physical address, since it involves some form of page table lookup, may itself cause page faults. These page faults may occur independently of the presence or absence in main memory of the physical page being referenced by the virtual address.]
Hit Ratio = R :: Ratio between the number of times virtual memory is accessed without giving origin to a page fault and the number of time it is accessed giving origin to a page fault. A concept related to the Hit Ratio is the concept of Mean Time Between Faults, which is equal to H*R, if we do not consider the time required to handle the fault.
Miss Penalty = M :: Time it takes to access virtual memory in the case that there is a page fault.
Speed Ratio = S;: Ratio between Miss Penalty and Hit Cost. For example if we have a Hit Cost of 100 nanoseconds, and a Miss Penalty of 10 milliseconds, then the Speed Ratio is 100,000.

It becomes easy to see that the expected access time to virtual memory becomes:

	                R*H + M    R*H + M       M         S
  E(access time) =	------- ~= ------- = H + - = H(1 + -)
                 	  R+1         R          R         R

For example, if H is 0.1microseconds, M is 15milliseconds, and R is 100,000 then the expected access time to virtual memory is 0.25microseconds, that is, it is 2.5 times the access time to physical memory.
Notice that in the acess time formula the first term represents the successful access to memory (i.e. time cpu is busy and paging IO is idle) and the second term represents access to secondary storage (i.e time cpu is idle but paging IO is busy)

The beauty of virtual memory is that it is like a game that can be played more than once, not only between main memory and secondary storage, but also between cache memory and main memory. In all case things work well for us because programs show "Locality" in their behavior. [Notice that when we talk of locality we mean either temporal locality or spacial locality. Temporal locality says that if I have used something recently, I am likely to use it again in the future. Spacial Locality says that if I use something now, I am likely to use something nearby in the future. Temporal locality guides us in what we keep in memory according to the replacement policy. Spacial locality may be used in pre-fetching.] Let's consider the combination of cache and main memory virtual address translation. Let Hc, Rc, and Mc and Hm, Rm, and Mm be respectively the hit cost, Hit ratio, and Miss penalty in the case of cache to main memory, and main memory to disk virtual addresses.
We can consider two cases: the case where the cache virtual address space operates on top of the main memory virual address space, and the converse case.

In the case of cache operating on virtual addresses (i.e. it keeps pairs of the form [virtual address, value]) the expected access time in the resulting memory system from cache to disk is:
```
		         Mc        1       Mm         Hm      Mm
  E(access time) ~= Hc + -- = Hc + --(Hm + --) = Hc + -- + -------
			 Rc        Rc      Rm         Rc   Rc * Rm
```
For example, if we use the previous values for Hm, Rm, Mm and we assume that Hc is 0.02microseconds and Rc is 18, then we get an expected access time that is about 0.03microseconds. WOW. We have disk size memory with cache memory performance!
In the case of main memory's virtual memory operating on top of the cache (i.e. it keeps pairs of the form [physical address, value]) the expected access time in the resulting memory system from cache to disk is:
```
                                                 Mm        Hm   Mm
  E(access time) ~= E(access time to cache VM) + -- = Hc + -- + --
                                                 Rm        Rc   Rm
```
For the same values used above, now the access time is about 0.026 + 0.15 micro seconds.

Clearly case 2 is less efficient than case 1. But it is the way that is usually implemented in hardware. One reason is that if the cache works using virtual addresses then it becomes tied to a specific process and when there is a context switch the cache needs to be flushed and cold started. In order to take full advantage of the cache when it uses real addresses it is necessary to increase the hit ratio for main memory, for example, from 100,000 to 600,000 (in this case the total access time becomes 0.51microseconds). You may have heard that if in a car you improve the power of the engine by a lot, you have also to improve brakes, suspensions, steering, body rigidity, etc. The same is true in our case, if we add cache, we should also increase the size of main memory so as to decrease main memory fault rate and bring the performance of the whole memory system close to the one of the cache.

In a computer system things are more complex than we have described. There may be two or more levels of caches between the cpu and the main memory, and the caches may differentiate between instructions and data. Also, the presence of TLBs (Translation Look Aside Buffers) complicates performance analysis. [A TLB is an associative memory that maps a few virtual page addresses (one per page) to physical page addresses.]

Hardware support for Virtual Memory

Of course the memory mamangement hardware unit will support translation from virtual to physical address. But it does much more. Here is a partial list of other services it may provide:

Vectored Interrupt in case of page fault with information for identifying what caused the fault (for example, access violation versus true page fault on which address)
When the execution of an instruction is interrupted because of a page fault, after the fault has been handled we need to re-execute the interrupted instruction. This can be supported by the hardware.
Support for replacement policy (use bit and dirty bit)
Support for choosing when to have the VM active
Support for debugging when using VM (selecting which addresses to translate and which not to translate)
Translation of adresses across distinct virtual address spaces

As virtual memories became larger, page tables also became larger. Thus a fundamental problem is how to manage the amount of page table that is kept in physical memory. Different architecture deal with this problem differently. The DIGITAL VAX put the paged tables in virtual memory (but not all of them of course). IBM's System 38 and its successors did without it by inverting the page table, and using a frame table (tied to size of physical memory, not virtual memory). The INTEL386 used hierarchical tables. Same problem, different ways to deal with it.

From Patterson and Hennessy Book on Computer Architecture

I found very useful a summary found in Patterson and Hennessy book on computer architecture of similarities among uses of the virtual memory concept. I paraphrase that summary below [it is a bit dated by now]:

Speeds: Caches are realized with static RAM with access time ~10nsec; main memory is done with dynamic RAM: 80nsec; Disk: 15ms.

                           |  CACHE     |   PAGED MEMORY |      TLB       |
                           |==============================================|
   Total size in blocks    | 250-10,000 | 2000-250,000   | 16-1K          |
   Total size in bytes     | 4KB-4MB    | 4MB-1GB        | 64-8K          |
   Block size in bytes     | 4-256      | 4KB-16KB       | 4-8            |
   Miss penalty in clocks  | 10-100     | 100,000-1M     | 10-50          |
   Miss rate               | 0.1%-20%   | 10**-7 - 1**-6 Ś 10**-5 - 10**-2|
                           |==============================================|

Question : Where can a block be placed?

Fully Associative : Anywhere
Direct Mapping : Fixed Location
Set-Associative : Direct mapping with a number of alternatives

Question: How is a block found?

Search in case of Set-Associative
Page Table in case of Fully Associative
Direct, for Direct Mapping

Question: Which block should be replaced on a miss?

Random
LRU, or one of its approximations

Question: What happens on a Write?

Write Through: Write goes both to the cache and to the backup
Write Back: Write is done on cache. Later, as necessary or by deadline, to backing store.

Reasons for Faults:

Compulsory = Cold Start Misses : The first time the block is brought into memory
Capacity = Blocks don't all fit, so need replacements
Conflict = Collision Misses : Forced out by a collision in direct mapping or in set associative.

The PDP-11 Segmented Virtual Memory

The PDP-11 is a very old mini-computer very popular at the beginning of the 70's. It came in many versions. It uses segmented virtual memory. It is interesting, among other things, because in many models it had bigger physical address spaces than virtual spaces. We will describe a model where a virtual address is 16 bits(64KB) and the physical address is 18 bits(256KB). The PDP-11 uses in simple form many of the features that are common in current virtual memory systems.

Since the PDP-11 has 8 segments (the machine is so old that at the time these segments were called pages), the virtual address will consist of a 3 bit segment-id field and a 13 bit displacement-in-segment field. The 8 Page Descriptor Registers (PDRs) and 8 Page Address Registers (PARs) form a kind of Page Table for this machine. It is much smaller than in the "real" virtual address systems that have come later.

Hard versus Soft Page Faults

Suppose you have a page fault. If there is a free frame we can immediately load the needed page into that frame. But if no frame is available (and no replaceable page is available which has not been written to) then we have first to write to disk the replaced page, then read in the needed page. This means that for one page fault we have more than one IO operation: double jeopardy. I know that may be other processes can use the processor while I wait for my page, but still it is bad for my program. A possibility is to keep available a pool of "free" frames. The content of these frames is copied to buffers and written to disk before replacement is required and the frames are marked available for replacement. Then in the case of a page fault: if it is for one of the pages that were occupying the "free" frames, then we have a Soft Page Fault since we can still find in the frame the original page; if it is for another page, then we have a Hard Page Fault since the page has to be brought in from disk. Now for a Hard page fault we will have to wait for a single IO operation. Of course soft page faults will be handled much more quickly than hard page faults.

Working Set Principle

If you go over the various replacement policies you find that they assume that a program, say with n pages, executes with m frames. Nothing is said about how m was chosen. The replacement policy says only how to use those frames. Here we discuss how to choose m.

Suppose that we are in a system where pages are backed to a single disk and this disk has an access time that is 10ms. Then we know that the maximum number of page faults that our system can handle is 100 per second. If we run my program on this system and give to it a lot of page frames, we may get, say, 20 page faults per second. If the number of frames available to my program is reduced we may get 40 page faults per second. By reducing further the number of frames, we may get to 100 faults per second. But beyond that point, no matter how few frames my program is given, we will not get more than 100 faults per second. If I want my program to keep the cpu busy, if f is the number of faults it generates per second, M is the page fault penalty, and c is the compute time per second, we should have c >= f*M [In our case c >= 10*f] so that the program runs at least as much as it waits for faults and the cpu does not have to wait for the io channels.

If instead of thinking only of my program, I think of all the programs currently executing, if cumulatively they create 60 page faults, then the disk is not fully utilized, if the faults are close to 100, then probably programs are waiting on page faults more than they should. If a program in a second has more time waiting for page faults than cpu time, then that program has more page faults than it should. Clearly, if programs have a mean time between faults that is no less that the fault penalty, then, on average the cpu will not be idle while the disk is handling faults. If we make them equal, then, both cpu and IO are equally busy. Going back to our analysis of the performance of virtual memory, we see that in the "ideal" case, the access time to virtual memory is 2 times the access time to physical memory.
We call Working Set of a program the set consisting of the distinct pages referenced by that program in a specified time interval. At times we use the working set concept to mean the number of pages in such a set. We (the operating system) try to allocate to a program enough frames to accomodate a desired working set, both as number of frames needed and as identity of the pages involved.. A desirable working set is one, as discussed above, where for the chosen process the cpu time is not less than the fault handling delay time. Since the operating system may be unable to give to each process the desired number of frames, the OS has two choices, to let programs run even if they have too many page faults (they are trashing), or to swap out the programs whose desired working set cannot be satisfied. The OS follows the second policy. This is the Working Set Principle: Let a program run only if we can accomodate its desired working set.
The operating system will identify the pages in the working set by their age [assuming LRU or some of its approximations]. If a process cannot be run with its working set [insufficient space is available], the process is swapped out. When swapped in we will pre-page (i.e. fetch) the pages in its working set. Pre-paging is the opposite of the usual policy of bringing in a page only when it is needed [Demand Paging]. The choice of using pre-paging or demand paging is called a Fetch Policy choice. Another kind of policy, an Allocation Policy, will determine if we are allocating pages globally, i.e. out of a pool shared by many processes, or locally, i.e. out of a pool used only by this process, and statically, i.e. using a fixed number of pages, or dynamically, i.e. using a variable number of pages.

The Five Minutes Rule

A concept related to what discussed in the Working Set Principle goes by the name The Five Minutes Rule. It addresses the question of how large a memory should be relative to the size of the next lower memory in the memory hierarchy. The argument goes as follows [We consider only economic reasons, no real-time concerns]

Suppose that data blocks are 4KB and memory is $10/MB. thus the cost of the block is $0.04. The cost of disk is, say, $500 for a 4GB disk, thus for the 4KB block the disk cost is negligible ~$0.003. But if we think in terms of use of the disk, in terms of accesses, we have a different result. Say we can make 50 accesses per second to the disk. Thus one access per second costs $10. Thus ($0.04-$0.003) buys us about an access every 270 seconds, i.e. about every five minutes. Thus if a data block is accessed more often than every five minutes it should be kept in main memory. Otherwise in secondary memory. This reasoning applies to any level in the memory hierarchy. The general formula for the threshold interval is

              CostOfAccessPerSecond
           ---------------------------- = ThresholdIntervalInSeconds
           (CostMemory - CostSecondary)

Paging on Windows NT

Here are, with comments, some figures from a book on Windows NT written by Helen Custer.
The first picture shows the states through which a page frame will pass through its life.

In this figure "PTE" means "Page Table Entry". A page frame starts by being "free". Before being used a frame may be "zeroed" so as to enforce a security policy [it is required by the Defense Department in certain circumstances]. After a frame has been "In use" it is removed from the working set. There are two possibilities: the content of the frame has been modified or it has not. In the first case the frame moves to the "Modified" state; otherwise it moves to the "Standby" state. In the modified state there are 3 possibilities: 1) the contained page is used again, with transition back to "In use"; 2) the contained page virtual addresses cease to be used, with transition to the "Free" state; or 3) the modified contained page is written to disk and the frame moves to the "Standby" state. The Transitions from the "Standby" state are similar, but now there is an extra transition: the frame may be freed while the contained page remains part of a virtual address space. In that case the frame is freed, but the corresponding PTE is updated to contain the disk address of where the page content is stored on disk. Note that the transitions from the states Standby and Modified to InUse takes place in correspondence to soft page faults since they correspond to pages that are still in main memory while their PTE is not pointing to a memory frame.

The next figure just indicates that the OS for ease of use keeps all the page frames on lists, one per state. Information about all the page frames is kept in a table, called the Page Frame Database. When the system needs a frame, it will look first to the Zeroed list, then the Free list (zero it if required by security policy), then Standby list, then the Modified list.

The following figure interrelates the information kept in the Page Frame Database with the information kept in the various page tables. Notice that it is possible, given a page table entry, to determine what is the corresponding page frame, and viceversa [you need this back pointer when, for example, you have to free a frame during replacement policy]. Notice further that when a page frame is shared by more than one address space, then there is an extra shared page table entry. In this manner changes in the frame can be recognised in all the affected address spaces [this means that the hardware that interprets the virtual memory page table entry must be able to recognize the indirection and follow the pointer].

ingargiola.cis.temple.edu