Computer Architecture: CPUs -- Virtual Memory Technologies and Virtual Addressing

AMAZON multi-meters discounts AMAZON oscilloscope discounts

1. Introduction

The previous sections discuss physical memory and caching. The section on physical memory considers the hardware technologies used to create memory systems, the organization of physical memory into words, and the physical addressing scheme used to access memory. The section on caching describes how a memory cache is organized, and explains why a memory cache improves performance dramatically.

This section considers the important concept of virtual memory. It examines the motivation, the technologies used to create virtual address spaces, and the mapping between virtual and physical memory. Although our focus is primarily on the hardware systems, we will learn how an operating system uses virtual memory facilities.

2. Definition of Virtual Memory

We use the term Virtual Memory (VM) to refer to a mechanism that hides the de tails of the underlying physical memory to provide a more convenient memory environment. In essence, a virtual memory system creates an illusion - an address space and a memory access scheme that overcome limitations of the physical memory and physical addressing scheme. The definition may seem vague, but we need to encompass a wide variety of technologies and uses. The next sections will define the concept more precisely by giving examples of virtual memory systems that have been created and the technologies used to implement each. We will learn that the variety in virtual memory schemes arises because no single scheme is optimal in all cases.

We have already seen an example of a memory system that fits our definition of virtual memory in Section 11: an intelligent memory controller that provides byte ad dressing with an underlying physical memory that uses word addressing. The implementation consists of a controller that allows a processor to specify requests using byte addressing. We further saw that choosing sizes to be powers of two avoids arithmetic computation and makes the translation of byte addresses to word addresses trivial.

3. Memory Management Unit and Address Space

Architects use the term Memory Management Unit (MMU) to describe an intelligent memory controller. An MMU creates a virtual address space for the processor.

The addresses a processor uses are virtual addresses because the MMU translates each address into an underlying physical memory. We classify the entire mechanism as a virtual memory system because it is not part of the underlying physical memory.

Informally, to help distinguish virtual memory from physical memory, engineers use the adjective real to refer to a physical memory. For example, they might use the term real address to refer to a physical address, or the term real address space to refer to the set of addresses recognized by the physical memory.

4. An Interface to Multiple Physical Memory Systems

An MMU that can map from byte addresses to underlying word addresses can be extended to create more complex memory organizations. For example, Intel designed a network processor that used two types of physical memory: SRAM and DRAM. Recall that SRAM is faster than DRAM, but costs more, so the system had a smaller amount of SRAM (intended for items that were accessed frequently) and a large amount of DRAM (intended for items that were not accessed frequently). Furthermore, the SRAM physical memory was organized with four bytes per word and the DRAM physical memory was organized with eight bytes per word. Intel's network processor used an embedded RISC processor that could access both memories. More important, the RISC processor used byte addressing. However, rather than using separate instructions or operand types to access the two memories, the Intel design followed a standard approach: it integrated both physical memories into a single virtual address space.

To implement a uniform virtual address space out of two dissimilar physical memory systems, the memory controller must perform the necessary translations. In essence, the MMU must supply an abstraction that hides details of the underlying memory systems. FIG. 1 illustrates the overall architecture.

FIG. 1 Illustration of an architecture in which two dissimilar memories connect to a processor. The processor can use either memory.

In the figure, the processor connects to a Memory Management Unit. The MMU receives memory requests from the processor, translates each request, and forwards the request to the controller for physical memory 1 or to the controller for physical memory

2. The controllers for the two physical memories operate as described in Section 11 -- a controller accepts a request that specifies byte addressing, and translates the request into operations that use word addressing.

How can the hardware in FIG. 1 provide a virtual address space? The answer is related to the memory banks described in Section 11. Conceptually, the MMU divides the address space into two parts, which the MMU associates with physical memory 1 and physical memory 2. For example, if each physical memory contains a gigabyte (0x40000000 bytes) of RAM, the MMU can create a virtual address space that maps addresses 0 through 0x3fffffff to the first memory and addresses 0x40000000 through 0x7fffffff to the second memory. FIG. 2 illustrates the resulting virtual memory system.

5. Address Translation Or Address Mapping

Each of the underlying memory systems in FIG. 2 operates like an independent physical memory -- the hardware expects requests to reference addresses beginning at zero. Thus, each of the two memories recognizes the same set of addresses.

For memory 1, the virtual addresses associated with the memory cover the same range as the hardware expects. For memory 2, however, the processor generates virtual ad dresses starting at 0x40000000, so the MMU must map an address to the lower range (i.e., real addresses 0 through 0x3fffffff) before passing a request to memory 2. We say that the MMU translates the address.

FIG. 2 Illustration of a virtual memory system that divides an address space among two physical memories†. The MMU uses an ad dress to decide which memory to access.

Mathematically, the address mapping for memory 2 is straightforward: the MMU merely subtracts 0x40000000 from an address. FIG. 3 explains the concept.

FIG. 3 The sequence of steps used by a Memory Management Unit to create the virtual memory depicted in FIG. 2. The MMU maps the virtual address space onto two physical memories.

The point is:

An MMU can combine multiple underlying physical memory systems to create a virtual address space that provides a processor with the illusion of a single, uniform memory system. Because each underlying memory uses addresses that start at zero, the MMU must translate between the addresses generated by the processor and the addresses used by each memory.

†We have chosen to label an address space with address zero at the bottom; some documentation uses the convention of placing zero at the top of the address space.

6. Avoiding Arithmetic Calculation

In practice, an MMU does not use subtraction to implement address translation be cause subtraction requires substantial hardware (e.g., an ALU) and takes too much time to perform for each memory reference. The solution consists of using powers of two to simplify the hardware. For example, consider the mapping in FIG. 2. Addresses 0 through 0x3fffffff map to memory 1, and addresses 0x40000000 through 0x7fffffff onto memory 2. FIG. 4 shows that when expressed in binary, the addresses occupy thirty-one bits, and the ranges differ only in the high-order bit.

FIG. 4 The binary values for addresses in the range 0 through 2 gigabytes. Except for the high-order bit, values above 1 gigabyte are the same as those below.

As the example shows, choosing a power of two can eliminate the need for sub traction because low-order bits can be used as a physical address. In the example, when mapping an address to one of the two underlying physical memories, an MMU can use the high-order bit of an address to determine which physical memory should receive the request. To form a physical address, the MMU merely extracts the remaining bits of the virtual address.

To summarize:

Dividing a virtual address space on a boundary that corresponds to a power of two allows the MMU to choose a physical memory and per form the necessary address translation without requiring arithmetic operations.

7. Dis-contiguous Address Spaces

FIG. 2 shows an example of a contiguous virtual address space, an address space in which all addresses are mapped onto an underlying physical memory. That is, the processor can reference any address from zero to the highest address because each address corresponds to a location in one of the physical memories. Interestingly, most computers are designed to be flexible - the physical memory is designed to allow the computer's owner to determine how much memory to install. The computer contains physical slots for memory, and the owner can choose to populate all the slots with memory chips or leave some of the slots empty.

Consider the consequence of allowing an owner to install an arbitrary amount of memory. Because it is defined when the computer is created, the virtual address space includes an address for each possible physical memory location (i.e., addresses for the maximum amount of memory that can be installed in the computer). If an owner decides to omit some of the memory, part of the virtual address space becomes unusable -- if the processor references an address that does not correspond to physical memory, an error results. The virtual address space is not contiguous because regions of valid addresses are separated by invalid addresses. For example, FIG. 5 shows how a virtual address space might appear if the virtual address space is mapped onto two physical memories, and part of each physical memory is omitted.

FIG. 5 Example of a noncontiguous virtual address space of N bytes that is mapped onto two physical memories. Some addresses do not correspond to physical memory.

When part of a virtual address space does not map onto physical memory, we say that the address space contains a hole. In FIG. 5, for example, the virtual address space contains two holes†.

We can summarize:

A virtual address space can be contiguous, in which case every ad dress maps to a location of an underlying physical memory, or non contiguous, in which case the address space contains one or more holes. If a processor attempts to read or write any address that does not correspond to physical memory, an error results.

†We will see further examples of address spaces that contain holes when we discuss I/O.

Many other possibilities exist for mapping a virtual address space onto physical memories. For example, recall from Section 11 that the two low-order bits of an ad dress can be used to interleave memory among four separate physical memory modules (i.e., banks), and the remaining bits of the address can be used to select a byte within a module. One of the chief advantages of interleaving bytes among a set of modules arises from the ability of underlying hardware to access separate physical memories simultaneously. Using low-order bits to select a module means that successive bytes of memory come from different modules. In particular, if a processor accesses a data item composed of thirty-two bits, the underlying memory system can fetch all four bytes simultaneously.

8. Motivations for Virtual Memory

The trivial examples above show that a memory system can present a processor with a virtual address space that differs from the underlying physical memory. The rest of the section explores more complex virtual memory schemes. In most cases, the schemes incorporate and extend the concepts discussed above. We will learn that there are four main motivations for the use of complex virtual memory:

-- Homogeneous integration of hardware

-- Programming convenience

-- Support for multiprogramming

-- Protection of programs and data

Homogeneous Integration Of Hardware. Our examples explain how a virtual memory system can provide a homogeneous interface to a set of physical memories.

More important, the scheme allows heterogeneity in the underlying memories. For ex ample, some of the underlying physical memories can use a word size of thirty-two bits, while others use a word size of sixty-four bits. Some of the memories can have a much faster cycle time than others, or some of the memories can consist of RAM while others consist of ROM. The MMU hides the differences by allowing the processor to access all memories from a single address space.

Programming Convenience. One of the chief advantages of a virtual memory sys tem arises from the ease of programming. If separate physical memories are not integrated into a uniform address space, a processor needs special instructions (or special operand formats) for each memory. Programming memory accesses becomes painful.

More important, if a programmer decides to move an item from one memory to another, the program must be rewritten, which means that the decision cannot be made at run time.

Support For Multiprogramming. Modern computer systems allow multiple applications to run at the same time. For example, a user who is editing a document can leave a word processor open, temporarily launch a Web browser to check a reference, and listen to music at the same time. The terms multiprogramming and multiprocessing each characterize a computer system that allows multiple programs to run at the same time. We will see that a virtual memory system is needed to support multiprogramming.

Protection Of Programs and Data. We said that a CPU uses modes of execution to determine which instructions are allowed at any time. We will see that virtual memory is inherently linked to a computer's protection scheme.

9. Multiple Virtual Spaces and Multiprogramming

Early computer designers thought that multiprogramming was impractical. To understand why, consider how an instruction set works. The operands that specify in direction each reference a memory address. If two programs are loaded into a single memory and run at the same time, a conflict can occur if the programs attempt to use the same memory location for two different purposes. Thus, programs can only run together if they are written to avoid using the same memory addresses.

The most common technology for multiprogramming uses virtual memory to establish a separate virtual address space for each program. To understand how a virtual memory system can be used, consider an example. FIG. 6 illustrates a straightforward mapping.

FIG. 6 Illustration of four partitions mapped onto a single physical memory. Each virtual address space starts at address zero.

The mechanism in the figure divides the physical memory into equal-size areas that are known as partitions. Partitioned memory systems were used on early mainframe computers in the 1960s and 1970s, but have since been replaced. One of the main drawbacks of partitioned memory is that the memory available to a given program is a fraction of total physical memory on the computer. As FIG. 6 illustrates, systems that used partitioned memory typically divided memory into four partitions, which meant that one-fourth of the total memory was dedicated to each program.

The diagram in FIG. 6 implies that an MMU translates multiple virtual ad dress spaces onto a single physical memory. In practice, however, MMU hardware can perform additional mappings. For example, an MMU can translate from virtual address space 1 to an intermediate virtual address space, and then translate the intermediate virtual address space onto one or more underlying physical memories (which may implement further translation from byte addresses to word addresses).

10. Creating Virtual Spaces Dynamically

How should a virtual memory system be created? In the simplistic examples above, we implied that the mapping from virtual address space(s) to physical memories is chosen when the hardware is built. Although some small, special-purpose systems have the mappings designed into hardware, general-purpose computer systems usually do not. Instead, the MMU in a general-purpose system can be changed dynamically at run time. That is, when the system boots, the processor tells the MMU exactly how to map the virtual address space onto the physical memory.

How can a program running on a processor change the address space and continue to run? In general, the address space to be used is part of the processor mode. The processor begins running in real mode, which means that the processor passes all memory references directly to the physical memory without using the MMU. While operating in real mode, the processor can interact with the MMU to establish a map ping. Once a mapping has been specified, the processor can execute an instruction that changes the mode, enables the MMU, and branches to a specified location. The MMU translates each memory reference according to the mapping that was configured.

The next sections examine technologies that have been used to create dynamic virtual memory systems. We will consider three examples:

-- Base-Bound Registers

-- Segmentation

-- Demand Paging

11. Base-Bound Registers

A mechanism known by the name base-bound is among the oldest and easiest dynamic virtual memory schemes to understand. In essence, the base-bound scheme creates a single virtual address space and maps the space onto a region of physical memory. The name refers to a pair of registers that are part of the MMU; both must be loaded before the MMU is enabled. The base register holds an address in physical memory that specifies where to map the virtual address space, and the bound register holds an integer that specifies the size of the address space. FIG. 7 illustrates the mapping.

FIG. 7 Illustration of a virtual memory that uses a base-bound mechanism. The base register specifies the location of the virtual ad dress space, and the bound register specifies the size.

12. Changing The Virtual Space

It may seem that a base-bound mechanism is uninteresting because it only provides a single virtual address space. We must remember, however, that a base-bound mechanism is dynamic (i.e., easy to change). The idea is that an operating system can use the base-bound mechanism to move among multiple virtual address spaces. For example, suppose the operating system has loaded two application programs at different locations in memory. The operating system, which runs in real mode, controls the MMU. When an application, A, is ready to run, the operating system configures the MMU to point to A's section of the memory, enables the MMU mapping, and branches to the application.

Later, when control returns to the operating system, the operating system selects another application to run, B, configures the MMU to point to B's memory, enables the MMU, and branches to the code for B. Each application's virtual address space starts at zero; the application remains unaware of its location in physical memory.

The point is that an operating system can use a base-bound mechanism to provide as much functionality as the static virtual memory mechanisms considered earlier. We can summarize:

A base-bound mechanism uses two values in the MMU to specify how a virtual address space maps onto the physical address space. The base-bound mechanism is powerful because an operating system can change the mapping dynamically.

13. Virtual Memory and Protection

Why is a bound register used in the base-bound approach? The answer is protection: although a base register is sufficient to establish the mapping from virtual address to physical address, the mapping does not prevent a program from accidentally or maliciously referencing arbitrarily large memory locations. In FIG. 7, for example, ad dresses higher than M lie beyond the region allocated to the program (i.e., the addresses may be allocated to another application).

The base-bound scheme uses the bound register to guarantee that a program will not exceed its allocated space. Of course, to implement protection, the MMU must check each memory reference and raise an error if the program attempts to reference an address greater than M. The protection offered by a base-bound mechanism provides an example of an important concept:

A virtual memory system that supports multiprogramming must also provide protection that prevents one application from reading or altering memory that has been allocated to another application.

14. Segmentation

The memory mappings described above are intended to map a complete address space (i.e., all memory that is needed for an application to run, including the compiled program and the data the program uses). We say that a virtual memory technology that maps an entire address space is a coarse granularity mapping. The alternative, which consists of mapping parts of an address space, is known as a fine granularity mapping.

To understand the motivation for a fine granularity mapping, consider a typical application program. The program consists of a set of functions, and flow passes from one function to another through a procedure call. Early computer architects observed that although memory was a scarce resource, coarse granularity virtual systems required an entire application to occupy memory. Most of the memory was unused because only one function was actively executing at any time.

To reduce the amount of memory needed, the architects proposed that each pro gram be divided into variable-size blocks, and only the blocks of the program that are needed at any time be loaded in memory. That is, pieces of the program are kept on an external storage device, typically a disk, until one of them is needed. At that time, the operating system finds an unused region of memory that is large enough, and loads the piece into memory. The operating system then configures the MMU to establish the mapping between the virtual addresses that the piece uses and the physical addresses used to hold the piece. When a program piece is no longer used, the operating system copies the piece back to disk, thereby making the memory available for another piece.

The variable-size piece scheme is known as segmentation, and the pieces of pro grams are known as segments. Once proposed, segmentation generated many questions.

What hardware support would be needed to make segmentation efficient? Should the hardware dictate an upper bound on the size of a segment? After much research and a few hardware experiments, segmentation faded. The central problem with segmentation arises after an operating system begins to move blocks in and out of memory. Because segments are variable size, the memory tends toward a situation in which the unused memory is divided into many small blocks.

Computer scientists use the term fragmentation to describe the situation, and say that the memory becomes fragmented†. We can summarize:

Segmentation refers to a virtual memory scheme in which programs are divided into variable-size blocks, and only the blocks currently needed are kept in memory. Because it leads to a problem known as memory fragmentation, segmentation is seldom used.

15. Demand Paging

An alternative to segmentation was invented that has become extremely successful.

Known as demand paging, the technique follows the same general scheme as segmentation: divide a program into pieces, keep the pieces on external storage until they are needed, and load an individual piece when the piece is referenced.

The most significant difference between demand paging and segmentation lies in how the program is divided. Instead of variable-size segments that are large enough to hold a complete function, demand paging uses fixed-size blocks called pages. Initially, when memories and application programs were much smaller, architects chose a page size of 512 bytes or 1 Kbyte; current architectures use larger page sizes (e.g., Intel processors use 4 Kbyte pages).

16. Hardware and Software for Demand Paging

A combination of two technologies is needed for an effective virtual memory sys tem that supports demand paging:

-- Hardware to handle address mapping efficiently, record when each page is used, and detect missing pages

-- Software to configure the hardware, monitor page use, and move pages between external store and physical memory

†To avoid memory fragmentation, some architects experimented with larger, fixed-size segments (e.g., 64 Kbytes per segment).

Demand Paging Hardware. Technically, the hardware architecture provides an ad dress mapping mechanism and allows software to handle the demand aspect. That is, software (usually an operating system) configures the MMU to specify which pages from a virtual address space are present in memory and where each page is located.

Then, the operating system runs an application that uses the virtual address space that has been configured. The MMU translates each memory address until the application references an address that is not available (i.e., an address on one of the pages that is not present in memory).

A reference to a missing page is called a page fault, and is treated as an error condition (e.g., like a division by zero). That is, instead of fetching the missing page from external storage, the hardware merely informs the operating system that a fault has occurred and allows the operating system to handle the problem. Typically the hardware is arranged to raise an exception. The hardware saves the current state of the computation (including the address of the instruction that caused the fault), and then uses the exception vector. Thus, from the operating system's point of view, a page fault acts like an interrupt. Once the fault has been handled, the operating system can instruct the processor to restart execution at the instruction that caused the fault.

Demand Paging Software. Software in the operating system is responsible for management of the memory: software must decide which pages to keep in memory and which to keep on external storage. More important, the software fetches pages on demand. That is, once the hardware reports a page fault, paging software takes over.

The software identifies the page that is needed, locates the page on secondary storage, locates a slot in memory, reads the page into memory, and reconfigures the MMU.

Once the page has been loaded, the software resumes executing the application, and the fetch-execute cycle continues until another page fault occurs.

Of course, the paging hardware and software must work together. For example, when a page fault occurs, the hardware must save the state of the computation in such a way that the values can be reloaded later when execution resumes. Similarly, the software must understand exactly how to configure the MMU.

17. Page Replacement

To understand paging, we must consider what happens after a set of applications has been running a long time. As applications reference pages, the virtual memory sys tem moves the pages into memory. Eventually, the memory becomes full. The operating system knows when a page is needed because the application references the page.

A difficult decision involves selecting one of the existing pages to evict to make space for an incoming page. Moving a page between external storage and memory takes time, so performance is optimized by choosing to move a page that will not be needed in the near future. The process is known as page replacement.

Because page replacement is handled by software, the discussion of algorithms and heuristics is beyond the scope of this text. We will see, however, that the hardware pro vides mechanisms that assist the operating system in making a decision.

18. Paging Terminology and Data Structures

The term page refers to a block of a program's address space, and the term frame refers to a slot in physical memory that can hold a page. Thus, we say that software loads a page into a frame of memory. When a page is in memory, we say that the page is resident, and the set of all pages from an address space that are currently in memory is called the resident set.

The primary data structure used for demand paging is known as a page table. The easiest way to envision a page table is to imagine a one-dimensional array that is indexed by a page number. That is, entries in the table have index zero, one, and so on.

Each entry in the page table either contains a null value (if the page is not resident) or the address of the frame in physical memory that currently holds the page. FIG. 8 illustrates a page table.

FIG. 8 Illustration of an active page table with some entries pointing to frames in memory. A null pointer in a page table entry (denoted by ?) means the page is not currently resident in memory.

19. Address Translation in a Paging System

Items in FIG. 8 correspond to frames, not individual words. To understand paging hardware, imagine an address space divided into fixed-size pages as FIG. 9 illustrates.

FIG. 9 Illustration of a virtual address space divided into pages of K bytes per page.

We will see that the addresses associated with each page are important. As the figure shows, if each page contains K bytes, bytes on page zero have addresses zero through K-1, bytes on page 1 have addresses K through 2K-1, and so on.

Conceptually, translation of a virtual address, V, to a corresponding physical ad dress, P, requires three steps:

1. Determine the number of the page on which address V lies.

2. Use the page number as an index into the page table to find the location of the frame in memory that holds the page.

3. Determine how far into the page V lies, and move that far into the frame in memory.

FIG. 9 illustrates how addresses are associated with pages. Mathematically, the page number on which an address lies, N, can be computed by dividing the address by the number of bytes per page, K:

page_number = N =

(eqn. 1)

Similarly, the offset of the address within the page, O, can be computed as the remainder of the division†.

offset = O = V modulo K (eqn.2)

Thus, a virtual address, V, is translated to a corresponding physical address, P,by using the page number and offset, N and O, as follows:

physical_address = P = pagetable [N] + O (eqn. 3)

†Note that the computation of a byte address within a page is similar to the computation of a byte address within a word .

20. Using Powers of Two

As Section 11 discusses, an arithmetic operation, such as division, is too expensive to perform on each memory reference. Therefore, like other parts of a memory system, a paging system is designed to avoid arithmetic computation. The number of bytes per page is chosen to be a power of two, 2q , which means that the address of the first byte in each frame has q low-order bits equal to zero. Interestingly, because the low-order bits of a frame address are always zero, a page table does not need to store a full ad dress. The consequence of using a power of two is that the division and modulo operations specified in the mathematical equations can be replaced by extracting bits. Furthermore, the addition operation can be replaced by a logical or. As a result, instead of using Equations 1 through 3, the MMU performs the following computation to translate a virtual address, V, into a physical address, P:

P = pagetable[ high_order_bits(V) ] or low_order_bits (V) (eqn. 4)

FIG. 10 illustrates how MMU hardware performs a virtual address mapping.

When considering the figure, remember that hardware can move bits in parallel. Thus, the arrow that points from the low-order bits in the virtual address to the low-order bits in the physical address represents a parallel data path - the hardware sends all the bits at the same time. Also, the arrow from the page table entry to the high-order bits in the physical address means that all bits from the page table entry can be transferred in parallel.

FIG. 10 Illustration of how an MMU performs address translation on a paging system. Making the page size a power of two eliminates the need for division and remainder computation.

21. Presence, Use, aAnd Modified Bits

Our description of paging hardware omits several details. For example, in addition to a value that specifies the frame in which a page is located, each page table entry contains control bits that the hardware and software use to coordinate. FIG. 11 lists three control bits found in most paging hardware.

Control Bit Meaning

Presence bit Tested by hardware to determine whether page is currently resident in memory

Use bit Set by hardware whenever page is referenced

Modified bit Set by hardware whenever page is changed

FIG. 11 Examples of control bits found in each page table entry and the actions hardware takes with each. The bits are intended to assist the page replacement software in the operating system.

Presence Bit. The most straightforward control bit is called a presence bit, which specifies whether the page is currently in memory. The bit is set by software and tested by the hardware. Once it has loaded a page and filled in other values in the page table entry, the operating system sets the presence bit to one; when it removes a page from memory, the operating system sets the presence bit to zero. When it translates an ad dress, the MMU examines the presence bit in the page table entry - if the presence bit is one, translation proceeds, and if the presence bit is zero, the hardware declares a page fault has occurred.

Use Bit. The use bit, which provides information needed for page replacement, is initialized to zero and later tested by software. The bit is set by hardware. The mechanism is straightforward: whenever it accesses a page table entry, the MMU sets the use bit to one. The operating system periodically sweeps through the page table, testing the use bit to determine whether the page has been referenced since the last sweep. A page that has not been referenced becomes a candidate for eviction; otherwise, the operating system clears the use bit and leaves the page for the next sweep.

Modified Bit. The modified bit is initialized and later tested by software. The bit is set by hardware. The paging software sets the bit to zero when a page is loaded.

The MMU sets the bit to one whenever a write operation occurs to the page. Thus, the modified bit is one if any byte on the page has been written since the page was loaded.

The value is used during page replacement - if a page is selected for eviction, the value of the modified bit tells the operating system whether the page must be written back to external storage or can be discarded (i.e., whether the page is identical to the copy on external storage).

22. Page Table Storage

Where do page tables reside? Some systems store page tables in a special MMU chip that is external from the processor. Of course, because memory references play an essential role in processing, the MMU must be designed to work efficiently. In particular, to ensure that memory references do not become a bottleneck, some processors use a special-purpose, high-speed hardware interface to access an MMU. The interface contains parallel wires that allow the processor and MMU to send many bits at the same time.

Surprisingly, many processors are designed to store page tables in memory! That is, the processor (or the MMU) contains a special purpose register that the operating system uses to specify the location of the current page table. The location of a page table must be specified by giving a physical address. Typically, such systems are designed to divide memory into three regions as FIG. 12 illustrates.

FIG. 12 Illustration of how physical memory might be divided in an architecture that stores page tables in memory. A large area of physical memory is reserved for frames.

The design in the figure illustrates one of the motivations for a memory system composed of heterogeneous technologies: because page tables are used frequently, the memory used to store page tables needs high performance (e.g., SRAM). However, be cause high performance memory is expensive, overall cost can be reduced by using a lower-cost memory (e.g., DRAM) to store frames. Thus, an architect can design a system that uses SRAM to hold page tables and DRAM for frame storage.

23. Paging Efficiency and a Translation Look-aside Buffer

A central question underlies all virtual memory architectures: how efficient is the resulting system? To understand the question, it is important to realize that address translation must be performed on every memory reference: each instruction fetch, each operand that references memory, and each store of a result. Because memory is so heavily used, the mechanisms that implement address translation must be extremely efficient or translation will become a bottleneck. Architects are primarily concerned with the amount of time the MMU uses to translate a virtual address to a physical address; they are less concerned with the amount of time it takes for the operating system to configure page tables.

One technique used to optimize the performance of a demand paging system stands out as especially important. The technique uses special, high-speed hardware known as a Translation Lookaside Buffer (TLB) to achieve faster page table lookups. A TLB is a form of Content Addressable Memory that stores recently used values from a page table. When it first translates an address, the MMU places a copy of the page table en try in the TLB. On successive lookups, the hardware performs two operations in parallel: the standard address translation steps depicted in FIG. 10 and a high-speed search of the TLB. If the requested information is found in the TLB, the MMU aborts the standard translation and uses the information from the TLB. If the entry is not in TLB, the standard translation proceeds.

To understand why a TLB improves performance, consider the fetch-execute cycle.

A processor tends to fetch instructions from successive locations in memory. Further more, if the program contains a branch, probability is extremely high that the destination will be nearby, probably on the same page. Thus, rather than randomly accessing pages, a processor tends to fetch successive instructions from the same page. A TLB improves performance because it optimizes successive lookups by avoiding indexing into a page table. The difference in performance is especially dramatic for architectures that store page tables in memory; without a TLB, such systems are too slow to be useful. We can summarize:

A special high-speed hardware device, called a Translation Lookaside Buffer (TLB), is used to optimize performance of a paging system. A virtual memory that does not have a TLB can be unacceptably slow.

24. Consequences for Programmers

Experience has shown that demand paging works well for most computer pro grams. The code that programmers produce tends to be organized into functions that each fit onto a page. Similarly, data objects, such as character strings, are designed so the data occupies consecutive memory locations, which means that once a page has been loaded, the page tends to remain resident for multiple references. Finally, some compilers understand paging, and optimize performance by placing data items onto pages.

One way that programmers can affect virtual memory performance arises from array access. Consider a two-dimensional array in memory. Most programming systems allocate an array in row-major order, which means that rows of an array are placed in contiguous memory as FIG. 13 illustrates.

FIG. 13 An illustration of a two-dimensional array stored in row-major order. A row is contiguous in memory.

As the figure shows, rows of the matrix occupy successive locations in memory.

Thus, if A is a two-dimensional array of bytes, the location of A[ i , j ] is given by:

location(A) + i×Q+j where Q is the number of bytes per row.

The chief alternative to row-major order is known as column-major order. When an array is stored in column-major order, the elements of a column occupy contiguous memory locations. The choice between row-major or column-major order is usually determined by the programming language and compiler, not by a programmer.

A programmer can control how a program iterates through an array, and a good choice can optimize virtual memory performance. For example, if a large array of characters, A[N,M], is stored in row-major order, the nested loops shown here:

fori=1toN{ forj=1toM{ A[i,j] = 0;

}

will require less time to execute than a loop that varies the indices in the opposite order:

forj=1toM{ fori=1toN{ A[i,j] = 0;

}

The difference in time arises because varying the row index will force the virtual memory system to move from one page of memory to another for each reference, but varying the column index means M successive references stay on the same page.

25. The Relationship Between Virtual Memory and Caching

Two of the key technologies in virtual memory systems are related to caching: a TLB and the demand page replacement. Recall that a TLB consists of a small, high speed hardware mechanism that improves the performance of a demand paging system dramatically. In fact, a TLB is nothing more than a cache of address mappings: when ever it looks up a page table entry, the MMU stores the entry in the TLB. A successive lookup for the same page will receive an answer from the TLB.

Like many cache systems, a TLB usually uses the Least Recently Used replacement strategy. Conceptually, when an entry is referenced, the TLB moves the entry to the front of the list; when a new reference occurs and the cache is full, the TLB discards the page table entry on the back of the list to make space for the new entry. Of course, the TLB cannot afford to keep a linked list in memory. Instead, the TLB contains digital circuits that move values into a special-purpose Content Addressable Memory (CAM) at high speed.

Demand paging can be viewed as a form of caching. The cache corresponds to main memory, and the data store corresponds to the external storage where pages are kept until needed. Furthermore, the page replacement policy serves as a cache replacement policy. In face, paging borrows the phrase replacement policy from caching.

Interestingly, thinking of demand paging as a cache can help us understand an important concept: how a virtual address space can be much larger than physical memory.

Like a cache, physical memory only holds a fraction of the total pages. From our analysis of caching, we know that the performance of a demand-paged virtual memory can approach the performance of physical memory. In other words:

The analysis of caching in the previous section shows that using demand paging on a computer system with a small physical memory can perform almost as well as if the computer had a physical memory large enough for the entire virtual address space.

26. Virtual Memory Caching and Cache Flush

If caching is used with virtual memory, should the cache be placed between the processor and the MMU or between the MMU and physical memory? That is, should the memory cache store pairs of virtual address and contents or physical address and contents? The answer is complex. On the one hand, using virtual addresses increases memory access speed because the cache can respond before the MMU translates the virtual address into a physical address. On the other hand, a cache that uses virtual ad dresses needs extra hardware that allows the cache to interact with the virtual memory system. To understand why, observe that a virtual memory system usually supplies the same address range to each running application (i.e., each process has addresses that start at zero). Now consider what happens when the operating system performs a con text switch that stops running one process and runs another process. Suppose the memory cache contains an entry for address 2000 before the switch occurs. If the cache is unchanged during the context switch and the new process accesses location 2000, the cache will return the value from location 2000 in the old process. Therefore, when it changes from one process to another, the operating system must also change items in the cache.

How can a cache be engineered to avoid the problem of ambiguity that occurs when multiple processes use the same range of addresses? Architects use two solutions:

-- A cache flush operation

-- A disambiguating identifier

Cache Flush. One way to ensure that a cache does not report incorrect values consists of removing all existing entries from the cache. We say that the cache is flushed.

In architectures that use flushing, the operating system must flush the cache whenever it performs a context switch to move from one application to another.

Disambiguation. The alternative to cache flushing involves the use of extra bits that identify the running process (or more precisely, the address space). The processor contains an extra hardware register that contains an address space ID. Many operating systems create an address space for each associated process, and use the process ID (an integer) to identify an address space. Whenever it switches to an application, the operating system loads the application's process ID into the address space ID register.

As FIG. 14 shows, the cache prepends the contents of the ID register onto the virtual address when it stores an item in the cache, which means that even if process 1 and process 2 both reference address 0, the two entries in the cache will differ.

FIG. 14 Illustration of an ID register used to disambiguate among a set of virtual address spaces. Each address space is assigned a unique number, which the operating system loads into the ID register.

As the figure illustrates, the cache is designed to use a longer address than the memory system. Before passing a request to the cache, the processor creates an artificially long address by concatenating a virtual address onto the process ID. The processor then passes the longer address to the cache. From the cache's point of view, there is no ambiguity: even if two applications reference the same virtual address, the ID bits distinguish between the two addresses.

27. Summary

Virtual memory systems present an abstract address space to a processor and to each application program running on the processor. A virtual memory system hides de tails of the underlying physical memory.

Several virtual memory architectures are possible. The virtual memory system can hide details of word addressing or can present a uniform address space that incorporates multiple, possibly heterogeneous, memory technologies.

Virtual memory offers convenience for programmers, support for multiprogramming, and protection. When multiple programs run concurrently, virtual memory can be used to provide each program with an address space that begins at zero.

Virtual memory technologies include base-bound, segmentation, and demand paging; demand paging is the most popular. A demand paging system uses page tables to map from a virtual address to a physical address; a high-speed search mechanism known as a TLB makes page table lookup efficient.

To avoid arithmetic computation, virtual memory systems make physical memory and page sizes a power of two. Doing so allows the hardware to translate addresses without using arithmetic or logical operations.

Either physical or virtual addresses can be cached. If a cache uses virtual ad dresses, an ambiguity problem can arise when multiple applications (processes) use the same range of virtual addresses. Two techniques can be used to solve the ambiguity problem: the operating system can flush the cache whenever it switches from one application to another, or the cache hardware can be designed to use artificially long ad dresses where the high-order bits consist of an address space ID (typically, a process ID).

EXERCISES

1. Consider a computer using the virtual address space illustrated in FIG. 2. If a C programmer writes:

c chhaar rc c; ;

c chhaar r* *pp; ;

p p==( (cchhaar r* *))11007733774411882266; ;

c c==* *pp; ;

Which memory module will be referenced, and where in the module will the referenced byte be located within the memory?

2. A traditional Intel PC has a hole in its memory address space between 640 kilobytes and 1 megabyte. Use FIG. 5 as an example, and draw a figure to scale showing the hole in a PC address space if the PC has 2 megabytes of memory.

3. Which of the four motivations for virtual memory help programmers? Explain.

4. Does demand paging require special hardware or special software? Explain.

5 Conceptually, a page table is an array. What is found in each element of the page table array, and how is it interpreted?

6. Consider the presence, use, and modified bits. For each bit, tell when the bit changes and whether the hardware or software makes the change.

7. Assuming a page size of 4K bytes, compute the page number and the offset for addresses 100, 1500, 8800, and 10000.

8. Write a computer program that takes two input values, a page size and an address, and computes the page number and offset for the address.

9. Extend the program in the previous exercise. If the page size is a power of two, do not use division or modulus operations when computing the answer.

10. Calculate the amount of memory needed to hold an example page table. Assume that each page table entry occupies thirty-two bits, the page size is 4 Kbytes, and a memory address occupies thirty-two bits.

11. Write a computer program that takes as input a page size and an address space size, and performs the calculation in the previous exercise. (You may restrict sizes to powers of two.)

12. What is page replacement, and is it performed by hardware or software?

13. Consider a two-level paging scheme. Assume the high-order ten bits of an address are used as an index into a directory table to select a page table. Assume each page table contains 1024 entries and the next ten bits of the address are used to select a page table entry. Also assume the final twelve bits of the address are used to select a byte on the page. How much memory is required for the directory table and page tables?

14. What is a TLB, and why is it necessary?

15. Write a program that references all locations in a large two-dimensional array stored in row-major order. Compare the execution times when the program iterates through rows and touches each column within a row to the time required when the program iterates through all columns and touches each row within a column. Explain the results.

16. If a memory system caches virtual addresses and each process has a virtual address space that starts at zero, what must an operating system do when changing from one pro cess to another? Why?

PREV. | NEXT