Computer Architecture: CPUs -- Caches and Caching

AMAZON multi-meters discounts AMAZON oscilloscope discounts

1. Introduction

The previous section discusses physical memory systems, focusing on the underlying technologies used to build memory systems and the organization of address spaces.

The section also discusses the organization of a physical memory into words.

This section takes a different view of the problem: instead of concentrating on technologies used to construct memory systems, the section focuses on a technology used to improve memory system performance. The section presents the fundamental concept of caching, shows how caching is used in memory systems, explains why caching is essential, and describes why caching achieves high performance with low cost.

2. Information Propagation in a Storage Hierarchy

Recall from Section 10 that storage mechanisms are organized into a hierarchy that includes general-purpose registers, main memory, and secondary storage. Data items migrate up and down the hierarchy, usually under control of software. In general, items move up the hierarchy when they are read, and down the hierarchy when they are writ ten. For example, when generating code for an arithmetic computation, a compiler arranges to move items from memory into registers. After the computation finishes, the result may be moved back to memory. If an item must be kept after the program finishes, the programmer will arrange to copy the item from memory to secondary storage.

We will see how caching fits into the storage hierarchy, and will learn that a memory cache uses hardware rather than software to move items up and down in its part of the hierarchy.

3. Definition of Caching

The term caching refers to an important optimization technique used to improve the performance of any hardware or software system that retrieves information. In memory systems, caching can reduce the Von Neumann bottleneck. A cache acts as an intermediary. That is, a cache is placed on the path between a mechanism that makes requests and a mechanism that answers requests, and the cache is configured to intercept and handle all requests.

The central idea in caching is high-speed, temporary storage: the cache keeps a local copy of selected data, and answers requests from the local copy whenever possible.

Performance improvement arises because the cache is designed to return answers faster than the mechanism that normally fulfills requests. FIG. 1 illustrates how a cache is positioned between a mechanism that makes requests and a mechanism that answers requests.

FIG. 1 Conceptual organization of a cache, which is positioned on the path between a mechanism that makes requests and a storage mechanism that answers requests.

4. Characteristics of a Cache

The above description is purposefully vague because caching is a broad concept that appears in many forms in computer and communication systems. This section clarifies the definition by explaining the concept in more detail; later sections give examples of how caching can be used.

Although a variety of caching mechanisms exist, they share the following general characteristics:

-- Small

-- Active

-- Transparent

-- Automatic

Small. To keep economic cost low, the amount of storage associated with a cache is much smaller than the amount of storage needed to hold the entire set of data items.

Most cache sizes are less than ten percent of the main storage size; in many cases, a cache holds less than one percent as much as the data store. Thus, one of the central design issues revolves around the selection of data items to keep in the cache.

Active. A cache contains an active mechanism that examines each request and decides how to respond. Activities include: checking to see if a requested item is avail able in the cache, retrieving a copy of an item from the data store if the item is not available locally, and deciding which items to keep in the cache.

Transparent. We say that a cache is transparent, which means that a cache can be inserted without making changes to the requester or data store. That is, the interface the cache presents to the requester is exactly the same as the interface a data store presents, and the interface the cache presents to the data store is exactly the same as the interface a requester presents.

Automatic. In most cases, a cache mechanism does not receive instructions on how to act or which data items to store in the cache storage. Instead, a cache implements an algorithm that examines the sequence of requests, and uses the requests to determine how to manage the cache.

5. Cache Terminology

Although caching is used in a variety of contexts, some of the terminology related to caching has universal acceptance across all types of caching systems. A cache hit (abbreviated hit) is defined as a request that can be satisfied by the cache without access to the underlying data store. Conversely, a cache miss (abbreviated miss) is defined as a request that cannot be satisfied by the cache.

Another term characterizes a sequence of references presented to a cache. We say that a sequence of references exhibits high locality of reference if the sequence contains repetitions of the same requests; otherwise, we say that the sequence has low locality of reference. We will see that high locality of reference leads to higher performance. Locality refers to items in the cache. Therefore, if a cache stores large data items (e.g., pages of memory), repeated requests do not need to be identical as long as they reference the same item in the cache (e.g., memory references to items on the same page).

6. Best and Worst Case Cache Performance

We said that if a data item is stored in the cache, the cache mechanism can return the item faster than the data store. As FIG. 2 shows, we represent the costs of retrieval from the requester's view.

FIG. 2 Illustration of access costs when using a cache. Costs are measured with respect to the requester.

In the figure, Ch is the cost if an item is found in the cache (i.e., a hit), and Cm is the cost if an item is not found in the cache (i.e., a miss). Interestingly, individual costs are not informative. Observe that because a cache uses the contents of requests to determine which items to keep, the performance depends on the sequence of requests.

Thus, to understand caching, we must examine the performance on a sequence of re quests. For example, we can easily analyze the best and worst possible behavior for a sequence of N requests. At one extreme, if each request references a new item, caching does not improve performance at all - the cache must forward each request to the data store. Thus, in the worst case, the cost is:

Cworst

= N Cm (1)

It should be noted that our analysis ignores the administrative overhead required to maintain the cache. If we divide by N to compute the average cost per request, the result is Cm.

At the other extreme, if all requests in the sequence specify the same data item (i.e., the highest locality of reference), the cache can indeed improve performance.

When it receives the first request, the cache fetches the item from the data store and saves a copy; subsequent requests can be satisfied by using the copy in the cache.

Thus, in the best case. the cost is:

Cbest

= Cm + (N - 1) Ch (2)

Dividing by N produces the cost per request:

Cper_request

(3)

As N-->8, the first two terms approach zero, which means that the cost per request in the best case becomes Ch. We can understand why caching is such a powerful tool:

If one ignores overhead, the worst case performance of caching is no worse than if the cache were not present. In the best case, the cost per request is approximately equal to the cost of accessing the cache, which is lower than the cost of accessing the data store.

7. Cache Performance on a Typical Sequence

To estimate performance of a cache on a typical sequence of requests, we need to examine how the cache handles a sequence that contains both hits and misses. Cache designers use the term hit ratio to refer to the percentage of requests in the sequence that are satisfied from the cache. Specifically, the hit ratio is defined to be:

hit ratio =

total number of requests

number of requests that are hits

(4)

The hit ratio is a value between zero and one. We define a miss ratio to be one minus the hit ratio.

Of course, the actual hit ratio depends on the specific sequence of requests. Experience has shown that for many caches, the hit ratio tends to be nearly the same across the requests encountered in practice. In such cases, we can derive an equation for the cost of access in terms of the cost of a miss and the cost of a hit:

Cost = r Ch + (1-r) Cm (5)

where r is the hit ratio defined in Equation 4 above.

The cost of accessing the data store, given by Cm in the equation, is fixed. Thus, there are two ways a cache designer can improve the performance of a cache: increase the hit ratio or decrease the cost of a hit.

8. Cache Replacement Policy

How can a cache designer increase the hit ratio? There are two ways:

-- Increase the cache size

-- Improve the replacement policy

Increase the cache size. Recall that a cache is usually much smaller than a large data store. When it begins, a cache keeps a copy of each response. Once the cache storage is full, an item must be removed from the cache before a new item can be added. A larger cache can store more items.

Improve the replacement policy. A cache uses a replacement policy to decide which item to remove when a new item is encountered and the cache is full. The re placement policy specifies whether to ignore the new item or how to choose an item to evict to make space for the new item. A replacement policy that chooses to keep those items that will be referenced again can increase the hit ratio.

9. LRU Replacement

What replacement policy should be used? There are two issues. First, to increase the hit ratio, the replacement policy should retain those items that will be referenced most frequently. Second, the replacement policy should be inexpensive to implement, especially for a memory cache. One replacement policy that satisfies both criteria has become extremely popular. Known as Least Recently Used (LRU), the policy specifies replacing the item that was referenced the longest time in the past†.

LRU is easy to implement. The cache mechanism keeps a list of data items that are currently in the cache. When an item is referenced, the item moves to the front of the list; when replacement is needed, the item at the back of the list is removed.

LRU works well in many situations. In cases where the set of requests has a high locality of reference (i.e., where a cache can improve performance), a few items will be referenced again and again. LRU tends to keep those items in the cache, which means the cost of access is kept low.

We can summarize:

When its storage is full and a new item arrives, a cache must choose whether to retain the current set of items or replace one of the current items with the new item. The Least Recently Used (LRU) policy is a popular choice for replacement because it is trivial to implement and tends to keep items that will be requested again.

10. Multilevel Cache Hierarchy

One of the most unexpected and astonishing aspects of caching arises from the use of caching to improve the performance of a cache! To understand how such an optimization is possible, recall that the insertion of a cache lowers the cost of retrieving items by placing some of the items closer to the requester. Now imagine an additional cache placed between the requester and the existing cache as FIG. 3 illustrates.

FIG. 3 The organization of a system with an additional cache inserted.

†Note that "least recently" always refers to how long ago the item was last referenced, not to the number of accesses.

Can a second cache improve performance? Yes, provided the cost to access the new cache is lower than the cost to access the original cache (e.g., the new cache is closer to the requester). In essence, the cost equation becomes:

Cost = r1 Ch 1 + r2 Ch 2 + (1 - r1 - r2) Cm ( eqn. 6)

where r1 denotes the fraction of hits for the new cache, r2 denotes the fraction of hits for the original cache, Ch 1 denotes the cost of accessing the new cache, and Ch 2 denotes the cost of accessing the original cache.

When more than one cache is used along the path from requester to data store, we say that the system implements a multilevel cache hierarchy. A set of Web caches pro vides an example of a multilevel hierarchy. The path between a browser running on a user's computer can pass through a cache at the user's ISP as well as the local cache mechanism used by the browser.

The point is:

Adding an additional cache can be used to improve the performance of a system that uses caching. Conceptually, the caches are arranged in a multilevel hierarchy.

11. Preloading Caches

How can cache performance be improved further? Cache designers observe that although many cache systems perform well in the steady state (i.e., after the system has run for awhile), the system exhibits higher cost during startup. That is, the initial hit ratio is extremely low because the cache must fetch items from the data store. In some cases, the startup costs can be lowered by preloading the cache. That is, values are loaded into the cache before execution begins.

Of course, preloading only works in cases where the cache can anticipate requests.

For example, an ISP's Web cache can be preloaded with hot pages (i.e., pages that have been accessed frequently in the past day or pages for which the owner expects frequent access). As an alternative, some caches use an automated method of preloading. In one form, the cache periodically places a copy of its contents on nonvolatile storage, allowing recent values to be preloaded at startup. In another form, the cache uses a reference to pre-fetch related data. For example, if a processor accesses a byte of memory, the cache can fetch 128 bytes. Thus, if the processor accesses the next byte, which is likely, the value will come from the cache.

Prefetching is especially important for Web pages. A typical Web page contains references to multiple images, and before the page can be displayed, a browser must download a copy of each image and cache the copy on the user's computer. As a page is being downloaded, a browser can scan for references to images, and can begin to pre fetch each of the images without waiting for the entire page to download.

12. Caches Used With Memory

Now that we understand the basic idea of caching, we can consider some of the ways caches are used in memory systems. In fact, the concept of caching originated with computer memory systems†. The original motivation was higher speed at low cost. Because memory was both expensive and slow, architects looked for ways to improve performance without incurring the cost of higher-speed memory. The architects discovered that a small amount of high-speed cache improved performance dramatically.

The result was so impressive that by the 1980s, most computer systems had a single cache located between the processor and memory. Physically, memory was on one circuit board and the cache occupied a separate circuit board, which allowed computer owners to upgrade the memory or the cache independently. As described above, a caching hierarchy can increase performance more than a single cache. Therefore, we will see that modern computers employ a hierarchy of memory caches and use caching in a variety of ways. The next sections present a few examples.

13. Physical Memory Cache

Caching has become popular as a way to achieve higher memory performance without significantly higher cost. Early computers used a physical memory system.

That is, when it generated a request, the processor specified a physical address, and the memory system responded to the physical address. Thus, to be inserted on the path between a processor and the memory, a cache had to understand and use physical ad dresses.

It may seem that a physical memory cache is trivial. We can imagine the memory cache receiving a fetch request, checking to see if the request can be answered from the cache, and then, if the item is not present, passing the request to the underlying memory. Furthermore, we can imagine that once an item has been retrieved from the underlying memory, the cache saves a copy locally, and then returns the value to the processor.

In fact, our imagined scenario is misleading - a physical memory cache is much more complex than the above description. To understand why, we must remember that hardware achieves high speed through parallelism. For example, when it encounters a fetch request, a memory cache does not check the cache and then access the physical memory. Instead, the cache hardware performs two tasks in parallel: the cache simultaneously passes the request to the physical memory and searches for an answer locally.

If it finds an answer locally, the cache must cancel the memory operation. If it does not find an answer locally, the cache must wait for the underlying memory operation to complete. Furthermore, when an answer does arrive from memory, the cache uses parallelism again by simultaneously saving a local copy of the answer and transferring the answer back to the processor. Parallel activities make the hardware complex. The point is:

†In addition to introducing the use of microcode, Maurice Wilkes is credited with inventing the concept of a memory cache in 1965.

To achieve high performance, a physical memory cache is designed to search the local cache and access the underlying memory simultaneously. Parallelism complicates the hardware.

14. Write Through and Write Back

In addition to parallelism, memory caches are also complicated by write (i.e., store) operations. There are two issues: performance and coherence. Performance is easiest to understand: caching improves the performance for retrieval requests, but not for storage requests. That is, a write operation takes longer because a write operation must change the value in the underlying memory. More important, in addition to for warding the request to the memory, a cache must also check to see whether the item is in the cache. If so, the cache must update its copy as well. In fact, experience has shown that a memory cache should always keep a local copy of each value that is writ ten because programs tend to access a value a short time after it has been stored.

Initial implementations of memory caches handled write operations as described above: the cache kept a copy and forwarded the write operation to the underlying memory. We use the term write-through cache to describe the approach.

The alternative, known as write-back cache, keeps a copy of a data item that is written, and waits until later to update the underlying physical memory. To know whether the underlying physical memory must be updated, a write-back cache keeps an extra bit with each item that is known as the dirty bit. In a physical memory cache, a dirty bit is associated with each block in the cache. When an item is fetched and a copy is placed in the cache, the dirty bit is initialized to zero. When the processor modifies the item (i.e., performs a write), the dirty bit is set to one. When it needs to eject a block from the cache, the hardware first examines the dirty bit associated with the block. If the dirty bit is one, a copy of the block is written to memory. If the dirty is zero, however, the block can simply be overwritten because data in the block is exactly the same as the copy in memory. The point is:

A write-back cache associates a dirty bit with each block to record whether the block has been modified since it was fetched. When ejecting a block from the cache, the hardware writes a copy of a dirty block to memory, but simply overwrites the contents if the block is not dirty.

To understand why write-back improves performance, imagine a for loop in a pro gram that increments a variable in memory on each iteration of the loop. A write-back cache places the variable in the cache the first time the variable is referenced. On each successive iteration, changes to the variable only affect the cached copy. Assume that once the loop ends, the program stops referencing the variable. Eventually, the program generates enough other references so that the variable is the least recently used item in the cache, and will be selected for replacement. When a new item is referenced and a cache slot is needed, the cache writes the value of the variable to the underlying physical memory. Thus, although the variable can be referenced or changed many times, the memory system only has one access to the underlying physical memory†.

15. Cache Coherence

Memory caches are especially complex in a system with multiple processors (e.g., a multicore CPU). We said that a write-back cache achieves higher performance than a write-through cache. In a multiprocessor environment, performance is also optimized by giving each core its own cache. Unfortunately, the two optimizations conflict. To understand why, look at the architecture in FIG. 4, which shows two processors that each have a private cache.

†An optimizing compiler can further improve performance by using a general-purpose register to hold the variable until the loop finishes (another form of caching).

FIG. 4 Illustration of two processors sharing an underlying memory.

Because each processor has a separate cache, conflicts can occur if both processors reference the same memory address.

Now consider what happens if the two caches use a write-back approach. When processor 1 writes to a memory location X, cache 1 holds the value for X. Eventually, when it needs space, cache 1 writes the value to the underlying physical memory. Similarly, whenever processor 2 writes to a memory location, the value will be placed in cache 2 until space is needed. The problem should be obvious: without an additional mechanism, incorrect results will occur if both processors simultaneously issue read and write operations for a given address.

To avoid conflicts, all caches that access a given memory must follow a cache coherence protocol that coordinates the values. For example, when processor 2 reads from an address, A, the coherence protocol requires cache 2 to inform cache 1. If it currently holds address A, cache 1 writes A to the physical memory so cache 2 can obtain the most recent value. That is, a read operation on any processor triggers a write-back in any cache that currently holds a cached copy of the address. Similarly, if any processor issues a write operation for an address, A, all other caches must be informed to discard cached values of A. Thus, in addition to requiring additional hardware and a mechanism that allows the caches to communicate, cache coherency introduces additional delay.

16. L1, L2, and L3 Caches

We said that arranging multiple caches into a hierarchy can improve overall performance. Indeed, most computer memory systems have at least two levels of cache hierarchy. To understand why computer architects added a second level of cache to the memory hierarchy, we must consider four facts:

-- A traditional memory cache was separate from both the memory and the processor.

-- To access a traditional memory cache, a processor used pins that connect the processor chip to the rest of the computer.

-- Using pins to access external hardware takes much longer than accessing functional units that are internal to the processor chip.

-- Advances in technology have made it possible to increase the number of transistors per chip, which means a processor chip can contain more hardware.

The conclusion should be clear. We know that adding a second cache can improve memory system performance, we further know that placing the second cache on the processor chip will make the cache access times much lower, and we know that technology now allows chip vendors to add more hardware to their chips. So, it makes sense to embed a second memory cache in the processor chip itself. If the hit ratio is high, most data references will never leave the processor chip - the effective cost of accessing memory will be approximately the same as the cost of accessing a register.

To describe the idea of multiple caches, computer manufacturers originally adopted the terms Level 1 cache (L1 cache) to refer to the cache onboard the processor chip, Level 2 cache (L2 cache) to refer to an external cache, and Level 3 cache (L3 cache)to refer to a cache built into the physical memory. That is, an L1 cache was originally on-chip and an L2 or L3 cache was off-chip.

In fact, chip sizes have become so large that a single chip can contain multiple cores and multiple caches. In such cases, manufacturers use the term L1 cache to describe a cache that is associated with one particular core, the term L2 cache to describe an on-chip cache that may be shared, and the term L3 cache to describe an on chip cache that is shared by multiple cores. Typically, all cores share an L3 cache.

Thus, the distinction between on-chip and off-chip has faded.

When using traditional terminology for a multilevel cache hierarchy, an L1 cache is embedded on the processor chip, an L2 cache is external to the processor, and an L3 cache is built into the physical memory. More recent terminology defines an L1 cache to be associated with a single core, whereas L2 and L3 refer to on-chip caches that all cores share.

17. Sizes of L1, L2, and L3 Caches

Most computers employ a cache hierarchy. Of course, the cache at the top of the hierarchy is the fastest, but also the smallest. FIG. 5 lists example cache memory sizes. The L1 cache may be divided into separate instruction and data caches, as described in the next section.

FIG. 5 Example cache sizes in 2016. Although absolute sizes continue to change; readers should focus on the amount of cache relative to RAM that is 4 GB to 32 GB.

18. Instruction and Data Caches

Should all memory references pass through a single cache? To understand the question, imagine instructions being executed and data being accessed. Instruction fetch tends to behave with high locality - in many cases, the next instruction to be executed is found at an adjacent memory address. Furthermore, the most time-consuming loops in a program are usually small, which means the entire loop can fit into a cache.

Although the data access in some programs exhibits high locality, the data access in others does not. For example, when a program accesses a hash table, the locations referenced appear to be random (i.e., the location referenced in one instant is not necessarily close to the location referenced in the next).

Differences between instruction and data behavior raise the question of how inter mixing the two types of references affects a cache. In essence, the more random the sequence of requests becomes, the worse a cache performs (because the cache will save each value, even though the value will not be needed again). We can state a general principle:

Inserting random references in the series of requests tends to worsen cache performance; reducing the number of random references that occurs tends to improve cache performance.

19. Modified Harvard Architecture

Is performance optimized by having a separate cache for instructions and data? The simplistic answer is obvious. When both data and instructions are placed in the same cache, data references tend to push instructions out of the cache, lowering performance. Adding a separate instruction cache will improve performance.

The simplistic answer above is insufficient, however, because the question is not whether additional hardware will help, but how to choose among tradeoffs. Because additional hardware will generate more heat, consume more power, and in portable de vices, deplete the battery faster, an architect must weigh all the costs of an additional cache. If an architect does decide to add more cache hardware, the question is how best to use the hardware. We know, for example, that increasing the size of a single cache will increase performance by avoiding collisions. If a cache becomes sufficiently large, intermixing instructions and data references will work fine. Would it be better to add a separate instruction cache or to retain a single cache and increase the size? Many architects have decided that the optimal way to use a modest amount of additional hardware lies in introducing a new I-cache (instruction cache) and using the existing cache as a D-cache (data cache). Separating instruction and data caches is trivial in a Harvard Architecture because an I-cache is associated with the instruction memory and a D-cache is associated with the data memory. Should architects abandon the Von Neumann Architecture?

Many architects have adopted a compromise in which a computer has separate instruction and data caches, but the caches lead to a single memory. We use the term Modified Harvard Architecture to characterize the compromise. FIG. 6 illustrates the modified architecture.

FIG. 6 A Modified Harvard Architecture with separate instruction and data caches leading to the same underlying memory.

20. Implementation of Memory Caching

Conceptually, each entry in a memory cache contains two values: a memory ad dress and the value of the byte found at that address. In practice, storing a complete ad dress with each entry is inefficient. Therefore, memory caches use clever techniques to reduce the amount of space needed. The two most important cache optimization techniques are known as:

-- Direct mapped memory cache

-- Set associative memory cache

We will see that, like virtual memory schemes, both cache implementations use powers of two to avoid arithmetic computation.

21. Direct Mapped Memory Cache

A direct mapped memory cache uses a mapping technique to avoid overhead.

Although memory caches are used with byte-addressable memories, a cache does not record individual bytes. Instead, a cache divides both the memory and the cache into a set of fixed-size blocks, where the block size, B (measured in bytes), is chosen to be a power of two. The hardware places an entire block in the cache whenever a byte in the block is referenced. Using cache terminology, we refer to a block in the cache as a cache line; the size of a direct mapped memory cache is often specified by giving the number of cache lines times the size of a cache line. For example, the size might be specified as 4K lines with 8 bytes per line. To envision such a cache, think of bytes in memory being divided into 8-byte segments and assigned to lines of the cache. FIG. 7 illustrates how bytes of memory would be assigned for a block size of eight in a cache that has four lines. (Note: a memory cache usually holds many more than four lines; a small cache size has been chosen merely as a simplified example for the figure.) Observe that the blocks in memory are numbered modulo C, where C is the number of slots in the cache. That is, blocks are numbered from zero through C- 1 (C is 4 in the figure). Interestingly, using powers of two means that no arithmetic is required to map a byte address to a block number. Instead, the block number can be found by extracting a set of bits. In the figure, the block number can be computed by extracting the fourth and fifth bits of an address. For example, consider the byte with address 57 (1__ 11001 in binary, shown with the forth and fifth bits underlined). The bits 11 are 3 in decimal, which agrees with the block number in the figure. In address 44 (1__ 01100 in binary), the fourth and fifth bits are 01 and the block number is 1. We can express the mapping in programming language terms as:

b = (byte_address >> 3) & 0x03;

In terms of a memory cache, no computation is needed - the hardware places the value in an internal register and extracts the appropriate bits to form a block number.

FIG. 7 An example assignment of block numbers to memory locations for a cache of four blocks with eight bytes per block.

The key to understanding a direct mapped memory cache arises from the following rule: only a memory block numbered i can be placed in cache slot i. For example, the block with addresses 16 through 23 can be placed in slot 2, as can the block with ad dresses 48 through 55.

If multiple memory blocks can be placed in a given slot, how does the cache know which block is currently in a slot? The cache attaches a unique tag to each group of C blocks. For example, FIG. 8 illustrates how tag values are assigned to memory blocks in our example cache that has four slots.

FIG. 8 An example memory cache with space for four blocks and a memory divided into conceptual blocks of 8 bytes. Each group of four blocks in memory is assigned a unique tag.

To identify the block currently in a slot of the cache, each cache entry contains a tag value. Thus, if slot zero in the cache contains tag K, the value in slot zero corresponds to block zero from the area of memory that has tag K.

Why use tags? A cache must uniquely identify the entry in a slot. Because a tag identifies a large group of blocks rather than a single byte of memory, using a tag re quires fewer bits to identify a section of memory than using a full memory address.

Furthermore, as the next section explains, choosing the block size and the size of memory identified by a tag to be powers of two makes cache lookup extremely efficient.

22. Using Powers of Two for Efficiency

Although the direct mapping described above may seem complex, using powers of two simplifies the hardware implementation. In fact, the hardware is elegant and extremely efficient. Instead of modulo arithmetic, both the tag and block number can be computed by extracting groups of bits from a memory address. The high-order bits of the address are used as the tag, the next set of bits forms a block number, and the final set of bits gives a byte offset within the block. FIG. 9 illustrates the division.

Tag block offset

FIG. 9 Illustration of how using powers of two allows a cache to divide a memory address into three separate fields that correspond to a tag, a block number, and a byte offset within the block.

Once we know that all values can be obtained via bit extraction, the algorithm for lookup in a direct-mapped memory cache is straightforward. Think of the cache as an array. The idea is to extract the block number from the address, and then use the block number as an index into the array. Each entry in the array contains a tag and a value.

If the tag in the address matches the tag in the cache slot, the cache returns the value. If the tag does not match, the cache hardware must fetch the block from memory, place a copy in the cache, and then return the value. Algorithm 1 summarizes the steps.

The algorithm omits an important detail. Each slot in the cache has a valid bit that specifies whether the slot has been used. Initially (i.e., when the computer boots), all valid bits are set to 0 (to indicate that none of the slots contain blocks from memory).

When it stores a block in a slot, the cache hardware sets the valid bit to 1. When it ex amines the tag for a slot, the hardware reports a mismatch if the valid bit is set, which forces a copy of the block to be loaded from memory.

=========

Algorithm 1

Given:

A memory address Find:

The data byte at that address Method:

Extract the tag number, t, block number, b, and offset, o, from the address by selecting the appropriate bit fields Examine the tag in slot b of the cache If the tag in slot b of the cache matches t

{ Use o to select the appropriate byte from the block in slot b, and return the byte

}

else { /* Update the cache */ Fetch block b from the underlying memory Place a copy in slot b Set the tag on slot b to t Use o to select the appropriate byte from the block in slot b, and return the byte

}

------------------------

Algorithm 1 Cache Lookup In A Direct Mapped Memory Cache

=========

23. Hardware Implementation of a Direct Mapped Cache

Algorithm 12.1 describes cache lookup as if the cache is an array and separate steps are taken to extract items and index the array. In fact, slots of a cache are not stored in an array in memory. Instead, they are implemented with hardware circuits, and the circuits work in parallel. For example, the first step that extracts items from the address can be implemented by placing the address in an internal register (a hardware circuit that consists of a set of latches), and arranging each bit of the address to be represented by the output of one latch. That is, once the address has been placed in the register, each bit of the address will be represented by a separate wire. The items t, b, and o in the address can be obtained merely by dividing the output wires into groups.

The second step in Algorithm 1 requires the cache hardware to examine one of the slots. The hardware uses a decoder to select exactly one of the slots. All slots are connected to common output wires; the hardware is arranged so that only a selected slot puts its output on the wires. A comparator circuit is used to compare the tag in the ad dress with the tag in the selected slot. FIG. 10 gives a simplified block diagram of the hardware to perform cache lookup.

FIG. 10 Block diagram of the hardware used to implement lookup in a memory cache.

The circuit takes a memory address as an input, and produces two outputs. The valid output is 1 if and only if the specified address is found in the cache (i.e., the cache returns a value). The value output is the contents of memory at the specified address.

In the figure, each slot has been divided into a valid bit, a tag, and a value to indicate that separate hardware circuits can be used for each field. Horizontal lines from the decoder to each slot indicate a connection that can be used to activate the circuits for the slot. At any time, however, the decoder only selects one slot (in the figure, the selected slot is shown shaded).

Vertical lines through the slots indicate parallel connections. Hardware in each slot connects to the wires, but only a selected slot places a value on the vertical wires.

Thus, in the example, the input to the and gate only comes from the V circuit of the selected slot, the input to the comparator only comes from the Tag circuit of the selected slot, and the value output only comes from the Value circuit of the selected slot. The key point is that cache lookup can be performed quickly by combinatorial circuits.

24. Set Associative Memory Cache

The chief alternative to a direct mapped memory cache is known as a set associative memory cache. In essence, a set associative memory cache uses hardware parallel ism to provide more flexibility. Instead of maintaining a single cache, the set associative approach maintains multiple underlying caches, and provides hardware that can search all of them simultaneously. More important, because it provides multiple under lying caches, a set associative memory cache can store more than one block that has the same number.

As a trivial example, consider a set associative cache in which there are two copies of the underlying hardware. FIG. 11 illustrates the architecture.

FIG. 11 Illustration of a set associative memory cache with two copies of the underlying hardware. The cache includes hardware to search both copies in parallel.

To understand the advantages of the set associative approach, consider a reference string in which a program alternately references two addresses, A1 and A2, that have different tags, but both have block number zero. In a direct mapped memory cache, the two addresses contend for a single slot in the cache. A reference to A1 loads the value of A1 into slot 0 of the cache, and a reference to A2 overwrites the contents of slot 0 with the value of A2. Thus, in an alternating sequence of references, every reference results in a cache miss. In a set associative memory cache, however, A1 can be placed in one of the two underlying caches, and A2 can be placed in the other. Thus, every reference results in a cache hit.

As the amount of parallelism increases, performance of a set associative memory cache increases. In the extreme case, a cache is classified as fully associative, if each of the underlying caches contains only one slot, but the slot can hold an arbitrary value.

Note that the amount of parallelism determines a point on a continuum: with no parallelism, we have a direct mapped memory cache, and with full parallelism, we have the equivalent of a Content Addressable Memory (CAM).

25. Consequences for Programmers

Experience has shown that caching works well for most computer programs. The code that programmers produce tends to contain loops, which means a processor will repeatedly execute a small set of instructions before moving on to another set.

Similarly, programs tend to reference data items multiple times before moving on to a new data item. Furthermore, some compilers are aware of caching, and help optimize the generated code to take advantage of the cache.

Despite the overwhelming success of caching, programmers who understand how a cache works can write code that exploits a cache. For example, consider a program that must perform many operations on each element of a large array. It is possible to per form one operation at a time (in which case the program iterates through the array many times) or to perform all the operations on a single element of the array before moving to the next element (in which case the program iterates through the array once). From the point of view of caching, the latter is preferable because the element will remain in the cache.

26. Summary

Caching is a fundamental optimization technique that can be used in many con texts. A cache intercepts requests, automatically stores values, and answers requests quickly, whenever possible. Variations include a multilevel cache hierarchy and preloaded caches.

Caches provide an essential performance optimization for memories. Most computer systems employ a multilevel memory cache. Originally, an L1 cache resided on an integrated circuit along with the processor, and an L2 cache was located external to the processor, and an L3 cache was associated with the memory. As integrated circuits became larger, manufacturers moved L2 and L3 caches onto the processor chip, using the distinction that an L1 cache is associated with a single core and L2/L3 caches are shared among multiple cores.

A technology known as a direct mapped memory cache handles lookup without keeping a list of cached items. Although we think of the lookup algorithm as performing multiple steps, a hardware implementation of a direct mapped memory cache can use combinatorial logic circuits to perform the lookup without needing a processor. A set associative memory cache extends the concept of direct mapping to permit parallel access.

EXERCISES

1. What does the term transparent mean when applied to a memory cache?

2. If the hit ratio for a given piece of code is 0.2, the time required to access the cache is 20 nanoseconds, and the time to access the underlying physical memory is 1 microsecond, what is the effective memory access time for the piece of code?

3. A physicist writes C code to iterate through a large 2-dimensional array:

f float ta a[[3322776688, ,3 322776688]], ,s sum; ;

. .... .

f foor r( (ii==00; ;i i<<3322776688; ;i i ++++) ){ {

f foor r( (jj==00; ;j j<<3322776688; ;j j ++++) ){ {

s sum m+ += =a a[[jj,,ii]]; ;

}}

The physicist complains that the code is running very slowly. What simple change can you make that will increase execution speed?

4. Consider a computer where each memory address is thirty-two-bits long and the memory system has a cache that holds up to 4K entries. If a naive cache is used in which each entry of the cache stores an address and a byte of data, how much total storage is needed for the cache? If a direct mapped memory cache is used in which each entry stores a tag and a block of data that consists of four bytes, how much total storage is needed?

5. Extend the previous exercise. Assume the size of the cache is fixed, and find an alternative to the naive solution that allows the storage of more data items. Hint: what values are placed in the cache if the processor always accesses four-byte integers in memory?

6. Consult vendors' specifications and find the cost of memory access and the cost of a cache hit for a modern memory system (Ch and Cm in Section 6).

7. Use the values obtained in the previous exercise to plot the effective memory access cost as the hit ratio varies from zero to one.

8. Using the values of Ch and Cm obtained in Exercise 6, what value of the hit ratio, r, is needed to achieve an improvement of 30% in the mean access time of a memory system (as compared to the same memory system without a cache)?

9. State two ways to improve the hit ratio of a cache.

10. What is cache coherence, and what type of system needs it?

11. Write a computer program to simulate a direct mapped memory cache using a cache of 64 blocks and a block size of 128 bytes. To test the program, create a 1000 x 1000 array of integers. Simulate the address references if a program walks the array in row major order and column-major order. What is the hit ratio of your cache in each case?

12. The hardware diagram in FIG. 10 only shows the circuits needed for lookup. Ex tend the diagram to include circuits that load a value into the cache from memory.

PREV. | NEXT