Computer Architecture: CPUs -- Physical Memory and Physical Addressing [part 2]

AMAZON multi-meters discounts AMAZON oscilloscope discounts

<<PREV.

17. Memory Size and Address Space

How large can a memory be? It may seem that memory size is only an economic issue - more memory simply costs more money. However, size turns out be an essential aspect of memory architecture because overall memory size is inherently linked to other design choices. In particular, the addressing scheme determines a maximum memory size.

Recall that data paths in a processor consist of parallel hardware. When the processor is designed, the designer must choose a size for each data path, register, and other hardware units. The choice places a fixed bound on the size of an address that can be generated or passed from one unit to another. Typically, the address size is the same as the integer size. For example, a processor that uses thirty-two-bit integers uses thirty-two-bit addresses and a processor that uses sixty-four-bit integers uses sixty-four bit addresses. As Section 3 points out, a string of k bits can represent 2k values. Thus, a thirty-two-bit value can represent:

2^32 = 4,294,967,296

unique addresses (i.e., addresses 0 through 4,294,967,295). We use the term address space to denote the set of possible addresses.

The tradeoff between byte addressing and word addressing is now clear: given a fixed address size, the amount of memory that can be addressed depends on whether the processor uses byte addressing or word addressing. Furthermore, if word addressing is used, the amount of memory that can be addressed depends on the word size. For ex ample, on a computer that uses word addressing with four bytes per word, a thirty-two bit value can hold enough addresses for 17,179,869,184 bytes (i.e., four times as much as when byte addressing is used).

18. Programming With Word Addressing

Many processors use byte addressing because byte addressing provides the most convenient interface for programmers. However, byte addressing does not maximize memory size. Therefore, specialized systems, such as processors designed for numeric processing, use word addressing to provide access to the maximum amount of memory for a given address size.

On a processor that uses word addressing, software must handle the details of byte manipulation. In essence, software performs the same function as a memory controller in a byte-addressed architecture. For example, to extract a single byte, software must read the appropriate word from memory, and then extract the byte. Similarly, to write a byte, software must read the word containing the byte, update the correct byte, and write the modified word back to memory. To optimize software performance, logical shifts and bit masking are used to manipulate an address rather than division or remainder computation. Similarly, shifts and logical operations are used to extract bytes from a word. For example, to extract the leftmost byte from a thirty-two-bit word, w,a programmer can code a C statement:

( w > > 24 ) & 0xff

The code performs a logical and with constant 0xff to ensure that only the low-order eight bits are kept after the shift is performed. To understand why the logical and is needed, recall from Section 3 that a right shift propagates the sign bit. Thus, if w contains 0xa1b2c3d2, the expression w>> 24 will produce 0xffffffa1. After the logical and, the result is 0xa1.

19. Memory Size and Powers of Two

We said that a physical memory architecture can be characterized as follows:

Physical memory is organized into a set of M words that each contain N bytes; to make controller hardware efficient, M and N are each chosen to be powers of two.

The use of powers of two for word and address space size has an interesting consequence: the maximum amount of memory is always a power of two rather than a power of ten. As a result, memory size is measured in powers of two. For example, a Kilo byte (Kbyte) is defined to consist of 210 bytes, a Megabyte (MB) is defined to consist of 220 bytes, and a Gigabyte (GB) is defined to consist of 230 bytes. The terminology is confusing because it is an exception. In computer networking, for example, a measure of Megabits per second refers to base ten. Thus, one must be careful when mixing memory size with other measures (e.g., although there are eight bits per byte, one Kilo byte of data in memory is not eight times larger than one Kilobit of data sent across a network). We can summarize:

When used to refer to memory, the prefixes kilo, mega, and giga are defined as powers of 2; when used with other aspects of computing, such as computer networking, the prefixes are defined as powers of 10.

20. Pointers and Data Structures

Memory addresses are important because they form the basis for commonly used data abstractions, such as linked lists, queues, and trees. Consequently, programming languages often provide facilities that allow a programmer to declare a pointer variable that holds a memory address, assign a value to a pointer, or dereference a pointer to obtain an item. In the C programming language, for example, the declaration:

char *cptr;

declares variable cptr to be a pointer to a character (i.e., to a byte in memory). The compiler allocates storage for variable cptr equal to the size of a memory address, and allows the variable to be assigned the address of an arbitrary byte in memory.

The autoincrement statement:

cptr + + ;

increases the value of cptr by one (i.e., moves to the next byte in memory).

Interestingly, the C programming language has a heritage of both byte and word addressing. When performing arithmetic on pointers, C accommodates the size of the underlying item. As an example, the declaration:

int *iptr;

declares variable iptr to be a pointer to an integer (i.e., a pointer to a word). The compiler allocates storage for variable iptr equal to the size of a memory address (i.e., the same size as allocated for cptr above). However, if the program is compiled and run on a processor that defines an integer to be four bytes, the autoincrement statement:

iptr + + ;

increases the value of iptr by four. That is, if iptr is declared to be the byte address of the beginning of a word in memory, autoincrement moves to the byte address of the next word in memory.

In fact, all the examples above assume a byte-addressable computer. The compiler generates code to increment a character pointer by one and an integer pointer by four.

Although C has facilities that allow a pointer to move from one word to the next, the language is intended for use with a byte-addressable memory.

21. A Memory Dump

A trivial example will help us understand the relationship between pointers and memory addresses. Consider a linked list as FIG. 9 illustrates.

FIG. 9 Example of a linked list. Each pointer in the list corresponds to a memory address.

To create such a list, a programmer must write a declaration that specifies the con tents of a node, and then must allocate memory to hold the list. In our trivial example, each node in the list will contain two items: an integer count and a pointer to the next node on the list. In C, a struct declaration is used to define the contents of a node:

struct node { int count;

struct node *next;}

Similarly, a variable named head that serves as the head of the list is defined as:

struct node *head;

To understand how the list appears in memory, consider a memory dump as FIG. 10 illustrates†.

[ †As the figure shows, a programmer can initialize memory to a hexadecimal value that makes it easy to identify items in a memory dump. In the example, a programmer has used the value deadbeef.]

FIG. 10 Illustration of a small portion of a memory dump that shows the contents of memory. The address column gives the memory address of the leftmost byte on the line, and all values are shown in hexadecimal.

The example in the figure is taken from a processor that uses byte addressing.

Each line of the figure corresponds to sixteen contiguous bytes of memory that are di vided into four groups of four bytes. Each group contains eight hexadecimal digits to represent the values of four bytes. The address at the beginning of a line specifies the memory address of the first byte on that line. Therefore, the address on each line is six teen greater than the address on the previous line.

Assume the head of a linked list is found at address 0x0001bde4, which is located on the first line of the dump. The first node of the list starts at address 0x0001bdf8, which is located on the second line of the dump, and contains the integer 192 (hexadecimal constant 000000c0).

The processor uses byte addressing, and bytes of memory are contiguous. In the figure, spacing has been inserted to divide the output into groups of bytes to improve readability. Specifically, the example shows groups of four-byte units, which implies that the underlying word size is four bytes (i.e., thirty-two bits).

22. Indirection and Indirect Operands

When we discussed operands and addressing modes in Section 7, the topic of in direction arose. Now that we understand memory organization, we can understand how a processor evaluates an indirect operand. As an example, suppose a processor executes an instruction in which an operand specifies an immediate value of 0x1be1f, and specifies indirection. Further suppose that the processor has been designed to use thirty two-bit values. Because the operand specifies an immediate value, the processor first loads the immediate value (hexadecimal 1be1f). Because the operand specifies indirection, the processor treats the resulting value as an address in memory, and fetches the word from the address. If the values in memory correspond to the values shown in FIG. 10, the processor will load the value from the rightmost word in the last line of the figure, and the final operand value will be 6.

23. Multiple Memories With Separate Controllers

Our discussion of physical memory has assumed a single memory and a single memory controller. In practice, however, some architectures contain multiple physical memories. When multiple memories are used, hardware parallelism can be employed to provide higher memory performance. Instead of a single memory and a single controller, the memory system can have multiple controllers that operate in parallel, as FIG. 11 illustrates.

In the figure, interface hardware receives requests from the processor. The inter face uses the address in the request to decide which memory should be used, and passes the request to the appropriate memory controller†.

[ † Section 13 explains that the interface acts as a Memory Management Unit (MMU), and explains the functionality in more detail.]

Why does it help to have multiple memories, each with their own controller? Remember that after memory is accessed, the hardware must be reset before the next access can occur. If two memories are available, a programmer can arrange to access one while the other resets, increasing the overall performance. That is, because memory controllers can operate in parallel, using two memory controllers allows more memory accesses to occur per unit time. In a Harvard Architecture, for example, higher performance results because instruction fetch does not interfere with data access and vice versa.

FIG. 11 Illustration of connections for two memory modules with separate controllers.

24. Memory Banks

Multiple physical memories can also be used with a Von Neumann Architecture as a convenient way to form large memory by replicating small memory modules. The idea, known as memory banks, uses the interface hardware to map addresses onto two physical memories. For example, suppose two identical memory modules are each designed to have physical addresses 0 through M-1. The interface can arrange to treat them as two banks that form a contiguous large memory with twice the addresses as FIG. 12 illustrates.

FIG. 12 The logical arrangement of two identical memory banks to form a single memory that is twice the size.

In FIG. 12, addresses 0 through M-1 are assigned to one bank and addresses from M to 2M - 1 are assigned to the second bank. In Section 13, we will see that map ping an address can be extremely efficient.

Although the banks are arranged to give the illusion of one large memory, the underlying hardware is configured as FIG. 11 shows. As a result, controllers for the two memory banks can operate in parallel. Thus, if instructions are placed in one bank and data in another, higher performance can result because instruction fetch will not interfere with data access and vice versa.

How do memory banks appear to a programmer? In most architectures, memory banks are transparent - memory hardware automatically finds and exploits parallelism.

In embedded systems and other special-purpose architectures, a programmer may be responsible for placing items into separate memory banks to increase performance. For example, a programmer may need to place code at low memory addresses and data items at high memory addresses.

25. Interleaving

A related optimization used with physical memory systems is known as interleaving. To understand the optimization, observe that many programs access data from sequential memory locations. For example, if a long text string is copied from one place in memory to another or a program searches a list of items, sequential memory locations will be referenced. In a banked memory, sequential locations lie in the same memory bank, which means that successive accesses must wait for the controller to reset.

Interleaving uses the idea of separate controllers, but instead of organizing memories into banks, interleaving places consecutive words of memory in separate physical memory modules. Interleaving achieves high performance during sequential memory accesses because a word can be fetched while the memory for the previous word resets. Interleaving is usually hidden from programmers -- a programmer can write code without knowing that the underlying memory system has mapped successive words into separate memory modules. The memory hardware handles all the details automatically.

We use the terminology N-way interleaving to describe the number of underlying memory modules (to make the scheme efficient, N is chosen to be a power of two). For example, FIG. 13 illustrates how words of memory are assigned to memory modules in a four-way interleaving scheme.

How can interleaving be achieved efficiently? The answer lies in thinking about the binary representation. In FIG. 13, for example, words 0, 4, 8, and so on all lie in memory module 0. What do the addresses have in common? When represented in binary, the values all have two low-order bits equal to 00. Similarly, the words as signed to module 1 have low-order bits equal to 01, the words assigned to module 2 have low-order bits equal to 10, and the words assigned to module 3 have low-order bits equal to 11. Thus, when given a memory address, the interface hardware extracts the low-order two bits, and uses them to select a module.

FIG. 13 Illustration of 4-way interleaving that illustrates how successive words of memory are placed into memory modules to optimize performance.

Interestingly, accessing the correct word within a module is equally efficient. The modules themselves are standard memory modules that provide an array of words ad dressed 0 through K-1, for some value of K. The interface ignores the two low-order bits of an address and uses the rest of the bits as an index into the memory module. To see why it works, write 0, 4, 8, ... in binary, and remove the two low-order bits. The result is the sequence 0, 1, 2,... Similarly, removing the two low-order bits from 1, 5, 9... also results in the sequence 0, 1, 2,...

We can summarize:

If the number of modules is a power of two, the hardware for N-way interleaving is extremely efficient because low-order bits of an ad dress are used to select a module and the remaining bits are used as an address within the module.

26. Content Addressable Memory

An unusual form of memory exists that blends the two key aspects we discussed:

technology and memory organization. The form is known as a Content Addressable Memory (CAM). As we will see, a CAM does much more than merely store data items -- it includes hardware for high-speed searching.

The easiest way to think about a CAM is to view it as memory that has been organized as a two-dimensional array. Each row, which is used to store an item, is called a slot. In addition to allowing a processor to place a value in each slot, a CAM allows a processor to specify a search key that is exactly as long as one slot. Once a search key has been specified, the hardware can perform a search of the table to determine whether any slot matches the search key. FIG. 14 illustrates the organization of a CAM.

FIG. 14 Illustration of a Content Addressable Memory ( CAM). CAM provides both a memory technology and a memory organization.

For the most basic form of a CAM, the search mechanism performs an exact match. That is, the CAM hardware compares the key against each slot, and reports whether a match was found. Unlike an iterative search performed by a conventional processor, a CAM reports results instantly. In essence, each slot in a CAM contains hardware that performs the comparison. We can imagine wires leading from the bits of the key down through all the slots. Each slot contains gates that compare bits of the key to bits of the value in the slot. Because the hardware for all slots operates in parallel, the time required to perform the search does not depend on the number of slots.

Of course, parallel search hardware makes CAM extremely expensive because the search mechanism must be replicated for each slot. It also means CAM consumes much more power (and produces much more heat) than a conventional memory. Thus, an architect only uses a CAM when lookup speed is more important than cost and power consumption. For example, in a high-speed Internet router, the system must check each incoming packet to determine whether other packets have arrived previously from the same source. To handle high-speed connections, some designs use a CAM to store a list of source identifiers. The CAM allows a search to be performed fast enough to accommodate packets arriving at a high rate (i.e., over a high-speed network).

27. Ternary CAM

An alternative form of CAM, known as Ternary CAM (TCAM), extends the idea of CAM to provide partial match searches. In essence, each bit in a slot can have three values: zero, one, or "don't care." Like a standard CAM, a TCAM performs the search operation in parallel by examining all slots simultaneously. Unlike a standard CAM, a TCAM only performs the match on bits that have the value zero or one. Partial matching allows a TCAM to be used in cases where two or more entries in the CAM overlap - a TCAM can find the best match (e.g., the longest prefix match).

28. Summary

We examined two aspects of physical memory: the underlying technology and the memory organization. Many memory technologies exist. Differences among them include permanence (RAM or ROM), clock synchronization, and the read and write cycle times.

Physical memory is organized into words and accessed through a controller.

Although programmers find byte addressing convenient, most underlying memory systems use word addressing. An intelligent memory controller can translate from byte ad dressing to word addressing. To avoid arithmetic computation in a controller, memory is organized so the address space and bytes per word are powers of two.

Programming languages, such as C, provide pointer variables and pointer arithmetic that allow a programmer to obtain and manipulate memory addresses. A memory dump, which shows the contents of memory along with the memory address of each lo cation, can be used to relate data structures in a program to values in memory at run time.

Memory banks and interleaving both employ multiple memory modules. Banks are used to organize a large memory out of smaller modules. Interleaving places successive words of memory in separate modules to speed sequential access.

Content Addressable Memory (CAM) combines memory technology and memory organization. A CAM organizes memory as an array of slots, and provides a high speed search mechanism.

EXERCISES

1. Smart phones and other portable devices typically use DRAM rather than SRAM. Explain why.

2. Explain the purpose of a DRAM refresh mechanism.

3. Assume a computer has a physical memory organized into 64-bit words. Give the word address and offset within the word for each of the following byte addresses: 0, 9, 27, 31, 120, and 256.

4. Extend the above exercise by writing a computer program that computes the answer.

The program should take a series of inputs that each consist of two values: a word size specified in bits and a byte address. For each input, the program should generate a word address and offset within the word. Note: although it is specified in bits, the word size must be a power of two bytes.

5. On an ARM processor, attempting to load an integer from memory will result in an error if the address is not a multiple of 4 bytes. What term do we use to refer to such an error?

6. If a computer has 64-bit addresses, and each address corresponds to one byte, how many gigabytes of memory can the computer address?

7. Compute the number of memory operations required for a two-address instruction if the instruction and both operands are unaligned.

8. Write a C function that declares a static integer array, M, and implements fetch and store operations that use shift and Boolean operations to access individual bytes.

9. Find the memory in a PC, identify the type of chips used, and look up the vendor's specification of the chips to determine the memory type and speed.

10. Redraw FIG. 13 for an 8-way interleaved memory.

11. Emulate a physical memory. Write a C program that declares an array, M, to be an array of 10,000 integers (i.e., an array of words). Implement two functions, fetch and store, that use array M to emulate a byte-addressable memory. fetch(i) returns the ith byte of the memory, and store (i,ch) stores the 8-bit character ch into the ith byte of the memory. Do not use byte pointers. Instead, use the ideas in this section to write code that computes the correct word that contains the specified byte and the offset within the word.

12. Simulate a TCAM. Write a program that matches an input string with a set of patterns.

For the simulation, use characters instead of bits. Allow each pattern to contain a string of characters, and interpret an asterisk as a "wild card" that matches any character. Can you find a way to make the match proceed faster than iterating through all patterns?

PREV. | NEXT