Computer Architecture: Processor Types and Instruction Sets

AMAZON multi-meters discounts AMAZON oscilloscope discounts

1. Introduction

The previous section introduces a variety of processors and explains the fetch execute cycle that programmable processors use. This section continues the discussion by focusing on the set of operations that a processor can perform. The section explains various approaches computer architects have chosen, and discusses the advantages and disadvantages of each. The next sections extend the discussion by describing the various ways processors access operands.

2. Mathematical Power, Convenience, And Cost

What operations should a processor offer? From a mathematical point of view, a wide variety of computational models provide equivalent computing power. In theory, as long as a processor offers a few basic operations, the processor has sufficient power to compute any computable function†.

Programmers understand that although only a minimum set of operations are necessary, a minimum is neither convenient nor practical. That is, the set of operations is designed for convenience rather than for mere functionality. For example, it is possible to compute a quotient by repeated subtraction. However, a program that uses repeated subtraction to compute a quotient runs slowly. Thus, most processors operations include hardware for each basic arithmetic operation: addition, subtraction, multiplication, and division.

†In a mathematical sense, only three operations are needed to compute any computable function: add one, subtract one, and branch if a value is nonzero.

To a computer architect, choosing a set of operations that the processor will per form represents a tradeoff. On the one hand, adding an additional arithmetic operation, such as multiplication or division, provides convenience for the programmer. On the other hand, each additional operation adds more hardware and makes the processor design more difficult. Adding hardware also increases engineering considerations such as chip size, power consumption, and heat dissipation. Thus, because a smart phone is designed to conserve battery power, the processors used in smart phones typically have fewer built-in operations than the processors used in powerful mainframe computers.

The point is that when considering the set of operations a given processor provides, we need to remember that the choice represents a complex tradeoff:

The set of operations a processor provides represents a tradeoff among the cost of the hardware, the convenience for a programmer, and engineering considerations such as power consumption.

3. Instruction Set Architecture

When an architect designs a programmable processor, the architect must make two key decisions:

-- Instruction set: the set of operations the processor provides

-- Instruction representation: the format for each operation

We use the term instruction set to refer to the set of operations the hardware recognizes, and refer to each operation as an instruction. We assume that on each iteration of its fetch-execute cycle, a processor executes one instruction.

The definition of an instruction set specifies all details about instructions, including an exact specification of actions the processor takes when it executes the instruction.

Thus, the instruction set defines values on which each instruction operates and the results the instruction produces. The definition specifies allowable values (e.g., the division instruction requires the divisor to be nonzero) and error conditions (e.g., what happens if an addition results in an overflow).

The term instruction representation (instruction format) refers to the binary representation that the hardware uses for instructions. The instruction representation is important because it defines a key interface: the interface between software that generates instructions and places them in memory and the hardware that executes the instructions. The software (e.g., the compiler, linker, and loader) must create an image in memory that uses exactly the same instruction format that the processor hardware expects.

We say that the definition of an instruction set and the corresponding representation define an Instruction Set Architecture (ISA). That is, an ISA defines both syntactic and semantic aspects of the instruction set. IBM Corporation pioneered the approach in the 1960s when it developed an ISA for its System/360 line of computers - with minor exceptions, all computers in the line shared the same basic instruction set, but individual models differed widely (approximately a 1:30 ratio) in the size of memory, processor speed, and cost.

4. Opcodes, Operands, And Results

Conceptually, each instruction contains three parts that specify: the exact operation to be performed, the value(s) to use, and where to place the result(s). The following paragraphs define the idea more precisely.

Opcode. The term opcode (short for operation code) refers to the exact operation to be performed. An opcode is a number; when the instruction set is designed, each operation must be assigned a unique opcode. For example, integer addition might be assigned opcode five, and integer subtraction might be assigned opcode twelve.

Operands. The term operand refers to a value that is needed to perform an operation. The definition of an instruction set specifies the exact number of operands for each instruction, and the possible values (e.g., the addition operation takes two signed integers).

Results. In some architectures, one or more of the operands specify where the processor should place results of an instruction (e.g., the result of an arithmetic operation); in others, the location of the result is determined automatically.

5. Typical Instruction Format

Each instruction is represented as a binary string. On most processors, an instruction begins with a field that contains the opcode, followed by fields that contain the operands. FIG. 1 illustrates the general format.

FIG. 1 The general instruction format that many processors use. The op code at the beginning of an instruction determines exactly which operands follow.

6. Variable-Length vs. Fixed-Length Instructions

The question arises: should each instruction be the same size (i.e., occupy the same number of bytes) or should the length depend on the quantity and type of the operands? For example, consider integer arithmetic operations. Addition or subtraction operates on two values, but negation operates on a single value. Furthermore, a processor can handle multiple sizes of operands (e.g., a processor can have an instruction that adds a pair of sixteen-bit integers as well as an instruction that adds a pair of thirty-two bit integers). Should one instruction be shorter than another? We use the term variable-length to characterize an instruction set that includes multiple instruction sizes, and the term fixed-length to characterize an instruction set in which every instruction is the same size. Programmers expect variable-length instructions because software usually allocates space according to the size of each object (e.g., if the strings "Hello" and "bye" appear in a program, a compiler will allocate 5 and 3 bytes, respectively). From a hardware point of view, however, variable-length instructions require complex hardware to fetch and decode. By comparison, fixed-length instructions require less complex hardware. Fixed-length instructions allow processor hardware to operate at higher speed because the hardware can compute the location of the next instruction easily. Thus, many processors force all instructions to be the same size, even if some instructions can be represented in fewer bits than others. The point is:

Although it may seem inefficient to a programmer, using fixed-length instructions can make processor hardware less complex and faster.

How does a processor that uses fixed-length instructions handle cases where an instruction does not need all operands? For example, how does a fixed-length instruction set accommodate both addition and negation? Interestingly, the hardware is designed to ignore fields that are not needed for a given operation. Thus, an instruction set may specify that in some instructions, specific bits are unused†. To summarize:

When a fixed-length instruction set is employed, some instructions contain extra fields that the hardware ignores. The unused fields should be viewed as part of a hardware optimization, not as an indication of a poor design.

†Some hardware requires unused bits to be zero.

7. General-Purpose Registers

As we have seen, a register is a small, high-speed hardware storage device found in a processor. A register has a fixed size (e.g., 32 or 64 bits) and supports two basic operations: fetch and store. We will see later that registers can operate in a variety of roles, including as an instruction pointer (also called a program counter) that gives the address of the next instruction to execute. For now, we will restrict our attention to a simple case that is well known to programmers: general-purpose registers that are used as a temporary storage mechanism. A processor usually has a small number of general-purpose registers (e.g., thirty-two), and each register is usually the size of an integer. For example, on a processor that provides thirty-two bit arithmetic, each general-purpose register holds thirty-two bits. As a result, a general-purpose register can hold an operand needed for an arithmetic instruction or the result of such an instruction.

In many architectures, general-purpose registers are numbered from 0 through N-1. The processor provides instructions that can store a value into (or fetch a value from) a specified register. General-purpose registers have the same semantics as memory: a fetch operation returns the value specified in the previous store operation.

Similarly, a store operation replaces the contents of the register with a new value.

8. Floating Point Registers and Register Identification

Processors that support floating point arithmetic often use a separate set of registers to hold floating point values. Confusion can arise because both general-purpose registers and floating point registers are usually numbered starting at zero - the instruction determines which registers are used. For example, if registers 3 and 6 are specified as operands for an integer instruction, the processor will extract the operands from the general-purpose registers. However, if registers 3 and 6 are specified as operands for a floating point instruction, the floating point registers will be used.

9. Programming With Registers

Many processors require operands to be placed in general-purpose registers before an instruction is executed. Some processors also place the results of an instruction in a general-purpose register. Thus, to add two integers in variables X and Y and place the result in variable Z, a programmer must create a series of instructions that move values to the corresponding registers. For example, if general-purpose registers 3, 6, and 7 are available, the program might contain four instructions that perform the following steps:

-- Load a copy of variable X from memory into register 3

-- Load a copy of variable Y from memory into register 6

-- Add the value in register 3 to the value in register 6, and place the result in register 7

-- Store a copy of the value in register 7 to variable Z in memory

We will see that moving a value between memory and a register is relatively ex pensive, so performance is optimized by leaving values in registers if the value will be used again. Because a processor only contains a small number of registers, a programmer (or compiler) must decide which values to keep in the registers at any time; other values are kept in memory†. The process of choosing which values the registers contain is known as register allocation.

Many details complicate register allocation. One of the most common arises if an instruction generates a large result, called an extended value. For example, integer multiplication can produce a result that contains twice as many bits as either operand.

†The term register spilling refers to moving a value from a register back into memory to make the register available for a new value.

Some processors offer facilities for double precision arithmetic (e.g., if a standard integer is thirty-two bits wide, a double precision integer occupies sixty-four bits).

To handle extended values, the hardware treats registers as consecutive. On such processors, for example, an instruction that loads a double precision integer into register 4 will place half the integer in register 4 and the other half in register 5 (i.e., the value of register 5 will change even though the instruction contains no explicit reference).

When choosing registers to use, a programmer must plan for instructions that place ex tended data values in consecutive registers.

10. Register Banks

An additional hardware detail complicates register allocation: some architectures divide registers into multiple banks, and require the operands for an instruction to come from separate banks. For example, on a processor that uses two register banks, an integer add instruction may require the two operands to be from separate banks.

To understand register banks, we must examine the underlying hardware. In essence, register banks allow the hardware to operate faster because each bank has a separate physical access mechanism and the mechanisms operate simultaneously. Thus, when the processor executes an instruction that accesses two operands in registers, both operands can be obtained at the same time. FIG. 2 illustrates the concept.

FIG. 2 Illustration of eight registers divided into two banks. Hardware allows the processor to access both banks at the same time.

Register banks have an interesting consequence for programmers: it may not be possible to keep data values in registers permanently. To understand why, consider the following assignment statements that are typical of those used in a conventional programming language, and assume we want to implement the statements on a processor that has two register banks as FIG. 2 illustrates.

R ? X+Y

S ? Z-X

T ? Y+Z

To perform the first addition, X and Y must be in separate register banks. Let's assume X is in a register in bank A, and Y is in a register in bank B. For the subtraction, Z must be in the opposite register bank than X (i.e., Z must be in a register in bank B). For the third assignment, Y and Z must be in different banks. Unfortunately, the first two assignments mean that Y and Z are located in the same bank. Thus, there is no possible assignment of X, Y, and Z to registers that works with all three instructions.

We say that a register conflict occurs.

What happens when a register conflict arises?

The programmer must either re-assign registers or insert an instruction to copy values. For example, we could insert an extra instruction that copies the value of Z into a register in bank A before the final addition is performed.

11. Complex And Reduced Instruction Sets

Computer architects divide instruction sets into two broad categories that are used to classify processors†:

-- Complex Instruction Set Computer (CISC)

-- Reduced Instruction Set Computer (RISC)

A CISC processor usually includes many instructions (typically hundreds), and each instruction can perform an arbitrarily complex computation. Intel's x86 instruction set is classified as CISC because a processor provides hundreds of instructions, including complex instructions that require a long time to execute (e.g., one instruction manipulates graphics in memory and others compute the sine and cosine functions).

In contrast to CISC, a RISC processor is constrained. Instead of arbitrary instructions, a RISC design strives for a minimum set that is sufficient for all computation (e.g., thirty-two instructions). Instead of allowing a single instruction to compute an arbitrary function, each instruction performs a basic computation. To achieve the highest possible speed, RISC designs constrain instructions to be a fixed size. Finally, as the next section explains, a RISC processor is designed to execute an instruction in one clock cycle‡. Arm Limited and MIPS Corporation have each created RISC architectures with limited instructions that can be executed in one clock cycle. The ARM designs are especially popular in smart phones and other low-power devices.

We can summarize:

†Instead of using the full name, most engineers use the acronyms, which are pronounced sisk and risk.

‡Recall from Section 2 that a clock, which pulses at regular intervals, is used to control digital logic.

A processor is classified as CISC if the instruction set contains instructions that perform complex computations that can require long times; a processor is classified as RISC if it contains a small number of instructions that can each execute in one clock cycle.

12. RISC Design And The Execution Pipeline

We said that a RISC processor executes one instruction per clock cycle. In fact, a more accurate version of the statement is: a RISC processor is designed so the processor can complete one instruction on each clock cycle. To understand the subtle difference, it is important to know how the hardware works. We said that a processor performs a fetch-execute cycle by first fetching an instruction and then executing the instruction.

In fact, the processor divides the fetch-execute cycle into several steps, typically:

-- Fetch the next instruction

-- Decode the instruction and fetch operands from registers

-- Perform the arithmetic operation specified by the opcode

-- Perform memory read or write, if needed

-- Store the result back to the registers

To enable high speed, RISC processors contain parallel hardware units that each perform one step listed above. The hardware is arranged in a multistage pipeline†, which means the results from one hardware unit are passed to the next hardware unit.

FIG. 3 illustrates a pipeline.

FIG. 3 An example pipeline of the five hardware stages that are used to perform the fetch-execute cycle.

In the figure, an instruction moves left to right through the pipeline. The first stage fetches the instruction, the next stage examines the opcode, and so on. Whenever the clock ticks, all stages simultaneously pass the instruction to the right. Thus, instructions move through the pipeline like an assembly line: at any time, the pipeline contains five instructions.

†The terms instruction pipeline and execution pipeline are used interchangeably to refer to the multistage pipeline used in the fetch-execute cycle.

The speed of a pipeline arises because all stages can operate in parallel - while the fourth stage executes an instruction, the third stage fetches the operands for the next instruction. Thus, a stage never needs to delay because an instruction is always ready on each clock cycle. FIG. 4 illustrates how a set of instructions pass through a five-stage pipeline.

FIG. 4 Instructions passing through a five-stage pipeline. Once the pipe line is filled, each stage is busy on each clock cycle.

The figure clearly illustrates that although a RISC processor cannot perform all the steps needed to fetch and execute an instruction in one clock cycle, parallel hardware al lows the processor to finish one instruction per clock cycle. We can summarize:

Although a RISC processor cannot perform all steps of the fetch execute cycle in a single clock cycle, an instruction pipeline with parallel hardware provides approximately the same performance: once the pipeline is full, one instruction completes on every clock cycle.

13. Pipelines And Instruction Stalls

We say that the instruction pipeline is transparent to programmers because the instruction set does not contain any explicit references to the pipeline. That is, the hardware is constructed so the results of a program are the same whether or not a pipe line is present. Although transparency can be an advantage, it can also be a disadvantage: a programmer who does not understand the pipeline can inadvertently introduce inefficiencies.

To understand the effect of programming choices on a pipeline, consider a program that contains two successive instructions that perform an addition and subtraction on operands and results located in registers that we will label A, B, C, D, and E:

Instruction K: C ? add A B

Instruction K+1: D ? subtract E C

Although instruction K can proceed through the pipeline from beginning to end, instruction K+1 encounters a problem because operand C is not available in time. That is, the hardware must wait for instruction K to finish before fetching the operands for instruction K+1. We say that a stage of the pipeline stalls to wait for the operand to become available. FIG. 5 illustrates what happens during a pipeline stall.

FIG. 5 Illustration of a pipeline stall. Instruction K+1 cannot proceed until an operand from instruction K becomes available.

The figure shows a normal pipeline running until clock cycle 3, when Instruction K+1 has reached stage 2. Recall that stage 2 fetches operands from registers. In our example, one of the operands for instruction K+1 is not available, and will not be avail able until instruction K writes its results into a register. The pipeline must stall until instruction K completes. In the code above, because the value of C has not been computed, stage 2 cannot fetch the value for C. Thus, stages 1 and 2 remain stalled during clock cycles 4 and 5. During clock cycle 6, stage 2 can fetch the operand, and pipeline processing continues.

The rightmost column in FIG. 5 shows the effect of a stall on performance: the final stage of the pipeline does not produce any results during clock cycles 6, 7, and 8.

If no stalls had occurred, instruction K+1 would have completed during clock cycle 6, but the stall means the instruction does not complete until clock cycle 9.

To describe the delay between the cause of a stall and the time at which output stops, we say that a bubble passes through the pipeline. Of course, the bubble is only apparent to someone observing the pipeline's performance because correctness is not affected. That is, an instruction always passes directly to the next stage as soon as one stage completes, which means all instructions are executed in the order specified.

14. Other Causes Of Pipeline Stalls

In addition to waiting for operands, a pipeline can stall when the processor exe cutes any instruction that delays processing or disrupts the normal flow. For example, a stall can occur when a processor:

-- Accesses external storage

-- Invokes a coprocessor

-- Branches to a new location

-- Calls a subroutine

The most sophisticated processors contain additional hardware to avoid stalls. For example, some processors contain two copies of a pipeline, which allows the processor to start decoding the instruction that will be executed if a branch is taken as well as the instruction that will be executed if a branch is not taken. The two copies operate until a branch instruction can be executed. At that time, the hardware knows which copy of the pipeline to follow; the other copy is ignored. Other processors contain special shortcut hardware that passes a copy of a result back to a previous pipeline stage.

15. Consequences For Programmers

To achieve maximum speed, a program for a RISC architecture must be written to accommodate an instruction pipeline. For example, a programmer should avoid introducing unnecessary branch instructions. Similarly, instead of referencing a result register immediately in the following instruction, the reference can be delayed. As an example, FIG. 6 shows how code can be rearranged to run faster.

C ? add A B C ? add A B

-- ? subtract E C F ? add G H

F ? add G H M ? add K L

J ? subtract I F D ? subtract E C

M ? add K L J ? subtract I F

P ? subtract M N P ? subtract M N

FIG. 6 (a) A list of instructions, and (b) the instructions reordered to run faster in a pipeline. Reducing pipeline stalls increases speed.

In the figure, the optimized program separates references from computation. For example, in the original program, the second instruction references value C, which is produced by the previous instruction. Thus, a stall occurs between the first and second instructions. Moving the subtraction to a later point in the program allows the processor to continue to operate without a stall.

Of course, a programmer can choose to view a pipeline as an automatic optimization instead of a programming burden. Fortunately, most programmers do not need to perform pipeline optimizations manually. Instead, compilers for high-level languages perform the optimizations automatically.

Rearranging code sequences can increase the speed of processing when the hardware uses an instruction pipeline; programmers view the reordering as an optimization that can increase speed without affecting correctness.

16. Programming, Stalls, and No-Op Instructions

In some cases, the instructions in a program cannot be rearranged to prevent a stall.

In such cases, programmers can document stalls so anyone reading the code will under stand that a stall occurs. Such documentation is especially helpful if a program is modified because the programmer who performs the modification can reconsider the situation and attempt to reorder instructions to prevent a stall.

How should programmers document a stall? One technique is obvious: insert a comment that explains the reason for a stall. However, most code is generated by compilers and is only read by humans when problems occur or special optimization is required. In such cases, another technique can be used: in places where stalls occur, insert extra instructions in the code. The extra instructions show where items can be inserted without affecting the pipeline. Of course, the extra instructions must be innocuous - they must not change the values in registers or otherwise affect the program. In most cases, the hardware provides the solution: a no-op. That is, an instruction that does absolutely nothing except occupy time. The point is:

Most processors include a no-op instruction that does not reference data values, compute a result, or otherwise affect the state of the computer. No-op instructions can be inserted to document locations where an instruction stall occurs.

17. Forwarding

As mentioned above, some hardware has special facilities to improve instruction pipeline performance. For example, an ALU can use a technique known as forwarding to solve the problem of successive arithmetic instructions passing results.

To understand how forwarding works, consider the example of two instructions where operands A, B, C, D, and E are in registers:

Instruction K: C ? add A B

Instruction K+1: D ? subtract E C

We said that such a sequence causes a stall on a pipelined processor. However, a processor that implements forwarding can avoid the stall by arranging for the hardware to detect the dependency and automatically pass the value for C from instruction K directly to instruction K+1. That is, a copy of the output from the ALU in instruction K is forwarded directly to the input of the ALU in instruction K+1. As a result, instructions continue to fill the pipeline, and no stall occurs.

18. Types of Operations

When computer architects discuss instruction sets, they divide the instructions into a few basic categories. FIG. 7 lists one possible division.

-- Integer arithmetic instructions

-- Floating point arithmetic instructions

-- Logical instructions (also called Boolean operations)

-- Data access and transfer instructions

-- Conditional and unconditional branch instructions

-- Processor control instructions

-- Graphics instructions

FIG. 7 An example of categories used to classify instructions. A general-purpose processor includes instructions in all the categories.

19. Program Counter, Fetch-Execute, And Branching

Recall from Section 4 that every processor implements a basic fetch-execute cycle.

During the cycle, control hardware in the processor automatically moves through instructions - once it finishes executing one instruction, the processor automatically moves past the current instruction in memory before fetching the next instruction. To implement the fetch-execute cycle and a move to the next instruction, the processor uses a special-purpose internal register known as an instruction pointer or program counter†.

When a fetch-execute cycle begins, the program counter contains the address of the instruction to be executed. After an instruction has been fetched, the program counter is updated to the address of the next instruction. The update of the program counter during each fetch-execute cycle means the processor will automatically move through successive instructions in memory. Algorithm 5.1 specifies how the fetch-execute cycle moves through successive instructions.

Algorithm 1. Assign the program counter an initial program address.

Repeat forever

{ Fetch: access the next step of the program from the location given by the program counter.

Set an internal address register, A, to the address beyond the instruction that was just fetched.

Execute: Perform the step of the program.

Copy the contents of address register A to the program counter.

}

Algorithm 1. The Fetch-Execute Cycle

The algorithm allows us to understand how branch instructions work. There are two cases: absolute and relative. An absolute branch computes a memory address, and the address specifies the location of the next instruction to execute. Typically, an absolute branch instruction is known as a jump. During the execute step, a jump instruction computes an address and loads the value into the internal register A that Algorithm 5.1 specifies. At the end of the fetch-execute cycle, the hardware copies the value into the program counter, which means the address will be used to fetch the next instruction.

For example, the absolute branch instruction:

†The two terms are equivalent.

jump 0x05DE

causes the processor to load 0x05DE into the internal address register, which is copied into the program counter before the next instruction is fetched. In other words, the next instruction fetch will occur at memory location 0x05DE.

Unlike an absolute branch instruction, a relative branch instruction does not specify an exact memory address. Instead, a relative branch computes a positive or negative increment for the program counter. For example, the instruction:

br +8 specifies branching to a location that is eight bytes beyond the current location (i.e., beyond the current value of the program counter).

To implement relative branching, a processor adds the operand in the branch instruction to the program counter, and places the result in internal address register A.

For example, if the relative branch computes -12, the next instruction to be executed will be found at an address twelve bytes before the current instruction. A compiler might use a relative branch at the end of a short while-loop.

Most processors also provide an instruction to invoke a subroutine, typically jsr (jump subroutine). In terms of the fetch-execute cycle, a jsr instruction operates like a branch instruction with a key difference: before the branch occurs, the jsr instruction saves the value of the address register, A. When it finishes executing, a subroutine re turns to the caller. To do so, the subroutine executes an absolute branch to the saved address. Thus, when the subroutine finishes, the fetch-execute cycle resumes at the instruction immediately following the jsr.

20. Subroutine Calls, Arguments, And Register Windows

High-level languages use a subroutine call instruction, such as jsr, to implement a procedure or function call. The calling program supplies a set of arguments that the subroutine uses in its computation. For example, the function call cos( 3.14159 ) has the floating point constant 3.14159 as an argument.

One of the principal differences among processors arises from the way the underlying hardware passes arguments to a subroutine. Some architectures use memory - the arguments are stored on the stack in memory before the call, and the subroutine extracts the values from the stack when they are referenced. In other architectures, the processor uses either general-purpose or special-purpose registers to pass arguments.

Using either special-purpose or general-purpose registers to pass arguments is much faster than using a stack in memory because registers are part of the local storage in the processor itself. Because few processors provide special-purpose registers for argument passing, general-purpose registers are typically used. Unfortunately, general purpose registers cannot be devoted exclusively to arguments because they are also needed for other computation (e.g., to hold operands for arithmetic operations). Thus, a programmer faces a tradeoff: using a general-purpose register to pass an argument can increase the speed of a subroutine call, but using the register to hold a data value can in crease the speed of general computation. Thus, a programmer must choose which arguments to keep in registers and which to store in memory.

Some processors include an optimization for argument passing known as a register window. Although a processor has a large set of general-purpose registers, the register hardware only exposes a subset of the registers at any time. The subset is known as a window. The window moves automatically each time a subroutine is invoked, and moves back when the subroutine returns. More important, the windows that are avail able to a program and subroutine overlap - some of the registers visible to the caller are visible to the subroutine. A caller places arguments in the registers that will overlap before calling a subroutine and the subroutine extracts values from the registers. FIG. 8 illustrates the concept of a register window

FIG. 8 Illustration of a register window (a) before a subroutine call, and (b) during the call. Values A, B, C, and D correspond to arguments that are passed.

In the figure, the hardware has 16 registers, but only 8 registers are visible at any time; others are unavailable. A program always references visible registers as numbers 0 through the window size minus one (0 through 7 in the example). When a subroutine is called, the hardware changes the set of registers that are visible by sliding the window. In the example, registers that were numbered 4 through 7 before the call become 0 through 3 after the call. Thus, the calling program places arguments A through D in registers 4 through 7, and the subroutine finds the arguments in registers 0 through 3.

Registers with values xi are only available to the calling program. The advantage of a register window approach is that registers that are not in the current window retain the values they had. So, when the called subroutine returns, the window will slide back, and registers with values xi will be exactly the same as before the call.

† Section 3 describes the calling sequence used with an x86 architecture, and Section 4 explains how an ARM architecture passes some arguments in registers and some in memory.

The illustration in FIG. 8 uses a small window size (eight registers) to simplify the diagram. In practice, processors that use a register window typically have larger windows. For example, the Sparc architecture has one hundred twenty-eight or one hundred forty-four physical registers and a window size of thirty-two registers; however, only eight of the registers in the window overlap (i.e., only eight registers can be used to pass arguments).

21. An Example Instruction Set

An example instruction set will help clarify the concepts described above. We have selected the MIPS processor as an example for two reasons. First, the MIPS processor is popular for use in embedded systems. Second, the MIPS instruction set is a classic example of the instruction set offered by a RISC processor. FIG. 9 lists the instructions in the MIPS instruction set.

A MIPS processor contains thirty-two general-purpose registers, and most instructions require the operands and results to be in registers. For example, the add instruction takes three operands that are registers: the instruction adds the contents of the first two registers and places the result in the third.

In addition to the integer instructions that are listed in FIG. 9, the MIPS architecture defines a set of floating point instructions for both single precision (i.e., thirty two bit) and double precision (i.e., sixty-four bit) floating point values. The hardware provides a set of thirty-two floating point registers. Although they are numbered from zero to thirty-one, the floating point registers are completely independent of the general-purpose registers.

To handle double precision values, the floating point registers operate as pairs.

That is, only an even-numbered floating point register can be specified as an operand or target in a floating point instruction - the hardware uses the specified register plus the next odd-numbered register as a combined storage unit to hold a double precision value.

FIG. 10 summarizes the MIPS floating point instruction set.

===================

----------

Instruction

Arithmetic add subtract add immediate add unsigned subtract unsigned add immediate unsigned move from coprocessor multiply

multiply unsigned divide

divide unsigned move from Hi move from Lo Logical (Boolean) and or and immediate or immediate shift left logical shift right logical Data Transfer load word store word load upper immediate move from coproc. register Conditional Branch

branch equal branch not equal set on less than set less than immediate set less than unsigned set less than immediate Unconditional Branch jump

jump register jump and link

--------

Meaning

integer addition integer subtraction integer addition (register + constant) unsigned integer addition unsigned integer subtraction

unsigned addition with a constant

access coprocessor register integer multiplication unsigned integer multiplication integer division unsigned integer division access high-order register access low-order register logical and (two registers) logical or (two registers) and of register and constant or of register and constant shift register left N bits shift register right N bits load register from memory store register into memory

place constant in upper sixteen bits of register

obtain a value from a coprocessor branch if two registers equal branch if two registers unequal compare two registers

compare register and constant

compare unsigned registers

compare unsigned register and constant go to target address go to address in register procedure call

---------------

FIG. 9 An example instruction set. The table lists the instructions offered by the MIPS processor.

=======

Instruction

Arithmetic FP add FP subtract FP multiply FP divide FP add double FP subtract double FP multiply double FP divide double Data Transfer load word coprocessor store word coprocessor Conditional Branch

branch FP true branch FP false FP compare single FP compare double

Meaning

floating point addition floating point subtraction floating point multiplication floating point division double-precision addition

double-precision subtraction

double-precision multiplication

double-precision division

load value into FP register

store FP register to memory branch if FP condition is true branch if FP condition is false

compare two FP registers

compare two double precision values

FIG. 10 Floating point (FP) instructions defined by the MIPS architecture. Double precision values occupy two consecutive floating point registers.

=========

22. Minimalistic Instruction Set

It may seem that the instructions listed in FIG. 9 are insufficient and that additional instructions are needed. For example, the MIPS architecture does not include an instruction that copies the contents of a register to another register, nor does the architecture include instructions that can add a value in memory to the contents of a register.

To understand the choices, it is important to know that the MIPS instruction set sup ports two principles: speed and minimalism. First, the basic instruction set has been designed carefully to ensure high speed (i.e., the architecture has the property that when a pipeline is used, one instruction can complete on every clock cycle). Second, the instruction set is minimalistic - it contains the fewest possible instructions that handle standard computation. Limiting the number of instructions forms a key piece of the design. Choosing thirty-two instructions means that an opcode only needs five bits and no combination of the bits is wasted.

One feature of the MIPS architecture, which is also used in other RISC processors, helps achieve minimalism: fast access to a zero value. In the case of MIPS, register 0 provides the mechanism - the register is reserved and always contains the value zero. Thus, to test whether a register is zero, the value can be compared to register zero.

Similarly, register zero can be used in any instruction. For example, to copy a value from one register to another, an add instruction can be used in which one of the two operands is register zero.

23. The Principle of Orthogonality

In addition to the technical aspects of instruction sets discussed above, an architect must consider the aesthetic aspects of a design. In particular, an architect strives for elegance. Elegance relates to human perception: how does the instruction set appear to a programmer? How do instructions combine to handle common programming tasks? Are the instructions balanced (if the set includes right-shift, does it also include left shift)? Elegance calls for subjective judgment. However, experience with a few instruction sets often helps engineers and programmers recognize and appreciate elegance.

One particular aspect of elegance, known as orthogonality, concentrates on eliminating unnecessary duplication and overlap among instructions. We say that an instruction set is orthogonal if each instruction performs a unique task. An orthogonal instruction set has important advantages for programmers: orthogonal instructions can be understood more easily, and a programmer does not need to choose among multiple instructions that perform the same task. Orthogonality is so important that it has become a general principle of processor design. We can summarize:

The principle of orthogonality specifies that each instruction should perform a unique task without duplicating or overlapping the functionality of other instructions.

24. Condition Codes And Conditional Branching

On many processors, executing an instruction results in a status, which the processor stores in an internal hardware mechanism. A later instruction can use the status to decide how to proceed. For example, when it executes an arithmetic instruction, the ALU sets an internal register known as a condition code that contains bits to record whether the result is positive, negative, zero, or an arithmetic overflow occurred. A conditional branch instruction that follows the arithmetic operation can test one or more of the condition code bits, and use the result to determine whether to branch.

An example will clarify how a condition code mechanism is used†. To understand the paradigm, consider a program that tests for equality between two values. As a simple example, suppose the goal is to set register 3 to zero if the contents of register 4 are not equal to the contents of register 5. FIG. 11 contains example code.

† Section 9 explains programming with condition codes and shows further examples.

FIG. 11 An example of using a condition code. An ALU operation sets the condition code, and a later conditional branch instruction tests the condition code.

25. Summary

Each processor defines an instruction set that consists of operations the processor supports; the set is chosen as a compromise between programmer convenience and hardware efficiency. In some processors, each instruction is the same size, and in other processors size varies among instructions.

Most processors include a small set of general-purpose registers that are high-speed storage mechanisms. To program using registers, one loads values from memory into registers, performs a computation, and stores the result from a register into memory. To optimize performance, a programmer leaves values that will be used again in registers.

On some architectures, registers are divided into banks, and a programmer must ensure that the operands for each instruction come from separate banks.

Processors can be classified into two broad categories of CISC and RISC depending on whether they include many complex instructions or a minimal set of instructions.

RISC architectures use an instruction pipeline to ensure that one instruction can complete on each clock cycle. Programmers can optimize performance by rearranging code to avoid pipeline stalls.

To implement conditional execution (e.g., an if-then-else), many processors rely on a condition code mechanism - an ALU instruction sets the condition code, and a later instruction (a conditional branch) tests the condition code.

EXERCISES

1. When debugging a program, a programmer uses a tool that allows them to show the con tents of memory. When the programmer points the tool to a memory location that contains an instruction, the tool prints three hex values with labels:

OC=0x43 OP1=0xff00 OP2=0x0324

What do the labels abbreviate?

2. If the arithmetic hardware on a computer requires operands to be in separate banks, what instruction sequence will be needed to compute the following?

AA ?? B B--C C

QQ ?? A A**C C

WW ?? Q Q ++A A

ZZ ?? W W--Q Q

3. Assume you are designing an instruction set for a computer that will perform the Boolean operations and, or, not, and exclusive or. Assign opcodes and indicate the number of operands for each instruction. When your instructions are stored in memory, how many bits will be needed to hold the opcode?

4. If a computer can add, subtract, multiply, and divide 16-bit integers, 32-bit integers, 32-bit floating point values, and 64-bit floating point values, how many unique opcodes will be needed? (Hint: assume one op code for each operation and each data size.)

5. A computer architect boasted that they were able to design a computer in which every instruction occupied exactly thirty-two bits. What is the advantage of such a design?

6. Classify the ARM architecture owned by ARM Limited, the SPARC architecture owned by Oracle Corporation, and the Intel Architecture owned by Intel Corporation as CISC or RISC.

7. Consider a pipeline of N stages in which stage i takes time ti. Assuming no delay between stages, what is the total time (start to finish) that the pipeline will spend handling a single instruction?

8. Insert nop instructions in the following code to eliminate pipeline stalls (assume the pipe line illustrated in FIG. 5).