Computer Architecture: CPUs -- Microcode, Protection, and Processor Modes

AMAZON multi-meters discounts AMAZON oscilloscope discounts

1. Introduction

Previous sections consider two key aspects of processors: instruction sets and operands. The sections explain possible approaches, and discuss the advantages and disadvantages of each approach. This section considers a broad class of general purpose processors, and shows how many of the concepts from previous sections are applied. The next section considers low-level programming languages used with processors.

2. A Central Processor

Early in the history of computers, centralization emerged as an important architectural approach - as much functionality as possible was collected into a single processor. The processor, which became known as a Central Processing Unit (CPU), con trolled the entire computer, including both calculations and I/O.

In contrast to early designs, a modern computer system follows a decentralized approach. The system contains multiple processors, many of which are dedicated to a specific function or a hardware subsystem. For example, we will see that an I/O de vice, such as a disk, can include a processor that handles disk transfers.

Despite the shift in paradigm, the term CPU has survived because one chip contains the hardware used to perform most computations and coordinate and control other processors. In essence, the CPU manages the entire computer system by telling other processors when to start, when to stop, and what to do. When we discuss I/O, we will see how the CPU controls the operation of peripheral devices and processors.

3. CPU Complexity

Because it must handle a wide variety of control and processing tasks, a modern CPU is extremely complex. For example, Intel makes a CPU chip that contains 2.5 billion transistors. Why is a CPU so complex? Why are so many transistors needed? Multiple Cores. In fact, modern CPU chips do not contain just one processor. In stead, they contain multiple processors called cores. The cores all function in parallel, permitting multiple computations to proceed at the same time. Multicore designs are required for high performance because a single core cannot be clocked at arbitrarily high speeds.

Multiple Roles. One aspect of CPU complexity arises because a CPU must fill several major roles: running application programs, running an operating system, handling external I/O devices, starting or stopping the computer, and managing memory.

No single instruction set is optimal for all roles, so a CPU often includes many instructions.

Protection and Privilege. Most computer systems incorporate a system of protection that gives some subsystems higher privilege than others. For example, the hardware prevents an application program from directly interacting with I/O devices, and the operating system code is protected from inadvertent or deliberate change.

Hardware Priorities. A CPU uses a priority scheme in which some actions are as signed higher priority than others. For example, we will see that I/O devices operate at higher priority than application programs - if the CPU is running an application pro gram when an I/O device needs service, the CPU must stop running the application and handle the device.

Generality. A CPU is designed to support a wide variety of applications. Consequently, the CPU instruction set often contains instructions that are used for each type of application (i.e., a CISC design).

Data Size. To speed processing, a CPU is designed to handle large data values.

Recall from Section 2 that digital logic gates each operate on a single bit of data and that gates must be replicated to handle integers. Thus, to operate on values composed of sixty-four bits, each digital circuit in the CPU must have sixty-four copies of each gate.

High Speed. The final, and perhaps most significant, source of CPU complexity arises from the desire for speed. Recall the important concept discussed earlier:

Parallelism is a fundamental technique used to create high-speed hardware.

That is, to achieve highest performance, the functional units in a CPU must be replicated, and the design must permit the replicated units to operate simultaneously. The large amount of parallel hardware needed to make a modern CPU operate at the highest rate also means that the CPU requires many transistors. We will see further explanations later in the section.

4. Modes of Execution

The features listed above can be combined or implemented separately. For example, a given core can be granted access to other parts of memory with or without higher priority. How can a CPU accommodate all the features in a way that allows programmers to understand and use them without becoming confused? In most CPUs, the hardware uses a set of parameters to handle the complexity and control operation. We say that the hardware has multiple modes of execution. At any given time, the current execution mode determines how the CPU operates. FIG. 1 lists items usually associated with a CPU mode of execution.

-- The subset of instructions that are valid d The size of data items d The region of memory that can be accessed d The functional units that are available d The amount of privilege

FIG. 1 Items typically controlled by a CPU mode of execution. The characteristics of a CPU can change dramatically when the mode changes.

5. Backward Compatibility

How much variation can execution modes introduce? In principle, the modes available on a CPU do not need to share much in common. As one extreme case, some CPUs have a mode that provides backward compatibility with a previous model. Back ward compatibility allows a vendor to sell a CPU with new features, but also permits customers to use the CPU to run old software.

Intel's line of processors (i.e., 8086, 186, 286,...) exemplifies how backward compatibility can be used. When Intel first introduced a CPU that operated on thirty-two bit integers, the CPU included a compatibility mode that implemented the sixteen-bit instruction set from Intel's previous CPU. In addition to using different sizes of integers, the two architectures have different numbers of registers and different instructions. The two architectures differ so significantly that it is easiest to think of the design as two separate pieces of hardware with the execution mode determining which of the two is used at any time.

A CPU uses an execution mode to determine the current operational characteristics. In some CPUs, the characteristics of modes differ so widely that we think of the CPU as having separate hardware subsystems and the mode as determining which piece of hardware is used at the current time.

6. Changing Modes

How does a CPU change execution modes? There are two ways:

-- Automatic (initiated by hardware) d Manual (under program control)

Automatic Mode Change. External hardware can change the mode of a CPU. For example, when an I/O device requests service, the hardware informs the CPU.

Hardware in the CPU changes mode (and jumps to the operating system code) automatically before servicing the device. We will learn more when we consider how I/O works.

Manual Mode Change. In essence, manual changes occur under control of a running program. Most often, the program is the operating system, which changes mode before it executes an application. However, some CPUs also provide multiple modes that applications can use, and allow an application to switch among the modes.

What mechanism is used to change mode? Three approaches have been used. In the simplest case, the CPU includes an instruction to set the current mode. In other cases, the CPU contains a special-purpose mode register to control the mode. To change modes, a program stores a value into the mode register. Note that a mode register is not a storage unit in the normal sense. Instead, it consists of a hardware circuit that responds to the store command by changing the operating mode. Finally, a mode change can occur as the side effect of another instruction. In most CPUs, for example, the instruction set includes an instruction that an application uses to make an operating system call. A mode change occurs automatically whenever the instruction is executed.

To accommodate major changes in mode, additional facilities may be needed to prepare for the new mode. For example, consider a case in which two modes of execution do not share general-purpose registers (e.g., in one mode the registers have sixteen bits and in another mode the registers contain thirty-two bits). It may be necessary to place values in alternate registers before changing mode and using the registers. In such cases, a CPU provides special instructions that allow software to create or modify values before changing the mode.

7. Privilege And Protection

The mode of execution is linked to CPU facilities for privilege and protection.

That is, part of the current mode specifies the level of privilege for the CPU. For example, when it services an I/O device, a CPU must allow device driver software in the operating system to interact with the device and perform control functions. However, an arbitrary application program must be prevented from accidentally or maliciously is suing commands to the hardware or performing control functions. Thus, before it exe cutes an application program, an operating system changes the mode to reduce privilege. When running in a less privileged mode, the CPU does not permit direct control of I/O devices (i.e., the CPU treats a privileged operation like an invalid instruction).

8. Multiple Levels of Protection

How many levels of privilege are needed, and what operations should be allowed at each level? The subject has been discussed by hardware architects and operating sys tem designers for many years. CPUs have been invented that offer no protection, and CPUs have been invented that offer eight levels, each with more privilege than the previous level. The idea of protection is to help prevent problems by using the minimum amount of privilege necessary at any time. We can summarize:

By using a protection scheme to limit the operations that are allowed, a CPU can detect attempts to perform unauthorized operations.

FIG. 2 illustrates the concept of two privilege levels.

FIG. 2 Illustration of a CPU that offers two levels of protection. The operating system executes with highest privilege, and application programs execute with less privilege.

Although no protection scheme suffices for all CPUs, designers generally agree on a minimum of two levels for a CPU that runs application programs:

A CPU that runs applications needs at least two levels of protection:

the operating system must run with absolute privilege, but application programs can run with limited privilege.

When we discuss memory, we will see that the issues of protection and memory access are intertwined. More important, we will see how memory access mechanisms, which are part of the CPU mode, provide additional forms of protection.

9. Microcoded Instructions

How should a complex CPU be implemented? Interestingly, one of the key abstractions used to build a complex instruction set comes from software: complex instructions are programmed! That is, instead of implementing the instruction set directly with digital circuits, a CPU is built in two pieces. First, a hardware architect builds a fast, but small processor known as a microcontroller†. Second, to implement the CPU instruction set (called a macro instruction set), the architect writes software for the microcontroller. The software that runs on the microcontroller is known as microcode.

FIG. 3 illustrates the two-level organization, and shows how each level is implemented.

FIG. 3 Illustration of a CPU implemented with a microcontroller. The macro instruction set that the CPU provides is implemented with microcode.

†The small processor is also called a microprocessor, but the term is somewhat misleading.

The easiest way to think about microcode is to imagine a set of functions that each implement one of the CPU macro instructions. The CPU invokes the microcode during the instruction execution. That is, once it has obtained and decoded a macro instruction, the CPU invokes the microcode procedure that corresponds to the instruction.

The macro- and micro architectures can differ. As an example, suppose that the CPU is designed to operate on data items that are thirty-two bits and that the macro instruction set includes an add32 instruction for integer addition. Further suppose that the microcontroller only offers sixteen-bit arithmetic. To implement a thirty-two-bit addition, the microcode must add sixteen bits at a time, and must add the carry from the low-order bits into the high-order bits. FIG. 4 lists the microcode steps that are required:

/* The steps below assume that two 32-bit operands are located in registers labeled R5 and R6, and that the microcode must use 16-bit registers labeled r0 through r3 to compute the results.

*/ add32:

move low-order 16 bits from R5 into r2 move low-order 16 bits from R6 into r3 add r2 and r3, placing result in r1 save value of the carry indicator move high-order 16 bits from R5 into r2 move high-order 16 bits from R6 into r3 add r2 and r3, placing result in r0 copy the value in r0 to r2 add r2 and the carry bit, placing the result in r0 check for overflow and set the condition code move the thirty-two bit result from r0 and r1 to the desired destination

FIG. 4 An example of the steps required to implement a thirty-two-bit macro addition with a microcontroller that only has sixteen-bit arithmetic. The macro- and micro architectures can differ.

The exact details are unimportant; the figure illustrates how the architecture of the microcontroller and the macro instruction set can differ dramatically. Also note that be cause each macro instruction is implemented by a microcode program, a macro instruction can perform arbitrary processing. For example, it is possible for a single macro instruction to implement a trigonometric function, such as sine or cosine, or to move large blocks of data in memory. Of course, to achieve higher performance, an architect can choose to limit the amount of microcode that corresponds to a given instruction.

10. Microcode Variations

Computer designers have invented many variations to the basic form of microcode.

For example, we said that the CPU hardware implements the fetch-execute cycle and invokes a microcode procedure for each instruction. On some CPUs, microcode implements the entire fetch-execute cycle - the microcode interprets the opcode, fetches operands, and performs the specified operation. The advantage is greater flexibility: microcode defines all aspects of the macro system, including the format of macro instructions and the form and encoding of each operand. The chief disadvantage is lower performance: the CPU cannot have an instruction pipeline implemented in hardware.

As another variation, a CPU can be designed that only uses microcode for extensions. That is, the CPU has a complete macro instruction set implemented directly with digital circuits. In addition, the CPU has a small set of additional opcodes that are implemented with microcode. Thus, a vendor can manufacture minor variations of the basic CPU (e.g., a version with a special encryption instruction intended for customers who implement security software or a version with a special pattern matching instruction intended for customers who implement text processing software). If some or all of the extra instructions are not used in a particular version of the CPU, the vendor can insert microcode that makes them undefined (i.e., the microcode raises an error if an un defined instruction is executed).

11. The Advantage of Microcode

Why is microcode used? There are three motivations. First, because microcode offers a higher level of abstraction, building microcode is less prone to errors than building hardware circuits. Second, building microcode takes less time than building circuits. Third, because changing microcode is easier than changing hardware circuits, new versions of a CPU can be created faster.

We can summarize:

A design that uses microcode is less prone to errors and can be up dated faster than a design that does not use microcode.

Of course, microcode does have some disadvantages that balance the advantages:

-- Microcode has more overhead than a hardware implementation.

-- Because it executes multiple micro instructions for each macro instruction, the microcontroller must run at much higher speed than the CPU.

-- The cost of a macro instruction depends on the micro instruction set.

12. FPGAs and Changes to the Instruction Set

Because a microcontroller is an internal mechanism intended to help designers, the micro instruction set is usually hidden in the final design. The microcontroller and microcode typically reside on the integrated circuit along with the rest of the CPU, and are only used internally. Only the macro instruction set is available to programmers.

Interestingly, some CPUs have been designed that make the microcode dynamic and accessible to customers who purchase the CPU. That is, the CPU contains facilities that allow the underlying hardware to be changed after the chip has been manufactured.

Why would customers want to change a CPU? The motivations are flexibility and performance: allowing a customer to make some changes to CPU instructions defers the decision about a macro instruction set, and allows a CPU's owner to tailor instructions to a specific use. For example, a company that sells video games might add macro instructions to manipulate graphics images, and a company that makes networking equipment might create macro instructions to process packet headers. Using the underlying hardware directly (e.g., with microcode) can result in higher performance.

One technology that allows modification has become especially popular. Known as Field Programmable Gate Array (FPGA), the technology permits gates to be altered after a chip has been manufactured. Reconfiguring an FPGA is a time-consuming pro cess. Thus, the general idea is to reconfigure the FPGA once, and then use the resulting chip. An FPGA can be used to hold an entire CPU, or an FPGA can be used as a supplement that holds a few extra instructions.

We can summarize:

Technologies like dynamic microcode and FPGAs allow a CPU instruction set to be modified or extended after the CPU has been purchased. The motivations are flexibility and higher performance.

13. Vertical Microcode

The question arises: what architecture should be used for a microcontroller? From the point of view of someone who writes microcode, the question becomes: what instructions should the microcontroller provide? We discussed the notion of microcode as if a microcontroller consists of a conventional processor (i.e., a processor that follows a conventional architecture). We will see shortly that other designs are possible.

In fact, a microcontroller cannot be exactly the same as a standard processor. Be cause it must interact with hardware units in the CPU, a microcontroller needs a few special hardware facilities. For example, a microcontroller must be able to access the ALU and store results in the general-purpose registers that the macro instruction set uses. Similarly, a microcontroller must be able to decode operand references and fetch values. Finally, the microcontroller must coordinate with the rest of the hardware, including memory.

FIG. 5 An illustration of the internal structure within a CPU. Solid arrows indicate a hardware path along which data can move.

The major item shown in the figure is an Arithmetic Logic Unit (ALU) that per forms operations such as addition, subtraction, and bit shifting. The remaining functional units provide mechanisms that interface the ALU to the rest of the system. For example, the hardware units labeled operand 1 and operand 2 denote operand storage units (i.e., internal hardware registers). The ALU expects operands to be placed in the storage units before an operation is performed, and places the result of an operation in the two hardware units labeled result 1 and result 2†. Finally, the register access unit provides a hardware interface to the general-purpose registers.

In the figure, arrows indicate paths along which data can pass as it moves from one functional unit to another; each arrow is a data path that handles multiple bits in parallel (e.g., 32 bits). Most of the arrows connect to the data transfer mechanism, which serves as a conduit between functional units (a later section explains that the data transfer mechanism depicted here is called a bus).

15. Example Horizontal Microcode

Each functional unit is controlled by a set of wires that carry commands (i.e., binary values that the hardware interprets as a command). Although FIG. 5 does not show command wires, we can imagine that the number of command wires connected to a functional unit depends on the type of unit. For example, the unit labeled result 1 only needs a single command wire because the unit can be controlled by a single binary value: zero causes the unit to stop interacting with other units, and one causes the unit to send the current contents of the result unit to the data transfer mechanism.

FIG. 6 summarizes the binary control values that can be passed to each functional unit in our example, and gives the meaning of each.

†Recall that an arithmetic operation, such as multiplication, can produce a result that is twice as large as an operand.

FIG. 6 Possible command values and the meaning of each for the example functional units in FIG. 5. Commands are carried on parallel wires.

As FIG. 6 shows, the register access unit is a special case because each command has two parts: the first two bits specify an operation, and the last four bits specify a register to be used in the operation. Thus, the command 010011 means that value in register three should be moved to the data transfer mechanism.

Now that we understand how the hardware is organized, we can see how horizontal microcode works. Imagine that each microcode instruction consists of commands to functional units - when it executes an instruction, the hardware sends bits from the instruction to functional units. FIG. 7 illustrates how bits of a microcode instruction correspond to commands in our example.

FIG. 7 Illustration of thirteen bits in a horizontal microcode instruction that correspond to commands for the six functional units.

16. A Horizontal Microcode Example

How can horizontal microcode be used to perform a sequence of operations? In essence, a programmer chooses which functional units should be active at any time, and encodes the information in bits of the microcode. For example, suppose a programmer needs to write horizontal microcode that adds the value in general-purpose register 4 to the value in general-purpose register 13 and places the result in general-purpose register

4. FIG. 8 lists the operations that must be performed.

-- Move the value from register 4 to the hardware unit for operand 1

-- Move the value from register 13 to the hardware unit for operand 2 d Arrange for the ALU to perform addition d Move the value from the hardware unit for result 2 (the low order bits of the result) to register 4

FIG. 8 An example sequence of steps that the functional units must exe cute to add values from general-purpose registers 4 and 13, and place the result in general-purpose register 4.

Each of the steps can be expressed as a single micro instruction in our example system. The instruction has bits set to specify which functional unit(s) operate when the instruction is executed. For example, FIG. 9 shows a microcode program that corresponds to the four steps.

In the figure, each row corresponds to one instruction, which is divided into fields that each correspond to a functional unit. A field contains a command to be sent to the functional unit when the instruction is executed. Thus, commands determine which functional units operate at each step.

FIG. 9 An example horizontal microcode program that consists of four instructions with thirteen bits per instruction. Each instruction corresponds to a step listed in FIG. 8.

Consider the code in the figure carefully. The first instruction specifies that only two hardware units will operate: the unit for operand 1 and the register interface unit.

The fields that correspond to the other four units contain zero, which means that those units will not operate when the first instruction is executed. The first instruction also uses the data transfer mechanism - data is sent across the transfer mechanism from the register interface unit to the unit for operand 1†. That is, fields in the instruction cause the register interface to send a value across the transfer mechanism, and cause the operand 1 unit to receive the value.

17. Operations That Require Multiple Cycles

Timing is among the most important aspects of horizontal microcode. Some hardware units take longer to operate than others. For example, multiplication can take longer than addition. That is, when a functional unit is given a command, the results do not appear immediately. Instead, the program must delay before accessing the output from the functional unit.

A programmer who writes horizontal microcode must ensure that each hardware unit is given the correct amount of time to complete its task. The code in FIG. 9 assumes that each step can be accomplished in one micro instruction cycle. However, a micro cycle may be too short for some hardware units to complete a task. For example, an ALU may require two micro instruction cycles to complete an addition. To accommodate longer computation, an extra instruction can be inserted following the third instruction. The extra instruction merely specifies that the ALU should continue the previous operation; no other units are affected. FIG. 10 illustrates an extra microcode instruction that can be inserted to create the necessary delay.

†For purposes of this simplified example, we assume the data transfer mechanism always operates and does not require any control.

FIG. 10 An instruction that can be inserted to add delay processing to wait for the ALU to complete an operation. Timing and delay are crucial aspects of horizontal microcode.

18. Horizontal Microcode and Parallel Execution

Now that we have a basic understanding of how hardware operates and a general idea about horizontal microcode, we can appreciate an important property: the use of parallelism. Parallelism is possible because the underlying hardware units operate in dependently. A programmer can specify parallel operations because an instruction contains separate fields that each control one of the hardware units.

As an example, consider an architecture that has an ALU plus separate hardware units to hold operands. Assume the ALU requires multiple instruction cycles to complete an operation. Because the ALU accesses the operands during the first cycle, the hardware units used to hold operands remain unused during successive cycles. Thus, a programmer can insert an instruction that simultaneously moves a new value into an operand unit while an ALU operation continues. FIG. 11 illustrates such an instruction.

FIG. 11 An example instruction that simultaneously continues an ALU operation and loads the value from register seven into operand hardware unit one. Horizontal microcode makes parallelism easy to specify.

The point is:

Because horizontal microcode instructions contain separate fields that each control one hardware unit, horizontal microcode makes it easy to specify simultaneous, parallel operation of the hardware units.

19. Look-Ahead and High Performance Execution

In practice, the microcode used in CPUs is much more complex than the simplistic examples in this section. One of the most important sources of complexity arises from the desire to achieve high performance. Because silicon technology allows manufacturers to place billions of transistors on a single chip, it is possible for a CPU to include many functional units that all operate simultaneously.

A later section considers architectures that make parallel hardware visible to a programmer. For now, we will consider an architectural question: can multiple functional units be used to improve performance without changing the macro instruction set? In particular, can the internal organization of a CPU be arranged to detect and exploit situations in which parallel execution will produce higher performance? We have already seen a trivial example of an optimization: FIG. 11 shows that horizontal microcode can allow an ALU operation to continue at the same time a data value is transferred to a hardware unit that holds an operand. However, our example re quires a programmer to explicitly code the parallel behavior when creating the micro code.

To understand how a CPU exploits parallelism automatically, imagine a system that includes an intelligent microcontroller and multiple functional units. Instead of working on one macro instruction at a time, the intelligent controller is given access to many macro instructions. The controller looks ahead at the instructions, finds values that will be needed, and directs functional units to start fetching or computing the values. For example, suppose the intelligent controller finds the following four instructions on a 3-address architecture:

add R1, R3, R7

sub R4, R4, R6

add R9, R5, R2

shift R8, 5

We say that an intelligent controller schedules the instructions by assigning the necessary work to functional units. For example, the controller can assign each operand to a functional unit that fetches and prepares operand values. Once the operand values are available for an instruction, the controller assigns the instruction to a functional unit that performs the operation. The instructions listed above can each be assigned to an ALU. Finally, when the operation completes, the controller can assign a functional unit the task of moving the result to the appropriate destination register. The point is: if the CPU contains enough functional units, an intelligent controller can schedule all four macro instructions to be executed at the same time.

20. Parallelism and Execution Order

Our above description of an intelligent microcontroller overlooks an important de tail: the semantics of the macro instruction set. In essence, the controller must ensure that computing values in parallel does not change the meaning of the program. For example, consider the following sequence of instructions:

add R1, R3, R7

sub R4, R4, R6

add R9, R1, R2

shift R8, 5

Unlike the previous example, the operands overlap. In particular, the first instruction specifies register one as a destination, and the third instruction specifies register one as an operand. The macro instruction set semantics dictate sequential processing of instructions, which means that the first instruction will place a value in register one be fore the third instruction references the value. To preserve sequential semantics, an intelligent controller must understand and accommodate such overlap. In essence, the controller must balance between two goals: maximize the amount of parallel execution, while preserving the original (i.e., sequential) semantics.

21. Out-Of-Order Instruction Execution

How can a controller that schedules parallel activities handle the case where an operand in one instruction depends on the results of a previous instruction? The controller uses a mechanism known as a scoreboard that tracks the status of each instruction being executed. In particular, a scoreboard maintains information about dependencies among instructions and the original macro instruction sequence execution. Thus, the controller can use the scoreboard to decide when to fetch operands, when execution can proceed, and when an instruction is finished. In short, the scoreboard approach al lows the controller to execute instructions out of order, but then reorders the results to reflect the order specified by the code.

To achieve highest speed, a modern CPU contains multiple copies of functional units that permit multiple instructions to be executed simultaneously. An intelligent controller uses a scoreboard mechanism to schedule execution in an order that preserves the appearance of sequential processing.

22. Conditional Branches and Branch Prediction

Conditional branches pose another problem for parallel execution. For example, consider the following computation:

Y --> f(X) if (Y > Z ) { Q} else { R}

When translated into machine instructions, the computation contains a conditional branch that directs execution either to the code for Q or the code for R. The condition depends on the value of Y, which is computed in the first step. Now consider running the code on a CPU that uses parallel execution of instructions. In theory, once it reaches the conditional branch, the CPU must wait for the results of the comparison -- the CPU cannot start to schedule code for R or Q until it knows which one will be selected.

In practice, there are two approaches used to handle conditional branches. The first, which is known as branch prediction, is based on measurements which show that in most code, the branch is taken approximately sixty percent of the time. Thus, building hardware that schedules instructions along the branch path provides more optimization than hardware that schedules instructions along the non-branch path. Of course, assuming the branch will occur may be incorrect - if the CPU eventually determines that the branch should not be taken, the results from the branch path must be discarded and the CPU must follow the other path. The second approach simply follows both paths in parallel. That is, the CPU schedules instructions for both outcomes of the conditional branch. As with branch prediction, the CPU must eventually decide which result is valid. That is, the CPU continues to execute instructions, but holds the results internally. Once the value of the condition is known, the CPU discards the results from the path that is not valid, and proceeds to move the correct results into the appropriate destinations. Of course, a second conditional branch can occur in either Q or R; the scoreboard mechanism handles all the details.

The point is:

A CPU that offers parallel instruction execution can handle conditional branches by proceeding to pre-compute values on one or both branches, and choosing which values to use at a later time when the computation of the branch condition completes.

It may seem wasteful for a CPU to compute values that will be discarded later.

However, the goal is higher performance, not elegance. We can also observe that if a CPU is designed to wait until a conditional branch value is known, the hardware will merely sit idle. Therefore, high-speed CPUs, such as those manufactured by Intel and AMD, are designed with parallel functional units and sophisticated scoreboard mechanisms.

23. Consequences for Programmers

Can understanding how a CPU is structured help programmers write faster code? In some cases, yes. Suppose a CPU is designed to use branch prediction and that the CPU assumes the branch is taken. A programmer can optimize performance by arranging code so that the most common cases take the branch. For example, if a programmer knows that it will be more common for Y to be less than Z, instead of testing Y > Z, a programmer can rewrite the code to test whether Y = Z.

24. Summary

A modern CPU is a complex processor that uses multiple modes of execution to handle some of the complexity. An execution mode determines operational parameters such as the operations that are allowed and the current privilege level. Most CPUs offer at least two levels of privilege and protection: one for the operating system and one for application programs.

To reduce the internal complexity, a CPU is often built with two levels of abstraction: a microcontroller is implemented with digital circuits, and a macro instruction set is created by adding microcode.

There are two broad classes of microcode. A microcontroller that uses vertical microcode resembles a conventional RISC processor. Typically, vertical microcode consists of a set of procedures that each correspond to one macro instruction; the CPU runs the appropriate microcode during the fetch-execute cycle. Horizontal microcode, which allows a programmer to schedule functional units to operate on each cycle, consists of instructions in which each bit field corresponds to a functional unit. A third alternative uses Field Programmable Gate Array (FPGA) technology to create the underlying system.

Advanced CPUs extend parallel execution by scheduling a set of instructions across multiple functional units. The CPU uses a scoreboard mechanism to handle cases where the results of one instruction are used by a successive instruction. The idea can be extended to conditional branches by allowing parallel evaluation of each path to proceed, and then, once the condition is known, discarding the values along the path that is not taken.

EXERCISES

1. If a quad-core CPU chip contains 2 billion transistors, approximately how many transistors are needed for a single core?

2. List seven reasons a modern CPU is complex.

3. The text says that some CPU chips include a backward compatibility mode. Does such a mode offer any advantage to a user?

4. Suppose that in addition to other hardware, the CPU used in a smart phone contains additional hardware for three previous versions of the chip (i.e., three backward compatibility modes). What is the disadvantage from a user's point of view?

5. Virtualized software systems used in cloud data centers often include a hypervisor that runs and controls multiple operating systems, and applications that each run on one of the operating systems. How do the levels protection used with such systems differ from conventional levels of protection?

6. Some manufacturers offer a chip that contains a processor with a basic set of instructions plus an attached FPGA. An owner can configure the FPGA with additional instructions.

What does such a chip provide that conventional software cannot?

7. Read about FPGAs, and find out how they are "programmed." What languages are used to program an FPGA?

8. Create a microcode algorithm that performs 32-bit multiplication on a microcontroller that only offers 16-bit arithmetic, and implement your algorithm in C using short variables.

9. You are offered two jobs for the same salary, one programming vertical microcode and the other programming horizontal microcode. Which do you choose? Why?

10. Find an example of a commercial processor that uses horizontal microcode, and document the meaning of bits for an instruction similar to the diagram in FIG. 7.

11. What is the motivation for a scoreboard mechanism in a CPU chip, and what functionality does it provide?

12. If Las Vegas casinos computed the odds on program execution, what odds would they give that a branch is taken? Explain.

PREV. | NEXT