Computer Architecture: CPUs -- Parallelism

AMAZON multi-meters discounts AMAZON oscilloscope discounts

1. Introduction

Previous sections cover the three key components of computer architecture: processors, memory systems, and I/O. This section begins a discussion of fundamental concepts that cross the boundaries among architectural components.

The section focuses on the use of parallel hardware, and shows that parallelism can be used throughout computer systems to increase speed. The section introduces terminology and concepts, presents a taxonomy of parallel architectures, and examines computer systems in which parallelism is the fundamental paradigm around which the entire system is designed. Finally, the section discusses limitations and problems with parallel architectures.

The next section extends the discussion by examining a second fundamental technique: pipelining. We will see that both parallelism and pipelining are important in high-speed designs.

2. Parallel and Pipelined Architectures

Some computer architects assert that there are only two fundamental techniques used to increase hardware speed: parallelism and pipelining. We have already encountered examples of each technique, and have seen how they can be used.

Other architects take a broader view of parallelism and pipelining, using the techniques as the fundamental basis around which a system is designed. In many cases, the architecture is so completely dominated by one of the two techniques that the resulting system is informally called a parallel computer or a pipelined computer.

3. Characterizations of Parallelism

Rather than classify an architecture as parallel or nonparallel, computer architects use a variety of terms to characterize the type and amount of parallelism that is present in a given design. In many cases, the terminology describes the possible extremes for a type of parallelism. We can classify an architecture by stating where the architecture lies between the two extremes. FIG. 1 lists the key characterizations using nomenclature proposed by Michael J. Flynn in a classic paper†. Later sections explain each of the terms and give examples.

-- Microscopic vs. macroscopic

-- Symmetric vs. asymmetric

-- Fine-grain vs. coarse-grain

-- Explicit vs. implicit

FIG. 1 Terminology used to characterize the amount and type of parallelism present in a computer architecture.

4. Microscopic vs. Macroscopic

Parallelism is fundamental; an architect cannot design a computer without thinking about parallel hardware. Interestingly, the pervasiveness of parallelism means that un less a computer uses an unusual amount of parallel hardware, we typically do not discuss the parallel aspects. To capture the idea that much of the parallelism in a computer remains hidden inside subcomponents, we use the term microscopic parallelism. Like microbes in the world around us, microscopic parallelism is present, but does not stand out without closer inspection.

The point is:

Parallelism is so fundamental that virtually all computer systems contain some form of parallel hardware. We use the term microscopic parallelism to characterize parallel facilities that are present, but not especially visible.

To be more precise, we say that microscopic parallelism refers to the use of parallel hardware within a specific component (e.g., inside a processor or inside an ALU), whereas macroscopic parallelism refers to the use of parallelism as a basic premise around which a system is designed.

†M. J. Flynn, "Some Computer Organizations and Their Effectiveness," IEEE Transactions on Computers, C-21(9):948--960, September 1972.

5. Examples of Microscopic Parallelism

In earlier sections, we have seen examples of using microscopic parallelism within processors, memory systems, and I/O subsystems. The following paragraphs highlight a few examples.

ALU. An Arithmetic Logic Unit handles logical and arithmetic operations. Most ALUs perform integer arithmetic by processing multiple bits in parallel. Thus, an ALU that is designed to operate on integers contains parallel hardware that allows the ALU to compute a Boolean function on a pair of thirty-two bit values in a single operation. The alternative consists of an ALU that processes one bit at a time, an approach that is known as bit serial processing. It should be easy to see that bit serial processing takes much longer than computing bits in parallel. Therefore, bit serial arithmetic is reserved for special cases.

Registers. The general-purpose registers in a CPU make heavy use of microscopic parallelism. Each bit in a register is implemented by a separate digital circuit (specifically, a latch). Furthermore, to guarantee the highest-speed computation, parallel data paths are used to move data between general-purpose registers and the ALU.

Physical Memory. As another example of microscopic parallelism, recall that a physical memory system uses parallel hardware to implement fetch and store operations -- the hardware is designed to transfer an entire word on each operation. As in an ALU, microscopic parallelism increases memory speed dramatically. For example, a memory system that implements sixty-four bit words can access or store approximately sixty-four times as much data in the same time as a memory system that accesses a single bit at a time.

Parallel Bus Architecture. As we have seen, the central bus in a computer usually uses parallel hardware to achieve high-speed transfers among the processor, memory, and I/O devices. A typical modern computer has a bus that is either thirty-two- or sixty-four-bits wide, which means that either thirty-two or sixty-four bits of data can be transferred across the bus in a single step.

6. Examples of Macroscopic Parallelism

As the examples in the previous section demonstrate, microscopic parallelism is essential for high-speed performance - without parallel hardware, various components of a computer system cannot operate at high speed. Computer architects are aware that the global architecture often has a greater impact on overall system performance than the performance of any single subsystem. That is, adding more parallelism to a single subsystem may not improve the overall system performance†.

To achieve the greatest impact, parallelism must span multiple components of a system -- instead of merely using parallelism to improve the performance of a single component, the system must allow multiple components to work together. We use the term macroscopic parallelism to characterize the use of parallelism across multiple, large-scale components of a computer system. A few examples will clarify the concept.

Multiple, Identical Processors. Systems that employ macroscopic parallelism usually employ multiple processors in one form or another. For example, some PCs are advertised as dual core or quad core computers, meaning that the PC contains two or four copies of the processor on a single chip. The chip is arranged to allow both processors to operate at the same time. The hardware does not control exactly how the cores are used. Instead, the operating system assigns code to each core. For example, the operating system can assign one core the task of handling I/O (i.e., running device drivers), and assign other cores application programs to run.

Multiple, Dissimilar Processors. Another example of macroscopic parallelism arises in systems that make extensive use of special-purpose coprocessors. For example, a computer optimized for high-speed graphics might have four displays attached, with a special graphics processor running each display. A graphics processor, typically found on an interface card, does not use the same architecture as a CPU because the graphics processor needs instructions optimized for graphics operations.

7. Symmetric vs. Asymmetric

We use the term symmetric parallelism to characterize a design that uses replications of identical elements, usually processors or cores, that can operate simultaneously.

For example, the multicore processors mentioned above are said to be symmetric be cause all cores are identical.

The alternative to a symmetric parallel design is a parallel design that is asymmetric. As the name implies, an asymmetric design contains multiple elements that function at the same time, but differ from one another. For example, a PC with a CPU, a graphics coprocessor, a math coprocessor, and an I/O coprocessor is classified as using asymmetric parallelism because the four processors can operate simultaneously, but differ from one another internally†.

8. Fine-grain vs. Coarse-grain Parallelism

We use the term fine-grain parallelism to refer to computers that provide parallel ism on the level of individual instructions or individual data elements, and the term coarse-grain parallelism to refer to computers that provide parallelism on the level of programs or large blocks of data. For example, a graphics processor that uses sixteen parallel hardware units to update sixteen bytes of an image at the same time is said to use fine-grain parallelism. In contrast, a dual core PC that uses one core to print a document while another core composes an email message is described as using coarse-grain parallelism.

†Some architects also apply the term asymmetric to a multicore design if the cores do not have the same access to memory and I/O devices.

† Section 21 discusses performance in more detail.

9. Explicit vs. Implicit Parallelism

An architecture in which the hardware handles parallelism automatically without requiring a programmer to initiate or control parallel execution is said to offer implicit parallelism, and an architecture in which a programmer must control each parallel unit is said to offer explicit parallelism. We will consider the advantages and disadvantages of explicit and implicit parallelism later.

10. Types of Parallel Architectures (Flynn Classification)

Although many systems contain multiple processors of one type or another, the term parallel architecture is usually reserved for designs that permit arbitrary scaling.

That is, when they refer to a parallel architecture, architects usually mean a design in which the number of processors can be arbitrarily large (or at least reasonably large).

As an example, consider a computer that can have either one or two processors.

Although adding a second processor increases parallelism, such an architecture is usually classified as a dual-processor computer rather than a parallel architecture. Similarly, a PC with four cores is classified as a quad-core PC. However, a cluster of thirty-two interconnected PCs that can scale to one thousand twenty-four PCs is classified as a parallel architecture.

The easiest way to understand parallel architectures is to divide the architectures into broad groups, where each group represents a type of parallelism. Of course, no division is absolute - most practical computer systems are hybrids that contain facilities from more than one group. Nevertheless, we use the classification to define basic concepts and nomenclature that allow us to discuss and characterize the systems.

A popular way to describe parallelism that is attributed to Flynn considers whether processing or data is replicated. Known as the Flynn classification, the system focuses on whether the computer has multiple, independent processors each running a separate program or a single program being applied to multiple data items. FIG. 2 lists terms used by the Flynn classification to define types of parallelism; the next sections explain the terminology and give examples.

Name | Meaning

SISD Single Instruction stream Single Data stream

SIMD Single Instruction stream Multiple Data streams

MISD Multiple Instruction streams Single Data stream

MIMD Multiple Instruction streams Multiple Data streams

FIG. 2 Terminology used by the Flynn classification to characterize parallel computers†.

†MISD is a specialized category that is reserved for unusual hardware, such as the pipeline architecture shown in Figure 19.5 on page 387 that executes multiple instructions on a single piece of data or a redundant processor used to increase reliability.

11. Single Instruction Single Data (SISD)

The phrase Single Instruction Single Data stream (SISD) is used to describe an architecture that does not support macroscopic parallelism. The term sequential architecture or uniprocessor architecture is often used in place of SISD to emphasize that the architecture is not parallel. In essence, SISD refers to a conventional (i.e., Von Neumann) architecture - the processor runs a standard fetch-execute cycle and performs one operation at a time. The term refers to the idea that a single, conventional processor is executing instructions that each operate on a single data item. That is, unlike a parallel architecture, a conventional processor can only execute one instruction at any time, and each instruction refers to a single computation.

Of course, we have seen that an SISD computer can use parallelism internally. For example, the ALU may be able to perform operations on multiple bits in parallel, the CPU may invoke a coprocessor, or the CPU may have mechanisms that allow it to fetch operands from two banks of memory at the same time. However, the overall effect of an SISD architecture is sequential execution of instructions that each operate on one data item.

12. Single Instruction Multiple Data (SIMD)

The phrase Single Instruction Multiple Data streams (SIMD) is used to describe a parallel architecture in which each instruction specifies a single operation (e.g., integer addition), but the instruction is applied to many data items at the same time. Typically, an SIMD computer has sufficient hardware to handle sixty-four simultaneous operations (e.g., sixty-four simultaneous additions).

Vector Processors. An SIMD architecture is not useful for applications such as word processing or email. Instead, SIMD is used with applications that apply the same operation to a set of values. For example, graphics applications and some scientific applications work well on an SIMD architecture that can apply an operation to a large set of values. The architecture is sometimes called a vector processor or an array processor after the mathematical concept of vectors and the computing concept of arrays.

As an example of how an SIMD machine works, consider normalizing the values in a vector, V, that contains N elements. Normalization requires that each item in the vector be multiplied by a floating point number, Q. On a sequential architecture (i.e., an SISD architecture), the algorithm required to normalize the vector consists of a loop as FIG. 3 shows.

for i from 1 to N { V[i] ? V[i] × Q; }

FIG. 3 A sequential algorithm for vector normalization.

On an SIMD architecture, the underlying hardware can apply an arithmetic operation to all the values in an array simultaneously (assuming the size of the array does not exceed the parallelism in the hardware). For example, in a single step, hardware that has sixty-four parallel units can multiply each value in an array of sixty-four elements by a constant. Thus, the algorithm to perform normalization of an array on an SIMD computer takes one step:

V à V × Q;

Of course, if vector V is larger than the hardware capacity, multiple steps will be required. The important point is that a vector instruction on an SIMD architecture is not merely a shorthand for a loop. Instead, the underlying system contains multiple hardware units that operate in parallel to provide substantial speedup; the performance improvement can be significant, especially for computations that use large matrices.

Of course, not all instructions in an SIMD architecture can be applied to an array of values. Instead, an architect identifies a subset of operations to be used with vectors, and defines a special vector instruction for each. For example, normalization of an en tire array is only possible if the architect chooses to include a vector multiplication instruction that multiplies each value in the vector by a constant.

In addition to operations that use a constant and a vector, SIMD computers usually provide instructions that use two vectors. That is, a vector instruction takes one or more operands that each specify a vector. For example, SIMD architectures are used for problems involving matrix multiplication. On most SIMD machines, an operand that specifies a vector gives two pieces of information: the location of the vector in memory and an integer that specifies the size of the vector (i.e., number of items in the vector). On some machines, vector instructions are controlled by special-purpose registers - the address and size of each vector are loaded into registers before a vector instruction is invoked. In any case, software determines the number of items in a vector up to the maximum size supported by the hardware†.

Graphics Processors. SIMD architectures are also popular for use with graphics.

To understand why, it is important to know that typical graphics hardware uses sequential bytes in memory to store values for pixels on a screen. For example, consider a video game in which foreground figures move while a background scene stays in place.

Game software must copy the bytes that correspond to the foreground figure from one location in memory to another. A sequential architecture requires a programmer to specify a loop that copies one byte at a time. On an SIMD architecture, however, a programmer can specify a vector size, and then issue a single copy command. The under lying SIMD hardware then copies multiple bytes simultaneously.

†An exercise considers speedup in cases where vectors exceed the capacity of the hardware; a definition of speedup can be found in Section 15.

13. Multiple Instructions Multiple Data (MIMD)

The phrase Multiple Instructions Multiple Data streams (MIMD) is used to describe a parallel architecture in which each of the processors performs independent computations at the same time. Although many computers contain multiple internal processing units, the MIMD designation is reserved for computers in which the processors are visible to a programmer. That is, an MIMD computer can run multiple, in dependent programs at the same time.

Symmetric Multiprocessor (SMP). The most well-known example of an MIMD architecture consists of a computer known as a Symmetric Multiprocessor (SMP). An SMP contains a set of N processors (or N cores) that can each be used to run programs.

In a typical SMP design, the processors are identical: they each have the same instruction set, operate at the same clock rate, have access to the same memory modules, and have access to the same external devices. Thus, any processor can perform exactly the same computation as any other processor. FIG. 4 illustrates the concept.

FIG. 4 The conceptual organization of a symmetric multiprocessor with N identical processors that each have access to memory and I/O devices.

While some researchers explored ways to increase the speed and power of silicon chips, other researchers investigated the symmetric multiprocessor form of MIMD as an alternate way to provide more powerful computers. One of the most well-known projects, which was conducted at Carnegie Mellon University, produced a prototype known as the Carnegie multi-mini-processor (C.mmp). During the 1980s, vendors first created commercial products, informally called multiprocessors, that used the SMP approach.

Sequent Corporation (currently owned by IBM) created a symmetric multiprocessor that runs the Unix operating system, and Encore Corporation created a symmetric multiprocessor named Multimax.

Asymmetric Multiprocessor (AMP). Although SMPs are popular, other forms of MIMD architectures are possible. The chief alternative to an SMP design is an Asymmetric Multiprocessor (AMP). An AMP contains a set of N programmable processors that can operate at the same time, but does not require all processors to have identical capabilities. For example, an AMP design can choose a processor that is appropriate to a given task (i.e., one processor can be optimized for management of high-speed disk storage devices and another processor can be optimized for graphics display).

In most cases, AMP architectures follow a master-slave approach in which one processor (or in some cases a set of processors) controls the overall execution and invokes other processors as needed. The processor that controls execution is known as the master, and other processors are known as slaves.

In theory, an AMP architecture that has N processors can have many distinct types of processors. In practice, however, most AMP designs have between two and four types of processors. Typically, a general-purpose AMP architecture includes at least one processor optimized for overall control (the master), and others optimized for subsidiary functions such as arithmetic computation or I/O.

Math and Graphics Co-processors. Commercial computer systems have been created that use an asymmetric architecture. One of the most widely known AMP designs became popular in the late 1980s and early 1990s when PC manufacturers began selling math coprocessors. The idea of a math coprocessor is straightforward: the coprocessor is a special-purpose chip that the CPU can invoke to perform floating point computation. Because it is optimized for one task, a coprocessor can perform the task faster than the CPU.

CDC Peripheral Processors. Control Data Corporation helped pioneer the idea of using an AMP architecture in mainframes when they created the 6000 series of main frame computers. The CDC architecture used ten peripheral processors to handle I/O.

FIG. 5 illustrates the conceptual organization with peripheral processors between the CPU and I/O devices. Interestingly, CDC's peripheral processors were not limited to I/O - a peripheral processor resembled a minicomputer with a general-purpose instruction set that could be used however a programmer chose. The peripheral processors had access to memory, which meant a peripheral processor could read or store values in any location. Although they were much slower than the CPU, all ten peripheral processors on the CDC could execute simultaneously. Thus, it was possible to optimize program performance by dividing tasks among the peripheral processors as well as the CPU.

Although CDC computers are no longer manufactured, the basic idea of programmable I/O processors continues to be used. Surprisingly, multicore chips have made the general approach feasible again because many cores make it possible to dedicate one or more cores to I/O.

I/O Processors. Most mainframe computers use an AMP architecture to handle I/O at high speed without slowing down the CPU. Each external I/O connection is equipped with a dedicated, programmable processor. Instead of manipulating a bus or handling interrupts, the CPU merely downloads a program into the programmable processor. The processor then handles all the details of I/O. For example, the mainframe computers sold by IBM Corporation use programmable I/O processors called channels.

FIG. 5 Illustration of the asymmetric architecture used in the CDC 6000 mainframe computers.

14. Communication, Coordination, and Contention

It may seem obvious that a multiprocessor architecture will always have better performance than a uniprocessor architecture. Consider, for example, a symmetric multiprocessor, M. Intuitively, computer M can outperform a uniprocessor because M can perform N times as many operations at any time. Moreover, if a chip vendor finds a way to make a single processor run faster than M, the vendor who sells M merely re places each of the processors in M with the new chip to have a faster multiprocessor.

Indeed, many companies that created multiprocessors made these statements to attract customers.

Unfortunately, our intuition about computer performance can be misleading. Architects have found three main challenges in designing a high-performance parallel architecture:

-- Communication

-- Coordination

-- Contention

Communication. Although it may seem trivial to envision a computer that has dozens of independent processors, the computer must also provide a mechanism that al lows the processors to communicate with each other, with memory, and with I/O de vices. More important, the communication mechanism must be able to scale to handle a large number of processors. An architect must spend a significant amount of effort to create a parallel computer system that does not have severe communication bottlenecks.

Coordination. In a parallel architecture, processors must work together to perform computation. Therefore, a coordination mechanism is needed that allows processing to be controlled. We said that asymmetric designs usually designate one of the processors to act as a master that controls and coordinates all processing; some symmetric designs also use the master-slave approach. Other architectures use a distributed coordination mechanism in which the processors must be programmed to coordinate among them selves without a master.

Contention. When two or more processors attempt to access a resource at the same time, we say that the processors contend for the resource. Resource contention creates one of the greatest challenges in designing a parallel architecture because contention in creases as the number of processors increases.

To understand why contention is a problem, consider memory. If a set of N processors all have access to a given memory, a mechanism is needed that only permits one processor to access the memory at any time. When multiple processors attempt to use the memory simultaneously, the hardware contention mechanism blocks all except one of them. That is, N - 1 of the processors are idle during the memory access. In the next round, N - 2 processors remain idle. It should be obvious that:

In a parallel architecture, contention for shared resources lowers performance dramatically because only one processor can use a given resource at any time; the hardware contention mechanism forces other processors to remain idle while they wait for access.

15. Performance of Multiprocessors

Multiprocessor architectures have not fulfilled the promise of scalable, high performance computing. There are several reasons: operating system bottlenecks, contention for memory, and I/O. In a modern computer system, the operating system controls all processing, including allocating tasks to processors and handling I/O. Only one copy of an operating system can run because a device cannot take orders from multiple processors simultaneously. Thus, in a multiprocessor, at most one processor can run operating system software at any time, which means the operating system is a shared resource for which processors must contend. As a consequence, the operating system quickly becomes a bottleneck that processors access serially -- if K processors need access, K-1 of them must wait.

Contention for memory has proven to be an especially difficult problem. First, hardware for a multiported memory is extremely expensive. Second, one of the more important optimizations used in memory systems, caching, causes problems when used with a multiprocessor. If the cache is shared, processors contend for access. If each processor has a private cache, all caches must be coordinated so that any update is propagated to all caches. Unfortunately, such coordination introduces overhead.

Many multiprocessor architectures suffer from another weakness: the architecture only outperforms a uniprocessor when performing intensive computation. Surprisingly, most applications are not limited by the amount of computation they perform. Instead, most applications are I/O bound, which means the application spends more time waiting for I/O than performing computation. For example, most of the delay in common applications, such as word spreadsheets, video games, and Web browsing, arises when the application waits for I/O from a file or the network. Therefore, adding additional computational power to the underlying computer does not lower the time required to perform the computation - the extra processors sit idle waiting for I/O.

To assess the performance of an N-processor system, we define the notion of speedup to be the ratio of the performance of a single processor to the performance of a multiprocessor. Specifically, we define speedup as:

Speedup =

tN t1 __ _

where t1 denotes the execution time taken on a single processor, and tN denotes the execution time taken on a multiprocessor†. In each case, we assume performance is measured using the best algorithm available (i.e., we allow the program to be rewritten to take advantage of parallel hardware).

When multiprocessors are measured performing general-purpose computing tasks, an interesting result emerges. In an ideal situation, we would expect performance to in crease linearly as more processors are added to a multiprocessor system. Experience has shown, however, that problems like memory contention, inter-processor communication, and operating system bottlenecks mean that multiprocessors do not achieve linear speedup. Instead, performance often reaches a limit as FIG. 6 illustrates.

Surprisingly, the performance illustrated in the figure may not be achievable in practice. In some multiprocessor designs, communication overhead and memory contention dominate the running time: as more and more processors are added, the performance starts to decrease. For example, a particular symmetric multiprocessor design exhibited a small speedup with a few processors. However, when sixty-four processors were used, communication overhead made the performance worse than a single processor system. We can summarize:

When used for general-purpose computing, a multiprocessor may not perform well. In some cases, added overhead means performance de creases as more processors are added.

†Because we expect the processing time on a single processor to be greater than the processing time on a multiprocessor, we expect the speedup to be greater than one.

FIG. 6 Illustration of the ideal and typical performance of a multiprocessor as the number of processors is increased. Values on the y-axis list the relative speedup compared to a single processor.

16. Consequences for Programmers

Parallelism usually makes programming more complex. A programmer must be aware of parallel execution, and must prevent one parallel activity from interfering with another. The following sections describe some of the mechanisms and facilities that programmers use.

16.1 Locks and Mutual Exclusion

Writing code that uses multiple processors is inherently more complex than writing code for a single processor. To understand the complexity, consider using a shared variable. For example, suppose two processors use a variable x to store a count. A programmer writes a statement such as:

x=x+1;

A compiler translates the statement into a sequence of machine instructions, such as the sequence in FIG. 7.

load x, R5 # Load variable x into R5

incr R5 # Increment the value in R5

store R5, x # Store R5 back into x

FIG. 7 An example sequence of machine instructions used to increment a variable in memory. In most architectures, increment entails a load and a store operation.

Unfortunately, if two processors attempt to increment x at nearly the same time, the value of x might be incremented once instead of twice. The error arises because each of the two processors operates independently and competes for access to memory.

Thus, the operations might be performed in the order given in FIG. 8.

-- Processor 1 loads x into its register 5

-- Processor 1 increments its register 5

-- Processor 2 loads x into its register 5

-- Processor 1 stores its register 5 into x

-- Processor 2 increments its register 5

-- Processor 2 stores its register 5 into x

FIG. 8 A sequence of steps that can occur when two independent processors or cores access variable x in shared memory.

To prevent problems like the one illustrated in FIG. 8, multiprocessor hardware provides hardware locks. A programmer must associate a lock with each shared item, and use the lock to ensure that no other processors can change the item while an update is in progress. For example, if lock 17 is associated with variable x, a programmer must obtain lock 17 before updating x. The idea is called mutual exclusion, and we say that a processor must gain exclusive use of a variable before updating the value. FIG. 9 illustrates the sequence of instructions.

lock 17 # wait for lock 17

load x, R5 # Load variable x into R5

incr R5 # Increment the value in R5

store R5, x # Store R5 back into x

release 17 # release lock 17

FIG. 9 Illustration of the instructions used to guarantee exclusive access to a variable. A separate lock is assigned to each shared item.

The underlying hardware guarantees that only one processor will be granted a lock at any time. Thus, if two or more processors both attempt to obtain a given lock at the same time, one obtains access (i.e., continues to execute) and the other is blocked. In fact, an arbitrary number of processors can be blocked while one processor holds the lock. Once the processor that holds the lock releases it, the hardware selects a blocked processor, grants the processor the lock, and allows the processor to proceed. Thus, the hardware ensures that at most one processor can hold a given lock at any time.

Locking adds a nontrivial amount of complexity to programs for several reasons.

First, because locking is unusual, a programmer not accustomed to programming multiprocessors can easily forget to lock a shared variable, and because unprotected access may not always result in an error, the problem can be difficult to detect. Second, locking can severely reduce performance - if K processors attempt to access a shared vari able at the same time, the hardware will keep K-1 of them idle while they wait for ac cess. Third, because separate instructions are used to obtain and release a lock, locking adds overhead. Thus, a programmer must decide whether to obtain a lock for each individual operation or whether to obtain a lock, hold the lock while performing a series of operations on the variable, and then release the lock.

16.2 Programming Explicit and Implicit Parallel Computers

The most important aspect of parallelism for a programmer concerns whether software or hardware is responsible for managing parallelism: a system that uses implicit parallelism is significantly easier to program than a system that uses explicit parallel ism. For example, consider a processor designed to handle packets arriving from a computer network. In an implicit design, a programmer writes code to handle a single packet, and the hardware automatically applies the same program to N packets in parallel. In an explicit design, the programmer must plan to read N packets, send each to a different core, wait for the cores to complete processing, and extract the resulting packets. In many cases, the code required to control parallel cores and determine when they each finish is more complex than the code to perform the desired computation. More important, code to control parallel hardware units must allow hardware to operate in arbitrary order. For example, because the time required to process a packet depends on the packet's contents, a controller must be ready for the hardware units to complete processing in arbitrary order. The point is:

From a programmer's point of view, a system that uses explicit parallelism is significantly more complex to program than a system that uses implicit parallelism.

16.3 Programming Symmetric and Asymmetric Multiprocessors

One of the most important advantages of symmetry arises from the positive consequences it has for programmers: a symmetric multiprocessor can be substantially easier to program than an asymmetric multiprocessor. First, if all processors are identical, a programmer only needs one compiler and one language. Second, symmetry means a programmer does not need to consider which tasks are best suited for which type of processor. Third, because identical processors usually operate at the same speed, a programmer does not need to worry about the time required to perform a task on a given processor. Fourth, because all processors use the same encoding for instructions and data, a binary program or a data value can be moved from one processor to another.

Of course, any form of multiprocessor introduces a complication: in addition to everything else, a programmer must consider how coding decisions will influence performance. For example, consider a computation that processes packets arriving over a network. A conventional program keeps a global counter in memory, and updates the counter when a packet arrives. On a shared memory architecture, however, updating a value in memory is more expensive because a processor must obtain a lock before up dating a shared value in memory. Thus, a programmer needs to consider the effect of minor details, such as updating a shared counter in memory.

17. Redundant Parallel Architectures

Our discussion has focused on the use of parallel hardware to improve performance or increase functionality. However, it is also possible to use parallel hardware to improve reliability and prevent failure. That is, multiple copies of hardware can be used to verify each computation.

The term redundant hardware usually refers to multiple copies of a hardware unit that operate in parallel to perform an operation. The basic difference between redundant hardware and the parallel architectures described above arises from the data items being used: a parallel architecture arranges for each copy of the hardware to operate on a separate data item; a redundant architecture arranges for all copies to perform exactly the same operation.

The point of using redundant hardware is verification that a computation is correct.

What happens when redundant copies of the hardware disagree? The answer depends on the details and purpose of the underlying system. One possibility uses votes: K copies of a hardware unit each perform the computation and produce a value. A special hardware unit then compares the output, and selects the value that appears most often.

Another possibility uses redundant hardware to detect hardware failures: if two copies of the hardware disagree, the system displays an error message, and then halts until the defective unit can be repaired or replaced.

18. Distributed and Cluster Computers

The parallel architectures discussed in this section are called tightly coupled be cause the parallel hardware units are located inside the same computer system. The alternative, which is known as a loosely coupled architecture uses multiple computer systems that are interconnected by a communication mechanism that spans longer distances. For example, we use the term distributed architecture to refer to a set of computers that are connected by a computer network or an internet. In a distributed architecture, each computer operates independently, but the computers can communicate by sending messages across a network.

A special form of distributed computing system is known as a network cluster or a cluster computer. In essence, a cluster computer consists of a set of independent computers, such as commodity PCs, connected by a high-speed computer network. Scientists use cluster computers to run computations on extremely large sets of data, Internet search companies use clusters to respond to users' search terms, and cloud providers use the cluster approach to build cloud data centers. The general idea is that for a cluster of N computers, computation can be divided many ways. The computers in a cluster are flexible - they can be dedicated to solving a single problem or separate problems.

Computers in the cluster run independently. If they are working on a single problem, the results may be collected to produce the final output.

A special case of cluster computing is used to construct a high-capacity Web site that handles many small requests. Each computer in the cluster runs a copy of the same Web server. A special-purpose system known as a Web load balancer disperses incoming requests among computers in the cluster. Each time a request arrives, the load balancer chooses the least-loaded computer in the cluster and forwards the request.

Thus, a Web site with N computers in a cluster can respond to approximately N times as many requests per second as a single computer.

Another form of loosely coupled distributed computing is known as grid computing. Grid computing uses the global Internet as a communication mechanism among a large set of computers. The computers (typically personal computers owned by individuals) agree to provide spare CPU cycles for the grid. Each computer runs software that repeatedly accepts a request, performs the requested computation, and returns the result. To use the grid, a problem must be divided into many small pieces. Each piece of the problem is sent to a computer, and all computers can execute simultaneously.

19. A Modern Supercomputer

Informally, the term supercomputer is used to denote an advanced computing sys tem that has significantly more processing power than mainframe computers. Because they are often used for scientific calculations, supercomputers are typically assessed by the number of floating point operations per second the computer can perform.

Parallelism has always played an important role in supercomputers. Early super computers had 16 or 64 processors. A modern supercomputer consists of a cluster of many PCs that are interconnected by a high-speed Local Area Network. Furthermore, the processor in each PC has multiple cores. Modern supercomputers carry parallelism to a surprising extreme. For example, the Tianhe-2 supercomputer in China consists of a cluster of 16,000 Intel nodes. Each node has its own memory and a set of processors, each of which has multiple cores. The resulting system has a total of 3,120,000 cores.

The computational power of a computer with over 3 million cores is difficult to imagine.

20. Summary

Parallelism is a fundamental optimization technique used to increase hardware performance. Most components of a computer system contain parallel hardware; an architecture is only classified as parallel if the architecture includes parallel processors. Explicit parallelism gives a programmer control over the use of parallel facilities; implicit parallelism handles parallelism automatically.

A uniprocessor computer is classified as a Single Instruction Single Data (SISD) architecture because a single instruction operates on a single data item at any given time. A Single Instruction Multiple Data (SIMD) architecture allows an instruction to operate on an array of values. Typical SIMD machines include vector processors and graphics processors. A Multiple Instructions Multiple Data (MIMD) architecture em ploys multiple, independent processors that operate simultaneously and can each exe cute a separate program. Typical MIMD machines include symmetric and asymmetric multiprocessors. Alternatives to SIMD and MIMD architectures include redundant, distributed, cluster, and grid architectures.

In theory, a general-purpose multiprocessor with N processors should perform N times faster than a single processor. In practice, however, memory contention, communication overhead, and coordination mean that the performance of a multiprocessor does not increase linearly as the number of processors increases. In the extreme case, overhead means that performance can decrease as additional processors are added.

Programming a computer with multiple processors can be a challenge. In addition to other considerations, a programmer must use locks to guarantee exclusive access to shared items.

A modern supercomputer consists of a large cluster of processors. If a problem can be partitioned into subparts, the processors in a supercomputer cluster can work on subparts in parallel.

EXERCISES

1. Define macroscopic parallelism and give an example.

2. If a computer has four cores plus two GPU cores, does the system have symmetric parallelism, asymmetric parallelism, or some of both? Explain?

3. Use the Flynn classification scheme to classify a dual-core smart phone.

4. What is contention, and how does it affect performance?

5. A C programmer is writing code that will run on multiple cores, and must increment a shared variable x. Instead of writing:

x=x+1;

the C programmer writes:

x ++;

Does the second form guarantee that two cores can execute the increment without interfering with one another? Explain.

6. You receive two job offers for the same salary, one writing code for a system that uses explicit parallelism and another writing code for a system that uses implicit parallelism.

Which do you choose, and why?

7. Consider multiplying two 10 x 20 matrices on a computer that has vector capability but limits each vector to sixteen items. How is matrix multiplication handled on such a computer, and how many vector multiplications are required?

8. In the previous exercise, how many scalar multiplications are needed on a uniprocessor (i.e., an SISD architecture)? If we ignore addition and only measure multiplication, what is the speedup? Does the speedup change when multiplying 100 x 100 matrices?

9. If you have access to single-processor and dual-processor computers that use the same clock rate, write a program that consumes large amounts of CPU time, run multiple copies on both computers, and record the running times. What is the effective speedup?

10. In the previous question, change the program to reference large amounts of memory (e.g., repeatedly set a large array to a value x, then set the array to value y, and so on).

How do memory references affect the speedup?

11. Can a multiprocessor ever achieve speedup that is better than linear? To find out, consider an encryption breaking algorithm that must try twenty-four (four factorial) possible encryption keys and must perform up to 1024 operations to test each key (stopping early only if an answer is found). If we assume a multiprocessor requires K milliseconds to perform 1024 operations, on average how much time will the processor spend solving the entire problem? How much time will a 32-processor MIMD machine spend solving the problem? What is the resulting speedup?

12. Search the Web to find a list of the top 10 supercomputers. How many cores does each have?

PREV. | NEXT