Computer Architecture: CPUs -- Data Pipelining



Home | Forum | DAQ Fundamentals | DAQ Hardware | DAQ Software

Input Devices
| Data Loggers + Recorders | Books | Links + Resources


AMAZON multi-meters discounts AMAZON oscilloscope discounts


1. Introduction

Earlier sections present processors, memory systems, and I/O as the fundamental aspects of computer architecture. The previous section shows how parallelism can be used to increase performance, and explains a variety of parallel architectures.

This section focuses on the second major technique used to increase performance:

data pipelining. The section discusses the motivation for pipelining, explains the variety of ways pipelining is used, and shows why pipelining can increase hardware performance.

2. The Concept of Pipelining

The term pipelining refers broadly to any architecture in which digital information flows through a series of stations (e.g., processing components) that each inspect, interpret, or modify the information as FIG. 1 illustrates.


FIG. 1 Illustration of the pipeline concept. The example pipeline has four stages, and information flows through each stage.

Although we are primarily interested in hardware architectures and the use of pipe lining within a single computer system, the concept itself is not limited to hardware.

Pipelining is not restricted to a single computer, a particular type or size of digital information, or a specific length of pipeline (i.e., a particular number of stages). Instead, pipelining is a fundamental concept in computing that is used in a variety of situations.

To help us understand the concept, we will consider a set of characteristics. FIG. 2 lists ways to characterize pipelines, and succeeding paragraphs explain each of the characteristics.

-- Hardware or software implementation

--Large or small scale

-- Synchronous or asynchronous flow

-- Buffered or unbuffered flow

-- Finite chunks or continuous bit streams

--Automatic data feed or manual data feed

--Serial or parallel path

-- Homogeneous or heterogeneous stages FIG. 2 The variety of ways a pipeline can be used in digital systems.

Hardware or Software Implementation. Pipelining can be implemented in either software or hardware. For example, the Unix operating system provides a pipe mechanism that can be used to form a software pipeline - a set of processes creates pipes that connect the output of one process to the input of the next process. We will examine hardware pipelines in later sections. However, it should be noted that software and hardware pipelines are independent: a software pipeline can be created on a computer that does not use a pipeline hardware architecture, and pipeline hardware is not necessarily visible to programmers.

Large Or Small Scale. Stages in a pipeline can range from simplistic to powerful, and a pipeline can range in length from short to long. At one extreme, a hardware pipe line can be contained entirely within a small functional unit on a chip. At the other extreme, a software pipeline can be created by passing data through a series of programs that each run on a separate computer and use the Internet to communicate. Similarly, a short pipeline can be formed of two stages, one that generates information and one that absorbs it, and a long pipeline can contain hundreds of stages.

Synchronous Or Asynchronous Flow. A synchronous pipeline operates like an assembly line: at a given time, each stage is processing some amount of information (e.g., a byte). A global clock controls movement, which means that all stages simultaneously forward their data (i.e., the results of processing) to the next stage. The alternative, an asynchronous pipeline, allows a station to forward information at any time. Asynchronous communication is especially attractive for situations where the amount of time a given stage spends on processing depends on the data the stage receives. However, asynchronous communication can mean that if one stage delays for a long time, later stages must wait.

Buffered or Unbuffered Flow. Our conceptual diagram in FIG. 1 implies that one stage of a pipeline sends data directly to another stage. It is also possible to con struct a pipeline in which a buffer is placed between each pair of stages. Buffering is useful with asynchronous pipelines in which information is processed in bursts (i.e., a pipeline in which a stage repeatedly emits steady output, then ceases emitting output, and then begins emitting steady output again).

Finite Chunks Or Continuous Bit Streams. The digital information that passes through a pipeline can consist of a sequence of small data items (e.g., packets from a computer network) or an arbitrarily long bit stream (e.g., a continuous video feed).

Furthermore, a pipeline that operates on individual data items can be designed such that all data items are the same size (e.g., disk blocks that are each four Kbytes) or the size of data items is not fixed (e.g., a series of Ethernet packets that vary in length).

Automatic Data Feed or Manual Data Feed. Some implementations of pipelines use a separate mechanism to move information, and other implementations require each stage to participate in moving information. For example, a synchronous hardware pipe line typically relies on an auxiliary mechanism to move information from one stage to another. However, a software pipeline usually requires each stage to write outgoing data and read incoming data explicitly.

Serial or Parallel Path. The large arrows in FIG. 1 imply that a parallel path is used to move information from one stage to another. Although some hardware pipe lines do use a parallel path, many use serial communication. Furthermore, communication between stages need not consist of conventional communication (e.g., stages can use a computer network or shared memory to communicate).

Homogeneous or Heterogeneous Stages. Although FIG. 1 uses the same size and shape for each stage of a pipeline, homogeneity is not required. Some implementations of pipelines choose a type of hardware that is appropriate for each stage.

3. Software Pipelining

From a programmer's point of view, a software pipeline is attractive for two reasons. First, a software pipeline provides a way to handle complexity. Second, a software pipeline allows programs to be re-used. In essence, both goals are achieved because a software pipeline allows a programmer to divide a large, complex task into smaller, more generic pieces.

As an example of software pipelining, consider the pipeline facilities provided by the Unix shell (i.e., the command interpreter). To create a software pipeline, a user enters a list of command names separated by the vertical bar character to specify that the programs should be run as a pipeline. The shell arranges the programs so the output from one program becomes the input of the next. Each program can have zero or more arguments that control processing. For example, the following input to the shell specifies that three programs, cat, sed, and more are to be connected in a pipeline:

cat x | sed 's/friend/partner/g' | more

In the example, the cat program writes a copy of file x (presumably a text file) to its output, which becomes the input of the sed program. The sed program, in the middle of the pipeline, receives input from cat and sends output to more. Sed has an argument that specifies translating every occurrence of the word friend to partner. The final program in the pipeline, more, receives input from sed and displays the input on the user's screen.

Although the example above is trivial, it illustrates how a software pipeline helps programmers. Decomposing a program into a series of smaller, less complex programs makes it easier to create and debug software. Furthermore, if the division is chosen carefully, some pieces can be re-used among programs. In particular, programmers often find that using a pipeline to separate input and output processing from computation allows the code that performs computation to be re-used with various forms of in put and output.

4. Software Pipeline Performance and Overhead

It may seem that software pipelining results in lower performance than a single program. The operating system must run multiple application programs concurrently, and must pass data between pairs of programs. Inefficiency can be especially high if early stages of a pipeline pass large volumes of data that are later discarded. For example, consider the following software pipeline that contains one more stage than the ex ample above: an additional invocation of sed that deletes any line containing the character W.

cat x | sed 's/friend/partner/g' | sed '/W/d' | more

If we expect ninety-nine percent of all lines to contain the character W, the first two stages of the pipeline will perform unnecessary work (i.e., processing lines of text that will be discarded in a later stage of the pipeline). In the example, the pipeline can be optimized by moving the deletion to an earlier stage. However, the overhead of using a software pipeline appears to remain: copying data from one program to another is less efficient than performing all computation in a single program.

Surprisingly, a software pipeline can perform better than a large, monolithic pro gram, even if the underlying hardware does not use multiple cores. To understand why, consider the underlying architecture: processing, memory, and I/O are constructed from independent hardware. An operating system takes advantage of the independence by automatically switching the processor among application programs (i.e., processes):

when one application is waiting for I/O, another application runs. Thus, if a pipeline is composed of many small applications, the operating system may be able to improve overall performance by running one of the applications in a pipeline, while another application waits for I/O.

5. Hardware Pipelining

Like software pipelining, hardware pipelining can help a designer manage complexity - a complex task can be divided into smaller, more manageable pieces. How ever, the most important reason architects choose a hardware pipeline is increased performance. There are two distinct uses of hardware pipelines that each provide high performance:

-- Instruction pipeline

-- Data pipeline

Instruction Pipeline. Section 5 explains how the fetch-execute cycle in a processor can use a pipeline to decode and execute instructions. To be precise, we use the term instruction pipeline to describe a pipeline in which the information consists of machine instructions and the stages of the pipeline decode and execute the instructions. Because the instruction set and operand types vary among processors, there is no overall agreement on the number of stages in an instruction pipeline or the exact operations per formed at a given stage†.

Data Pipeline. The alternative to an instruction pipeline is known as a data pipe line. That is, instead of passing instructions, a data pipeline is designed to pass data from stage to stage. For example, if a data pipeline is used to handle packets that arrive from a computer network, each packet passes sequentially through the stages of the pipeline. Data pipelining provides some of the most unusual and most interesting uses of pipelining. As we will see, data pipelining also has the potential for the greatest overall improvement in performance.

6. How Hardware Pipelining Increases Performance

To understand why pipelining is fundamental in hardware design, we need to ex amine a key point: pipelining can dramatically increase performance. To see how, com pare a data pipeline to a monolithic design. For example, consider the design of an Internet router that is used by an Internet Service Provider (ISP) to forward packets between customers and Web sites. A router connects to multiple networks, some of which lead to customers and at least one leads to the Internet. Network packets can arrive over any network, and the router's job is to send each packet on toward its destination. For purposes of this example, we will assume the router performs six basic operations on each packet as listed in FIG. 3. It is not important to understand each of the operations, only to appreciate that the example is realistic.

†The definition of superpipeline, given later in this section, also relates to an instruction pipeline.

1. Receive a packet (i.e., read the packet from the network device and transfer the bytes into a buffer in memory)

2. Verify packet integrity (e.g., use a checksum to verify that no changes occurred between transmission and reception)

3. Check for forwarding loops (i.e., decrement a value in the header, and reform the header with the new value)

4. Select a path (i.e., use the destination address field in the packet to select one of the possible output networks and a destination on that network)

5. Prepare for transmission (i.e., compute information that will be sent with the packet and used by the receiver to verify integrity)

6. Transmit the packet (i.e., transfer the packet to the output device)


FIG. 3 An example series of steps that hardware in an Internet router performs to forward a packet.

Consider the design of hardware that implements the steps in the figure. Because the steps involve complex computation, it may seem that a processor should be used to perform packet forwarding. However, a single processor is not fast enough for high speed networks. Thus, most designs employ two optimizations described in earlier sections: smart I/O devices and parallelism. A smart I/O device can transfer a packet to or from memory without using a processor, and a parallel design uses a separate processor to handle each input.

A parallel router design with a smart I/O interface means that each processor implements a loop that repeatedly executes the six basic steps. FIG. 4 illustrates how a processor connects to an input, and shows the algorithm the processor runs.


FIG. 4 (a) Illustration of the connections on a processor used in a parallel implementation of an Internet router, and (b) the algorithm the processor executes. Each processor handles input from one network.

Suppose a parallel architecture, like the one in the figure, is still too slow. That is, suppose the processor cannot execute all the steps of the algorithm before the next packet arrives over the interface and no faster processor is available. How can we achieve higher performance? One possibility for higher speed lies in a data pipeline:

use a pipeline of several processors in place of a single processor as FIG. 5 illustrates†.


FIG. 5 Illustration of a pipeline used in place of a single processor in an Internet router.

It may seem that the pipeline in the figure is no faster than the single processor in FIG. 4. After all, the pipeline architecture performs exactly the same operations on each packet as the single processor. Furthermore, if the processors in FIG. 5 are each the same speed as the processor in FIG. 4, the time to perform a given operation will be the same. For example, the step labeled verify integrity will take the same amount of time on both architectures, the step labeled check for loops will take the same amount of time on both architectures, and so on. Thus, if we ignore the delay introduced by passing packets among stages of the pipeline, the total time taken to pro cess a packet is exactly the same as in a single processor architecture. That is:

A data pipeline passes data through a series of stages that each examine or modify the data. If it uses the same speed processors as a non pipeline architecture, a data pipeline will not improve the overall time needed to process a given data item.

If the total processing time required for an item is the same in the pipelined and non-pipelined architectures, what is the advantage of a data pipeline? Surprisingly, even if the individual processors in FIG. 5 are each exactly the same speed as the processor in FIG. 4, the pipeline architecture can process more packets per second.

To see why, observe that an individual processor executes fewer instructions per packet.

Furthermore, after operating on one data item, a processor moves on to the next data item. Thus, a data pipeline architecture allows a given processor to move on to the next data item more quickly than a non-pipeline architecture. As a result, data can enter (and leave) a pipeline at a higher rate.

†A pipeline provides an example of the Flynn MISD type of parallel architecture mentioned in the previous section.

We can summarize:

Even if a data pipeline uses the same speed processors as a non-pipe line architecture, a data pipeline has higher overall throughput (i.e., number of data items processed per second).

7. When Pipelining Can Be Used

A pipeline will not yield higher performance in all cases. FIG. 6 lists conditions that must be met for a pipeline to perform faster than a single processor.

-- Partionable problem

-- Equivalent processor speed

-- Low overhead data movement


FIG. 6 The three key conditions that must be met for a data pipeline to perform better than the same computation on a single processor.

Partionable Problem. It must be possible to partition processing into stages that can be computed independent of one another. Computations that employ a sequence of steps work well in a pipeline, but computations that involve iteration often do not.

Equivalent processor speed. It should be obvious that if the processors used in a data pipeline are slow enough, the overall time required to perform a computation will be much higher than on a single processor. Processors in the pipeline do not need to be faster than the single processor. We merely require that each processor in the pipeline is approximately as fast as the single processor. That is, the time required to perform a given computation on a pipeline processor must not exceed the time required to perform the same computation on the single processor.

Low overhead data movement. In addition to the time required to perform computation, a data pipeline has an additional overhead: the time required to move a data item from one stage of the pipeline to the next. If moving the data incurs extremely high latency, pipelining will not increase performance.

The requirements arise because of an important principle:

The throughput of a pipeline is limited by the stage that takes the most time.

As an example, consider the data pipeline in FIG. 5. Suppose that all processors in the pipeline are identical, and assume that a pipeline processor takes exactly the same time to execute an instruction as the single processor. To make the example concrete, assume that a processor can execute ten instructions each microsecond. Further suppose the four stages in the figure take fifty, one hundred, two hundred, and one hundred fifty instructions, respectively, to process a packet. The slowest stage requires two hundred instructions, which means the total time the slowest stage takes to process a packet is:

(eqn.1)

Looking at this another way, we can see that the maximum number of packets that can be processed per second is the inverse of the time per packet of the slowest stage.

Thus, the overall throughput of the example pipeline, Tp, is given by:

(eqn.2)

In contrast, a non-pipelined architecture must execute all 500 instructions for each packet, which means that the total time required for a packet is 50 µsec. Thus, the throughput of the non-pipelined architecture is limited to:

(eqn.3)

8. The Conceptual Division of Processing

The reason data pipelining improves performance arises because pipelining pro vides a special form of parallelism. By dividing a series of sequential operations into groups that are each handled by a separate stage of the pipeline, pipelining allows stages to operate in parallel. Of course, a pipeline architecture differs from a conventional parallel architecture in a significant way: although the stages operate in parallel, a given data item must pass through all stages. FIG. 7 illustrates the concept.


FIG. 7 (a) Processing on a conventional processor, and (b) equivalent processing in a data pipeline. The functions performed in sequence are divided among stages of the pipeline.

The point of the figure is that the three stages operate in parallel. Stage three per forms function h on one data item at the same time stage two performs function g on a second data item and stage one performs function f on a third data item. As long as the pipeline is full (i.e., there are no delays between items), the overall system benefits from N stages all running in parallel.

9. Pipeline Architectures

Recall from the previous section that we distinguish between hardware architectures that merely use parallelism and architectures in which parallelism forms the central paradigm around which the entire architecture is designed. We make an analogous distinction between hardware architectures that use pipelining and architectures in which pipelining forms the central paradigm around which the entire system is designed. We reserve the name pipeline architectures for the latter. Thus, one might hear an architect say that the processor in a given system uses instruction pipelining, but the architect will not characterize the system as a pipeline architecture unless the overall design centers around a pipeline.

Most hardware systems that follow a pipeline architecture are dedicated to special purpose functions. For instance, the example above describes how pipelining can be used to improve performance of a packet processing system. Pipelining is especially important in network systems because the high data rates used when sending data over optical fibers exceeds the capacity of conventional processors.

Pipeline architectures are less relevant to general-purpose computers for two reasons. First, few applications can be decomposed into a set of independent operations that can be applied sequentially. Instead, a typical application accesses items randomly and keeps large volumes of additional state information. Second, even in situations where the functions to be performed on data can be decomposed into a pipeline, the number of stages in the pipeline and the hardware needed to implement each stage is not usually known in advance. As a result, general-purpose computers usually restrict pipeline hardware to an instruction pipeline in the processor or a special-purpose pipe line in an I/O device.

10. Pipeline Setup, Stall, and Flush Times

Our description of pipelines overlooks many of the practical details. For example, many pipeline implementations have overhead associated with starting and stopping the pipeline. We use the term setup time to describe the amount of time required to start a pipeline after an idle period. Setup may involve synchronizing processing among stages or passing a special control token through the pipeline to restart each stage. For a software pipeline, setup can be especially expensive because connections among various stages are created dynamically.

Unlike other architectures, a pipeline can require significant time to terminate processing. We use the term flush time to refer to the amount of time that elapses between the input being unavailable and the pipeline finishing its current processing. We say that items currently in the pipeline must be flushed before the pipeline can be shut down.

The need to flush items through a pipeline can arise for two reasons. First, a pipe line becomes idle when no input is available for the first stage. Second, as we have seen, later stages of a pipeline become idle when one stage stalls (i.e., the stage delays because it cannot complete processing). In a high-speed hardware pipeline, mundane operations such as a memory reference or an I/O operation can cause a stage to stall.

Thus, high flush (or setup) times can reduce pipeline performance significantly.

11. Definition of Superpipeline Architecture

A final concept completes our description of pipelines. Architects use the term superpipeline to describe an extension of the pipeline approach in which a given stage of the pipeline is subdivided into a set of partial stages. Superpipelining is most often used with an instruction pipeline, but the concept applies to data pipelines as well. The general idea is: if dividing processing into N stages can increase overall throughput, ad ding more stages can increase throughput further.

A traditional instruction pipeline might have five stages that correspond to: instruction fetch, instruction decode, operand fetch, ALU operation, and memory write. A superpipeline architecture subdivides one or more stages into multiple pieces. For example, a superpipeline might subdivide the operand fetch stage into four steps: decode an operand, fetch immediate values or values from registers, fetch values from memory, and fetch indirect operand values. As with standard pipelining, the point of the subdivision is higher throughput -- because each substage takes less time, throughput of a superpipeline is higher than the throughput of a standard pipeline.

12. Summary

Pipelining is a broad, fundamental concept that is used with both hardware and software. A software pipeline, which arranges a set of programs in a series with data passing through them, can be used on hardware that does not provide pipelining.

A hardware pipeline is either classified as an instruction pipeline, which is used in side a processor to handle machine instructions, or a data pipeline, in which arbitrary data is transferred through the pipeline. The superpipeline technique, in which a stage of a pipeline is further subdivided into partial stages, is often used with an instruction pipeline.

A data pipeline does not decrease the overall time required to process a single data item. However, using a pipeline does increase the overall throughput (items processed per second). The stage of a pipeline that requires the most time to process an item limits the throughput of the pipeline.

EXERCISES

1. A scientist uses a cluster of PCs and arranges to have software on each processor per form one step of a computation. The processor reads up to 1 MB of data (whatever is available), processes the data, and then passes its output to the next processor over a 32 bit bus. What characteristics from FIG. 2 does the arrangement have?

2. Your team has been given the task of moving a video processing program from an old single-core processor to a new quad-core processor that has a high-speed interconnect among cores. Conventional parallel approaches will not work because the frames of video must be processed in order. What technique can you suggest that offers a possible way to use the new hardware to improve performance?

3. An engineer builds a data pipeline with eight processors. To measure performance, the engineer runs the software on one processor and measures the time taken to process a single data item. The engineer then divides the software into eight stages, and measures the time taken to process a single data item. What do the measurements show?

4. Most data pipeline hardware is devoted to specialized tasks (e.g., graphics processing).

Would installing a data pipeline in all computers increase the performance of all pro grams? Why or why not?

5. A manager notices that the company has a few idle computers in each of ten data centers. The data centers are spread across the country, with low-speed Internet connections used to communicate among the data centers. The manager proposes that rather than using a computer in the local data center, a "giant data pipeline" be set up across all ten data centers to increase performance. What do you tell the manager about the idea?

6. You are given a program that runs on one core, and are asked to divide the program into pieces that will use up to eight cores in a data pipeline. You can divide the program two ways. In one, the cores each perform 680, 2000, 1300, 1400, 800, 1900, 1200, and 200 instructions. In the other, the cores perform 680, 1400, 1300, 1400, 1400, 1000, 1200, and 1100 instructions. Which division do you choose, and why?

7. What is the maximum throughput of a homogeneous pipeline in which four processors each handle one million instructions per second and processing a data item requires 50, 60, 40, and 30 instructions, respectively? Assume a constant execution time for all types of instructions.

8. In the previous exercise, what is the relative gain in throughput compared to an architecture without pipelining? What is the maximum speedup?

9. Extend the previous exercise by considering heterogeneous processors that have speeds of 1.0, 1.2, 0.9, and 1.0 million instructions per second, respectively.

10. If you are asked to apply superpipelining to subdivide one of the stages of an existing pipeline, which stage should you choose? Why?

PREV. | NEXT

Related Articles -- Top of Page -- Home

Updated: Thursday, April 27, 2017 8:39 PST