Computer Architecture / Organization -- Processor organization [part 2]

AMAZON multi-meters discounts AMAZON oscilloscope discounts

<<PREV.

4. Register window + instruction cache machine

4.1 Exploitation of locality

Both stack and accumulator machines suffer from the same handicap of the need to access memory for each variable reference. This is particularly severe when evaluating an expression. The inclusion of a file of processor registers, directly controlled by the control unit, may greatly reduce the problem by housing all variables local to a procedure while it runs. Registers may be used to contain any type of data for any purpose and hence are collectively referred to as a general purpose register file.

The efficiency derived from a register file is a result of structured programming. Research has shown that the use of structured programming results in temporal locality, i.e. the processor tends to access the same group of variables within a certain size time window. Spatial locality is the tendency to reference variables closely situated within memory. The exploitation of locality is discussed in Section 3.

When the procedure is entered all local variables are initialized within the register file. If the register file is not large enough, only those most frequently referenced are housed there. In order that the calling procedure does not have its local variables corrupted, the new procedure must, as its very first act, save the register file at a known location. Parameters may be passed and values returned in the register file but great care must be taken. A procedure implementation must pay regard to register usage. As a result conventions exist with some compilers and architectures, for instance the use of r0 in the "C" language for a function return value.

• Temporary storage for special registers (e.g. PC, PSR)

• Temporary storage for general purpose registers

Considerable performance improvement may be gained with a register file and stack if the compiler makes effective use of the registers. This means always using them for expression evaluation, and as much as possible for local variables.

The need for a compiler to be portable among a wide variety of machines often results in their less than optimal exploitation. Unfortunately, there are also still the memory references required on every procedure call that are due to the architecture and not the problem. We now move on to discuss the removal of this inefficiency.

4.2 Software partitions

The advantages of a stack machine over an accumulator one are attractive but the penalty can be reduced performance. Each procedure invocation requires parameters, return address and frame pointer to be written to the stack. The two latter items must later also be read on exit and return. All this implies memory access over a bus which can only transmit one item at a time. There is a way to avoid as many references and greatly speed up those remaining.

A large register file is provided within the processor. A register window is notionally laid over the register file so that only those visible through the window are accessible. The window now takes the place of the stack in implementing local variables, parameters and return value as depicted in FIG. 8.

Accessing data in processor registers is very much faster than accessing memory locations because each one is accessed directly without the need for address decoding and bus protocol. The problem with registers is one of interconnection.

Bussed memory requires only one extra signal connection each time the number of locations doubles! Registers require another signal pair (at least) for each and every one added. For this reason it is costly, particularly with a 2-d technology, where overlapping connections are not allowed. A large area of silicon "real estate" is given over to interconnect that could just as easily be used for a large bussed memory or even another processor!

FIG. 8: Contents of a register window

A compromise is where the register file is broken up into equal sized windows which overlap to allow parameter passing. Only one window is active at a time whose address is the current window pointer (CWP) for which a register is provided. For example, we might have eight windows of twenty-two registers.

Each window may overlap its successor or predecessor by six registers. Hence the window advances across sixteen registers on each procedure call. In this example the register file thus consists of…

• 128 physical registers

• 8 windows each of 22 logical registers

…and requires a CWP of just three bits. [ This example is based on the Berkeley RISC I processor.]

Instructions only ever refer to one of twenty-two registers, hence the control unit need only send control signals for these. No internal bus collision will occur because the current window pointer is decoded to allow only one group of registers to respond to the signals. Inspection of FIG. 9 should render clear the fact that each register need only be connected to just two pairs of control signals.

Each physical register is visible within only two possible windows. The CWP decoded will provide the select signal to choose which signals each register must obey.

FIG. 9: Connectivity between register windows

FIG. 10: Logical register file supporting global, local and parameter variables

4.3 Data referencing

Global data, although undesirable since it allows side effects to occur, is still perceived by many as essential for an architecture to support. A register window machine deals with this in an elegant fashion. FIG. 10 shows how it is achieved. A partition at one extreme of the logical register file remains permanently bound to the same set of physical registers and holds global variables visible to all procedures. Given ten global registers, the total size of the logical register in our example will be thirty-two.

There need only be a single principal addressing mode for a register window machine! Absolute, register relative and indexed register relative access to main memory may be synthesized quite simply. The principal mode uses two arguments to an instruction. The first specifies a register in the logical register file. (We assume that the PC may be referenced simply as an extension to the logical register file.) The second argument contains more bits than the first including one which indicates its interpretation, which may be either a logical register number (like the first) or an immediate operand. The effective address of an operand in memory is thus given by ...

There follows a summary of possible synthesized addressing modes...

• Absolute: reg is any register containing the value zero, arg2 is an immediate operand specifying the required absolute address

• Register relative: reg contains the address of the operand, arg2 is either any register containing the value zero or a zero immediate value

• Register relative+index: reg is once again a vector, arg2 is a register containing the required index

• PC relative: reg is the logical register number of the PC, arg2 is an immediate operand whose value is the required offset

(Used solely for conditional branching)

FIG. 11: An asynchronous pipeline

Arguments may be used to derive either one or two operands depending on the effect of the operation encoded within the opcode. The single operand effect will be restricted to load/store instructions which transfer any data to be processed within a procedure into local register variables. In this way memory references are kept down to the absolute minimum necessary. Logical and arithmetic operators will require two arguments.

4.4 Parallel instruction execution

In Section 7 the process performed by the processor control unit was described in pseudocode by…

REPEAT fetch instruction execute instruction UNTIL FALSE

There is an assumption here that the two components of the loop body must be executed sequentially when in fact such is not the case. The sub-processors which perform each sub-process are in fact independent except for a single asynchronous channel of comunication…the instruction register. In machines where the above is a true description of control unit behaviour, each sub processor is idle while the other does its job. Borrowing some notation from the Occam 2 programming language [Inmos 88#1] we can describe the transformation we require thus…

SEQ PAR

fetch instruction fetch instruction

execute instruction execute instruction

Things are not quite this simple however. The processor responsible for the fetch process is the bus control unit (BCU) which is required to perform other tasks which are subsidiary to the execute process. This leads to the possibility remaining that execution might be delayed until a fetch terminates. Also a fetch typically takes less time than an execute which leads to valuable machinery lying idle once again.

The answer is to have a queue of instructions called an instruction cache which is topped up whenever the bus is not in use for instruction execution. A fresh instruction is always present, waiting at the front of the queue, whenever a new execute process is ready to start provided the queue is large enough. The use of a queue data object in this fashion is called buffering and is vital for efficiency in asynchronous communication. The idea of pipelining is shown in FIG. 11 and is precisely that employed on a production line such as is found in a car factory. Parallel execution of fetch and execute is but a simple example.

One other problem remains. The result of the execution of a conditional branch instruction may change the identity of the following instruction depending on processor state. The contents of the entire instruction cache must then be discarded and replaced. However, a good standard of structured programming should make this comparatively rare. Of the order of only 1% of instructions are a conditional branch. Also the condition will only fail an average of 50% of occasions, leaving the contents of the instruction cache still valid.

An interrupt will also invalidate the contents of the instruction cache. Again interrupt events are comparatively rare.

Further pipeline parallelism may be obtained by breaking down execute microprocesses. However, if this is pursued too far the overhead incurred in buffer transactions and growth in control unit complexity can defeat performance gained.

Parallel operation of different functional units is obviously desirable to avoid machinery lying idle unnecessarily. If a number of instructions present in the instruction cache each require to operate different functional units (e.g. multiplier, ALU) on independent operands there is no reason why they should not.

Ensuring that the operands concerned are indeed independent is not always easy however.

Multiple identical functional units become useful once an instruction cache is included in the design. Typically this means multiple ALU architectures. A significant increase in instruction throughput may be gained in many applications at the cost of a reasonable increase in control unit complexity.

FIG. 12: Instruction cache+register window machine organization

4.5 Summary

The organization of the instruction cache+register window machine is shown in FIG. 12. Note that only two special registers are required, PC and CWP. The most significant performance advantage is that very much less external memory access is required upon…

• Subroutine invocation

• Expression evaluation

This fact gives rise to an observed performance increase with very little added complexity compared to an accumulator or stack machine.

The programmer's architecture is shown in FIG. 13 and is clearly very simple to understand and for which to generate code.

[ Remember that the compiler code generator author need not be concerned with how the windowing operates and need only consider the logical register file. ]

There is no technical reason for associating the presence of an instruction cache and register windowing in processor design. They are quite independent ideas and are brought together here merely to reduce the number of designs illustrated. However, they do complement each other very well since…

• Instruction pipelining seeks to optimize instruction throughput

• Register windowing seeks to optimize data throughput

FIG. 13: Programmer's architecture of cache+window machine

Pipelining is a concept of extreme importance to the field of parallel processing.

It is interesting to carry the idea onwards towards its logical conclusion where each instruction has its own processor. Much more promising is the data-led approach where each variable has its own processor! Further comments would be outside the scope of this text.

Another idea which suggests itself given this design is that of establishing memory as purely for code. The idea of separate memory for data and code predates that of a single memory shared by both. Machines with separate memories for each are referred to as Harvard architectures. The compiler would obviously have to attempt to ensure that sufficient procedure activation depth exists. Recursion would be risky, but then again it is anyway given a stack machine. The implementation technology would have to permit a sufficiently large register file. This requires reducing the cost of interconnect implicit with a 2-d technology.

The origin of many of these ideas is (historically at least) associated with the drive for the RISC [ Reduced Instruction Set Computer. ] architecture. A good survey of RISCs is to be found in

[Tabak 87].

5. Queue + channel machine

5.1 Process scheduling

Process networks

The design of this machine is to fully support the software model of a process network which communicates, internally and externally, by passing messages over channels16. Processes may be hierarchically reduced until the structure of each process is purely sequential and composed of the following primitives…

• Assignment

• Input

• Output

Only synchronous communication is possible between processes, which suspend while awaiting rendezvous. For any processor network to be efficient, every node must itself be multiprocessing, otherwise they would be idle while their single running process is suspended.

Scheduling algorithms

The support required for multiprocessing is some kind of scheduling algorithm.

The two simplest are…

• First In First Out (FIFO)

• Round Robin

The FIFO algorithm simply stores ready processes in a ready queue. A process is served (or despatched) to the processor and then runs till it terminates. Round Robin also employs a ready queue but each process is timed-out after one timeslice if it has not already terminated or suspended. The magnitude of the timeslice depends upon the overhead in context switching between processes and may be either fixed or variable. If variable it might be made proportional to the priority of the process, though that would not be easy to implement.

Scheduling support

The minimum support necessary for process scheduling is simply a front pointer and a back pointer for a ready queue. The process is represented in the queue simply by its workspace pointer (WS). When suspended, the PC value is located at a fixed offset from the start of workspace. Context switching between processes is rendered extremely simple by applying the constraint that it is postponed until completion of any expression evaluation already under way on time-out. This means that no registers need be saved except the PC. The contents of WS are simply placed in the ready queue, when timed-out, or at the rendezvous, when suspended (see below).

We shall assume processes to be static objects which must all be declared prior to compilation so that the compiler can allocate workspace for each process.

There is no mechanism to allocate workspace for dynamic processes (created at run-time).

5.2 Message passing Channels

Channels which venture to processes running on external processors are referred to as hard channels. They clearly require special hardware mechanisms. The mechanism implementing the transmission and reception of data over a hard channel may be built around a bidirectional shift register (see Section 6). The transmission must be inhibited until the receiving processor is ready. A subsidiary channel of signal protocol must be provided, alongside the data channel, to effect this. Input and output instructions are unaffected by whether the partner process is running on the same or another processor.

Soft channels are those which serve to connect processes which share a processor. They are implemented in a fashion which appears identical with hard channels to the processor at the instruction level. Both employ rendezvous and differ only in the rendezvous location and the mechanism by which data is transferred.

[ The communicating sequential process model of a system is described in Section 1.

Detailed accounts may be found in [Hoare 78], [Hoare 85].

]

FIG. 14: Implementation of rendezvous using single memory location

Rendezvous

Rendezvous requires a rendezvous location agreed a priori by both processes partaking in the transaction. This may simply be a memory location which is initialized with a value denoting empty (FIG. 14). The first process to arrive at the rendezvous, on finding it empty, leaves behind its identifier (WS value) and suspends itself by simply quitting the processor, causing the despatch of the next ready process from the queue. When the second process arrives it completes the transaction by direct access to the workspace of the first, which it subsequently reschedules (enqueues). Note that it is quite irrelevant whether the sender or receiver arrives first. The second process to arrive will either read or write the workspace of the first, depending on which one it is.

[The two taken together effect what is known as a Busy/Ready protocol. ]

5.3 Alternation

Alternation is natural and simple to implement on the queue/channel machine.

Concurrent processes may be programmed, each to look after communication with a single environment process. Any which are not currently communicating will be suspended. Any which are will be scheduled to run or be running.

For very low latency in real-time systems, a high priority process may preempt the one running in response to a signal on a special hard channel (which supports signals only).

FIG. 15: Process queue+message channel machine organization

5.4 Summary

The design of a processor to support both local multiprocessing and connection within a processor network need not be complicated. All that is needed to support static nonprioritized processes is a pair of queue pointer registers and the necessary hardware for transmitting and receiving data on each hard channel.

Processes are represented in a ready queue by their workspace pointer.

Workspace is private to each process and includes a record of the program counter value when the process is not running. Context switching is kept simple and fast by obviating the need to save any registers except PC. This is achieved by denying the possibility to suspend or time-out any process while an expression evaluation is under way. FIG. 15 shows the organization of the processor and FIG. 16 the programmer's architecture. Prioritized processes may be supported by employing a round robin algorithm with timeslice proportional to priority. An alternative, which is much simpler to implement, is to employ a separate ready queue for each distinct priority and allowing any process to pre-empt one of lower priority.

FIG. 16: Programmer's architecture of queue+channel machine

Soft channels and hard channels are unified by employing an identical rendezvous mechanism. They differ purely in the memory locations used as rendezvous locations -- i.e. a memory location is reserved for each hard channel. -- and data transfer mechanism. Processes awaiting rendezvous are suspended (with little processing overhead) on awaiting communication and rescheduled later by the other process partaking in the transaction. It is irrelevant whether sender or receiver is first to arrive at rendezvous location.

This design is ideally suited to real-time systems because it enables easy implementation of alternation. However many such applications would require prioritization and preemption in process scheduling.

Processes provide a problem-level abstraction, software-level modularity and machinelevel reusability.

Exercises

Question one

i. As described in the text, the implementation of a function invocation by a compiler on a pure stack machine implies the creation of a suitable stack structure containing…

• Local variables

• Parameters

• Return value

Explain how the compiler initializes…

• Value parameters

• Reference parameters

ii. Draw a diagram showing the stack structure created to implement an invocation of the following procedure and function, defined using the Modula-2 programming language…

PROCEDURE HexStringToCard(String: ARRAY OF CHAR; VAR Value: CARDINAL; VAR Error: BOOLEAN) VAR Index: CARDINAL;

BEGIN … END HexStringToCard PROCEDURE Random(Seed: CARDINAL; VAR NextSeed: CARDINAL) : REAL; BEGIN … END Random

Note: A function in Modula-2 is referred to as a function procedure and is still declared as a procedure, but with the addition of a return type. Assume that CARDINAL and REAL variables are of length four bytes and that BOOLEAN and CHAR variables are each represented within a single byte. ARRAY OF CHAR is called an open array parameter whose size is determined on invocation at run time. The stack pointer points to the topmost item and not to the first free byte.

Question two

Many architectures build their stacks downward in memory. Use your solution to question one to show in each case how each parameter and each variable would be referenced using only the frame pointer.

Note: Use the notation "<offset>(fp)". Also assume that any address is three bytes long and that the least significant byte of anything is always found at the lowest address.

Question three

Consider three processors…

• Zero-address stack machine

• One-address accumulator machine

• Two-address register file machine

…with simplified instruction sets, respectively…

• {push pop mul div add sub}

• {load store mul div add sub}

For all three machines assume a basic instruction length of one byte and an address length of four bytes. An immediate operand may be encoded within one, two or four bytes as required using twos-complement integer representation.

Stack machine push, and pop alone take a single address locating the operand to be moved. The arithmetic instructions all operate on the topmost two items on the stack, leaving the result on top. Where ordering of operands is important, the top item is equivalent to that on the left of a written expression. Assume all variables are located as byte offsets from either frame pointer or static base register. Hence both move instructions will be just two bytes long (one byte for an instruction plus one byte for an offset). Accumulator machine One operand is always the accumulator, corresponding to that on the left of a written expression. The result is always placed in the accumulator.

Register file machine

The second address identifies both the right-hand operand and the destination of the result of an arithmetic operation. In the case of a move instruction, the first and second address identify source and destination respectively. Assume sixteen registers allowing up to two to be specified within a single byte, additional to the basic instruction and an address or immediate operand (if required). Note that there is no register windowing, so all variables are bound to memory and not to registers.

i The execution time of an instruction is almost completely determined by the number of bus cycles required. Why does it take much longer to perform a (read or write) bus cycle than any other processor operation? Need it always be true?

ii Assuming a unit of time equal to that taken to perform a bus cycle, enumerate the time taken to execute each instruction of each processor described above.

iii Summarize the advantages and disadvantages of the stack machine in the light of your answers to parts i and ii of this question. How might the worst disadvantage be alleviated.

Question four

i For each of the processors described in question three compose the code which a compiler might produce to implement the following assignment…

…given that the order of operations will obey the following rules…

• Parenthesized expressions evaluated first (from inner-most out)

• Expression evaluated from right to left

• Operators evaluated according to precedence order (* , / , + , -) ii Discuss your solution to i and give a quantitative comparison between architectures with regard to…

• Number of instructions required

• Execution time

• Code length

Quantify the improvement in the execution time on the stack machine yielded by arranging that the topmost three stack items are always to be found in special dedicated processor registers.

Note: You may assume that a > 0 and that single precision arithmetic is sufficient (i.e. forget about carry).

Question five

i Enumerate all bus operations required for a stack machine to invoke, enter, exit and return from a function. Assume that three parameters are passed and that all parameters and return value each require just a single bus operation to read or write.

ii Compare your result with the operations required to achieve the same thing on a register windowing machine. What are the limitations of register windowing?

Question six

What is the difference between a procedure and a process and what (if any) are the similarities? Include a brief comparison between procedure oriented and process oriented software design.

PREV. | NEXT