Computer Architecture / Organization -- Processor organization [part 1]

AMAZON multi-meters discounts AMAZON oscilloscope discounts

1. Requirements

1.1 General requirements

There are no hard and fast rules applying to processor organization. In this Section, four designs are considered as distinct. Commercially available designs are either hybrids, employing a mixture of the ideas presented here, or a combination of more than one philosophy allowing the system designer freedom of choice over which to exploit. This is illustrated in Section 10 with a discussion of three commercially available designs.

Two requirements dominate long term design aims…

Efficient execution requires the minimization of the time taken to translate each element of a software source module into micro-operations. Virtually all software is now written in a high level language. It is important to understand that the architecture must be optimally designed for the compiler code generator.

The fact that an alternative architecture may permit more rapid execution of a process, given machine level (assembly language) programming, is completely immaterial.

Efficient development requires architectural support for software partitions such as procedures and modules. Software partitions should correspond to problem partitions in the most direct fashion possible. Separate development of each partition requires separate compilation which in turn requires support for code and data referencing across partition boundaries. With the procedural programming model this implies the ability for a procedure in one module to reference variables and procedures within another, ideally without the need to recompile or re-link more than just the one (referencing) module.

1.2 Throughput Instructions

The purpose of a processor is to translate machine language instructions into micro-operations. The speed of execution depends on the average time taken to translate each instruction. The instruction throughput [1] is simply the mean number of instructions executed per unit time.

FIG. 1: Simplest model of processor function

A number of methods are known which improve instruction throughput and they are often the first thing of which a designer or manufacturer will boast.

Prominent among them currently is the instruction cache and pipelining which are both discussed below. Caution is however required since some instruction sets achieve more per instruction than others at the problem level!

Data

Like the Turing Machine, a contemporary computer employs separate processor and memory. The two are connected by some kind of communication channel.

Unlike the Turing Machine processor, which can only reference a single location and move to its successor or predecessor, a contemporary processor has random access to its memory [2]. It may reference any element in memory, whenever it chooses.

[ 1 …usually stated in MIPS, Millions of Instructions Per Second.

2. The memory is still linear, like that of the Turing Machine.]

FIG. 2: Processor to memory communication

The limit on the instruction throughput of a processor is largely determined by its data throughput which is the number of memory references per unit time that the processor to memory communication channel is capable of supporting. This is referred to as its bandwidth3. The channel is typically a bus, thus an important factor will be the bus width, equal to the number of bits transferred per bus cycle.

Most contemporary computers employ a shared memory for both instructions and data. This philosophy is credited to J. von Neumann.

Thus the bandwidth of the processor to memory channel is important because, more than any other single factor in contemporary computer design, it determines the limit of both instruction and data throughput. It is thus commonly referred to as the von Neumann bottleneck.

1.3 Real-time systems

In addition to these universal requirements will be the applicability of the design.

It is now becoming cost effective to design processors which are optimized to a specific application. Arguably the most important growth is in real-time systems which are capable of synchronous communication with their environment. Such systems require event-driven software. An event occurring external to the system immediately causes appropriate processing to generate the required response. The delay between the event and execution of the first relevant instruction is known as the latency and must be minimized. Such a processor must also possess the capability of alternation between distinct environment processes. As a result it must itself be capable of scheduling multiple processes [5].

[3. In its strict sense the term refers to the difference between the lowest and highest frequency of sinusoidal analogue signal which may be transmitted over the channel.

4. …who is also credited with the very idea of the stored program. Previously, computers had to be physically altered in order to change their program. Those machines which employ separate memories for instructions and data are said to possess a Harvard architecture.

5. See discussion in Section 1.

]

Applications of real-time systems range from aircraft flight control to robotics.

The principles are also highly applicable to the human/computer interface process of a computer work-station.

2. Accumulator machine

2.1 Programming constructs

The most basic construct, the sequence, is implemented using a counter called the program counter (PC). It contains the address of the next instruction to be fetched from memory and then executed. It is incremented after each instruction fetch microoperation.

Useful computations require selection and iteration constructs both of which may be implemented by means of conditional branch instructions. Branching6 occurs if a specified condition flag is set [7] in the processor state register (PSR).

Processor state depends on the results of previously executed instructions. For example, if the result of a preceding addition instruction was zero, a flag (usually denoted by Z) will be set. The arithmetic logic unit (ALU) will automatically have caused it to be so.

The organization of the accumulator machine is shown in FIG. 3 and its programmer's architecture in FIG. 4. Note that there is no obvious means to reuse software partitions of any kind.

2.2 Data referencing

The accumulator is so called because it is said to accumulate the value of the computed function. Its contents repeatedly form one argument of a function computed by the ALU. The result of each ALU operation is communicated back into the accumulator. For example, summation of an array of integers may be performed as a sequence of binary addition operations each of which takes the result of the last as one operand.

The X register gets its name from "eXtension" or "indeX" and is used to contain either. For example multiplication of two integers will require a result register twice as large as required for each operand and will use X as an extension of the accumulator. X also renders multiple precision arithmetic more efficient by reducing the number of memory references required by acting as a temporary store. It would also be used as the array and loop index in the summation example mentioned above. An instruction to decrement X would be employed in the loop body. The loop would terminate with a conditional branch instruction which interrogates the Z flag in the PSR.

FIG. 3: Accumulator machine organization

FIG. 4: Programmer's architecture of accumulator machine

Accumulator machine instructions need take only a single operand. Typical addressing modes are…

• Immediate: Operand embedded within code immediately following instruction

• Absolute (direct): Operand to be found in memory location whose absolute address is embedded within code immediately following instruction

• Indexed: Operand to be found at effective address given by the sum of an absolute address (array base) and the contents of the X register (array index)

• Indirect: Effective address of operand found in memory at specified absolute address (vector), i.e. the address of the address of the operand is specified following the instruction

• Indirect indexed: Effective address of operand is a vector to which is added the contents of the X register (i.e. indexing takes place after indirection)

• Indexed indirect: Effective address of operand is the vector address to which is added the contents of the X register (i.e. indexing takes place before indirection) Immediate mode is used for referencing constants and absolute mode for scalar variables. The others support the access of elements within sequence-associated and pointer-associated structured variables.

[ 6. The addition of a signed integer to the PC.

7 …or clear as required. ]

[8 At a lower level, clearing the instruction register of a microcoded control unit is satisfactory if the fetch instruction microprocess lies at the base of microcode memory. This will cause the first micro-operation to be to fetch the first instruction of the boot code. ]

2.3 Booting

By booting or bootstrapping we mean the correct initiation of the process we wish to run. This first of all requires that all registers possess a suitable state. In particular the PC must point to the first instruction of the boot code. Traditionally this is first achieved by clearing all registers, which implies that the first instruction to be executed must exist at the memory location whose address is zero8. Similarly, judicious design can ensure that a satisfactory boot state is achieved with all flags in the PSR cleared to zero.

2.4 Summary

The accumulator machine described here has many similarities with the von Neumann machine [von Neumann 46]. It supports programming a complete set of procedural constructs via a program counter, processor state register and conditional branch instructions.

Indirection and indexing support the referencing of both sequence-associated and pointer-associated data. The referencing of constant and scalar variable data are supported via immediate and absolute addressing modes.

One-address instructions are adequate, the accumulator being used for both one operand and the result of an operator. The other operand must always be fetched from memory. Thus the bandwidth of the processor-to-memory communication channel is paramount in determining performance in most applications.

No support is provided whatsoever for software modularity at the machine level.

3. Stack machine

3.1 Software partitions

The stack machine directly supports a stack data, object (see Section 2) through the inclusion of a stack pointer (SP) register to contain the address of the top-of stack (TOS). It may be automatically incremented after each push and decremented after each pop instruction. In addition it should also be possible to reference TOS without actually removing the top item, i.e. without affecting SP.

A subroutine is simply a Section of code which may be executed repeatedly when a program runs, avoiding the need for code repetition. When combined with a mechanism for parameter passing it becomes a procedure. With a further mechanism to return a value it implements a function9. Two special instructions facilitate subroutine invocation and return…

• bsr <offset>…branch to subroutine

• rts…return from subroutine

bsr causes the following microprocess to run…

(1) PC ? (PC+length(bsr)) (2) EAR ? SP (3) DOR ? PC, SP(increment) (4) PC ? (PC+<offset>), BCU(write)

Note that the semantic level of the RTL in use has now been raised by extending it to allow the right-hand side to be specified as an arithmetic sum, whose evaluation requires a number of unlisted micro-operations and hence takes more than one clockcycle. Depending on the total length of the bsr instruction, it may actually be quicker to use the increment function of the PC than to cycle it through the ALU. Some machines are equipped with a second ALU dedicated to effective address computation.

Step 1 adjusts the PC to point to the next instruction of the main routine. Steps 2 and 3 push the PC contents onto the stack (TOS ? PC). Step 4 adjusts the PC to point to the subroutine start. An immediate operand forms the offset from the current PC value to the subroutine start, rts simply pops the address of the instruction following the bsr off the stack and back into the program counter (PC ? TOS)…

(1) EAR ? SP (2) DIR ? Mem, BCU(read), SP(decrement) (3) PC ? DIR

3.2 Data referencing

It should be clear from the above that a stack provides direct support for subroutines. In order to implement procedures and functions the correct reference scoping for the following classes of data must be guaranteed…

• Global data

• Local data

• Parameters

• Function value

Note that all data are dynamic except global data. The stack frame is the mechanism for referencing all dynamic data. The frame pointer (FP) register contains the address of the first free location at the bottom of the frame, which contains all local variables and whose size must be decided by the compiler by inspecting their number and type. Directly beneath the stack frame is found the old frame pointer value and return address followed by parameters and possibly space reserved for a function value return. This structure is created by the compiler generated code in implementing a procedure or function call and is shown in FIG. 5.

[9. The computation of a function is by means of a process! Take great care to distinguish between the mathematical definition of function and its implementation as a subroutine which receives parameters and returns a value. ]

FIG. 5: Stack format within a procedure or function

To summarize the compiler generated code for a procedure or function call…

1. Push space for return value (function only)

2. Push parameters

3. Branch to subroutine

…and for procedure/function entry…

1. Push old frame pointer

2. Copy stack pointer to frame pointer

3. Increment stack pointer by frame size

…and for procedure/function exit…

1. Copy frame pointer to stack pointer

2. Pop old frame pointer value to frame pointer register

3. Decrement stack pointer by parameter block size

The function value returned (if any) will now be easily accessible at TOS. There follows a summary of addressing modes available with a stack machine…

• Immediate: Operand embedded in code immediately following instruction

• Register relative: Operand is at a specified offset from the address held in a register

• Register indirect: Vector to the operand is located at the effective address formed by the sum of an offset and the contents of a register

The registers available for register relative addressing are detailed in Table 1.

The static base register (SB) contains a pointer to an area of memory reserved for global access by all procedures of a software module.

Table 1: Stack machine registers and their application

A vector may be maintained as a reference into a structured data object whose elements are thus recovered or inserted by indirection. A little thought about this leads to the realization that a mechanism must be provided to generate the run time effective address of one of the above modes to serve as a pointer to an object. No indexing is required since the vector may be adjusted as necessary to point to an element of a static, sequence associated, data structure (array or record). The mechanism takes the form of a special instruction which generates the effective address specified and places it at TOS. There follows the procedure to reference an element in an array, given a variable representing the array index, …

1. Generate the array base address at TOS

2. Add this to array index, result to TOS

3. Reference with SP indirect addressing

One-address instructions are quite adequate. The stack machine uses TOS like the accumulator machine uses its accumulator. TOS always serves as one operand and the result of a binary operator so it remains only to specify the other operand.

All data referencing is therefore made relative to addresses stored in processor registers. Variables are bound to offsets, not absolute addresses! This has the very important consequence that code may run independent of position in memory. No absolute addresses need appear in the code at all. It is thus known as position independent code.

3.3 Expression evaluation

Expressions may be evaluated with very few instructions purely using the stack10. In a given assignment, the left-hand side is a variable which is referenced via a special register and an offset to which it is bound. Which register (FP or SB) depends on the scope of the variable as discussed above. The right hand side is the expression which is computed at TOS. Each variable appearing in the expression is pushed onto the stack as it is required. When the computation is complete the value of the expression rests at TOS. The assignment is then completed by a pop to the location of the variable.

Instructions take the form of operators which operate on the stack. Typically they pop the top two items as operands and push a result back. This may then be used as an operand to the next instruction and so on until the value of the whole expression is to be found at TOS, e.g. to compute (9+3)×(7-4)…

push #7 7

push #4 7, 4

sub 3

push #9 9, 3

push #3 3, 9, 3

add 12, 3

mul 4

"#" denotes immediate addressing mode. The status of the stack is shown on the right hand side.

A great deal of code is (or may be) written with only short expressions to be evaluated. On some machines stacks have been implemented specifically and solely for the purpose of expression evaluation, taking the form of a small number of processor registers. By eliminating the need for memory references a significant performance increase is obtained.

It is quite possible to survive without an index register to reference array elements by using a variable to act as an indirection vector. The vector into the array (base+ index) is maintained and use made of indirect mode to recover or set the element value. This does have the unfortunate consequence however of incurring two memory accesses per element reference, one to acquire the vector, a second to acquire the operand.

3.4 Alternation

Alternation between environment processes is provided by the mechanism known as the interrupt. A hardware channel, which supports a signal protocol, is connected to the con-trol unit. When a signal is received, the control unit completes the current instruction, increments the PC, then saves it on the stack. The address of the interrupt routine is copied from the INT register into the PC. The activity is very much like that of calling a subroutine. Return from interrupt routine is achieved by a pop of the return address back to the PC.

In this way the processor is able to switch rapidly between two routines, although the means of switching in each direction are not symmetric. The interrupt allows rapid switching from a "main" routine into the interrupt routine but subsequent interrupts are disabled upon entry and execution continues until programmed return.

To communicate with multiple environment processes, polling must be used.

On entry into the interrupt routine, all such processes are interrogated (polled) until the one which sent the interrupt signal is found.

3.5 Summary

FIG. 7 shows the organization of the stack machine, FIG. 6 the programmer's architecture.

Stack machines permit software modularity by supporting procedures and functions at machine level. However, each invocation requires considerable time overhead, reducing performance for the sake of both reducing code size and easing the development and maintenance of source. They are very efficient, with respect to code length, at expression evaluation and require only one-address instructions.

Stack machines require only a small instruction set composed solely of the operators required together with push and pop. An additional "generate address" instruction is required to support access within sequence-associated or pointer associated structured data. They are compact and simple in architecture and machine code. The instructions themselves may be compactly encoded and implemented simply because there are few of them. Their single severe disadvantage is that every data reference is to main memory which is time and power consuming.

A form of alternation is possible, via the mechanism of the interrupt, which requires further support (e.g. software polling).

[ 10. If the required computation is expressed purely as a function it is possible to avoid the need for the notion of assignment altogether! Functional programming is outside the scope of this guide.

11. The interrupt microprocess of a control unit is discussed in Section 7.]

FIG. 6: Programmer's architecture of stack machine

FIG. 7: Stack machine organization

PREV. | NEXT