Computer Architecture / Organization -- Machine Language (part 2)

AMAZON multi-meters discounts AMAZON oscilloscope discounts

<<PREV.

2. Simple architectures

2.1 Sequential processing

Programmer's architecture refers to just that set of registers which a machine level programmer needs to know about. Architecture more generally refers to everything such a programmer needs to know, including the available instruction set, addressing modes etc..

Fig. 7 shows the programmer's architecture for purely sequential processing. As may be seen, very little is needed. The size of the general purpose register file shown is arbitrary although it is generally agreed that about eight registers are sufficient for expression evaluation. They may still be used for parameter passing if subroutines are not called while an expression is being evaluated.

The following addressing modes should suffice and permit position independent code…

• Immediate

• Register

• Workspace relative

(Optional index modifier) The following instruction set represents the minimum absolutely necessary to support structured programming with a single level of procedure invocation…

Assignment Operators Control flow Linkage

load and cb bsr store or br ret not

Assignment Operators Control flow Linkage

shl shr

Assignment and operators may be two-operand instructions where the second is always the destination. Those for control flow and linkage take a single offset operand except ret which takes none, shl and shr are shift left and shift right operators where the first operand is the number of bits (see Section 4) to shift.

It is quite possible to make do with even fewer instructions without loss of generality. For instance {and or not} may be replaced with nand. A computer with just a single instruction has also been reported! Such machines, however, are most probably not a delight to generate code for.

2.2 Parallel processing

Global processor scheduling

Imagine a manager for a construction company who is to build a house with a given number of workers. Each worker can take on any task so long as it is unambiguously described. Obviously there are a lot more tasks than there are workers. Procedures describing how to perform each task already exist. It remains for the manager to schedule processes between workers in such a way as to promote efficiency. Fig. 8 depicts the problem of scheduling processes onto processors.

Also previously defined is which tasks may proceed concurrently. Some tasks will be specified as necessarily proceeding in a given sequence. The manner in which tasks relate (communicate) with each other must be specified. An example of a procedure which cannot commence until another has finished is the erection of the roof. Clearly this should not begin until all four walls are built.

The manager of a computer is termed an operating system and is distinct from the running application. One of the operating system activities is process management. Let us assume that this is the only activity of the operating system and hence forward simply refer to it as the process manager. It is the job of the process manager to schedule processes to processors (workers) in such a way as to…

• Balance load between workers

• Minimize worker idleness

• Guarantee absence of deadlock

See Section 1 for a discussion of these and related problems.

Fig. 8: Global scheduling of processes to processors

Fig. 9: Local process scheduling

Local processor scheduling

Each worker will have a number of tasks to perform. Hence he now has the problem of scheduling himself. There are some facilities which a processor (or worker) needs to make this possible, and some which just make it easier. (A building site-worker might make do with a pencil and a scrap of paper but a processor needs a bit more.) Each processor must maintain a ready queue holding the identities of all processes which are ready. The ready queue might form the basis of a round robin schedule such that, when its turn comes, each process is executed until…

• It terminates

• Waits for communication

• It has been executing for more than a timeslice

A timeslice on a computer is typically about a millisecond. Round robin scheduling is fairer to shorter lived processes than most alternatives.

It is obviously inefficient to maintain a queue in a computer by moving all the remaining members when one leaves. Instead a sequence associated structure is kept with two pointers. Queue front points to the head of the queue (the next member to leave) and queue back points to its tail ( Fig. 9).

Two new instructions are required to support a process ready queue, endp terminates a process and causes the next in the queue to be despatched to the processor, startp places a newly ready process at the back of the queue. These can both be made very simple if the queue entry is just a pointer to the value required in the program counter for that process to run or continue.

The change of process on a processor requires a context switch. The context is simply the processor state and process state. When a process terminates there is obviously no need to save context. If it is just suspending, to resume later, context must be saved and later restored. Context switching is reduced almost to zero by two simple expedients. First, maintain all process state in workspace except when evaluating expressions. Second, forbid suspension while expression evaluation is in progress. As a result no process state is in registers when suspension occurs except program counter and workspace pointer. The program counter may be saved at a reserved offset within workspace and the workspace pointer used as process identifier. A context switch is thus very fast because it only involves saving and loading just two registers, WS and PC.

The implementation of PAR and ALT constructs is beyond the scope of this section and must wait for Part III which also considers the architecture support for concurrency in considerably greater depth and detail.

Fig. 10: Process states and state transitions

Fig. 11: Programmer's architecture to support concurrent sequential processes

Communication

Where only one processor is involved there must be a mechanism establishing synchronization. The problem is exactly like meeting up with a friend during the day when only the location of the rendezvous has been established. Suppose that you arrive to find your friend is not yet there, i.e. you have arrived first.

Obviously you must suspend the process which involves the meeting. In order not to suspend all your processes you sensibly decide to leave a note giving your location for the rest of the day. Having arrived, your friend is able to find you and the suspended process is free to resume.

There is no need for the system to maintain a list of suspended processes since each may be rescheduled by the second process to arrive at the rendezvous, who knows their identity.

To summarize, a process may be…

• Running

• Ready

• Suspended

Fig. 10 shows the relationship between the process states.

The minimum additional instructions needed to support synchronous inter process communication may be given the mnemonics in and out. They operate in a highly symmetric manner. A previously declared memory location suffices for a channel rendezvous and is initialized with a value denoting "empty". If either instruction finds the rendezvous empty, it deposits its process identifier and suspends. The second to arrive detects that the rendezvous is not empty, completes the communication (in either direction) and reschedules the first process by simply enqueing its identifier.

The rendezvous mechanism is considered in greater detail in Part III.

Summary of concurrency support

The minimum programmer's architecture to support concurrent sequential processes on a single processor is shown in Fig. 11.

The minimum additional instructions needed to support both process scheduling and synchronous inter-process communication is {startp endp in out}.

2.3 Modular software

Refer to Section 2 for a full account of the use of modules in software engineering.

There is a performance overhead with invocation of a procedure in another module due to the extra level of indirection required. Remember that modules serve the purposes of development, maintenance and cost, not the application performance. As such they may be thought of as the software equivalent of hardware chips.

A module descriptor is required to support linkage with other modules and is made up from the following pointers…

• Program base

• Static base

• Link table

Fig. 12: Programmer's architecture for module support

An extra processor register is shown provided in Fig. 12 to house a pointer to (address of) the descriptor of the current module.

Program base is the address of the base of module code, which consists of the concatenated code for all its constituent procedures. A given procedure is thus referenced via an offset from this pointer.

Static base is the address of the base of workspace reserved for the module as a whole. Variables therein have scope which is global to all module procedures.

Referencing global variables can cause highly undesirable side effects and should be kept to an absolute minimum or (preferably) altogether eliminated.

Link table is the address of the base of a table of absolute addresses and procedure descriptors allowing procedures in the current module to reference variables and procedures, respectively, in others. A procedure descriptor consists of its parent module descriptor together with its offset from program base.

A special instruction is required to invoke an external procedure and an extra addressing mode to reference external variables.

3. Instruction set complexity

3.1 Reduced instruction set computer (RISC)

Description

RISC philosophy calls for little more than the minimum necessary set of instructions which enables all possible programs to be compiled. Correct selection of instructions is therefore critical and has been helped by inspection of usage by existing compiler generators. Instructions themselves are kept simple and are of common, small size. To summarize the principal features of a RISC design…

• Small instruction set optimized for:

- Programming language

- Compiler code generation

• Common, small instruction size

• Fast instruction execution

• Single addressing mode

• Load/store register file expression evaluation

All of these features interact in such a way as to coexist harmoniously. The small instruction set promotes rapid decoding and execution. Decoding requires less hardware which instead may be devoted to a large register file. Hence the register file may be large enough to be used for local variables as well as expression evaluation.

Just because the instruction set is small does not mean that it cannot effectively reduce the semantic gap. Much progress may be made by simultaneous consideration of language, compiler and architecture. An effective match may be maintained between the architecture and compiler code generation strategies. The arrival of RISC in fact champions at least two causes…

• Improved compiler/architecture interface

• Reduction of processor complexity

A more complete description of the features and principles of RISC design must wait for Part III.

Reducing processor complexity has two motivations. It improves reliability and reduces the development time and so gets a new processor to market more quickly. The author has first hand experience of more traditional devices arriving years late and still with serious flaws. Indeed this had become the rule rather than the exception. How much trust can one put in a device when notice of a serious defect is announced long after it was brought to market. Lives now frequently depend on computer reliability. RISC design has already been demonstrated as an answer to this problem. A relatively small company in the UK has recently developed its own RISC. The Acorn ARM arrived on time and for four years (at time of writing) has proved free from "bugs". It also delivers a cost/performance ratio which is an embarrassment to its non-RISC competitors.

We now turn to the underlying reason why a smaller instruction set is able to deliver an enhanced performance…

Exploitation of locality

It is not the case that a RISC delivers improved performance because more, faster instructions simply reduce the total execution time from that required by fewer, slower ones. It is the choice of instructions to implement that counts.

Because RISC designers set out to have fewer, they had to think more carefully about which are really needed and which are not essential.

Temporal locality is an extremely important concept for processing. It is the property of software to reference the same set of stored items in memory within a given time window ( Fig. 13).

The process model of execution encourages this by only allowing references to local variables and procedure parameters. Structured programming also strongly enhances the effect since a loop body will reference the same variables on each iteration.

The RISC philosophy is to reap the maximum possible benefit from temporal locality. The idea is that, on process start (or procedure entry), all local variables are loaded into registers where they may be accessed much more rapidly.

Afterwards they are stored in memory. This is called a load/store memory access scheme. If, in addition, parameters are passed in registers then spatial locality is introduced. References will all be to the same spatial window. The idea may be extended by keeping all the local variables of the currently executing group of processes in an enlarged register file.

The load/store scheme plus the drive for simplicity require that only one or two addressing modes are provided. However these are defined in a flexible manner so that cunning use of registers may be used to create synthesized addressing modes.

Fig. 13: Temporal and spatial locality

The most important consequence of load/store access, and its associated single addressing mode, is that memory references are minimized. This is responsible for a large proportion of the performance improvement seen. A secondary consequence is that the compiler code generation is simplified. Another mechanism, called register windowing, almost eliminates the (conventionally high) performance overhead of procedure invocation. Discussion of this is left for Part III.

History

The RISC story began with an IBM project around 1975 called the 801, [Radin 83]. This was never sold commercially. However the results were used in the IBM RT which is now available. The term RISC, together with many of the fundamental ideas, surfaced in the Berkeley designs called RISC I and RISC II, [Patterson & Ditzel 80].

An account of the Berkeley projects, and a fascinating comparison of several recent academic and commercial RISC systems, may be found in [Tabak 87].

Also very highly recommended is [Colwell et al. 85] which seeks to define a RISC, separates and assesses the various associated ideas and gives a thorough contrast with CISCs.

3.2 Complex instruction set computer (CISC)

Description

In CISC design no attempt is made to minimize the size of the instruction set. As well as there being a larger set, the instructions themselves are more complex.

Many different formats are possible and a plethora of addressing modes are provided. For example the DEC VAX architecture supplies addressing modes which auto-increment or auto-decrement an array index after, or before, carrying out an operation.

The result is a complex processor executing instructions rather more slowly than its RISC counterpart. However, some of the instructions and addressing modes achieve much more, when they are required. Given both application and compiler which make use of the powerful features, a superior performance may be demonstrated by a CISC.

The chief characteristics of a CISC are…

• Large instruction set to suit broad range of applications

• Variable size instructions (some very large)

• Many addressing modes (some highly complex)

Motivations for complexity

The motivations towards complexity are largely due to attempts to continue the upgrading of existing, successful products. They may be summarized…

• Speed up specified application operations via hardware implementation

• Reduce semantic gap via hardware implementation of programming language statements

• Maintain upwards compatibility in product line

• Reduce size of machine code software because of high memory cost

Here, the focus of attention for performance increase is the application. The high cost and low speed of reference of memory which prevailed for many years persuaded designers to turn their attention to reducing code size and the number of memory references by "migrating" subroutines to hardware implementation.

Programming language and the compiler/architecture interface were rarely considered. New instructions could not replace old ones because of the need for upwards compatibility. (Old code had to still run on new machines.) Some new instructions actually operated more slowly than the series of old ones they were supposed to replace, though this was not generally the case. Many of the new instructions were highly specialized, thus rarely used. As complexity increased, so design and development time rapidly increased, as did the incidence of design error.

Closure of the semantic gap

It is possible to implement some procedural constructs in machine language more directly on a CISC. An example is the case instruction of the NS32000 CISC (see Section 10). The termination of an iterative FOR loop usually consists of several distinct instructions. The NS32000 provides for this combination in a single instruction, acb (add, compare & branch), which adds a signed value to the loop index, compares it to zero and then conditionally branches.

Referencing elements in structured data objects is also made much simpler in CISC machine language by the addition of extra addressing modes. Even elements within sequence-associated structures, themselves inside pointer associated ones, may be referenced directly.

Applications support

Applications specific machine language extensions or (performance) enhancements are usually readily available. For example, graphics applications frequently require moving, or applying a common logical operator to, a block of memory at a time. A special addressing mode may be supplied to cope with this.

Similar provision is often found nowadays for string processing and for mathematical (floating-point) operators.

3.3 Comparison of RISC and CISC

Closure of the semantic gap and applications support look very good in the catalogue and have proved popular. However, once the CISC machine language is implemented, an application may not run faster than it would on a RISC. The reasons are twofold. Firstly, the powerful instructions take time to translate into a sequence of primitive operations. Secondly, all that CISC complexity could have been exchanged for a larger register file, to gain greater benefit from locality, or even for another separate processor. The latter would reap benefit if parallelism exists at the problem level and hence at the algorithmic level.

To summarize, the advantages of RISC are improvements in…

• Performance for structured software via exploitation of temporal locality

• Reliability and freedom from design errors

• Design and development path

• Compiler/architecture interface

Those factors which a CISC maintains in its favor are…

• Code length

• Application specific performance

• Upwards compatibility with older machines

It must be emphasized that RISC machines demonstrate several innovations.

Each must be considered separately. Research is still under way to discover which truly offer the greatest rewards. The reader is strongly encouraged to read [Colwell et al. 85] for an up-to-date account of these matters.

Although there is great controversy over the RISC vs CISC issue, several new machines are now available visibly benefiting from RISC architecture. These include the Acorn Archimedes, SUN 4 and IBM RT.

The only arguable disadvantages to a RISC are that the size of object module is increased and that programming at machine level is slightly more work. It is not harder work since the machine language is simpler. Neither of these is of great importance to someone whose primary concern is performance and who programs in a high level language (as most do).

Apart from an uncompetitive performance, CISC has a more serious disadvantage. The design and development paths are long and risky [2]. By the time some have reached the market they are almost out of date. The cost of development is high and rapidly rising. The greater complexity also impacts reliability.

By contrast the development of RISC designs is short and cheap. For example the Acorn ARM processor was developed on a very short time scale and the very first chip made was reported to function perfectly. It was designed on an Acorn home micro using their own software. The cost and reliability of the chip set make it very competitive.

Designing optimized code generators for a RISC is generally easier than for a CISC. Part of the motivation behind the RISC concept is that existing compilers were not making sufficient use of complex instructions.

Exercises

Question 1

i. When a guide is translated from one natural language to another, is it interpretation or compilation?

ii. Explain what is meant by semantic gap. In your own words, summarize the arguments for and against designing a computer architecture to reduce it.

Question 2

The instruction…

add r0, 0400

means…

Add the contents of register zero, in the register file, to the memory location whose address is 0400 and put the result back in register zero.

State the storage class, access class and addressing mode of each of the two operands.

Question 3

A rather basic processor has only the following instructions…

• nand Bitwise logic operator which operates on its one and only register and an operand in memory. Its result is placed automatically in the register.

• shift Shift operator which shifts the contents of the register left by one bit.

• load…direct addressed operand from memory to register.

• store…operand in register to direct addressed memory location.

• branch…on the condition that the process state is set, the number of words forward or back specified by the immediate addressed operand.

It also has only a one bit processor state which is cleared if the result of a nand is zero and set otherwise. Both memory locations and the register are four bits wide.

Write a program which computes the AND of two variables stored in memory whose addresses may be referenced symbolically as x and y. Ensure that a zero result clears the processor state and sets it otherwise.

PREV. | NEXT