Computer Architecture / Organization -- System organization [part 1]

AMAZON multi-meters discounts AMAZON oscilloscope discounts

1. Internal communication

1.1 System bus

Bus devices

A top-level functional decomposition of a computer may be made yielding a requirement for the following components…

• Processor

• Memory

• External communication (Input/Output or I/O)

A computer system may typically be broken down into a number of components called devices, each of which implements, or cooperates in implementing, one or other system function. A minimum of one device is required to implement each function1. The use of multiple devices cooperating to implement memory and external communication is commonplace and is discussed in this Section. Systems with multiple processor devices are rapidly becoming more common but suffer complication and are outside the scope of this guide.

Devices must be able to communicate with each other. The form of communication channel employed is the bus [2]. In general a bus permits both one to-one and one-to-many communication. On the system bus, however, communication is restricted to just one-to-one. FIG. 1 shows how devices are connected to the system bus.

FIG. 1: Devices communicating over the system bus

It is of the greatest importance to understand that the system bus constitutes a shared resource of the system. A device must wait its turn in order to use it.

Whilst waiting it will be idle. The bandwidth of a bus is the number of symbols (binary words) which may be transmitted across it in unit time. The bandwidth of the single system bus used in contemporary computers determines the limit of their performance.

[1. This approach is reductionist and may not be the only way to approach constructing a computer. Artificial neural systems [Vemuri 88] [Arbib 87] inspired by models of brain function, offer an example of a holistic approach. 2. We have met the bus before as a means of communication inside a processor (see Section 7). 3 The purchaser of many a processor "upgrade" has been sadly disappointed to find only a marginal increase in performance because of this.]

Bus structure

The term system bus is used to collectively describe a number of separate channels. In fact it may be divided into the following subsidiary bus channels…

• Address

• Data

• Control

Address bus and data bus are each made up of one physical channel for each bit of their respective word lengths. The control bus is a collection of channels, usually of signal protocol (i.e. single bit), which collectively provide system control. The structure of the system bus is depicted in FIG. 2.

FIG. 2: Subdivision of system bus into address, data and control buses

Control signals may be broken down into groups implementing protocols for the following communication…

• Arbitration

• Synchronous transaction

• Asynchronous transaction (Events)

These form the subject matter of the remainder of this Section.

1.2 Bus arbitration

Arbitration protocol

While a bus transaction occurs, an arbiter decides which device, requesting use of the bus, will become master of the next transaction. The master controls the bus during the whole of a bus cycle deciding the direction of data transfer and the address of the word which is to be read or written.

Arbitration must take into account the special demands of the processor fetching code from memory. As a result most commercial processors combine the tasks of arbitration with that of executing code by implementing both in a single device called a central processing unit (CPU).

The arbitration protocol operates cyclically, concurrent with the bus cycle, and is composed of two signals…

• Bus request

• Bus grant

FIG. 3: Bus arbitration with daisy chain prioritization

One physical channel for each signal is connected to each potential master device. A device which requires to become master asserts bus request and waits for a signal on bus grant, upon receipt of which it disasserts bus request and proceeds with a transaction at the start of the next bus cycle. Note that bus request is a signal which is transmitted continuously whereas bus grant is instantaneous. A useful analogy is the distinction between a red stop sign, which is displayed continuously while active, and a factory hooter, which simply sounds briefly once. Both may be considered to be of signal protocol but care must be taken to distinguish which is intended. In digital systems we talk of a level, meaning a continuous signal usually communicated by maintaining a potential, or an edge, meaning an instantaneous signal usually communicated by changing a potential.

Slave devices are typically memory devices, all of which must be located within unique areas of the memory map. Masters are usually I/O processors which drive devices for such purposes as mass storage, archival or communication with other remote computers.

Device prioritization

How does the arbiter decide which device is to be the next master if there are multiple requests pending? As is often the case, the answer to the question lies waiting in the problem as a whole. Here we find another question outstanding… How do we ensure that the more urgent tasks are dealt with sooner? Both questions are answered if we assign a priority to each device and provide a mechanism whereby bus requests are granted accordingly. The simplest such mechanism is that of the daisy chain shown in FIG. 3.

1.3 Synchronous bus transactions

Bus cycle A transaction is a single communication event between two bus devices. Each transaction is synchronous, i.e. each participating device must complete a transaction before proceeding. Asynchronous communication may also be achieved as we shall see later in this Section. On each transaction a single device becomes master and communicates a message to or from just one slave. One transaction occurs on each bus cycle. A single transaction may be subdivided into two phases…

• Address

• Data transfer

The address phase involves transmitting the following messages…

• Address

• Read/Write

The protocol of the address channel is simply made up of timing and word length. That of the read/write channel is, once again, its timing and just a single bit which indicates the direction of data transfer.

Since bus transactions occur iteratively, the operation of the two phases are together referred to as a, bus cycle.

Synchronous transaction protocol

As mentioned above the protocol of each of the three component bus channels relies heavily on timing. This is achieved using the system clock and is best explained by a diagram, FIG. 4. The duration of the first phase is the time taken to…

• Setup (render valid) address and R/W

• Send physical message

• Decode address

…and is measured in clock-cycles (ticks). In the diagram example each phase takes just two clock-cycles.

An address must be decoded in order that the required memory register may be connected to the bus. Phase two comprises…

• Connect memory register to data bus

• Setup (render valid) data

• Send physical message (either direction)

• Latch data

Both address and data must remain valid long enough to physically traverse their channel and be successfully latched. [The physical means of sending and receiving messages is discussed in Section 1 and Section 5. The timing restrictions of accessing any physical memory device are discussed in Section 5. In summary these are the setup time, hold time and propagation delay. ]

FIG. 4: Bus cycle showing address and data transfer phases

Synchronous transaction protocol for slow slaves

The time taken by a memory device to render valid data onto the data bus varies according to the device concerned. Because of this, a transaction protocol specifying a fixed interval between valid address and valid data would require that interval to be appropriate to the slowest slave device ever likely to be encountered. This would imply an unnecessarily slow system since, as pointed out earlier, bus bandwidth limits overall system performance.

Wait states are states introduced in the bus cycle, between address valid and data valid, by slow devices to gain the extra time they need (FIG. 5). Any number of wait states are permitted by most contemporary bus transaction protocols. Note that an extra control signal (Rdy) is necessary and that such a protocol implies a slight increase in processor complexity.

FIG. 5: Bus cycle with a single wait state inserted

Address/data multiplexing

The fact that the address bus and data bus are active at distinct times may be used to reduce the cost of the physical system at the expense of a slight reduction in system bus bandwidth.

Multiplexing is the technique of unifying two (or more) virtual channels in a single physical one. It was introduced and discussed in Section 6 where it was shown how to construct a multiplexer and demultiplexer. Time-multiplexing cyclically switches the physical channel between the processes communicating address and data at both ends.

Required bus width is a significant factor in the expense of implementing a system. As has been pointed out previously, when discussing control units5, interconnect can be particularly expensive given a two-dimensional technology.

Assuming equal data and address bus width, the required interconnect may be halved. A single address/data bus may be employed, passing the address in the first phase of the bus cycle and transferring data in the second half.

The performance overhead incurred is simply the extra time required to effect the switching at either end. The address must be latched by the slave and then removed from the bus by the master. A delay must be allowed between phases to ensure that two devices do not attempt attempt to transmit on the same physical channel simultaneously.

FIG. 6: Event signal (interrupt request) and daisy chained event acknowledge (interrupt grant)

1.4 Asynchronous bus transactions

System events

There is another form of communication which requires a physical channel. The behaviour of the running processes will typically be conditionally dependent upon events occurring within the system. Devices must communicate the occurrence of an event to the processor as shown in FIG. 6.

Note that events can occur within the processor as well, For example, the control unit should be capable of detecting an attempt to divide by zero. Such a processor event is typically dealt with by exactly the same mechanism as for system events. The set of system events and that of processor events are collectively known as exceptions.

System events are associated with communication, both internal and external.

Completion or failure of asynchronous communication transactions must be signalled to the processor. A signal that an event has occurred is called an interrupt since it causes the processor to cease executing the "main" program and transfer to an interrupt service routine.

Event protocol

Before a system event can be processed it must first be identified. There are two principal methods…

• Polling

• Vectoring

Event polling means testing each and every event source in some predetermined order (see discussion of event prioritization below). Clearly this will occupy the processor with a task which is not directly getting the system task done. Care must be taken to test the most active sources first to minimize the average time taken to identify an event.

Given a pure signal, there exists no choice but polling to identify an event.

Commonplace in contemporary system architectures is a more sophisticated protocol which includes transmission of the event identity by the source.

Whether or not a running process on the processor will be interrupted or not depends on the event which caused the attempted interrupt. The interrupt signal is thus more properly referred to as an interrupt request. In order to decide whether interruption will indeed occur, the event protocol of the system must include some form of arbitration. If no event is currently being serviced, the request will be successful and an interrupt grant signal be returned.

Thus a complete picture of the required event protocol may now be presented.

There are three phases…

1. Event signal (interrupt request)

2. Event acknowledge (interrupt grant)

3. Event identify (interrupt vector)

The symbol used to identify the event may be chosen so as to also serve as a pointer into a table of pointers to the appropriate interrupt service routines. This table is called the interrupt despatch table. Its location must be known to the processor and hence a base pointer is to be found at either of the following…

• Reserved memory location

• Reserved processor register

The event protocol and its efficient means of vectoring a processor to the required interrupt service routine are depicted in FIG. 7.

[ 6 …as discussed in Section 7.

7. The "main" program may be thought of as a routine executed in response to an interrupt signalling that a reset or boot event has occurred.]

Event arbitration

Event protocols must include some means of deciding which event to service given more than one pending. There are three fundamental schemes…

• FIFO

• Round robin

• Prioritization

FIG. 7: Event identification by vector

FIG. 8: Event prioritization and control using an interrupt control unit (ICU)

Prioritized arbitration is the most simple to implement in hardware and is the one depicted in FIG. 6. Event acknowledge channels are arranged in a daisy chain. Each device passes on any signal received that it does not require itself.

Devices, regarded as event sources, must simply be connected in the daisy chain in such a way that higher priority processes are closer to the processor. Software prioritization is also extremely simple. The order of polling is simply arranged such that sources are inspected according to priority. Note that this may well conflict with the efficiency requirement that sources be inspected in order of event frequency. Daisy chain and prioritized polling require only signal event protocol.

Prioritization of a vectored event protocol, as depicted in FIG. 7, requires a little more hardware but still uses standard components. A priority encoder is used to encode the interrupt vector/event identity and thus ensures that the one transmitted is the highest priority provided that event sources are connected appropriately.

FIG. 9: Event masking and prioritization using priority encoder

An interrupt control unit, FIG. 8, is an integrated device which will usually provide prioritized vectored event protocol as well as FIFO and round robin. In addition it may be expected to provide event masking as well, FIG. 9.

Note that, in any prioritized event protocol, if a higher priority event occurs, while a lower priority event is being serviced, interruption of the current interrupt service routine will take place. This is equivalent to preemption of a running process as in any other process scheduler. The event protocol employed effectively determines the scheduling algorithm for the lowest level processes of an operating system. This becomes programmable to a large extent given a contemporary ICU and thus must be decided by the operating system designer.

Direct memory access (DMA)

As pointed out earlier, the processor is unique among bus masters in requiring continuous, high priority, access to the memory device containing the executable program code. Other bus masters require direct memory access and must request it from the bus arbiter, which is implemented, together with the processor, in a single device called the central processor unit. These bus masters must report asynchronous communication events to the processor.

Typically a bus master will require to read or write a block of data to or from memory. The size of the block and location in memory for the transfer will need to be under program control, although the actual transfer need not be. To facilitate this a direct memory access controller (DMAC) is used which is said to provide a number of DMA channels under program control. The DMAC conducts the transfers programmed independently of the processor. A schematic diagram is shown in FIG. 10.

FIG. 10: Direct memory access controller (DMAC) connected to CPU and system bus

The protocol employed for communication between the DMAC and the (would-be master) devices may be simple. For example a busy/ready protocol might be employed where the device transmits a ready signal, announcing that it is ready to transfer data, and the DMAC may assert a busy signal continuously until it is free to begin. In addition a R/W channel will be required to indicate the direction of transfer.

A simplified picture of the programmable registers is shown in FIG. 11.

One control bit is shown which would determine whether an event is generated upon completion of a transfer. Other control parameters which may be expected are channel priority and transfer mode. The transfer mode defines the way in which the DMAC shares the system bus with the CPU. The three fundamental modes are as follows…

• Block transfer mode…completes the whole transfer in one operation and thus deprives the CPU of any access to the bus while it does so

• Cycle stealing mode…transfers a number of bytes at a time, releasing the bus periodically to allow the CPU access

• Transparent mode…makes use of bus cycles that would otherwise go unused and so does not delay the CPU but does seriously slow up the speed of data transfer from the device concerned

There are some devices which require block transfer mode because they generate data at a very high rate, once started, and are inefficient to stop and restart.

Magnetic disc and tape drives usually require this mode.

Although a degree of efficiency is possible by careful design, the fact remains that the system bus is a shared resource and currently sets the limit to overall system performance.

FIG. 11: DMAC channel registers

2. Memory organization

2.1 Physical memory organization

Requirements

The memory sub-system of a computer must fulfill some or all of the following requirements, depending on application…

• Minimum mean access time

• Minimum mean cost

• Non-volatility

• Portability

• Archival

The first three items apply to all systems, regardless of application. The last two really only apply to work-stations (which represent a just a small proportion of working systems).

Minimum mean access time (per access)…of memory partially determines the bandwidth of the system bus and thus the performance of the entire system (see preceding Section). It is not necessary to have all memory possessing the minimum attainable access time. That would certainly conflict with other requirements, particularly that of minimum mean cost. The mean access time should be considered over all accesses, not over locations. It is possible to minimize mean access time by ensuring that the fastest memory is that most frequently accessed. Memory management must operate effectively to ensure that the data most frequently referenced is placed in the memory most rapidly accessed. We shall see how to achieve this later on in this Section.

Minimum mean cost (per bit)…over all memory devices employed largely determines the cost of contemporary machines. This is because the majority (~90%) of the fundamental elements (e.g. switches) contained therein are employed in implementing system memory, rather than the processor or external communication8. This is simply true of the kind of computer we are building now. There is no fundamental reason it should be this way.

The requirements of minimum cost per bit and minimum access time per access can be made consistent by ensuring that as large a proportion of memory as possible is implemented in the lowest cost technology available, while the remainder is implemented using the fastest.

Non-volatility…of memory means that its contents do not "flow away" when the power supply is turned off. It is essential that at least some of system memory be non-volatile in order that boot code be available on power-up and that data may be saved for a later work session. Non-volatility is difficult to achieve with electronic switches. It is currently only understood how to make switches which

"leak" electrons and thus require a constantly applied potential to retain their state. As we shall see later, it is possible to minimize the problem by continually "topping up a leaky bucket".

Magnetic technology is able to provide memory devices which are non volatile on the timescale of interest, i.e. up to tens of years. More recently, optical and magneto-optical non-volatile technologies have matured to the point of commercial competitiveness.

Portability…of at least a Section of system memory is essential for archival of data and to allow work to progress on different machines (e.g. should one break down). Magnetic or optical technology offers an advantage here in not requiring physical contact with media, thus greatly reducing wear and the possibility of damage due to repeated insertion and removal.

Archival…of data means the maintenance of a long life copy of important data.

This poses similar technological demands to those posed by minimum mean cost and portability. The only potential difference is longevity of memory media, which is sometimes limited by environmental constraints such as temperature variations. Periodic refreshing of archival media can be expensive due to sheer volume of data.

[8. In fact, the greatest cost element in contemporary computer fabrication is that of interconnect. Manufacturers of electronic connectors tend to make much bigger profits than those of integrated devices. Of these, those which sell memory do much better than those who sell processors. ]

Technological constraints

It is impossible to fulfill all memory requirements with a single memory device.

It probably always will be because different physical limits are imposed on optimization against each requirement.

Access time is limited first by the switching necessary to connect the required memory register to the data bus and secondly by the time taken for that memory state to be read and latched. The underlying physical limitation, assuming an efficient physical memory organization, is the charging time required to close a switch. This matter is fully discussed in Section 5.

Cost is limited by the production cost, per element, of the device. This may be subdivided into materials and production process. There is no reason why cheap manufacture should coincide with device performance.

Non-volatility and portability require durability and the avoidance of physical contact with the host. The latter suggests that only a comparatively small physical potential will be available to change the state of a memory element. This is true of magnetic technology but, recently, optical technology has become available, in the form of the laser, which demonstrates that this need not be the case.

Archival requires immunity from environmental effects and long term non volatility.

Physical memory devices

A flip-flop is referred to as a static memory because, if undisturbed, its state will persist for as long as power is provided to hold one or other normally-open switch closed. Dynamic memory is an alternative which, typically, is easier to implement, cheaper to manufacture and consumes less power.

FIG. 12: Sense amplification to refresh a dynamic memory

The idea is simply to employ a reservoir (bucket) which represents 1 when charged and 0 otherwise. Implementation in electronic technology requires merely a capacitor. Capacitances are not just cheap to fabricate on an integrated circuit, they are hard to avoid! Roughly four times the memory may be rendered on the same area of silicon for the same cost.

Nothing comes for free. The problem is leakage. Any real physical capacitor is in fact equivalent to an ideal one in parallel with a resistance, which is large but not infinite. A dynamic memory may be thought of as a "leaky bucket". The charge will slowly leak away. The memory state is said to require periodic refreshing. Contemporary electronic dynamic memory elements require a refresh operation approximately every two milliseconds. This is called the refresh interval Note that as bus bandwidth increases, refresh intervals remain constant and thus become less of a constraint on system performance.

Refreshing is achieved using a flip-flop as shown in FIG. 12. First the flip flop is discharged by moving the switches to connect to zero potential. Secondly the switches are moved so as to connect one end of the flip-flop to the data storage capacitor and the other to a reference capacitor, whose potential is arranged to be exactly half way between that corresponding to each logic state.

The flip-flop will adopt a state which depends solely on the charge in the data capacitor and thus recharge, or discharge, it to the appropriate potential. Flip flop, reference capacitor and switches are collectively referred to as a sense amplifier. Note that memory refresh and read operations are identical.

FIG. 13: Writing and reading a 0 using opto-mechanical technology

A cost advantage over static memory is only apparent if few flip-flops are needed for sense amplification. By organizing memory in two dimensions, as discussed below, the number of sense amplifiers may be reduced to the square root of the number of memory elements. Thus as the size of the memory device grows so does the cost advantage of dynamic memory over static memory. For small memory devices, static memory may still remain cost effective. The principal disadvantage of dynamic memory is the extra complexity required to ensure refresh and the possible system bus wait states it may imply.

Note that all forms of electronic memory are volatile. The need for electrical contacts, which are easily damaged and quickly worn, generally prohibit portability. Cost per bit is also high since each element requires fabrication.

Small access time is the dominant motivation for the use of electronic memory.

That which follows is intended only as an outline and not a full description of a real commercial optical memory device, which is rather more complicated.

However the general operation of optical memory, which currently looks set to gain great importance, should become clear. What follows should also serve as an illustration of the exploitation of physical properties of materials for a memory device offering low cost per bit.

FIG. 13 shows how a laser is used as a write head to write a 0. A laser

[ Light amplification by stimulated emission of radiation. ] is a source of intense light (visible electromagnetic radiation) whose beam may be very precisely located on a surface (e.g. of a rotating disc). Once positioned, a pulse of radiation causes the surface to melt and resolidify forming a small crater or pit. A second, lower power, laser is pointed at the location to be read. Since the surface there has been damaged, its radiation will be scattered and only a very small amount will fall on a light sensitive detector, to be interpreted as a 0.

Together, the lower power laser and detector form the read head. The damage is impossible to undo, hence the data cannot be erased.

Reading a 1 follows the same procedure except that the write head delivers no pulse and thus leaves the surface undamaged. Mirror-like specular reflection will now be detected by the read head which will thus register a 1, FIG. 14.

FIG. 14: Writing and reading a 1 using opto-mechanical technology

The scheme outlined above should more properly be termed opto-mechanical memory since the reference mechanism is optical but the memory mechanism is mechanical, having two states…damaged and undamaged. An alternative approach, termed magneto-optical memory, uses a laser write head which alters the magnetic rather than the physical state of the surface. This in turn affects its optical properties which may be sensed by the read head. In contrast to opto mechanical memory, magneto-optical memory may be reused again and again since data may be erased. It thus offers competition for purely magnetic memory with two distinct advantages…

• Significantly greater capacity per device

• Improved portability

…in addition to the non-volatility and extremely low cost per bit which both technologies offer. Access time and transfer rate are similar, although magnetic devices are currently quicker, largely because of technological maturity. The portability advantage of the optical device arises out of the absence of any physical contact between medium and read/write heads. Further reading on optical memory is currently scarce, except for rather demanding journal publications. An article in BYTE magazine may prove useful [Laub 86].

There is an enormous gulf separating an idea such as that outlined above and making it work. For example, a durable material with the necessary physical properties must be found. Also, some means must be found to physically transport the heads over the disc while maintaining the geometry. A myriad of such problems must be solved. Development is both extremely risky and extremely costly.

Physical memory access

Whatever the physical memory element employed, a very large number must be arranged in such a way that any one may be individually referenced as rapidly as possible. The traditional way to arrange memory is as a one-dimensional array of words. Each word is of a specified length which characterizes the system. A unique address forms an index which points to a current location. This arrangement presumes access of just one location at a time.

The essential physical requirement is for address decoding to enable the selected word to be connected, via a buffer, to the data bus. FIG. 15 depicts this arrangement.

If there is no restriction on choice of element referenced, the device may be referred to as random access. Linear devices, e.g. those which use a medium in the form of a tape, impose a severe overhead for random access but are extremely fast for sequential access. A magnetic tape memory can present the storage location whose address is just one greater than the last one accessed immediately. An address randomly chosen will require winding or rewinding the tape…a very slow operation. Tape winding is an example of physical transport (see below).

An alternative arrangement is a two-dimensional array of words. FIG. 16 shows this for a single bit layer. Other bits, making up the word, should be visualized lying on a vertical axis, perpendicular to the page. No further decoding is required for these, only a buffer connecting each bit to the data bus.

Each decode signal shown is connected to a vertical slice, along a row or column, through the memory "cube".

Externally the 2-d memory appears one-dimensional since words are accessed by a single address. Internally this address is broken into two, the row address and column address which are decoded separately. The advantage of row/column addressing is that the use of dynamic memory can yield a cost advantage, using current technology, as a result of needing just one sense amplifier for each entire row (or column). This implies that only flip-flops are required for n memory elements.

Extra external hardware is required to ensure that every element is refreshed within the specified refresh interval. A refresh counter is used to generate the refresh address on each refresh cycle. Assuming that a sense amplifier is provided for each column, it is possible to refresh an entire row simultaneously.

Care must be taken to guarantee a refresh operation for each row within the given refresh interval.

FIG. 15: One-dimensional organization of memory

Note that row address and column address are multiplexed onto a local address bus which may thus be half the width of that of the system bus. Reduction of interconnect reduces cost at the expense of speed and complexity. The following signals must be provided in such a manner as not to interfere unduly in system bus cycles…

• Row address strobe (RAS)

• Column address strobe (CAS) 10

These are used to implement a protocol for the communication of row and column addresses, i.e. to latch them.

Most contemporary memory devices which offer…

• Low cost per bit

• Non-volatility

• Portability

…require some kind of physical transport. For example, devices using magnetic technology require transport of both read and write heads which impose and sense, respectively, the magnetic field on the medium11. As discussed briefly above, optical devices typically require movement of one or more lasers, with respect to the medium, which act as read/write heads.

Two performance parameters of memory devices requiring physical transport are of importance…

• Access time (random access)

• Data transfer rate (sequential access)

Access time is the time taken to physically transport the read/write heads over the area of medium where the desired location is to be found. This operation is known as a seek. The data transfer rate is the rate at which data, arranged sequentially, may be transferred to or from the medium. This also requires physical transport, but in one direction only, and without searching.

FIG. 16: Row/column addressing and multiplexing

FIG. 17 shows the arrangement of the winchester disc which possesses a number of solid plattens, each of which is read and written by an independent head.

Each sector is referenced as though it belonged to a 1-d memory via a single address which is decoded into three subsidiary values…

• Head

• Track

• Sector

Note that the term sector is used with two distinct meanings…a sector of the circular disc and its inter Section with a track!

FIG. 17: Memory access involving physical transport

[ 10. "Strobe" is a common term meaning simply "edge signal".

11. Usually a plastic tape or disc coated in a magnetic material.]

Whatever the arrangement, as far as the system bus is concerned, each individual memory element may be visualized simply as a flip-flop, internally constructed from a pair of normally-open switches or, if one prefers, a pair of invertor gates. Only one connection is actually required for both data in and data out (level) signals. However these must be connected to the data bus in such a way as to…

• Render impossible simultaneous read and write

• Ensure access only when device is selected

FIG. 18 shows how this may be achieved. Note that it is the output buffer which supplies the power to drive the data bus and not the poor old flip-flop! This avoids any possiblity of the flip-flop state being affected by that of the bus when first connected for a read operation.

Modular interleaved memory

Here we discuss a method for enhancing system bus bandwidth and which also provides a measure of security against the failure of a single memory device.

Memory is divided up into n distinct modules each of which decodes a component of the address bus, the remainder of which (log2 n bits) is decoded to select the required module. The arrangement is shown in FIG. 19. Note that the data bus width is four times that for an individual module and hence requires that each bus master data buffer be correspondingly wide. This implies extra interconnect and hence extra cost. The number of modules required is decided by the ratio of processor cycle to bus cycle (usually the number of clock cycles per bus cycle). Typical for current electronic technology is n=4.

Address interleaving is the assignment of each member of n consecutive addresses to a separate module. The most obvious scheme is to assign the module number containing address x to be x modulo n. It requires that n be a power of two.

FIG. 18: Connection of memory device to bus

Arguably the greatest advantage of modular, interleaved memory is that an instruction cache may be filled with fewer memory references as overhead.

Given n=4 and single byte, zero-address format, four instructions may be read using just one bus cycle. As discussed in Section 8, the efficiency of an instruction cache is dictated by the condi tional branch probability since a branch, following the success of a condition, will render useless the remaining cache contents.

Simultaneous, independent references to distinct modules are possible. Thus a measure of support for shared memory multiprocessing is afforded. Such benefit extends to the model of CPU plus input/output processor (IOP) […incorporating direct access memory controller. ] , which effects all external communication, including that with mass storage memory devices.

Each processor may possess its own connection to the modular memory, sending an address and receiving data. An extra memory control unit (MCU), FIG. 20, decides whether two simultaneous requests may be serviced in parallel and, if so, effects both transactions. If two simultaneous requests require access to the same module then the MCU must take appropriate action, typically inserting wait states into the lower priority processor (IOP) bus cycle.

FIG. 19: Address interleaving over a modular memory

Associative cache

The associative cache (FIG. 21) is a means of reducing the average access time of memory references. It should be thought of as being interposed between processor and "main" memory. It operates by intercepting and inspecting each address to see if it possesses a local copy. If so a hit is declared internally and Rdy asserted, allowing the bus cycle to be completed with the cache, instead of main memory, placing data on the data bus. Otherwise, a miss is internally declared and the job of finding data and completing the bus cycle left to main memory. The ratio of hit to miss incidents is called the hit ratio and characterizes the efficiency of the cache. Provision is shown for main memory to assert Rdy if the cache has failed to do so. The device implementing main memory must be designed with this in mind. Also the processor must be capable of the same brevity of bus cycle as is the cache.

Crucial to the performance of the cache is the associativity between the internally recorded address tag and the address on the address bus. Full association implies recording the entire address as the address tag. An address is compared with each tag stored simultaneously (in parallel). Unfortunately this requires a lot of address tag storage, as well as one comparator for each, and is thus expensive. There is a better way.

FIG. 20: Shared access to a modular memory

FIG. 21: Two-way associative cache memory

FIG. 22: One-way (direct mapped) set associative cache

Set association reduces the number of comparators and tag memory required by restricting the number of locations where a match might be found. If that number is n the cache is said to be n-way set associative. Although many schemes are possible an obvious and common one is to divide the address into two components and use the least significant word to simultaneously hash -- Hashing (here) means using the value as an index into an array -- into each of n banks of tag memory. The most significant word is used as a tag. A hit is declared if any of the n tags found matches the most significant word of the address. Remember that all n comparators function simultaneously. Direct mapped association, the limiting case where n=1, is shown in FIG. 22. n-way association may be achieved simply by iterating the structure shown.

Set associativity requires a replacement algorithm to determine which of the n possible slots is used to cache a data element, intercepted on the data bus from main memory, following a miss. Several possibilities include…

• Least recently used (LRU)

• Least frequently used (LFU)

• First In First Out (FIFO)

LRU is easily implemented in a two-way associative cache by maintaining a single bit for each entry which is set when referenced and cleared when another member of the set is referenced. LFU requires a counter for each member of each set. FIFO requires the maintenance of a modulo n counter for each entry, pointing to the next set member to be updated. Locality suggests that LRU should give the best performance of the three.

NEXT>>

PREV. | NEXT