Appendix A:

RISC and CISC definitions:

The technical term RISC has been swamped by the marketing term RISC to the point that it's lost almost all meaning as a technical term. Almost everything now is described as RISC, even when it isn't. A historical perspective can help illustrate the point in the development of computer architechtures which "RISC" was invented to describe.

Register usage evolved from accumulators and index registers down two paths, one leading to general purpose registers and another less popular path to stacks.

Accumulator architecture:
The first computers were accumulator based, performing operations on data stored in a register with data loaded from memory locations. Addresses were coded in program memory, and could only be changed by modifying the program. Initially machines only had one (which all operations implicitly refered to), but later multiple accumulators were sometimes used.
Index registers:
Index registers were used to hold memory addresses, while accumulators were still used for computation. Operations could specify an accumulator for the one operand, and an index register to fetch the second operand from memory.
Memory to memory architectures:
With a little modification, index and accumulator registers could be used interchangably - they were now general purpose, meaning that a register could hold either data or an address. Some designs took this further, allowing the value in memory to be an indirect address itself - sometimes in an unlimited chain of references.
Stack architectures:
General registers can be made even more flexible allowing them to hold multiple values (either data or addresses) in a useful order - in other words, stacks. At the same time, the design can be simplified back to a single stack (as an accumulator architecture) with no loss of flexibility, providing swap and duplicate operations are provided.
Register to register architectures:
Although conceptually powerful, memory to memory operations are limited in speed because memory access is relatively slow, while registers are fast. The obvious improvement is to increase the number of registers, and restrict or simplify memory access (limited to load and store operations) to simplify the overall design. The first major system to follow this idea was the CDC 6600, but the idea was also explored in the IBM 801 project. However it wasn't until the Berkeley RISC design named this type of design a "Reduced Instruction Set Computer". The term CISC was also invented at this time solely to give a name to the prior generation (memory-memory designs).
A diagram can show the approximate lineage of register usage:

      Accumulator
          |
      Accumulator
        +Index
       /     \
    Stack...Memory-   (CISC)
            Data
              |
            Load-     (RISC)
            Store
Program instruction formats changed from simple to complex due to limited memory capacity, then to simple as capacity eased.

Simple hardware-interpreted instructions:
Limited hardware meant operations and circuitry had to be simple, hence the use of accumulators. Instruction bits often trigered the operations.
Hardware decoded, then interpreted instructions:
More complex instructions were encoded in numeric codes which needed to be decoded to supply the bits to trigger hardware operations.
Multistep instructions, broken down by hardware (microcode):
Commonly used sequences (copy a block of memory, scan a string) could be encoded as a single instruction, which is decoded to several which are executed, one operation at a time.
Instructions broken down before execution, executed as simple instructions:
The Intel P6 series and Motorola 68060 operate like this, to compensate for the limits of complex instructions. This was actually produced in response to single step instructions, but is a logical progression from microcode.
Single-step instructions, no need to break down:
This is basically the idea presented as RISC.
The execution of instructions evolved to perform more operations at a time.

One at a time:
One instruction was fetched, decoded if necessary, and executed before the next one. For microcoded processors, all operations finished before the next instruction was fetched.
Independent stages of multiple insturcions at a time (pipeline):
After the decoder was finished, it could be used for the next instruction while the first went to the execute stage. If an instruction took more than one cycle in a stage, the compiler could be used to anticipate this and insert a NOP if the delay could be predicted, or hardware could detect this and stall the pipeline until execution is finished. This is the technique used for initial RISC processors.
Multiple parallel instructions, grouped by compiler (VLIW):
The idea was that obviously independent instructions (especially simple ones) could be detected by the compiler, and executed safely without hardware checks.
Parallel instructions grouped partly by compiler and partly by available hardware:
What Intel/HP call EPIC, used in IA-64 and TMS 320C6x.
Parallel instructions grouped by hardware (superscalar):
Self explanatory.
Instructions executed out of order, grouped by hardware:
Generally with the use of automated resource renaming.
At the time of introduction, RISC and "load-store" were often synonymous, but RISC usually referred to a list of features:

Register windows turned out to not be a useful enough idea to catch on, so have been mostly forgotten. RISC is more commonly used to refer to a design philosophy than a list of features (actually implementation techniques), especially since most of them have been applied to CISC designs (pipelines as far back as the Zilog Z8000, register windows in the Hitachi H16 and H32, and microcode eliminated in most modern designs). Basically, RISC asks whether hardware (for complex instructions or memory-to-memory operations) is necessary, or whether it can be replaced by software (simpler instructions or load/store architecture). Higher instruction bandwidth is usually offset by a simpler chip that can run at a higher clock speed, and more available optimisations for the compiler.

By contrast, the CISC philosophy has been that if added hardware can result in an overall increase in speed, it's good - the ultimate goal of mapping every high level language statement on to a single CPU instruction. The disadvantage is that it's harder to increase the clock speed of a complex chip. The PowerPC is a good example of this idea applied to a load-store architecture.


IBM System 360/370/390: The Mainframe(1964) . . . .

The IBM System/360 is a sort of geologic feature in the computer world, and isn't at all a microprocessor, but was certainly influential (and enough people asked for it to be included in this list). It was designed to be an "all around" (as in, 360 degrees) system usable for any computing task, and as a result standardized many features (usually implemented previously, but not all together) for the computing industry, such as 8-bit bytes and byte addressable memory, 32-bit words, segmented and paged memory (see the Intel 80386), packed decimal and the EBCDIC character set (the latter isn't really a standard, as most systems use ASCII, except for the fact that immense amounts of data are stored on IBM System/360s in EBCDIC format). It was also meant to be scalable, so the architecture was separated for the first time from the implementation (initial four models ranged from simple, microcoded low end to fast hardwired high end, all matching the formal description (in APL) published in an IBM Systems Journal around 1964)

The S/360 has sixteen 32 bit general purpose registers (occasionally paired up as 64 bit registers), four 64 bit floating point registers (or two 128 bit registers), and a Program Status Word like that in the DEC VAX, except that in the S/360 the PSW includes the program counter (24 bits in the S/360, 31 bits in the S/370 XA (eXtended Architecture, pre 1983) and later versions). The S/370 (pre 1977) also includes sixteen control registers used by the operating system, and minor instruction set additions (mainly to discourage customers from trusting compatibility of Amdahl Corporation's faster S/360 compatible systems).

A two stage pipeline was first introduced in the IBM 3033 (1977). Instructions are fetched from the cache into three 32 bit buffers. The Instruction Pre-Processing Function (IPPF) then decodes them, generates operand addresses and stores them in operand address registers, and places source operands in operand buffers. Decoded instructions were placed into a 4 entry queue until the execution unit was ready.

In some high end models (such as 360/91, 1967) when a conditional branch occurs, the most likely next instruction is loaded into the IPPF buffer, but the previous next instruction is not discarded, so either can be executed without penalty. Two speculative branches can be buffered this way. The 360/91 also featued register renaming, instruction pipelining, instruction caching, out-of-order floating point execution (credited to Robert Tomasulo) and imprecise interrupts (rediscovered almost two decades later by microprocessor designers). Some had a "loop mode" like the Motorola 68010.

Addressing was originally 24 bit, but was extended to 31 bits (the high bit indicated whether to use 24 or 32 bits) with the XA architecture (This caused problems with software which stored type information in the unused 8 bits of a 32 bit word. The same thing happened when the Motorola 68000 was expanded from 24 to 32 bit addressing). The S/360 used completely position independent (register+offset and register+index) addressing modes. Virtual memory was added in the S/370, and used a segment and paging method - the first 8 bits of an address indicated an entry in a segment table which is added to the next 4 or 8 bits to get the page table index which contains the upper (12 or 20) bits of the physical memory address, and the rest of the address provides the lower 12 bits (the Intel 80386 uses a similar method, while the Motorola 68030 uses fixed length logical/physical pages instead of variable length segments).

Like the DEC VAX, the S/370 has been implemented as a microprocessor. The Micro/370 discarded all but 102 instructions (some supervisor instructions differed), with a coprocessor providing support for 60 others, while the rest are emulated (as in the MicroVAX). The Micro/370 had a 68000 compatible bus, but was otherwise completely unique (some legends claim it was a 68000 with modified microcode plus a modified 8087 as the coprocessor, others say IBM started with the 68000 design and completely replaced most of the core, keeping the bus interface, ALU, and other reusable parts, which is more likely).

More recently, with increased microprocessor complexity, the line was moved to microprocessor versions. A complete S/390 superscalar microprocessor with 64K L1 cache (at up to 350MHz, a higher clock rate than the 200MHz Intel's Pentium Pro available at the time) was been designed. Addressing was expanded to 44 bits, and in October 2000, a 64-bit version (zSeries) was introduced (the remaining manufacturer of compatible systems then dropped the line, rather than spend the effort to match the new architecture). Two CPU cores run in parallel with comparison circuitry ensuring that results which don't match are retried, or the entire CPU is disabled if correct results can't be obtained.


IBM:
http://www.ibm.com/
e-servers from IBM: zSeries mainframe servers:
http://www-1.ibm.com/servers/eserver/zseries/

VAX: The Penultimate CISC (1978) .

The VAX architecture wasn't designed as a microprocessor, though single chip versions were implemented (around 1984). However, it and its predecessor, the PDP-11, helped inspire design of the Motorola 68000, Zilog Z8000, and particularly the National Semiconductor 32xxx series CPUs. It was considered the most advanced CISC design, and the closest so far to the ultimate CISC goal. This is one reason that the VAX 11/780 is used as the speed benchmark for 1 MIPS (Million Instructions Per Second), though actual execution was apparently closer to 0.5 MIPS.

The VAX was a 32 bit architecture, with a 32 bit address range (split into 1G sections for process space, process specific system space, system space, and unused/reserved for future use). Each process has its own 1G process and 1G process system address space, with memory allocated in pages.

It features sixteen user visible 32 bit registers. Registers 12 to 15 are special - AP (Argument Pointer), FP (Frame Pointer), SP and PC (user, supervisor, executive, and kernal modes have separate SPs in R14, like the 68000 user and supervisor modes). All these registers can be used for data, addressing and indexing. A 64 bit PSL (Program Status Longword) keeps track of interrupt levels, program status, condition codes, and access mode (kernal (hardware management), executive (files/records), supervisor (interpreters), user (programs/data)).

The VAX 11 featured an 8 byte instruction prefetch buffer, like the 8086, while the VAX 8600 has a full 6 stage pipeline. Instructions mimic high level language constructs, and provide dense code. For example, the CALL instruction, which not only handles the argument list itself, but enforces a standard procedure call for all compilers. However, the complex instructions aren't always the fastest way of doing things. For example, the INDEX instruction was 45% to 60% faster when by replaced by simpler VAX instructions. This was one inspiration for the RISC philosophy.

Further inspiration came from the MicroVAX (VAX 78032) implementation, since in order to reduce the architecture to a single (integer) chip, only 175 of the 304 instructions (and 6 of 14 native data types) were implemented (through microcode), while the rest were emulated - this subset included 98% of instructions in a typical program. The optional FPU implemented 70 instructions and 3 VAX data types, which was another 1.7% of VAX instructions. All remaining VAX instructions were only used 0.2% of the time, and this allowed MicroVAX designs to eventually exceed the speed of full VAX implementations, before being replaced by the Alpha architecture (using binary translators to run VMS and VAX programs on the new CPU).

High end versions of the VAX from 8700 onward eliminated the need for emulation while retaining the simpler implementation by decoding the VAX instruction set into a set of simple microinstructions, which were executed by a fast core (a technique later used by National Semiconductor in the Swordfish as well as Intel and competitors in Pentium Pro-type CPUs.


Compaq.com - VAX systems - home:
http://www.compaq.com/alphaserver/vax/
VAXarchive:
http://vax.sevensages.org/

RISC Roots: CDC 6600 (1965) . .

Most RISC concepts can be traced back to the Control Data Corporation CDC 6600 'Supercomputer' designed by Seymour Cray (1964?), which emphasized a small (74 op codes) load/store and register-register instruction as a means to greater performance. The CDC 6600 itself has roots in the UNIVAC 1100, which many CDC 6600 engineers worked on.

The CDC 6600 was a 60-bit machine ('bytes' were 6 bits each, but that was a software convention, there was no hardware support for values smaller than a 60-bit word until later versions added a Compare and Move Unit (CMU) for character, string and block operations - a story repeated with the initial DEC Alpha processor), with an 18-bit address range. It had eight 18 bit A and 18 bit B (address) and eight 60 bit X (data) registers, with useful side effects - loading an address into A1, A2, A3, A4 or A5 caused a load from memory at that address into registers X1, X2, X3, X4 or X5. Similarly, A6 and A7 registers had a store effect on X6 and X7 registers - loading an address into A0 had no side effects. As an example, to add two arrays into a third, the starting addresses of the source could be loaded into A2 and A3 causing data to load into X2 and X3, the values could be added to X6, and the destination address loaded into A6, causing the result to be stored in memory. Incrementing A2, A3, and A6 (after adding) would step through the array. Side effects such as this are decidedly anti-RISC, but very nifty. This vector-oriented philosophy is more directly expressed in later Cray computers.

Most instructions operated on X registers, with only simple address add/subtract on the A and B address registers. Like many RISC-era CPUs, register B0 was hardwired to 0 (because there was no increment instruction, often B1 was set to 1 at the start of a program and used instead, which has made some architects with CDC-6600 experience decide that hard-wired registers are a waste of effort anyway).

Integer and floating point values used the same registers. Initially integer multiply operations were to be omitted, but were added by modifying the floating point circuitry, but limiting multiplication to 48-bit integers (check out the integer multiplication in the Intel/HP IA-64). Double precision was supported with instructions which computed the least significant 48 bits of a floating point result, so a double precision number consisted of two single precision numbers - a truncated single precision value, and a smaller number which could be added for the full value (a bit clumsy but it worked).

Only one instruction could be issued per cycle, but multiple independent functional units (eight in the CDC 6600) meant instruction execution in different units could overlap (a scoreboard register prevented instructions from issuing to a unit if the operands weren't available). The units weren't pipelined until the CDC 7600 (1969 - nine mostly different units), at which point instructions could be issued without waiting for operands (they would wait for them in the functional unit if necessary). Compared to the variable instruction lengths of other machines, instructions were only 15 bits (or 30 bits - 12 bits with a 18-bit constant) in 60-bit "parcels" (30-bit instructions could not cross parcel boundaries), to simplify decoding (a RISC-like feature). The previous 16 to 32 instructions were stored in an 8-word buffer (like the Motorola 68020 loop buffer). Branches had to arrive at the beginning of a 60-bit parcel.

The CDC-6600 CPU had no condition code register - all comparisons were part of branch instructions.

I/O was accomplished concurrently with a barrel processor - to cope with I/O latency, the processor had ten contexts, similar to a multithreaded processor. Execution would continue in a context to set up an I/O operation until it began, or until the context timed out and was switched to the next context.


TNO-FEL Museum: Computer history:
http://www.tno.nl/instit/fel/museum/computer/en/cdcsystE.html
The First Commercial Computers:
http://physinfo.ulb.ac.be/divers_html/PowerPC_Programming_Info/intro_to_risc/irt2_history4.html
A Seymour Cray Perspective:
http://research.microsoft.com/users/gbell/craytalk/

RISC Formalised: IBM 801 . . .

The first system to formalise these principles was the IBM 801 project (1975), meant for a simple network switching controller and named after the building it was developed in. Like the VAX, it was not a microprocessor (ECL implementation), but strongly influenced microprocessor designs. The design goal was to speed up frequently used instructions while discarding complex instructions that slowed the overall implementation. Like the CDC 6600, memory access was limited to load/store operations (which were delayed, locking the register until complete, so most execution could continue). Branches were delayed, and instructions used a three operand format common to load-store processors. Execution was pipelined, allowing 1 instruction per cycle.

The 801 had thirty two 32 bit registers, but no floating point unit/registers, and no separate user/supervisor mode, since it was an experimental system - security was enforced by the compiler. It implemented Harvard architecture with separate data and instruction caches, and had flexible addressing modes.

IBM tried to commercialise the 801 design starting in 1977 (before RISC workstations first became popular) with the ROMP CPU (Research OPD (Office Products Division) Mini Processor), 1986, first chips early as 1981) used in the PC/RT workstation, but it wasn't successful. Originally designed for wordprocessor systems, changes to reduce cost included eliminating the caches and Harvard architecture (but adding 40 bit virtual memory), reducing registers to sixteen, variable length (16/32 bit) instructions (to increase instruction density), and floating point support via an adaptor to an NS32081 FPU (later, a 68881 or 68882 were available). This allowed a small CPU, only 45,000 transistors, but an average instruction took around 3 cycles.

The 801 itself morphed into an I/O processor for the IBM 3090 mainframes

This wasn't the only innovative design developed by IBM which never saw daylight. Slightly earlier (around 1971) the Advanced Computer System pioneered superscalar (seven issue) design, speculative execution, delayed condition codes, multithreading, imprecise traps and instruction streamed interrupts, and load/store buffers, plus compiler optimisation to support these features. It was expensive and incompatible with the System/360, so was not pursued, but many ideas did find its way into the expensive high end mainframes such as the IBM 360/91 (ACS-360 chief architect Gene Amdahl later founded Amdahl Corporation to make System/360 compatible systems).


IBM ACS:
http://www.cs.clemson.edu/~mark/acs.html

RISC Refined: Berkeley RISC, Stanford MIPS . .

Some time after the 801, around 1981, projects at Berkeley (RISC I and II) and Stanford University (MIPS) further developed these concepts. The term RISC came from Berkeley's project, which was the basis for the fast Pyramid minicomputers and SPARC processor. Because of this, features are similar, including a windowed register file (10 global and 22 windowed, vs 8 and 24 for SPARC) with R0 wired to 0. Branches are delayed, and like ARM, all instructions have a bit to specify if condition codes should be set, and execute in a 3 stage pipeline. In addition, next and current PC are visible to the user, and last PC is visible in supervisor mode.

The Berkeley project also produced an instruction cache with some innovative features, such as instruction line prefetch that identified jump instructions, frequently used instructions compacted in memory and expanded upon cache load, multiple cache chips support, and bits to map out defective cache lines.

The Stanford MIPS project was the basis for the MIPS R2000, and like the case with Berkeley project, there are close similarities. MIPS stood for Microprocessor without Interlocked Pipeline Stages, using the compiler to eliminate register conflicts (and generally hide any unsafe CPU behaviour from programmers). Like the R2000, the MIPS had no condition code register, and a special HI/LO multiply and divide register pair.

Unlike the R2000, the MIPS had only 16 registers, and two delay slots for LOAD/STORE and branch instructions. The PC and last three PC values were tracked for exception handling. In addition, instructions were 'packed' (like the Berkeley RISC), in that many instructions specified two operations that were dispatched in consecutive cycles (not decoded by the cache). In this way, it was a 2 operation VLIW, but executed sequentially. User assembly language was translated to 'packed' format by the assembler.

Being experimental, there was no support for floating point operations.

SOAR (Smalltalk On A RISC) modified the RISC II design to support Smalltalk.


An Introduction to RISC Processors:
http://wheelie.tees.ac.uk/users/a.clements/RISC/RISC.htm
The Stanford torch Project
http://www-flash.stanford.edu/torch/

Processor Classifications:

Arbitrarily assigned by me...

    Complex/                                                         Simple/
      CISC____________________________________________________________RISC
      |                                                         14500B*
4-bit |                                                    *Am2901
      |                                   *4004
      |                                *4040
8-bit |                                     6800,650x         *1802
      |                       8051*  *  *8008   *    SC/MP
      |                              Z8    *         *    *F8
      |                F100-L*   8080/5  2650          
      |                             *       *NOVA        *  *PIC16x
      |          MCP1600*   *Z-80         *6809    IMS6100
16-bit|          *Z-280           *PDP11             80C166*  *M17
      |                      *8086    *TMS9900
      |                 *Z8000          *65816
      |                *56002
      |            32016*   *68000 ACE HOBBIT  Clipper      R3000
32-bit|432      [3]  96002 *68020    *   *  *  *   *29000     *   *ARM
      | *         *VAX * 80486 68040 *PSC i960    *SPARC         *SH
      |          Z80000*    *  *    TRON48    PA-RISC
      |    PPro  Pent* [1]---*-------     *    *88100
      | *    * [2]--<860>-*--*-----            *     *88110
64-bit|Rekurs         POWER PowerPC   *        CDC6600     *R4000
      |            x86-64*   *620 U-SPARC *     *R8000         *Alpha
      |     --------------      R10000
[1] - About here, from left to right, the Swordfish and 68060.
[2] - In general, Pentium emulator 'clones' such as the 586, AMD K5, and Cyrix M1 fit about here.
[3] - TMS 320C30 and IBM S/360 go here, for different reasons. Boy, it's getting awfully crowded there!

Okay, an explanation. Since this is only a 2-dimensional graph, and I want to get a lot more across than that allows, design features 'pull' a CPU along the RISC/CISC axis, and the complexity of the design (given the number of bits and other considerations) also tug it - thus the much of the POWER's RISC-ness is offset by its inherently complex (though effective) design. And it also depends on my mood that day - hey, it's ultimately subjective anyway.


Previous Page
Table of Contents
Next Page