Register usage evolved from accumulators and index registers down two paths, one leading to general purpose registers and another less popular path to stacks.
Accumulator | Accumulator +Index / \ Stack...Memory- (CISC) Data | Load- (RISC) StoreProgram instruction formats changed from simple to complex due to limited memory capacity, then to simple as capacity eased.
By contrast, the CISC philosophy has been that if added hardware can result in an overall increase in speed, it's good - the ultimate goal of mapping every high level language statement on to a single CPU instruction. The disadvantage is that it's harder to increase the clock speed of a complex chip. The PowerPC is a good example of this idea applied to a load-store architecture.
The S/360 has sixteen 32 bit general purpose registers (occasionally paired up as 64 bit registers), four 64 bit floating point registers (or two 128 bit registers), and a Program Status Word like that in the DEC VAX, except that in the S/360 the PSW includes the program counter (24 bits in the S/360, 31 bits in the S/370 XA (eXtended Architecture, pre 1983) and later versions). The S/370 (pre 1977) also includes sixteen control registers used by the operating system, and minor instruction set additions (mainly to discourage customers from trusting compatibility of Amdahl Corporation's faster S/360 compatible systems).
A two stage pipeline was first introduced in the IBM 3033 (1977). Instructions are fetched from the cache into three 32 bit buffers. The Instruction Pre-Processing Function (IPPF) then decodes them, generates operand addresses and stores them in operand address registers, and places source operands in operand buffers. Decoded instructions were placed into a 4 entry queue until the execution unit was ready.
In some high end models (such as 360/91, 1967) when a conditional branch occurs, the most likely next instruction is loaded into the IPPF buffer, but the previous next instruction is not discarded, so either can be executed without penalty. Two speculative branches can be buffered this way. The 360/91 also featued register renaming, instruction pipelining, instruction caching, out-of-order floating point execution (credited to Robert Tomasulo) and imprecise interrupts (rediscovered almost two decades later by microprocessor designers). Some had a "loop mode" like the Motorola 68010.
Addressing was originally 24 bit, but was extended to 31 bits (the high bit indicated whether to use 24 or 32 bits) with the XA architecture (This caused problems with software which stored type information in the unused 8 bits of a 32 bit word. The same thing happened when the Motorola 68000 was expanded from 24 to 32 bit addressing). The S/360 used completely position independent (register+offset and register+index) addressing modes. Virtual memory was added in the S/370, and used a segment and paging method - the first 8 bits of an address indicated an entry in a segment table which is added to the next 4 or 8 bits to get the page table index which contains the upper (12 or 20) bits of the physical memory address, and the rest of the address provides the lower 12 bits (the Intel 80386 uses a similar method, while the Motorola 68030 uses fixed length logical/physical pages instead of variable length segments).
Like the DEC VAX, the S/370 has been implemented as a microprocessor. The Micro/370 discarded all but 102 instructions (some supervisor instructions differed), with a coprocessor providing support for 60 others, while the rest are emulated (as in the MicroVAX). The Micro/370 had a 68000 compatible bus, but was otherwise completely unique (some legends claim it was a 68000 with modified microcode plus a modified 8087 as the coprocessor, others say IBM started with the 68000 design and completely replaced most of the core, keeping the bus interface, ALU, and other reusable parts, which is more likely).
More recently, with increased microprocessor complexity, the line was moved to microprocessor versions. A complete S/390 superscalar microprocessor with 64K L1 cache (at up to 350MHz, a higher clock rate than the 200MHz Intel's Pentium Pro available at the time) was been designed. Addressing was expanded to 44 bits, and in October 2000, a 64-bit version (zSeries) was introduced (the remaining manufacturer of compatible systems then dropped the line, rather than spend the effort to match the new architecture).
The VAX was a 32 bit architecture, with a 32 bit address range (split into 1G sections for process space, process specific system space, system space, and unused/reserved for future use). Each process has its own 1G process and 1G process system address space, with memory allocated in pages.
It features sixteen user visible 32 bit registers. Registers 12 to 15 are special - AP (Argument Pointer), FP (Frame Pointer), SP and PC (user, supervisor, executive, and kernal modes have separate SPs in R14, like the 68000 user and supervisor modes). All these registers can be used for data, addressing and indexing. A 64 bit PSL (Program Status Longword) keeps track of interrupt levels, program status, condition codes, and access mode (kernal (hardware management), executive (files/records), supervisor (interpreters), user (programs/data)).
The VAX 11 featured an 8 byte instruction prefetch buffer, like the 8086, while the VAX 8600 has a full 6 stage pipeline. Instructions mimic high level language constructs, and provide dense code. For example, the CALL instruction, which not only handles the argument list itself, but enforces a standard procedure call for all compilers. However, the complex instructions aren't always the fastest way of doing things. For example, the INDEX instruction was 45% to 60% faster when by replaced by simpler VAX instructions. This was one inspiration for the RISC philosophy.
Further inspiration came from the MicroVAX (VAX 78032) implementation, since in order to reduce the architecture to a single (integer) chip, only 175 of the 304 instructions (and 6 of 14 native data types) were implemented (through microcode), while the rest were emulated - this subset included 98% of instructions in a typical program. The optional FPU implemented 70 instructions and 3 VAX data types, which was another 1.7% of VAX instructions. All remaining VAX instructions were only used 0.2% of the time, and this allowed MicroVAX designs to eventually exceed the speed of full VAX implementations, before being replaced by the Alpha architecture (using binary translators to run VMS and VAX programs on the new CPU).
High end versions of the VAX from 8700 onward eliminated the need for emulation while retaining the simpler implementation by decoding the VAX instruction set into a set of simple microinstructions, which were executed by a fast core (a technique later used by National Semiconductor in the Swordfish as well as Intel and competitors in Pentium Pro-type CPUs.
The CDC 6600 was a 60-bit machine ('bytes' were 6 bits each, but that was a software convention, there was no hardware support for values smaller than a 60-bit word until later versions added a Compare and Move Unit (CMU) for character, string and block operations - a story repeated with the initial DEC Alpha processor), with an 18-bit address range. It had eight 18 bit A and 18 bit B (address) and eight 60 bit X (data) registers, with useful side effects - loading an address into A1, A2, A3, A4 or A5 caused a load from memory at that address into registers X1, X2, X3, X4 or X5. Similarly, A6 and A7 registers had a store effect on X6 and X7 registers - loading an address into A0 had no side effects. As an example, to add two arrays into a third, the starting addresses of the source could be loaded into A2 and A3 causing data to load into X2 and X3, the values could be added to X6, and the destination address loaded into A6, causing the result to be stored in memory. Incrementing A2, A3, and A6 (after adding) would step through the array. Side effects such as this are decidedly anti-RISC, but very nifty. This vector-oriented philosophy is more directly expressed in later Cray computers.
Most instructions operated on X registers, with only simple address add/subtract on the A and B address registers. Like many RISC-era CPUs, register B0 was hardwired to 0 (because there was no increment instruction, often B1 was set to 1 at the start of a program and used instead, which has made some architects with CDC-6600 experience decide that hard-wired registers are a waste of effort anyway).
Integer and floating point values used the same registers. Initially integer multiply operations were to be omitted, but were added by modifying the floating point circuitry, but limiting multiplication to 48-bit integers (check out the integer multiplication in the Intel/HP IA-64). Double precision was supported with instructions which computed the least significant 48 bits of a floating point result, so a double precision number consisted of two single precision numbers - a truncated single precision value, and a smaller number which could be added for the full value (a bit clumsy but it worked).
Only one instruction could be issued per cycle, but multiple independent functional units (eight in the CDC 6600) meant instruction execution in different units could overlap (a scoreboard register prevented instructions from issuing to a unit if the operands weren't available). The units weren't pipelined until the CDC 7600 (1969 - nine mostly different units), at which point instructions could be issued without waiting for operands (they would wait for them in the functional unit if necessary). Compared to the variable instruction lengths of other machines, instructions were only 15 bits (or 30 bits - 12 bits with a 18-bit constant) in 60-bit "parcels" (30-bit instructions could not cross parcel boundaries), to simplify decoding (a RISC-like feature). The previous 7 instructions were stored in a buffer (like the Motorola 68020 loop buffer). Branches had to arrive at the beginning of a 60-bit parcel.
The CDC-6600 CPU had no condition code register - all comparisons were part of branch instructions.
I/O was accomplished concurrently with a barrel processor - to cope with I/O latency, the processor had ten contexts, similar to a multithreaded processor. Execution would continue in a context to set up an I/O operation until it began, or until the context timed out and was switched to the next context.
The 801 had thirty two 32 bit registers, but no floating point unit/registers, and no separate user/supervisor mode, since it was an experimental system - security was enforced by the compiler. It implemented Harvard architecture with separate data and instruction caches, and had flexible addressing modes.
IBM tried to commercialise the 801 design starting in 1977 (before RISC workstations first became popular) with the ROMP CPU (Research OPD (Office Products Division) Mini Processor), 1986, first chips early as 1981) used in the PC/RT workstation, but it wasn't successful. Originally designed for wordprocessor systems, changes to reduce cost included eliminating the caches and Harvard architecture (but adding 40 bit virtual memory), reducing registers to sixteen, variable length (16/32 bit) instructions (to increase instruction density), and floating point support via an adaptor to an NS32081 FPU (later, a 68881 or 68882 were available). This allowed a small CPU, only 45,000 transistors, but an average instruction took around 3 cycles.
The 801 itself morphed into an I/O processor for the IBM 3090 mainframes
This wasn't the only innovative design developed by IBM which never saw daylight. Slightly earlier (around 1971) the Advanced Computer System pioneered superscalar (seven issue) design, speculative execution, delayed condition codes, multithreading, imprecise traps and instruction streamed interrupts, and load/store buffers, plus compiler optimisation to support these features. It was expensive and incompatible with the System/360, so was not pursued, but many ideas did find its way into the expensive high end mainframes such as the IBM 360/91 (ACS-360 chief architect Gene Amdahl later founded Amdahl Corporation to make System/360 compatible systems).
The Berkeley project also produced an instruction cache with some innovative features, such as instruction line prefetch that identified jump instructions, frequently used instructions compacted in memory and expanded upon cache load, multiple cache chips support, and bits to map out defective cache lines.
The Stanford MIPS project was the basis for the MIPS R2000, and like the case with Berkeley project, there are close similarities. MIPS stood for Microprocessor without Interlocked Pipeline Stages, using the compiler to eliminate register conflicts (and generally hide any unsafe CPU behaviour from programmers). Like the R2000, the MIPS had no condition code register, and a special HI/LO multiply and divide register pair.
Unlike the R2000, the MIPS had only 16 registers, and two delay slots for LOAD/STORE and branch instructions. The PC and last three PC values were tracked for exception handling. In addition, instructions were 'packed' (like the Berkeley RISC), in that many instructions specified two operations that were dispatched in consecutive cycles (not decoded by the cache). In this way, it was a 2 operation VLIW, but executed sequentially. User assembly language was translated to 'packed' format by the assembler.
Being experimental, there was no support for floating point operations.
SOAR (Smalltalk On A RISC) modified the RISC II design to support Smalltalk.
Complex/ Simple/ CISC____________________________________________________________RISC | 14500B* 4-bit | *Am2901 | *4004 | *4040 8-bit | 6800,650x *1802 | 8051* * *8008 * SC/MP | Z8 * * *F8 | F100-L* 8080/5 2650 | * *NOVA * *PIC16x | MCP1600* *Z-80 *6809 IMS6100 16-bit| *Z-280 *PDP11 80C166* *M17 | *8086 *TMS9900 | *Z8000 *65816 | *56002 | 32016* *68000 ACE HOBBIT Clipper R3000 32-bit|432 [3] 96002 *68020 * * * * *29000 * *ARM | * *VAX * 80486 68040 *PSC i960 *SPARC *SH | Z80000* * * TRON48 PA-RISC | PPro Pent* [1]--[1] - About here, from left to right, the Swordfish and 68060.-*------- * *88100 | * * [2]--<860>-*--*----- * *88110 64-bit|Rekurs POWER PowerPC * CDC6600 *R4000 | 620* U-SPARC * *R8000 *Alpha | ------- ------- R10000
Okay, an explanation. Since this is only a 2-dimensional graph, and I want to get a lot more across than that allows, design features 'pull' a CPU along the RISC/CISC axis, and the complexity of the design (given the number of bits and other considerations) also tug it - thus the much of the POWER's RISC-ness is offset by its inherently complex (though effective) design. And it also depends on my mood that day - hey, it's ultimately subjective anyway.