Section Five: Born Beyond Scalar

Part I: Intel 960, Intel quietly gets it right (1987 or 1988?) . . . .

Largely obscured by the marketing hype surrounding the Intel 80860, the 80960 was actually an overall better processor. It's sometimes considered a successor of the i432 (also called "RISC applied to the 432"), and does have similar hardware support for context switching. This path came about indirectly through the 960 MC designed for the BiiN machine, a joint Intel/Siemens venture, which was still very complex (it included many i432 object-oriented ideas, including a tagged memory system). The M-series ("Military") design predated the released i960 (which removed tag bits and complex instruction microcode), but was released later.

The i960 was retargeted at the high end embedded market (it included multiprocessor and debugging support, and strong interrupt/fault handling, but lacked MMU support), while the 860 was intended to be a general purpose processor (the name 80860 echoing the popular 8086). It replaced the AMD 29K series as "the world's most popular embedded RISC" until 1996.

Although the first implementation was not superscalar, the 960 was designed to allow dispatching of instructions to multiple (undefined, but generally including at least one integer) execution units, which could include internal registers (such as the four 80 bit registers in the floating point unit (32, 64, and 80 bit IEEE operations)) - the 960 CA version (1989) was superscalar. There are sixteen 32 bit global registers which can be shared by all excution units and sixteen register "caches" - similar to the SPARC register windows, but not overlapping (originally four banks). It's a load/store Harvard architecture (32-bit flat addressing), but has some complex microcoded instructions (such as CALL/RET). There are also thirty-two 32 bit special function registers.

It's a very clean embedded architecture, not designed for high level applications, but very effective and scalable - something that can't be said for all Intel's processor designs. However, as part of a patent dispute with DEC, Intel obtained DEC's StrongARM design, and generally dropped the i960 from further development (to the resentment of the developers).


Intel Corporation:
http://www.intel.com/
Intel i960 Processor Info:
http://developer.intel.com/design/i960/INDEX.HTM
Intel's Megaprocessors: i80860, i80960, and iWarp:
http://mes.loyola.edu/faculty/phs_eg769/iMegaproc.htm

Part II: Intel 860, "Cray on a Chip" (late 1988?) . . .

The Intel 80860 was an impressive chip, able at top speed to perform close to 66 MFLOPS at 33 MHz in real applications, compared to a more typical 5 or 10 MFLOPS for other CPUs of the time. Much of this was marketing hype, and it never become popular, lagging behind most newer CPUs and Digital Signal Processors in performance.

The 860 has several modes, from regular scaler mode to a superscalar mode that executes two instructions per cycle and a user visible pipeline mode (instructions using the result register of a multi-cycle op would take the current value instead of stalling and waiting for the result). It can use the 8K data cache in a limited way as a small vector register (like those in supercomputers). The unusual cache uses virtual addresses, instead of physical, so the cache has to be flushed any time the page tables changes, even if the data is unchanged. Instruction and data busses are separate, with 4 G of memory, using segments. It also includes a Memory Management Unit for virtual storage.

The 860 has thirty two 32 bit registers and thirty two 32 bit (or sixteen 64 bit) floating point registers. It was one of the first microprocessors to contains not only an FPU as well as an integer ALU, but also a 3-D graphics unit (attached to the FPU) that supports lines drawing, Gouraud shading, Z-buffering for hidden line removal, and other operations in conjunction with the FPU. It was also the first able to do an integer operation, and a (unique at the time) multiply and add floating point instruction, for the equivalent of three instructions, at the same time (a FPU instruction bit indicated the current and next integer/floating-point pairs can execute in parallel, similar to the Apollo DN10000 CPU/FPU (also 1988) which had an integer bit which affected only the current integer/floating-point pair).

However actually getting the chip at top speed usually required using assembly language - using standard compilers gave it a speed closer to other processors. Because of this, it was used as a coprocessor, either for graphics, or floating point acceleration, like add in parallel units for workstations. Another problem with using the Intel 860 as a general purpose CPU is the difficulty handling interrupts. It is extensively pipelined, having as many as four pipes operating at once, and when an interrupt occurs, the pipes can spill and lose data unless complex code is used to clean up. Delays range from 62 cycles (best case) to 50 microseconds (almost 2000 cycles).


Intel Corporation:
http://www.intel.com/
Assembly Language Programming of the iPSC/860
http://www.cacr.caltech.edu/resources/paragon/i860.tutorial.shtml
Intel's 80860 64- bit RISC CPU
http://www.totse.com/en/computers/computer_technology/80860.html
NeXTdimension
http://www.blackholeinc.com/product/dimension.html
Intel's Megaprocessors: i80860, i80960, and iWarp:
http://mes.loyola.edu/faculty/phs_eg769/iMegaproc.htm

Part III: IBM RS/6000 POWER chips (1990) . . . .

When IBM decided to become a real part of the workstation market (after its unsuccessful PC/RT based on the ROMP processor), it decided to produce a new innovative CPU, based partly on the 801 project that pioneered RISC theory. RISC initially stood for Reduced Instruction Set Computer, but IBM defined it as Reduced Instruction Set Cycles, and implemented a relatively complex processor (POWER - Performance Optimization With Enhanced RISC) with more high level instructions than even many memory-data processors.

The first POWER CPU (POWER1) was implemented using three ICs for the processor - branch, integer and floating point units - plus two or four cache chips, and defined the basic architecture. The branch unit was unusually complex, and contained the program counter, as well as a condition code (CC) register and a loop register. The CC register has eight field sets, the first two reserved for fixed and floating point operations, the seventh (later) for vector operations, and the rest which could be set separately, and combined or checked several instructions later. The loop register is a counter for 'decrement and branch on zero' loops with no branch penalty (similar to certain DSPs like the TMS320C30). POWER1 was also one of the first superscalar CPUs of its generation, the branch unit could dispatch multiple instructions to the two functional unit input queues while itself executing a program control operation (up to four operations at once, even out of order). Speculative branches were supported using a prediction bit in the branch instructions (results discarded before being saved if not taken, the alternate instruction was buffered and discarded if the branch was taken), and the branch unit manages subroutine calls without branch penalties, as well as hardware interrupts. Results are forwarded to instructions in the pipeline which use them before they are written to the registers.

Thirty two 32-bit registers were defined for the POWER1 integer unit, which also included certain string operations, as well as all load/store operations. In addition, it included a special MQ register for extended precision multiply/divides, similar to the MIPS HI/LO registers. Like many other load-store CPUs, register R0 is treated as constant 0 for some instructions, but it is used like a normal register most of the time. The POWER/PowerPC architecture supports memory/data-style 'update' operations, incrementing/decrementing the used address register before a load/store.

The floating point unit had thirty two 64 bit registers, performing only double precision operations, and including a DSP-like multiply-accumulate operation. Floating point exceptions are imprecise - in fact, don't produce exceptions at all, but set a condition bit on an error. The bit must be tested by software to determine if an error occurred.

IBM, Motorola, and Apple formed a coalition (around 1992) to produce a microprocessor version of the POWER design as a successor to both the Motorola 68000 and Intel 80x86, resulting in the PowerPC. The architectural differences began with the elimination of the MQ register, since it would add complexity to possible superscalar versions. This was replaced with separate instructions to calculate the upper and lower parts of a multiplication, which would (with two integer units) execute simultaneously anyway. Division was handled similarly using general registers. In addition, the more complex string operations and three-source instructions were removed, and finally, 32 bit floating point support was added. Dropped POWER instructions were to be emulated in the PowerPC CPUs.

The first PowerPC 601 (1993) was a bridge (considered first generation or G1), and included both POWER and PowerPC features, based strongly on the POWER1, except it had a single 32K cache rather than separate I/D caches. It defined the Motorola 88000 as the standard PowerPC bus. The 603 (1993?, first second generation G2) separated the main functional units further, removing load/store operations from the integer unit (four functional units total - integer, floating point, load/store (using integer registers), branch), and splitting the branch unit into a fetch/branch unit, a dispatch unit, and a completion/exception unit. The 603 also added a rename buffer in the dispatch unit for speculative execution using renamed integer and floating point registers, which are ordered properly by the completion/exception unit, or discarded for mispredicted branches and exceptions. Separate 8K and 16K I/D cache versions were available.

The PowerPC 604 (mid 1995) added dynamic branch prediction using a branch history table, and added two simplified integer units - three integer, two for single-cycle operations, one for multicycle operations such as multiply/divide, plus floating point, load/store and branch, total of six. Four instructions could be dispatched at once The CC register could also be renamed. The PowerPC 620 expanded the 604 design to 64 bits (but with a 'backside' L2 cache bus), and added new 64 bit instructions, but was delivered much later and slower than promised, and was further delayed when it was with drawn for a redesign. The 32 bit PowerPC 750 (G3, early 1998) refined the design and performance, adding a P620-style backside cache bus, but made no other significant changes (notably though, they used a 603-based 32-bit FPU, rather than the 64-bit 604 FPU).

Workstation versions continued with the POWER2 (1993), a high bandwidth design with two floating point load/store units, 256K of data cache, and added 128-bit floating point support and a square root instruction. Initially a multichip design, it was later combined into one chip (P2SC), and then into an eight CPU "SuperChip". It could issue up to six instructions and four simultaneous loads or stores. It was superceded by the POWER3 (Early 1998), with eight functional units (two FPU, three integer (two single cycle, one multicycle), two load/store, and branch unit), but capable of operating at much higher clock speeds. In addition, a 64 bit version, the PowerPC A35 (Apache), was designed for the AS/400 E series which added decimal arithmatic and string instructions, also used in the RS/6000 S70 workstation (called the PowerPC RS64-I).

The A50/RS64-II (Northstar (1998)/Pulsar (1999, faster clock version)) added support for parallel execution, including the idea of vertical multithreading (implemented earlier by the CDC-6600 peripheral processors, and more recently by Tera in their MTA supercomputers in 1998?, which took the idea to extremes, supporting 128 threads per CPU and giving up cache entirely). The CPU state registers (integer and floating point, program counter, condition codes, etc.) are duplicated, allowing execution to be switched to a second thread in three cycles when a load misses the primary cache and causes a delay - the second thread can continue while the load for the first thread completes. The CPU is also designed to minimize branch delays by using a short, simple pipeline (five stages, in-order four-way issue to five units - simple integer, complex (multiply/divide) integer, load/store, branch, and floating point unit), and uses branch pre-fetching (also in PowerPC 750 and newer, identifies branch using LR or CTR registers when fetched and loads target instruction into the processor cache, in addition to branch prediction and target caching).

The POWER4 (late 2001) used in the RS/6000 (renamed "something-Series", where "something" is p, x, n, i, r, z, or something else - I can never remember), increased the number of pipeline stages to fifteen for integer and branch operations, seventeen for load/store, and more for floating point operations, allowing a high clock rate (1.1 GHz at introduction), issuing 8 instructions per cycle to two FPUs, two load/store units, two integer units, a branch unit and a condition code register unit (used by the branch unit). Two processor cores share cache on each chip - for high-bandwidth computing, performance is improved by disabling one CPU.

In addition, IBM and Motorola have designed simplified embedded versions, such as the IBM 40x series, and Motorola's 8xx versions, though complexity limits how small the designs can be - for the lower end, Motorola designed the ARM-like MCore low cost/power RISC CPU, while IBM simply licensed the ARM itself.

In direct response to Intel's MMX instructions, AltiVec extensions were introduced with fourth generation (G4, September 1999) PowerPC CPUs from Motorola (IBM initially declined to support the extensions, until agreeing to become a second source of AltiVec CPUs for Apple Macintoshes). Unlike multimedia extensions which use integer (HP PA-RISC MAX) or floating point registers (Sun VIS, Intel MMX), AltiVec adds an entire new set of 128-bit registers (enough for a vector of four 32-bit floating point numbers) and a separate vector execution unit and instruction set (four operand - three source, one result), supported by the complex PowerPC branch unit. That means that operating system software needs to be modified to preserve additional CPU state information (like the MIPS MDMX which adds a 192-bit accumulator to hold intermediate results, but uses 64-bit floating point registers for data), but it allows multimedia instructions to be executed in parallel with both integer and floating point operations, and to reduce the number of registers to save, an additional register (VRSAVE) is added to track which vector registers are being used - unused registers don't need to be stored. In addition to subword vector operations, AltiVec also includes permutation operations along the same lines as PA-RISC MAX instructions, and subword floating point operations like MIPS MDMX which can also perform vector multiplication allowing 3-D graphics support (see Appendix D) like the Hitachi SH4.

AltiVec and the embedded versions was apparently part of the reason Nintendo decided to switch from MIPS processors to a custom designed IBM variant of the PowerPC as the CPU for its next generation game console, code named "Dolphin".

It's interesting to note that the AltiVec data formats are based primarily on Java standards (based on IEEE), then on IEEE, and lastly on ANSI C9X floating point standards. A "Java mode" provides strict adherence to these standards, a "Non-Java mode" relaxes adherence to allow faster operations (if implemented).

A very high clock rate (500MHz) BiCMOS version called the 704 (based on a simplified 604) was being developed in 1996 by Exponential Technologies, expanding on the type of technology which Intel found necessary to keep its Pentium and Pentium Pro CPUs competitive, but advances in CMOS and a slower initial product (410MHz) sharply reduced the clock speed advantages, cancelling the project (faster, lower power, fully CMOS Pentium and Pentium Pro CPUs have replaced earlier BiCMOS versions). IBM went so far as to produce a 1GHz integer-only demonstration version of a CMOS PowerPC, and used the PowerPC as the first product to replace aluminum conductors with lower resistance copper, boosting clock speeds by about 33%.

Overall, the POWER/PowerPC architecture is a very powerful, almost mainframe-like architecture which could easily have fit into the "Wierd and Innovative" section, violating the traditional RISC philosophy of simplicity and fewer instructions (with over a hundred, including many duplicate which implicitly set CC bits and other which don't), versus only about 34 for the ARM and 52 for the Motorola 88000 (including FPU instructions)). The complexity is very effective, but has somewhat limited the clock speed of the designs (but less so than the even more complex Intel Pentium and Pentium II designs). It's an interesting tradeoff, considering that a highly parallel 71.5 MHz POWER2 managed to be faster than a 200MHz DEC Alpha 21064 of the same generation (though Alpha remained the fastest CPU at any given time until the POWER4).


IBM:
http://www.ibm.com/
IBM PowerPC Microprocessor:
http://www.chips.ibm.com/products/ppc/
Motorola:
http://www.mot.com/
Motorola PowerPC ISA:
http://e-www.motorola.com/webapp/sps/site/homepage.jsp?nodeId=03M943030450467
Apple Computer:
http://www.apple.com/
The PowerPC FAQ
http://linuxppc.org/documentation/powerpcfaq//index.html

Part IV: DEC Alpha, Designed for the future (1992) . . .

The DEC Alpha architecture is designed, according to DEC, for a operational life of 25 years. Its main innovation is PALcalls (or writable instruction set extension), but it is an elegant blend of features, selected to ensure no obvious limits to future performance - no special registers, etc. The first Alpha chip is the 21064.

Alpha is a 64 bit architecture (32 bit instructions) that doesn't support 8- or 16-bit operations, but allows conversions, so no functionality is lost (Most processors of this generation are similar, but have instructions with implicit conversions). Alpha 32-bit operations differ from 64 bit only in overflow detection. Alpha does not provide a divide instruction due to difficulty in pipelining it. It's very much like the MIPS R2000, including use of general registers to hold condition codes. However, Alpha has an interlocked pipeline, so no special multiply/divide registers are needed, and Alpha is meant to avoid the significant growth in complexity which the R2000 family experienced as it evolved into the R8000 and R10000.

One of Alpha's roles is to replace DEC's two prior architectures - the MIPS-based workstations and VAX minicomputers (Alpha evolved from a VAX replacment project codenamed PRISM, not to be confused with the Apollo Prism acquired by Hewlett Packard). To do this, the chip provides both IEEE and VAX 32 and 64 bit floating point operations, and features Privileged Architecture Library (PAL) calls, a set of programmable (non-interruptable) macros written in the Alpha instruction set, similar to the programmable microcode of the Western Digital MCP-1600 or the AMD Am2910 CPUs, to simplify conversion from other instruction sets (VAX running VMS and 80x86 running Microsoft Windows NT) using a binary translator, as well as providing flexible support for a variety of operating systems.

Alpha was also designed for the future for a 1000-fold eventual increase in performance (10 X by clock rate, 10 X by superscalar execution, and 10 X by multiprocessing) Because of this, superscalar instructions may be reordered, and trap conditions are imprecise (like in the 88010). Special instructions (memory and trap barriers) are available to syncronise both occurrences when needed (different from the POWER use of a trap condition bit which is explicitly by software, but similar in effect. SPARC also has a specification for similar barrier instructions). And there are no branch delay slots like in the R2000, since they produce scheduling problems in superscalar execution, and compatibility problems with extended pipelines. Instead speculative execution (branch instructions include hint bits) and a branch cache are used.

The 21064 was introduced with one integer, one floating point, and one load/store unit. The 21164 (Early 1995) began expanding instruction parallelism by adding one integer/load/store unit with byte vector (multimedia-type) instructions (replacing the load/store unit) and one floating point unit, and increased clock speed from 200 MHz to 300 MHz (still roughly twice that of competing CPUs), and introduced the idea of a level 2 cache on chip (8K each inst/data level 1, 96K combined level 2).

The 21264 (mid 1998) expanded this to four integer units (two add/logic/shift/branch (one also with multiply, one with multimedia) and two add/logic/load/store), two different floating point units (one for add/div/square root and one for multiply), with the ability to load four, dispatch six, and retire eight instructions per cycle (and for the first time including 40 integer and 40 floating point rename registers and out of order execution), at up to 500MHz. Integer registers are duplicated, with two independent pipelines, to simplify register access, similarly to the Sun MAJC functional units, except that instructions are grouped by the dispatching logic into two streams. Values transferred between register banks (instruction streams) add a delay.

The 21464 (expected 2003) begins the multiprocessor strategy by adding five high speed interconnects (four CPU (10 GB/s), almost identical in concept to the Transputer CPUs, and one I/O (3 GB/s)) to an enhanced 21264 core. The 21464 is expected to add an eight- or ten-issue CPU core, and support for multithreading like the IBM Northstar POWER CPU.

Multimedia extensions introduced with the 21264 are simple, but include VIS-type motion estimation (MPEG).

DEC's Alpha is in many ways the antithesis of IBM's POWER design, which gains performance from complexity, and the expense of a large transistor count, while the Alpha concentrates on the original RISC idea of simplicity and a higher clock rate - though that also has its drawback, in terms of very high power consumption.

In 1998, DEC was purchased by Compaq. The transition was blamed by some with the 21264 being unable to exceed 833MHz clock speeds, while CPUs from Intel, AMD, and SiByte (MIPS) gained attention by exceeding 1GHz (Alpha performance remained near the top of the competition, but it had also previously also had the highest clock speeds). Apparently a design flaw was to blame, but rather than redesigning the existing chip, the resources were spent developing the 21364 and 21464 instead (though rumour has it Compaq parners Samsung and IBM were able to produce 1GHz Alphas in 1999, but were forbidden by Compaq because they did not have a support chip set able to handle that speed). In 2001, Compaq cancelled completion of the 21464, deciding to adopt the IA-64 instead, and sold all Alpha intellectual property (from circuits to compilers, and even the Alpha design team) to Intel, shortly before the announcement of a controversial merger with Hewlett-Packard (nasty accusations claimed it was pressure from HP, others claimed Intel was embarassed by both AMD's Athlon performance (due partly to DEC Alpha engineers who moved to AMD when Compaq bought DEC) and the Intel-designed Itanium's poor performance when compared to almost all competitors, and especially to the second IA-64 processor designed by HP).

Alpha Processor Inc. renamed itself API Networks, and switched to chip interconnect products based on an AMD-initiated HyperTransport standard.


Alpha Processor Inc. Networks:
http://www.alpha-processor.com/
Compaq Enterprise Solutions and Services:
http://www5.compaq.com/enterprise/
Samsung Semiconductor:Alpha CPU
http://samsungelectronics.com/semiconductors/Alpha_CPU/Alpha_CPU.htm

Previous Page
Table of Contents
Next Page