Fool.com: Inside Intel Again: RISCing the Pentium [Rule Maker] February 24, 2000

"> Fool.com: Inside Intel Again: RISCing the Pentium [Rule Maker] February 24, 2000 Inside Intel Again: RISCing the Pentium
Part 3

By Rob Landley (TMF Oak)
February 24, 2000

While Intel (Nasdaq: INTC) was fiddling around with the 386 and 486 designs, which the last two columns described (Part 1 and Part 2), the rest of the industry wasn't standing still. PCs had manufacturing economies of scale that dwarfed the mainframe and minicomputer markets and could afford to come out with a faster batch of chips every few months leading to a terrifying rate of advance. Terrifying, that is, if you were trying to compete against them. If the high-end players were going to stay ahead of the game, they would have to do it with better designs.

A man named Seymour Cray came up with the big new idea in the high end. It was called a "RISC" design, which is an acronym for "Reduced Instruction Set Computing." A RISC instruction set is simplified, with each instruction in it taking only one clock cycle to complete. The older designs were full of complicated instructions that took many clock cycles, and thus the acronym "CISC" (standing for "Complex Instruction Set Computing") was retroactively invented to describe them.

In a RISC design, the more complicated tasks that took many clock cycles to complete were broken up into smaller pieces, and each individual step the processor did per clock cycle was given its own instruction number. (Since these were basically the steps the chip had to do behind the scenes anyway, this wasn't necessarily much of a change. The new instruction set was simply more explicit.) In addition, each instruction was made the exact same length, which means that on a 32-bit chip each instruction would be exactly 4 bytes long.

By itself, these changes merely reduced the complexity of decoding variable-length instructions, and allowed the instruction execution circuitry to be streamlined a bit. Not major wins. But this also opened up a whole new avenue for optimization, which is called "pipelining."

Basically, you build two complete sets of the instruction execution circuitry on the same chip. These two copies are called "Processor Cores," or "Pipelines." Since each instruction is the same length, the chip doesn't have to finish decoding the first instruction to see where the second one starts. Each clock cycle, the second pipeline can look at the instruction immediately following the one the first CPU core is executing, and if the second instruction doesn't depend on any results of the first, the second core can execute the second instruction then and there, in the same clock cycle. Under ideal circumstances, the processor can literally execute two instructions each clock cycle. When circumstances aren't ideal, the second pipeline pulls a "wait state," just like the clock-doubled 486 (described yesterday) did when it ran out of prefetched instructions in the cache.

Since the end of the previous paragraph was pure, unadulterated technobabble, I'd like to step back and demonstrate the idea using a portion of a recipe to represent a computer program:

Measure 1 cup flour
Add flour to mixing bowl
Count 2 eggs
Crack eggs into bowl
Mix

If you'll notice, a lot of instructions depend on the results of previous instructions. You can't put the flour in the bowl before you've measured it; you can't crack the eggs before you've counted them; and you can't mix the ingredients before they're both in the bowl. But other pairs of instructions are independent, such as adding the flour to the mixing bowl and counting the eggs. If a second set of instruction execution circuitry is looking over the first's shoulder, the flour can be put in the bowl and the proper number of eggs selected at the same time, basically counting as only one step.

It works even better if the program is re-ordered to pair off independent instructions, like so:

Measure 1 cup flour
Count 2 eggs
Add flour to mixing bowl
Crack eggs into bowl
Mix

Now the measuring and counting can be done in one step, and adding both ingredients to the bowl can be done in a second step, followed by mixing. From taking five steps to finish, now the process only takes three. This is a tremendous speed improvement, achieved by rethinking and reorganizing the processor's instructions so that more transistors can be thrown at the problem to do more stuff at the same time, in parallel.

Of course, RISC had some downsides. Without the variable instruction length trick central to CISC instruction sets, programs took up much more memory. The chip also required twice as many transistors for two complete processor cores, which meant it took up twice as much space on the silicon wafer, which made it twice as expensive to manufacture. And RISC programs had to have their instructions in the right order to take full advantage of the second processor core, which required more advanced program development tools that weren't available when the first generation of RISC chips hit the market. But like all optimizations, the trade-off was considered worth it if you could execute two instructions per clock cycle, even part of the time.

Soon, RISC chips were everywhere. IBM (NYSE: IBM) and Motorola (NYSE: MOT) paired to produce the PowerPC; Digital Equipment came out with the Alpha; and a tiny start-up called Sun (Nasdaq: SUNW) came up with the Sparc design for Unix machines, which it wanted to hook up to something called the "Internet." All of these designs were expected to kill off Intel, which had reached the limits of what CISC could do with the 486.

Obviously Intel couldn't abandon the IA32 instruction set, or it would lose the installed base of software that kept its existing customers loyal. How could it make a RISC chip that executed CISC instructions? By brute force, of course. Remember how the inside of the 486 was separated from the rest of the system so it could run at a different speed? The Pentium took this concept a step further, by having the inside execute different instructions than the outside.

Take a 486 as described yesterday with even more on-chip cache fed by an even cleverer prefetch unit. Separately, design a 2-core RISC chip. Then, glue the RISC cores to the nouveau-486's cache with a layer of translation circuitry that converts each IA32 instruction into one or more RISC instructions and re-orders the resulting RISC instructions (whenever possible) to keep both processor cores humming along at full speed. Viola, you have the Pentium.

It's the sort of painfully brilliant design that is best admired with several Tylenol handy. Yet the Pentium worked better than even the pure RISC designs did, because it still had the CISC advantage of small instruction size. Remember, the prefetch unit is the bottleneck, sucking data from the system's Random Access Memory (RAM) chips at the motherboard's maximum speed. The Pentium doubled the memory bus from 32 bits to 64 bits in expectation of executing two instructions per clock cycle on a regular basis, but you could still clock double or triple the thing to your heart's content. (After all, the processor is covered with heat sinks and the rest of the motherboard isn't.) So if the instructions the processor is fetching out of memory are small, instead of getting only two at a time it might wind up sucking in three or four. Combine this with each CISC instruction's potential to expand into multiple RISC instructions, and even a clock-tripled pentium isn't likely to hit wait states very often.

This is why the Pentium was so much faster than the 486. Now you know.

I could spend a whole day on the changes between the first generation Pentium and the modern Pentium III or Athlon, but I only have four days this week instead of five, so something had to get dropped. If you're really curious, bug me and I might post about it on the discussion boards. In brief, they got a whole lot more clever about translating IA32 instructions into RISC microcode and threw a whole lot more transistors at the translation unit. Ways they got cleverer include "deeper pipelines," "branch prediction," "speculative execution," and "multi-level cache."

Tomorrow, RISC gives way to VLIW for the next generation of designs: Intel's Merced vs. Transmeta's Crusoe.

In the meantime, we have a few Foolish Specials running right now that you might find interesting: How to Avoid Securities Fraud and Adventures in Venture Capital.

- Oak