While we have been talking about AMD's high-speed Athlon Thunderbirds and Palominos, Intel actually beat AMD to the gun by releasing Pentium IV Willamette in November of 2000. Pentium IV was exactly what Intel needed to again take the torch from AMD. Pentium IV is a truly new CPU architecture and serves as the beginning to new technologies we will see for the next several years. The new NetBurst architecture is designed with future speed increase in mind, meaning P4 is not going to fade away quickly like Pentium III near the 1 GHz mark.
According to Intel, NetBurst is made up of four new technologies: Hyper Pipelined Technology, Rapid Execution Engine, Execution Trace Cache and a 400MHz system bus. Let's look at the first three, since they require some explanation:
Hyper Pipelined Technology There are a couple of ways to increase the speed of a processor. One is to decrease the die size. Technology in this regard is developed quickly, but not quickly enough. The P5 core saw its limit quickly and so did the P6 core (which is why Pentium III was limited at around 1 GHz). The technology to move into a smaller die size was not yet ready at the time of the Willamette release, so Intel moved to plan B. Plan B is to change the design of the CPU pipeline so that it is wider, can accommodate more instructions. This is what Intel did. Hyper Pipelined Technology refers to Intel's expanding of the CPU pipeline from 10 stages (of the P6) to 20 stages. This effectively makes the data pipe (bad term, but descriptive) wider, and allows each stage to do actually less per clock cycle than the P6 core. The fact that each stage actually does less per clock cycle is what gives this design room for expandability. It is analogous to expanding a street highway - you add more lanes and for awhile each lane has less traffic, but eventually traffic increases and the road can handle much more traffic. The tradeoff in simply expanding this pipeline to a bunch of stages is that it takes the processor longer to recover from mistakes in the branch level prediction, being that it has to basically start over with 20 stages rather than a shorter 10-stage pipeline. The P4, though, has a newly advanced branch predictor to help with this problem.
Rapid Execution Engine The Pentium IV contains 2 arithmetic logic units and they operate at twice the speed of the processor. While this might sound like absolute heaven, it is good to keep in mind that they had to do it this way due to the pipeline design in order to even keep integer performance up to that of the Pentium III. So, this is really a necessary design change due to the increase pipeline size.
Execution Trace Cache Intel also did some re-working of the P4's internal cache in order to nullify the effects of a mistake in branch prediction that can be a real lag with a 20-stage pipeline. First, they increase the branch target buffer size to eight times that of the Pentium III. This cache is the area from which the branch predictor gets its data. Secondly, Intel reduced the size of the L1 data cache to only 8K in order to reduce the latency of the cache. This, no doubt, increases the need for the 256 KB on-die L2 cache, and the latency of that has been improved on the P4 as well. Lastly, Intel added a execution trace cache. This is a new cache that can hold instructions that are already decoded and ready for execution. This means that the processor does not have to again waste time decoding every instruction when a branch prediction error occurs. Instead, it can just go to this 12K cache and retrieve the operation and go.
The early Pentium 4's made use of the Socket 423 interface. One of the reasons for the new interface is the addition of heatsink retention mechanisms to either side of the socket. This is a move to help owners avoid the dreaded mistake of crushing the CPU core by tightening the heatsink down on it too tightly. The retention bases hold the heat sink onto the CPU. Socket 423 was short-lived, and Pentium IV quickly moved to Socket 478 with the release of the 1.9 GHz. Also, P4 was, at its launch, associated exclusively with Rambus RDRAM. Intel was stuck in this agreement with Rambus, and this was an obvious hurdle for promotion, being that most computer users to not have Rambus and don't wish to buy any. So, early retail P4's actually came packaged with two 64MB sticks of RDRAM. With chipset support later coming, DDR mating with the Pentium IV eventually came.
Pentium IV's, as you might expect, were and still are on the expensive end of things. The new core was quite big when compared to other processors and the cost to product it was innately higher. In early 2002, Intel announced a new edition of the Pentium IV based on the Northwood core. The big news with this is that Intel leaves the larger 0.18 Willamette core in favor of this new 0.13 micron Northwood. This shrunk the core and therefore allowed Intel to not only make Pentium IV's cheaper but also make more of them. The core is still bigger than that of the Athlon XP, but this is explained by the fact that Intel increased the L2 cache from 256 KB to 512 KB for Northwood. This raises the transistor count from 42 million for Willamette to 55 million for Northwood. Northwood was first released in 2 GHz and 2.2 GHz versions, but the new design gives P4 room to move up to 3 GHz quite easily. It was recently released at 2.53 GHz using a 533 MHz front side bus. Other than that, Northwood is architecturally the same as Willamette.