The 4004 had 46 instructions, using only 2,300 transistors in a 16-pin DIP. It ran at a clock rate of 740kHz (eight clock cycles per CPU cycle of 10.8 microseconds) - the original goal was 1MHz, to allow it to compute BCD arithmetic as fast (per digit) as a 1960's era IBM 1620.
The 4040 (1972) was an enhanced version of the 4004, adding 14 instructions, larger (8 level) stack, 8K program space, and interrupt abilities (including shadows of the first 8 registers). Should Pioneer 10 and Pioneer 11 ever be found by an extraterrestrial species, the 4004 will represent an example of Earth's technology.
[for additional information, see Appendix E]
It was bit serial to reduce connections between chips, with highly parallel design and high clock rate to compensate. Words were 20 bits (required by the precision of the sensor and control values) and ALU units could perform operations on input bits as they were read in, while bits of the previous result was read out. "Steering Logic" (SL) units switched input signals to output lines (and added or subtracted, if two inputs went to one output), which could be directed to multiplication, division, and special logic units (which acted a little like a Transfer Triggered Architecture). Bits read serially from the ROMS (eight banks with 128 20-bit words, each with its own program counter) directed the data movement and unit operations, but had to be synchronized with data movement making programming difficult (basically microcode). RAM (called Random Access Storage) consisted of units with sixteen 20-bit words. Programming consisted of using the SLs to direct instruction and data words to the function units, which could be hooked to other function units in a pipeline, along with other pipelines in parallel. A separate set of eight ROMs could be used for data.
It took until 1998 to declassify a paper on the 1970 design. Although impressively elegant, it probably didn't warrant that length of secrecy.
It included a 4-bit accumulator, 4-bit Y register and 2 or 3-bit X register, which combined to create a 6 or 7 bit index register for the 64 or 128 nybbles of on chip RAM. A 1-bit status register was used for various purposes in different contexts. The 6-bit PC combined with a 4 bit page register and an optional 1 bit bank ('chapter') register to produce 10 or 11 address bits to 1KB or 2KB of on-chip program ROM. There was also a 6-bit subroutine return register and 4-bit page buffer, used as the destination on a branch, or exchanged with the PC and page registers for a subroutine (amounting to a 1-element stack, branches could not be performed within a subroutine).
An interesting feature of the PC is it was incremented using a feedback shift register, not a counter, so instructions were not consecutive in memory, but since all memory was internal, this was not a problem. Instructions were 8 bits with twelve hardwired, and with a 31X16 element PLA allowing 31 custom microprogrammed instructions. All hardwired instructions were single cycle, and no interrupts were allowed.
It gained fame in the movie "ET: The Extraterrestrial" as the brains in the Texas Instruments "Speak and Spell" educational toy.
The 8080 was used in the Altair 8800, the first widely-known personal computer (though the definition of 'first PC' is fuzzy. Some claim that the 12-bit LINC (Laboratory INstruments Computer) was the first 'personal computer'. Developed at MIT (Lincoln Labs) in 1963 using DEC components, it inspired DEC to design its own PDP-8 in 1965, also considered an early 'personal computer'). 'Home computer' would probably be a better term here, though).
Intel updated the design with the 8085 (1976), which added two instructions to enable/disable three added interrupt pins (and the serial I/O pins), and simplified hardware by only using +5V power, and adding clock generator and bus controller circuits on-chip.
Clock speeds ranged from the original Z-80 2.5MHz to the Z80-H (later called Z80-C) at 8MHz, and later a CMOS version at 10MHz.
Like many processors (including the 8085), the Z-80 featured many undocumented instructions. In some cases, they were a by-product of early designs (which did not trap invalid op codes, but tried to interpret them as best they could), and in other cases chip area near the edge was used for added instructions, but fabrication made the failure rate high. Instructions that often failed were just not documented, increasing chip yield. Later fabrication made these more reliable.
But the thing that really made the Z-80 popular in designs was the memory interface - the CPU generated its own RAM refresh signals, which meant easier design and lower system cost, the deciding factor in its selection for the TRS-80 Model 1. That and its 8080 compatibility, and CP/M, the first standard microprocessor operating system, made it the first choice of many systems.
Embedded varients of the Z-80 were also produced. Hitachi produced the 64180 (1984) with added components (two 16 bit timers, two DMA controllers, three serial ports, and a segmented MMU mapping a 20 bit (1M) address space to any three variable sized segments in the 16 bit (64K) Z-80 memory map), a design Zilog and Hitachi later refined to produce the Z-180 and HD64180Z (1987?) which were compatible with Z-80 peripheral chips, plus variants (Z-181, Z-182). The Z-280 was a 16 bit version introduced about July, 1987 (loosely based on the ill-fated Z-800), with a paged (like Z-180) 24 bit (16M) MMU (8 or 16 bit bus resizing), user/supervisor modes and features for multitasking, a 256 byte (4-way) cache, 4 channel DMA, and a huge number of new op codes tacked on (total of almost 3,500, including previously undocumented Z-80 instructions), though the size made some very slow. Internal clock could be run at twice the external clock (ex. 16MHz CPU with a 8MHz bus), and additional on-chip components were available. A 16/32 bit Z-380 version also exists (1994) with added 32-bit linear addressing mode (16-bit mode is Z-80 and Z-180 binary compatible, but not Z-280 compatible).
Rabbit Semiconductor's Rabbit 2000 (1999/2000?) with a Z-80 derived instruction set which drops some instructions (mostly I/O, some less useful instructions), and adds others (16-bit data, computed address). It also drops dynamic RAM support, because embedded systems more often use stgatic RAM, and adds serial, parallel, and inter-processor communication units. Program space is extended to 20 bits using an 8-bit page register, rather than the Z-180's MMU.
The Z-8 (1979) was an embedded processor with on-chip RAM (actually a set of 124 general and 20 special purpose registers) and ROM (often a BASIC interpreter), and is available in a variety of custom configurations up to 20MHz. Not actually related to the Z-80.
Unlike the 8080 and its kind, the 6502 (and 6800) had very few registers. It was an 8 bit processor, with 16 bit address bus. Inside was one 8 bit data register, two 8 bit index registers, and an 8 bit stack pointer (stack was preset from address 256 ($100 hex) to 511 ($1FF)). It used these index and stack registers effectively, with more addressing modes, including a fast zero-page mode that accessed memory addresses from address 0 to 255 ($FF) with an 8-bit address that speeded operations (it didn't have to fetch a second byte for the address).
Back when the 6502 was introduced, RAM was actually faster than microprocessors, so it made sense to optimize for RAM access rather than increase the number of registers on a chip. It also had a lower gate count (and cost) than its competitors.
The 650x also had undocumented instructions.
The CMOS 65C02/65C02S fixed some original 6502 design flaws, and the 65816 (officially W65C816S, both designed by Bill Mensch of Western Design Center Inc.) extended the 650x to 16 bits internally, including index and stack registers, with a 16-bit direct page register (similar to the 6809), and 24-bit address bus (16 bit registers plus 8 bit data/program bank registers). It included an 8-bit emulation mode. Microcontroller versions of both exist, and a 32-bit version (the 65832) is planned. Various licensed versions are supplied by GTE (16 bit G65SC802 (pin compatible with 6502), and G65SC816 (support for VM, I/D cache, and multiprocessing)) and Rockwell (R65C40), and Mitsubishi has a redesigned compatible version. The 6502 remains surprisingly popular largely because of the variety of sources and support for it.
The 6502-based Apple II line (not backwards compatible with the Apple I) was among the first microcomputers introduced and became the longest running PC line, eventually including the 65816-based Apple IIgs The 6502 was also used in the Nintendo entertainment system (NES), and the 65816 is in the 16-bit successor, the Super NES, before Nintendo switched to MIPS embedded processors.
Other features were one of the first multiplication instructions of the time, 16 bit arithmetic, and a special fast interrupt. But it was also highly optimized, gaining up to five times the speed of the 6800 series CPU. Like the 6800, it included the undocumented HCF (Halt Catch Fire) instruction to incrementally strobe the address lines for bus testing ("jump to accumulator (A or B)" in the 6800, implemented and documented as $00 in the 68HC11 which is described below).
The 6800 and 6809, like the 6502 series, used a single clock cycle to generate the timing for four internal execution stages by using the rising and falling edges of the base cycle (not just rising edges), and another clock 90 degrees out of phase (giving two rising and two falling edges per cycle) - this allowed instructions to execute in one external 'cycle' rather than four for most CPUs, such as the 8080, which used the external clock directly, so an equivalent instruction would take four cycles, meaning a 2MHz 6809 would be roughly equivalent to a 8MHz 8080. This is different from clock-doubling, which uses a phase-locked-loop to generate a faster internal clock (for the CPU) which is synchronised with an external clock (for the bus). Motorola later produced CPUs in this line with a standard four-cycle clock. The 680x and 650x only accessed memory every other cycle, allowing a peripheral (such as video, or even a second cpu) to access the same memory without conflict.
The 6800 lived on as well, becoming the 6801/3, which included ROM, some RAM, a serial I/O port, and other goodies on the chip (as an embedded controller, minimizing part counts - but expensive at 35,000 transistors. The 6805 was a cheaper 6801/3, dropping seldom used instructions and features). Later the 68HC11 version (two 8 bit/one 16 bit data register, two 16 bit index, and one 16 bit stack register, and an expanded instruction set with 16 bit multiply operations) was extended to 16 bits as the 68HC16 (additional 16-bit accumulator E, three index registers IX, IY, IZ, plus extension registers to add 4 bits to addresses and accumulator E for a 1M address space, plus 16-bit multiply registers HR and IR and 36-bit AM accumulator), and a lower cost 16 bit 68HC12 (May 1996). It remains a popular embedded processor (with over 2 billion 6800 varients sold), and radiation hardened versions of the 68HC11 have been used in communications satellites. But the 6809 was a very fast and flexible chip for its time, particularly with the addition of the OS-9 operating system.
Of course, I'm a 6809 fan myself...
The Am2901, from Advanced Micro Devices, was a popular 4-bit-slice processor. It featured sixteen 4-bit registers and a 4-bit ALU, and operation signals to allow carry/borrow or shift operations and such to operate across any number of other 2901s. An address sequencer (such as the 2910) could provide control signals with the use of custom microcode in ROM.
The Am2903 featured hardware multiply.
Legend holds that some Soviet clones of the PDP-11 were assembled from Soviet clones of the Am2901.
AMD also produced what is probably the first floating point "coprocessor" for microprocessors, the AMD 9511 "arithmetic circuit" (1979), which performed 32 bit (23 + 7 bit floating point) RPN-style operations (4 element stack) under CPU control - the 64-bit 9512 (1980) lacked the transcendental functions. It was based on a 16-bit ALU, performed add, subtract, multiply, and divide (plus sine and cosine), and while faster than software on microprocessors of the time (about 4X speedup over a 4MHz Z-80), it was much slower (at 200+ cycles for 32*32->32 bit multiply) than more modern math coprocessors are.
It was used in some CP/M (Z-80)
systems (I heard it was used on an S-100 bus math card for NorthStar
systems, but that was in fact used a 74181 BCD (Binary Coded Decimal)
ALU, and ten PROM chips for microcode). Calculator circuits (such as the
National Semiconductor MM57109 (1980), actually a 4-bit NS COP400
processor with floating point routines in ROM) were also sometimes used,
with emulated keypresses sent to it and results read back, to simplify
programming rather than for speed.
While the 8048 used 1-byte instructions, the 8051 has a more flexible 2-byte instruction set. It has eight 8-bit registers, plus an accumulator A. Data space is 128 bytes accessed directly or indirectly by a register, plus another 128 above that in the 8052 which can only be accessed indirectly (usually for a stack). External memory occupies the same address space, and can be accessed directly (in a 256 byte page via I/O ports) or through the 16 bit DPTR address register much like in the RCA 1802. Direct data above location 32 is bit-addressable. Data and program memory share the address space (and address lines, when using external memory). Although complicated, these memory models allow flexibility in embeded designs, making the 8051 very popular (over 1 billion sold since 1988).
The Siemens 80C517 adds a math coprocessor to the CPU which provides 16 and 32 bit integer support plus basic floating point assistance (32 bit normalise and shift), reminiscent of the old AMD 9511. The Texas Instruments TMS370 is similar to the 8051, Adding a B accumulator and some 16 bit support.
The PIC has a large register set (from 25 to 192 8-bit registers, compared to the Z-8's 144). There are up to 31 direct registers, plus an accumulator W, though R1 to R8 also have special functions - R2 is the PC (with implicit stack (2 to 16 level)), and R5 to R8 control I/O ports. R0 is mapped to the register R4 (FSR) points to (similar to the ISAR in the F8, it's the only way to access R32 or above).
The 16x is very simple and RISC-like (but less so than the RCA 1802 or the more recent Atmel AVR microcontroller. It has only 33 fixed length 12-bit instructions, including several with a skip-on-condition flag to skip the next instruction (for loops and conditional branches), producing tight code important in embedded applications. It's marginally pipelined (2 stages - fetch and execute) - combined with single cycle execution (except for branches - 2 cycles), performance is very good for its processor catagory.
The 17x has more addressing modes (direct, indirect, and relative - indirect mode instructions take 2 execution cycles), more instructions (58 16-bit), more registers (232 to 454), plus up to 64K-word program space (2K to 8K on chip). The high end versions also have single cycle 8-bit unsigned multiply instructions.
The PIC 16x is an interesting look at an 8 bit design made with slightly newer design techniques than other 8 bit CPUs in this list - around 1978 by General Instruments (the 1650, a successor to the more general 1600). It lost out to more popular CPUs and was later sold to Microchip Technology, which still sells it for small embedded applications. An example of this microprocessor is a small PC board called the BASIC Stamp, consisting of 2 ICs - an 18-pin PIC 16C56 CPU (with a BASIC interpreter in 512 word ROM (yes, 512)) and 8-pin 256 byte serial EEPROM (also made by Microchip) on an I/O port where user programs (about 80 tokenized lines of BASIC) are stored.