by Tom Jennings
16 September, 1999
revised 9 January, 2001 (Source URLs updated, lessened errors in national characters in ASCII-1967)
revised 22 January, 2000
revised 6 October, 1999
Entire contents copyright
Tom Jennings 1999-2001.
All rights reserved.
ASCII is not art. It's a code, a way of hiding things within a smaller thing.
This document is about character codes, specifically a history of ASCII(1), the American Standard Code for Information Interchange, and it's immediate ancestors; FIELDATA, ITA2, Murray's telegraphy code, Baudot's telegraphy code, and Morse's telegraphy code, and involves some forensic bitology.
ASCII, born at the dawn of the modern computer age (1958--1965), is perfectly representative of the period; clean, spare, optimistic and ultimately ignorant of anything but it's immediate past; its view of the future, our present, rosy because it was so naive.
The codes covered here are the beginning of a crude alphabet for our new machines' pidgin, a baby language, for better and worse, mindlessly mumbled sub-atomic particles of thoughts. There is a thread of research that believes that the internal dialog of human thought is formed by language, not the reverse, and I tend to agree with them. Our character codes certainly shape the things we express and think of electronically.
Character codes are a form of information compression, to accomodate the extreme lack of bandwidth available in paper, ink, or the tapping armature of a telegraph. The concept of characters and character-codes in ASCII is utterly inseparable from our Western, roman alphabet culture. You need the "one time pad"(2) of Western culture to understand it or make use of it at all.
If this history seems too conveniently linear -- it is. My approach was to start with the survivor -- ASCII-1967 -- and trace its direct lineage backwards, then write from the oldest forward. And I vastly simplified things, otherwise it would require a thousand pages and large grant to pull off(3). This isn't a detailed history of the development of character codes per se, but of the codes themselves, the specific meaning of the individual character codes.
The history of electrical or electronic communications really means the history of serial communications. Serial means a symbol at a time, one after the other, in an agreed-upon sequence. The concept isn't arbitrary, it seems to be inherent in human language. Words are spoken one at a time, words have a beginning and an end. While vision is "parallel", broad-side, both alphabet-based and iconic languages look at one symbol or ideograph at a time.
Character-based communications is fundamentally different than things like telephony. Characters reduce communications to discrete symbols (incorrectly called "digital"), while things like telephones and facsimile ("fax") are continuously variable (reasonably called "analog", as the vibrations in an earphone is an analogy, a flawed copy, of the vibrations your voice makes in a microphone).
The fantastic advantage of discrete symbolic communications is that the meaning can be modified mechanically. A trivial and silly example: every time you write "PLEASE SEND ME 9 FRUITCAKES" a machine transporting your email could change it to "PLEASE SEND ME 900 FRUITCAKES". (It also helps that there are so many layers of mediation you can't tell if a person or a machine wrote the symbols "9", "0", "0", etc.) This is because "meaning" is accessible; it is agreed that "9" is a number, a quantity. The meaning in the spoken-sound "nine" is quite well hidden to machines, so far.
Character codes are human codes, for surrogate machine organs. The histories of each code is as complex as any human endeavor, only a lot more boring. But if you've persisted so far you might as well continue.
And finally, I will utterly ignore the most obvious end-result of all this electrical communication; the printed word on paper, because at that point, as far as character codes go, it all stops and leaves the mechanical/electrical/electronic realm.
We need to be clear on a few things before we start. While I've tried to limit jargon to an absolute minimum, I need to be clear on a few definitions used in the text:
The standard story has it that Samuel Morse invented the electric recording telegraph in 1837. Since there's a code called Morse code you'd think he'd created that too, and at the same time, but it's not how it happened.
Morse's original signalling scheme didn't involve the transmitting of codes for characters at all; he went one step further, and transmitted what was essentially just a numeric code. At each end of the telegraph, the receiver and the sender both would have a giant dictionary of words, each word numbered. Intelligence would be transported by the sender looking up each word in the "dictionary", obtaining it's numerical code, and transmitting the numeric code for each word, for every word in the message. (It may have been his intent to automate this process, with each transported number controlling a mechanism that located each word.) The receiver would obviously perform the opposite function, receiving the numerical codes and converting them to words using the big dictionary. A reasonable enough approach, considering there was no (zero) experience to fall back on, it seems to have disappeared by 1844, when the famous "WHAT HATH GOD WROUGHT" message was sent. Morse's assistant, Alfred Vail, allegedly worked up the code we think of as "Morse's code", which though still cryptic, had a small, finite(9) set of symbols that were already well understood -- the roman alphabet.
What may not be obvious is that Morse's system is also a recording system; it recorded it's signals on a narrow strip of paper with a pen, making little wiggles when the voltage on the wire changed; the operator would decode these afterwards. The practice of decoding Morse's code by ear didn't happen for a few decades, and only once it became a character code. Later systems pricked holes with a needle. (This appears to be the direct and immediate predecessor to all "paper tape" storage systems nearly always associated with teleprinter equipment.) One scheme raised bumps on paper; Morse used it in 1844 to receive the famous "WHAT HATH GOD WROUGHT" message from the Supreme Court in Washington D.C. to Baltimore, shown below. (The original tape is in the Smithsonian Museum in Washington D.C.)
(There's an interesting story within this early use of "tele-communications", in the inscription Samuel Morse wrote along the top of this historic artifact:
This sentence was written from Washington by me at the Baltimore Terminus at 8h. 45 min. on Friday, May 24th. 1844, being the first ever transmitted from Washington to Baltimore by telegraph, and was indited (sic?) by my much loved friend Annie G. Ellsworth. Signed...
He appears just as confused about 'here-and-there' as any novice internet emailer, with the "from Washington by me at...Baltimore", repeated in the same brief sentence.)
|_ _ _
|._ _ _
|.._ _ _
|_ _ _..
|._ _ _ _
|_ _ _ _.
|_ _ _ _ _
SPECIAL NOTES ON THIS TABLE: Putting telegraphy code into a table is problematic; unlike the rest of the codes in this document, imposing modern base-two conventions on telegraphy codes isn't so easy. For comparison purposes, I've arranged the code into a table based solely upon the length of the two asserted states; short and long, aka dot and dash (or dit and dah, etc), to attempt a correlation with later codes to see if there are any length-based criteria for the seemingly arbitrary arrangement of characters is the table. You decide if it was worth it. Suggestions welcome. Also, shown here is the modern International Morse Code; rather than attempt to document the original numeric code, I thought it more useful to start with the first character-based code. From what I can see of the 1844 tape, shown further above, the alphabet looks the same; The International code varies mostly in a number of punctuation symbols, which are contentious anyways. Most of the punctuation symbols are six or more elements. Further, for comparison purposes, I've only included the punctuation symbols that were used in the later Baudot and early teleprinter code, and hopefully more representative of the period than the current rich set.
Telegraphy is about pushing symbols over a distance using a long wire. Everything else is secondary to this goal.
In 1837, things precision was made of wood and brass and steel, purely mechanical. Electricity had recently become regularized ("understood" too strong a word), and the electro-magnet, copper wire wound 'round a steel core, a movable armature hung closely but not touching over it, was the arbiter between the electrical and the mechanical; a switch, metal moved by a human hand or machine, the arbiter between the mechanical and the electrical world. A long and vastly expensive wire between the two, with a battery to provide electrical power, provides instantaneous communication between places possibly miles apart.
Samuel Morse's system is today called "digital" but is more accurately called "discrete"; with a switch, controlled by a human hand or a machine, battery voltage is applied or removed from the wire, and detected instantaneously at the other end. The states are "discrete" in that they do not rely on finely measuring the voltage; it's mere presence or non-presence suffices.
Skipping theory, information is sent from one end to the other by changing the state of the wire over time, in a manner agreed-upon at both ends.
Morse's first system, in 1837, wasn't very nice. The sender used a cumbersome code, involving a metal slug with coded notches, one for each possible word(16)(7). The sending machine "felt" the notches across the top of the slug, and impressed on/off voltages on the wire. The far-away receiver ticked up and down in time with the on/off voltages, inscribing marks on a strip of paper. The marks were looked up in a book, a "dictionary" of codes and matching words, the words written down, and the message deciphered.
By 1844 this awful scheme was abandoned in favor of a simple alphabetic code, still making scratches on paper strips, but no longer needing tedious, bulky and expensive dictionaries and confusing metal slugs. We need not think of it ever again.
The new alphabetic code, now called "Morse's Code" (even though Morse didn't create it, it was used on his hardware and he got the credit) recognizes four different states of the wire: voltage-on long ("dah" or dash), voltage-on brief ("dit" or dot), voltage-off long (space between characters and words), voltage-off brief (space between dits and dahs)(5). (When sent by hand, it's casually considered to have only two symbols, "dit" and "dah".)
Like all modern codes, Morse code is built up from smaller symbols; characters (letters, numbers, punctuation, etc) are encoded as a series of dits, dahs, and spaces between.
It is a variable-length code, designed so that the most common characters are short -- the letter "E" is a single symbol, while "1", occurring less often, is five symbols (considering that for human purposes the dit or dah and the brief space that follows it is a unit). The vowels are all brief and simple, and less-common letters use longer sequences, not that there was much science behind the letter-frequencies, apparently, as there isn't much correspondence between modern, machine-counted letter frequencies and the Morse code lengths beyond the first two or three characters.
Note that there are no "format effectors", that is, codes that control how the transmitted symbols are to be displayed; that's a little too modern yet. There is also no non-printing space character yet; like the Arabic invention of zero, it wasn't needed until a positional notation came into use with mechanical teleprinters, some 50 years in the future. In telegraphy, you simply pause briefly.
The beginnings of modern serial communications
In France, Emile Baudot designed his own "printing telegraph" system in 1874. The code itself was developed by two of his cohorts, Johann Gauss and Wilhelm Weber as part of the overall system(), though apparently the code was used for cryptography by Sir Francis Bacon as early as 1605(15). Unlike Morse's code, all of the symbols in "Baudot's" code are the same length -- five symbols, making mechanical encoding, and more importantly, decoding, vastly easier. The design is quite alien by today's standards, a multiple-wire synchronous multiplex system, where the human operators did the "time slicing". Codes were generated by a device with five piano-like keys, operated with two fingers on the left hand, and three from the right. Synchronization with the "network" was done by the human operator listening for a "cadence signal", and took quite some skill to operate. Printing was done automatically and mechanically. As crude as the hardware is, the fixed-length code was a breakthrough (if an obvious one in hindsight).
Another important side effect of Baudot's method is that it relies on only two states of the wire; the presence of a voltage, or not. This provides an added level of relibility by lowering the possibility of errors, and has a sound mathematical basis later expanded upon by Claude Shannon in the mid 20th century.
A schematic Baudot keyboard is shown to the right; note how the fingers are labelled. The fingers of the left hand, IV and V, denote rows in the table to the left; the three fingers of the right hand, I, II and III, form the column number; eg. finger I by itself is 1 ("A"), II by itself is 2 ("E"), both I and II are 3 ("É"), etc. V is the most significant digit (har har); I the least.
There are two sub-tables, marked FIGS and LTRS. The table from which the character indicated by the finger-code comes from depends on the most-recently-pressed FIGS or LTRS key; these two keys specify which code table that both the sender and receiver should use.
It's not much worse than having to remember to press the SHIFT key on your computer keyboard to get the % character above the 5 key. The alternative was using a sixth finger, which was probably deemed to be even more cumbersome.
(Note) Not falling into any particular category are two codes that indicate "error" or "erasure", by pressing the two left-hand keys only. These indicate that the last character should be ignored, and produce a character similar to an ASCII asterisk *. This function mutated into DEL over the next few decades.
Telegraphy codes are not "numerical". These codes are not random, scrambled, insensible, etc. There is subtlety here recent computer weenies can't grasp easily. It's easy to complain that machine collation is "impossible" with these codes, but keep in mind that it simply wasn't on the agenda to do so; and that machines even capable of such manipulations of character streams were more than half a century in the future.
Where Morse's code was asymetrical and frequent letters brief, Baudot's code is arranged to minimize hand and finger motion and fatigue, and to "make sense" to the human hand. Forcing it into this table destroys that vision, but is done for my modern comparison purposes, and with the assumption you are not using it to learn to send the Baudot code; it makes the code appear random and badly designed; this is not true. (This is exactly the sort of thing Thomas Kuhn wrote about in "The structure of Scientific Revolutions").
I cribbed the code from two unattributed scans of apparently old graphics occasionally seen on the net, found at (7) (10). In these tables the finger positions are labelled I II III ... V. Today this numbering implies that I is given a weight of 1, II a weight of 2, etc., and this is further supported by the order of characters in the table of International Telegraph Alphabet 1 (ITA1), where A, code "1" in my table and the forefinger of the right hand, has an impulse in the first (out of five) position only. (This further assumes that ITA1 characters were sent least-significant impulse first; with no previous experience before them, it's hard to imagine that the designers of ITA1 didn't issue first the impulse labelled "1st impulse", followed by the remaining four.) This pattern holds true for the rest of the characters in the ITA1 table.
Note that there is no explicit non-printing space character; FIGURE BLANK and LETTER BLANK were used for this purpose(10), so there were in fact two "space" characters, in addition to their case-select functions. There are no "format effectors", such as CR or LF, though ITA1 does contain both.
Many of the unusual characters varied from implementation to implementation, such as É and others. However a common subset of uncontested characters remains constant throughout all of the codes in this article.
An astute (or obsessive; your call) person by now has noticed that there are only 32 combinations possible using five fingers, not enough for all the codes in the tables above. There are actually 64 symbol positions in the Baudot code. This is handled by splitting the codes into two "cases", and stealing two codes to specify which case to use. Think of an old mechanical typewriter, where "SHIFT" actually moves the paper and platen up or down, to get at the different cases, or rows of characters; "SHIFT" actually doubles the number of available characters using the same number of print hammers. These special symbols are named FIGS ("figures") and LTRS ("letters"). To type COST 700 DOLLARS you would press the following keys:
Where (sp) is the space bar, and FIGS-SHIFT and LTRS-SHIFT is the typewriter's "SHIFT" key. This is quite practical, because as this paragraph shows, written communication is mostly letters, so it isn't as awful as it sounds; you send a "LTRS" code, then all of the codes that follow are assumed to be in the "letters" code table; when a "FIGS" code is sent, all of the codes that follow are taken from the "figures" table. Therefore, in current ITA2 code, code number 6 means either I if LTRS was the last case-code sent, or 8 if FIGS was last sent.
It may seem odd that this wasn't used to generate "upper" and "lower" case letters, eg. a vs. A, b vs. B etc., at least in the electrical communications world, for nearly a half-century. It was used to cram the minimum number of symbols into five digits to handle basic communications; letters, numbers, punctuation. This technique was in common use until the advent of the ASCII code, covered later.
Between 1899 and 1901, Donald Murray, either a New Zealander sheep farmer(10), or a newspaper man(15), depending on who you ask, (could be both; it doesn't hurt to have two careers at the turn of millenia) developed an automatic telegraphy system, using what he thought was the best features of the Baudot multiplex system. Rather than the difficult "piano" key encoding system, his scheme used a more reasonable (to modern souls) typewriter-like keyboard mechanism that automatically generated the bit-level codes, and presumably handled the synchronization.
Since people didn't have to impress bit patterns onto the wires with their fingers, he was free to arrange his code for the benefit of the machinery; all that the operators had to do was press the appropriately-labeled key top, the machinery did the dirty work.
Murray's criteria was to minimize the number of mechanical operations per character; the most common characters have codes that contain the fewest number of 0-to-1 transistions. The letter E, with only one of five bit positions having a 1, moves only one bit's worth of mechanism per character; and punches only one hole in a paper tape (one lever, one punch/die movement), reducing wear on the machinery, no small matter when you consider that a single-spaced, typed page of text is approximately 2000 characters, or more than 10,000 character "bits", each bit having at least one mechanical component that moves, needs oiling, adjustment, etc.
Two items of note here: the codes shown above as LF and CR are shown in my meager data, here and here, as COL and LINE PAGE, respectively. I admit it is flimsy evidence, but it is probably not a coincidence that in the subsequent ITA2 code, these same codes have those functions. So I let my assertion stand(0).
Western Union Telegraph Company purchased the American rights to Murray's design(15), and after modifying the FIGURES case (dumping the peculiar fractionals and other obsolete characters in favor of their own) used the code through the 1950's. Murray's code is the last ad-hoc character code of historical note in this thread; at this point, telegraphy networks were large enough to not tolerate hacker meddling with "better" systems, instead favoring lumbering, international-committee codes with infrequent change. Murray's code, as modified by Western Union, and with the exception of a few "national use" characters, was adopted by CCITT (International Telegraph and Telephone Consultative Committee, "CCITT" if you're French) as the ITA #2 code (International Telegraphy Alphabet), covered next in our story.
ITA2 (International Telegraph Alphabet #2) is the real name for the code often called "Baudot", though it retains the gross characteristics of the Baudot and Murray code before it, with it's five-level code and "case" concept. No teleprinter was ever made that used Baudot's code. Even the ARRL's Radio Amateur Handbook(1) calls ITA "Baudot", casually. Baudot's code was replaced by Murray's code in 1901. And ITA2 replaced both by the early 1930's, so virtually all "teletype" equipment made in the U.S. uses ITA2 or the U.S.-national version of the code.
Though ITA2 is structurally similar, it departs from Baudot in a number of ways. The printing characters are again scrambled, in a so-far-mysterious way(6). "Format effectors" appear for the first time; codes that do not cause printing of a symbol, but control the physical arrangement of characters on a page, specifically, CR and LF, and I suppose you might call BEL one, in that it might cause a human operator to do something useful. The european characters were dropped, and an explicit non-print space code was added.
Since the five-bit scheme of telegraphy was retained, the table above is read the same way as before; when LTRS is received, codes that follow are taken from the first two rows of the table, marked "LTRS"; when a FIGS code is received, codes that follow are assumed to be in the last two rows, marked "FIGS". You can see that some symbols and functions appear in both cases, CR, LF, space, NUL, and of course the case control codes LTRS and FIGS.
In the real world of smelly, oily, metal machinery, the handling of LTRS and FIGS is more complicated; some machines revert to LTRS state after the end of a line, and some after every space. This is for various error-recovery reasons and isn't part of the code.
ITA2 and it's relatives have very few control codes, reflecting their telegraphy and non-automatic roots; the only real transmission control is WRU, the "WHO ARE YOU?" function, and arguably BEL, which rings the bell. Teletypes and similar ITA2-coded machinery were pressed into service in early computing simply because they were the only symbolic machinery around; it wasn't like anyone really liked them; they were horribly slow amongst other things, and besides, in the primordal age of computing (1930 - 1960), only a few visionaries like Alan Turing or Vannevar Bush saw any need for computers to even have the ability to process alphabetic symbols (most assumed computers were for calculating with numbers, imagine!).
Codes designed to cover both traditional communication and new-fangled computers, such as FIELDATA, added many control functions, and untangled some of the by-now-annoying features such as jumbled alphabets.Differences between ITA2 and U.S. TTY(1)
The two codes are nearly identical, differing only in the FIGURES case.
This was the electro-mechanical age, where it was far easier to change a teletype's print-head than it was to translate codes from one to another. (I have a type basket for my Model 28 Teletype, circa 1964, that has a very rich character set, but it is scrambled to be compatible with some long-dead IBM punched-card equipment, and is hence unusable.)
Buried in the ITA code is a remnant of what is likely a seventy-five year old compatibility war, probably between two large equipment manufacturers.
The alleged controversy is over the ordering of bits across a row of paper tape, the storage medium of the time. I unearthed this corpse whilst trying to convert some old amateur-radio 5-level tapes to modern disk files. To make a long story short, after reading the physical tapes into my computer, I found that all of the bit patterns on the 5-level tape were reversed, left-to-right. After rigorously double-checking hardware and coding conventions I began to suspect that certain manufacturers equipment punched holes left-to-right, and some right-to-left. As long as you read tapes on the same brand of equipment they were punched on the bits came out just fine; the problem is when you punch a tape on Brand X, then read it on Brand Y's reader -- everything is backwards!
|BLANK||0 0 0 0 0||0 0 0 0 0||symmetrical|
|space||0 0 1 0 0||0 0 1 0 0||symmetrical|
|LTRS||1 1 1 1 1||1 1 1 1 1||symmetrical|
|FIGS||1 1 0 1 1||1 1 0 1 1||symmetrical|
|CR||0 1 0 0 0||0 0 0 1 0||equals LF|
|LF||0 0 0 1 0||0 1 0 0 0||equals CR|
Now this isn't necessarily fatal; if data on a Brand X tape is transmitted through a network built with Brand Y equipment, while the data was en route it would appear scrambled; but upon reaching it's destination, it would be just fine when finally read on a Brand X tape reader.
This problem appears to have been solved with a compromise; the characters that are "transmission control" related, the ones that would most affect the movement of this data through the wrong-brand network, are bit-wise symmetrical -- the codes for FIGS, LTRS, space and BLANK -- are the same reversed left to right! Further, the codes for CR and LF, equal each other when reversed left to right!
The CR/LF reversability is useful because CR followed by LF produces the same result as LF followed by CR on page printers.
Other symmetrical characters include C, R, Y and Z. I'd be curious to know if any of these characters were used in de facto protocols.
(NOTE: The names of FIELDATA supervisory codes are not standardized.)
The FIELDATA character code is part of an Army communications system that existed from 1957 through the early/mid 1960's; while it saw no use in commercial communications equipment as far as I can tell, it had an enormous influence on the design of ASCII. ASCII's design was well under way when FIELDATA was deployed, and at least one person worked on both standards (Leubbert(9)).
FIELDATA isn't just a code; it's a design for "information interchange" that includes an electrical specification and adapters(8)to make peripheral equipment (teletypes, etc.) compatible with FIELDATA computers, such as the MOBIDIC (Sylvania) and BASICPAC and LOGICPAC (Philco).
All of the FIELDATA equipment is long obsolete, though the code itself lingers to this day, unfortunately, in legacy COBOL software (UNIVAC computers used a mushed-up version of FIELDATA as their internal character code); can you say "Y2K"(7)? For all intents and purposes "FIELDATA" today refers to the character code. It, or a minor variant, is sometimes called the "DoD standard 8-bit code".
FIELDATA explicitly incorporates for the first time the concept of "control codes" (called in FIELDATA Supervisory codes), for in-band signalling. For this purpose, a seventh bit (called the "tag") determines which code table to use; 1 for the alphabetic set, 0 for the Supervisory set. As a computer internal code, these are combined into one table, seven bits in width.
A lot of effort went into transmission error analysis, and appears to have affected the supervisory code choices. In-band signalling was addressed directly, and while the codes today appear quite haphazard, they introduced the concept in a workable way. Message-format functions were included (SCB, ECB, SBK, EBK, EBE), as were error correction/flow control functions (RTT, RTR, NRR, EOF, RPT).
Alphabetic and numerical characters are in collation order; simple arithmetic comparisons perform traditional sorting, with the non-printing-space character positioned before A. Characters in the table are arranged such that the alphabet, numbers, and "math" and graphic character sub-sets are isolatable with simple bit-masks. (This later taken to an extreme in ASCII.)
It is a reasonably well-designed code, considering that it broke so much new ground, and we can assume the Army's experience affected the then-current design of ASCII. It was a far-thinking solution to a perennial problem, and the Army had the large equipment base, the need, and most importantly the budget to pull it off.
Needless to say, it didn't solve all problems; FIELDATA is riddled with now-obvious bad ideas, redundancy, missing functions (present in codes before, and after, FIELDATA), etc. But don't be too harsh, it was long long ago in a galaxy far far away (there were probably no more than a few thousand commercial computers in existence at this time).
It is important to note that unlike today, where a character code is assumed to be a single, atomic unit, in FIELDATA the definition is different, and reflects the state-of-the-art of the time. While FIELDATA is essentially a 6-bit code, the definition(9) states that there is an underlying 4-bit "detail" (row of the table above), two "indicator" bits that select one of four rows within an alphabet (Supervisory or Alphabetic), and the "tag" bit, making it variably a six or seven bit code. (Computers of the time considered characters to be six bits in width, and is why many computers had register and memory widths of 18 and 36 bits.)
FIELDATA, and to a lesser extent the ASCII that follows, were designed for hardware decoding. Keep in mind that the free-for-all of symbol and character manipulation by computer didn't happen on a large scale until the mid-1970's; printing and "input/ouput" was something that was done "off line", computer time considered too precious for mere printing of tables and such. Separate machinery -- often partially mechanical -- was used to read tapes produced by computers, and render them into human-readable form. FIELDATA is designed for that environment, and at least one document(9) explicitly describes the decoding of character codes with a separate wire per character-bit. For example, the seventh bit, called the "tag" bit, which determines which alphabet to use (Supervisory or Alphabetic) could be a wire leading to machinery or circuitry; but when used as an "internal code" (eg. computer character code) it can all be contained in one storage word, as is assumed today.
The inadequacies of ITA2 were screamingly obvious to the communications industry by the late 1950's. A number of efforts were launched to solve the problem. ASCII, the American Standard Code for Information Interchange, was the work of committee X3.4 of ASA, the American Standards Association was the historic survivor. X3.4 was composed of a reasonable slice of the computing and data communications industry, including IBM (which didn't make use of the standard until the 1980's), AT&T, and it's subsidiary Teletype Corporation, maker of the most popular communications equipment of the time. The author of my FIELDATA reference article(9), W. F. Leubbert, was a member.
AT&T had immediate need for a new character code, and was certainly the most dependent on communications equipment. Coupled with it's monopoly status (and standardized equipment in every corner of it's global network), it's monstrous cashflow, and the fact that the leading manufacturer of communications equipment (Teletype) was a captive, whatever it decided upon would become standardized, de facto. Lucky for us, ASCII-1963 was well designed. The original authors claimed that it wasn't designed to represent any alphabet in any computer; stressing "Information Interchange", though I have to wonder if it wasn't simply politic to say this, what with AT&T's peculiar power.
In spite of the "American Standard..." code it appears to have been at least partly coordinated with international efforts; AT&T needed results now and got off the bus with a partial design that Teletype could implement for it immediately. While the table above shows many "UNDEFINED"'s, it was probably assumed that lower-case characters would take up much of these slots, which is what happened in the next version of the code.
What FIELDATA did well, ASCII did better. It threw away the last vestige of the 5-bit teleprinter code, the FIGS and LTRS case control functions, in favor of an unambiguous code (eg. the meaning of some particular code value doesn't depend on a previous FIGS or LTRS case code). It kept the COBOL graphic characters, pleasing the military and it's contractors (as opposed to including a larger ALGOL set), included the cleaner parts of the data message delimiters and the transmit controls of FIELDATA, rounded out the format effector characters (such as adding the LF line feed fuction puzzlingly missing from FIELDATA), and improved on the collating issues. Overall it accomplished a lot (though it assuredly made many enemies). It also retained the 4-bit "detail" table structure of FIELDATA, one of many small nods to the Army, a big source of funding for a lot of the committee members' patrons.
It is however a flawed code. The lower-case alphabet got left out (the obvious gaping hole destined for it implies a political solution); confusion over some graphical characters took a decade to untangle: up-arrow vs. caret, left-arrow vs. underscore; and some to this day have never really been untangled (CR and/or LF as logical line-end delimiters). But on the whole, after an overhaul in 1967 this code has stood for a long time (some would argue too long, in fact).
ASCII-1963, formally known as ASA (later USASA, later still ANSI) standard X3.4-1963, is part of a series of standards; others specify how codes such as ASCII are stored on perforated tape, magnetic tape, and punched cards, collating sequences, and error handling.
The major items considered are summarized below. They are described in (15), and in (13) in excruciating detail.
It was a lot to juggle. It also meant that there were going to be more than 64 characters, and no one loved the shifted nature of previous codes (FIGS and LTRS of ITA, for example), so it meant a 7-bit, 128-code table was needed. Arguments for interleaved capital and small letters were made, and discarded. "Shift" characters analagous to FIGS and LTRS were kept, terrifyingly, in the form of FIGS and LTRS. BCD subsets were arranged. Transmission controls were conjured (and received major revision in 1967). Hamming distance, eg. the number of differing bits between transmission control codes, was used to determine placement of control codes in the table to minimize problems with bit-smashing data errors.
In fact it worked out quite well; it could have been worse, much worse. There seems to be only a few oddities, such as ACK and ESC, possibly political rather than technical compromises.`
Standard X3.4-1963 was published on 17 June, 1963.
(ASCII is and always was a seven bit code. I am shocked at the number of people and sources that claim it to be an 8-bit code. There are only 128 character codes in ASCII. Many of the extentions to ASCII are 8 bits, but they are not ASCII.)
ASCII-1967 was the adoption and "nationalization" of ECMA-6(4), and is the basic set of graphic and control characters used in subsequent standards. In October 1963, four months after the release of X3.4-1963, the ISO international standards committee stated that the lower-case alphabet would put in the "UNDEFINED" portion of the table. This bumped the (oddly placed) ACK and ESC characters from the graphic table to where they belonged, in the control section. ASCII-1967 is essentially the International Reference Version (IRV) of ECMA-6(4) (or possibly more accurately, the IRV is the U.S. version of ECMA-6(4)).
Summary of changes made in ASCII-1967: The new message format control characters were cleaned
ECMA-6(4), and therefore X3.4-1967, formalized the contentious "national use" and
The other area, equally contentious, are national graphic characters. Vastly detailed arguments were made(13) regarding the "national use" characters, which I won't attempt to describe. Suffice to say that "with prior agreement" between parties the meanings of these characters could be changed to accomodate national-use, generally letters like the French and Italian à , etc. There was already precedence for this compromise; the ITA2 code had defined national-use characters.
The ASCII-1967/ECMA-6 is an excellent code; it's lasted for over 30 years, as long as ITA2, and will certainly linger a while longer yet. A lot of care went into it's design, and while some of it is little used today it is safe to say that the Internet probably wouldn't have spread so far so fast without it as a common underlying code and set of conventions.
Here is a brief summary of the design criteria for the design of the ASCII code, especially the 1967 version. These changes are summarized in (14) and spelled out in excruciating detail in (13).
|Non-printing space||Idle, NUL, DEL|
|Upper-case alphabetic||Lower-case alphabetic|
|Special graphic or punctuation||Digit|
|Supervisory or control code||Transmission control|
|Format effector||Information separator|
Message formatting is a way to transmit data across a serial communications link. There are two general approaches; fixed-field records and variable-length records.
In fixed-field formatting, every "field" -- in our example below, date, name, etc is fixed at a certain number of characters; if the datum is less than that size, the data is "padded" with spaces or other character. It is easy to store and transport fixed-field data; you simply copy a fixed number of characters around; however it is inflexible (data cannot be wider than a defined field) and wasteful of space (fields are always the maximum width), and prone to severe damage on communications links; a dropped character, shortening one record, wrecks all the subsequent records, too. But on media such as magnetic disks, with low error rates and high speeds, fixed-length records is often the first choice for speed and efficiency.
Variable-field formatting instead lets data in each field be as long, or as short, as required, and places a special character after each datum to mark its end. While it is harder to copy variable-field data records (you have to examine every character) they are more space- and time-efficient, and less error-prone on communications links (a dropped character damages only one or two records). It is(was) the method of choice for serial communication links.
What follows are arbitrary example message ("record") formatting using three different codes. It is abitrary because there is no fixed standard for message-formatting, the control codes are essentially tools in a toolkit, but these are as representative as any.
Example fixed-field data record, as stored on a magnetic disk:
|DATE:||01 JAN 50|
The sample data above, encapsulated in various message formats. The colors indicate the control type; white is the data in each field.
|SBK||0000023||EBE||01 JAN 50||EBE||SECRET BUNKER||EBE||MOSCOW||EBE||USSR||EBK||[more records]||EOF||[end of tape]|
|SOM||0000023||EOA||01 JAN 1950||S0||SECRET BUNKER||S1||MOSCOW||S2||USSR||EOM||[more records]||EOT||LEM||[end of tape]|
|SOH||0000023||STX||01 JAN 1950||US||SECRET BUNKER||US||MOSCOW||US||USSR||ETX||[more records]||EOT||EM||[end of tape]|
|Upper Case Alphabet||various||various||various||various|
Upper case printable character; the roman alphabet A through Z. These, the digits and some basic punctuation, are the symbols encoded in Morse and other telegraphy codes, and were deemed adequate for most human communication, as indeed they are (as long as you are American or British). But this was done in the U.S. and western Europe first, and the world was much smaller then, so it's hard to blame them for it.
Especially in early codes, with no track record to rely on, all sorts of "characters" were stuffed into character code tables. One of the most common are graphical "short cuts" for fractionals; 1/2, 1/3, 1/4, etc. These would generally be encoded as "1/" where presumably, you would press the 1/ key, followed by a number such as 2 to make a complete fraction. Most of these were jettisoned rapidly when general practice settled down, and room for more critical functions such as line and page control were needed. Nearly all of them are easily built from more basic characters, such as fractions, abbreviations such as "No.", etc.
|Lower Case Alphabet||various||various||various||various|
Lower case printable character; the roman alphabet a through z. Not available in many, if any, communication codes, until FIELDATA, which applied them inconsistently and put them in a silly place, mixed in with the supervisory (control) codes.
|Digits||various||7/0(4) - 7/9(4)||3/0(4) - 3/9(4)||3/0(4) - 3/9(4)|
The arabic digits, 1 through 9 and 0, or post-ASCII, "the digits 0 through 9". Clearly the digits are ordered in modern character codes to allow mechanical collating.
The graphic ^ replaced the up-arrow ↑ of ASCII-1963, under pressure from international committees requiring it as an alphabetic diacritical mark, and the short-lived "up arrow" introduced in ASCII-1963 disappeared.
In many early computer programming languages, up-arrow meant "exponentiation"; for example, 2^3 (or 2↑3) is 2 to the 3rd power, or 8. See the off-topic rant for left arrow.
As a (not so) amusing historical inheritance; while there is a UNICODE definition for an up-arrow character, it may not be visible in the example above, depending on your browser.
|Left Arrow ←||undef||6/6(4)||5/15(4)||5/15(4)|
One of the graphical codes, left-arrow mutated to the underscore of ASCII-1967. It may have had earlier, or other, meanings, but for some early programming languages it was "assignment", eg.
c ← b + a
"C is assigned the sum of B and A".
An illuminating aside: it is common in computer languages for a variable to be reassigned; this is a radical departure from traditional mathematical convention, where every intermediate or new value receives a new variable name. (All of the original programmers were mathematically trained; mathematicians were already using precise, unambiguous symbols and procedures, and the once "useless" field of mathematical logic became "useful".) For example, something as simple as accumulating the sum of four variables we might write today as:
s = a s = s + b s = s + c s = s + d
in standard mathematical notation would have been
s = a s' = s + a s'' = s' + a s''' = s'' + a
Where s, s', s'', s''' are all separate, distinct variables. Reassignment is a relatively new idea; modern languages like the C language use = as in the first example above. Algol, the first 'modern' language that C is largely based upon, originally wanted a special 'assignment' character such as the left-arrow. They settled on
s := a s := s + b s := s + c s := s + d
Leaving = to be a test for equality. But I digress.
Printable graphical characters; punctuation, collating symbols, mathematical symbols, etc. The position of these characters moved about greatly in different code tables, attempting to juggle various logical subsets (eg. grouping 1-9 0 + - / * = together) and trying to keep typewriter keyboard layouts close to their traditional layouts, a more complex issue in the bygone era of mechanical logic (eg. the extra mechanical parts necessary to have a key lever perform digital transformations in more than one column of the table).
Additionally, in the late 50's/early 60's there was an explosion of "automatic" computer programming (eg. what were to become compilers and interpreters), and the designers of these wanted to use the rich, traditional set of mathematical symbols, as well as "my idea is bigger than your idea", so there were many competing camps and clashing ideas on what characters should go into extremely cramped code sets (and small keyboards). ALGOL favored a large set, COBOL used common typewriter characters. Since the U.S. military vastly preferred COBOL (one of it's major proponents was Grace Hopper, later Admiral Hopper), it won, since the military, the Army in particular, was heavily at the forefront of computer $tandardization.
Also a major issue was collating (sorting) order; there is a fair amount of subtlety in the order of characters in the table, to allow numerical sorting to work. Space and punctuation characters come before the alphabet, upper case before lower; traditionally, numbers sort after letters, but putting the digits in ASCII before the letters was a compromise to make the table fit; but it was deemed practical because flipping table row order was a simple bit transformation. However as is by now obvious it's rarely done any more, and numbers generally sort before words.
SPEC, 7/14, (FIELDATA). I am not certain that this this is really a graphic; but it appears in the Alphabetic set, and has a graphical representation. I suspect(0) that it is substituted for other unprintable characters. Code 3/14, SPC, occupies the complementary spot in the Supervisory table, probably not a coincidence. There is very little hard data on FIELDATA's code set.
|NUL, IDL||0/0(4)||0/0, 4/0(4)||0/0(4)||0/0(4)|
IDL, Idle, 0/0 (FIELDATA); NUL, Null character, 0/0 (ASCII); BLANK, 0/0 (ITA2); MS, Master Space, 4/0 (FIELDATA).
Nothing, nada data, NUL tape, IDLE or NULL characters are the first, the 0th, position in most character code tables. Something we can all agree on. Humans are culturally drawn to zero as a delimiter. Perforated paper tape, in use for half a century for machine data storage, uses holes to represent presence of a data bit, absence no data; hence a NUL tape is all 0's, no holes, hence it's ITA name, BLANK, reflecting it's telegraphy roots of "blank tape".
MS, Master Space, code 4/0, is simply the IDLE code, 0/0, in the FIELDATA alphabetic table. It has a graphical representation as well; it's common for there to be an agreed-upon 'meta-representation' of a non-printing character, but it's unusual to see it in a code table; it appears to be another FIELDATA ambiguity. When used as the base 6-bit code, for example using hardware decode of the "tag" bit, MS and NUL are actually the same character. This is some of the historical, case-oriented baggage ASCII so sensibly did away with.
Note this is not the same thing as the non-printing space character.
It is hard to casually explain the function of NUL in the serial character scheme-of-things; here is the explanation from ECMA-48(2): "NUL is used for media fill or time fill. NUL characters may be inserted to, or deleted from, a data stream without affecting the information content of that stream, but such action may affect the information layout and/or the control of the equipment". A lovely bit of text that is, I mean that truly.
Space, 0/4 (ITA2); 4/5 (FIELDATA); 2/0 (ASCII). Control space, 0/5 (FIELDATA).
The other nothing in these character sets, this is known in the ASCII-1963 standard as 'word separator, (space, normally non-printing)'. Though it produces no ink on paper, it is of course graphical nonetheless. The other 'BLANK' is NUL or IDLE.
Non-printing space exists in both cases of ITA2. Many ITA machines "un-shift" on space, reverting to LTRS case.
Similarly, FIELDATA has two "space" codes; one each in the alphabetical and the supervisory code tables, an artifact brought on by the ambiguous nature of the "tag" bit (see the discussion under the FIELDATA table). The alphabetic space has a graphical representation as well; it's common for there to be an agreed-upon 'meta-representation' of a non-printing character, but it's unusual to see it in a code table; it appears to be another FIELDATA ambiguity. Interestingly, I recall Data General FORTRAN4 documentation, in the late 1970's, using this triangle in syntactical diagrams to indicate space.
|D0||undef||2/0 - 2/9(4)||undef||undef|
D0, Dial 0, 2/0, through D9, Dial 9, 2/9 (FIELDATA) Ten sequential codes.
Almost certainly for modem control (my source(9) calls them "Dial0" through "Dial9"), eg. dial a telephone number to reach another modem. There are ten dial codes, D0, D1, etc, D9. An odd choice; it seems "obvious" today to simply use standard digit characters preceeded by a dial command, but, it only seems that way because I live in an age where there's a computer in every damned telephone.
SCB, Start of Control Block, 2/10 (FIELDATA). Definition unknown, but likely an early message-format construct; the name can probably be taken literally. See also ECB.
|SBK, SOM, SOH||undef||2/11(4)||0/1(4)||0/1(4)|
SBK, Start Of Block, 2/11 (FIELDATA); SOM, Start Of Message 0/1 (ASCII-1963); SOH, Start Of Header 0/1 (ASCII-1967). Message formatting characters; these indicate the start of a block of data (SBK, SOM) or the start of the header portion of the data (SOH).
RTT, Ready To Transmit, 3/0 (FIELDATA). Presumably data transmitter telling the data receiver it is ready to send. In ASCII these sorts of things are done with more abstraction, eg. a state machine uses ACK and NAK or DC1 ("Q" column) and DC3 ("S" column) characters to indicate ready.
RTR, Ready To Receive, 3/1 (FIELDATA). Presumably the data receiver telling the data transmitted that it is ready to receive data. In ASCII these sorts of things are done with more abstraction, eg. a state machine uses ACK and NAK or DC1 ("Q" column) and DC3 ("S" column) characters to indicate ready.
NRR, Not Ready to Receive, 3/2 (FIELDATA). In ASCII these sorts of things are done with more abstraction, eg. a state machine uses ACK and NAK or DC1 ("Q" column) and DC3 ("S" column) characters to indicate ready.
EBE, End of Blockette, 3/3 (FIELDATA). Exact function unknown, but along with other tantalizing named codes, certainly a delimiter for formatting data.
|EBK, EOM, ETX||undef||3/4(4)||0/3(4)||0/3(4)|
EBK, End of Block (FIELDATA); EOM, End of Message (ASCII-1963); ETX, End of Text (ASCII-1967). These specify the end of formatted data.
EOF, End Of File, 3/5 (FIELDATA). The data stream structure of FIELDATA is a little opaque. Probably equivelent to ASCII EOT code. See message formatting examples.
ECB, End of Control Block, 3/6 (FIELDATA). One of the message formatting functions. See also SCB.
ACK, Acknowledge; 3/7 (FIELDATA); 7/12 (ASCII-1963); 0/6 (ASCII-1967). Implying positive or successful completion(2). Note the odd position of ACK; X3.4-1963(4) says "Acknowledge' was placed where its code could be generated by simple means." --uhuh, most likely, where some major manufacturer's device was already generating this code, but realistically, this is how standards get made in the real world.
There are two oddly-placed control codes in ASCII-1963, both are historical compromises: ESC, and ACK, mentioned above.
RPT, Repeat Block, 3/8 (FIELDATA); NAK, Negztive Acknowledgement, 1/5 (ASCII). Negative acknowledgement, failure, error. Another function common to all codes. FIELDATA RPT is Repeat Block; ASCII-1963 ERR is Error; ASCII-1967 NAK is Negative Acknowledgement(2). We all agree that pointing out other's faults is good sport.
INS, Interpret Sign, 3/10 (FIELDATA). Unknown function; no information available. Likely a single-purpose command included for historical reasons.
NIS, Non-Interpret Sign, 3/11 (FIELDATA). Unknown function; no information available. Likely a single-purpose command included for historical reasons.
CWF, Code Word Follows, 3/12 (FIELDATA). Who knows, your guess is as good as mine.
SAC, S.A.C. (really), 3/13 (FIELDATA). Mysteriously, my source(9) says only "S.A.C.". Win a prize, reveal to me the secrets within and I'll make you famous in this very paragraph.
SPC, Special Character 3/14 (FIELDATA); ESC, Escape, 7/14 (ASCII-1963); 1/11 (ASCII-1967). ASCII ESC is the direct inheritor of the FIELDATA "Special" function: The ASCII-1963 standard(4) states: "The 'Escape' was placed so as to conform with the 'special' function of the DOD standard 8-bit code and to facilitate the 6-bit contraction."; hence it's otherwise odd placement in the table. (The '6-bit contraction' business is simply the FIELDATA character code with the "tag" bit stripped.) So it appears that ESC/SPC is an in-band signalling character; worse, it apparently re-introduces context-sensitive parsing of the stream (shades of FIGS and LTRS, but maybe I'm being too strict.) In ASCII-1967 they bit the bullet and sensibly put all control codes together, in the lower portion of the table, where they sit today.
The concept of "ESCape sequence" is defined explicitly in ECMA-6(3) as the ESC character followed by a sequence of characters as an extention of control functions. In other words, the three characters ESC [ J are the command to clear the screen on a DEC VT-100 display terminal; this use of a "special" character (ESC) allows manufacturers such as DEC to extend the functionality of ASCII without having to mess around with the basic definitions.
There are two oddly-placed control codes in ASCII-1963, both are historical compromises: ESC, discussed above, and ACK.
DEL, Delete, 3/15 (FIELDATA); 7/15 (ASCII). The exact function and especially the bit pattern of the DELETE character is ancient: on perforated paper tape, symbols are encoded as rows of holes punched across the width of the tape, the most common being 5-row and 8-row tape. Since hole-punching is unavoidably final, the only possiblity for "editing" is to set aside "all holes punched" to mean "deleted", and that's what was done.
What would be "DEL" in ITA2 is assigned to LTRS, the "default" case for most teleprinters. It would be harmless to use LTRS to delete as long as another FIGS was issued, as necessary.
This is how you edit text on a teletype, sans computer. I have done this once or twice. I have personally used a two-pass paper tape text editor, on a Varian 622/I minicomputer, that used a variation of this for editing assembly language source tapes.
Note that DEL may be used as a "line fill" character, as is NUL. Many systems send a constant stream of characters, to indicate that all is well, but there is nothing to do; this is called "line fill", because it fills the line (circuit) but otherwise has no effect, because both DEL and NUL characters are generally ignored.
Note that FIELDATA's code is all 1's (5-bit) and ASCII all 1's (7-bit).
UC, Upper Case, 4/1 (FIELDATA); CUC, Control Upper Case, 0/1 (FIELDATA); FIGS, Figures 1/11 (ITA2).
See LC, CLC, or LTRS. CUC is simply a side-effect of duplicating the entire alphabet with the "tag" bit.
UC has a graphical representation as well; it's common for there to be an agreed-upon 'meta-representation' of a non-printing character, but it's unusual to see it in a code table; it appears to be another FIELDATA ambiguity.
LC, Lower case, 4/2 (FIELDATA); Control LC, 0/2 (FIELDATA); LTRS, Letters, 1/15 (ITA2). CLC is a side-effect of duplicating the entire alphabet with the "tag" bit.
5-bit codes such as ITA2 have only 32 character positions; to encode the entire alphabet, digits, punctuation and format effectors two code tables are defined, and the UC/FIGS and LC/LTRS codes determine which table is in use, eg. determines the state of a small state machine. The decoding machine must "remember" the most recent FIGS or LTRS code received, and use the appropriate table when printing or accepting keyboard presses. As you can imagine, the loss of a FIGS or LTRS character, or the operator forgetting to press it, can make a mess. Real teletype geeks can tell when a line of FIGS codes is really LTRS. For convenience (whose is the question) many teletypes revert back to FIGS state whenever certain things happen, such as the receipt of a CR character, to add to the fun.
In a teleprinter or typewriter, the state machine is the position of the paper-carrying carriage; up or down, upper or lower positions, and historically the codes directly control the position of the print mechanism. If you are old enough you remember that typewriter keys have two characters on them; which one hits the paper depends on if the carriage is up or down.
FIGS and LTRS exists in both cases of ITA2.
Similarly, FIELDATA has two UC/LC codes; one each in the alphabetical and the supervisory code tables. I would guess that they are in fact treated exactly the same(0), and the redundancy of having two is brought on by the ambiguous nature of the "tag" bit (see the discussion under the FIELDATA table).
LC has a graphical representation as well; it's common for there to be an agreed-upon 'meta-representation' of a non-printing character, but it's unusual to see it in a code table; it appears to be another FIELDATA ambiguity.
HT, Horizontal Tabulation, 4/3 (FIELDATA); 0/9 (ASCII). Control HT, 0/3 (FIELDATA). CHT appears in the FIELDATA supervisory code because of the inherent duplication caused by the "tag" bit.
Horizontal Tabulation, or Tab, as it's called today. On a typewriter, you would move little metal tabs into a toothed rack, such that when you pressed the "tabulation" key, the carriage would slide quickly to the left, stopping at the next metal tab. For setting columns for forms, letterhead, etc. The same model is used today on "modern" software like Microsoft Word, only the little inscrutable icons are too small to see.
In FIELDATA, HT exists in both 6-bit "cases"; see the discussion under the FIELDATA table above. It also has a graphical representation as well; it's common for there to be an agreed-upon 'meta-representation' of a non-printing character, but it's unusual to see it in a code table; it appears to be another FIELDATA ambiguity.
CR, Carriage Return, 0/8 (ITA2); 4/4 (FIELDATA); 0/13 (ASCII). Control CR; 0/8 (FIELDATA). CCR appears in the FIELDATA supervisory code because of the inherent duplication caused by the "tag" bit.
Returns the typewriter/teleprinter printing carriage to it's right-most position, so that the type mechanism will next print in the left-most column. (Though literally true, that was just to confuse you; it moves the cursor to the left edge of the screen in non-iron(ic) technology.) One of the few things that has improved in "user interfaces" in the last few decades is that you no longer have to "pad" CR's with NUL characters, because the electronic "carriage" (sic)(sick) is more or less instantaneous, unlike actual metal carriages which took a tens of milliseconds to move. (We still have the "new line" problem, after all these decades, so don't think I'm being too cynical with this.)
It exists in both cases of ITA2.
Similarly, FIELDATA has two codes for this function, due to the redundancy provided by the "tag" 7th bit. When FIELDATA codes are treated as six bits, as in hardware decode, then the HT and CHT characters have the same code.
EOA, End Of Address 0/2 (ASCII-1963); STX, Start of Text 0/2 (ASCII-1967). STX is defined as the start of a text, and the end of a header(2). See the message formatting section for details.
EOT, End Of Transmission, 0/4 (ASCII). The conclusion of the transmission of one or more text messages(2). Likely the inheritor of the FIELDATA EOF function, but I have no evidence at this time.
See the message formatting section for details.
WRU Who aRe yoU, 0/9 (ITA2); 0/5 (ASCII-1963); ENQ, Enquire, 0/5 (ASCII-1967) Enquire. WRU/ENQ is sent to request a response from the receiver(2).
WRU the old name for this function. I love WRU. Many teletypes (the 33ASR for example) contain a little wheel with sixteen rows of eight tabs. It is essentially a "READ ONLY MEMORY"; up to 16 characters are stored by breaking off tabs in each row (no tabs broken is a NUL; all tabs broken is DEL). When the teletype receives a WRU/ENQ character over the line, this device is triggered, the little wheel rotates and each code is sent one by one back over the line, as if a human responded by typing the desired response. It seemed very high-tech at the time (as indeed it was).
For example, you want to place and order for widgets with the Acme Company. On your Telex teletype, you would dial the Acme Company's Telex number. After it connects (300 or 110 baud) you type Control-E. The Acme Company's teletype responds by transmitting the contents of it's WRU device; on your teletype you then see:
You are now assured that the Telex you dialed really is the Acme Company. (You made up your order ahead of time by using your teletype in "LOCAL" mode, with the paper tape punch "ON"; you type your order in the form of a brief letter, using DEL to delete mistakes as necessary, taking your time, producing a paper tape copy of your order. After dialing, and verifying, you load the tape in the reader and press START; this transmits your order to Acme at full line speed. You are charged by the second so offline editing is a must.)
See also RU.
aRe yoU. The ASCII-1963 code table implies this is a query such as "RU ACMECO", (see example in WRU description) but no description or further data is given, and I have no experience with it. Occupies ACK slot of ASCII-1967.
|BEL||0/11(4) or 0/5(4)
BEL, Bell, 0/11 or 0/5 (ITA2); 0/7 (ASCII). audible signal, or might control some other alarm or device(2). Out-of-band signaling? Transmission control? har har. On my Teletype Corp. Model 28, it's a brass or steel bell about 4" diameter, and makes a lovely sound.
It seems odd that FIELDATA does not define a BEL character; presumably it was considered an embedded function of underlying codes transported by FIELDATA, such as teleprinter codes.
BS, Backspace, 7/15 (FIELDATA); FE0, Format Effector 0, 0/8 (ASCII-1963); BS, Backspace, 0/8 (ASCII-1967). Today we think of a cursor on a screen moving one character position to the left; but it's origin is to position a paper tape punch back over the character most recently punched, generally to be followed by a DEL character for deletion.
It is interesting to note that FIELDATA's BS code occupies ASCII's DEL slot, considering the general relationship of the functions involved.
For some reason ASCII-1963 leaves the what-became-BS position loose, as "Format Effector 0". It is probably knowable, but until I find some old proceedings or public discussion it'll remain a mystery. More recently and specifically ECMA-48(2) defines it thusly: 'BS causes the active data position to be moved one character position in the data component in the direction opposite to that of the implicit movement.' See how far we've come?
LF, Line Feed, 0/2 (ITA2); 0/10 (ASCII). As should be obvious, feeds the paper up one line. Oddly, no LF character is defined in FIELDATA, or at least in the definition I have. Whether this is an oversight or a typo is unknown to me. CR and LF are two very old legacies we still deal with today. The unix convention is the most sensible; the code for LF is called "newline", and it's presence causes hardware drivers ("format effectors") to move the cursor to the left-most position, and advance one line down ("paper up"). It leaves CR free for barbaric cursor control.
VT, Vertical Tab, 0/11 (ASCII). Since business forms are a very common use of computing and telecommunications equipment, VT performs a function very analogous to HT, only in the vertical direction. Unlike a typewriter's metal tabs, VT was usually handled by a device called (are you ready) the VFU, or Vertical Forms Unit. It was usually a small paper tape loop installed on reader inside a paper printing device, the tape having one row for every line on the paper form or sheet, for a standard american "letter size" form, 66 rows correspond to the 66 printable lines on the page. Punched on the tape were holes indicating where subsequent VT characters should "tab" to; eg. if you punch positions 12 and 18, the first VT will take you to line #12, the second VT will take you to line #18.
Part of forms control is the FF, or Form Feed character, which positions the paper to print on the first line of the next sheet or form. With FF, VT and LF you choose which line on the paper to print on.
You of course punch VFU tapes on your teletype, which every office has, right?
FF, Form Feed, 0/12 (ASCII). The FF character causes printing devices to move the paper, or print head, to the first or top line on the next form or sheet of paper. There is some ambiguity as to where the print position is after a FF; the modern definition(2) says it should be the defined "home" position (eg. as if FF were followed by a CR). It is generally considered sensible to not assume anything, and issue a CR explicitly, just to be sure. Related vertical control characters are VT and LF.
SO, Shift Out, 0/14 (ASCII). SO indicates that the characters that follow should or can be interpreted differently; it's exact meaning is implementation-dependent, but it could be use to change type face(5), or to (say) encapsulate ITA2 codes in an ASCII stream, until the receipt of a SI code.
May be related to FIELDATA INS or NIS.
SI, Shift In, 0/15 (ASCII). Causes the meaning or interpretation of the character stream to revert to it's normal meanings. See also SO.
May be related to FIELDATA INS or NIS.
DLE, Data Link Escape, 1/0 (ASCII-1967).
DLE is used to "escape" control characters when they are part of data. If a control character such as ETX, for example, were to appear as data, the communications system would interpret it incorrectly as a transmission control. To prevent this, the DLE character is used to "escape" the data, to indicate to the communications system that the next character following is to be treated as data.
|SOH||header||STX||A B C||DLE||ETX||D E F||ETX||.....||EOT|
In the example above, the control character "ETX" which appears as data is escaped by the DLE, signalling to the communication system that it should be treated as data. Which characters need to be "escaped" depends on the communications system.
DC0, Device Control 0, 1/0 (ASCII-1963). DC0 is a reserved code to perform code-unspecified functions on terminal equipment. This code was reassigned to DLE for ASCII-1967.
See also DC0, DC1, DC2, DC3, and DC4. By convention, DC3, in the "S" column, is "XOFF", or device/punch off, and DC1 is "XON", or device/punch on. I do not know the origin of this convention, but it is supported by 33ASRs.
DC1, Device Control 1, 1/1 (ASCII). DC1 has two meanings, one nearly obsolete today. DC1 is used to turn on an auxilliary device, such as a paper tape punch attached to a teletype, the most common, and obsolete, example. It can also be used to perform pretty much any other sort of thing, generally performing an "ON" or enable function(2).
The second meaning, still in use, is for start/stop "flow control" on a serial link with inadequately supported hardware functions. When used like this, it's usually called XON or X-ON; receipt of a DC1/XON character causes the sending device to start or to continue outputting text. Many computer console interfaces use XON/XOFF (oops I'm getting ahead of myself) to control the displaying of text on the console device.
See also DC0, DC1, DC2, DC3, and DC4. By convention, DC3, in the "S" column, is "XOFF", or device/punch off, and DC1 is "XON", or device/punch on. I do not know the origin of this convention, but it is supported by 33ASRs.
DC2, Device Control 2, 1/2 (ASCII). A secondary control for auxilliary equipment; similar to DC1, it can be used to modify the behavior of the auxilliary device(2). Little used.
See also DC0, DC1, DC2, DC3, and DC4. By convention, DC3, in the "S" column, is "XOFF", or device/punch off, and DC1 is "XON", or device/punch on. I do not know the origin of this convention, but it is supported by 33ASRs.
DC3, Device Control 3, 1/3 (ASCII). DC3 has two meanings, one nearly obsolete today. DC3 is used to turn off an auxilliary device, such as a paper tape punch attached to a teletype, the most common example. It can also be used to perform pretty much any other sort of thing, generally performing an "OFF" or disable function(2).
The second function, still in use, is for start/stop "flow control" on a serial link with inadequately supported hardware functions. When used like this, it's usually called XOFF or X-OFF; receipt of a DC3/XOFF character causes the sending device to stop or to pause outputting text. Many computer console interfaces use XON/XOFF to control the displaying of text on the console device.
DC4, Device Control 4, 1/4 (ASCII). A secondary control for auxilliary equipment; similar to DC3, but instead to modify the behavior of the auxilliary device, such as pause, interrupt, etc(2). Little used.
ST, Stop, 6/15 (FIELDATA). I believe this to be the same function as DC4, but I cannot find my notes on it, so it will have to remain conjecture.
It has a graphical representation as well; it's common for there to be an agreed-upon 'meta-representation' of a non-printing character, but it's unusual to see it in a code table; it appears to be another FIELDATA ambiguity.
SYN, Synchronous Idle, 1/6 (ASCII). Transmitted approximately continuously to indicate line not in use, but still active and functional; for systems that require a continuous stream of characters to be considered normally functional(2).
|EBK, LEM, ETB||undef||3/4(4)||1/7(4)||1/7(4)|
EBK, End of Block 3/4 (FIELDATA); ETB, End Transmission Block, 1/7 (ASCII-1967). Used to mark the end of a block, for transmission-control purposes(2). See the message format examples.
|S0 - S3||undef||undef||1/8 to 1/11(4)||undef|
S0, Separator 0, 1/8, through S3, 1/11 (ASCII-1963). Information separator for formatted data, application-dependent. Of the eight separator characters in ASCII-1963, these first four were reassigned: S0 became CAN, Cancel; S1 became EM, End of Media; S2 became SUB, Substitute; and S3 became ESC.
The other four separators were retained but given new names: S4 became FS, S5 became GS S6 became RS and S7 became US.
S4, Separator 4, 1/12 (ASCII-1963); FS, File Separator, 1/12 (ASCII-1967). Set aside as a general data record separator, given it's new name of FS in ASCII-1967. 'It may delimit a data item called a file'(2), though it also states the definition is implementation-dependent. Last I knew, a file was a collection of records, but in this context it could have another meaning.
S5, Separator 5, 1/13 (ASCII-1963); GS, Group Separator, 1/13 (ASCII-1967). Set aside as a general data record separator, given it's new name of GS in ASCII-1967. 'It may delimit a data item called a group'(2), though it also states the exact meaning is implementation-dependent. I am not certain where a group is this data heirarchy.
S6, Separator 6, 1/14 (ASCII-1963); RS, Record Separator, 1/14 (ASCII-1967). Set aside as a general data record separator, given it's new name of RS in ASCII-1967. 'It may delimit a data item called a record'(2). I assume a record in this context is what it usually means, a distinct and discrete string of data, though it also states the exact meaning is implementation-dependent.
S7, Separator 7, 1/15 (ASCII-1963); US, Unit Separator, 1/15 (ASCII-1967). Set aside as a general data record separator, given it's new name of US in ASCII-1967. 'It may delimit a data item called a unit'(2), though it also states the exact meaning is implementation-dependent.
S6 became RS and S7 became US.
CAN, Cancel, 1/8 (ASCII-67). Occupies the position of ASCII-1963 S0. CAN means that previously-sent data is in error and should be ignored; the details depend on the specific application(2).
LEM, Logical End of Media, 1/7 (ASCII-1963); EM, End of Medium, 1/9 (ASCII-1967).
Meaning the physical end of a medium (tape, etc), or the usable portion of the medium, or as "logical" (today, "virtual") end-of-medium(2). Usage is implementation-dependent, but an example would be an EM character is sent right before a tape ends, or a modem disconnects; the EM would signal "the physical medium is about to end".
SUB, Substitute, 1/10 (ASCII-1967). Replaces ASCII-1963 generic separator character S2. A SUB character replaces a character that is in error, a sort of placeholder(2), and produced by machinery, not humans. Presumably(0) if a character error such as a parity error is detected, the bad character in a text is replaced with a SUB, so that the receiving system can at least detect it; for example:
|THE QUICK BROWN FO||SUB||JUMPS OVER THE LAZY GOD|
A SUB character replaces the damaged character following the text "FO".
Punched card systems, developed more or less at the same time as telegraphy, seem obvious to our story here; though they do have their own codes for letters, numbers and other symbols, cards are record oriented, with each card containing a block of data, such as household census data, or an item in a manufacturers inventory, rather than human communication or correspondence (though that has been done too). Human communication is a serial stream of characters, without a formal format (though one can be imposed upon it; everyone rebels or cheats).
Cards are first and foremost designed for sorting information(11). They are very cumbersome for communications.
Character codes such as ASCII came from the need for people to communicate thoughts and ideas, not blocks of formatted data. Humans have a long history with written language, and communication character codes are but an extension of this. Punched cards are part of the history of calculating machinery, related, but not the same. Character codes for punched cards are intimately part of the cardboard medium; for example, holes too close together weaken the card! More importantly, there is a different mindset for handling holes-in-cards as data, involving looking for patterns of holes, independent of the meaning of the holes (whether number or letter) by doing mechanical collating, and in some systems like PEEKABOO, using knitting-needle like tools to order cards!
Punched cards were certainly used in computers, but early computers were not given character abilities at all until amazingly late (1960's) with a few exceptions such as the SAGE U.S. defense computer network. The demise of the punched card for data storage coincides with demise of all mechanical data-handling equipment. It was clearly dying in the late 1960's, and all-but-dead by 1980. They live on vestigially in commercial airline ticket formats, parking lot pay systems, etc.
NOTE(1) Strictly speaking, "ASCII" is no longer in use. The 128 characters that make up "ASCII", of which 95 have graphic representations (eg. marks on paper or screen), were defined as-we-know-them-today in two documents; the international ECMA-6(3) in 1965 and X3.4-1967(6), also known as USASCII, in 1967. Since that time they were integrated into a series of international and national standards, and make up the basic subset of UNICODE and other "modern" standards. But that's not our interest here.
NOTE(2) A "one time pad" is a table of random data used to encrypt (make secret) or decrypt messages. The name comes from the fact that the tables are usually printed on pads of paper, and each sheet, containing a unique table of data, is used only once, then destroyed. It works as follows: Mary and John both have the same one-time pad. Mary encrypts her message by looking up each letter in the table of random data, writing them down. This written sheet is handed to John, who finds the scrambled letter in his one-time pad and finds the original letter, and decrypts the message. Anyone without the right one-time pad will find the written sheet utterly meaningless. This method is very secure, but securely distributing the pads is problematic. We all carry a "one time pad" of culture with us, and interpret what we see of the world using it.
NOTE(3) ...and probably get diminishing return on your investment. I already wonder at the usefulness of this thing. Having read Coded character sets: History & Development(13) I already feel better; Mackenzie wrote an entire book on this stuff, though it covers mainly codes of interest to IBM, his employer at the time, though it does provide solid, detailed information on the design criteria for ASCII.
NOTE(4) The pair of numbers are row/column; eg. 1/5 means the character in the code table indicated by row number 1, column number 5.
NOTE(5) I don't want to give the impression that the wire was treated in so formal a way; In practice all kinds of "states" were eked out of the wire; inter-word voltage=off spaces are longer than inter-letter; and of course when nothing is being sent at all the space is arbitrarily long. And I would bet(0) that use was made of all sorts of ad hoc things; voltage=on for long periods as out-of-band signals, etc. And later in it's history, all kinds of multiplexing and multi-voltage schemes were used to increase the usage of expensive-to-install wires. However, the character code itself needs only the four basic states.
NOTE(6) It really does seem lost to history just how the characters are ordered in the code. If anyone has any actual information I'd love to hear it. Even plausible rumors may be entertained.
NOTE(7) I shouldn't even go there -- but in fact, the very few systems that will have bad problems with "Y2K" dates (the problem of legacy software encoding the date-year as two digits causing arithmetic over/underflow on date calcs) are pretty much limited to ancient COBOL programs from the second generation of computers. In 1980 I worked for a place whose payroll program was still running in IBM 1401 code, running on a 1401 emulator on a System/360, under the TSO operating system; the 360 and TSO were both horribly obsolete in 1980. I wouldn't be surprised if today they have something running a 360 emulator running the 1401 emulator running payroll. This same place had a CDC 6600, with card I/O -- only. And they were a large, well-funded defense contractor. Go figure. Few today can afford such folly.
NOTE(8) Apparently inspired by the cybernetic/biological/neurological trend of the 1940's and 1950's, the code- and protocol-converting devices were called "embolic" equipment. Even more amusing are the logical schematic designs of Turing and von Neumann, who used a modified "neuron" diagram to indicate digital logic functions, which were in fact very clean, easy to read, and modern looking. In fact, they became the basis for the now-common OR and AND logic blocks, with inverting dots, etc. But I digress, pointlessly.
NOTE(9) Given the state of the art at the time, mere thousands of different symbols approaches "infinity". Alan Turing would have approved of this change; in 1938, in his famous COMPUTABLE NUMBERS paper, one of the criteria for reasonableness for his hypothetical computing machine was a finite number of symbols; if there's too many symbols in use, they become too hard to tell one from another; eg. errors rapidly creep into the system.
Please note that while URLs are included here for convenience, they are extremely fickle things. By the time you read this, the URL may have changed. Don't despair! Before panicking, start from the home page of the site, follow links or look for a 'search' function. Each citation below attempts to give keywords that have located the referenced document using each site's 'search' function.
SOURCE(0) Author's arrogant assertion, offered without further substantiation.
SOURCE(1) The Amateur Radio Handbook, 1999, published by the Amateur Radio Relay League (ARRL), Chapter 12. ARRL home is http://www.arrl.org; the current search page is http://www.arrl.org/htdig/, enter "ITA2 Baudot table" in the search box.
SOURCE(2) ECMA-48: Control Functions for Coded Character Sets. This and other ECMA documents are available for free from ECMA directly. ECMA home is http://www.ecma.ch. As a last resort, a non-authoritative copy is available here in PDF format, in case the ECMA website is unavailable to you.
SOURCE(3) ECMA-6: 7-bit coded Character Set This and other ECMA documents are available for free from ECMA directly. ECMA home is http://www.ecma.ch. As a last resort, a non-authoritative copy is available here in PDF format, in case the ECMA website is unavailable to you.
SOURCE(4) X3.4-1963, AMERICAN STANDARD CODE FOR INFORMATION INTERCHANGE, American Standards Association (ASA), 17 June 1963. Available here as page images of a copy of the document obtained from ANSI. Unlike the very enlightened ECMA, ANSI charges for copies of their standards; I paid US$30.00 for eleven (11) poorly xeroxed sheets of paper containing the X3.4-1963 standard.
SOURCE(5) ECMA-35: Character Code Structure and Extention Techniques. This and other ECMA documents are available for free from ECMA directly. ECMA home is http://www.ecma.ch. As a last resort, a non-authoritative copy is available here in PDF format, in case the ECMA website is unavailable to you.
SOURCE(6) ANSI X3.4-1967, AMERICAN STANDARD CODE FOR INFORMATION INTERCHANGE, The 1967 standard is no longer available from ANSI.
SOURCE(7) The Telegraph Office, http://fohnix.metronet.com/~nmcewen/ref.html. A wonderful site for telegraph lore, very thorough and detailed.
SOURCE(9) DATA TRANSMISSION EQUIPMENT CONCEPTS FOR FIELDATA, W. F. Leubbert, U.S. Army Research and Development Laboratory, Fort Monmouth, New Jersey, in 1959 PROCEEDINGS OF THE WESTERN JOINT COMPUTER CONFERENCE, pp. 189-196.
SOURCE(10) North American Data Communications Museum http://www.nadcomm.com/fiveunit/fiveunits.htm, for information on early teleprinters.
SOURCE(11) PUNCHED CARDS: Their application to Science and Industry, Robert S. Casey, ed., published by Reinhold Publishing Corp., 1958. From page 3 in the Introduction: "The two general types of punched cards, for hand sorting and machine sorting...". An amusing essay is titled, "Holes, Punches, Notches, Slots, and Logic".
SOURCE(12) The GREENKEYS mailing list, an informal group of long-time professional and amateur radio, telegraphy, radioteletype operators, and fans of obsolete but aesthetically-pleasing wood-and-metal communications apparatus. Their collective experience is probably counted in centuries.
SOURCE(13) Coded character sets, history & development by Charles E. Mackenzie, ISBN 0-201-14460-3 (Addison-Wesley, 1980).
SOURCE(14) Revised U.S.A. Standard code for information interchange by Fred W. Smith, in Western Union Technical Review, November 1967. Note that I am using this as the authoritative source on ASCII-1967, as ANSI no longer has available document X3.4-1967 and I cannot find a copy. Since Fred Smith appears to have been on the X3.4 committee, and he wrote articles for WUTR in 1963 and 1967, and WUTR is an in-house technical rag, it's probably quite reliable.
SOURCE(15) New U.S.A. Standard code for information interchange by Fred W. Smith, in Western Union Technical Review, April, 1964.
SOURCE(16)From the Smithsonian Institution's website. The original URL with the images has moved or is gone; it was at http://photo2.si.edu/infoage.html. Try http://photo2.si.edu or if all else fails, search or ask for National Museum of American History: People, Information & Technology Photographs of Samuel Morse's equipment, with captions.
SOURCE(17) ISO International Register Coded Character Set ISO-IR-006 submitted by ANSI, comprised of the character set of X3.4-1968. Though not the same as having a copy of the actual standard it is at least derived directly from it. This copy was taken from the Information Technology Standards Commission of Japan's website (an "English" language version button is available). a large collection of ISO International Reference Coded Character Sets can be found there.OTHER SOURCES
I ALSO WISH TO THANK the following people and organizations: The GreenKeys mailing list participants, for their research, knowledge, and friendly tolerance of a youngster's stupid questions; D.R. House's North American Data Communications Museum web page for hosting Alan Hobbs, G8GOJ's code descriptions; dik t. winter's codes web page, a far broader collection of codes; George Hutchinson's RTTY web page; and various artists at ljubljana digital media lab, especially Vuk Cosik's ASCII art projects for inspiration. And thanks to Eric A. Hall for pointing out persistent errors in the ASCII-1967 section and pointing me to SOURCE(17).
Three cheers to the ECMA, an international standards body, doing the good work since 1961. All of their documents are free, online, and their staff is accessible and helpful.
Three rasberries go to American National Standards Institute (ANSI), utterly unable to provide me with a copy of the X3.4-1967 standard. It seems they farmed out their "order fulfillment" (sic) to Global Engineering, and in doing so they now have no older, paper-only, standards available. But worse, rather than saying "sorry, unavailable" I've received the bureaucratic run-around since July 1999 (I write this mid-September). Lucky for me, I obtained a copy of X3.4-1963 from ANSI (8 June 1999); when I tried to order X3.4-1967 (7 July 1999) Global Engineering could not "access" (sic) the document. And ANSI had the nerve to charge US$30.00 for an 11-page xerox copy of X3.4-1963. Cf. ECMA.
Entire contents copyright Tom Jennings 1999-2001. All rights reserved.