My learning project was rewriting the bzip algorithm for the busybox project, an open source software development effort. I'll attempt to describe my learning in terms of Vygotsky's Zone of Proximal Development, although I've come to disagree with some aspects of the socio-constructivist viewpoint.

I kept a journal, but a better journal is simply the series of posts I made to the busybox development email list, which document my public communication with other project members while I worked. Some messages I posted to this list are at the end of this document, and are referred to within the text.

Since the new learning here is built on top of several years previous experience, I'll start by providing a bit of background on the context and social structure within which I was working.

What is busybox?

Busybox is a project for Linux systems that re-implements the standard unix command line tools, making each one smaller and simpler, and combining them into a single swiss-army-knife style package. Busybox functionally replaces a half-dozen different standard unix component packages from the Free Software Foundation, but despite having many packages worth of functionality the whole of busybox is smaller than any one of the FSF packages it replaces.

Busybox is in many ways a fairly standard open source software project. It's written in the "c" programming language (the lingua franca of operating system infrastructure, known by all systems programmers). It uses standard build commands and tools, and comes with its own documentation. An experienced programmer can more or less hit the ground running when first trying to use or modify busybox, with a minimal learning curve.

Busybox was originally designed for use in small, memory-constrained embedded systems (routers, toasters, microwaves, cell phones, etc), and later became useful for booting an entire working Linux system from a single 3.5" floppy disk. Now, after a lot of development and testing in such smaller environments, busybox is gaining popularity as a general replacement for the comparatively bloated and unwieldy older implementations of the standard Unix toolkit (from the politically charged Free Software Foundation). Some people, myself included, believe that doing the same amount of work in half as many lines of computer code is an improvement in and of itself. Not only is brevity the soul of wit, but it's also more efficient and easier to maintain. (In this context, "maintain" is a term of art meaning to debug, enhance, support, and generally understand.)

I personally got involved with Busybox when I tried to use it for my own purposes as a replacement for the standard utilities on my Linux laptop, and found some of its capabilities incomplete and others inadequate. Specifically, the "sed" (stream editor) command, which is used to search for and replace text (and perform other, similar modifications to text files) in an automated fashion, didn't work with some of the automated scripts I had access to. Busybox's sed was missing features, and some of the features it implemented didn't work in my test cases. Specifically, sed is extensively used for automatic configuration of other software packages, which can detect which of the many different hardware platforms they're being used on when you try to compile them, and adjust themselves appropriately. Using busybox as part of a general software development environment was a more ambitious use than powering a microwave, and exposed limitations in the existing implementation. Although many people besides myself wanted to use busybox in a development environment, they didn't have the time or energy to address its limitations. I did, so I got my hands dirty fixing busybox's "sed" implementation, which was maintained by an Australian named Glenn McGrath.

Fixing Busybox's "sed" was a comparatively easy project, because the sed command is well understood and documented as part of the Single Unix Specification. The sed specification is available on the internet at "http://www.opengroup.org/onlinepubs/007904975/utilities/sed.html", and although I had to learn a lot about how sed worked, it was mostly a question of reading the documentation and trying it out. And whenever I got stuck, I could ask Glenn, who could explain how busybox's existing sed implmentation worked.

By the time I'd come to a reasonable stopping point with sed, I'd gotten to know the busybox development community, and was fairly regularly communicating with them in private email, on the mailing list, and through real-time chat (IRC). I'd also learned some of the project's cultural norms, such as the fact that from my perspective the development team was largely nocturnal (in part due to the presence of a number of people from Australia and Russia, but also a preference of the US programmers in Utah and Louisiana). Since most of my free time was from 10 pm to 3 am Central time anyway (after class), this worked just fine for me.

So with a bit of momentum built up after the sed project, I was ready to tackle a new aspect of busybox.

Background on bzip.

Bzip is a data compression program. Data compression is a technique for removing redundant information from a file, in a way that can be easily reversed. The "compressed" version is represented in a more compact format (which takes up less space), from which a corresponding decompression program can recreate the original data.

The process is similar to translating a document from one language to another language, one which you don't personally understand but in which all the words are much shorter. Then when you need to use it, you translate it back to something you can understand.

Of course not all data has enough redundancy to be compressible, and some has so little redundancy that the translated version can actually be slightly larger than the original. But most types of commonly used information, such as text, graphics, and sound, are very compressible.

The most common type of data compression program for Linux is "gzip", which is based on an algorithm called "deflate". Various "zip" programs for windows are based on the deflate algorithm, and it has many other common uses such as in graphics files and certain types of internet data transmission. Deflate is fast and relatively efficient as compression algorithms go, so it is very widely used.

Bzip uses a very different algorithm, which produces a different type of file in a different "compressed language". A file created by bzip must be decompressed by bunzip, the traditional deflate/inflate code of zip and gzip wouldn't know what to do with it. The reason to use bzip is that it usually gets better compression than gzip, producing smaller files. The down side is that it requires more computer power to do so, taking much longer to process a file and needing more temporary memory while it works. Nevertheless, for long term storage (backup files, software distribution through the internet, etc), an extra 10 to 15 percent space savings is often worth taking four or five times as long to compress a file, especially as computers get faster. Thus bzip has become very popular for long-term storage of files.

The project

Thanks to Glenn McGrath, busybox was already capable of decompressing a bzip file. Glenn had added bunzip to busybox after cleaning up the original "bunzip" source (written by a man named Julian Seward), removing and simplifying lot of unnecessary lines of code, and trimming the result to half its original size. (I.E. half the size of Julian's version.) Glen's smaller version was still quite capable of recreating the original data from a bzip-compressed file.

But Glenn hadn't added the corresponding compression support, so although busybox could extract copies of the original files out of a bzip compressed archive, it had no command to create those compressed archives; they had to come from elsewhere. This halfway status was kind of annoying, since if I relied only on busybox I couldn't create bzip files, but if I installed the original bzip package, that came with decompression support, so busybox's version was redundant. When putting together a system, either I would be doing without an important comment, or I would be adding another command twice.

On October 3, as I was finishing up my improvements to the "sed" command, I asked Glenn if he planned to add the missing bzip compression functionality to busybox. He said he had too much else to do, so I decided to do it myself. (See message 1.) In theory, it was a simple matter of copying over the functions from Julian Seward's code, and performing the same kinds of cleanups Glen had done to make it smaller and simpler. To see what kind of cleanups those had been, I started reading through the bunzip code in busybox.

My own work on bunzip started out as a follow-up to Glenn's work, a similar non-intrusive cleanup nibbling at the edges of the code but leaving the central engine unmodified. I did this because I didn't understand how the decompression engine worked, and didn't want to introduce subtle flaws by making changes I didn't understand. (Such flaws might not show up in my testing, but could corrupt the data of someone else who used it months later, by producting a bad translation that didn't accurately reproduce the original.) What I was really trying to learn was how to bolt the corresponding compression code onto busybox, looking at the decompression code for a model of how to do it.

By the end of October 5, I'd figured out where most of the mess came from. (See message 2.) The technical details aren't important, but as I summed up at the time, "My brain hurts now." It became clear to me that the original bunzip code had been written by an academic mathematician rather than an engineer. He understood the mathematics of data compression, but was a very clumsy and over-cautious programmer who produced confusing and repetitive computer code.

Julian Seward is intelligent, but he treats C as a foriegn language. You expect a math-to-C dictionary on his desk. The bzip code could be described as "Pidgin C", barely intellgible to programmers. It would actually have been easier of bzip was just badly written, since then you could throw most of it out and start over. But this wasn't quite the case. There were many places where subtle and clever things were being done hidden among places where clumsy things were being done. At each non-obvious twist, you couldn't just assume he was being clumsy, you had to prove he wasn't being subtle. It was maddening to read.

I'd assumed that the limited nature of Glenn's early cleanup work had been due to a lack of time, and that undoubtedly contributed to it. But I quickly determined that the engine itself was twice as large as it needed to be, and even the simple things it was doing were handled in an over-complicated, roundabout manner. The point of busybox was small, clean, simple implementations of things, and this just wasn't anything like that ideal.

This was a very different undertaking than the earlier sed cleanup. With sed, there were a number of experts to fall back on. But nobody in the busybox project really understood how bzip worked. Glenn had simply taken a chunk of Julian Seward's code, cleaned it up a bit, and bolted it on. Julian Seward himself was not involved in the busybox project (although I did have a brief email exchange with him about licensing issues). This time, I was mostly on my own.

I could have left well enough alone; what was there worked and you could say "if it ain't broke, don't fix it". But by that metric, the busybox project itself would never have been started. Busybox in general was simple and maintainable, but the bzip engine was not. Although I'd tried very hard to avoid digging into the main mess, I eventually decided my wrapper cleanups were just rearranging the deck chairs on the titanic. If I was going to do it at all, I needed to do it right.

That evening (message 3), I finished and sent in my non-intrusive polishing of the wrapper code, and set to work reading the rest of the file.

Learning

Even tackling the main body of bunzip, I had initially hoped I could take the bunzip engine apart, clean it out, and put it back together without having to actually learn too much about how it worked. Maybe if I removed the obvious problems I'd noticed, the rest of it would be clear and understandable and well designed. This turned out not to be the case. At many points, despite being able to see what the computer would do with the code, I had no idea _why_.

I understood some data compression theory already. Many years earlier, I'd played around with other data compression programs. I'd once translated the "deflate" algorithm (used by zip and gzip) from the "C" programming language to another language called "Java", and years before that I'd read a magazine article on the basic compression techniques "huffman coding" and "run length compression", although I was quite rusty with them. Bzip used both of those techniques, but in a novel way and in combination with new compression technology I'd never seen before.

Still, what little experience I had with data compression put me ahead of the other people on the busybox team who hadn't dealt with it at all, so I set to work reading Julian's code. I put three or four hours an evening into it for a week, just to get through the impenetrable was Julian had expressed himself to the computer, and translated the original code as I went along into something I understood. The process was very similar to taking an impenetrable academic text written by a long-winded professor who's being paid by the word, and producing a "cliff's notes" version. I basically took such copious notes I wound up paraphrasing the original work entirely. The hard part was understanding what any part of the original actually meant.

The two techniques I'd previously encountered, huffman coding and run length encoding, were comparatively simple. I remembered enough about them to recognize them in Julian's bunzip source code, although there was a lot of head-scratching and experimentation along the way. I also added a lot of instrumentation to the original messy-but-working version to figure out what the various components did and how they all fit together. Doing so, I mapped out the general infrastructure (the "plumbing" of the data flow), and identified two instances of Run Length Encoding, and the huffman coding infrastructure.

I also isolated a third technique, which was completely new to me. I could see what it was doing, but had no idea why. With no one to ask, I had to treat that part of the code as a magic incantation that had to be repeated more or less as-is, and clean up around it as best I could. I'd encountered a similarly unintelligible portion in the huffman table generation, but I'd known what the end result was supposed to be and what the starting point meant, and been able to simplify the bit in between with algebra-like transformations that resulted in provably equivalent code that was much smaller. But this new technique was a brick wall: I had no idea what it was trying to accomplish, even after watching it work.

Luckily, I eventually stumbled across a comment in the code that mentioned that the new technique was called a "Burrows-Wheeler transform", and that's all I needed. The name allowed me to search the internet and find some excellent papers explaining it (including a September 1996 article from Dr. Dobbs Journal, and the original paper from the DEC researchers who invented the technique). As I posted on October 14th:

Ignore the Miss Piggy quote (I.E. "I don't understand any of this" undoing the burrows-wheeler transform). I found an article explaining the Burrows-Wheeler transform (in Dr. Dobbs Journal Sep 1996, article available at http://dogma.net/markn/articles/bwt/bwt.htm), and also the original DEC paper on the subject, which you can get as a PDF: http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-124.html

This allowed me to get unblocked, and finally produce a "cliff's notes" version I felt confident in, because I understood what all the parts were doing. The above quote was part of the message where I shared the early results of my work with the rest of the busybox development team.

My "micro-bunzip" implementation initially ran more slowly than the original version, because I'd gone for simplicity and understandability over speed, and removed some of the tangled optimizations Julian Seward had knitted into his code. I explained my reasons for this on October 14th:

It's still a little over 3 seconds slower extracting the 2.6.0-test6 kernel tarball (81.5 seconds vs 78 seconds for the old one on my laptop), but oh well. I'm tired. Ship it.

Half the reason I did this was so other people could bang on it without divine intervention. Time for them to figure out how to optimize the thing. (Although I do suspect the memory usage could definitely be brought down from 4 megs. It's those top 3 bytes of dbuf, used for the burrows-wheeler sorting vector, that's eating most of it. But I'm not awake enough to think about it right now...)

As I'd suspected, a man named Manuel Nova (who lives in Louisiana) stepped up and made his own improvements to my bzip version, speeding it up beyond the original. I coordinated with him, and with Glenn, to eventually produce a version that was both smaller and faster than the original. My simple version had the virtue of being understandable, which opened it up to being improved upon by the existing Busybox team.

Ananlysis

Computer programmers have a very different social organization from traditional classroom. We're a lot like scientists heading off to the lab, working alone for long periods of time and then coming back to discuss it only when we're stuck or have already gotten past the difficult bit. Even our early learning is commonly done out of books, interacting with a computer rather than with other human beings.

Vygotsky's social arguments about learning requiring a cultural context are not obvious in the computer world, because so much of what we do didn't exist previously. (Ten years ago, busybox did not exist, twenty years ago, the Burrows-Wheeler transform did not exist, and thirty five years ago Unix did not exist.) Most of us extend the edges of our knowledge without being led by others, because we're going into areas that others have never been.

Also, our "culture" is global. Chinese programmers, german programmers, and american programmers share common programming languages (and tend to discuss technical subjects in english, even when it's not the speaker's native language). So whether we were raised in single parent households or in a kibbutz matters primarily in whether or not we successfully gain entry to the world of programming, but matters much less once we're on the inside.

That said, there is a large existing body of knowledge we build on, and there is definitely a community of "open source" developers on the internet. And the point of most open source computer programs is that other people use them. (Why else bother to release them onto the internet?) The communities that develop them are fan clubs organized around the project, no different than model rocket enthusiasts or the people who organize science fiction conventions. Developers show off their new work around the virtual water cooler, "let me tell you about what I did last week..."

This goes to motivation. A lot has been written on Open Source motivation, most notably the book "The Cathedral and the Bazaar" by Eric Raymond. And computer programming promotes mastery orientation rather than performance orientation, since results are produced by repeated interaction with a machine, and only presented to another human when ready. But the other side of motivation is what kind of friction you're up against: it takes a lot of motivation to force yourself to do something hard or unpleasant. This brings us back to the zone of proximal development.

In my case, my zone of proximal development intersected with the bunzip applet even better than Glenn's, the original contributor of the applet to the project. Fiddling with bunzip wasn't as hard or unpleasant for me as for Glenn. The earlier data compression work I'd done years ago prepared me to work on this project. The articles on the Burrows-Wheeler transform were out on the internet years ago, but Glenn hadn't delved deeply enough into the code to even identify the need for them.

I also had experience cleaning up more-or-less working, but unintelligible computer code that was too brittle to modify. (I.E. it broke if you made even small changes, and nobody knew why.) My big preparation for this was a two-year stint at IBM where I cleaned up an entire project that had deteriorated into a terrible state as a parade of temporary workers made various changes to it without staying long enough to learn what had come before, until the manager in charge demanded a permanent worker be assigned to stick around long enough to smooth the various cracks and seams into a coherent whole. The bunzip code was quite small in comparison to that long-ago IBM project.

So for me, cleaning up bunzip wasn't as novel as for other people. It was a chance to update my old knowledge of data compression, and a type of cleanup effort I know how to tackle. Others in the project had skills to contribute which I didn't (such as Manuel's speed enhancements), but couldn't apply these skills until the work I did brought the opportunity into their sphere of conpetence. Knowing that others were ready to build on my work, and that I was in a unique position to enable their further work, was very motivating to me.

Vygotsky's Zone of Proximal Development is usually viewed in the context of socio-constructivism, but I personally think this joining is somewhat artificial. Reading the Prawat and Floden article, for example, I got the impression of two distinct parts to Vygotsky: the zone of proximal development (what the learner is ready to learn), and socio-constructivism (that knowledge creation is a shared rather than an individual experience). The joining between these two struck me as very artificial. To me, the ZPD is an extension of the concept of prior knowledge. It's the area in which learners have the relevant background knowledge to understand new things without being held back by missing foundation. Whether or not they the interact with someone else to navigate the zone is a separate issue. In theory, they have the potential to figure it out on their own, presented with the correct problem or series of problems. They may go much more slowly, with no guarantee of success, but it is possible. Many people have taught themsleves to play piano without lessons, or without even learning to read music.

Similarly, I remember a number of assumptions in the Bonk article, where the focus on culture and environment and their effect on the learner totally eclipse the fact that the learner adapts to the learning. Many times, assumptions that learning is most effective when differences are taken into account are really that teaching is most effective. Teaching and learning are two different things. Things can be learned without being taught.

The social nature of development is not universal: if this was true nobody would ever learn anything new. Gregor Mendel's work was not valued in his lifetime; he worked in isolation without collaborators, and much of his research was burnt after his death. Mozart performed his first concert at the age of 6. The mathematical prodigy Blaise Pascal taught himself algebra as a child to help his tax-collector father (who initially thought his child was posessed by demons when he found out). Leonardo DaVinci basically had no peers, and designed a helicopter centuries before anyone else even conceived of one.

Scientists usually try to filter out cultural factors with double-blind experiments and the scientific method. Pointing out the prevalence of human artifacts in the centuries of existing knowledge we try to build upon is about as useful as pointing out that a human is always the one doing the learning. It may be true, but it doesn't help. Many important mathematical discoveries were made by people who thought the earth was flat. The first people to propose that the earth went around the sun did so because they worshipped the sun as a god. The progress of science through the ages has largely been through filtering out the political and religious assumptions that cling to unrelated knowledge. Perhaps socio-culturalism is basically a warning, that knowledge tends to be tainted with unrelated beliefs, and we should be on our guard against that. It seems to me that socio-culturalism celebrats what the scientific method is designed to filter out.

So I am a fan of the concept of the Zone of Proximal Development, without being a fan of socio-culturalism. I believe that just because one person has multiple ideas, this does not mean that every idea by that person is equally right or equally wrong. I agree that there is a mental space through which a learner is well prepared to proceed at any given time, with all their background knowledge needs met, ready to face a given problem.

But I don't find viewing this background knowledge in terms of a culture to be a helpful metaphor.

Message 1

Subject: Re: [BusyBox] bzip2 compress support in busybox?
From: Rob Landley 
To: Glenn McGrath 
CC: busybox@codepoet.org

On Friday 03 October 2003 02:58, Glenn McGrath wrote:

> I did the orginal commit for bzip2, i only concentrated on the
> decompression stuff because i figured thats wwhat it would mostly be
> used for.
>
> As long as there is an option to only support decompression and that
> adding compression doesnt increase the decompression size
> (without good reason) then you have my support.

Yup, the plan is to make "bzip2" be a separate menuconfig option.  (And fix up 
tar to support "tar cj" when bzip2 support is enabled.

To-do list item added...

> Glenn

Rob

Message 2

Subject: [BusyBox] Start of bunzip2 cleanup.
From: Rob Landley 
To: busybox@codepoet.org,
 bug1@optushome.com.au
Date: Sun, 5 Oct 2003 03:40:29 -0500

In preparation for adding bzip compression support, I've been reading through 
the bzip2 support, and decided cleaning it up would be a good thing.  (If 
nothing else, the gratuitous malloc and free of three different structures 
for every block of compressed data is just evil.)

My original goal was to avoid touching anything in BZ2_bzDecompress() and up, 
and just clean up the wrappers around them (as Glenn already started to do).  
But then I hit the tangle of incestuous pointers between DState, bz_stream, 
and bzFile.

if bzFile goes away and becomes local variables in uncompressStream (which 
it's sort of crying out for), then the strm member of DStream should 
logically be an inline bz_stream structure rather than a pointer to one, 
which also allows the state member of bz_stream to go away since that points 
to the DState it lives in.

Each DState has exactly one bz_stream associated with it, and each bz_stream 
has exactly one DState.  (Each bzFile has one of each, but that's an optional 
wrapper.)  The fact bz_stream and DState are seperately allocated and have 
pointers to each other is a bad sign.

This patch untangles them.  (It's not really intended to reduce code size in 
and of itself, just make further cleanups possible.)  I'd appreciate some 
additional eyeballs confirming I didn't screw up, since I'm basically 
touching high-voltage code here. :)

With this patch, bz_stream becomes the first member of DState (I.E. no longer 
allocated separately), several circular pointer groupings are turned into 
trees, and all the references that need to be adjusted because of this are 
adjusted.

All this is just basically paving the way for further cleanup.  I fixed up 
code in this patch I intend to rip out soon.  But this is the minimal patch 
that needs to touch things in BZ2_bzDecompress()  and above, and does just 
the adjustment there that's necessary and very little else.  (Okay, one 
whitespace cleanup, the removal of a duplicate #define, and adding the vi 
tabstop comment to the top of the file.  But that's it.)

The really annoying part of this is that the argument passed to 
BZ2_bzDecompress() becomes a DState pointer instead of a bz_stream pointer.  
(I could continue to pass the bz_stream pointer and just typecast it, since 
it's the first member, but that's evil.)  Since bz_stream already had to have 
a member intialized to point to a valid DState anyway, this isn't really 
anything more than a cosmetic change.  But it's still annoying, since this is 
part of the library api bzip2 documents.  (Admittedly, in our code it's a 
static function, but still.  It's the principle of the thing.)

This patch compiled and had no errors doing:

./busybox bzcat ../../newfirmware/base/linux-2.6.0-test6.tar.bz2 | tar tv

The rest of the cleanup I currently have planned shouldn't touch anything in 
bzDecompress or above, just in the wrapper around bzDecompress.  (I.E. 
bzReadOpen, bzReadClose, and read_bz2 can all be merged into uncompressStream 
and somewhere between 1/2 and 2/3 of the resulting mess can be just deleted.)

Note: the void * removed from bz_stream was actually a DState pointer, they 
just declared their structures in the wrong order and didn't have the type 
available yet.  I moved bzFile down past DState for a similar reason, 
although the only actual change to bzFile was swapping the DState * for a 
bz_stream pointer (and a name change to force the compiler to complain about 
any code that used it that I didn't see).

The entire rest of the patch is fallout from this.

Rob

Message 3

Subject: [BusyBox] Ow.  Ouch.  Pain.
From: Rob Landley <rob@landley.net>
To: busybox@codepoet.org
Date: Sun, 5 Oct 2003 05:33:04 -0500

I understand what BZ2_decompress is doing now.  It's manually simulating a 
setjmp/longjmp that you can return from and resume later.  (THAT's why the 
big switch statement at the beginning with all the CASE statements inside 
loops, the ->state member, and why it saves all the local variables into the 
big DState structure.)

It would be much, much, much, much, much more natural to just have get_bits 
know how to read data from the input stream (or do a callback or 
_something_).

And have unRLE_obuf_to_output_FAST do the same for output, keeping in mind 
that the actual output code (with crc calculation) is currently not factored 
into a function, but is instead block copied three times.

My brain hurts now.  A lot.

Rob

(Lemme guess: the guy who wrote bzip2 is an academic mathematician, not a 
professional programmer?)