Introduction

In 2007 I won a fellowship from the Linux Foundation, which was offered in response to widespread complaints about a lack of good Linux kernel documentation. I spent the next seven months focusing on this problem. Unfortunately, stating a problem statement isn't always the same as proposing a solution, and I spent almost half that time just trying to figure out what "fixing it" actually meant, and how to go about it.

The first big realization I came to is that contrary to expectations, the kernel's Documentation/ directory is just the tip of the iceberg. Most kernel documentation lives in web pages, magazine articles, online books, blog entries, wikis, conference papers, audio and video recordings of talks, standards, man pages, list archives, commit messages, and more. The problem isn't a LACK of documentation, it's that no human being can ever hope to read more than a tiny fraction of what's said, written, and recorded about the kernel on a daily basis.

This paper is not an attempt to summarize seven months of reading about why UTF-8 is a good internationalization format or how to implement firmware loading for statically linked device drivers. It's about turning the dark matter of kernel documentation into something you can browse.

The real task is editorial. Only after finding and indexing the existing mass of documentation can anyone figure out whether what's out there for a given topic is complete and up-to-date. You can't fill in the holes without first knowing where they are.

The editorial task

Google is great at finding things, but it doesn't tell you what to search for. There's more to teaching someone English than handing them a dictionary. Creating a proper index of the web's Linux kernel documentation is a huge undertaking. Just keeping it up to date _after_ its creation would be a project on par with any other major kernel subsystem. But without such an index showing where the holes are, writing new documentation to try to fill in those holes tends to reinvent the wheel.

Much of the documentation already out there is repetitive and overlapping. Half-hearted attempts to collate what's available into a single new comprehensive version usually just add one more variant to the pile. An editor who isn't already an expert on a given topic probably won't write a category killer document attracting patches instead of competition. (If you don't feel like contributing to someone else's existing document, why do you expect them to contribute to yours? Will you still be updating your version regularly six months from now?)

More to the point, author and editor are different jobs. Researching and writing new documentation isn't really an editorial task. An editor collects and organizes the submissions of others. (A real editor spends most of their time wading through a slush pile and saying "no" to most of it, or saying "no" to things their assistant editors pass up to them. Sound familiar? This is normal.) Providing a brief summary and a pile of links to existing documents is more effective than trying to become an expert on every single topic, and has the advantage of not making the clutter any _worse_.

The other big editorial problem is keeping documentation up to date. If code is a living thing, its documentation must also be. The best documentation about the innards of the 2.4 kernel is only passably accurate about early 2.6, and in many ways the first 2.6 release has more in common with 2.4 than with 2.6.25. But the "many eyeballs" effect of open source can be diluted by many targets. In a maze of twisty documents, all different, fixes that don't naturally funnel to a central integration and redistribution point get eaten by a grue.

On burning the Internet to a CD

It took me a while to realize the editorial nature of the kernel documentation problem. The obvious place to start when looking for Linux kernel documentation may be Google, but the next most obvious place is the Documentation directory in the kernel source tarball. And that comes with some enormous built-in assumptions.

The kernel tarball is the central repository for kernel source code, so it's easy to assume that Documentation/ is the central repository for all kernel documentation, or could easily be turned into such. This assumption cost me about 3 months.

First of all, the Documentation directory isn't even the only significant source of documentation within the kernel tarball itself. The kerneldoc entries in the source code (used by "make htmldocs" and the kconfig help entries (used by the help option of "make menuconfig") are each significant and completely separate sources of documentation. The files in Documentation/ seldom if ever refer to htmldocs or menuconfig help, and they seldom refer back to it (or to each other).

This other information cannot easily be moved into the Documentation directory. The other sources of documentation in the kernel source are usually located near the things they document, and benefit from locality of reference. There's a reason they live where they do, as do over two dozen README files in the source code, the output of "make help", references to IETF RFC documents in source comments, and so on.

In addition, the data formats are different. Documentation consists primarily of flat text files, htmldocs is uses structured source code comments to generate docbook and from that HTML or PDF output, and kconfig is in its own format which has dedicated viewer programs (such as menuconfig).

None of these is really an obvious choice for indexing the others. The flat text of Documentation/ does not lend itself to linking out the way HTML does, so at first glance htmldocs seems a better choice for an index. But the format of htmldocs is highly constrained by its origins as structured source comments; it's designed to do what it's currently doing and not much else. As a potential index, the kconfig help has both sets of disadvantages; it's flat text without hyperlinks and it's highly structured for a conflicting purpose. But of these, the Documentation directory seems the least bad choice.

Organized based on where passing strangers put things down last.

Documentation/ does not compile, give warnings, or break the build. It cannot easily be profiled or benchmarked. Because of this, the normal kernel build process doesn't naturally organize it very well. Here are a few of the files in the top level Documentation directory of the 2.6.25 kernel:

This is a small subset of the ~140 files at the top level, and doesn't include anything in the ~75 different subdirectories for busses, architectures, foreign language translations, subsystems, and so on.

A token attempt at organizing Documentation/ can be found in the 00-INDEX files in each subdirectory, containing a one line description of each file (example: "device-mapper/ - directory with info on Device Mapper."). Some directories have this file, some don't. Some files are listed, some aren't.

00-INDEX is better than nothing, but it mirrors a filesystem hierarchy without symlinks. A file like filesystems/ramfs-rootfs-initramfs.txt belongs both in "filesystems" and in "early-userspace", but it has to pick one.

Even the perennial question "where do I start?" has at least three answers: the oldest and in some ways still the best is the "README" file at the top of the kernel (not in Documentation), the next oldest is Documentation/kernel-docs.txt, and the newest is Documentation/HOWTO. None of them really provide a good introduction to the kernel's source code. For that I recommend Linux Kernel 2.4 Internals (http://www.moses.uklinux.net/patches/lki.html), which is woefully out of date and x86-specific but still the best I've found. It is not in the kernel tarball.

It is possible to clean up Documentation/ (albeit a fairly large undertaking, and I pushed a few patches to this effect which were generally greeted with resounding indifference). It's also possible to convert the Documentation directory to HTML (an even larger project). But ultimately, there's a larger philosophical problem.

My interview process for the Linux Foundation fellowship consisted of writing Documentation/rbtree.txt. Before doing so I pointed out that there was already an excellent article on Red Black Trees in the Linux Weekly News kernel archives, and another article about it on Wikipedia. But they weren't in the kernel tarball and thus (I was told) they didn't count, so I reinvented the wheel to get the job.

Documentation/ is based on the assumption that everything of interest will be merged into the kernel tarball. It already copies standards documents and HOWTOs with defined upstream locations because having potentially out-of-date copies in the kernel tarball is superior to having a single cannonical location for this information out on the web. The philosophy of Documentation/ is the same as for code: if out of tree drivers are bad, out of tree documentation must be bad.

This is the wrong philosophy for indexing documentation that lives on the web. The web has many formats (from pdf to flash) and Documentation has one. Web content has many licenses, the kernel is GPLv2 only. How does one apply CodingStyle to Linus Torvalds' Google video about the origins of git? The kernel source tarball is currently just under 50 megabytes, the mp3 audio recordings of OLS just for the year 2000 are a little under 90 megabytes.

Exporting kernel tarball docs to http://kernel.org/doc

The kernel-centric Documentation/ in the kernel tarball created a reciprocal problem: Not only did Documentation/ suck at indexing the web, but the web wasn't doing that great at indexing the kernel's built-in documentation either.

In early 2007 I did a google search for ext2 filesystem format documentation, which didn't bring up Documentation/filesystems/ext2.txt in the first five pages of hits. I didn't even notice that file until a month later (after all if its google rank sucks how good can it be), because the was no cannonical uncompressed location at which to find it on the web, and things like gitweb or the most recent release tarball were too transient to work up much of a ranking for any specific version. (Similarly, there was no standard web location for the current htmldocs, despite that being HTML!)

So one well-defined problem I needed to tackle was exporting the documentation already in the kernel tarball somewhere Google could find it. I requested a page on kernel.org, and received "http://kernel.org/doc". I copied the kernel's Documentation/* to "http://kernel.org/doc/Documentation", set up the "make htmldocs" tools on my laptop, and posted the results to "http://kernel.org/doc/htmldocs". Then I created a script to periodically update this documentation from the kernel repository (http://kernel.org/hg/linux-2.6) and checked this script into a new mercurial repository on my website (http://landley.net/hg/kdocs).

Over the months that followed, I improved my export script to harvest and export much more information from the kernel source. (See http://landley.net/hg/kdocs/file/tip/make/ for the scripts that do all this. If you check out the mercurial repository and run "make/make.sh --long" it'll try to reproduce this directory on your machine. You need mercurial, wget, pdftk, xmlto, and probably some other stuff.)

http://kernel.org/doc/Documentation

The way a web server shows a directory full of files isn't very informative, so I wrote a script that turns the 00-INDEX files in each Documentation subdirectory into a simple HTML index. This had the unfortunate side effect of hiding files in any directory with a 00-INDEX that doesn't list everything, so I wrote a script "make/doclinkcheck.py" that compares the generated HTML indexes against the contents of the directories and shows 404 errors and unindexed files. I sent lots of 00-INDEX patches to linux-kernel trying to fill in some of the gaps, but as of April 2008 doclinkcheck.py shows about 650 files improperly indexed in Documentation.

http://kernel.org/doc/htmldocs

On the htmldocs front, the top level book index created by "make htmldocs" was unfortunate, so I had my script create a better one. I also wrote a quick script to create "one big html file" versions of each "book", and used the old trick that if "deviceiobook.html" is the one big ("nochunks") version, "deviceiobook" at the same location is a directory containing the many small pages ("chunks") version. The top level index lists both versions.

http://kernel.org/kdocs/menuconfig

The kconfig help text is the third big source of kernel documentation, and the only human readable documentation on several topics, so I wrote make/menuconfig2html.py to parse the kconfig source files and produce HTML versions of the help text.

The resulting web pages organize information the same way menuconfig does. The first page selects architecture, the later pages show config symbols with one line descriptions. The symbol names link to help text extracted from the appropriate Kconfig file.

I attempted to organize it to reduce duplication to produce a "single point of truth" for Google to find easily and hopefully rank high. There are several index pages (since menuconfig shows different menus for different architectures), but each Kconfig file is translated to a single page of help text, and the indexes link to the same translated Kconfig files. Each HTML file is named after the source file it's generated from.

http://kernel.org/kdocs/readme

The kernel source contains over two dozen README files outside of the Documentation directory. I collected them together into one directory.

The top level README is one answer to the perennial question "where do I start"? Another is "Documentation/HOWTO", or perhaps "Documentation/kernel-docs.txt". A better answer is probably [LINK] Linux Kernel Internals

http://kernel.org/kdocs/rfc-linux.html

Many comments in the Linux kernel source code reference Internet Engineering Task Force Request For Comments (IETF RFC) standards documents, which live at "http://tools.ietf.org/html". I put together a script to grep the source code for RFC mentions, and put a link to the RFC together with links to the source files that mention them. (Seemed like a useful thing to do at the time.)

http://kernel.org/docs/make-help.txt

If you type "make help", you get a little more documentation, so make.sh puts it on the web.

Indexing kernel documentation on the internet

The next task was mining the internet for documentation and trying to put it in some coherent order. This is a huge undertaking, and I barely scratched the surface. What I did find was overwhelming, and had some common characteristics.

There are lots of existing indexes of documentation. I already mentioned the kernel tarball's Documentation/kernel-docs.txt. Linux Weekly News has an index of all the kernel articles its written (at http://lwn.net/Kernel/Index/). Linux Journal magazine has online archives going back to its first issue (http://www.linuxjournal.com/xstatic/magazine/archives). The old kernel-traffic website is still up [LINK] The Linux Documentation Project is mostly about userspace documentation. Existing indexes: Linux Documentation Project lwn.net linuxjournal.com ldd, Robert Love's book... kernel traffic Each documentation repository indexes itself. Their aim is to provide new documentation. In the few cases that they reference and organize existing documentation written by others, they usually do so by mirroring it.

The Linux Documentation Project is closest to what I was looking for, but it's mostly focused on userspace.