changeset 579:53f0a143f244

Check in some notes on the elf spec, and update web page header to include links to more documentation.
author Rob Landley <rob@landley.net>
date Fri, 21 Mar 2008 20:21:49 -0500
parents b3f6400d0046
children 7ddeaeed4e94
files www/elfnotes.html www/header.html
diffstat 2 files changed, 195 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/www/elfnotes.html	Fri Mar 21 20:21:49 2008 -0500
@@ -0,0 +1,193 @@
+<title>Rob's notes-to-self about the ELF file format.</title>
+
+<p>The actual spec is available <a href=http://www.muppetlabs.com/~breadbox/software/ELF.txt>here</a>,
+these are just my notes from reading it:</p>
+
+<p>There are three kinds of ELF files:</p>
+<ul>
+<li>Executables</li>
+<li>Shared libraries (*.so)</li>
+<li>Relocatable files (*.o)</li>
+</ul>
+
+<h2>Elf Header</h2>
+
+<p>Each ELF file starts with an <b>elf header</b>, which (for 32 bit elfses)
+looks like:</p>
+
+<pre>
+struct liv_tyler {
+  uint8_t ident[4];   // Always set to "0x7fELF"
+  uint8_t class;      // 1==32 bit, 2==64 bit
+  uint8_t endian;     // 1==little endian, 2==big endian
+  uint8_t version1;   // Set to 1
+  uint8_t pad[9];     // Unused (set to 0)
+  uint16_t type;      // 1==relocatable.o, 2==executable, 3==sharedobject.so
+  uint16_t machine;   // 2==sparc,3==x86,4==m68k,10==mips
+  uint32_t version2;  // Set to 1
+  uint32_t entry;     // Process entry point (virtual address)
+  uint32_t phoff;     // Program Header table (file offset, usually ehsize or 0)
+  uint32_t shoff;     // Section header table (file offset)
+  uint32_t flags;     // "processor-specific flags", whatever that is.
+  uint16_t ehsize;    // Set to 52 (elf header's size in bytes.  Why?)
+  uint16_t phentsize; // sizeof(program header table entry) == 32
+  uint16_t phnum;     // Count of entries in program header table (0 for *.o).
+  uint16_t shentsize; // sizeof(section header table entry) == 40
+  uint16_t shnum;     // Count of entries in section header table.
+  uint16_t shstrndx;  // Index in section header table of .shstrtab
+} eheader;
+</pre>
+
+<p>The "endian" byte indicates the endianness of the rest of the data.  The
+default entry value for x86 is 0x8048000.</p>
+
+<p>The above refers to three other data structures:</p>
+<ul>
+<li>The <b>program header table</b> - used to load program into memory and execute it.</li>
+<li>The <b>section header table</b> - used by compiler, linker, objdump, gdb...</li>
+<li>The <b>section header string table</b> - we'll get to this later.</li>
+</ul>
+
+<h2>Program Header Table</h2>
+
+<p>Executables and shared libraries have a program header table, *.o files do
+not (which is why you can't run 'em).  When an ELF file has a program header
+table, it usually starts right after the ELF header, I.E. eheader.phoff == 52.
+For *.o files both phoff and phnum are zero.</p>
+
+<p>The program header table is an array of program header structs, each of which
+describes a chunk of the file relevant to actually executing a program with
+this file.  The ELF spec says it must come before any other loadable segment
+in the file, although it doesn't say why.</p>
+
+<p>Each of the program header structs looks like:</p>
+
+<pre>
+struct orlando_bloom {
+  uint32_t type;    // LOAD=1, DYNAMIC=2, INTERP=3, NOTE=4, PHDR=6
+  uint32_t offset;  // Starting location of data in file.
+  uint32_t vaddr;   // virtual address to map the segment into memory at
+  uint32_t paddr;   // physical address (unused in Linux?)
+  uint32_t filesz;  // Number of bytes to load from file.
+  uint32_t memsz;   // Number of bytes to allocate in memory.
+  uint32_t flags;   // Or together: execute=1, write=2, read=4
+  uint32_t align;   // loader aligns vaddr to this, must be power of 2.
+} pheader;
+</pre>
+
+<p>If memsz > filesz then the memory at the end is zeroed.  (If memsz < filesz
+your ELF file is broken.)  The flags say what permissions to mmap it with.</p>
+
+<ul>
+
+<li><p>An INTERP program header (there can be only one per file, presumably
+Connor McLeod) indicates the dynamic linker, and the first thing an exec() call
+does is look for one of these and if it finds it, loads that program instead and
+passes it a filehandle to this one.  (Look at the uClibc dynamic linker source
+to see how that works.)</p></li>
+
+<li><p>A LOAD segment gets mapped into the program's address space.  (This is
+more or less a "normal" segment.)  The loader basically does:
+<blockquote>
+  mmap(ph.vaddr, ph.filesz, ph.flags, MAP_PRIVATE, fd, ph.offset)
+</blockquote>
+Plus the bit about extra zeroed memory at the end, if any.</p></li>
+
+<li><p>A DYNAMIC segment refers to something this program need to get out of a
+shared library.  The dynamic linker needs to fix these up before the program
+can run, which means static executables can't have any DYNAMIC
+segments.</p></li>
+
+<li><p>A NOTE section is a comment.  Don't bother.  Humans seldom read ELF
+files by hand.</p></li>
+
+<li><p>A PHDR section points back to the program header itself, showing where
+it lives in the file.  This makes life easier for debuggers, but is not
+actually required.</p>
+
+<h2>Section Headers:</h2>
+
+<p>The section header table is an array of these structures:</p>
+<pre>
+struct cate_blanchett {
+  uint32_t name;  // Name of section (index into section header string table)
+  uint32_t type;  // Type of this section (see below)
+  uint32_t flags; // Or together: 1 writeable, 2 occupies memory, 4 executable
+  uint32_t addr;  // Virtual address at which to map this section (0 if none)
+  uint32_t offset;// Start of section in file.
+  uint32_t size;  // Section's length in bytes
+  uint32_t link;  // varies based on type (usually section header index of
+                     a related string or symbol table)
+  uint32_t info;  // varies based on type
+  uint32_t addralign; // Alignment (must be power of 2)
+  uint32_t entsize;   // Size of entries (or 0)
+}
+</pre>
+
+<p>The "type" field can be one of:</p>
+<ul>
+<li>1 = <b>PROGBITS</b> - binary data, just map it in.</li>
+<li>2 = <b>SYMTAB</b> - Imports symbol table (what we need to run).</li>
+<li>3 = <b>STRTAB</b> - String table</li>
+<li>4 = <b>RELA</b> - Relocation entries "with explicit addends"</li>
+<li>5 = <b>HASH</b> - Symbol hash table (for dynamic linking)</li>
+<li>6 = <b>DYNAMIC</b> - Exports symbol table (we are a shared library)</li>
+<li>7 = <b>NOTE</b> - Comment</li>
+<li>8 = <b>NOBITS</b> - zeroed data (bss)</li>
+<li>9 = <b>REL</b> - Relocation entries "without explicit addends".  (What's an addend?)</li>
+<li>11 = <b>DYNSYM</b> - Brief version of SYMTAB (just the symbols the linker
+actually needs.)</li>
+</ul>
+
+<p>A section index is an index into the array of section headers.  Special
+indexes that _don't_ point into the array are:</p>
+
+<ul>
+<li><b>SHN_UNDEF</b> (0) - Undefined symbol</li>
+<li><b>SHN_ABS</b> (0xfff1) - Symbols living at an absolute address.</li>
+<li><b>SHN_COMMON</b> (0xfff2) - "common" symbols (Unallocated globals?  What?)</li>
+</ul>
+
+<p>Note that the first section header (0, also known as SHN_UNDEF) exists but
+can never be used and has all zeroed fields.  It's wasted space.  Bad spec
+developers, no biscuit!  (I wonder if anything actually _refers_ to it, and if
+not can I just cheat and omit it?  Not that tinycc really optimizes the size of
+the output binaries but come on guys, this is sad and pointless.  Maybe exec
+or ld-linux.so uses it internally?)</p>
+
+<h2>String Table</h2>
+</p>A STRTAB is a bunch of null terminated strings one after the other, the
+first of which is zero length (always starts with a NULL byte at position 0).
+The last string must be null terminated, can't end the section short.  (As a
+special case, a zero length STRTAB is allowed, and referring to offset zero in
+it refers to an empty string.)</p>
+
+<p>The "shstridx" field in the elf header is the index of a section header with
+a STRTAB for the section names.  The "name" field in each section header is
+an index into that string table.</p>
+
+<h2>Symbol Table</h2>
+
+<p>The symbol table is another array.  Executables don't actually need it
+(unless they contain debugging information), but object files and shared
+libraries do.</p>
+
+<p>The structure for symbol tables is:</p>
+
+<pre>
+typedef hugo_weaving {
+  uint32_t name;  // Index into the object string table (in sht.link), 0=none
+  uint32_t value; // Value of the symbol
+  uint32_t size;  // Size of symbol (0 for none or unknown)
+  uint8_t  info;  // High 4 bits symbol binding: 0=local, 1=global, 2=weak
+                  // Low 4 bits symbol type: data=1, func=2, section=3, file=4
+  uint8_t  other; // Set to 0
+  uint16_t shndx; // Section header table index for this symbol's section.
+};
+</pre>
+
+<p>And again, the spec insists that the first entry in each symbol table be
+wasted and zeroed out.</p>
+
+</pre>
+</blockquote>
--- a/www/header.html	Thu Mar 20 17:25:10 2008 -0500
+++ b/www/header.html	Fri Mar 21 20:21:49 2008 -0500
@@ -16,6 +16,8 @@
     <li><a href="index.html">News</a></li>
     <li><a href="/hg/tinycc/raw-file/tip/README">README</a></li>
     <li><a href="differences.html">Differences</a></li>
+    <li><a href="tinycc-doc.html">Out of date internals documentation</a></li>
+    <li><a href="elfnotes.html">Notes on the ELF Spec</a></li>
   </ul>
   <b>Download</b>
   <ul>