Appendix B Describing Physical Memory

B.1 Initialising Zones

B.1.1 Function: setup_memory

Source: arch/i386/kernel/setup.c

The call graph for this function is shown in Figure 2.3. This function gets the necessary information to give to the boot memory allocator to initialise itself. It is broken up into a number of different tasks.

Find the start and ending PFN for low memory (min_low_pfn, max_low_pfn), the start and end PFN for high memory (highstart_pfn, highend_pfn) and the PFN for the last page in the system (max_pfn).
Initialise the bootmem_data structure and declare which pages may be used by the boot memory allocator
Mark all pages usable by the system as “free” and then reserve the pages used by the bitmap representing the pages
Reserve pages used by the SMP config or the initrd image if one exists

991 static unsigned long __init setup_memory(void)
992 {
993       unsigned long bootmap_size, start_pfn, max_low_pfn;
994 
995       /*
996        * partially used pages are not usable - thus
997        * we are rounding upwards:
998        */
999       start_pfn = PFN_UP(__pa(&_end));
1000 
1001      find_max_pfn();
1002 
1003      max_low_pfn = find_max_low_pfn();
1004 
1005 #ifdef CONFIG_HIGHMEM
1006      highstart_pfn = highend_pfn = max_pfn;
1007      if (max_pfn > max_low_pfn) {
1008            highstart_pfn = max_low_pfn;
1009      }
1010      printk(KERN_NOTICE "%ldMB HIGHMEM available.\n",
1011            pages_to_mb(highend_pfn - highstart_pfn));
1012 #endif
1013      printk(KERN_NOTICE "%ldMB LOWMEM available.\n",
1014                  pages_to_mb(max_low_pfn));

: 999PFN_UP() takes a physical address, rounds it up to the next page and returns the page frame number. _end is the address of the end of the loaded kernel image so start_pfn is now the offset of the first physical page frame that may be used
: 1001find_max_pfn() loops through the e820 map searching for the highest available pfn
: 1003find_max_low_pfn() finds the highest page frame addressable in ZONE_NORMAL
: 1005-1011If high memory is enabled, start with a high memory region of 0. If it turns out there is memory after max_low_pfn, put the start of high memory (highstart_pfn) there and the end of high memory at max_pfn. Print out an informational message on the availability of high memory
: 1013-1014Print out an informational message on the amount of low memory

1018      bootmap_size = init_bootmem(start_pfn, max_low_pfn);
1019 
1020      register_bootmem_low_pages(max_low_pfn);
1021 
1028      reserve_bootmem(HIGH_MEMORY, (PFN_PHYS(start_pfn) +
1029                  bootmap_size + PAGE_SIZE-1) - (HIGH_MEMORY));
1030 
1035      reserve_bootmem(0, PAGE_SIZE);
1036 
1037 #ifdef CONFIG_SMP
1043       reserve_bootmem(PAGE_SIZE, PAGE_SIZE);
1044 #endif
1045 #ifdef CONFIG_ACPI_SLEEP
1046         /*
1047          * Reserve low memory region for sleep support.
1048          */
1049         acpi_reserve_bootmem();
1050 #endif

: 1018init_bootmem()(See Section E.1.1) initialises the bootmem_data struct for the config_page_data node. It sets where physical memory begins and ends for the node, allocates a bitmap representing the pages and sets all pages as reserved initially
: 1020register_bootmem_low_pages() reads the e820 map and calls free_bootmem() (See Section E.3.1) for all usable pages in the running system. This is what marks the pages marked as reserved during initialisation as free
: 1028-1029Reserve the pages that are being used to store the bitmap representing the pages
: 1035Reserve page 0 as it is often a special page used by the bios
: 1043Reserve an extra page which is required by the trampoline code. The trampoline code deals with how userspace enters kernel space
: 1045-1050If sleep support is added, reserve memory required for it. This is only of interest to laptops interested in suspending and beyond the scope of this book

1051 #ifdef CONFIG_X86_LOCAL_APIC
1052       /*
1053        * Find and reserve possible boot-time SMP configuration:
1054        */
1055       find_smp_config();
1056 #endif
1057 #ifdef CONFIG_BLK_DEV_INITRD
1058       if (LOADER_TYPE && INITRD_START) {
1059             if (INITRD_START + INITRD_SIZE <= 
                    (max_low_pfn << PAGE_SHIFT)) {
1060                   reserve_bootmem(INITRD_START, INITRD_SIZE);
1061                   initrd_start =
1062                    INITRD_START ? INITRD_START + PAGE_OFFSET : 0;
1063                   initrd_end = initrd_start+INITRD_SIZE;
1064             }
1065             else {
1066                   printk(KERN_ERR 
                           "initrd extends beyond end of memory "
1067                       "(0x%08lx > 0x%08lx)\ndisabling initrd\n",
1068                       INITRD_START + INITRD_SIZE,
1069                       max_low_pfn << PAGE_SHIFT);
1070                   initrd_start = 0;
1071             }
1072       }
1073 #endif
1074 
1075       return max_low_pfn;
1076 }

: 1055This function reserves memory that stores config information about the SMP setup
: 1057-1073If initrd is enabled, the memory containing its image will be reserved. initrd provides a tiny filesystem image which is used to boot the system
: 1075Return the upper limit of addressable memory in ZONE_NORMAL

B.1.2 Function: zone_sizes_init

Source: arch/i386/mm/init.c

This is the top-level function which is used to initialise each of the zones. The size of the zones in PFNs was discovered during setup_memory() (See Section B.1.1). This function populates an array of zone sizes for passing to free_area_init().

323 static void __init zone_sizes_init(void)
324 {
325     unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
326     unsigned int max_dma, high, low;
327 
328     max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
329     low = max_low_pfn;
330     high = highend_pfn;
331 
332     if (low < max_dma)
333         zones_size[ZONE_DMA] = low;
334     else {
335         zones_size[ZONE_DMA] = max_dma;
336         zones_size[ZONE_NORMAL] = low - max_dma;
337 #ifdef CONFIG_HIGHMEM
338         zones_size[ZONE_HIGHMEM] = high - low;
339 #endif
340     }
341     free_area_init(zones_size);
342 }

: 325Initialise the sizes to 0
: 328Calculate the PFN for the maximum possible DMA address. This doubles up as the largest number of pages that may exist in ZONE_DMA
: 329max_low_pfn is the highest PFN available to ZONE_NORMAL
: 330highend_pfn is the highest PFN available to ZONE_HIGHMEM
: 332-333If the highest PFN in ZONE_NORMAL is below MAX_DMA_ADDRESS, then just set the size of ZONE_DMA to it. The other zones remain at 0
: 335Set the number of pages in ZONE_DMA
: 336The size of ZONE_NORMAL is max_low_pfn minus the number of pages in ZONE_DMA
: 338The size of ZONE_HIGHMEM is the highest possible PFN minus the highest possible PFN in ZONE_NORMAL (max_low_pfn)

B.1.3 Function: free_area_init

Source: mm/page_alloc.c

This is the architecture independant function for setting up a UMA architecture. It simply calls the core function passing the static contig_page_data as the node. NUMA architectures will use free_area_init_node() instead.

838 void __init free_area_init(unsigned long *zones_size)
839 {
840     free_area_init_core(0, &contig_page_data, &mem_map, zones_size, 
                            0, 0, 0);
841 }

838The parameters passed to free_area_init_core() are

: 0 is the Node Identifier for the node, which is 0
: contig_page_data is the static global pg_data_t
: mem_map is the global mem_map used for tracking struct pages. The function free_area_init_core() will allocate memory for this array
: zones_sizes is the array of zone sizes filled by zone_sizes_init()
: 0 This zero is the starting physical address
: 0 The second zero is an array of memory hole sizes which doesn't apply to UMA architectures
: 0 The last 0 is a pointer to a local mem_map for this node which is used by NUMA architectures

B.1.4 Function: free_area_init_node

Source: mm/numa.c

There are two versions of this function. The first is almost identical to free_area_init() except it uses a different starting physical address. There is for architectures that have only one node (so they use contig_page_data) but whose physical address is not at 0.

This version of the function, called after the pagetable initialisation, if for initialisation each pgdat in the system. The caller has the option of allocating their own local portion of the mem_map and passing it in as a parameter if they want to optimise it's location for the architecture. If they choose not to, it will be allocated later by free_area_init_core().

 61 void __init free_area_init_node(int nid, 
        pg_data_t *pgdat, struct page *pmap,
 62     unsigned long *zones_size, unsigned long zone_start_paddr, 
 63     unsigned long *zholes_size)
 64 {
 65     int i, size = 0;
 66     struct page *discard;
 67 
 68     if (mem_map == (mem_map_t *)NULL)
 69         mem_map = (mem_map_t *)PAGE_OFFSET;
 70 
 71     free_area_init_core(nid, pgdat, &discard, zones_size, 
                        zone_start_paddr,
 72                     zholes_size, pmap);
 73     pgdat->node_id = nid;
 74 
 75     /*
 76      * Get space for the valid bitmap.
 77      */
 78     for (i = 0; i < MAX_NR_ZONES; i++)
 79         size += zones_size[i];
 80     size = LONG_ALIGN((size + 7) >> 3);
 81     pgdat->valid_addr_bitmap = 
                     (unsigned long *)alloc_bootmem_node(pgdat, size);
 82     memset(pgdat->valid_addr_bitmap, 0, size);
 83 }

61The parameters to the function are:

: nid is the Node Identifier (NID) of the pgdat passed in
: pgdat is the node to be initialised
: pmap is a pointer to the portion of the mem_map for this node to use, frequently passed as NULL and allocated later
: zones_size is an array of zone sizes in this node
: zone_start_paddr is the starting physical addres for the node
: zholes_size is an array of hole sizes in each zone

68-69If the global mem_map has not been set, set it to the beginning of the kernel portion of the linear address space. Remeber that with NUMA, mem_map is a virtual array with portions filled in by local maps used by each node

71Call free_area_init_core(). Note that discard is passed in as the third parameter as no global mem_map needs to be set for NUMA

73Record the pgdats NID

78-79Calculate the total size of the nide

80Recalculate size as the number of bits requires to have one bit for every byte of the size

81Allocate a bitmap to represent where valid areas exist in the node. In reality, this is only used by the sparc architecture so it is unfortunate to waste the memory every other architecture

82Initially, all areas are invalid. Valid regions are marked later in the mem_init() functions for the sparc. Other architectures just ignore the bitmap

B.1.5 Function: free_area_init_core

Source: mm/page_alloc.c

This function is responsible for initialising all zones and allocating their local lmem_map within a node. In UMA architectures, this function is called in a way that will initialise the global mem_map array. In NUMA architectures, the array is treated as a virtual array that is sparsely populated.

684 void __init free_area_init_core(int nid, 
        pg_data_t *pgdat, struct page **gmap,
685     unsigned long *zones_size, unsigned long zone_start_paddr, 
686     unsigned long *zholes_size, struct page *lmem_map)
687 {
688     unsigned long i, j;
689     unsigned long map_size;
690     unsigned long totalpages, offset, realtotalpages;
691     const unsigned long zone_required_alignment = 
                                              1UL << (MAX_ORDER-1);
692 
693     if (zone_start_paddr & ~PAGE_MASK)
694         BUG();
695 
696     totalpages = 0;
697     for (i = 0; i < MAX_NR_ZONES; i++) {
698         unsigned long size = zones_size[i];
699         totalpages += size;
700     }
701     realtotalpages = totalpages;
702     if (zholes_size)
703         for (i = 0; i < MAX_NR_ZONES; i++)
704             realtotalpages -= zholes_size[i];
705             
706     printk("On node %d totalpages: %lu\n", nid, realtotalpages);

This block is mainly responsible for calculating the size of each zone.

: 691The zone must be aligned against the maximum sized block that can be allocated by the buddy allocator for bitwise operations to work
: 693-694It is a bug if the physical address is not page aligned
: 696Initialise the totalpages count for this node to 0
: 697-700Calculate the total size of the node by iterating through zone_sizes
: 701-704Calculate the real amount of memory by substracting the size of the holes in zholes_size
: 706Print an informational message for the user on how much memory is available in this node

708     /*
709      * Some architectures (with lots of mem and discontinous memory
710      * maps) have to search for a good mem_map area:
711      * For discontigmem, the conceptual mem map array starts from 
712      * PAGE_OFFSET, we need to align the actual array onto a mem map 
713      * boundary, so that MAP_NR works.
714      */
715     map_size = (totalpages + 1)*sizeof(struct page);
716     if (lmem_map == (struct page *)0) {
717         lmem_map = (struct page *) alloc_bootmem_node(pgdat, map_size);
718         lmem_map = (struct page *)(PAGE_OFFSET + 
719             MAP_ALIGN((unsigned long)lmem_map - PAGE_OFFSET));
720     }
721     *gmap = pgdat->node_mem_map = lmem_map;
722     pgdat->node_size = totalpages;
723     pgdat->node_start_paddr = zone_start_paddr;
724     pgdat->node_start_mapnr = (lmem_map - mem_map);
725     pgdat->nr_zones = 0;
726
727     offset = lmem_map - mem_map;

This block allocates the local lmem_map if necessary and sets the gmap. In UMA architectures, gmap is actually mem_map and so this is where the memory for it is allocated

: 715Calculate the amount of memory required for the array. It is the total number of pages multipled by the size of a struct page
: 716If the map has not already been allocated, allocate it
: 717Allocate the memory from the boot memory allocator
: 718MAP_ALIGN() will align the array on a struct page sized boundary for calculations that locate offsets within the mem_map based on the physical address with the MAP_NR() macro
: 721Set the gmap and pgdat→node_mem_map variables to the allocated lmem_map. In UMA architectures, this just set mem_map
: 722Record the size of the node
: 723Record the starting physical address
: 724Record what the offset within mem_map this node occupies
: 725Initialise the zone count to 0. This will be set later in the function
: 727offset is now the offset within mem_map that the local portion lmem_map begins at

728     for (j = 0; j < MAX_NR_ZONES; j++) {
729         zone_t *zone = pgdat->node_zones + j;
730         unsigned long mask;
731         unsigned long size, realsize;
732 
733         zone_table[nid * MAX_NR_ZONES + j] = zone;
734         realsize = size = zones_size[j];
735         if (zholes_size)
736             realsize -= zholes_size[j];
737 
738         printk("zone(%lu): %lu pages.\n", j, size);
739         zone->size = size;
740         zone->name = zone_names[j];
741         zone->lock = SPIN_LOCK_UNLOCKED;
742         zone->zone_pgdat = pgdat;
743         zone->free_pages = 0;
744         zone->need_balance = 0;
745         if (!size)
746             continue;

This block starts a loop which initialises every zone_t within the node. The initialisation starts with the setting of the simplier fields that values already exist for.

: 728Loop through all zones in the node
: 733Record a pointer to this zone in the zone_table. See Section 2.4.1
: 734-736Calculate the real size of the zone based on the full size in zones_size minus the size of the holes in zholes_size
: 738Print an informational message saying how many pages are in this zone
: 739Record the size of the zone
: 740zone_names is the string name of the zone for printing purposes
: 741-744Initialise some other fields for the zone such as it's parent pgdat
: 745-746If the zone has no memory, continue to the next zone as nothing further is required

752         zone->wait_table_size = wait_table_size(size);
753         zone->wait_table_shift =
754             BITS_PER_LONG - wait_table_bits(zone->wait_table_size);
755         zone->wait_table = (wait_queue_head_t *)
756             alloc_bootmem_node(pgdat, zone->wait_table_size
757                         * sizeof(wait_queue_head_t));
758 
759         for(i = 0; i < zone->wait_table_size; ++i)
760             init_waitqueue_head(zone->wait_table + i);

Initialise the waitqueue for this zone. Processes waiting on pages in the zone use this hashed table to select a queue to wait on. This means that all processes waiting in a zone will not have to be woken when a page is unlocked, just a smaller subset.

: 752wait_table_size() calculates the size of the table to use based on the number of pages in the zone and the desired ratio between the number of queues and the number of pages. The table will never be larger than 4KiB
: 753-754Calculate the shift for the hashing algorithm
: 755Allocate a table of wait_queue_head_t that can hold zone→wait_table_size entries
: 759-760Initialise all of the wait queues

762         pgdat->nr_zones = j+1;
763 
764         mask = (realsize / zone_balance_ratio[j]);
765         if (mask < zone_balance_min[j])
766             mask = zone_balance_min[j];
767         else if (mask > zone_balance_max[j])
768             mask = zone_balance_max[j];
769         zone->pages_min = mask;
770         zone->pages_low = mask*2;
771         zone->pages_high = mask*3;
772 
773         zone->zone_mem_map = mem_map + offset;
774         zone->zone_start_mapnr = offset;
775         zone->zone_start_paddr = zone_start_paddr;
776 
777         if ((zone_start_paddr >> PAGE_SHIFT) & 
                                          (zone_required_alignment-1))
778             printk("BUG: wrong zone alignment, it will crash\n");
779

Calculate the watermarks for the zone and record the location of the zone. The watermarks are calculated as ratios of the zone size.

: 762First, as a new zone is active, update the number of zones in this node
: 764Calculate the mask (which will be used as the pages_min watermark) as the size of the zone divided by the balance ratio for this zone. The balance ratio is 128 for all zones as declared at the top of mm/page_alloc.c
: 765-766The zone_balance_min ratios are 20 for all zones so this means that pages_min will never be below 20
: 767-768Similarly, the zone_balance_max ratios are all 255 so pages_min will never be over 255
: 769pages_min is set to mask
: 770pages_low is twice the number of pages as pages_min
: 771pages_high is three times the number of pages as pages_min
: 773Record where the first struct page for this zone is located within mem_map
: 774Record the index within mem_map this zone begins at
: 775Record the starting physical address
: 777-778Ensure that the zone is correctly aligned for use with the buddy allocator otherwise the bitwise operations used for the buddy allocator will break

780         /*
781          * Initially all pages are reserved - free ones are freed
782          * up by free_all_bootmem() once the early boot process is
783          * done. Non-atomic initialization, single-pass.
784          */
785         for (i = 0; i < size; i++) {
786             struct page *page = mem_map + offset + i;
787             set_page_zone(page, nid * MAX_NR_ZONES + j);
788             set_page_count(page, 0);
789             SetPageReserved(page);
790             INIT_LIST_HEAD(&page->list);
791             if (j != ZONE_HIGHMEM)
792                 set_page_address(page, __va(zone_start_paddr));
793             zone_start_paddr += PAGE_SIZE;
794         }
795

: 785-794Initially, all pages in the zone are marked as reserved as there is no way to know which ones are in use by the boot memory allocator. When the boot memory allocator is retiring in free_all_bootmem(), the unused pages will have their PG_reserved bit cleared
: 786Get the page for this offset
: 787The zone the page belongs to is encoded with the page flags. See Section 2.4.1
: 788Set the count to 0 as no one is using it
: 789Set the reserved flag. Later, the boot memory allocator will clear this bit if the page is no longer in use
: 790Initialise the list head for the page
: 791-792Set the page→virtual field if it is available and the page is in low memory
: 793Increment zone_start_paddr by a page size as this variable will be used to record the beginning of the next zone

796         offset += size;
797         for (i = 0; ; i++) {
798             unsigned long bitmap_size;
799 
800             INIT_LIST_HEAD(&zone->free_area[i].free_list);
801             if (i == MAX_ORDER-1) {
802                 zone->free_area[i].map = NULL;
803                 break;
804             }
805 
829             bitmap_size = (size-1) >> (i+4);
830             bitmap_size = LONG_ALIGN(bitmap_size+1);
831             zone->free_area[i].map = 
832               (unsigned long *) alloc_bootmem_node(pgdat, 
                                                       bitmap_size);
833         }
834     }
835     build_zonelists(pgdat);
836 }

This block initialises the free lists for the zone and allocates the bitmap used by the buddy allocator to record the state of page buddies.

: 797This will loop from 0 to MAX_ORDER-1
: 800Initialise the linked list for the free_list of the current order i
: 801-804If this is the last order, then set the free area map to NULL as this is what marks the end of the free lists
: 829Calculate the bitmap_size to be the number of bytes required to hold a bitmap where each bit represents on pair of buddies that are 2ⁱ number of pages
: 830Align the size to a long with LONG_ALIGN() as all bitwise operations are on longs
: 831-832Allocate the memory for the map
: 834This loops back to move to the next zone
: 835Build the zone fallback lists for this node with build_zonelists()

B.1.6 Function: build_zonelists

Source: mm/page_alloc.c

This builds the list of fallback zones for each zone in the requested node. This is for when an allocation cannot be satisified and another zone is consulted. When this is finished, allocatioons from ZONE_HIGHMEM will fallback to ZONE_NORMAL. Allocations from ZONE_NORMAL will fall back to ZONE_DMA which in turn has nothing to fall back on.

589 static inline void build_zonelists(pg_data_t *pgdat)
590 {
591     int i, j, k;
592 
593     for (i = 0; i <= GFP_ZONEMASK; i++) {
594         zonelist_t *zonelist;
595         zone_t *zone;
596 
597         zonelist = pgdat->node_zonelists + i;
598         memset(zonelist, 0, sizeof(*zonelist));
599 
600         j = 0;
601         k = ZONE_NORMAL;
602         if (i & __GFP_HIGHMEM)
603             k = ZONE_HIGHMEM;
604         if (i & __GFP_DMA)
605             k = ZONE_DMA;
606 
607         switch (k) {
608             default:
609                 BUG();
610             /*
611              * fallthrough:
612              */
613             case ZONE_HIGHMEM:
614                 zone = pgdat->node_zones + ZONE_HIGHMEM;
615                 if (zone->size) {
616 #ifndef CONFIG_HIGHMEM
617                     BUG();
618 #endif
619                     zonelist->zones[j++] = zone;
620                 }
621             case ZONE_NORMAL:
622                 zone = pgdat->node_zones + ZONE_NORMAL;
623                 if (zone->size)
624                     zonelist->zones[j++] = zone;
625             case ZONE_DMA:
626                 zone = pgdat->node_zones + ZONE_DMA;
627                 if (zone->size)
628                     zonelist->zones[j++] = zone;
629         }
630         zonelist->zones[j++] = NULL;
631     } 
632 }

: 593This looks through the maximum possible number of zones
: 597Get the zonelist for this zone and zero it
: 600Start j at 0 which corresponds to ZONE_DMA
: 601-605Set k to be the type of zone currently being examined
: 614Get the ZONE_HIGHMEM
: 615-620If the zone has memory, then ZONE_HIGHMEM is the preferred zone to allocate from for high memory allocations. If ZONE_HIGHMEM has no memory, then ZONE_NORMAL will become the preferred zone when the next case is fallen through to as j is not incremented for an empty zone
: 621-624Set the next preferred zone to allocate from to be ZONE_NORMAL. Again, do not use it if the zone has no memory
: 626-628Set the final fallback zone to be ZONE_DMA. The check is still made for ZONE_DMA having memory as in a NUMA architecture, not all nodes will have a ZONE_DMA

B.2 Page Operations


B.2 Page Operations	216
B.2.1 Locking Pages	216
B.2.1.1 Function: `lock_page()`	216
B.2.1.2 Function: `__lock_page()`	216
B.2.1.3 Function: `sync_page()`	217
B.2.2 Unlocking Pages	218
B.2.2.1 Function: `unlock_page()`	218
B.2.3 Waiting on Pages	219
B.2.3.1 Function: `wait_on_page()`	219
B.2.3.2 Function: `___wait_on_page()`	219

B.2.1 Locking Pages

B.2.1.1 Function: lock_page

Source: mm/filemap.c

This function tries to lock a page. If the page cannot be locked, it will cause the process to sleep until the page is available.

921 void lock_page(struct page *page)
922 {
923     if (TryLockPage(page))
924         __lock_page(page);
925 }

: 923TryLockPage() is just a wrapper around test_and_set_bit() for the PG_locked bit in page→flags. If the bit was previously clear, the function returns immediately as the page is now locked
: 924Otherwise call __lock_page()(See Section B.2.1.2) to put the process to sleep

B.2.1.2 Function: __lock_page

Source: mm/filemap.c

This is called after a TryLockPage() failed. It will locate the waitqueue for this page and sleep on it until the lock can be acquired.

897 static void __lock_page(struct page *page)
898 {
899     wait_queue_head_t *waitqueue = page_waitqueue(page);
900     struct task_struct *tsk = current;
901     DECLARE_WAITQUEUE(wait, tsk);
902 
903     add_wait_queue_exclusive(waitqueue, &wait);
904     for (;;) {
905         set_task_state(tsk, TASK_UNINTERRUPTIBLE);
906         if (PageLocked(page)) {
907             sync_page(page);
908             schedule();
909         }
910         if (!TryLockPage(page))
911             break;
912     }
913     __set_task_state(tsk, TASK_RUNNING);
914     remove_wait_queue(waitqueue, &wait);
915 }

: 899page_waitqueue() is the implementation of the hash algorithm which determines which wait queue this page belongs to in the table zone→wait_table
: 900-901Initialise the waitqueue for this task
: 903Add this process to the waitqueue returned by page_waitqueue()
: 904-912Loop here until the lock is acquired
: 905Set the process states as being in uninterruptible sleep. When schedule() is called, the process will be put to sleep and will not wake again until the queue is explicitly woken up
: 906If the page is still locked then call sync_page() function to schedule the page to be synchronised with it's backing storage. Call schedule() to sleep until the queue is woken up such as when IO on the page completes
: 910-1001Try and lock the page again. If we succeed, exit the loop, otherwise sleep on the queue again
: 913-914The lock is now acquired so set the process state to TASK_RUNNING and remove it from the wait queue. The function now returns with the lock acquired

B.2.1.3 Function: sync_page

Source: mm/filemap.c

This calls the filesystem-specific sync_page() to synchronsise the page with it's backing storage.

140 static inline int sync_page(struct page *page)
141 {
142     struct address_space *mapping = page->mapping;
143 
144     if (mapping && mapping->a_ops && mapping->a_ops->sync_page)
145         return mapping->a_ops->sync_page(page);
146     return 0;
147 }

: 142Get the address_space for the page if it exists
: 144-145If a backing exists, and it has an associated address_space_operations which provides a sync_page() function, then call it

B.2.2 Unlocking Pages

B.2.2.1 Function: unlock_page

Source: mm/filemap.c

This function unlocks a page and wakes up any processes that may be waiting on it.

874 void unlock_page(struct page *page)
875 {
876     wait_queue_head_t *waitqueue = page_waitqueue(page);
877     ClearPageLaunder(page);
878     smp_mb__before_clear_bit();
879     if (!test_and_clear_bit(PG_locked, &(page)->flags))
880         BUG();
881     smp_mb__after_clear_bit(); 
882 
883     /*
884      * Although the default semantics of wake_up() are
885      * to wake all, here the specific function is used
886      * to make it even more explicit that a number of
887      * pages are being waited on here.
888      */
889     if (waitqueue_active(waitqueue))
890         wake_up_all(waitqueue);
891 }

: 876page_waitqueue() is the implementation of the hash algorithm which determines which wait queue this page belongs to in the table zone→wait_table
: 877Clear the launder bit as IO has now completed on the page
: 878This is a memory block operations which must be called before performing bit operations that may be seen by multiple processors
: 879-880Clear the PG_locked bit. It is a BUG() if the bit was already cleared
: 881Complete the SMP memory block operation
: 889-890If there are processes waiting on the page queue for this page, wake them

B.2.3 Waiting on Pages

B.2.3.1 Function: wait_on_page

Source: include/linux/pagemap.h

 94 static inline void wait_on_page(struct page * page)
 95 {
 96     if (PageLocked(page))
 97         ___wait_on_page(page);
 98 }

: 96-97If the page is currently locked, then call ___wait_on_page() to sleep until it is unlocked

B.2.3.2 Function: ___wait_on_page

Source: mm/filemap.c

This function is called after PageLocked() has been used to determine the page is locked. The calling process will probably sleep until the page is unlocked.

849 void ___wait_on_page(struct page *page)
850 {
851     wait_queue_head_t *waitqueue = page_waitqueue(page);
852     struct task_struct *tsk = current;
853     DECLARE_WAITQUEUE(wait, tsk);
854 
855     add_wait_queue(waitqueue, &wait);
856     do {
857         set_task_state(tsk, TASK_UNINTERRUPTIBLE);
858         if (!PageLocked(page))
859             break;
860         sync_page(page);
861         schedule();
862     } while (PageLocked(page));
863     __set_task_state(tsk, TASK_RUNNING);
864     remove_wait_queue(waitqueue, &wait);
865 }

: 851page_waitqueue() is the implementation of the hash algorithm which determines which wait queue this page belongs to in the table zone→wait_table
: 852-853Initialise the waitqueue for the current task
: 855Add this task to the waitqueue returned by page_waitqueue()
: 857Set the process state to be in uninterruptible sleep. When schedule() is called, the process will sleep
: 858-859Check to make sure the page was not unlocked since we last checked
: 860Call sync_page()(See Section B.2.1.3) to call the filesystem-specific function to synchronise the page with it's backing storage
: 861Call schedule() to go to sleep. The process will be woken when the page is unlocked
: 862Check if the page is still locked. Remember that multiple pages could be using this wait queue and there could be processes sleeping that wish to lock this page
: 863-864The page has been unlocked. Set the process to be in the TASK_RUNNING state and remove the process from the waitqueue

Appendix B Describing Physical Memory

B.1 Initialising Zones

B.1.1 Function: setup_memory

B.1.2 Function: zone_sizes_init

B.1.3 Function: free_area_init

B.1.4 Function: free_area_init_node

B.1.5 Function: free_area_init_core

B.1.6 Function: build_zonelists

B.2 Page Operations

Contents

B.2.1 Locking Pages

B.2.1.1 Function: lock_page

B.2.1.2 Function: __lock_page

B.2.1.3 Function: sync_page

B.2.2 Unlocking Pages

B.2.2.1 Function: unlock_page

B.2.3 Waiting on Pages

B.2.3.1 Function: wait_on_page

B.2.3.2 Function: ___wait_on_page