Appendix J Page Frame Reclamation
J.1 Page Cache Operations
This section addresses how pages are added and removed from the page cache and
LRU lists, both of which are heavily intertwined.
J.1.1 Adding Pages to the Page Cache
J.1.1.1 Function: add_to_page_cache
Source: mm/filemap.c
Acquire the lock protecting the page cache before calling
__add_to_page_cache() which will add the page to the page hash
table and inode queue which allows the pages belonging to files to be found
quickly.
667 void add_to_page_cache(struct page * page,
struct address_space * mapping,
unsigned long offset)
668 {
669 spin_lock(&pagecache_lock);
670 __add_to_page_cache(page, mapping,
offset, page_hash(mapping, offset));
671 spin_unlock(&pagecache_lock);
672 lru_cache_add(page);
673 }
- 669Acquire the lock protecting the page hash and inode queues
- 670Call the function which performs the “real” work
- 671Release the lock protecting the hash and inode queue
- 672Add the page to the page cache. page_hash() hashes into
the page hash table based on the mapping and the offset
within the file. If a page is returned, there was a collision and the
colliding pages are chained with the page→next_hash
and page→pprev_hash fields
J.1.1.2 Function: add_to_page_cache_unique
Source: mm/filemap.c
In many respects, this function is very similar to
add_to_page_cache(). The principal difference is that this
function will check the page cache with the pagecache_lock spinlock
held before adding the page to the cache. It is for callers may race with
another process for inserting a page in the cache such as
add_to_swap_cache()(See Section K.2.1.1).
675 int add_to_page_cache_unique(struct page * page,
676 struct address_space *mapping, unsigned long offset,
677 struct page **hash)
678 {
679 int err;
680 struct page *alias;
681
682 spin_lock(&pagecache_lock);
683 alias = __find_page_nolock(mapping, offset, *hash);
684
685 err = 1;
686 if (!alias) {
687 __add_to_page_cache(page,mapping,offset,hash);
688 err = 0;
689 }
690
691 spin_unlock(&pagecache_lock);
692 if (!err)
693 lru_cache_add(page);
694 return err;
695 }
- 682Acquire the pagecache_lock for examining the cache
- 683Check if the page already exists in the cache with
__find_page_nolock() (See Section J.1.4.3)
- 686-689If the page does not exist in the cache, add it with
__add_to_page_cache() (See Section J.1.1.3)
- 691Release the pagecache_lock
- 692-693If the page did not already exist in the page cache, add it to
the LRU lists with lru_cache_add()(See Section J.2.1.1)
- 694Return 0 if this call entered the page into the page cache and 1 if
it already existed
J.1.1.3 Function: __add_to_page_cache
Source: mm/filemap.c
Clear all page flags, lock it, take a reference and add it to the inode and
hash queues.
653 static inline void __add_to_page_cache(struct page * page,
654 struct address_space *mapping, unsigned long offset,
655 struct page **hash)
656 {
657 unsigned long flags;
658
659 flags = page->flags & ~(1 << PG_uptodate |
1 << PG_error | 1 << PG_dirty |
1 << PG_referenced | 1 << PG_arch_1 |
1 << PG_checked);
660 page->flags = flags | (1 << PG_locked);
661 page_cache_get(page);
662 page->index = offset;
663 add_page_to_inode_queue(mapping, page);
664 add_page_to_hash_queue(page, hash);
665 }
- 659Clear all page flags
- 660Lock the page
- 661Take a reference to the page in case it gets freed prematurely
- 662Update the index so it is known what file offset this page
represents
- 663Add the page to the inode queue with
add_page_to_inode_queue() (See Section J.1.1.4).
This links the page via the page→list to the
clean_pages list in the address_space and points
the page→mapping to the same address_space
- 664Add it to the page hash with add_page_to_hash_queue()
(See Section J.1.1.5). The hash page was returned
by page_hash() in the parent function. The page hash allows page
cache pages without having to lineraly search the inode queue
J.1.1.4 Function: add_page_to_inode_queue
Source: mm/filemap.c
85 static inline void add_page_to_inode_queue(
struct address_space *mapping, struct page * page)
86 {
87 struct list_head *head = &mapping->clean_pages;
88
89 mapping->nrpages++;
90 list_add(&page->list, head);
91 page->mapping = mapping;
92 }
- 87When this function is called, the page is clean, so
mapping→clean_pages is the list of interest
- 89Increment the number of pages that belong to this mapping
- 90Add the page to the clean list
- 91Set the page→mapping field
J.1.1.5 Function: add_page_to_hash_queue
Source: mm/filemap.c
This adds page to the top of hash bucket headed by p. Bear
in mind that p is an element of the array
page_hash_table.
71 static void add_page_to_hash_queue(struct page * page,
struct page **p)
72 {
73 struct page *next = *p;
74
75 *p = page;
76 page->next_hash = next;
77 page->pprev_hash = p;
78 if (next)
79 next->pprev_hash = &page->next_hash;
80 if (page->buffers)
81 PAGE_BUG(page);
82 atomic_inc(&page_cache_size);
83 }
- 73Record the current head of the hash bucket in next
- 75Update the head of the hash bucket to be page
- 76Point page→next_hash to the old head of the hash
bucket
- 77Point page→pprev_hash to point to the array
element in page_hash_table
- 78-79This will point the pprev_hash field to the head of
the hash bucket completing the insertion of the page into the linked
list
- 80-81Check that the page entered has no associated buffers
- 82Increment page_cache_size which is the size of the
page cache
J.1.2 Deleting Pages from the Page Cache
J.1.2.1 Function: remove_inode_page
Source: mm/filemap.c
130 void remove_inode_page(struct page *page)
131 {
132 if (!PageLocked(page))
133 PAGE_BUG(page);
134
135 spin_lock(&pagecache_lock);
136 __remove_inode_page(page);
137 spin_unlock(&pagecache_lock);
138 }
- 132-133If the page is not locked, it is a bug
- 135Acquire the lock protecting the page cache
- 136__remove_inode_page() (See Section J.1.2.2)
is the top-level function for when the pagecache lock is held
- 137Release the pagecache lock
J.1.2.2 Function: __remove_inode_page
Source: mm/filemap.c
This is the top-level function for removing a page from the page cache for
callers with the pagecache_lock spinlock held. Callers that do not
have this lock acquired should call remove_inode_page().
124 void __remove_inode_page(struct page *page)
125 {
126 remove_page_from_inode_queue(page);
127 remove_page_from_hash_queue(page);
128
- 126remove_page_from_inode_queue() (See Section J.1.2.3)
remove the page from it's address_space at
page→mapping
- 127remove_page_from_hash_queue() removes the page from the
hash table in page_hash_table
J.1.2.3 Function: remove_page_from_inode_queue
Source: mm/filemap.c
94 static inline void remove_page_from_inode_queue(struct page * page)
95 {
96 struct address_space * mapping = page->mapping;
97
98 if (mapping->a_ops->removepage)
99 mapping->a_ops->removepage(page);
100 list_del(&page->list);
101 page->mapping = NULL;
102 wmb();
103 mapping->nr_pages--;
104 }
- 96Get the associated address_space for this
page
- 98-99Call the filesystem specific removepage() function if
one is available
- 100Delete the page from whatever list it belongs to in the
mapping such as the clean_pages list in most cases or
the dirty_pages in rarer cases
- 101Set the page→mapping to NULL as it is no longer
backed by any address_space
- 103Decrement the number of pages in the mapping
J.1.2.4 Function: remove_page_from_hash_queue
Source: mm/filemap.c
107 static inline void remove_page_from_hash_queue(struct page * page)
108 {
109 struct page *next = page->next_hash;
110 struct page **pprev = page->pprev_hash;
111
112 if (next)
113 next->pprev_hash = pprev;
114 *pprev = next;
115 page->pprev_hash = NULL;
116 atomic_dec(&page_cache_size);
117 }
- 109Get the next page after the page being removed
- 110Get the pprev page before the page being
removed. When the function completes, pprev will be linked to
next
- 112If this is not the end of the list, update
next→pprev_hash to point to pprev
- 114Similarly, point pprev forward to next.
page is now unlinked
- 116Decrement the size of the page cache
J.1.3 Acquiring/Releasing Page Cache Pages
J.1.3.1 Function: page_cache_get
Source: include/linux/pagemap.h
31 #define page_cache_get(x) get_page(x)
- 31Simple call get_page() which simply uses
atomic_inc() to increment the page reference count
J.1.3.2 Function: page_cache_release
Source: include/linux/pagemap.h
32 #define page_cache_release(x) __free_page(x)
- 32Call __free_page() which decrements the page count. If
the count reaches 0, the page will be freed
J.1.4 Searching the Page Cache
J.1.4.1 Function: find_get_page
Source: include/linux/pagemap.h
Top level macro for finding a page in the page cache. It simply looks up the
page hash
75 #define find_get_page(mapping, index) \
76 __find_get_page(mapping, index, page_hash(mapping, index))
- 76page_hash() locates an entry in the
page_hash_table based on the address_space and
offset
J.1.4.2 Function: __find_get_page
Source: mm/filemap.c
This function is responsible for finding a struct page given an entry in
page_hash_table as a starting point.
931 struct page * __find_get_page(struct address_space *mapping,
932 unsigned long offset, struct page **hash)
933 {
934 struct page *page;
935
936 /*
937 * We scan the hash list read-only. Addition to and removal from
938 * the hash-list needs a held write-lock.
939 */
940 spin_lock(&pagecache_lock);
941 page = __find_page_nolock(mapping, offset, *hash);
942 if (page)
943 page_cache_get(page);
944 spin_unlock(&pagecache_lock);
945 return page;
946 }
- 940Acquire the read-only page cache lock
- 941Call the page cache traversal function which presumes a lock is
held
- 942-943If the page was found, obtain a reference to it with
page_cache_get() (See Section J.1.3.1) so it is not
freed prematurely
- 944Release the page cache lock
- 945Return the page or NULL if not found
J.1.4.3 Function: __find_page_nolock
Source: mm/filemap.c
This function traverses the hash collision list looking for the page specified
by the address_space and offset.
443 static inline struct page * __find_page_nolock(
struct address_space *mapping,
unsigned long offset,
struct page *page)
444 {
445 goto inside;
446
447 for (;;) {
448 page = page->next_hash;
449 inside:
450 if (!page)
451 goto not_found;
452 if (page->mapping != mapping)
453 continue;
454 if (page->index == offset)
455 break;
456 }
457
458 not_found:
459 return page;
460 }
- 445Begin by examining the first page in the list
- 450-451If the page is NULL, the right one could not be found so return
NULL
- 452If the address_space does not match, move to the next
page on the collision list
- 454If the offset matchs, return it, else move on
- 448Move to the next page on the hash list
- 459Return the found page or NULL if not
J.1.4.4 Function: find_lock_page
Source: include/linux/pagemap.h
This is the top level function for searching the page cache for a page and
having it returned in a locked state.
84 #define find_lock_page(mapping, index) \
85 __find_lock_page(mapping, index, page_hash(mapping, index))
- 85Call the core function __find_lock_page() after looking
up what hash bucket this page is using with page_hash()
J.1.4.5 Function: __find_lock_page
Source: mm/filemap.c
This function acquires the pagecache_lock spinlock before calling
the core function __find_lock_page_helper() to locate the page
and lock it.
1005 struct page * __find_lock_page (struct address_space *mapping,
1006 unsigned long offset, struct page **hash)
1007 {
1008 struct page *page;
1009
1010 spin_lock(&pagecache_lock);
1011 page = __find_lock_page_helper(mapping, offset, *hash);
1012 spin_unlock(&pagecache_lock);
1013 return page;
1014 }
- 1010Acquire the pagecache_lock spinlock
- 1011Call __find_lock_page_helper() which will search the
page cache and lock the page if it is found
- 1012Release the pagecache_lock spinlock
- 1013If the page was found, return it in a locked state, otherwise
return NULL
J.1.4.6 Function: __find_lock_page_helper
Source: mm/filemap.c
This function uses __find_page_nolock() to locate a page within
the page cache. If it is found, the page will be locked for returning to the
caller.
972 static struct page * __find_lock_page_helper(
struct address_space *mapping,
973 unsigned long offset, struct page *hash)
974 {
975 struct page *page;
976
977 /*
978 * We scan the hash list read-only. Addition to and removal from
979 * the hash-list needs a held write-lock.
980 */
981 repeat:
982 page = __find_page_nolock(mapping, offset, hash);
983 if (page) {
984 page_cache_get(page);
985 if (TryLockPage(page)) {
986 spin_unlock(&pagecache_lock);
987 lock_page(page);
988 spin_lock(&pagecache_lock);
989
990 /* Has the page been re-allocated while we slept? */
991 if (page->mapping != mapping || page->index != offset) {
992 UnlockPage(page);
993 page_cache_release(page);
994 goto repeat;
995 }
996 }
997 }
998 return page;
999 }
- 982Use __find_page_nolock()(See Section J.1.4.3)
to locate the page in the page cache
- 983-984If the page was found, take a reference to it
- 985Try and lock the page with TryLockPage(). This macro is
just a wrapper around test_and_set_bit() which attempts to set the
PG_locked bit in the page→flags
- 986-988If the lock failed, release the pagecache_lock
spinlock and call lock_page() (See Section B.2.1.1) to lock
the page. It is likely this function will sleep until the page lock is
acquired. When the page is locked, acquire the pagecache_lock
spinlock again
- 991If the mapping and index no longer match, it
means that this page was reclaimed while we were asleep. The page is unlocked
and the reference dropped before searching the page cache again
- 998Return the page in a locked state, or NULL if it was not in the page
cache
J.2 LRU List Operations
J.2.1 Adding Pages to the LRU Lists
J.2.1.1 Function: lru_cache_add
Source: mm/swap.c
Adds a page to the LRU inactive_list.
58 void lru_cache_add(struct page * page)
59 {
60 if (!PageLRU(page)) {
61 spin_lock(&pagemap_lru_lock);
62 if (!TestSetPageLRU(page))
63 add_page_to_inactive_list(page);
64 spin_unlock(&pagemap_lru_lock);
65 }
66 }
- 60If the page is not already part of the LRU lists, add it
- 61Acquire the LRU lock
- 62-63Test and set the LRU bit. If it was clear, call
add_page_to_inactive_list()
- 64Release the LRU lock
J.2.1.2 Function: add_page_to_active_list
Source: include/linux/swap.h
Adds the page to the active_list
178 #define add_page_to_active_list(page) \
179 do { \
180 DEBUG_LRU_PAGE(page); \
181 SetPageActive(page); \
182 list_add(&(page)->lru, &active_list); \
183 nr_active_pages++; \
184 } while (0)
- 180The DEBUG_LRU_PAGE() macro will call BUG() if
the page is already on the LRU list or is marked been active
- 181Update the flags of the page to show it is active
- 182Add the page to the active_list
- 183Update the count of the number of pages in the active_list
J.2.1.3 Function: add_page_to_inactive_list
Source: include/linux/swap.h
Adds the page to the inactive_list
186 #define add_page_to_inactive_list(page) \
187 do { \
188 DEBUG_LRU_PAGE(page); \
189 list_add(&(page)->lru, &inactive_list); \
190 nr_inactive_pages++; \
191 } while (0)
- 188The DEBUG_LRU_PAGE() macro will call BUG() if
the page is already on the LRU list or is marked been active
- 189Add the page to the inactive_list
- 190Update the count of the number of inactive pages on the list
J.2.2 Deleting Pages from the LRU Lists
J.2.2.1 Function: lru_cache_del
Source: mm/swap.c
Acquire the lock protecting the LRU lists before calling
__lru_cache_del().
90 void lru_cache_del(struct page * page)
91 {
92 spin_lock(&pagemap_lru_lock);
93 __lru_cache_del(page);
94 spin_unlock(&pagemap_lru_lock);
95 }
- 92Acquire the LRU lock
- 93__lru_cache_del() does the “real” work of removing
the page from the LRU lists
- 94Release the LRU lock
J.2.2.2 Function: __lru_cache_del
Source: mm/swap.c
Select which function is needed to remove the page from the LRU list.
75 void __lru_cache_del(struct page * page)
76 {
77 if (TestClearPageLRU(page)) {
78 if (PageActive(page)) {
79 del_page_from_active_list(page);
80 } else {
81 del_page_from_inactive_list(page);
82 }
83 }
84 }
- 77Test and clear the flag indicating the page is in the LRU
- 78-82If the page is on the LRU, select the appropriate removal
function
- 78-79If the page is active, then call
del_page_from_active_list() else delete from the inactive list
with del_page_from_inactive_list()
J.2.2.3 Function: del_page_from_active_list
Source: include/linux/swap.h
Remove the page from the active_list
193 #define del_page_from_active_list(page) \
194 do { \
195 list_del(&(page)->lru); \
196 ClearPageActive(page); \
197 nr_active_pages--; \
198 } while (0)
- 195Delete the page from the list
- 196Clear the flag indicating it is part of active_list. The
flag indicating it is part of the LRU list has already been cleared by
__lru_cache_del()
- 197Update the count of the number of pages in the active_list
J.2.2.4 Function: del_page_from_inactive_list
Source: include/linux/swap.h
200 #define del_page_from_inactive_list(page) \
201 do { \
202 list_del(&(page)->lru); \
203 nr_inactive_pages--; \
204 } while (0)
- 202Remove the page from the LRU list
- 203Update the count of the number of pages in the
inactive_list
J.2.3 Activating Pages
J.2.3.1 Function: mark_page_accessed
Source: mm/filemap.c
This marks that a page has been referenced. If the page is already on the
active_list or the referenced flag is clear, the referenced
flag will be simply set. If it is in the inactive_list and the
referenced flag has been set, activate_page() will be called to
move the page to the top of the active_list.
1332 void mark_page_accessed(struct page *page)
1333 {
1334 if (!PageActive(page) && PageReferenced(page)) {
1335 activate_page(page);
1336 ClearPageReferenced(page);
1337 } else
1338 SetPageReferenced(page);
1339 }
- 1334-1337If the page is on the inactive_list
(!PageActive()) and has been referenced recently
(PageReferenced()), activate_page() is called to move it
to the active_list
- 1338Otherwise, mark the page as been referenced
J.2.3.2 Function: activate_lock
Source: mm/swap.c
Acquire the LRU lock before calling activate_page_nolock() which
moves the page from the inactive_list to the active_list.
47 void activate_page(struct page * page)
48 {
49 spin_lock(&pagemap_lru_lock);
50 activate_page_nolock(page);
51 spin_unlock(&pagemap_lru_lock);
52 }
- 49Acquire the LRU lock
- 50Call the main work function
- 51Release the LRU lock
J.2.3.3 Function: activate_page_nolock
Source: mm/swap.c
Move the page from the inactive_list to the active_list
39 static inline void activate_page_nolock(struct page * page)
40 {
41 if (PageLRU(page) && !PageActive(page)) {
42 del_page_from_inactive_list(page);
43 add_page_to_active_list(page);
44 }
45 }
- 41Make sure the page is on the LRU and not already on the active_list
- 42-43Delete the page from the inactive_list and add to the
active_list
J.3 Refilling inactive_list
This section covers how pages are moved from the active lists to the
inactive lists.
J.3.1 Function: refill_inactive
Source: mm/vmscan.c
Move nr_pages from the active_list to the
inactive_list. The parameter nr_pages is calculated by
shrink_caches() and is a number which tries to keep the active list
two thirds the size of the page cache.
533 static void refill_inactive(int nr_pages)
534 {
535 struct list_head * entry;
536
537 spin_lock(&pagemap_lru_lock);
538 entry = active_list.prev;
539 while (nr_pages && entry != &active_list) {
540 struct page * page;
541
542 page = list_entry(entry, struct page, lru);
543 entry = entry->prev;
544 if (PageTestandClearReferenced(page)) {
545 list_del(&page->lru);
546 list_add(&page->lru, &active_list);
547 continue;
548 }
549
550 nr_pages--;
551
552 del_page_from_active_list(page);
553 add_page_to_inactive_list(page);
554 SetPageReferenced(page);
555 }
556 spin_unlock(&pagemap_lru_lock);
557 }
- 537Acquire the lock protecting the LRU list
- 538Take the last entry in the active_list
- 539-555Move nr_pages or until the active_list
is empty
- 542Get the struct page for this entry
- 544-548Test and clear the referenced flag. If it has been referenced,
then it is moved back to the top of the active_list
- 550-553Move one page from the active_list to the inactive_list
- 554Mark it referenced so that if it is referenced again soon, it will
be promoted back to the active_list without requiring a second reference
- 556Release the lock protecting the LRU list
J.4 Reclaiming Pages from the LRU Lists
This section covers how a page is reclaimed once it has been selected for
pageout.
J.4.1 Function: shrink_cache
Source: mm/vmscan.c
338 static int shrink_cache(int nr_pages, zone_t * classzone,
unsigned int gfp_mask, int priority)
339 {
340 struct list_head * entry;
341 int max_scan = nr_inactive_pages / priority;
342 int max_mapped = min((nr_pages << (10 - priority)),
max_scan / 10);
343
344 spin_lock(&pagemap_lru_lock);
345 while (--max_scan >= 0 &&
(entry = inactive_list.prev) != &inactive_list) {
- 338The parameters are as follows;
-
- nr_pagesThe number of pages to swap out
- classzoneThe zone we are interested in swapping pages out for. Pages
not belonging to this zone are skipped
- gfp_maskThe gfp mask determining what actions may be taken such as
if filesystem operations may be performed
- priorityThe priority of the function, starts at
DEF_PRIORITY (6)
and decreases to the highest priority of 1
- 341The maximum number of pages to scan is the number of pages in the
active_list divided by the priority. At lowest priority, 1/6th
of the list may scanned. At highest priority, the full list may be scanned
- 342The maximum amount of process mapped pages allowed is either
one tenth of the max_scan value or nr_pages *
210−priority. If this number of pages are found, whole processes
will be swapped out
- 344Lock the LRU list
- 345Keep scanning until max_scan pages have been scanned
or the inactive_list is empty
346 struct page * page;
347
348 if (unlikely(current->need_resched)) {
349 spin_unlock(&pagemap_lru_lock);
350 __set_current_state(TASK_RUNNING);
351 schedule();
352 spin_lock(&pagemap_lru_lock);
353 continue;
354 }
355
- 348-354Reschedule if the quanta has been used up
- 349Free the LRU lock as we are about to sleep
- 350Show we are still running
- 351Call schedule() so another process can be context switched
in
- 352Re-acquire the LRU lock
- 353Reiterate through the loop and take an entry
inactive_list again. As we slept, another process could have
changed what entries are on the list which is why another entry has to be
taken with the spinlock held
356 page = list_entry(entry, struct page, lru);
357
358 BUG_ON(!PageLRU(page));
359 BUG_ON(PageActive(page));
360
361 list_del(entry);
362 list_add(entry, &inactive_list);
363
364 /*
365 * Zero page counts can happen because we unlink the pages
366 * _after_ decrementing the usage count..
367 */
368 if (unlikely(!page_count(page)))
369 continue;
370
371 if (!memclass(page_zone(page), classzone))
372 continue;
373
374 /* Racy check to avoid trylocking when not worthwhile */
375 if (!page->buffers && (page_count(page) != 1 || !page->mapping))
376 goto page_mapped;
- 356Get the struct page for this entry in the LRU
- 358-359It is a bug if the page either belongs to the
active_list or is currently marked as active
- 361-362Move the page to the top of the inactive_list so
that if the page is not freed, we can just continue knowing that it will be
simply examined later
- 368-369If the page count has already reached 0, skip over
it. In __free_pages(), the page count is dropped with
put_page_testzero() before __free_pages_ok() is
called to free it. This leaves a window where a page with a zero count is
left on the LRU before it is freed. There is a special case to trap this at
the beginning of __free_pages_ok()
- 371-372Skip over this page if it belongs to a zone we are not
currently interested in
- 375-376If the page is mapped by a process, then goto
page_mapped where the max_mapped is decremented and
next page examined. If max_mapped reaches 0, process pages will
be swapped out
382 if (unlikely(TryLockPage(page))) {
383 if (PageLaunder(page) && (gfp_mask & __GFP_FS)) {
384 page_cache_get(page);
385 spin_unlock(&pagemap_lru_lock);
386 wait_on_page(page);
387 page_cache_release(page);
388 spin_lock(&pagemap_lru_lock);
389 }
390 continue;
391 }
Page is locked and the launder bit is set. In this case, it is the second
time this page has been found dirty. The first time it was scheduled for IO
and placed back on the list. This time we wait until the IO is complete and
then try to free the page.
- 382-383If we could not lock the page, the PG_launder bit
is set and the GFP flags allow the caller to perform FS operations, then...
- 384Take a reference to the page so it does not disappear while we
sleep
- 385Free the LRU lock
- 386Wait until the IO is complete
- 387Release the reference to the page. If it reaches 0, the page will be
freed
- 388Re-acquire the LRU lock
- 390Move to the next page
392
393 if (PageDirty(page) &&
is_page_cache_freeable(page) &&
page->mapping) {
394 /*
395 * It is not critical here to write it only if
396 * the page is unmapped beause any direct writer
397 * like O_DIRECT would set the PG_dirty bitflag
398 * on the phisical page after having successfully
399 * pinned it and after the I/O to the page is finished,
400 * so the direct writes to the page cannot get lost.
401 */
402 int (*writepage)(struct page *);
403
404 writepage = page->mapping->a_ops->writepage;
405 if ((gfp_mask & __GFP_FS) && writepage) {
406 ClearPageDirty(page);
407 SetPageLaunder(page);
408 page_cache_get(page);
409 spin_unlock(&pagemap_lru_lock);
410
411 writepage(page);
412 page_cache_release(page);
413
414 spin_lock(&pagemap_lru_lock);
415 continue;
416 }
417 }
This handles the case where a page is dirty, is not mapped by any process,
has no buffers and is backed by a file or device mapping. The page is cleaned
and will be reclaimed by the previous block of code when the IO is complete.
- 393PageDirty() checks the PG_dirty bit,
is_page_cache_freeable() will return true if it is not mapped
by any process and has no buffers
- 404Get a pointer to the necessary writepage() function for
this mapping or device
- 405-416This block of code can only be executed if a
writepage() function is available and the GFP flags allow file
operations
- 406-407Clear the dirty bit and mark that the page is being laundered
- 408Take a reference to the page so it will not be freed unexpectedly
- 409Unlock the LRU list
- 411Call the filesystem-specific writepage() function which is
taken from the address_space_operations belonging to
page→mapping
- 412Release the reference to the page
- 414-415Re-acquire the LRU list lock and move to the next page
424 if (page->buffers) {
425 spin_unlock(&pagemap_lru_lock);
426
427 /* avoid to free a locked page */
428 page_cache_get(page);
429
430 if (try_to_release_page(page, gfp_mask)) {
431 if (!page->mapping) {
438 spin_lock(&pagemap_lru_lock);
439 UnlockPage(page);
440 __lru_cache_del(page);
441
442 /* effectively free the page here */
443 page_cache_release(page);
444
445 if (--nr_pages)
446 continue;
447 break;
448 } else {
454 page_cache_release(page);
455
456 spin_lock(&pagemap_lru_lock);
457 }
458 } else {
459 /* failed to drop the buffers so stop here */
460 UnlockPage(page);
461 page_cache_release(page);
462
463 spin_lock(&pagemap_lru_lock);
464 continue;
465 }
466 }
Page has buffers associated with it that must be freed.
- 425Release the LRU lock as we may sleep
- 428Take a reference to the page
- 430Call try_to_release_page() which will attempt to release
the buffers associated with the page. Returns 1 if it succeeds
- 431-447This is a case where an anonymous page that was in the swap
cache has now had it's buffers cleared and removed. As it was on the swap
cache, it was placed on the LRU by add_to_swap_cache() so remove
it now frmo the LRU and drop the reference to the page. In
swap_writepage(), it calls remove_exclusive_swap_page()
which will delete the page from the swap cache when there are no more
processes mapping the page. This block will free the page after the buffers
have been written out if it was backed by a swap file
- 438-440Take the LRU list lock, unlock the page, delete it from the
page cache and free it
- 445-446Update nr_pages to show a page has been freed and
move to the next page
- 447If nr_pages drops to 0, then exit the loop as the work
is completed
- 449-456If the page does have an associated mapping then simply drop
the reference to the page and re-acquire the LRU lock. More work will be
performed later to remove the page from the page cache at line 499
- 459-464If the buffers could not be freed, then unlock the page, drop
the reference to it, re-acquire the LRU lock and move to the next page
468 spin_lock(&pagecache_lock);
469
470 /*
471 * this is the non-racy check for busy page.
472 */
473 if (!page->mapping || !is_page_cache_freeable(page)) {
474 spin_unlock(&pagecache_lock);
475 UnlockPage(page);
476 page_mapped:
477 if (--max_mapped >= 0)
478 continue;
479
484 spin_unlock(&pagemap_lru_lock);
485 swap_out(priority, gfp_mask, classzone);
486 return nr_pages;
487 }
- 468From this point on, pages in the swap cache are likely to be
examined which is protected by the pagecache_lock which must be
now held
- 473-487An anonymous page with no buffers is mapped by a process
- 474-475Release the page cache lock and the page
- 477-478Decrement max_mapped. If it has not reached 0,
move to the next page
- 484-485Too many mapped pages have been found in the page cache. The LRU
lock is released and swap_out() is called to begin swapping out whole
processes
493 if (PageDirty(page)) {
494 spin_unlock(&pagecache_lock);
495 UnlockPage(page);
496 continue;
497 }
- 493-497The page has no references but could have been dirtied by
the last process to free it if the dirty bit was set in the PTE. It is left
in the page cache and will get laundered later. Once it has been cleaned,
it can be safely deleted
498
499 /* point of no return */
500 if (likely(!PageSwapCache(page))) {
501 __remove_inode_page(page);
502 spin_unlock(&pagecache_lock);
503 } else {
504 swp_entry_t swap;
505 swap.val = page->index;
506 __delete_from_swap_cache(page);
507 spin_unlock(&pagecache_lock);
508 swap_free(swap);
509 }
510
511 __lru_cache_del(page);
512 UnlockPage(page);
513
514 /* effectively free the page here */
515 page_cache_release(page);
516
517 if (--nr_pages)
518 continue;
519 break;
520 }
- 500-503If the page does not belong to the swap cache, it is part of the
inode queue so it is removed
- 504-508Remove it from the swap cache as there is no more references to
it
- 511Delete it from the page cache
- 512Unlock the page
- 515Free the page
- 517-518Decrement nr_page and move to the next page if it
is not 0
- 519If it reaches 0, the work of the function is complete
521 spin_unlock(&pagemap_lru_lock);
522
523 return nr_pages;
524 }
- 521-524Function exit. Free the LRU lock and return the number of pages
left to free
J.5 Shrinking all caches
J.5.1 Function: shrink_caches
Source: mm/vmscan.c
The call graph for this function is shown in Figure 10.4.
560 static int shrink_caches(zone_t * classzone, int priority,
unsigned int gfp_mask, int nr_pages)
561 {
562 int chunk_size = nr_pages;
563 unsigned long ratio;
564
565 nr_pages -= kmem_cache_reap(gfp_mask);
566 if (nr_pages <= 0)
567 return 0;
568
569 nr_pages = chunk_size;
570 /* try to keep the active list 2/3 of the size of the cache */
571 ratio = (unsigned long) nr_pages *
nr_active_pages / ((nr_inactive_pages + 1) * 2);
572 refill_inactive(ratio);
573
574 nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
575 if (nr_pages <= 0)
576 return 0;
577
578 shrink_dcache_memory(priority, gfp_mask);
579 shrink_icache_memory(priority, gfp_mask);
580 #ifdef CONFIG_QUOTA
581 shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
582 #endif
583
584 return nr_pages;
585 }
- 560The parameters are as follows;
-
- classzone is the zone that pages should be freed from
- priority determines how much work will be done to free pages
- gfp_mask determines what sort of actions may be taken
- nr_pages is the number of pages remaining to be freed
- 565-567Ask the slab allocator to free up some pages with
kmem_cache_reap() (See Section H.1.5.1). If enough are
freed, the function returns otherwise nr_pages will be freed from
other caches
- 571-572Move pages from the active_list to the
inactive_list by calling refill_inactive() (See Section J.3.1).
The number of pages moved depends on how many pages need to be freed and to
have active_list about two thirds the size of the page cache
- 574-575Shrink the page cache, if enough pages are freed, return
- 578-582Shrink the dcache, icache and dqcache. These are small objects
in themselves but the cascading effect frees up a lot of disk buffers
- 584Return the number of pages remaining to be freed
J.5.2 Function: try_to_free_pages
Source: mm/vmscan.c
This function cycles through all pgdats and tries to balance the
preferred allocation zone (usually ZONE_NORMAL) for each of them. This
function is only called from one place, buffer.c:free_more_memory()
when the buffer manager fails to create new buffers or grow existing ones. It
calls try_to_free_pages() with GFP_NOIO as the
gfp_mask.
This results in the first zone in
pg_data_t→node_zonelists having pages freed so that
buffers can grow. This array is the preferred order of zones to allocate
from and usually will begin with ZONE_NORMAL which is required by the
buffer manager. On NUMA architectures, some nodes may have ZONE_DMA
as the preferred zone if the memory bank is dedicated to IO devices and UML
also uses only this zone. As the buffer manager is restricted in the zones
is uses, there is no point balancing other zones.
607 int try_to_free_pages(unsigned int gfp_mask)
608 {
609 pg_data_t *pgdat;
610 zonelist_t *zonelist;
611 unsigned long pf_free_pages;
612 int error = 0;
613
614 pf_free_pages = current->flags & PF_FREE_PAGES;
615 current->flags &= ~PF_FREE_PAGES;
616
617 for_each_pgdat(pgdat) {
618 zonelist = pgdat->node_zonelists +
(gfp_mask & GFP_ZONEMASK);
619 error |= try_to_free_pages_zone(
zonelist->zones[0], gfp_mask);
620 }
621
622 current->flags |= pf_free_pages;
623 return error;
624 }
- 614-615This clears the PF_FREE_PAGES flag if it is set
so that pages freed by the process will be returned to the global pool rather
than reserved for the process itself
- 617-620Cycle through all nodes and call try_to_free_pages()
for the preferred zone in each node
- 618This function is only called with GFP_NOIO as a
parameter. When ANDed with GFP_ZONEMASK, it will always result
in 0
- 622-623Restore the process flags and return the result
J.5.3 Function: try_to_free_pages_zone
Source: mm/vmscan.c
Try to free SWAP_CLUSTER_MAX pages from the requested zone. As
will as being used by kswapd, this function is the entry for the
buddy allocator's direct-reclaim path.
587 int try_to_free_pages_zone(zone_t *classzone,
unsigned int gfp_mask)
588 {
589 int priority = DEF_PRIORITY;
590 int nr_pages = SWAP_CLUSTER_MAX;
591
592 gfp_mask = pf_gfp_mask(gfp_mask);
593 do {
594 nr_pages = shrink_caches(classzone, priority,
gfp_mask, nr_pages);
595 if (nr_pages <= 0)
596 return 1;
597 } while (--priority);
598
599 /*
600 * Hmm.. Cache shrink failed - time to kill something?
601 * Mhwahahhaha! This is the part I really like. Giggle.
602 */
603 out_of_memory();
604 return 0;
605 }
- 589Start with the lowest priority. Statically defined to be 6
- 590Try and free SWAP_CLUSTER_MAX pages. Statically
defined to be 32
- 592pf_gfp_mask() checks the PF_NOIO flag
in the current process flags. If no IO can be performed, it ensures there
is no incompatible flags in the GFP mask
- 593-597Starting with the lowest priority and increasing with each pass,
call shrink_caches() until nr_pages has been freed
- 595-596If enough pages were freed, return indicating that the work is
complete
- 603If enough pages could not be freed even at highest priority (where
at worst the full inactive_list is scanned) then check to see if
we are out of memory. If we are, then a process will be selected to be killed
- 604Return indicating that we failed to free enough pages
J.6 Swapping Out Process Pages
This section covers the path where too many process mapped pages have been
found in the LRU lists. This path will start scanning whole processes and
reclaiming the mapped pages.
J.6.1 Function: swap_out
Source: mm/vmscan.c
The call graph for this function is shown in Figure 10.5. This
function linearaly searches through every processes page tables trying
to swap out SWAP_CLUSTER_MAX number of pages. The process
it starts with is the swap_mm and the starting address is
mm→swap_address
296 static int swap_out(unsigned int priority, unsigned int gfp_mask,
zone_t * classzone)
297 {
298 int counter, nr_pages = SWAP_CLUSTER_MAX;
299 struct mm_struct *mm;
300
301 counter = mmlist_nr;
302 do {
303 if (unlikely(current->need_resched)) {
304 __set_current_state(TASK_RUNNING);
305 schedule();
306 }
307
308 spin_lock(&mmlist_lock);
309 mm = swap_mm;
310 while (mm->swap_address == TASK_SIZE || mm == &init_mm) {
311 mm->swap_address = 0;
312 mm = list_entry(mm->mmlist.next,
struct mm_struct, mmlist);
313 if (mm == swap_mm)
314 goto empty;
315 swap_mm = mm;
316 }
317
318 /* Make sure the mm doesn't disappear
when we drop the lock.. */
319 atomic_inc(&mm->mm_users);
320 spin_unlock(&mmlist_lock);
321
322 nr_pages = swap_out_mm(mm, nr_pages, &counter, classzone);
323
324 mmput(mm);
325
326 if (!nr_pages)
327 return 1;
328 } while (--counter >= 0);
329
330 return 0;
331
332 empty:
333 spin_unlock(&mmlist_lock);
334 return 0;
335 }
- 301Set the counter so the process list is only scanned once
- 303-306Reschedule if the quanta has been used up to prevent CPU
hogging
- 308Acquire the lock protecting the mm list
- 309Start with the swap_mm. It is interesting this is never
checked to make sure it is valid. It is possible, albeit unlikely that the
process with the mm has exited since the last scan and
the slab holding the mm_struct has been reclaimed during a
cache shrink making the pointer totally invalid. The lack of bug reports
might be because the slab rarely gets reclaimed and would be difficult to
trigger in reality
- 310-316Move to the next process if the swap_address has
reached the TASK_SIZE or if the mm is the init_mm
- 311Start at the beginning of the process space
- 312Get the mm for this process
- 313-314If it is the same, there is no running processes that can be
examined
- 315Record the swap_mm for the next pass
- 319Increase the reference count so that the mm does not get freed while
we are scanning
- 320Release the mm lock
- 322Begin scanning the mm with swap_out_mm()(See Section J.6.2)
- 324Drop the reference to the mm
- 326-327If the required number of pages has been freed, return success
- 328If we failed on this pass, increase the priority so more processes
will be scanned
- 330Return failure
J.6.2 Function: swap_out_mm
Source: mm/vmscan.c
Walk through each VMA and call swap_out_mm() for each one.
256 static inline int swap_out_mm(struct mm_struct * mm, int count,
int * mmcounter, zone_t * classzone)
257 {
258 unsigned long address;
259 struct vm_area_struct* vma;
260
265 spin_lock(&mm->page_table_lock);
266 address = mm->swap_address;
267 if (address == TASK_SIZE || swap_mm != mm) {
268 /* We raced: don't count this mm but try again */
269 ++*mmcounter;
270 goto out_unlock;
271 }
272 vma = find_vma(mm, address);
273 if (vma) {
274 if (address < vma->vm_start)
275 address = vma->vm_start;
276
277 for (;;) {
278 count = swap_out_vma(mm, vma, address,
count, classzone);
279 vma = vma->vm_next;
280 if (!vma)
281 break;
282 if (!count)
283 goto out_unlock;
284 address = vma->vm_start;
285 }
286 }
287 /* Indicate that we reached the end of address space */
288 mm->swap_address = TASK_SIZE;
289
290 out_unlock:
291 spin_unlock(&mm->page_table_lock);
292 return count;
293 }
- 265Acquire the page table lock for this mm
- 266Start with the address contained in swap_address
- 267-271If the address is TASK_SIZE, it means that a
thread raced and scanned this process already. Increase mmcounter so that
swap_out_mm() knows to go to another process
- 272Find the VMA for this address
- 273Presuming a VMA was found then ....
- 274-275Start at the beginning of the VMA
- 277-285Scan through this and each subsequent VMA calling
swap_out_vma() (See Section J.6.3) for each one. If the
requisite number of pages (count) is freed, then finish scanning and return
- 288Once the last VMA has been scanned, set swap_address
to TASK_SIZE so that this process will be skipped over by
swap_out_mm() next time
J.6.3 Function: swap_out_vma
Source: mm/vmscan.c
Walk through this VMA and for each PGD in it, call swap_out_pgd().
227 static inline int swap_out_vma(struct mm_struct * mm,
struct vm_area_struct * vma,
unsigned long address, int count,
zone_t * classzone)
228 {
229 pgd_t *pgdir;
230 unsigned long end;
231
232 /* Don't swap out areas which are reserved */
233 if (vma->vm_flags & VM_RESERVED)
234 return count;
235
236 pgdir = pgd_offset(mm, address);
237
238 end = vma->vm_end;
239 BUG_ON(address >= end);
240 do {
241 count = swap_out_pgd(mm, vma, pgdir,
address, end, count, classzone);
242 if (!count)
243 break;
244 address = (address + PGDIR_SIZE) & PGDIR_MASK;
245 pgdir++;
246 } while (address && (address < end));
247 return count;
248 }
- 233-234Skip over this VMA if the VM_RESERVED flag is
set. This is used by some device drivers such as the SCSI generic driver
- 236Get the starting PGD for the address
- 238Mark where the end is and BUG() it if the starting
address is somehow past the end
- 240Cycle through PGDs until the end address is reached
- 241Call swap_out_pgd()(See Section J.6.4) keeping
count of how many more pages need to be freed
- 242-243If enough pages have been freed, break and return
- 244-245Move to the next PGD and move the address to the next PGD
aligned address
- 247Return the remaining number of pages to be freed
J.6.4 Function: swap_out_pgd
Source: mm/vmscan.c
Step through all PMD's in the supplied PGD and call swap_out_pmd()
197 static inline int swap_out_pgd(struct mm_struct * mm,
struct vm_area_struct * vma, pgd_t *dir,
unsigned long address, unsigned long end,
int count, zone_t * classzone)
198 {
199 pmd_t * pmd;
200 unsigned long pgd_end;
201
202 if (pgd_none(*dir))
203 return count;
204 if (pgd_bad(*dir)) {
205 pgd_ERROR(*dir);
206 pgd_clear(dir);
207 return count;
208 }
209
210 pmd = pmd_offset(dir, address);
211
212 pgd_end = (address + PGDIR_SIZE) & PGDIR_MASK;
213 if (pgd_end && (end > pgd_end))
214 end = pgd_end;
215
216 do {
217 count = swap_out_pmd(mm, vma, pmd,
address, end, count, classzone);
218 if (!count)
219 break;
220 address = (address + PMD_SIZE) & PMD_MASK;
221 pmd++;
222 } while (address && (address < end));
223 return count;
224 }
- 202-203If there is no PGD, return
- 204-208If the PGD is bad, flag it as such and return
- 210Get the starting PMD
- 212-214Calculate the end to be the end of this PGD or the end of the
VMA been scanned, whichever is closer
- 216-222For each PMD in this PGD, call swap_out_pmd()
(See Section J.6.5). If enough pages get freed, break and return
- 223Return the number of pages remaining to be freed
J.6.5 Function: swap_out_pmd
Source: mm/vmscan.c
For each PTE in this PMD, call try_to_swap_out(). On completion,
mm→swap_address is updated to show where we finished to
prevent the same page been examined soon after this scan.
158 static inline int swap_out_pmd(struct mm_struct * mm,
struct vm_area_struct * vma, pmd_t *dir,
unsigned long address, unsigned long end,
int count, zone_t * classzone)
159 {
160 pte_t * pte;
161 unsigned long pmd_end;
162
163 if (pmd_none(*dir))
164 return count;
165 if (pmd_bad(*dir)) {
166 pmd_ERROR(*dir);
167 pmd_clear(dir);
168 return count;
169 }
170
171 pte = pte_offset(dir, address);
172
173 pmd_end = (address + PMD_SIZE) & PMD_MASK;
174 if (end > pmd_end)
175 end = pmd_end;
176
177 do {
178 if (pte_present(*pte)) {
179 struct page *page = pte_page(*pte);
180
181 if (VALID_PAGE(page) && !PageReserved(page)) {
182 count -= try_to_swap_out(mm, vma,
address, pte,
page, classzone);
183 if (!count) {
184 address += PAGE_SIZE;
185 break;
186 }
187 }
188 }
189 address += PAGE_SIZE;
190 pte++;
191 } while (address && (address < end));
192 mm->swap_address = address;
193 return count;
194 }
- 163-164Return if there is no PMD
- 165-169If the PMD is bad, flag it as such and return
- 171Get the starting PTE
- 173-175Calculate the end to be the end of the PMD or the end of the
VMA, whichever is closer
- 177-191Cycle through each PTE
- 178Make sure the PTE is marked present
- 179Get the struct page for this PTE
- 181If it is a valid page and it is not reserved then ...
- 182Call try_to_swap_out()
- 183-186If enough pages have been swapped out, move the address to the
next page and break to return
- 189-190Move to the next page and PTE
- 192Update the swap_address to show where we last finished off
- 193Return the number of pages remaining to be freed
J.6.6 Function: try_to_swap_out
Source: mm/vmscan.c
This function tries to swap out a page from a process. It is quite a large
function so will be dealt with in parts. Broadly speaking they are
- Function preamble, ensure this is a page that should be swapped out
- Remove the page and PTE from the page tables
- Handle the case where the page is already in the swap cache
- Handle the case where the page is dirty or has associated buffers
- Handle the case where the page is been added to the swap cache
47 static inline int try_to_swap_out(struct mm_struct * mm,
struct vm_area_struct* vma,
unsigned long address,
pte_t * page_table,
struct page *page,
zone_t * classzone)
48 {
49 pte_t pte;
50 swp_entry_t entry;
51
52 /* Don't look at this pte if it's been accessed recently. */
53 if ((vma->vm_flags & VM_LOCKED) ||
ptep_test_and_clear_young(page_table)) {
54 mark_page_accessed(page);
55 return 0;
56 }
57
58 /* Don't bother unmapping pages that are active */
59 if (PageActive(page))
60 return 0;
61
62 /* Don't bother replenishing zones not under pressure.. */
63 if (!memclass(page_zone(page), classzone))
64 return 0;
65
66 if (TryLockPage(page))
67 return 0;
- 53-56If the page is locked (for tasks like IO) or the PTE shows
the page has been accessed recently then clear the referenced bit and call
mark_page_accessed() (See Section J.2.3.1) to make
the struct page reflect the age. Return 0 to show it was not swapped out
- 59-60If the page is on the active_list, do not swap it out
- 63-64If the page belongs to a zone we are not interested in, do not
swap it out
- 66-67If the page is already locked for IO, skip it
74 flush_cache_page(vma, address);
75 pte = ptep_get_and_clear(page_table);
76 flush_tlb_page(vma, address);
77
78 if (pte_dirty(pte))
79 set_page_dirty(page);
80
- 74Call the architecture hook to flush this page from all CPUs
- 75Get the PTE from the page tables and clear it
- 76Call the architecture hook to flush the TLB
- 78-79If the PTE was marked dirty, mark the struct page
dirty so it will be laundered correctly
86 if (PageSwapCache(page)) {
87 entry.val = page->index;
88 swap_duplicate(entry);
89 set_swap_pte:
90 set_pte(page_table, swp_entry_to_pte(entry));
91 drop_pte:
92 mm->rss--;
93 UnlockPage(page);
94 {
95 int freeable =
page_count(page) - !!page->buffers <= 2;
96 page_cache_release(page);
97 return freeable;
98 }
99 }
Handle the case where the page is already in the swap cache
- 86Enter this block only if the page is already in the swap cache. Note
that it can also be entered by calling goto to the set_swap_pte
and drop_pte labels
- 87-88Fill in the index value for the swap entry.
swap_duplicate() verifies the swap identifier is valid and increases
the counter in the swap_map if it is
- 90Fill the PTE with information needed to get the page from swap
- 92Update RSS to show there is one less page being mapped by the
process
- 93Unlock the page
- 95The page is free-able if the count is currently 2 or less and has no
buffers. If the count is higher, it is either being mapped by other processes
or is a file-backed page and the “user” is the page cache
- 96Decrement the reference count and free the page if it reaches 0.
Note that if this is a file-backed page, it will not reach 0 even if there are
no processes mapping it. The page will be later reclaimed from the page cache
by shrink_cache() (See Section J.4.1)
- 97Return if the page was freed or not
115 if (page->mapping)
116 goto drop_pte;
117 if (!PageDirty(page))
118 goto drop_pte;
124 if (page->buffers)
125 goto preserve;
- 115-116If the page has an associated mapping, simply drop it from the
page tables. When no processes are mapping it, it will be reclaimed from the
page cache by shrink_cache()
- 117-118If the page is clean, it is safe to simply drop it
- 124-125If it has associated buffers due to a truncate followed by a
page fault, then re-attach the page and PTE to the page tables as it cannot
be handled yet
126
127 /*
128 * This is a dirty, swappable page. First of all,
129 * get a suitable swap entry for it, and make sure
130 * we have the swap cache set up to associate the
131 * page with that swap entry.
132 */
133 for (;;) {
134 entry = get_swap_page();
135 if (!entry.val)
136 break;
137 /* Add it to the swap cache and mark it dirty
138 * (adding to the page cache will clear the dirty
139 * and uptodate bits, so we need to do it again)
140 */
141 if (add_to_swap_cache(page, entry) == 0) {
142 SetPageUptodate(page);
143 set_page_dirty(page);
144 goto set_swap_pte;
145 }
146 /* Raced with "speculative" read_swap_cache_async */
147 swap_free(entry);
148 }
149
150 /* No swap space left */
151 preserve:
152 set_pte(page_table, pte);
153 UnlockPage(page);
154 return 0;
155 }
- 134Allocate a swap entry for this page
- 135-136If one could not be allocated, break out where the PTE and page
will be re-attached to the process page tables
- 141Add the page to the swap cache
- 142Mark the page as up to date in memory
- 143Mark the page dirty so that it will be written out to swap soon
- 144Goto set_swap_pte which will update the PTE with
information needed to get the page from swap later
- 147If the add to swap cache failed, it means that the page was placed
in the swap cache already by a readahead so drop the work done here
- 152Reattach the PTE to the page tables
- 153Unlock the page
- 154Return that no page was freed
J.7 Page Swap Daemon
This section details the main loops used by the kswapd daemon which
is woken-up when memory is low. The main functions covered are the ones that
determine if kswapd can sleep and how it determines which nodes
need balancing.
J.7.1 Initialising kswapd
J.7.1.1 Function: kswapd_init
Source: mm/vmscan.c
Start the kswapd kernel thread
767 static int __init kswapd_init(void)
768 {
769 printk("Starting kswapd\n");
770 swap_setup();
771 kernel_thread(kswapd, NULL, CLONE_FS
| CLONE_FILES
| CLONE_SIGNAL);
772 return 0;
773 }
- 770swap_setup()(See Section K.4.2) setups up how
many pages will be prefetched when reading from backing storage based on
the amount of physical memory
- 771Start the kswapd kernel thread
J.7.2 kswapd Daemon
J.7.2.1 Function: kswapd
Source: mm/vmscan.c
The main function of the kswapd kernel thread.
720 int kswapd(void *unused)
721 {
722 struct task_struct *tsk = current;
723 DECLARE_WAITQUEUE(wait, tsk);
724
725 daemonize();
726 strcpy(tsk->comm, "kswapd");
727 sigfillset(&tsk->blocked);
728
741 tsk->flags |= PF_MEMALLOC;
742
746 for (;;) {
747 __set_current_state(TASK_INTERRUPTIBLE);
748 add_wait_queue(&kswapd_wait, &wait);
749
750 mb();
751 if (kswapd_can_sleep())
752 schedule();
753
754 __set_current_state(TASK_RUNNING);
755 remove_wait_queue(&kswapd_wait, &wait);
756
762 kswapd_balance();
763 run_task_queue(&tq_disk);
764 }
765 }
- 725Call daemonize() which will make this a kernel thread,
remove the mm context, close all files and re-parent the process
- 726Set the name of the process
- 727Ignore all signals
- 741By setting this flag, the physical page allocator will always try to
satisfy requests for pages. As this process will always be trying to free
pages, it is worth satisfying requests
- 746-764Endlessly loop
- 747-748This adds kswapd to the wait queue in preparation
to sleep
- 750The Memory Block function (mb()) ensures that all reads
and writes that occurred before this line will be visible to all CPU's
- 751kswapd_can_sleep()(See Section J.7.2.2)
cycles through all nodes and zones checking the need_balance
field. If any of them are set to 1, kswapd can not sleep
- 752By calling schedule(), kswapd will now sleep
until woken again by the physical page allocator in __alloc_pages()
(See Section F.1.3)
- 754-755Once woken up, kswapd is removed from the wait
queue as it is now running
- 762kswapd_balance()(See Section J.7.2.4) cycles
through all zones and calls
try_to_free_pages_zone()(See Section J.5.3) for
each zone that requires balance
- 763Run the IO task queue to start writing data out to disk
J.7.2.2 Function: kswapd_can_sleep
Source: mm/vmscan.c
Simple function to cycle through all pgdats to call
kswapd_can_sleep_pgdat() on each.
695 static int kswapd_can_sleep(void)
696 {
697 pg_data_t * pgdat;
698
699 for_each_pgdat(pgdat) {
700 if (!kswapd_can_sleep_pgdat(pgdat))
701 return 0;
702 }
703
704 return 1;
705 }
- 699-702for_each_pgdat() does exactly as the name
implies. It cycles through all available pgdat's and in this case calls
kswapd_can_sleep_pgdat() (See Section J.7.2.3)
for each. On the x86, there will only be one pgdat
J.7.2.3 Function: kswapd_can_sleep_pgdat
Source: mm/vmscan.c
Cycles through all zones to make sure none of them need
balance. The zone→need_balanace flag is set by
__alloc_pages() when the number of free pages in the zone reaches
the pages_low watermark.
680 static int kswapd_can_sleep_pgdat(pg_data_t * pgdat)
681 {
682 zone_t * zone;
683 int i;
684
685 for (i = pgdat->nr_zones-1; i >= 0; i--) {
686 zone = pgdat->node_zones + i;
687 if (!zone->need_balance)
688 continue;
689 return 0;
690 }
691
692 return 1;
693 }
- 685-689Simple for loop to cycle through all zones
- 686The node_zones field is an array of all available
zones so adding i gives the index
- 687-688If the zone does not need balance, continue
- 6890 is returned if any needs balance indicating kswapd
can not sleep
- 692Return indicating kswapd can sleep if the for loop completes
J.7.2.4 Function: kswapd_balance
Source: mm/vmscan.c
Continuously cycle through each pgdat until none require balancing
667 static void kswapd_balance(void)
668 {
669 int need_more_balance;
670 pg_data_t * pgdat;
671
672 do {
673 need_more_balance = 0;
674
675 for_each_pgdat(pgdat)
676 need_more_balance |= kswapd_balance_pgdat(pgdat);
677 } while (need_more_balance);
678 }
- 672-677Cycle through all pgdats until none of them report
that they need balancing
- 675For each pgdat, call kswapd_balance_pgdat() to check if
the node requires balancing. If any node required balancing,
need_more_balance will be set to 1
J.7.2.5 Function: kswapd_balance_pgdat
Source: mm/vmscan.c
This function will check if a node requires balance by examining
each of the nodes in it. If any zone requires balancing,
try_to_free_pages_zone() will be called.
641 static int kswapd_balance_pgdat(pg_data_t * pgdat)
642 {
643 int need_more_balance = 0, i;
644 zone_t * zone;
645
646 for (i = pgdat->nr_zones-1; i >= 0; i--) {
647 zone = pgdat->node_zones + i;
648 if (unlikely(current->need_resched))
649 schedule();
650 if (!zone->need_balance)
651 continue;
652 if (!try_to_free_pages_zone(zone, GFP_KSWAPD)) {
653 zone->need_balance = 0;
654 __set_current_state(TASK_INTERRUPTIBLE);
655 schedule_timeout(HZ);
656 continue;
657 }
658 if (check_classzone_need_balance(zone))
659 need_more_balance = 1;
660 else
661 zone->need_balance = 0;
662 }
663
664 return need_more_balance;
665 }
- 646-662Cycle through each zone and call
try_to_free_pages_zone() (See Section J.5.3)
if it needs re-balancing
- 647node_zones is an array and i is an index within it
- 648-649Call schedule() if the quanta is expired to prevent
kswapd hogging the CPU
- 650-651If the zone does not require balance, move to the next one
- 652-657If the function returns 0, it means the
out_of_memory() function was called because a sufficient number
of pages could not be freed. kswapd sleeps for 1 second to give
the system a chance to reclaim the killed processes pages and perform
IO. The zone is marked as balanced so kswapd will ignore this zone
until the the allocator function __alloc_pages() complains again
- 658-661If is was successful, check_classzone_need_balance()
is called to see if the zone requires further balancing or not
- 664Return 1 if one zone requires further balancing