Commit 15b44736 authored by Hugh Dickins's avatar Hugh Dickins Committed by Linus Torvalds

mm/lru: revise the comments of lru_lock

Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to fix
the incorrect comments in code.  Also fixed some zone->lru_lock comment
error from ancient time.  etc.

I struggled to understand the comment above move_pages_to_lru() (surely
it never calls page_referenced()), and eventually realized that most of
it had got separated from shrink_active_list(): move that comment back.

Link: https://lkml.kernel.org/r/1604566549-62481-20-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
Signed-off-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Chen, Rong A" <rong.a.chen@intel.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
parent 2a5e4e34
...@@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. ...@@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
8. LRU 8. LRU
====== ======
Each memcg has its own private LRU. Now, its handling is under global Each memcg has its own vector of LRUs (inactive anon, active anon,
VM's control (means that it's handled under global pgdat->lru_lock). inactive file, active file, unevictable) of pages from each node,
Almost all routines around memcg's LRU is called by global LRU's each LRU handled under a single lru_lock for that memcg and node.
list management functions under pgdat->lru_lock.
A special function is mem_cgroup_isolate_pages(). This scans
memcg's private LRU and call __isolate_lru_page() to extract a page
from LRU.
(By __isolate_lru_page(), the page is removed from both of global and
private LRU.)
9. Typical Tests. 9. Typical Tests.
================= =================
......
...@@ -287,20 +287,17 @@ When oom event notifier is registered, event will be delivered. ...@@ -287,20 +287,17 @@ When oom event notifier is registered, event will be delivered.
2.6 Locking 2.6 Locking
----------- -----------
lock_page_cgroup()/unlock_page_cgroup() should not be called under Lock order is as follows:
the i_pages lock.
Other lock order is following: Page lock (PG_locked bit of page->flags)
mm->page_table_lock or split pte_lock
lock_page_memcg (memcg->move_lock)
mapping->i_pages lock
lruvec->lru_lock.
PG_locked. Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
mm->page_table_lock lruvec->lru_lock; PG_lru bit of page->flags is cleared before
pgdat->lru_lock isolating a page from its LRU under lruvec->lru_lock.
lock_page_cgroup.
In many cases, just lock_page_cgroup() is called.
per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
pgdat->lru_lock, it has no lock of its own.
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
----------------------------------------------- -----------------------------------------------
......
...@@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered. ...@@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
Broadly speaking, pages are taken off the LRU lock in bulk and Broadly speaking, pages are taken off the LRU lock in bulk and
freed in batch with a page list. Significant amounts of activity here could freed in batch with a page list. Significant amounts of activity here could
indicate that the system is under memory pressure and can also indicate indicate that the system is under memory pressure and can also indicate
contention on the zone->lru_lock. contention on the lruvec->lru_lock.
4. Per-CPU Allocator Activity 4. Per-CPU Allocator Activity
============================= =============================
......
...@@ -33,7 +33,7 @@ reclaim in Linux. The problems have been observed at customer sites on large ...@@ -33,7 +33,7 @@ reclaim in Linux. The problems have been observed at customer sites on large
memory x86_64 systems. memory x86_64 systems.
To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
main memory will have over 32 million 4k pages in a single zone. When a large main memory will have over 32 million 4k pages in a single node. When a large
fraction of these pages are not evictable for any reason [see below], vmscan fraction of these pages are not evictable for any reason [see below], vmscan
will spend a lot of time scanning the LRU lists looking for the small fraction will spend a lot of time scanning the LRU lists looking for the small fraction
of pages that are evictable. This can result in a situation where all CPUs are of pages that are evictable. This can result in a situation where all CPUs are
...@@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future. ...@@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future.
The Unevictable Page List The Unevictable Page List
------------------------- -------------------------
The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
called the "unevictable" list and an associated page flag, PG_unevictable, to called the "unevictable" list and an associated page flag, PG_unevictable, to
indicate that the page is being managed on the unevictable list. indicate that the page is being managed on the unevictable list.
...@@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous, ...@@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous,
swap-backed pages. This differentiation is only important while the pages are, swap-backed pages. This differentiation is only important while the pages are,
in fact, evictable. in fact, evictable.
The unevictable list benefits from the "arrayification" of the per-zone LRU The unevictable list benefits from the "arrayification" of the per-node LRU
lists and statistics originally proposed and posted by Christoph Lameter. lists and statistics originally proposed and posted by Christoph Lameter.
The unevictable list does not use the LRU pagevec mechanism. Rather,
unevictable pages are placed directly on the page's zone's unevictable list
under the zone lru_lock. This allows us to prevent the stranding of pages on
the unevictable list when one task has the page isolated from the LRU and other
tasks are changing the "evictability" state of the page.
Memory Control Group Interaction Memory Control Group Interaction
-------------------------------- --------------------------------
...@@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka ...@@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka
memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
lru_list enum. lru_list enum.
The memory controller data structure automatically gets a per-zone unevictable The memory controller data structure automatically gets a per-node unevictable
list as a result of the "arrayification" of the per-zone LRU lists (one per list as a result of the "arrayification" of the per-node LRU lists (one per
lru_list enum element). The memory controller tracks the movement of pages to lru_list enum element). The memory controller tracks the movement of pages to
and from the unevictable list. and from the unevictable list.
...@@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular ...@@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular
active/inactive LRU lists for vmscan to deal with. vmscan checks for such active/inactive LRU lists for vmscan to deal with. vmscan checks for such
pages in all of the shrink_{active|inactive|page}_list() functions and will pages in all of the shrink_{active|inactive|page}_list() functions and will
"cull" such pages that it encounters: that is, it diverts those pages to the "cull" such pages that it encounters: that is, it diverts those pages to the
unevictable list for the zone being scanned. unevictable list for the node being scanned.
There may be situations where a page is mapped into a VM_LOCKED VMA, but the There may be situations where a page is mapped into a VM_LOCKED VMA, but the
page is not marked as PG_mlocked. Such pages will make it all the way to page is not marked as PG_mlocked. Such pages will make it all the way to
...@@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the ...@@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
page from the LRU, as it is likely on the appropriate active or inactive list page from the LRU, as it is likely on the appropriate active or inactive list
at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put
back the page - by calling putback_lru_page() - which will notice that the page back the page - by calling putback_lru_page() - which will notice that the page
is now mlocked and divert the page to the zone's unevictable list. If is now mlocked and divert the page to the node's unevictable list. If
mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
it later if and when it attempts to reclaim the page. it later if and when it attempts to reclaim the page.
...@@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are: ...@@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are:
unevictable list in mlock_vma_page(). unevictable list in mlock_vma_page().
shrink_inactive_list() also diverts any unevictable pages that it finds on the shrink_inactive_list() also diverts any unevictable pages that it finds on the
inactive lists to the appropriate zone's unevictable list. inactive lists to the appropriate node's unevictable list.
shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
after shrink_active_list() had moved them to the inactive list, or pages mapped after shrink_active_list() had moved them to the inactive list, or pages mapped
......
...@@ -79,7 +79,7 @@ struct page { ...@@ -79,7 +79,7 @@ struct page {
struct { /* Page cache and anonymous pages */ struct { /* Page cache and anonymous pages */
/** /**
* @lru: Pageout list, eg. active_list protected by * @lru: Pageout list, eg. active_list protected by
* pgdat->lru_lock. Sometimes used as a generic list * lruvec->lru_lock. Sometimes used as a generic list
* by the page owner. * by the page owner.
*/ */
struct list_head lru; struct list_head lru;
......
...@@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype) ...@@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
struct pglist_data; struct pglist_data;
/* /*
* zone->lock and the zone lru_lock are two of the hottest locks in the kernel. * Add a wild amount of padding here to ensure datas fall into separate
* So add a wild amount of padding here to ensure that they fall into separate
* cachelines. There are very few zone structures in the machine, so space * cachelines. There are very few zone structures in the machine, so space
* consumption is not a concern here. * consumption is not a concern here.
*/ */
......
...@@ -102,8 +102,8 @@ ...@@ -102,8 +102,8 @@
* ->swap_lock (try_to_unmap_one) * ->swap_lock (try_to_unmap_one)
* ->private_lock (try_to_unmap_one) * ->private_lock (try_to_unmap_one)
* ->i_pages lock (try_to_unmap_one) * ->i_pages lock (try_to_unmap_one)
* ->pgdat->lru_lock (follow_page->mark_page_accessed) * ->lruvec->lru_lock (follow_page->mark_page_accessed)
* ->pgdat->lru_lock (check_pte_range->isolate_lru_page) * ->lruvec->lru_lock (check_pte_range->isolate_lru_page)
* ->private_lock (page_remove_rmap->set_page_dirty) * ->private_lock (page_remove_rmap->set_page_dirty)
* ->i_pages lock (page_remove_rmap->set_page_dirty) * ->i_pages lock (page_remove_rmap->set_page_dirty)
* bdi.wb->list_lock (page_remove_rmap->set_page_dirty) * bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
......
...@@ -28,12 +28,12 @@ ...@@ -28,12 +28,12 @@
* hugetlb_fault_mutex (hugetlbfs specific page fault mutex) * hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
* anon_vma->rwsem * anon_vma->rwsem
* mm->page_table_lock or pte_lock * mm->page_table_lock or pte_lock
* pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
* swap_lock (in swap_duplicate, swap_info_get) * swap_lock (in swap_duplicate, swap_info_get)
* mmlist_lock (in mmput, drain_mmlist and others) * mmlist_lock (in mmput, drain_mmlist and others)
* mapping->private_lock (in __set_page_dirty_buffers) * mapping->private_lock (in __set_page_dirty_buffers)
* mem_cgroup_{begin,end}_page_stat (memcg->move_lock) * lock_page_memcg move_lock (in __set_page_dirty_buffers)
* i_pages lock (widely used) * i_pages lock (widely used)
* lruvec->lru_lock (in lock_page_lruvec_irq)
* inode->i_lock (in set_page_dirty's __mark_inode_dirty) * inode->i_lock (in set_page_dirty's __mark_inode_dirty)
* bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
* sb_lock (within inode_lock in fs/fs-writeback.c) * sb_lock (within inode_lock in fs/fs-writeback.c)
......
...@@ -1613,14 +1613,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec, ...@@ -1613,14 +1613,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
} }
/** /**
* pgdat->lru_lock is heavily contended. Some of the functions that * Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
*
* lruvec->lru_lock is heavily contended. Some of the functions that
* shrink the lists perform better by taking out a batch of pages * shrink the lists perform better by taking out a batch of pages
* and working on them outside the LRU lock. * and working on them outside the LRU lock.
* *
* For pagecache intensive workloads, this function is the hottest * For pagecache intensive workloads, this function is the hottest
* spot in the kernel (apart from copy_*_user functions). * spot in the kernel (apart from copy_*_user functions).
* *
* Appropriate locks must be held before calling this function. * Lru_lock must be held before calling this function.
* *
* @nr_to_scan: The number of eligible pages to look through on the list. * @nr_to_scan: The number of eligible pages to look through on the list.
* @lruvec: The LRU vector to pull pages from. * @lruvec: The LRU vector to pull pages from.
...@@ -1814,25 +1816,11 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, ...@@ -1814,25 +1816,11 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
} }
/* /*
* This moves pages from @list to corresponding LRU list. * move_pages_to_lru() moves pages from private @list to appropriate LRU list.
* * On return, @list is reused as a list of pages to be freed by the caller.
* We move them the other way if the page is referenced by one or more
* processes, from rmap.
*
* If the pages are mostly unmapped, the processing is fast and it is
* appropriate to hold zone_lru_lock across the whole operation. But if
* the pages are mapped, the processing is slow (page_referenced()) so we
* should drop zone_lru_lock around each page. It's impossible to balance
* this, so instead we remove the pages from the LRU while processing them.
* It is safe to rely on PG_active against the non-LRU pages in here because
* nobody will play with that bit on a non-LRU page.
*
* The downside is that we have to touch page->_refcount against each page.
* But we had to alter page->flags anyway.
* *
* Returns the number of pages moved to the given lruvec. * Returns the number of pages moved to the given lruvec.
*/ */
static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
struct list_head *list) struct list_head *list)
{ {
...@@ -2010,6 +1998,23 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, ...@@ -2010,6 +1998,23 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
return nr_reclaimed; return nr_reclaimed;
} }
/*
* shrink_active_list() moves pages from the active LRU to the inactive LRU.
*
* We move them the other way if the page is referenced by one or more
* processes.
*
* If the pages are mostly unmapped, the processing is fast and it is
* appropriate to hold lru_lock across the whole operation. But if
* the pages are mapped, the processing is slow (page_referenced()), so
* we should drop lru_lock around each page. It's impossible to balance
* this, so instead we remove the pages from the LRU while processing them.
* It is safe to rely on PG_active against the non-LRU pages in here because
* nobody will play with that bit on a non-LRU page.
*
* The downside is that we have to touch page->_refcount against each page.
* But we had to alter page->flags anyway.
*/
static void shrink_active_list(unsigned long nr_to_scan, static void shrink_active_list(unsigned long nr_to_scan,
struct lruvec *lruvec, struct lruvec *lruvec,
struct scan_control *sc, struct scan_control *sc,
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment