ksm: introduce ksm_max_page_sharing per page deduplication limit

Without a max deduplication limit for each KSM page, the list of the rmap_items associated to each stable_node can grow infinitely large. During the rmap walk each entry can take up to ~10usec to process because of IPIs for the TLB flushing (both for the primary MMU and the secondary MMUs with the MMU notifier). With only 16GB of address space shared in the same KSM page, that would amount to dozens of seconds of kernel runtime. A ~256 max deduplication factor will reduce the latencies of the rmap walks on KSM pages to order of a few msec. Just doing the cond_resched() during the rmap walks is not enough, the list size must have a limit too, otherwise the caller could get blocked in (schedule friendly) kernel computations for seconds, unexpectedly. There's room for optimization to significantly reduce the IPI delivery cost during the page_referenced(), but at least for page_migration in the KSM case (used by hard NUMA bindings, compaction and NUMA balancing) it may be inevitable to send lots of IPIs if each rmap_item->mm is active on a different CPU and there are lots of CPUs. Even if we ignore the IPI delivery cost, we've still to walk the whole KSM rmap list, so we can't allow millions or billions (ulimited) number of entries in the KSM stable_node rmap_item lists. The limit is enforced efficiently by adding a second dimension to the stable rbtree. So there are three types of stable_nodes: the regular ones (identical as before, living in the first flat dimension of the stable rbtree), the "chains" and the "dups". Every "chain" and all "dups" linked into a "chain" enforce the invariant that they represent the same write protected memory content, even if each "dup" will be pointed by a different KSM page copy of that content. This way the stable rbtree lookup computational complexity is unaffected if compared to an unlimited max_sharing_limit. It is still enforced that there cannot be KSM page content duplicates in the stable rbtree itself. Adding the second dimension to the stable rbtree only after the max_page_sharing limit hits, provides for a zero memory footprint increase on 64bit archs. The memory overhead of the per-KSM page stable_tree and per virtual mapping rmap_item is unchanged. Only after the max_page_sharing limit hits, we need to allocate a stable_tree "chain" and rb_replace() the "regular" stable_node with the newly allocated stable_node "chain". After that we simply add the "regular" stable_node to the chain as a stable_node "dup" by linking hlist_dup in the stable_node_chain->hlist. This way the "regular" (flat) stable_node is converted to a stable_node "dup" living in the second dimension of the stable rbtree. During stable rbtree lookups the stable_node "chain" is identified as stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka is_stable_node_chain()). When dropping stable_nodes, the stable_node "dup" is identified as stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()). The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used elsewhere in any stable_node->head/node to avoid a clashes with the stable_node->node.rb_parent_color pointer, and different from &migrate_nodes. So the second field of &migrate_nodes is picked and verified as always safe with a BUILD_BUG_ON in case the list_head implementation changes in the future. The STABLE_NODE_DUP is picked as a random negative value in stable_node->rmap_hlist_len. rmap_hlist_len cannot become negative when it's a "regular" stable_node or a stable_node "dup". The stable_node_chain->nid is irrelevant. The stable_node_chain->kpfn is aliased in a union with a time field used to rate limit the stable_node_chain->hlist prunes. The garbage collection of the stable_node_chain happens lazily during stable rbtree lookups (as for all other kind of stable_nodes), or while disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the entire stable rbtree. While the "regular" stable_nodes and the stable_node "dups" must wait for their underlying tree_page to be freed before they can be freed themselves, the stable_node "chains" can be freed immediately if the stable_node->hlist turns empty. This is because the "chains" are never pointed by any page->mapping and they're effectively stable rbtree KSM self contained metadata. [akpm@linux-foundation.org: fix non-NUMA build] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Petr Holasek <pholasek@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Evgheni Dereveanchin <ederevea@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ksm: introduce ksm_max_page_sharing per page deduplication limit
Without a max deduplication limit for each KSM page, the list of the rmap_items associated to each stable_node can grow infinitely large. During the rmap walk each entry can take up to ~10usec to process because of IPIs for the TLB flushing (both for the primary MMU and the secondary MMUs with the MMU notifier). With only 16GB of address space shared in the same KSM page, that would amount to dozens of seconds of kernel runtime. A ~256 max deduplication factor will reduce the latencies of the rmap walks on KSM pages to order of a few msec. Just doing the cond_resched() during the rmap walks is not enough, the list size must have a limit too, otherwise the caller could get blocked in (schedule friendly) kernel computations for seconds, unexpectedly. There's room for optimization to significantly reduce the IPI delivery cost during the page_referenced(), but at least for page_migration in the KSM case (used by hard NUMA bindings, compaction and NUMA balancing) it may be inevitable to send lots of IPIs if each rmap_item->mm is active on a different CPU and there are lots of CPUs. Even if we ignore the IPI delivery cost, we've still to walk the whole KSM rmap list, so we can't allow millions or billions (ulimited) number of entries in the KSM stable_node rmap_item lists. The limit is enforced efficiently by adding a second dimension to the stable rbtree. So there are three types of stable_nodes: the regular ones (identical as before, living in the first flat dimension of the stable rbtree), the "chains" and the "dups". Every "chain" and all "dups" linked into a "chain" enforce the invariant that they represent the same write protected memory content, even if each "dup" will be pointed by a different KSM page copy of that content. This way the stable rbtree lookup computational complexity is unaffected if compared to an unlimited max_sharing_limit. It is still enforced that there cannot be KSM page content duplicates in the stable rbtree itself. Adding the second dimension to the stable rbtree only after the max_page_sharing limit hits, provides for a zero memory footprint increase on 64bit archs. The memory overhead of the per-KSM page stable_tree and per virtual mapping rmap_item is unchanged. Only after the max_page_sharing limit hits, we need to allocate a stable_tree "chain" and rb_replace() the "regular" stable_node with the newly allocated stable_node "chain". After that we simply add the "regular" stable_node to the chain as a stable_node "dup" by linking hlist_dup in the stable_node_chain->hlist. This way the "regular" (flat) stable_node is converted to a stable_node "dup" living in the second dimension of the stable rbtree. During stable rbtree lookups the stable_node "chain" is identified as stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka is_stable_node_chain()). When dropping stable_nodes, the stable_node "dup" is identified as stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()). The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used elsewhere in any stable_node->head/node to avoid a clashes with the stable_node->node.rb_parent_color pointer, and different from &migrate_nodes. So the second field of &migrate_nodes is picked and verified as always safe with a BUILD_BUG_ON in case the list_head implementation changes in the future. The STABLE_NODE_DUP is picked as a random negative value in stable_node->rmap_hlist_len. rmap_hlist_len cannot become negative when it's a "regular" stable_node or a stable_node "dup". The stable_node_chain->nid is irrelevant. The stable_node_chain->kpfn is aliased in a union with a time field used to rate limit the stable_node_chain->hlist prunes. The garbage collection of the stable_node_chain happens lazily during stable rbtree lookups (as for all other kind of stable_nodes), or while disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the entire stable rbtree. While the "regular" stable_nodes and the stable_node "dups" must wait for their underlying tree_page to be freed before they can be freed themselves, the stable_node "chains" can be freed immediately if the stable_node->hlist turns empty. This is because the "chains" are never pointed by any page->mapping and they're effectively stable rbtree KSM self contained metadata. [akpm@linux-foundation.org: fix non-NUMA build] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Petr Holasek <pholasek@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Evgheni Dereveanchin <ederevea@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2c653d0e · Andrea Arcangeli · Linus Torvalds · 172ffeb9 · 2c653d0e · 2c653d0e
Commit 2c653d0e authored Jul 06, 2017 by Andrea Arcangeli Committed by Linus Torvalds Jul 06, 2017
Expand all Hide whitespace changes
Inline Side-by-side

Showing with 730 additions and 66 deletions

Documentation/vm/ksm.txt Documentation/vm/ksm.txt +63 -0

mm/ksm.c mm/ksm.c +667 -66

No files found.
--- a/Documentation/vm/ksm.txt
+++ b/Documentation/vm/ksm.txt
@@ -98,6 +98,50 @@ use_zero_pages   - specifies whether empty pages (i.e. allocated pages
                   it is only effective for pages merged after the change.
                   Default: 0 (normal KSM behaviour as in earlier releases)

+max_page_sharing - Maximum sharing allowed for each KSM page. This
+                   enforces a deduplication limit to avoid the virtual
+                   memory rmap lists to grow too large. The minimum
+                   value is 2 as a newly created KSM page will have at
+                   least two sharers. The rmap walk has O(N)
+                   complexity where N is the number of rmap_items
+                   (i.e. virtual mappings) that are sharing the page,
+                   which is in turn capped by max_page_sharing. So
+                   this effectively spread the the linear O(N)
+                   computational complexity from rmap walk context
+                   over different KSM pages. The ksmd walk over the
+                   stable_node "chains" is also O(N), but N is the
+                   number of stable_node "dups", not the number of
+                   rmap_items, so it has not a significant impact on
+                   ksmd performance. In practice the best stable_node
+                   "dup" candidate will be kept and found at the head
+                   of the "dups" list. The higher this value the
+                   faster KSM will merge the memory (because there
+                   will be fewer stable_node dups queued into the
+                   stable_node chain->hlist to check for pruning) and
+                   the higher the deduplication factor will be, but
+                   the slowest the worst case rmap walk could be for
+                   any given KSM page. Slowing down the rmap_walk
+                   means there will be higher latency for certain
+                   virtual memory operations happening during
+                   swapping, compaction, NUMA balancing and page
+                   migration, in turn decreasing responsiveness for
+                   the caller of those virtual memory operations. The
+                   scheduler latency of other tasks not involved with
+                   the VM operations doing the rmap walk is not
+                   affected by this parameter as the rmap walks are
+                   always schedule friendly themselves.
+
+stable_node_chains_prune_millisecs - How frequently to walk the whole
+                   list of stable_node "dups" linked in the
+                   stable_node "chains" in order to prune stale
+                   stable_nodes. Smaller milllisecs values will free
+                   up the KSM metadata with lower latency, but they
+                   will make ksmd use more CPU during the scan. This
+                   only applies to the stable_node chains so it's a
+                   noop if not a single KSM page hit the
+                   max_page_sharing yet (there would be no stable_node
+                   chains in such case).
+
 The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/:

 pages_shared     - how many shared pages are being used
@@ -106,10 +150,29 @@ pages_unshared   - how many pages unique but repeatedly checked for merging
 pages_volatile   - how many pages changing too fast to be placed in a tree
 full_scans       - how many times all mergeable areas have been scanned

+stable_node_chains - number of stable node chains allocated, this is
+		     effectively the number of KSM pages that hit the
+		     max_page_sharing limit
+stable_node_dups   - number of stable node dups queued into the
+		     stable_node chains
+
 A high ratio of pages_sharing to pages_shared indicates good sharing, but
 a high ratio of pages_unshared to pages_sharing indicates wasted effort.
 pages_volatile embraces several different kinds of activity, but a high
 proportion there would also indicate poor use of madvise MADV_MERGEABLE.

+The maximum possible page_sharing/page_shared ratio is limited by the
+max_page_sharing tunable. To increase the ratio max_page_sharing must
+be increased accordingly.
+
+The stable_node_dups/stable_node_chains ratio is also affected by the
+max_page_sharing tunable, and an high ratio may indicate fragmentation
+in the stable_node dups, which could be solved by introducing
+fragmentation algorithms in ksmd which would refile rmap_items from
+one stable_node dup to another stable_node dup, in order to freeup
+stable_node "dups" with few rmap_items in them, but that may increase
+the ksmd CPU usage and possibly slowdown the readonly computations on
+the KSM pages of the applications.
+
 Izik Eidus,
 Hugh Dickins, 17 Nov 2009
--- a/mm/ksm.c
+++ b/mm/ksm.c