Commit f9c0f1c3 authored by Chengming Zhou's avatar Chengming Zhou Committed by Andrew Morton

mm/zswap: add more comments in shrink_memcg_cb()

Patch series "mm/zswap: optimize zswap lru list", v2.

This series is motivated when observe the zswap lru list shrinking, noted
there are some unexpected cases in zswap_writeback_entry().

bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'

There are some -ENOMEM because when the swap entry is freed to per-cpu
swap pool, it doesn't invalidate/drop zswap entry.  Then the shrinker
encounter these trashy zswap entries, it can't be reclaimed and return
-ENOMEM.

So move the invalidation ahead to when swap entry freed to the per-cpu
swap pool, since there is no any benefit to leave trashy zswap entries on
the zswap tree and lru list.

Another case is -EEXIST, which is seen more in the case of
!zswap_exclusive_loads_enabled, in which case the swapin folio will leave
compressed copy on the tree and lru list.  And it can't be reclaimed until
the folio is removed from swapcache.

Changing to zswap_exclusive_loads_enabled mode will invalidate when folio
swapin, which has its own drawback if that folio is still clean in
swapcache and swapout again, we need to compress it again.  Please see the
commit for details on why we choose exclusive load as the default for
zswap.

Another optimization for -EEXIST is that we add LRU_STOP to support
terminating the shrinking process to avoid evicting warmer region.

Testing using kernel build in tmpfs, one 50GB swapfile and
zswap shrinker_enabled, with memory.max set to 2GB.

                mm-unstable   zswap-optimize
real               63.90s       63.25s
user             1064.05s     1063.40s
sys               292.32s      270.94s

The main optimization is in sys cpu, about 7% improvement.


This patch (of 6):

Add more comments in shrink_memcg_cb() to describe the deref dance which
is implemented to fix race problem between lru writeback and swapoff, and
the reason why we rotate the entry at the beginning.

Also fix the stale comments in zswap_writeback_entry(), and add more
comments to state that we only deref the tree after we get the swapcache
reference.

Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com
Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-1-99d4084260a0@bytedance.comSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
Suggested-by: default avatarYosry Ahmed <yosryahmed@google.com>
Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
parent e374ae2b
...@@ -1207,10 +1207,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry, ...@@ -1207,10 +1207,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
/* /*
* folio is locked, and the swapcache is now secured against * folio is locked, and the swapcache is now secured against
* concurrent swapping to and from the slot. Verify that the * concurrent swapping to and from the slot, and concurrent
* swap entry hasn't been invalidated and recycled behind our * swapoff so we can safely dereference the zswap tree here.
* backs (our zswap_entry reference doesn't prevent that), to * Verify that the swap entry hasn't been invalidated and recycled
* avoid overwriting a new swap folio with old compressed data. * behind our backs, to avoid overwriting a new swap folio with
* old compressed data. Only when this is successful can the entry
* be dereferenced.
*/ */
tree = swap_zswap_tree(swpentry); tree = swap_zswap_tree(swpentry);
spin_lock(&tree->lock); spin_lock(&tree->lock);
...@@ -1263,22 +1265,29 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o ...@@ -1263,22 +1265,29 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
int writeback_result; int writeback_result;
/* /*
* Rotate the entry to the tail before unlocking the LRU, * As soon as we drop the LRU lock, the entry can be freed by
* so that in case of an invalidation race concurrent * a concurrent invalidation. This means the following:
* reclaimers don't waste their time on it.
* *
* If writeback succeeds, or failure is due to the entry * 1. We extract the swp_entry_t to the stack, allowing
* being invalidated by the swap subsystem, the invalidation * zswap_writeback_entry() to pin the swap entry and
* will unlink and free it. * then validate the zwap entry against that swap entry's
* tree using pointer value comparison. Only when that
* is successful can the entry be dereferenced.
* *
* Temporary failures, where the same entry should be tried * 2. Usually, objects are taken off the LRU for reclaim. In
* again immediately, almost never happen for this shrinker. * this case this isn't possible, because if reclaim fails
* We don't do any trylocking; -ENOMEM comes closest, * for whatever reason, we have no means of knowing if the
* but that's extremely rare and doesn't happen spuriously * entry is alive to put it back on the LRU.
* either. Don't bother distinguishing this case.
* *
* But since they do exist in theory, the entry cannot just * So rotate it before dropping the lock. If the entry is
* be unlinked, or we could leak it. Hence, rotate. * written back or invalidated, the free path will unlink
* it. For failures, rotation is the right thing as well.
*
* Temporary failures, where the same entry should be tried
* again immediately, almost never happen for this shrinker.
* We don't do any trylocking; -ENOMEM comes closest,
* but that's extremely rare and doesn't happen spuriously
* either. Don't bother distinguishing this case.
*/ */
list_move_tail(item, &l->list); list_move_tail(item, &l->list);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment