- 28 Mar, 2023 40 commits
-
-
Qi Zheng authored
For now, reparent_shrinker_deferred() is the only holder of read lock of shrinker_rwsem. And it already holds the global cgroup_mutex, so it will not be called in parallel. Therefore, in order to convert shrinker_rwsem to shrinker_mutex later, here we change to hold the write lock of shrinker_rwsem to reparent. Link: https://lkml.kernel.org/r/20230313112819.38938-7-zhengqi.arch@bytedance.comSigned-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill Tkhai <tkhai@ya.ru> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christian König <christian.koenig@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sultan Alsawaf <sultan@kerneltoast.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Qi Zheng authored
Like global and memcg slab shrink, also use SRCU to make count and scan operations in memory shrinker debugfs lockless. Link: https://lkml.kernel.org/r/20230313112819.38938-6-zhengqi.arch@bytedance.comSigned-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill Tkhai <tkhai@ya.ru> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christian König <christian.koenig@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sultan Alsawaf <sultan@kerneltoast.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kirill Tkhai authored
After we make slab shrink lockless with SRCU, the longest sleep unregister_shrinker() will be a sleep waiting for all do_shrink_slab() calls. To avoid long unbreakable action in the unregister_shrinker(), add shrinker_srcu_generation to restore a check similar to the rwsem_is_contendent() check that we had before. And for memcg slab shrink, we unlock SRCU and continue iterations from the next shrinker id. Link: https://lkml.kernel.org/r/20230313112819.38938-5-zhengqi.arch@bytedance.comSigned-off-by: Kirill Tkhai <tkhai@ya.ru> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christian König <christian.koenig@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sultan Alsawaf <sultan@kerneltoast.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Qi Zheng authored
Like global slab shrink, this commit also uses SRCU to make memcg slab shrink lockless. We can reproduce the down_read_trylock() hotspot through the following script: ``` DIR="/root/shrinker/memcg/mnt" do_create() { mkdir -p /sys/fs/cgroup/memory/test mkdir -p /sys/fs/cgroup/perf_event/test echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes for i in `seq 0 $1`; do mkdir -p /sys/fs/cgroup/memory/test/$i; echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs; mkdir -p $DIR/$i; done } do_mount() { for i in `seq $1 $2`; do mount -t tmpfs $i $DIR/$i; done } do_touch() { for i in `seq $1 $2`; do echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs; dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 & done } case "$1" in touch) do_touch $2 $3 ;; test) do_create 4000 do_mount 0 4000 do_touch 0 3000 ;; *) exit 1 ;; esac ``` Save the above script, then run test and touch commands. Then we can use the following perf command to view hotspots: perf top -U -F 999 1) Before applying this patchset: 32.31% [kernel] [k] down_read_trylock 19.40% [kernel] [k] pv_native_safe_halt 16.24% [kernel] [k] up_read 15.70% [kernel] [k] shrink_slab 4.69% [kernel] [k] _find_next_bit 2.62% [kernel] [k] shrink_node 1.78% [kernel] [k] shrink_lruvec 0.76% [kernel] [k] do_shrink_slab 2) After applying this patchset: 27.83% [kernel] [k] _find_next_bit 16.97% [kernel] [k] shrink_slab 15.82% [kernel] [k] pv_native_safe_halt 9.58% [kernel] [k] shrink_node 8.31% [kernel] [k] shrink_lruvec 5.64% [kernel] [k] do_shrink_slab 3.88% [kernel] [k] mem_cgroup_iter At the same time, we use the following perf command to capture IPC information: perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10 1) Before applying this patchset: Performance counter stats for 'system wide' (5 runs): 454187219766 cycles test ( +- 1.84% ) 78896433101 instructions test # 0.17 insn per cycle ( +- 0.44% ) 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% ) 2) After applying this patchset: Performance counter stats for 'system wide' (5 runs): 841954709443 cycles test ( +- 15.80% ) (98.69%) 527258677936 instructions test # 0.63 insn per cycle ( +- 15.11% ) (98.68%) 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% ) We can see that IPC drops very seriously when calling down_read_trylock() at high frequency. After using SRCU, the IPC is at a normal level. Link: https://lkml.kernel.org/r/20230313112819.38938-4-zhengqi.arch@bytedance.comSigned-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Kirill Tkhai <tkhai@ya.ru> Acked-by: Vlastimil Babka <Vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christian König <christian.koenig@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sultan Alsawaf <sultan@kerneltoast.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Qi Zheng authored
The shrinker_rwsem is a global read-write lock in shrinkers subsystem, which protects most operations such as slab shrink, registration and unregistration of shrinkers, etc. This can easily cause problems in the following cases. 1) When the memory pressure is high and there are many filesystems mounted or unmounted at the same time, slab shrink will be affected (down_read_trylock() failed). Such as the real workload mentioned by Kirill Tkhai: ``` One of the real workloads from my experience is start of an overcommitted node containing many starting containers after node crash (or many resuming containers after reboot for kernel update). In these cases memory pressure is huge, and the node goes round in long reclaim. ``` 2) If a shrinker is blocked (such as the case mentioned in [1]) and a writer comes in (such as mount a fs), then this writer will be blocked and cause all subsequent shrinker-related operations to be blocked. Even if there is no competitor when shrinking slab, there may still be a problem. If we have a long shrinker list and we do not reclaim enough memory with each shrinker, then the down_read_trylock() may be called with high frequency. Because of the poor multicore scalability of atomic operations, this can lead to a significant drop in IPC (instructions per cycle). So many times in history ([2],[3],[4],[5]), some people wanted to replace shrinker_rwsem trylock with SRCU in the slab shrink, but all these patches were abandoned because SRCU was not unconditionally enabled. But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"), the SRCU is unconditionally enabled. So it's time to use SRCU to protect readers who previously held shrinker_rwsem. This commit uses SRCU to make global slab shrink lockless, the memcg slab shrink is handled in the subsequent patch. [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/ [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/ [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/ [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/ [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/ Link: https://lkml.kernel.org/r/20230313112819.38938-3-zhengqi.arch@bytedance.comSigned-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill Tkhai <tkhai@ya.ru> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christian König <christian.koenig@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Qi Zheng authored
Patch series "make slab shrink lockless", v5. This patch series aims to make slab shrink lockless. 1. Background ============= On our servers, we often find the following system cpu hotspots: 52.22% [kernel] [k] down_read_trylock 19.60% [kernel] [k] up_read 8.86% [kernel] [k] shrink_slab 2.44% [kernel] [k] idr_find 1.25% [kernel] [k] count_shadow_nodes 1.18% [kernel] [k] shrink lruvec 0.71% [kernel] [k] mem_cgroup_iter 0.71% [kernel] [k] shrink_node 0.55% [kernel] [k] find_next_bit And we used bpftrace to capture its calltrace as follows: @[ down_read_trylock+1 shrink_slab+128 shrink_node+371 do_try_to_free_pages+232 try_to_free_pages+243 _alloc_pages_slowpath+771 _alloc_pages_nodemask+702 pagecache_get_page+255 filemap_fault+1361 ext4_filemap_fault+44 __do_fault+76 handle_mm_fault+3543 do_user_addr_fault+442 do_page_fault+48 page_fault+62 ]: 1161690 @[ down_read_trylock+1 shrink_slab+128 shrink_node+371 balance_pgdat+690 kswapd+389 kthread+246 ret_from_fork+31 ]: 8424884 @[ down_read_trylock+1 shrink_slab+128 shrink_node+371 do_try_to_free_pages+232 try_to_free_pages+243 __alloc_pages_slowpath+771 __alloc_pages_nodemask+702 __do_page_cache_readahead+244 filemap_fault+1674 ext4_filemap_fault+44 __do_fault+76 handle_mm_fault+3543 do_user_addr_fault+442 do_page_fault+48 page_fault+62 ]: 20917631 We can see that down_read_trylock() of shrinker_rwsem is being called with high frequency at that time. Because of the poor multicore scalability of atomic operations, this can lead to a significant drop in IPC (instructions per cycle). And more, the shrinker_rwsem is a global read-write lock in shrinkers subsystem, which protects most operations such as slab shrink, registration and unregistration of shrinkers, etc. This can easily cause problems in the following cases. 1) When the memory pressure is high and there are many filesystems mounted or unmounted at the same time, slab shrink will be affected (down_read_trylock() failed). Such as the real workload mentioned by Kirill Tkhai: ``` One of the real workloads from my experience is start of an overcommitted node containing many starting containers after node crash (or many resuming containers after reboot for kernel update). In these cases memory pressure is huge, and the node goes round in long reclaim. ``` 2) If a shrinker is blocked (such as the case mentioned in [1]) and a writer comes in (such as mount a fs), then this writer will be blocked and cause all subsequent shrinker-related operations to be blocked. [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/ All the above cases can be solved by replacing the shrinker_rwsem trylocks with SRCU. 2. Survey ========= Before doing the code implementation, I found that there were many similar submissions in the community: a. Davidlohr Bueso submitted a patch in 2015. Subject: [PATCH -next v2] mm: srcu-ify shrinkers Link: https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/ Result: It was finally merged into the linux-next branch, but failed on arm allnoconfig (without CONFIG_SRCU) b. Tetsuo Handa submitted a patchset in 2017. Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock. Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/ Result: Finally chose to use the current simple way (break when rwsem_is_contended()). And Christoph Hellwig suggested to using SRCU, but SRCU was not unconditionally enabled at the time. c. Kirill Tkhai submitted a patchset in 2018. Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab() Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/ Result: At that time, SRCU was not unconditionally enabled, and there were some objections to enabling SRCU. Later, because Kirill's focus was moved to other things, this patchset was not continued to be updated. d. Sultan Alsawaf submitted a patch in 2021. Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection Link: https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/ Result: Rejected because SRCU was not unconditionally enabled. We can find that almost all these historical commits were abandoned because SRCU was not unconditionally enabled. But now SRCU has been unconditionally enable by Paul E. McKenney in 2023 [2], so it's time to replace shrinker_rwsem trylocks with SRCU. [2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/ 3. Reproduction and testing =========================== We can reproduce the down_read_trylock() hotspot through the following script: ``` #!/bin/bash DIR="/root/shrinker/memcg/mnt" do_create() { mkdir -p /sys/fs/cgroup/memory/test mkdir -p /sys/fs/cgroup/perf_event/test echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes for i in `seq 0 $1`; do mkdir -p /sys/fs/cgroup/memory/test/$i; echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs; mkdir -p $DIR/$i; done } do_mount() { for i in `seq $1 $2`; do mount -t tmpfs $i $DIR/$i; done } do_touch() { for i in `seq $1 $2`; do echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs; dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 & done } case "$1" in touch) do_touch $2 $3 ;; test) do_create 4000 do_mount 0 4000 do_touch 0 3000 ;; *) exit 1 ;; esac ``` Save the above script, then run test and touch commands. Then we can use the following perf command to view hotspots: perf top -U -F 999 1) Before applying this patchset: 32.31% [kernel] [k] down_read_trylock 19.40% [kernel] [k] pv_native_safe_halt 16.24% [kernel] [k] up_read 15.70% [kernel] [k] shrink_slab 4.69% [kernel] [k] _find_next_bit 2.62% [kernel] [k] shrink_node 1.78% [kernel] [k] shrink_lruvec 0.76% [kernel] [k] do_shrink_slab 2) After applying this patchset: 27.83% [kernel] [k] _find_next_bit 16.97% [kernel] [k] shrink_slab 15.82% [kernel] [k] pv_native_safe_halt 9.58% [kernel] [k] shrink_node 8.31% [kernel] [k] shrink_lruvec 5.64% [kernel] [k] do_shrink_slab 3.88% [kernel] [k] mem_cgroup_iter At the same time, we use the following perf command to capture IPC information: perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10 1) Before applying this patchset: Performance counter stats for 'system wide' (5 runs): 454187219766 cycles test ( +- 1.84% ) 78896433101 instructions test # 0.17 insn per cycle ( +- 0.44% ) 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% ) 2) After applying this patchset: Performance counter stats for 'system wide' (5 runs): 841954709443 cycles test ( +- 15.80% ) (98.69%) 527258677936 instructions test # 0.63 insn per cycle ( +- 15.11% ) (98.68%) 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% ) We can see that IPC drops very seriously when calling down_read_trylock() at high frequency. After using SRCU, the IPC is at a normal level. This patch (of 8): To prepare for the subsequent lockless memcg slab shrink, add a map_nr_max field to struct shrinker_info to records its own real shrinker_nr_max. Link: https://lkml.kernel.org/r/20230313112819.38938-1-zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/20230313112819.38938-2-zhengqi.arch@bytedance.comSigned-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Kirill Tkhai <tkhai@ya.ru> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill Tkhai <tkhai@ya.ru> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christian König <christian.koenig@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sultan Alsawaf <sultan@kerneltoast.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Lorenzo Stoakes authored
Update instances of alloc_pages(..., 0), __get_free_pages(..., 0) and __free_pages(..., 0) to use alloc_page(), __get_free_page() and __free_page() respectively in core code. Link: https://lkml.kernel.org/r/50c48ca4789f1da2a65795f2346f5ae3eff7d665.1678710232.git.lstoakes@gmail.comSigned-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Collingbourne authored
Code inspection reveals that PG_skip_kasan_poison is redundant with kasantag, because the former is intended to be set iff the latter is the match-all tag. It can also be observed that it's basically pointless to poison pages which have kasantag=0, because any pages with this tag would have been pointed to by pointers with match-all tags, so poisoning the pages would have little to no effect in terms of bug detection. Therefore, change the condition in should_skip_kasan_poison() to check kasantag instead, and remove PG_skip_kasan_poison and associated flags. Link: https://lkml.kernel.org/r/20230310042914.3805818-3-pcc@google.com Link: https://linux-review.googlesource.com/id/I57f825f2eaeaf7e8389d6cf4597c8a5821359838Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Sebastian Andrzej Siewior authored
io_mapping_map_atomic_wc() disables preemption and pagefaults for historical reasons. The conversion to io_mapping_map_local_wc(), which only disables migration, cannot be done wholesale because quite some call sites need to be updated to accommodate with the changed semantics. On PREEMPT_RT enabled kernels the io_mapping_map_atomic_wc() semantics are problematic due to the implicit disabling of preemption which makes it impossible to acquire 'sleeping' spinlocks within the mapped atomic sections. PREEMPT_RT replaces the preempt_disable() with a migrate_disable() for more than a decade. It could be argued that this is a justification to do this unconditionally, but PREEMPT_RT covers only a limited number of architectures and it disables some functionality which limits the coverage further. Limit the replacement to PREEMPT_RT for now. This is also done kmap_atomic(). Link: https://lkml.kernel.org/r/20230310162905.O57Pj7hh@linutronix.deSigned-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reported-by: Richard Weinberger <richard.weinberger@gmail.com> Link: https://lore.kernel.org/CAFLxGvw0WMxaMqYqJ5WgvVSbKHq2D2xcXTOgMCpgq9nDC-MWTQ@mail.gmail.com Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Luis Chamberlain authored
In doing experimentations with shmem having the option to avoid swap becomes a useful mechanism. One of the *raves* about brd over shmem is you can avoid swap, but that's not really a good reason to use brd if we can instead use shmem. Using brd has its own good reasons to exist, but just because "tmpfs" doesn't let you do that is not a great reason to avoid it if we can easily add support for it. I don't add support for reconfiguring incompatible options, but if we really wanted to we can add support for that. To avoid swap we use mapping_set_unevictable() upon inode creation, and put a WARN_ON_ONCE() stop-gap on writepages() for reclaim. Link: https://lkml.kernel.org/r/20230309230545.2930737-7-mcgrof@kernel.orgSigned-off-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Tested-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Luis Chamberlain authored
Update the docs to reflect a bit better why some folks prefer tmpfs over ramfs and clarify a bit more about the difference between brd ramdisks. While at it, add THP docs for tmpfs, both the mount options and the sysfs file. Link: https://lkml.kernel.org/r/20230309230545.2930737-6-mcgrof@kernel.orgSigned-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Luis Chamberlain authored
In theory when info->flags & VM_LOCKED we should not be getting shem_writepage() called so we should be verifying this with a WARN_ON_ONCE(). Since we should not be swapping then best to ensure we also don't do the folio split earlier too. So just move the check early to avoid folio splits in case its a dubious call. We also have a similar early bail when !total_swap_pages so just move that earlier to avoid the possible folio split in the same situation. Link: https://lkml.kernel.org/r/20230309230545.2930737-5-mcgrof@kernel.orgSigned-off-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Luis Chamberlain authored
i915_gem requires huge folios to be split when swapping. However we have check for usage of writepages() to ensure it used only for swap purposes later. Avoid the splits if we're not being called for reclaim, even if they should in theory not happen. This makes the conditions easier to follow on shem_writepage(). Link: https://lkml.kernel.org/r/20230309230545.2930737-4-mcgrof@kernel.orgSigned-off-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Tested-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Luis Chamberlain authored
shmem_writepage() sets up variables typically used *after* a possible huge page split. However even if that does happen the address space mapping should not change, and the inode does not change either. So it should be safe to set that from the very beginning. This commit makes no functional changes. Link: https://lkml.kernel.org/r/20230309230545.2930737-3-mcgrof@kernel.orgSigned-off-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Tested-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Luis Chamberlain authored
Patch series "tmpfs: add the option to disable swap", v2. I'm doing this work as part of future experimentation with tmpfs and the page cache, but given a common complaint found about tmpfs is the innability to work without the page cache I figured this might be useful to others. It turns out it is -- at least Christian Brauner indicates systemd uses ramfs for a few use-cases because they don't want to use swap and so having this option would let them move over to using tmpfs for those small use cases, see systemd-creds(1). To see if you hit swap: mkswap /dev/nvme2n1 swapon /dev/nvme2n1 free -h With swap - what we see today ============================= mount -t tmpfs -o size=5G tmpfs /data-tmpfs/ dd if=/dev/urandom of=/data-tmpfs/5g-rand2 bs=1G count=5 free -h total used free shared buff/cache available Mem: 3.7Gi 2.6Gi 1.2Gi 2.2Gi 2.2Gi 1.2Gi Swap: 99Gi 2.8Gi 97Gi Without swap ============= free -h total used free shared buff/cache available Mem: 3.7Gi 387Mi 3.4Gi 2.1Mi 57Mi 3.3Gi Swap: 99Gi 0B 99Gi mount -t tmpfs -o size=5G -o noswap tmpfs /data-tmpfs/ dd if=/dev/urandom of=/data-tmpfs/5g-rand2 bs=1G count=5 free -h total used free shared buff/cache available Mem: 3.7Gi 2.6Gi 1.2Gi 2.3Gi 2.3Gi 1.1Gi Swap: 99Gi 21Mi 99Gi The mix and match remount testing ================================= # Cannot disable swap after it was first enabled: mount -t tmpfs -o size=5G tmpfs /data-tmpfs/ mount -t tmpfs -o remount -o size=5G -o noswap tmpfs /data-tmpfs/ mount: /data-tmpfs: mount point not mounted or bad option. dmesg(1) may have more information after failed mount system call. dmesg -c tmpfs: Cannot disable swap on remount # Remount with the same noswap option is OK: mount -t tmpfs -o size=5G -o noswap tmpfs /data-tmpfs/ mount -t tmpfs -o remount -o size=5G -o noswap tmpfs /data-tmpfs/ dmesg -c # Trying to enable swap with a remount after it first disabled: mount -t tmpfs -o size=5G -o noswap tmpfs /data-tmpfs/ mount -t tmpfs -o remount -o size=5G tmpfs /data-tmpfs/ mount: /data-tmpfs: mount point not mounted or bad option. dmesg(1) may have more information after failed mount system call. dmesg -c tmpfs: Cannot enable swap on remount if it was disabled on first mount This patch (of 6): Matthew notes we should not need to check the folio lock on the writepage() callback so remove it. This sanity check has been lingering since linux-history days. We remove this as we tidy up the writepage() callback to make things a bit clearer. Link: https://lkml.kernel.org/r/20230309230545.2930737-1-mcgrof@kernel.org Link: https://lkml.kernel.org/r/20230309230545.2930737-2-mcgrof@kernel.orgSigned-off-by: Luis Chamberlain <mcgrof@kernel.org> Suggested-by: Matthew Wilcox <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Tested-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jingyu Wang authored
Link: https://lkml.kernel.org/r/20230309104813.170309-1-jingyuwang_vip@163.comSigned-off-by: Jingyu Wang <jingyuwang_vip@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Danilo Krummrich authored
Fix missing EXPORT_SYMBOL_GPL() statement for mas_preallocate(). It isn't actually used by anything yet, but mas_preallocate() is part of the maple tree's 'Advanced API'. All other functions of this API are exported already. Link: https://lkml.kernel.org/r/20230302011035.4928-1-dakr@redhat.comSigned-off-by: Danilo Krummrich <dakr@redhat.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
The last remaining user of folio_write_one through the write_one_page wrapper is jfs, so move the functionality there and hard code the call to metapage_writepage. Note that the use of the pagecache by the JFS 'metapage' buffer cache is a bit odd, and we could probably do without VM-level dirty tracking at all, but that's a change for another time. Link: https://lkml.kernel.org/r/20230307143125.27778-4-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Gang He <ghe@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Jan Kara via Ocfs2-devel <ocfs2-devel@oss.oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
Use filemap_write_and_wait_range to write back the range of the dirty page instead of write_one_page in preparation of removing write_one_page and eventually ->writepage. Link: https://lkml.kernel.org/r/20230307143125.27778-3-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Gang He <ghe@suse.com> Cc: Jan Kara via Ocfs2-devel <ocfs2-devel@oss.oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
Patch series "remove most callers of write_one_page", v4. This series removes most users of the write_one_page API. These helpers internally call ->writepage which we are gradually removing from the kernel. This patch (of 3): We do not need to writeout modified directory blocks immediately when modifying them while the page is locked. It is enough to do the flush somewhat later which has the added benefit that inode times can be flushed as well. It also allows us to stop depending on write_one_page() function. Ported from an ext2 patch by Jan Kara. Link: https://lkml.kernel.org/r/20230307143125.27778-1-hch@lst.de Link: https://lkml.kernel.org/r/20230307143125.27778-2-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Jan Kara via Ocfs2-devel <ocfs2-devel@oss.oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Jan Kara <jack@suse.cz> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Alexander Potapenko authored
Ensure that KMSAN does not report false positives in instrumented callers of stack_depot_save(), stack_depot_print(), and stack_depot_fetch(). Link: https://lkml.kernel.org/r/20230306111322.205724-2-glider@google.comSigned-off-by: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: syzbot <syzkaller@googlegroups.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Alexander Potapenko authored
KMSAN does not instrument stackdepot and may treat memory allocated by it as uninitialized. This is not a problem for KMSAN itself, because its functions calling stackdepot API are also not instrumented. But other kernel features (e.g. netdev tracker) may access stack depot from instrumented code, which will lead to false positives, unless we explicitly mark stackdepot outputs as initialized. Link: https://lkml.kernel.org/r/20230306111322.205724-1-glider@google.comSigned-off-by: Alexander Potapenko <glider@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Suggested-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yue Zhao authored
The knob for cgroup v1 memory controller: memory.soft_limit_in_bytes is not protected by any locking so it can be modified while it is used. This is not an actual problem because races are unlikely. But it is better to use [READ|WRITE]_ONCE to prevent compiler from doing anything funky. The access of memcg->soft_limit is lockless, so it can be concurrently set at the same time as we are trying to read it. All occurrences of memcg->soft_limit are updated with [READ|WRITE]_ONCE. [findns94@gmail.com: v3] Link: https://lkml.kernel.org/r/20230308162555.14195-5-findns94@gmail.com Link: https://lkml.kernel.org/r/20230306154138.3775-5-findns94@gmail.comSigned-off-by: Yue Zhao <findns94@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tang Yizhou <tangyeechou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yue Zhao authored
The knob for cgroup v1 memory controller: memory.oom_control is not protected by any locking so it can be modified while it is used. This is not an actual problem because races are unlikely. But it is better to use [READ|WRITE]_ONCE to prevent compiler from doing anything funky. The access of memcg->oom_kill_disable is lockless, so it can be concurrently set at the same time as we are trying to read it. All occurrences of memcg->oom_kill_disable are updated with [READ|WRITE]_ONCE. [findns94@gmail.com: v3] Link: https://lkml.kernel.org/r/20230308162555.14195-4-findns94@gmail.com Link: https://lkml.kernel.org/r/20230306154138.377-4-findns94@gmail.comSigned-off-by: Yue Zhao <findns94@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tang Yizhou <tangyeechou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yue Zhao authored
The knob for cgroup v1 memory controller: memory.swappiness is not protected by any locking so it can be modified while it is used. This is not an actual problem because races are unlikely. But it is better to use [READ|WRITE]_ONCE to prevent compiler from doing anything funky. The access of memcg->swappiness and vm_swappiness is lockless, so both of them can be concurrently set at the same time as we are trying to read them. All occurrences of memcg->swappiness and vm_swappiness are updated with [READ|WRITE]_ONCE. [findns94@gmail.com: v3] Link: https://lkml.kernel.org/r/20230308162555.14195-3-findns94@gmail.com Link: https://lkml.kernel.org/r/20230306154138.3775-3-findns94@gmail.comSigned-off-by: Yue Zhao <findns94@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tang Yizhou <tangyeechou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yue Zhao authored
Patch series "mm, memcg: cgroup v1 and v2 tunable load/store tearing fixes", v2. This patch series helps to prevent load/store tearing in several cgroup knobs. As kindly pointed out by Michal Hocko and Roman Gushchin , the changelog has been rephrased. Besides, more knobs were checked, according to kind suggestions from Shakeel Butt and Muchun Song. This patch (of 4): The knob for cgroup v2 memory controller: memory.oom.group is not protected by any locking so it can be modified while it is used. This is not an actual problem because races are unlikely (the knob is usually configured long before any workloads hits actual memcg oom) but it is better to use READ_ONCE/WRITE_ONCE to prevent compiler from doing anything funky. The access of memcg->oom_group is lockless, so it can be concurrently set at the same time as we are trying to read it. Link: https://lkml.kernel.org/r/20230306154138.3775-1-findns94@gmail.com Link: https://lkml.kernel.org/r/20230306154138.3775-2-findns94@gmail.comSigned-off-by: Yue Zhao <findns94@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tang Yizhou <tangyeechou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Zi Yan authored
Fix two inputs to check_anon_huge() and one if condition, so the tests work as expected. Link: https://lkml.kernel.org/r/20230306160907.16804-1-zi.yan@sent.com Fixes: c07c343c ("selftests/vm: dedup THP helpers") Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Tested-by: Zach O'Keefe <zokeefe@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Gerald Schaefer authored
s390 can do more fine-grained handling of spurious TLB protection faults, when there also is the PTE pointer available. Therefore, pass on the PTE pointer to flush_tlb_fix_spurious_fault() as an additional parameter. This will add no functional change to other architectures, but those with private flush_tlb_fix_spurious_fault() implementations need to be made aware of the new parameter. Link: https://lkml.kernel.org/r/20230306161548.661740-1-gerald.schaefer@linux.ibm.comSigned-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Sergey Senozhatsky authored
We keep the old fullness (3/4 threshold) reporting in zs_stats_size_show(). Switch from allmost full/empty stats to fine-grained per inuse ratio (fullness group) reporting, which gives signicantly more data on classes fragmentation. Link: https://lkml.kernel.org/r/20230304034835.2082479-5-senozhatsky@chromium.orgSigned-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Sergey Senozhatsky authored
The zsmalloc compaction algorithm has the potential to waste some CPU cycles, particularly when compacting pages within the same fullness group. This is due to the way it selects the head page of the fullness list for source and destination pages, and how it reinserts those pages during each iteration. The algorithm may first use a page as a migration destination and then as a migration source, leading to an unnecessary back-and-forth movement of objects. Consider the following fullness list: PageA PageB PageC PageD PageE During the first iteration, the compaction algorithm will select PageA as the source and PageB as the destination. All of PageA's objects will be moved to PageB, and then PageA will be released while PageB is reinserted into the fullness list. PageB PageC PageD PageE During the next iteration, the compaction algorithm will again select the head of the list as the source and destination, meaning that PageB will now serve as the source and PageC as the destination. This will result in the objects being moved away from PageB, the same objects that were just moved to PageB in the previous iteration. To prevent this avalanche effect, the compaction algorithm should not reinsert the destination page between iterations. By doing so, the most optimal page will continue to be used and its usage ratio will increase, reducing internal fragmentation. The destination page should only be reinserted into the fullness list if: - It becomes full - No source page is available. TEST ==== It's very challenging to reliably test this series. I ended up developing my own synthetic test that has 100% reproducibility. The test generates significan fragmentation (for each size class) and then performs compaction for each class individually and tracks the number of memcpy() in zs_object_copy(), so that we can compare the amount work compaction does on per-class basis. Total amount of work (zram mm_stat objs_moved) ---------------------------------------------- Old fullness grouping, old compaction algorithm: 323977 memcpy() in zs_object_copy(). Old fullness grouping, new compaction algorithm: 262944 memcpy() in zs_object_copy(). New fullness grouping, new compaction algorithm: 213978 memcpy() in zs_object_copy(). Per-class compaction memcpy() comparison (T-test) ------------------------------------------------- x Old fullness grouping, old compaction algorithm + Old fullness grouping, new compaction algorithm N Min Max Median Avg Stddev x 140 349 3513 2461 2314.1214 806.03271 + 140 289 2778 2006 1878.1714 641.02073 Difference at 95.0% confidence -435.95 +/- 170.595 -18.8387% +/- 7.37193% (Student's t, pooled s = 728.216) x Old fullness grouping, old compaction algorithm + New fullness grouping, new compaction algorithm N Min Max Median Avg Stddev x 140 349 3513 2461 2314.1214 806.03271 + 140 226 2279 1644 1528.4143 524.85268 Difference at 95.0% confidence -785.707 +/- 159.331 -33.9527% +/- 6.88516% (Student's t, pooled s = 680.132) Link: https://lkml.kernel.org/r/20230304034835.2082479-4-senozhatsky@chromium.orgSigned-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Sergey Senozhatsky authored
Each zspage maintains ->inuse counter which keeps track of the number of objects stored in the zspage. The ->inuse counter also determines the zspage's "fullness group" which is calculated as the ratio of the "inuse" objects to the total number of objects the zspage can hold (objs_per_zspage). The closer the ->inuse counter is to objs_per_zspage, the better. Each size class maintains several fullness lists, that keep track of zspages of particular "fullness". Pages within each fullness list are stored in random order with regard to the ->inuse counter. This is because sorting the zspages by ->inuse counter each time obj_malloc() or obj_free() is called would be too expensive. However, the ->inuse counter is still a crucial factor in many situations. For the two major zsmalloc operations, zs_malloc() and zs_compact(), we typically select the head zspage from the corresponding fullness list as the best candidate zspage. However, this assumption is not always accurate. For the zs_malloc() operation, the optimal candidate zspage should have the highest ->inuse counter. This is because the goal is to maximize the number of ZS_FULL zspages and make full use of all allocated memory. For the zs_compact() operation, the optimal source zspage should have the lowest ->inuse counter. This is because compaction needs to move objects in use to another page before it can release the zspage and return its physical pages to the buddy allocator. The fewer objects in use, the quicker compaction can release the zspage. Additionally, compaction is measured by the number of pages it releases. This patch reworks the fullness grouping mechanism. Instead of having two groups - ZS_ALMOST_EMPTY (usage ratio below 3/4) and ZS_ALMOST_FULL (usage ration above 3/4) - that result in too many zspages being included in the ALMOST_EMPTY group for specific classes, size classes maintain a larger number of fullness lists that give strict guarantees on the minimum and maximum ->inuse values within each group. Each group represents a 10% change in the ->inuse ratio compared to neighboring groups. In essence, there are groups for zspages with 0%, 10%, 20% usage ratios, and so on, up to 100%. This enhances the selection of candidate zspages for both zs_malloc() and zs_compact(). A printout of the ->inuse counters of the first 7 zspages per (random) class fullness group: class-768 objs_per_zspage 16: fullness 100%: empty fullness 99%: empty fullness 90%: empty fullness 80%: empty fullness 70%: empty fullness 60%: 8 8 9 9 8 8 8 fullness 50%: empty fullness 40%: 5 5 6 5 5 5 5 fullness 30%: 4 4 4 4 4 4 4 fullness 20%: 2 3 2 3 3 2 2 fullness 10%: 1 1 1 1 1 1 1 fullness 0%: empty The zs_malloc() function searches through the groups of pages starting with the one having the highest usage ratio. This means that it always selects a zspage from the group with the least internal fragmentation (highest usage ratio) and makes it even less fragmented by increasing its usage ratio. The zs_compact() function, on the other hand, begins by scanning the group with the highest fragmentation (lowest usage ratio) to locate the source page. The first available zspage is selected, and then the function moves downward to find a destination zspage in the group with the lowest internal fragmentation (highest usage ratio). Link: https://lkml.kernel.org/r/20230304034835.2082479-3-senozhatsky@chromium.orgSigned-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Sergey Senozhatsky authored
Patch series "zsmalloc: fine-grained fullness and new compaction algorithm", v4. Existing zsmalloc page fullness grouping leads to suboptimal page selection for both zs_malloc() and zs_compact(). This patchset reworks zsmalloc fullness grouping/classification. Additinally it also implements new compaction algorithm that is expected to use less CPU-cycles (as it potentially does fewer memcpy-s in zs_object_copy()). Test (synthetic) results can be seen in patch 0003. This patch (of 4): This optimization has no effect. It only ensures that when a zspage was added to its corresponding fullness list, its "inuse" counter was higher or lower than the "inuse" counter of the zspage at the head of the list. The intention was to keep busy zspages at the head, so they could be filled up and moved to the ZS_FULL fullness group more quickly. However, this doesn't work as the "inuse" counter of a zspage can be modified by obj_free() but the zspage may still belong to the same fullness list. So, fix_fullness_group() won't change the zspage's position in relation to the head's "inuse" counter, leading to a largely random order of zspages within the fullness list. For instance, consider a printout of the "inuse" counters of the first 10 zspages in a class that holds 93 objects per zspage: ZS_ALMOST_EMPTY: 36 67 68 64 35 54 63 52 As we can see the zspage with the lowest "inuse" counter is actually the head of the fullness list. Remove this pointless "optimisation". Link: https://lkml.kernel.org/r/20230304034835.2082479-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20230304034835.2082479-2-senozhatsky@chromium.orgSigned-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jaewon Kim authored
Using order 4 pages would be helpful for IOMMUs mapping, but trying to get order 4 pages could spend quite much time in the page allocation. From the perspective of responsiveness, the deterministic memory allocation speed, I think, is quite important. The order 4 allocation with __GFP_RECLAIM may spend much time in reclaim and compation logic. __GFP_NORETRY also may affect. These cause unpredictable delay. To get reasonable allocation speed from dma-buf system heap, use HIGH_ORDER_GFP for order 4 to avoid reclaim. And let me remove meaningless __GFP_COMP for order 0. According to my tests, order 4 with MID_ORDER_GFP could get more number of order 4 pages but the elapsed times could be very slow. time order 8 order 4 order 0 584 usec 0 160 0 28,428 usec 0 160 0 100,701 usec 0 160 0 76,645 usec 0 160 0 25,522 usec 0 160 0 38,798 usec 0 160 0 89,012 usec 0 160 0 23,015 usec 0 160 0 73,360 usec 0 160 0 76,953 usec 0 160 0 31,492 usec 0 160 0 75,889 usec 0 160 0 84,551 usec 0 160 0 84,352 usec 0 160 0 57,103 usec 0 160 0 93,452 usec 0 160 0 If HIGH_ORDER_GFP is used for order 4, the number of order 4 could be decreased but the elapsed time results were quite stable and fast enough. time order 8 order 4 order 0 1,356 usec 0 155 80 1,901 usec 0 11 2384 1,912 usec 0 0 2560 1,911 usec 0 0 2560 1,884 usec 0 0 2560 1,577 usec 0 0 2560 1,366 usec 0 0 2560 1,711 usec 0 0 2560 1,635 usec 0 28 2112 544 usec 10 0 0 633 usec 2 128 0 848 usec 0 160 0 729 usec 0 160 0 1,000 usec 0 160 0 1,358 usec 0 160 0 2,638 usec 0 31 2064 Link: https://lkml.kernel.org/r/20230303050332.10138-1-jaewon31.kim@samsung.comSigned-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Reviewed-by: John Stultz <jstultz@google.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Alexander Potapenko authored
Add tests ensuring that memset16()/memset32()/memset64() are instrumented by KMSAN and correctly initialize the memory. Link: https://lkml.kernel.org/r/20230303141433.3422671-4-glider@google.comSigned-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Helge Deller <deller@gmx.de> Cc: Kees Cook <keescook@chromium.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Alexander Potapenko authored
KMSAN must see as many memory accesses as possible to prevent false positive reports. Fall back to versions of memset16()/memset32()/memset64() implemented in lib/string.c instead of those written in assembly. Link: https://lkml.kernel.org/r/20230303141433.3422671-3-glider@google.comSigned-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Reviewed-by: Marco Elver <elver@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Helge Deller <deller@gmx.de> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Alexander Potapenko authored
commit 5478afc5 ("kmsan: fix memcpy tests") uses OPTIMIZER_HIDE_VAR() to hide the uninitialized var from the compiler optimizations. However OPTIMIZER_HIDE_VAR(uninit) enforces an immediate check of @uninit, so memcpy tests did not actually check the behavior of memcpy(), because they always contained a KMSAN report. Replace OPTIMIZER_HIDE_VAR() with a file-local macro that just clobbers the memory with a barrier(), and add a test case for memcpy() that does not expect an error report. Also reflow kmsan_test.c with clang-format. Link: https://lkml.kernel.org/r/20230303141433.3422671-2-glider@google.comSigned-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Helge Deller <deller@gmx.de> Cc: Kees Cook <keescook@chromium.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Alexander Potapenko authored
clang -fsanitize=kernel-memory already replaces calls to memset/memcpy/memmove and their __builtin_ versions with __msan_memset/__msan_memcpy/__msan_memmove in instrumented files, so there is no need to override them. In non-instrumented versions we are now required to leave memset() and friends intact, so we cannot replace them with __msan_XXX() functions. Link: https://lkml.kernel.org/r/20230303141433.3422671-1-glider@google.comSigned-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Helge Deller <deller@gmx.de> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
Explicit memcg uncharging is not needed when the memcg accounting has the same lifespan of the page/folio. That becomes the case for khugepaged after Yang & Zach's recent rework so the hpage will be allocated for each collapse rather than being cached. Cleanup the explicit memcg uncharge in khugepaged failure path and leave that for put_page(). Link: https://lkml.kernel.org/r/20230303151218.311015-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: David Stevens <stevensd@chromium.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Anshuman Khandual authored
Since the following commit arch_make_huge_pte() should be used directly in generic memory subsystem as a platform provided page table helper, instead of pte_mkhuge(). Change hugetlb_basic_tests() to call arch_make_huge_pte() directly, and update its relevant documentation entry as required. 'commit 16785bd7 ("mm: merge pte_mkhuge() call into arch_make_huge_pte()")' Link: https://lkml.kernel.org/r/20230302114845.421674-1-anshuman.khandual@arm.comSigned-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/all/1ea45095-0926-a56a-a273-816709e9075e@csgroup.eu/ Cc: Jonathan Corbet <corbet@lwn.net> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Anshuman Khandual authored
Since the following commit, arch_make_huge_pte() should be used directly in generic memory subsystem as a platform provided page table helper, instead of pte_mkhuge(). This just drops pte_mkhuge() from remove_migration_pte(), which has now become redundant. 'commit 16785bd7 ("mm: merge pte_mkhuge() call into arch_make_huge_pte()")' Link: https://lkml.kernel.org/r/20230302025349.358341-1-anshuman.khandual@arm.comSigned-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/all/1ea45095-0926-a56a-a273-816709e9075e@csgroup.eu/Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-