- 04 Jul, 2022 40 commits
-
-
Matthew Wilcox (Oracle) authored
Do the per-cpu dereferencing of the fbatches once which saves 14 bytes of text and several percpu relocations. Link: https://lkml.kernel.org/r/20220617175020.717127-12-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
The function is too long, so pull this complicated conditional out into cpu_needs_drain(). This ends up shrinking the text by 14 bytes, by allowing GCC to cache the result of calling per_cpu() instead of relocating each lookup individually. Link: https://lkml.kernel.org/r/20220617175020.717127-11-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
No change to generated code, but this struct no longer contains any pagevecs, and not all the folio batches it contains are lru. Link: https://lkml.kernel.org/r/20220617175020.717127-10-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Rename it to just 'activate', saving 696 bytes of text from removals of compound_page() and the pagevec_lru_move_fn() infrastructure. Inline need_activate_page_drain() into its only caller. Link: https://lkml.kernel.org/r/20220617175020.717127-9-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Using folios instead of pages removes several calls to compound_head(), shrinking the kernel by 1089 bytes of text. Link: https://lkml.kernel.org/r/20220617175020.717127-8-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Using folios instead of pages shrinks deactivate_page() and lru_deactivate_fn() by 778 bytes between them. Link: https://lkml.kernel.org/r/20220617175020.717127-7-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Use a folio throughout lru_deactivate_file_fn(), removing many hidden calls to compound_head(). Shrinks the kernel by 864 bytes of text. Link: https://lkml.kernel.org/r/20220617175020.717127-6-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
When adding folios to the LRU for the first time, the LRU flag will already be clear, so skip the test-and-clear part of moving from one LRU to another. Removes 285 bytes from kernel text, mostly due to removing __pagevec_lru_add(). Link: https://lkml.kernel.org/r/20220617175020.717127-5-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
__pagevec_lru_add has no callers outside swap.c, so make it static, and move it to a more logical position in the file. Link: https://lkml.kernel.org/r/20220617175020.717127-4-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Start converting the LRU from pagevecs to folio_batches. Combine the functionality of pagevec_add_and_need_flush() with pagevec_lru_move_fn() in the new folio_batch_add_and_move(). Convert the lru_rotate pagevec to a folio_batch. Adds 223 bytes total to kernel text, because we're duplicating infrastructure. This will be more than made up for in future patches. Link: https://lkml.kernel.org/r/20220617175020.717127-3-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Patch series "Convert the swap code to be more folio-based". There's still more to do with the swap code, but this reaps a lot of the folio benefit. More than 4kB of kernel text saved (with the UEK7 kernel config). I don't know how much that's going to translate into CPU savings, but some of those compound_head() calls are on every page free, so it should be noticable. It might even be noticable just from an I-cache consumption perspective. This patch (of 22): This is just a wrapper around release_pages() for now. Place the prototype in mm.h along with folio_put() and folio_put_refs(). Link: https://lkml.kernel.org/r/20220617175020.717127-1-willy@infradead.org Link: https://lkml.kernel.org/r/20220617175020.717127-2-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Remove a few hidden calls to compound_head, saving 76 bytes of text. Link: https://lkml.kernel.org/r/20220617154248.700416-6-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Remove a few hidden calls to compound_head, saving 411 bytes of text. Link: https://lkml.kernel.org/r/20220617154248.700416-5-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Remove a few hidden calls to compound_head, saving 387 bytes of text on my test configuration. Link: https://lkml.kernel.org/r/20220617154248.700416-4-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Remove a few hidden calls to compound_head, saving 279 bytes of text. Link: https://lkml.kernel.org/r/20220617154248.700416-3-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Patch series "nvert much of vmscan to folios" vmscan always operates on folios since it puts the pages on the LRU list. Switching all of these functions from pages to folios saves 1483 bytes of text from removing all the baggage around calling compound_page() and similar functions. This patch (of 5): This is a straightforward conversion which removes several hidden calls to compound_head, saving 330 bytes of kernel text. Link: https://lkml.kernel.org/r/20220617154248.700416-1-willy@infradead.org Link: https://lkml.kernel.org/r/20220617154248.700416-2-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we can try mapping anonymous pages in a private writable mapping writable if they are exclusive, the PTE is already dirty, and no special handling applies. Mapping the anonymous page writable is essentially the same thing the write fault handler would do in this case. Special handling is required for uffd-wp and softdirty tracking, so take care of that properly. Also, leave PROT_NONE handling alone for now; in the future, we could similarly extend the logic in do_numa_page() or use pte_mk_savedwrite() here. While this improves mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE) performance, it should also be a valuable optimization for uffd-wp, when un-protecting. This has been previously suggested by Peter Collingbourne in [1], relevant in the context of the Scudo memory allocator, before we had PageAnonExclusive. This commit doesn't add the same handling for PMDs (i.e., anonymous THP, anonymous hugetlb); benchmark results from Andrea indicate that there are minor performance gains, so it's might still be valuable to streamline that logic for all anonymous pages in the future. As we now also set MM_CP_DIRTY_ACCT for private mappings, let's rename it to MM_CP_TRY_CHANGE_WRITABLE, to make it clearer what's actually happening. Micro-benchmark courtesy of Andrea: === #define _GNU_SOURCE #include <sys/mman.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <unistd.h> #define SIZE (1024*1024*1024) int main(int argc, char *argv[]) { char *p; if (posix_memalign((void **)&p, sysconf(_SC_PAGESIZE)*512, SIZE)) perror("posix_memalign"), exit(1); if (madvise(p, SIZE, argc > 1 ? MADV_HUGEPAGE : MADV_NOHUGEPAGE)) perror("madvise"); explicit_bzero(p, SIZE); for (int loops = 0; loops < 40; loops++) { if (mprotect(p, SIZE, PROT_READ)) perror("mprotect"), exit(1); if (mprotect(p, SIZE, PROT_READ|PROT_WRITE)) perror("mprotect"), exit(1); explicit_bzero(p, SIZE); } } === Results on my Ryzen 9 3900X: Stock 10 runs (lower is better): AVG 6.398s, STDEV 0.043 Patched 10 runs (lower is better): AVG 3.780s, STDEV 0.026 === [1] https://lkml.kernel.org/r/20210429214801.2583336-1-pcc@google.com Link: https://lkml.kernel.org/r/20220614093629.76309-1-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Suggested-by: Peter Collingbourne <pcc@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Edward Liaw authored
On Android this test is getting stuck in an infinite loop due to indeterminate behavior: The local variables steps and signalled were being reset to 1 and 0 respectively after every jump back to sigsetjmp by siglongjmp in the signal handler. The test was incrementing them and expecting them to retain their incremented values. The documentation for siglongjmp says: All accessible objects have values as of the time sigsetjmp() was called, except that the values of objects of automatic storage duration which are local to the function containing the invocation of the corresponding sigsetjmp() which do not have volatile-qualified type and which are changed between the sigsetjmp() invocation and siglongjmp() call are indeterminate. Tagging steps and signalled with volatile enabled the test to pass. Link: https://lkml.kernel.org/r/20220613233321.431282-1-edliaw@google.comSigned-off-by: Edward Liaw <edliaw@google.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
This commit documents the usage of DAMON_LRU_SORT for admins. Link: https://lkml.kernel.org/r/20220613192301.8817-10-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Users can do data access-aware LRU-lists sorting using 'LRU_PRIO' and 'LRU_DEPRIO' DAMOS actions. However, finding best parameters including the hotness/coldness thresholds, CPU quota, and watermarks could be challenging for some users. To make the scheme easy to be used without complex tuning for common situations, this commit implements a static kernel module called 'DAMON_LRU_SORT' using the 'LRU_PRIO' and 'LRU_DEPRIO' DAMOS actions. It proactively sorts LRU-lists using DAMON with conservatively chosen default values of the parameters. That is, the module under its default parameters will make no harm for common situations but provide some level of efficiency improvements for systems having clear hot/cold access pattern under a level of memory pressure while consuming only a limited small portion of CPU time. Link: https://lkml.kernel.org/r/20220613192301.8817-9-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
This commit documents the 'LRU_DEPRIO' scheme action for DAMON sysfs interface.` Link: https://lkml.kernel.org/r/20220613192301.8817-8-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
This commit adds a new DAMON-based operation scheme action called 'LRU_DEPRIO' for physical address space. The action deprioritizes pages in the memory area of the target access pattern on their LRU lists. This is hence supposed to be used for rarely accessed (cold) memory regions so that cold pages could be more likely reclaimed first under memory pressure. Internally, it simply calls 'lru_deactivate()'. Using this with 'LRU_PRIO' action for hot pages, users can proactively sort LRU lists based on the access pattern. That is, it can make the LRU lists somewhat more trustworthy source of access temperature. As a result, efficiency of LRU-lists based mechanisms including the reclamation target selection could be improved. Link: https://lkml.kernel.org/r/20220613192301.8817-7-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
This commit documents the 'lru_prio' scheme action for DAMON sysfs interface. Link: https://lkml.kernel.org/r/20220613192301.8817-6-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
This commit adds a new DAMOS action called 'LRU_PRIO' for the physical address space. The action prioritizes pages in the memory regions of the user-specified target access pattern on their LRU lists. This is hence supposed to be used for frequently accessed (hot) memory regions so that hot pages could be more likely protected under memory pressure. Internally, it simply calls 'mark_page_accessed()'. Link: https://lkml.kernel.org/r/20220613192301.8817-5-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
This commit moves code for 'DAMOS_PAGEOUT' handling of the physical address space monitoring operations set to a separate function so that its caller, 'damon_pa_apply_scheme()', can be more easily extended for additional DAMOS actions later. Link: https://lkml.kernel.org/r/20220613192301.8817-4-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Patch series "Extend DAMOS for Proactive LRU-lists Sorting". Introduction ============ In short, this patchset 1) extends DAMON-based Operation Schemes (DAMOS) for low overhead data access pattern based LRU-lists sorting, and 2) implements a static kernel module for easy use of conservatively-tuned version of that using the extended DAMOS capability. Background ---------- As page-granularity access checking overhead could be significant on huge systems, LRU lists are normally not proactively sorted but partially and reactively sorted for special events including specific user requests, system calls and memory pressure. As a result, LRU lists are sometimes not so perfectly prepared to be used as a trustworthy access pattern source for some situations including reclamation target pages selection under sudden memory pressure. DAMON-based Proactive LRU-lists Sorting --------------------------------------- Because DAMON can identify access patterns of best-effort accuracy while inducing only user-specified range of overhead, using DAMON for Proactive LRU-lists Sorting (PLRUS) could be helpful for this situation. The idea is quite simple. Find hot pages and cold pages using DAMON, and prioritize hot pages while deprioritizing cold pages on their LRU-lists. This patchset extends DAMON to support such schemes by introducing a couple of new DAMOS actions for prioritizing and deprioritizing memory regions of specific access patterns on their LRU-lists. In detail, this patchset simply uses 'mark_page_accessed()' and 'deactivate_page()' functions for prioritization and deprioritization of pages on their LRU lists, respectively. To make the scheme easy to use without complex tuning for common situations, this patchset further implements a static kernel module called 'DAMON_LRU_SORT' using the extended DAMOS functionality. It proactively sorts LRU-lists using DAMON with conservatively chosen default hotness/coldness thresholds and small CPU usage quota limit. That is, the module under its default parameters will make no harm for common situation but provide some level of benefit for systems having clear hot/cold access pattern under only memory pressure while consuming only limited small portion of CPU time. Related Works ------------- Proactive reclamation is well known to be helpful for reducing non-optimal reclamation target selection caused performance drops. However, proactive reclamation is not a best option for some cases, because it could incur additional I/O. For an example, it could be prohitive for systems using storage devices that total number of writes is limited, or cloud block storages that charges every I/O. Some proactive reclamation approaches[1,2] induce a level of memory pressure using memcg files or swappiness while monitoring PSI. As reclamation target selection is still relying on the original LRU-lists mechanism, using DAMON-based proactive reclamation before inducing the proactive reclamation could allow more memory saving with same level of performance overhead, or less performance overhead with same level of memory saving. [1] https://blogs.oracle.com/linux/post/anticipating-your-memory-needs [2] https://www.pdl.cmu.edu/ftp/NVM/tmo_asplos22.pdf Evaluation ========== In short, PLRUS achieves 10% memory PSI (some) reduction, 14% major page faults reduction, and 3.74% speedup under memory pressure. Setup ----- To show the effect of PLRUS, I run PARSEC3 and SPLASH-2X benchmarks under below variant systems and measure a few metrics including the runtime of each workload, number of system-wide major page faults, and system-wide memory PSI (some). - orig: v5.18-rc4 based mm-unstable kernel + this patchset, but no DAMON scheme applied. - mprs: Same to 'orig' but artificial memory pressure is induced. - plrus: Same to 'mprs' but a radically tuned PLRUS scheme is applied to the entire physical address space of the system. For the artificial memory pressure, I set 'memory.limit_in_bytes' to 75% of the running workload's peak RSS, wait 1 second, remove the pressure by setting it to 200% of the peak RSS, wait 10 seconds, and repeat the procedure until the workload finishes[1]. I use zram based swap device. The tests are automated[2]. [1] https://github.com/awslabs/damon-tests/blob/next/perf/runners/back/0009_memcg_pressure.sh [2] https://github.com/awslabs/damon-tests/blob/next/perf/full_once_config.sh Radically Tuned PLRUS --------------------- To show effect of PLRUS on the PARSEC3/SPLASH-2X workloads which runs for no long time, we use radically tuned version of PLRUS. The version asks DAMON to do the proactive LRU-lists sorting as below. 1. Find any memory regions shown some accesses (approximately >=20 accesses per 100 sampling) and prioritize pages of the regions on their LRU lists using up to 2% CPU time. Under the CPU time limit, prioritize regions having higher access frequency and kept the access frequency longer first. 2. Find any memory regions shown no access for at least >=5 seconds and deprioritize pages of the rgions on their LRU lists using up to 2% CPU time. Under the CPU time limit, deprioritize regions that not accessed for longer time first. Results ------- I repeat the tests 25 times and calculate average of the measured numbers. The results are as below: metric orig mprs plrus plrus/mprs runtime_seconds 190.06 292.83 281.87 0.96 pgmajfaults 852.55 8769420.00 7525040.00 0.86 memory_psi_some_us 106911.00 6943420.00 6220920.00 0.90 The first row is for legend. The first cell shows the metric that the following cells of the row shows. Second, third, and fourth cells show the metrics under the configs shown at the first row of the cell, and the fifth cell shows the metric under 'plrus' divided by the metric under 'mprs'. Second row shows the averaged runtime of the workloads in seconds. Third row shows the number of system-wide major page faults while the test was ongoing. Fourth row shows the system-wide memory pressure stall for some processes in microseconds while the test was ongoing. In short, PLRUS achieves 10% memory PSI (some) reduction, 14% major page faults reduction, and 3.74% speedup under memory pressure. We also confirmed the CPU usage of kdamond was 2.61% of single CPU, which is below 4% as expected. Sequence of Patches =================== The first and second patch cleans up DAMON debugfs interface and DAMOS_PAGEOUT handling code of physical address space monitoring operations implementation for easier extension of the code. The thrid and fourth patches implement a new DAMOS action called 'lru_prio', which prioritizes pages under memory regions which have a user-specified access pattern, and document it, respectively. The fifth and sixth patches implement yet another new DAMOS action called 'lru_deprio', which deprioritizes pages under memory regions which have a user-specified access pattern, and document it, respectively. The seventh patch implements a static kernel module called 'damon_lru_sort', which utilizes the DAMON-based proactive LRU-lists sorting under conservatively chosen default parameter. Finally, the eighth patch documents 'damon_lru_sort'. This patch (of 8): DAMON debugfs interface assumes users will write 'damos_action' value directly to the 'schemes' file. This makes adding new 'damos_action' in the middle of its definition breaks the backward compatibility of DAMON debugfs interface, as values of some 'damos_action' could be changed. To mitigate the situation, this commit adds mappings between the user inputs and 'damos_action' value and makes DAMON debugfs code uses those. Link: https://lkml.kernel.org/r/20220613192301.8817-1-sj@kernel.org Link: https://lkml.kernel.org/r/20220613192301.8817-2-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
swap_cache_info are not statistics that could be easily used to tune system performance because they are not easily accessile. Also they can't provide really useful info when OOM occurs. Remove these statistics can also help mitigate unneeded global swap_cache_info cacheline contention. Link: https://lkml.kernel.org/r/20220608144031.829-4-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
si->inuse_pages could still be accessed concurrently now. The plain reads outside si->lock critical section, i.e. swap_show and si_swapinfo, which results in data races. READ_ONCE and WRITE_ONCE is used to fix such data races. Note these data races should be ok because they're just used for showing swap info. [linmiaohe@huawei.com: use WRITE_ONCE to pair with READ_ONCE] Link: https://lkml.kernel.org/r/20220625093346.48894-2-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220608144031.829-3-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Uladzislau Rezki (Sony) authored
A get_random_bytes() function can cause a high contention if it is called across CPUs simultaneously. Because it shares one lock per all CPUs: <snip> class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg &crng->lock: 663145 665886 0.05 8.85 261966.66 0.39 7188152 13731279 0.04 11.89 2181582.30 0.16 ----------- &crng->lock 307835 [<00000000acba59cd>] _extract_crng+0x48/0x90 &crng->lock 358051 [<00000000f0075abc>] _crng_backtrack_protect+0x32/0x90 ----------- &crng->lock 234241 [<00000000f0075abc>] _crng_backtrack_protect+0x32/0x90 &crng->lock 431645 [<00000000acba59cd>] _extract_crng+0x48/0x90 <snip> Switch from the get_random_bytes() to prandom_u32() that does not have any internal contention when a random value is needed for the tests. The reason is to minimize CPU cycles introduced by the test-suite itself from the vmalloc performance metrics. Link: https://lkml.kernel.org/r/20220607093449.3100-6-urezki@gmail.comSigned-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Uladzislau Rezki (Sony) authored
__find_vmap_area() finds a "vmap_area" based on passed address. It scan the specific "vmap_area_root" rb-tree. Extend the function with one extra argument, so any tree can be specified where the search has to be done. There is no functional change as a result of this patch. Link: https://lkml.kernel.org/r/20220607093449.3100-5-urezki@gmail.comSigned-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Uladzislau Rezki (Sony) authored
A vmap_area can travel between different places. For example attached/detached to/from different rb-trees. In order to prevent fancy bugs, initialize a VA's list node after it is removed from the list, so it pairs with VA's rb_node which is also initialized. There is no functional change as a result of this patch. Link: https://lkml.kernel.org/r/20220607093449.3100-4-urezki@gmail.comSigned-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Uladzislau Rezki (Sony) authored
It implies that __alloc_vmap_area() allocates only from the global vmap space, therefore a list-head and rb-tree, which represent a free vmap space, are not passed as parameters to this function and are accessed directly from this function. Extend the __alloc_vmap_area() and other dependent functions to have a possibility to allocate from different trees making an interface common and not specific. There is no functional change as a result of this patch. Link: https://lkml.kernel.org/r/20220607093449.3100-3-urezki@gmail.comSigned-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Uladzislau Rezki (Sony) authored
Patch series "Reduce a vmalloc internal lock contention preparation work". This small serias is preparation work to implement per-cpu vmalloc allocation in order to reduce a high internal lock contention. This series does not introduce any functional changes, it is only about preparation. This patch (of 5): Currently link_va() and unlik_va(), in order to figure out a tree type, compares a passed root value with a global free_vmap_area_root variable to distinguish the augmented rb-tree from a regular one. It is hard coded since such functions can manipulate only with specific "free_vmap_area_root" tree that represents a global free vmap space. Make it common by introducing "_augment" versions of both internal functions, so it is possible to deal with different trees. There is no functional change as a result of this patch. Link: https://lkml.kernel.org/r/20220607093449.3100-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20220607093449.3100-2-urezki@gmail.comSigned-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Roman Gushchin authored
Add a scan interface which allows to trigger scanning of a particular shrinker and specify memcg and numa node. It's useful for testing, debugging and profiling of a specific scan_objects() callback. Unlike alternatives (creating a real memory pressure and dropping caches via /proc/sys/vm/drop_caches) this interface allows to interact with only one shrinker at once. Also, if a shrinker is misreporting the number of objects (as some do), it doesn't affect scanning. [roman.gushchin@linux.dev: improve typing, fix arg count checking] Link: https://lkml.kernel.org/r/YpgKttTowT22mKPQ@carbon [akpm@linux-foundation.org: fix arg count checking] Link: https://lkml.kernel.org/r/20220601032227.4076670-7-roman.gushchin@linux.devSigned-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Roman Gushchin authored
Add a simple tool which prints a sorted list of shrinker lists in the following format: (number of objects, shrinker name, cgroup). Example: $ ./memcg_shrinker.py -n 10 2090 sb-sysfs-26 /sys/fs/cgroup/system.slice 1809 sb-sysfs-26 /sys/fs/cgroup/system.slice/systemd-udevd.service 1044 sb-btrfs:vda2-24 /sys/fs/cgroup/system.slice/system-dbus\x2d:1.3\... 861 sb-btrfs:vda2-24 /sys/fs/cgroup/system.slice/system-dbus\x2d:1.3\... 804 sb-btrfs:vda2-24 /sys/fs/cgroup/system.slice 643 sb-btrfs:vda2-24 /sys/fs/cgroup/system.slice/firewalld.service 616 sb-cgroup2-30 /sys/fs/cgroup/init.scope 275 sb-sysfs-26 / 238 sb-proc-25 /sys/fs/cgroup/system.slice/systemd-journald.service 225 sb-proc-25 /sys/fs/cgroup/system.slice/abrtd.service Link: https://lkml.kernel.org/r/20220601032227.4076670-6-roman.gushchin@linux.devSigned-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Roman Gushchin authored
Add a document describing the shrinker debugfs interface. Link: https://lkml.kernel.org/r/20220601032227.4076670-5-roman.gushchin@linux.devSigned-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Roman Gushchin authored
Currently shrinkers are anonymous objects. For debugging purposes they can be identified by count/scan function names, but it's not always useful: e.g. for superblock's shrinkers it's nice to have at least an idea of to which superblock the shrinker belongs. This commit adds names to shrinkers. register_shrinker() and prealloc_shrinker() functions are extended to take a format and arguments to master a name. In some cases it's not possible to determine a good name at the time when a shrinker is allocated. For such cases shrinker_debugfs_rename() is provided. The expected format is: <subsystem>-<shrinker_type>[:<instance>]-<id> For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair. After this change the shrinker debugfs directory looks like: $ cd /sys/kernel/debug/shrinker/ $ ls dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42 mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43 mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44 rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49 sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13 sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36 sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19 sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10 sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9 sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37 sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38 sb-dax-11 sb-proc-45 sb-tmpfs-35 sb-debugfs-7 sb-proc-46 sb-tmpfs-40 [roman.gushchin@linux.dev: fix build warnings] Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castleReported-by: kernel test robot <lkp@intel.com> Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.devSigned-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Roman Gushchin authored
This commit introduces the /sys/kernel/debug/shrinker debugfs interface which provides an ability to observe the state of individual kernel memory shrinkers. Because the feature adds some memory overhead (which shouldn't be large unless there is a huge amount of registered shrinkers), it's guarded by a config option (enabled by default). This commit introduces the "count" interface for each shrinker registered in the system. The output is in the following format: <cgroup inode id> <nr of objects on node 0> <nr of objects on node 1>... <cgroup inode id> <nr of objects on node 0> <nr of objects on node 1>... ... To reduce the size of output on machines with many thousands cgroups, if the total number of objects on all nodes is 0, the line is omitted. If the shrinker is not memcg-aware or CONFIG_MEMCG is off, 0 is printed as cgroup inode id. If the shrinker is not numa-aware, 0's are printed for all nodes except the first one. This commit gives debugfs entries simple numeric names, which are not very convenient. The following commit in the series will provide shrinkers with more meaningful names. [akpm@linux-foundation.org: remove WARN_ON_ONCE(), per Roman] Reported-by: syzbot+300d27c79fe6d4cbcc39@syzkaller.appspotmail.com Link: https://lkml.kernel.org/r/20220601032227.4076670-3-roman.gushchin@linux.devSigned-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Roman Gushchin authored
Patch series "mm: introduce shrinker debugfs interface", v5. The only existing debugging mechanism is a couple of tracepoints in do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't covering everything though: shrinkers which report 0 objects will never show up, there is no support for memcg-aware shrinkers. Shrinkers are identified by their scan function, which is not always enough (e.g. hard to guess which super block's shrinker it is having only "super_cache_scan"). To provide a better visibility and debug options for memory shrinkers this patchset introduces a /sys/kernel/debug/shrinker interface, to some extent similar to /sys/kernel/slab. For each shrinker registered in the system a directory is created. As now, the directory will contain only a "scan" file, which allows to get the number of managed objects for each memory cgroup (for memcg-aware shrinkers) and each numa node (for numa-aware shrinkers on a numa machine). Other interfaces might be added in the future. To make debugging more pleasant, the patchset also names all shrinkers, so that debugfs entries can have meaningful names. This patch (of 5): Shrinker debugfs requires a way to represent memory cgroups without using full paths, both for displaying information and getting input from a user. Cgroup inode number is a perfect way, already used by bpf. This commit adds a couple of helper functions which will be used to handle memcg-aware shrinkers. Link: https://lkml.kernel.org/r/20220601032227.4076670-1-roman.gushchin@linux.dev Link: https://lkml.kernel.org/r/20220601032227.4076670-2-roman.gushchin@linux.devSigned-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Tianyu Li authored
When user specified more nodes than supported, get_nodes will access nmask array out of bounds. Link: https://lkml.kernel.org/r/20220601093211.2970565-1-tianyu.li@arm.com Fixes: e130242d ("mm: simplify compat numa syscalls") Signed-off-by: Tianyu Li <tianyu.li@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-