- 03 Sep, 2021 40 commits
-
-
Yang Shi authored
In the current implementation of soft offline, if non-LRU page is met, all the slab caches will be dropped to free the page then offline. But if the page is not slab page all the effort is wasted in vain. Even though it is a slab page, it is not guaranteed the page could be freed at all. However the side effect and cost is quite high. It does not only drop the slab caches, but also may drop a significant amount of page caches which are associated with inode caches. It could make the most workingset gone in order to just offline a page. And the offline is not guaranteed to succeed at all, actually I really doubt the success rate for real life workload. Furthermore the worse consequence is the system may be locked up and unusable since the page cache release may incur huge amount of works queued for memcg release. Actually we ran into such unpleasant case in our production environment. Firstly, the workqueue of memory_failure_work_func is locked up as below: BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s! Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15 in-flight: 409271:memory_failure_work_func pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work workqueue mm_percpu_wq: flags=0x8 pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 pending: vmstat_update workqueue cgroup_destroy: flags=0x0 pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072 pending: css_release_work_fn There were over 12K css_release_work_fn queued, and this caused a few lockups due to the contention of worker pool lock with IRQ disabled, for example: NMI watchdog: Watchdog detected hard LOCKUP on cpu 1 Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G L 5.10.32-t1.el7.twitter.x86_64 #1 Hardware name: TYAN F5AMT /z /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021 Workqueue: events memory_failure_work_func RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0 Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47 RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002 RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140 RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000 R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001 R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068 FS: 0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0 Call Trace: __queue_work+0xd6/0x3c0 queue_work_on+0x1c/0x30 uncharge_batch+0x10e/0x110 mem_cgroup_uncharge_list+0x6d/0x80 release_pages+0x37f/0x3f0 __pagevec_release+0x1c/0x50 __invalidate_mapping_pages+0x348/0x380 inode_lru_isolate+0x10a/0x160 __list_lru_walk_one+0x7b/0x170 list_lru_walk_one+0x4a/0x60 prune_icache_sb+0x37/0x50 super_cache_scan+0x123/0x1a0 do_shrink_slab+0x10c/0x2c0 shrink_slab+0x1f1/0x290 drop_slab_node+0x4d/0x70 soft_offline_page+0x1ac/0x5b0 memory_failure_work_func+0x6a/0x90 process_one_work+0x19e/0x340 worker_thread+0x30/0x360 kthread+0x116/0x130 The lockup made the machine is quite unusable. And it also made the most workingset gone, the reclaimabled slab caches were reduced from 12G to 300MB, the page caches were decreased from 17G to 4G. But the most disappointing thing is all the effort doesn't make the page offline, it just returns: soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 () It seems the aggressive behavior for non-LRU page didn't pay back, so it doesn't make too much sense to keep it considering the terrible side effect. Link: https://lkml.kernel.org/r/20210819054116.266126-1-shy828301@gmail.comSigned-off-by: Yang Shi <shy828301@gmail.com> Reported-by: David Mackey <tdmackey@twitter.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Miaohe Lin authored
Since commit cb731d6c ("vmscan: per memory cgroup slab shrinkers"), shrink_node_slabs is renamed to drop_slab_node. And doit argument is changed to forcekill since commit 6751ed65 ("x86/mce: Fix siginfo_t->si_addr value for non-recoverable memory faults"). Link: https://lkml.kernel.org/r/20210814105131.48814-5-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Miaohe Lin authored
It's unnecessary to pass in a struct page **hpagep because it's never modified. Changing to use *hpage to simplify the code. Link: https://lkml.kernel.org/r/20210814105131.48814-4-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Miaohe Lin authored
If the first pte is equal to poisoned_pfn, i.e. check_hwpoisoned_entry() return 1, the wrong ptep - 1 would be passed to pte_unmap_unlock(). Link: https://lkml.kernel.org/r/20210814105131.48814-3-linmiaohe@huawei.com Fixes: ad9c59c24095 ("mm,hwpoison: send SIGBUS with error virutal address") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Miaohe Lin authored
Patch series "Cleanups and fixup for hwpoison" This series contains cleanups to remove unneeded variable, fix some obsolete comments and so on. Also we fix potential pte_unmap_unlock on wrong pte. More details can be found in the respective changelogs. This patch (of 4): unmap_success is used to indicate whether page is successfully unmapped but it's irrelated with ZONE_DEVICE page and unmap_success is always true here. Remove this unneeded one. Link: https://lkml.kernel.org/r/20210814105131.48814-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210814105131.48814-2-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
George G. Davis authored
Some test_pages_isolated failure conditions don't include trace points. For debugging issues caused by "pinned" pages, make sure to trace all calls whether they succeed or fail. In this case, a failure case did not result in a trace point. So add the missing failure case in test_pages_isolated traces. Link: https://lkml.kernel.org/r/20210823202823.13765-1-george_davis@mentor.comSigned-off-by: George G. Davis <davis.george@siemens.com> Cc: Eugeniu Rosca <erosca@de.adit-jv.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vasily Averin authored
Obsoleted in_intrrupt() include task context with disabled BH, it's better to use in_task() instead. Link: https://lkml.kernel.org/r/877caa99-1994-5545-92d2-d0bb2e394182@virtuozzo.comSigned-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mike Rapoport authored
alloc_node_mem_map() is never only called from free_area_init_node() that is an __init function. Make the actual alloc_node_mem_map() also __init and its stub version static inline. Link: https://lkml.kernel.org/r/20210716064124.31865-1-rppt@kernel.orgSigned-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Nico Pache authored
When compiling with -Werror, cc1 will warn that 'zone_id' may be used uninitialized in this function warning. Initialize the zone_id as 0. Its safe to assume that if the code reaches this point it has at least one numa node with memory, so no need for an assertion before init_unavilable_range. Link: https://lkml.kernel.org/r/20210716210336.1114114-1-npache@redhat.com Fixes: 122e093c ("mm/page_alloc: fix memory map initialization for descending nodes") Signed-off-by: Nico Pache <npache@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mike Rapoport authored
Functions memblock_alloc_exact_nid_raw() and memblock_alloc_try_nid_raw() are intended for early memory allocation without overhead of zeroing the allocated memory. Since these functions were used to allocate the memory map, they have ended up with addition of a call to page_init_poison() that poisoned the allocated memory when CONFIG_PAGE_POISON was set. Since the memory map is allocated using a dedicated memmep_alloc() function that takes care of the poisoning, remove page poisoning from the memblock_alloc_*_raw() functions. Link: https://lkml.kernel.org/r/20210714123739.16493-5-rppt@kernel.orgSigned-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Michal Simek <monstr@monstr.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mike Rapoport authored
There are several places that allocate memory for the memory map: alloc_node_mem_map() for FLATMEM, sparse_buffer_init() and __populate_section_memmap() for SPARSEMEM. The memory allocated in the FLATMEM case is zeroed and it is never poisoned, regardless of CONFIG_PAGE_POISON setting. The memory allocated in the SPARSEMEM cases is not zeroed and it is implicitly poisoned inside memblock if CONFIG_PAGE_POISON is set. Introduce memmap_alloc() wrapper for memblock allocators that will be used for both FLATMEM and SPARSEMEM cases and will makei memory map zeroing and poisoning consistent for different memory models. Link: https://lkml.kernel.org/r/20210714123739.16493-4-rppt@kernel.orgSigned-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Michal Simek <monstr@monstr.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mike Rapoport authored
The microblaze's implementation of pte_alloc_one_kernel() used memblock_alloc_try_nid_raw() along with clear_page() to allocated a zeroed page during early setup. Replace calls of these functions with a call to memblock_alloc_try_nid() that already returns zeroed page and respects the same allocation limits as memblock_alloc_try_nid_raw(). While on it drop early_get_page() wrapper that was only used in pte_alloc_one_kernel(). Link: https://lkml.kernel.org/r/20210714123739.16493-3-rppt@kernel.orgSigned-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Michal Simek <monstr@monstr.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mike Rapoport authored
Patch series "mm: ensure consistency of memory map poisoning". Currently memory map allocation for FLATMEM case does not poison the struct pages regardless of CONFIG_PAGE_POISON setting. This happens because allocation of the memory map for FLATMEM and SPARSMEM use different memblock functions and those that are used for SPARSMEM case (namely memblock_alloc_try_nid_raw() and memblock_alloc_exact_nid_raw()) implicitly poison the allocated memory. Another side effect of this implicit poisoning is that early setup code that uses the same functions to allocate memory burns cycles for the memory poisoning even if it was not intended. These patches introduce memmap_alloc() wrapper that ensure that the memory map allocation is consistent for different memory models. This patch (of 4): Currently memory map for the holes is initialized only when SPARSEMEM memory model is used. Yet, even with FLATMEM there could be holes in the physical memory layout that have memory map entries. For instance, the memory reserved using e820 API on i386 or "reserved-memory" nodes in device tree would not appear in memblock.memory and hence the struct pages for such holes will be skipped during memory map initialization. These struct pages will be zeroed because the memory map for FLATMEM systems is allocated with memblock_alloc_node() that clears the allocated memory. While zeroed struct pages do not cause immediate problems, the correct behaviour is to initialize every page using __init_single_page(). Besides, enabling page poison for FLATMEM case will trigger PF_POISONED_CHECK() unless the memory map is properly initialized. Make sure init_unavailable_range() is called for both SPARSEMEM and FLATMEM so that struct pages representing memory holes would appear as PG_Reserved with any memory layout. [rppt@kernel.org: fix microblaze] Link: https://lkml.kernel.org/r/YQWW3RCE4eWBuMu/@kernel.org Link: https://lkml.kernel.org/r/20210714123739.16493-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210714123739.16493-2-rppt@kernel.orgSigned-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: Michal Simek <monstr@monstr.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrey Konovalov authored
kasan_rcu_uaf() writes to freed memory via kasan_rcu_reclaim(), which is only safe with the GENERIC mode (as it uses quarantine). For other modes, this test corrupts kernel memory, which might result in a crash. Turn the write into a read. Link: https://lkml.kernel.org/r/b6f2c3bf712d2457c783fa59498225b66a634f62.1628779805.git.andreyknvl@gmail.comSigned-off-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrey Konovalov authored
copy_user_test() does writes past the allocated object. As the result, it corrupts kernel memory, which might lead to crashes with the HW_TAGS mode, as it neither uses quarantine nor redzones. (Technically, this test can't yet be enabled with the HW_TAGS mode, but this will be implemented in the future.) Adjust the test to only write memory within the aligned kmalloc object. Link: https://lkml.kernel.org/r/19bf3a5112ee65b7db88dc731643b657b816c5e8.1628779805.git.andreyknvl@gmail.comSigned-off-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrey Konovalov authored
Some KASAN tests use global variables to store function returns values so that the compiler doesn't optimize away these functions. ksize_uaf() doesn't call any functions, so it doesn't need to use kasan_int_result. Use volatile accesses instead, to be consistent with other similar tests. Link: https://lkml.kernel.org/r/a1fc34faca4650f4a6e4dfb3f8d8d82c82eb953a.1628779805.git.andreyknvl@gmail.comSigned-off-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrey Konovalov authored
kmalloc_uaf_memset() writes to freed memory, which is only safe with the GENERIC mode (as it uses quarantine). For other modes, this test corrupts kernel memory, which might result in a crash. Only enable kmalloc_uaf_memset() for the GENERIC mode. Link: https://lkml.kernel.org/r/2e1c87b607b1292556cde3cab2764f108542b60c.1628779805.git.andreyknvl@gmail.comSigned-off-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrey Konovalov authored
The HW_TAGS mode doesn't check memmove for negative size. As a result, the kmalloc_memmove_invalid_size test corrupts memory, which can result in a crash. Disable this test with HW_TAGS KASAN. Link: https://lkml.kernel.org/r/088733a06ac21eba29aa85b6f769d2abd74f9638.1628779805.git.andreyknvl@gmail.comSigned-off-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrey Konovalov authored
kmalloc_oob_memset_*() tests do writes past the allocated objects. As the result, they corrupt memory, which might lead to crashes with the HW_TAGS mode, as it neither uses quarantine nor redzones. Adjust the tests to only write memory within the aligned kmalloc objects. Also add a comment mentioning that memset tests are designed to touch both valid and invalid memory. Link: https://lkml.kernel.org/r/64fd457668a16e7b58d094f14a165f9d5170c5a9.1628779805.git.andreyknvl@gmail.comSigned-off-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrey Konovalov authored
Multiple KASAN tests do writes past the allocated objects or writes to freed memory. Turn these writes into reads to avoid corrupting memory. Otherwise, these tests might lead to crashes with the HW_TAGS mode, as it neither uses quarantine nor redzones. Link: https://lkml.kernel.org/r/c3cd2a383e757e27dd9131635fc7d09a48a49cf9.1628779805.git.andreyknvl@gmail.comSigned-off-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrey Konovalov authored
Patch series "kasan: test: avoid crashing the kernel with HW_TAGS", v2. KASAN tests do out-of-bounds and use-after-free accesses. Running the tests works fine for the GENERIC mode, as it uses qurantine and redzones. But the HW_TAGS mode uses neither, and running the tests might crash the kernel. Rework the tests to avoid corrupting kernel memory. This patch (of 8): Rework kmalloc_oob_right() to do these bad access checks: 1. An unaligned access one byte past the requested kmalloc size (can only be detected by KASAN_GENERIC). 2. An aligned access into the first out-of-bounds granule that falls within the aligned kmalloc object. 3. Out-of-bounds access past the aligned kmalloc object. Test #3 deliberately uses a read access to avoid corrupting memory. Otherwise, this test might lead to crashes with the HW_TAGS mode, as it neither uses quarantine nor redzones. Link: https://lkml.kernel.org/r/cover.1628779805.git.andreyknvl@gmail.com Link: https://lkml.kernel.org/r/474aa8b7b538c6737a4c6d0090350af2e1776bef.1628779805.git.andreyknvl@gmail.comSigned-off-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Woody Lin authored
Move the boot parameter 'kasan.fault' from hw_tags.c to report.c, so it can support all KASAN modes - generic, and both tag-based. Link: https://lkml.kernel.org/r/20210713010536.3161822-1-woodylin@google.comSigned-off-by: Woody Lin <woodylin@google.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Chen Wandun authored
commit f608788c ("mm/vmalloc: use rb_tree instead of list for vread() lookups") use rb_tree instread of list to speed up lookup, but function __find_vmap_area is try to find a vmap_area that include target address, if target address is smaller than the leftmost node in vmap_area_root, it will return NULL, then vread will read nothing. This behavior is different from the primitive semantics. The correct way is find the first vmap_are that bigger than target addr, that is what function find_vmap_area_exceed_addr does. Link: https://lkml.kernel.org/r/20210714015959.3204871-1-chenwandun@huawei.com Fixes: f608788c ("mm/vmalloc: use rb_tree instead of list for vread() lookups") Signed-off-by: Chen Wandun <chenwandun@huawei.com> Reported-by: Hulk Robot <hulkci@huawei.com> Cc: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Uladzislau Rezki (Sony) authored
In order to simulate different fixed sizes for vmalloc allocation introduce a new parameter that sets number of pages to be allocated for the "fix_size_alloc_test" test. By default 1 page is used unless a different number is specified over the new parameter. Link: https://lkml.kernel.org/r/20210710194151.21370-1-urezki@gmail.comSigned-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Uladzislau Rezki (Sony) authored
Get rid of gfpflags_allow_blocking() check from the vmalloc() path as it is supposed to be sleepable anyway. Thus remove it from the alloc_vmap_area() as well as from the vm_area_alloc_pages(). Link: https://lkml.kernel.org/r/20210707182639.31282-2-urezki@gmail.comSigned-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Uladzislau Rezki (Sony) authored
In case of simultaneous vmalloc allocations, for example it is 1GB and 12 CPUs my system is able to hit "BUG: soft lockup" for !CONFIG_PREEMPT kernel. RIP: 0010:__alloc_pages_bulk+0xa9f/0xbb0 Call Trace: __vmalloc_node_range+0x11c/0x2d0 __vmalloc_node+0x4b/0x70 fix_size_alloc_test+0x44/0x60 [test_vmalloc] test_func+0xe7/0x1f0 [test_vmalloc] kthread+0x11a/0x140 ret_from_fork+0x22/0x30 To address this issue invoke a bulk-allocator many times until all pages are obtained, i.e. do batched page requests adding cond_resched() meanwhile to reschedule. Batched value is hard-coded and is 100 pages per call. Link: https://lkml.kernel.org/r/20210707182639.31282-1-urezki@gmail.comSigned-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Miles Chen authored
Clarify pgdat_to_phys() by testing if pgdat == &contig_page_data when CONFIG_NUMA=n. We only expect contig_page_data in such case, so we use &contig_page_data directly instead of pgdat. No functional change intended when CONFIG_BUG_VM=n. Comment from Mark [1]: " ... and I reckon it'd be clearer and more robust to define pgdat_to_phys() in the same ifdefs as contig_page_data so that these, stay in-sync. e.g. have: | #ifdef CONFIG_NUMA | #define pgdat_to_phys(x) virt_to_phys(x) | #else /* CONFIG_NUMA */ | | extern struct pglist_data contig_page_data; | ... | #define pgdat_to_phys(x) __pa_symbol(&contig_page_data) | | #endif /* CONIFIG_NUMA */ " [1] https://lore.kernel.org/linux-arm-kernel/20210615131902.GB47121@C02TD0UTHF1T.local/ Link: https://lkml.kernel.org/r/20210723123342.26406-1-miles.chen@mediatek.comSigned-off-by: Miles Chen <miles.chen@mediatek.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox authored
cppcheck warns that we're possibly losing information by shifting an int. It's a false positive, because we don't allow for a NUMA node ID that large, but if we ever change SECTION_NID_SHIFT, it could become a problem, and in any case this is usually a legitimate warning. Fix it by adding the necessary cast, which makes the compiler generate the right code. Link: https://lkml.kernel.org/r/YOya+aBZFFmC476e@casper.infradead.org Link: https://lkml.kernel.org/r/202107130348.6LsVT9Nc-lkp@intel.com Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Naoya Horiguchi authored
Currently SECTION_NID_SHIFT is set to 3, which is incorrect because bit 3 and 4 can be overlapped by sub-field for early NID, and can be unexpectedly set on NUMA systems. There are a few non-critical issues related to this: - Having SECTION_TAINT_ZONE_DEVICE set for wrong sections forces pfn_to_online_page() through the slow path, but doesn't actually break the kernel. - A kdump generation tool like makedumpfile uses this field to calculate the physical address to read. So wrong bits can make the tool access to wrong address and fail to create kdump. This can be avoided by the tool, so it's not critical. To fix it, set SECTION_NID_SHIFT to 6 which is the minimum number of available bits of section flag field. Link: https://lkml.kernel.org/r/20210707045548.810271-1-naoya.horiguchi@linux.dev Fixes: 1f90a347 ("mm: teach pfn_to_online_page() about ZONE_DEVICE section collisions") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Kazuhito Hagio <k-hagio-ab@nec.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Wang Wensheng <wangwensheng4@huawei.com> Cc: Rui Xiang <rui.xiang@huawei.com> Cc: Kazu <k-hagio-ab@nec.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Ohhoon Kwon authored
As the last users of __section_nr() are gone, let's remove unused function __section_nr(). Link: https://lkml.kernel.org/r/20210707150212.855-4-ohoono.kwon@samsung.comSigned-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Ohhoon Kwon authored
With CONFIG_SPARSEMEM_EXTREME enabled, __section_nr() which converts mem_section to section_nr could be costly since it iterates all section roots to check if the given mem_section is in its range. On the other hand, __nr_to_section() which converts section_nr to mem_section can be done in O(1). Let's pass section_nr instead of mem_section ptr to find_memory_block() in order to reduce needless iterations. Link: https://lkml.kernel.org/r/20210707150212.855-3-ohoono.kwon@samsung.comSigned-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Ohhoon Kwon authored
Patch series "mm: sparse: remove __section_nr() function", v4. This patch (of 3): With CONFIG_SPARSEMEM_EXTREME enabled, __section_nr() which converts mem_section to section_nr could be costly since it iterates all section roots to check if the given mem_section is in its range. Since both callers of section_mark_present already know section_nr, let's also pass section_nr as well as mem_section in order to reduce costly translation. Link: https://lkml.kernel.org/r/20210707150212.855-1-ohoono.kwon@samsung.com Link: https://lkml.kernel.org/r/20210707150212.855-2-ohoono.kwon@samsung.comSigned-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Muchun Song authored
register_page_bootmem_info_section() is only called from __init functions, so mark it __init as well. Link: https://lkml.kernel.org/r/20210817042221.77172-1-songmuchun@bytedance.comSigned-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Chen Wandun authored
mremap will account the delta between new_len and old_len in vma_to_resize, and then call move_vma when expanding an existing memory mapping. In function move_vma, there are two scenarios when calling do_munmap: 1. move_page_tables from old_addr to new_addr success 2. move_page_tables from old_addr to new_addr fail In first scenario, it should account old_len if do_munmap fail, because the delta has already been accounted. In second scenario, new_addr/new_len will assign to old_addr/old_len if move_page_table fail, so do_munmap is try to unmap new_addr actually, if do_munmap fail, it should account the new_len, because error code will be return from move_vma, and delta will be unaccounted. What'more, because of new_len == old_len, so account old_len also is OK. In summary, account old_len will be correct if do_munmap fail. Link: https://lkml.kernel.org/r/20210717101942.120607-1-chenwandun@huawei.com Fixes: 51df7bcb ("mm/mremap: account memory on do_munmap() failure") Signed-off-by: Chen Wandun <chenwandun@huawei.com> Acked-by: Dmitry Safonov <dima@arista.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Liam R. Howlett authored
Using vma_lookup() verifies the start address is contained in the found vma. This results in easier to read code. Link: https://lkml.kernel.org/r/20210817135234.1550204-1-Liam.Howlett@oracle.comSigned-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Luigi Rizzo authored
find_vma() and variants need protection when used. This patch adds mmap_assert_lock() calls in the functions. To make sure the invariant is satisfied, we also need to add a mmap_read_lock() around the get_user_pages_remote() call in get_arg_page(). The lock is not strictly necessary because the mm has been newly created, but the extra cost is limited because the same mutex was also acquired shortly before in __bprm_mm_init(), so it is hot and uncontended. [penguin-kernel@i-love.sakura.ne.jp: TOMOYO needs the same protection which get_arg_page() needs] Link: https://lkml.kernel.org/r/58bb6bf7-a57e-8a40-e74b-39584b415152@i-love.sakura.ne.jp Link: https://lkml.kernel.org/r/20210731175341.3458608-1-lrizzo@google.comSigned-off-by: Luigi Rizzo <lrizzo@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Greg Kroah-Hartman authored
fault_in_pages_writeable() and fault_in_pages_readable() treat the size parameter as unsigned, doing pointer math with the value, so make this explicit and set it to be a size_t type which all callers currently treat it as anyway. This solves the issue where static checkers get nervous seeing pointer arithmetic happening with a signed value. Link: https://lkml.kernel.org/r/20210727111136.457638-1-gregkh@linuxfoundation.orgSigned-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reported-by: Jordy Zomer <jordy@pwning.systems> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Huang Ying authored
Before commit c5b5a3dd ("mm: thp: refactor NUMA fault handling"), the TLB flushing is done in do_huge_pmd_numa_page() itself via flush_tlb_range(). But after commit c5b5a3dd ("mm: thp: refactor NUMA fault handling"), the TLB flushing is done in migrate_pages() as in the following code path anyway. do_huge_pmd_numa_page migrate_misplaced_page migrate_pages So now, the TLB flushing code in do_huge_pmd_numa_page() becomes unnecessary. So the code is deleted in this patch to simplify the code. This is only code cleanup, there's no visible performance difference. The mmu_notifier_invalidate_range() in do_huge_pmd_numa_page() is deleted too. Because migrate_pages() takes care of that too when CPU TLB is flushed. Link: https://lkml.kernel.org/r/20210720065529.716031-1-ying.huang@intel.comSigned-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Christoph Hellwig authored
flush_kernel_dcache_page is a rather confusing interface that implements a subset of flush_dcache_page by not being able to properly handle page cache mapped pages. The only callers left are in the exec code as all other previous callers were incorrect as they could have dealt with page cache pages. Replace the calls to flush_kernel_dcache_page with calls to flush_dcache_page, which for all architectures does either exactly the same thing, can contains one or more of the following: 1) an optimization to defer the cache flush for page cache pages not mapped into userspace 2) additional flushing for mapped page cache pages if cache aliases are possible Link: https://lkml.kernel.org/r/20210712060928.4161649-7-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Alex Shi <alexs@kernel.org> Cc: Geoff Levand <geoff@infradead.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Cercueil <paul@crapouillou.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Yoshinori Sato <ysato@users.osdn.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Christoph Hellwig authored
Pages used in scatterlist can be mapped page cache pages (and often are), so we must use flush_dcache_page here instead of the more limited flush_kernel_dcache_page that is intended for highmem pages only. Also remove the PageSlab check given that page_mapping_file as used by the flush_dcache_page implementations already contains that check. Link: https://lkml.kernel.org/r/20210712060928.4161649-5-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alex Shi <alexs@kernel.org> Cc: Geoff Levand <geoff@infradead.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Cercueil <paul@crapouillou.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Yoshinori Sato <ysato@users.osdn.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-