1. 29 Sep, 2022 2 commits
  2. 26 Sep, 2022 1 commit
  3. 23 Sep, 2022 4 commits
    • Feng Tang's avatar
      mm/slub: enable debugging memory wasting of kmalloc · 6edf2576
      Feng Tang authored
      kmalloc's API family is critical for mm, with one nature that it will
      round up the request size to a fixed one (mostly power of 2). Say
      when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
      could be allocated, so in worst case, there is around 50% memory
      space waste.
      
      The wastage is not a big issue for requests that get allocated/freed
      quickly, but may cause problems with objects that have longer life
      time.
      
      We've met a kernel boot OOM panic (v5.10), and from the dumped slab
      info:
      
          [   26.062145] kmalloc-2k            814056KB     814056KB
      
      From debug we found there are huge number of 'struct iova_magazine',
      whose size is 1032 bytes (1024 + 8), so each allocation will waste
      1016 bytes. Though the issue was solved by giving the right (bigger)
      size of RAM, it is still nice to optimize the size (either use a
      kmalloc friendly size or create a dedicated slab for it).
      
      And from lkml archive, there was another crash kernel OOM case [1]
      back in 2019, which seems to be related with the similar slab waste
      situation, as the log is similar:
      
          [    4.332648] iommu: Adding device 0000:20:02.0 to group 16
          [    4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0
          ...
          [    4.857565] kmalloc-2048           59164KB      59164KB
      
      The crash kernel only has 256M memory, and 59M is pretty big here.
      (Note: the related code has been changed and optimised in recent
      kernel [2], these logs are just picked to demo the problem, also
      a patch changing its size to 1024 bytes has been merged)
      
      So add an way to track each kmalloc's memory waste info, and
      leverage the existing SLUB debug framework (specifically
      SLUB_STORE_USER) to show its call stack of original allocation,
      so that user can evaluate the waste situation, identify some hot
      spots and optimize accordingly, for a better utilization of memory.
      
      The waste info is integrated into existing interface:
      '/sys/kernel/debug/slab/kmalloc-xx/alloc_traces', one example of
      'kmalloc-4k' after boot is:
      
       126 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] waste=233856/1856 age=280763/281414/282065 pid=1330 cpus=32 nodes=1
           __kmem_cache_alloc_node+0x11f/0x4e0
           __kmalloc_node+0x4e/0x140
           ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe]
           ixgbe_init_interrupt_scheme+0x2ae/0xc90 [ixgbe]
           ixgbe_probe+0x165f/0x1d20 [ixgbe]
           local_pci_probe+0x78/0xc0
           work_for_cpu_fn+0x26/0x40
           ...
      
      which means in 'kmalloc-4k' slab, there are 126 requests of
      2240 bytes which got a 4KB space (wasting 1856 bytes each
      and 233856 bytes in total), from ixgbe_alloc_q_vector().
      
      And when system starts some real workload like multiple docker
      instances, there could are more severe waste.
      
      [1]. https://lkml.org/lkml/2019/8/12/266
      [2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@huawei.com/
      
      [Thanks Hyeonggon for pointing out several bugs about sorting/format]
      [Thanks Vlastimil for suggesting way to reduce memory usage of
       orig_size and keep it only for kmalloc objects]
      Signed-off-by: default avatarFeng Tang <feng.tang@intel.com>
      Reviewed-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: John Garry <john.garry@huawei.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      6edf2576
    • Vlastimil Babka's avatar
      Merge branch 'slab/for-6.1/slub_validation_locking' into slab/for-next · 5959725a
      Vlastimil Babka authored
      My series [1] to fix validation races for caches with enabled debugging.
      
      By decoupling the debug cache operation more from non-debug fastpaths,
      additional locking simplifications were possible and done afterwards.
      
      Additional cleanup of PREEMPT_RT specific code on top, by Thomas Gleixner.
      
      [1] https://lore.kernel.org/all/20220823170400.26546-1-vbabka@suse.cz/
      5959725a
    • Vlastimil Babka's avatar
      Merge branch 'slab/for-6.1/common_kmalloc' into slab/for-next · 3662c13e
      Vlastimil Babka authored
      The "common kmalloc v4" series [1] by Hyeonggon Yoo.
      
      - Improves the mm/slab_common.c wrappers to allow deleting duplicated
        code between SLAB and SLUB.
      - Large kmalloc() allocations in SLAB are passed to page allocator like
        in SLUB, reducing number of kmalloc caches.
      - Removes the {kmem_cache_alloc,kmalloc}_node variants of tracepoints,
        node id parameter added to non-_node variants.
      - 8 files changed, 341 insertions(+), 651 deletions(-)
      
      [1] https://lore.kernel.org/all/20220817101826.236819-1-42.hyeyoo@gmail.com/
      
      --
      Merge resolves trivial conflict in mm/slub.c with commit 5373b8a0
      ("kasan: call kasan_malloc() from __kmalloc_*track_caller()")
      3662c13e
    • Vlastimil Babka's avatar
      Merge branch 'slab/for-6.1/trivial' into slab/for-next · 0467ca38
      Vlastimil Babka authored
      Trivial fixes and cleanups:
      - unneeded variable removals, by ye xingchen
      0467ca38
  4. 22 Sep, 2022 1 commit
    • Maurizio Lombardi's avatar
      mm: slub: fix flush_cpu_slab()/__free_slab() invocations in task context. · e45cc288
      Maurizio Lombardi authored
      Commit 5a836bf6 ("mm: slub: move flush_cpu_slab() invocations
      __free_slab() invocations out of IRQ context") moved all flush_cpu_slab()
      invocations to the global workqueue to avoid a problem related
      with deactivate_slab()/__free_slab() being called from an IRQ context
      on PREEMPT_RT kernels.
      
      When the flush_all_cpu_locked() function is called from a task context
      it may happen that a workqueue with WQ_MEM_RECLAIM bit set ends up
      flushing the global workqueue, this will cause a dependency issue.
      
       workqueue: WQ_MEM_RECLAIM nvme-delete-wq:nvme_delete_ctrl_work [nvme_core]
         is flushing !WQ_MEM_RECLAIM events:flush_cpu_slab
       WARNING: CPU: 37 PID: 410 at kernel/workqueue.c:2637
         check_flush_dependency+0x10a/0x120
       Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core]
       RIP: 0010:check_flush_dependency+0x10a/0x120[  453.262125] Call Trace:
       __flush_work.isra.0+0xbf/0x220
       ? __queue_work+0x1dc/0x420
       flush_all_cpus_locked+0xfb/0x120
       __kmem_cache_shutdown+0x2b/0x320
       kmem_cache_destroy+0x49/0x100
       bioset_exit+0x143/0x190
       blk_release_queue+0xb9/0x100
       kobject_cleanup+0x37/0x130
       nvme_fc_ctrl_free+0xc6/0x150 [nvme_fc]
       nvme_free_ctrl+0x1ac/0x2b0 [nvme_core]
      
      Fix this bug by creating a workqueue for the flush operation with
      the WQ_MEM_RECLAIM bit set.
      
      Fixes: 5a836bf6 ("mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMaurizio Lombardi <mlombard@redhat.com>
      Reviewed-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      e45cc288
  5. 19 Sep, 2022 1 commit
    • Feng Tang's avatar
      mm/slab_common: fix possible double free of kmem_cache · d71608a8
      Feng Tang authored
      When doing slub_debug test, kfence's 'test_memcache_typesafe_by_rcu'
      kunit test case cause a use-after-free error:
      
        BUG: KASAN: use-after-free in kobject_del+0x14/0x30
        Read of size 8 at addr ffff888007679090 by task kunit_try_catch/261
      
        CPU: 1 PID: 261 Comm: kunit_try_catch Tainted: G    B            N 6.0.0-rc5-next-20220916 #17
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
        Call Trace:
         <TASK>
         dump_stack_lvl+0x34/0x48
         print_address_description.constprop.0+0x87/0x2a5
         print_report+0x103/0x1ed
         kasan_report+0xb7/0x140
         kobject_del+0x14/0x30
         kmem_cache_destroy+0x130/0x170
         test_exit+0x1a/0x30
         kunit_try_run_case+0xad/0xc0
         kunit_generic_run_threadfn_adapter+0x26/0x50
         kthread+0x17b/0x1b0
         </TASK>
      
      The cause is inside kmem_cache_destroy():
      
      kmem_cache_destroy
          acquire lock/mutex
          shutdown_cache
              schedule_work(kmem_cache_release) (if RCU flag set)
          release lock/mutex
          kmem_cache_release (if RCU flag not set)
      
      In some certain timing, the scheduled work could be run before
      the next RCU flag checking, which can then get a wrong value
      and lead to double kmem_cache_release().
      
      Fix it by caching the RCU flag inside protected area, just like 'refcnt'
      
      Fixes: 0495e337 ("mm/slab_common: Deleting kobject in kmem_cache_destroy() without holding slab_mutex/cpu_hotplug_lock")
      Signed-off-by: default avatarFeng Tang <feng.tang@intel.com>
      Reviewed-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Reviewed-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      d71608a8
  6. 16 Sep, 2022 6 commits
    • Thomas Gleixner's avatar
      slub: Make PREEMPT_RT support less convoluted · 1f04b07d
      Thomas Gleixner authored
      The slub code already has a few helpers depending on PREEMPT_RT. Add a few
      more and get rid of the CONFIG_PREEMPT_RT conditionals all over the place.
      
      No functional change.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: linux-mm@kvack.org
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      1f04b07d
    • Vlastimil Babka's avatar
      mm/slub: simplify __cmpxchg_double_slab() and slab_[un]lock() · 5875e598
      Vlastimil Babka authored
      The PREEMPT_RT specific disabling of irqs in __cmpxchg_double_slab()
      (through slab_[un]lock()) is unnecessary as bit_spin_lock() disables
      preemption and that's sufficient on PREEMPT_RT where no allocation/free
      operation is performed in hardirq context and so can't interrupt the
      current operation.
      
      That means we no longer need the slab_[un]lock() wrappers, so delete
      them and rename the current __slab_[un]lock() to slab_[un]lock().
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Reviewed-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      5875e598
    • Vlastimil Babka's avatar
      mm/slub: convert object_map_lock to non-raw spinlock · 4ef3f5a3
      Vlastimil Babka authored
      The only remaining user of object_map_lock is list_slab_objects().
      Obtaining the lock there used to happen under slab_lock() which implied
      disabling irqs on PREEMPT_RT, thus it's a raw_spinlock. With the
      slab_lock() removed, we can convert it to a normal spinlock.
      
      Also remove the get_map()/put_map() wrappers as list_slab_objects()
      became their only remaining user.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Reviewed-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      4ef3f5a3
    • Vlastimil Babka's avatar
      mm/slub: remove slab_lock() usage for debug operations · 41bec7c3
      Vlastimil Babka authored
      All alloc and free operations on debug caches are now serialized by
      n->list_lock, so we can remove slab_lock() usage in validate_slab()
      and list_slab_objects() as those also happen under n->list_lock.
      
      Note the usage in list_slab_objects() could happen even on non-debug
      caches, but only during cache shutdown time, so there should not be any
      parallel freeing activity anymore. Except for buggy slab users, but in
      that case the slab_lock() would not help against the common cmpxchg
      based fast paths (in non-debug caches) anyway.
      
      Also adjust documentation comments accordingly.
      Suggested-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      41bec7c3
    • Vlastimil Babka's avatar
      mm/slub: restrict sysfs validation to debug caches and make it safe · c7323a5a
      Vlastimil Babka authored
      Rongwei Wang reports [1] that cache validation triggered by writing to
      /sys/kernel/slab/<cache>/validate is racy against normal cache
      operations (e.g. freeing) in a way that can cause false positive
      inconsistency reports for caches with debugging enabled. The problem is
      that debugging actions that mark object free or active and actual
      freelist operations are not atomic, and the validation can see an
      inconsistent state.
      
      For caches that do or don't have debugging enabled, additional races
      involving n->nr_slabs are possible that result in false reports of wrong
      slab counts.
      
      This patch attempts to solve these issues while not adding overhead to
      normal (especially fastpath) operations for caches that do not have
      debugging enabled. Such overhead would not be justified to make possible
      userspace-triggered validation safe. Instead, disable the validation for
      caches that don't have debugging enabled and make their sysfs validate
      handler return -EINVAL.
      
      For caches that do have debugging enabled, we can instead extend the
      existing approach of not using percpu freelists to force all alloc/free
      operations to the slow paths where debugging flags is checked and acted
      upon. There can adjust the debug-specific paths to increase n->list_lock
      coverage against concurrent validation as necessary.
      
      The processing on free in free_debug_processing() already happens under
      n->list_lock so we can extend it to actually do the freeing as well and
      thus make it atomic against concurrent validation. As observed by
      Hyeonggon Yoo, we do not really need to take slab_lock() anymore here
      because all paths we could race with are protected by n->list_lock under
      the new scheme, so drop its usage here.
      
      The processing on alloc in alloc_debug_processing() currently doesn't
      take any locks, but we have to first allocate the object from a slab on
      the partial list (as debugging caches have no percpu slabs) and thus
      take the n->list_lock anyway. Add a function alloc_single_from_partial()
      that grabs just the allocated object instead of the whole freelist, and
      does the debug processing. The n->list_lock coverage again makes it
      atomic against validation and it is also ultimately more efficient than
      the current grabbing of freelist immediately followed by slab
      deactivation.
      
      To prevent races on n->nr_slabs updates, make sure that for caches with
      debugging enabled, inc_slabs_node() or dec_slabs_node() is called under
      n->list_lock. When allocating a new slab for a debug cache, handle the
      allocation by a new function alloc_single_from_new_slab() instead of the
      current forced deactivation path.
      
      Neither of these changes affect the fast paths at all. The changes in
      slow paths are negligible for non-debug caches.
      
      [1] https://lore.kernel.org/all/20220529081535.69275-1-rongwei.wang@linux.alibaba.com/Reported-by: default avatarRongwei Wang <rongwei.wang@linux.alibaba.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      c7323a5a
    • Peter Collingbourne's avatar
      kasan: call kasan_malloc() from __kmalloc_*track_caller() · 5373b8a0
      Peter Collingbourne authored
      We were failing to call kasan_malloc() from __kmalloc_*track_caller()
      which was causing us to sometimes fail to produce KASAN error reports
      for allocations made using e.g. devm_kcalloc(), as the KASAN poison was
      not being initialized. Fix it.
      Signed-off-by: default avatarPeter Collingbourne <pcc@google.com>
      Cc: <stable@vger.kernel.org> # 5.15
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      5373b8a0
  7. 08 Sep, 2022 1 commit
    • Chao Yu's avatar
      mm/slub: fix to return errno if kmalloc() fails · 7e9c323c
      Chao Yu authored
      In create_unique_id(), kmalloc(, GFP_KERNEL) can fail due to
      out-of-memory, if it fails, return errno correctly rather than
      triggering panic via BUG_ON();
      
      kernel BUG at mm/slub.c:5893!
      Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
      
      Call trace:
       sysfs_slab_add+0x258/0x260 mm/slub.c:5973
       __kmem_cache_create+0x60/0x118 mm/slub.c:4899
       create_cache mm/slab_common.c:229 [inline]
       kmem_cache_create_usercopy+0x19c/0x31c mm/slab_common.c:335
       kmem_cache_create+0x1c/0x28 mm/slab_common.c:390
       f2fs_kmem_cache_create fs/f2fs/f2fs.h:2766 [inline]
       f2fs_init_xattr_caches+0x78/0xb4 fs/f2fs/xattr.c:808
       f2fs_fill_super+0x1050/0x1e0c fs/f2fs/super.c:4149
       mount_bdev+0x1b8/0x210 fs/super.c:1400
       f2fs_mount+0x44/0x58 fs/f2fs/super.c:4512
       legacy_get_tree+0x30/0x74 fs/fs_context.c:610
       vfs_get_tree+0x40/0x140 fs/super.c:1530
       do_new_mount+0x1dc/0x4e4 fs/namespace.c:3040
       path_mount+0x358/0x914 fs/namespace.c:3370
       do_mount fs/namespace.c:3383 [inline]
       __do_sys_mount fs/namespace.c:3591 [inline]
       __se_sys_mount fs/namespace.c:3568 [inline]
       __arm64_sys_mount+0x2f8/0x408 fs/namespace.c:3568
      
      Cc: <stable@kernel.org>
      Fixes: 81819f0f ("SLUB core")
      Reported-by: syzbot+81684812ea68216e08c5@syzkaller.appspotmail.com
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Signed-off-by: default avatarChao Yu <chao.yu@oppo.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      7e9c323c
  8. 01 Sep, 2022 7 commits
  9. 25 Aug, 2022 1 commit
  10. 24 Aug, 2022 11 commits
  11. 23 Aug, 2022 2 commits
  12. 22 Aug, 2022 1 commit
  13. 21 Aug, 2022 2 commits