• Linus Torvalds's avatar
    Merge tag 'mm-slub-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux · cc09ee80
    Linus Torvalds authored
    Pull SLUB updates from Vlastimil Babka:
     "SLUB: reduce irq disabled scope and make it RT compatible
    
      This series was initially inspired by Mel's pcplist local_lock
      rewrite, and also interest to better understand SLUB's locking and the
      new primitives and RT variants and implications. It makes SLUB
      compatible with PREEMPT_RT and generally more preemption-friendly,
      apparently without significant regressions, as the fast paths are not
      affected.
    
      The main changes to SLUB by this series:
    
       - irq disabling is now only done for minimum amount of time needed to
         protect the strict kmem_cache_cpu fields, and as part of spin lock,
         local lock and bit lock operations to make them irq-safe
    
       - SLUB is fully PREEMPT_RT compatible
    
      The series should now be sufficiently tested in both RT and !RT
      configs, mainly thanks to Mike.
    
      The RFC/v1 version also got basic performance screening by Mel that
      didn't show major regressions. Mike's testing with hackbench of v2 on
      !RT reported negligible differences [6]:
    
        virgin(ish) tip
        5.13.0.g60ab3ed-tip
                  7,320.67 msec task-clock                #    7.792 CPUs utilized            ( +-  0.31% )
                   221,215      context-switches          #    0.030 M/sec                    ( +-  3.97% )
                    16,234      cpu-migrations            #    0.002 M/sec                    ( +-  4.07% )
                    13,233      page-faults               #    0.002 M/sec                    ( +-  0.91% )
            27,592,205,252      cycles                    #    3.769 GHz                      ( +-  0.32% )
             8,309,495,040      instructions              #    0.30  insn per cycle           ( +-  0.37% )
             1,555,210,607      branches                  #  212.441 M/sec                    ( +-  0.42% )
                 5,484,209      branch-misses             #    0.35% of all branches          ( +-  2.13% )
    
                   0.93949 +- 0.00423 seconds time elapsed  ( +-  0.45% )
                   0.94608 +- 0.00384 seconds time elapsed  ( +-  0.41% ) (repeat)
                   0.94422 +- 0.00410 seconds time elapsed  ( +-  0.43% )
    
        5.13.0.g60ab3ed-tip +slub-local-lock-v2r3
                  7,343.57 msec task-clock                #    7.776 CPUs utilized            ( +-  0.44% )
                   223,044      context-switches          #    0.030 M/sec                    ( +-  3.02% )
                    16,057      cpu-migrations            #    0.002 M/sec                    ( +-  4.03% )
                    13,164      page-faults               #    0.002 M/sec                    ( +-  0.97% )
            27,684,906,017      cycles                    #    3.770 GHz                      ( +-  0.45% )
             8,323,273,871      instructions              #    0.30  insn per cycle           ( +-  0.28% )
             1,556,106,680      branches                  #  211.901 M/sec                    ( +-  0.31% )
                 5,463,468      branch-misses             #    0.35% of all branches          ( +-  1.33% )
    
                   0.94440 +- 0.00352 seconds time elapsed  ( +-  0.37% )
                   0.94830 +- 0.00228 seconds time elapsed  ( +-  0.24% ) (repeat)
                   0.93813 +- 0.00440 seconds time elapsed  ( +-  0.47% ) (repeat)
    
      RT configs showed some throughput regressions, but that's expected
      tradeoff for the preemption improvements through the RT mutex. It
      didn't prevent the v2 to be incorporated to the 5.13 RT tree [7],
      leading to testing exposure and bugfixes.
    
      Before the series, SLUB is lockless in both allocation and free fast
      paths, but elsewhere, it's disabling irqs for considerable periods of
      time - especially in allocation slowpath and the bulk allocation,
      where IRQs are re-enabled only when a new page from the page allocator
      is needed, and the context allows blocking. The irq disabled sections
      can then include deactivate_slab() which walks a full freelist and
      frees the slab back to page allocator or unfreeze_partials() going
      through a list of percpu partial slabs. The RT tree currently has some
      patches mitigating these, but we can do much better in mainline too.
    
      Patches 1-6 are straightforward improvements or cleanups that could
      exist outside of this series too, but are prerequsities.
    
      Patches 7-9 are also preparatory code changes without functional
      changes, but not so useful without the rest of the series.
    
      Patch 10 simplifies the fast paths on systems with preemption, based
      on (hopefully correct) observation that the current loops to verify
      tid are unnecessary.
    
      Patches 11-20 focus on reducing irq disabled scope in the allocation
      slowpath:
    
       - patch 11 moves disabling of irqs into ___slab_alloc() from its
         callers, which are the allocation slowpath, and bulk allocation.
         Instead these callers only disable preemption to stabilize the cpu.
    
       - The following patches then gradually reduce the scope of disabled
         irqs in ___slab_alloc() and the functions called from there. As of
         patch 14, the re-enabling of irqs based on gfp flags before calling
         the page allocator is removed from allocate_slab(). As of patch 17,
         it's possible to reach the page allocator (in case of existing
         slabs depleted) without disabling and re-enabling irqs a single
         time.
    
      Pathces 21-26 reduce the scope of disabled irqs in functions related
      to unfreezing percpu partial slab.
    
      Patch 27 is preparatory. Patch 28 is adopted from the RT tree and
      converts the flushing of percpu slabs on all cpus from using IPI to
      workqueue, so that the processing isn't happening with irqs disabled
      in the IPI handler. The flushing is not performance critical so it
      should be acceptable.
    
      Patch 29 also comes from RT tree and makes object_map_lock RT
      compatible.
    
      Patch 30 make slab_lock irq-safe on RT where we cannot rely on having
      irq disabled from the list_lock spin lock usage.
    
      Patch 31 changes kmem_cache_cpu->partial handling in put_cpu_partial()
      from cmpxchg loop to a short irq disabled section, which is used by
      all other code modifying the field. This addresses a theoretical race
      scenario pointed out by Jann, and makes the critical section safe wrt
      with RT local_lock semantics after the conversion in patch 35.
    
      Patch 32 changes preempt disable to migrate disable, so that the
      nested list_lock spinlock is safe to take on RT. Because
      migrate_disable() is a function call even on !RT, a small set of
      private wrappers is introduced to keep using the cheaper
      preempt_disable() on !PREEMPT_RT configurations. As of this patch,
      SLUB should be already compatible with RT's lock semantics.
    
      Finally, patch 33 changes irq disabled sections that protect
      kmem_cache_cpu fields in the slow paths, with a local lock. However on
      PREEMPT_RT it means the lockless fast paths can now preempt slow paths
      which don't expect that, so the local lock has to be taken also in the
      fast paths and they are no longer lockless. RT folks seem to not mind
      this tradeoff. The patch also updates the locking documentation in the
      file's comment"
    
    Mike Galbraith and Mel Gorman verified that their earlier testing
    observations still hold for the final series:
    
    Link: https://lore.kernel.org/lkml/89ba4f783114520c167cc915ba949ad2c04d6790.camel@gmx.de/
    Link: https://lore.kernel.org/lkml/20210907082010.GB3959@techsingularity.net/
    
    * tag 'mm-slub-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux: (33 commits)
      mm, slub: convert kmem_cpu_slab protection to local_lock
      mm, slub: use migrate_disable() on PREEMPT_RT
      mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
      mm, slub: make slab_lock() disable irqs with PREEMPT_RT
      mm: slub: make object_map_lock a raw_spinlock_t
      mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context
      mm, slab: split out the cpu offline variant of flush_slab()
      mm, slub: don't disable irqs in slub_cpu_dead()
      mm, slub: only disable irq with spin_lock in __unfreeze_partials()
      mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing
      mm, slub: detach whole partial list at once in unfreeze_partials()
      mm, slub: discard slabs in unfreeze_partials() without irqs disabled
      mm, slub: move irq control into unfreeze_partials()
      mm, slub: call deactivate_slab() without disabling irqs
      mm, slub: make locking in deactivate_slab() irq-safe
      mm, slub: move reset of c->page and freelist out of deactivate_slab()
      mm, slub: stop disabling irqs around get_partial()
      mm, slub: check new pages with restored irqs
      mm, slub: validate slab from partial list or page allocator before making it cpu slab
      mm, slub: restore irqs around calling new_slab()
      ...
    cc09ee80
page-flags.h 27.3 KB