• Mathieu Desnoyers's avatar
    sched: Introduce per-memory-map concurrency ID · af7f588d
    Mathieu Desnoyers authored
    This feature allows the scheduler to expose a per-memory map concurrency
    ID to user-space. This concurrency ID is within the possible cpus range,
    and is temporarily (and uniquely) assigned while threads are actively
    running within a memory map. If a memory map has fewer threads than
    cores, or is limited to run on few cores concurrently through sched
    affinity or cgroup cpusets, the concurrency IDs will be values close
    to 0, thus allowing efficient use of user-space memory for per-cpu
    data structures.
    
    This feature is meant to be exposed by a new rseq thread area field.
    
    The primary purpose of this feature is to do the heavy-lifting needed
    by memory allocators to allow them to use per-cpu data structures
    efficiently in the following situations:
    
    - Single-threaded applications,
    - Multi-threaded applications on large systems (many cores) with limited
      cpu affinity mask,
    - Multi-threaded applications on large systems (many cores) with
      restricted cgroup cpuset per container.
    
    One of the key concern from scheduler maintainers is the overhead
    associated with additional spin locks or atomic operations in the
    scheduler fast-path. This is why the following optimization is
    implemented.
    
    On context switch between threads belonging to the same memory map,
    transfer the mm_cid from prev to next without any atomic ops. This
    takes care of use-cases involving frequent context switch between
    threads belonging to the same memory map.
    
    Additional optimizations can be done if the spin locks added when
    context switching between threads belonging to different memory maps end
    up being a performance bottleneck. Those are left out of this patch
    though. A performance impact would have to be clearly demonstrated to
    justify the added complexity.
    
    The credit goes to Paul Turner (Google) for the original virtual cpu id
    idea. This feature is implemented based on the discussions with Paul
    Turner and Peter Oskolkov (Google), but I took the liberty to implement
    scheduler fast-path optimizations and my own NUMA-awareness scheme. The
    rumor has it that Google have been running a rseq vcpu_id extension
    internally in production for a year. The tcmalloc source code indeed has
    comments hinting at a vcpu_id prototype extension to the rseq system
    call [1].
    
    The following benchmarks do not show any significant overhead added to
    the scheduler context switch by this feature:
    
    * perf bench sched messaging (process)
    
    Baseline:                    86.5±0.3 ms
    With mm_cid:                 86.7±2.6 ms
    
    * perf bench sched messaging (threaded)
    
    Baseline:                    84.3±3.0 ms
    With mm_cid:                 84.7±2.6 ms
    
    * hackbench (process)
    
    Baseline:                    82.9±2.7 ms
    With mm_cid:                 82.9±2.9 ms
    
    * hackbench (threaded)
    
    Baseline:                    85.2±2.6 ms
    With mm_cid:                 84.4±2.9 ms
    
    [1] https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linux_syscall_support.h#L26Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20221122203932.231377-8-mathieu.desnoyers@efficios.com
    af7f588d
mm.h 114 KB