• Linus Torvalds's avatar
    Merge tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · b167fdff
    Linus Torvalds authored
    Pull scheduler updates from Ingo Molnar:
    "Load-balancing improvements:
    
       - Improve NUMA balancing on AMD Zen systems for affine workloads.
    
       - Improve the handling of reduced-capacity CPUs in load-balancing.
    
       - Energy Model improvements: fix & refine all the energy fairness
         metrics (PELT), and remove the conservative threshold requiring 6%
         energy savings to migrate a task. Doing this improves power
         efficiency for most workloads, and also increases the reliability
         of energy-efficiency scheduling.
    
       - Optimize/tweak select_idle_cpu() to spend (much) less time
         searching for an idle CPU on overloaded systems. There's reports of
         several milliseconds spent there on large systems with large
         workloads ...
    
         [ Since the search logic changed, there might be behavioral side
           effects. ]
    
       - Improve NUMA imbalance behavior. On certain systems with spare
         capacity, initial placement of tasks is non-deterministic, and such
         an artificial placement imbalance can persist for a long time,
         hurting (and sometimes helping) performance.
    
         The fix is to make fork-time task placement consistent with runtime
         NUMA balancing placement.
    
         Note that some performance regressions were reported against this,
         caused by workloads that are not memory bandwith limited, which
         benefit from the artificial locality of the placement bug(s). Mel
         Gorman's conclusion, with which we concur, was that consistency is
         better than random workload benefits from non-deterministic bugs:
    
            "Given there is no crystal ball and it's a tradeoff, I think
             it's better to be consistent and use similar logic at both fork
             time and runtime even if it doesn't have universal benefit."
    
       - Improve core scheduling by fixing a bug in
         sched_core_update_cookie() that caused unnecessary forced idling.
    
       - Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs
         for newly woken tasks.
    
       - Fix a newidle balancing bug that introduced unnecessary wakeup
         latencies.
    
      ABI improvements/fixes:
    
       - Do not check capabilities and do not issue capability check denial
         messages when a scheduler syscall doesn't require privileges. (Such
         as increasing niceness.)
    
       - Add forced-idle accounting to cgroups too.
    
       - Fix/improve the RSEQ ABI to not just silently accept unknown flags.
         (No existing tooling is known to have learned to rely on the
         previous behavior.)
    
       - Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags.
    
      Optimizations:
    
       - Optimize & simplify leaf_cfs_rq_list()
    
       - Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg().
    
      Misc fixes & cleanups:
    
       - Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems.
    
       - Fix a full-NOHZ bug that can in some cases result in the tick not
         being re-enabled when the last SCHED_RT task is gone from a
         runqueue but there's still SCHED_OTHER tasks around.
    
       - Various PREEMPT_RT related fixes.
    
       - Misc cleanups & smaller fixes"
    
    * tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
      rseq: Kill process when unknown flags are encountered in ABI structures
      rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flags
      sched/core: Fix the bug that task won't enqueue into core tree when update cookie
      nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()
      sched/core: Always flush pending blk_plug
      sched/fair: fix case with reduced capacity CPU
      sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling
      sched/core: add forced idle accounting for cgroups
      sched/fair: Remove the energy margin in feec()
      sched/fair: Remove task_util from effective utilization in feec()
      sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()
      sched/fair: Rename select_idle_mask to select_rq_mask
      sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()
      sched/fair: Decay task PELT values during wakeup migration
      sched/fair: Provide u64 read for 32-bits arch helper
      sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
      sched: only perform capability check on privileged operation
      sched: Remove unused function group_first_cpu()
      sched/fair: Remove redundant word " *"
      selftests/rseq: check if libc rseq support is registered
      ...
    b167fdff
core.c 283 KB