1. 12 Sep, 2024 1 commit
    • Tejun Heo's avatar
      sched: Move update_other_load_avgs() to kernel/sched/pelt.c · 902d67a2
      Tejun Heo authored
      96fd6c65 ("sched: Factor out update_other_load_avgs() from
      __update_blocked_others()") added update_other_load_avgs() in
      kernel/sched/syscalls.c right above effective_cpu_util(). This location
      didn't fit that well in the first place, and with 5d871a63 ("sched/fair:
      Move effective_cpu_util() and effective_cpu_util() in fair.c") moving
      effective_cpu_util() to kernel/sched/fair.c, it looks even more out of
      place.
      
      Relocate the function to kernel/sched/pelt.c where all its callees are.
      
      No functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      902d67a2
  2. 11 Sep, 2024 5 commits
  3. 10 Sep, 2024 9 commits
    • Tejun Heo's avatar
      sched_ext: Don't trigger ops.quiescent/runnable() on migrations · 513ed0c7
      Tejun Heo authored
      A task moving across CPUs should not trigger quiescent/runnable task state
      events as the task is staying runnable the whole time and just stopping and
      then starting on different CPUs. Suppress quiescent/runnable task state
      events if task_on_rq_migrating().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarDavid Vernet <void@manifault.com>
      Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
      Cc: Changwoo Min <multics69@gmail.com>
      Cc: Andrea Righi <andrea.righi@linux.dev>
      Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      513ed0c7
    • Tejun Heo's avatar
      sched_ext: Synchronize bypass state changes with rq lock · 750a40d8
      Tejun Heo authored
      While the BPF scheduler is being unloaded, the following warning messages
      trigger sometimes:
      
       NOHZ tick-stop error: local softirq work is pending, handler #80!!!
      
      This is caused by the CPU entering idle while there are pending softirqs.
      The main culprit is the bypassing state assertion not being synchronized
      with rq operations. As the BPF scheduler cannot be trusted in the disable
      path, the first step is entering the bypass mode where the BPF scheduler is
      ignored and scheduling becomes global FIFO.
      
      This is implemented by turning scx_ops_bypassing() true. However, the
      transition isn't synchronized against anything and it's possible for enqueue
      and dispatch paths to have different ideas on whether bypass mode is on.
      
      Make each rq track its own bypass state with SCX_RQ_BYPASSING which is
      modified while rq is locked.
      
      This removes most of the NOHZ tick-stop messages but not completely. I
      believe the stragglers are from the sched core bug where pick_task_scx() can
      be called without preceding balance_scx(). Once that bug is fixed, we should
      verify that all occurrences of this error message are gone too.
      
      v2: scx_enabled() test moved inside the for_each_possible_cpu() loop so that
          the per-cpu states are always synchronized with the global state.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarDavid Vernet <void@manifault.com>
      750a40d8
    • Huang Shijie's avatar
      sched/debug: Fix the runnable tasks output · 2cab4bd0
      Huang Shijie authored
      The current runnable tasks output looks like:
      
        runnable tasks:
         S            task   PID         tree-key  switches  prio     wait-time             sum-exec        sum-sleep
        -------------------------------------------------------------------------------------------------------------
         Ikworker/R-rcu_g     4         0.129049 E         0.620179           0.750000         0.002920         2   100         0.000000         0.002920         0.000000         0.000000 0 0 /
         Ikworker/R-sync_     5         0.125328 E         0.624147           0.750000         0.001840         2   100         0.000000         0.001840         0.000000         0.000000 0 0 /
         Ikworker/R-slub_     6         0.120835 E         0.628680           0.750000         0.001800         2   100         0.000000         0.001800         0.000000         0.000000 0 0 /
         Ikworker/R-netns     7         0.114294 E         0.634701           0.750000         0.002400         2   100         0.000000         0.002400         0.000000         0.000000 0 0 /
         I    kworker/0:1     9       508.781746 E       511.754666           3.000000       151.575240       224   120         0.000000       151.575240         0.000000         0.000000 0 0 /
      
      Which is messy. Remove the duplicate printing of sum_exec_runtime and
      tidy up the layout to make it look like:
      
        runnable tasks:
         S            task   PID       vruntime   eligible    deadline             slice          sum-exec      switches  prio         wait-time        sum-sleep       sum-block  node   group-id  group-path
        -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
         I     kworker/0:3  1698       295.001459   E         297.977619           3.000000        38.862920         9     120         0.000000         0.000000         0.000000   0      0        /
         I     kworker/0:4  1702       278.026303   E         281.026303           3.000000         9.918760         3     120         0.000000         0.000000         0.000000   0      0        /
         S  NetworkManager  2646         0.377936   E           2.598104           3.000000        98.535880       314     120         0.000000         0.000000         0.000000   0      0        /system.slice/NetworkManager.service
         S       virtqemud  2689         0.541016   E           2.440104           3.000000        50.967960        80     120         0.000000         0.000000         0.000000   0      0        /system.slice/virtqemud.service
         S   gsd-smartcard  3058        73.604144   E          76.475904           3.000000        74.033320        88     120         0.000000         0.000000         0.000000   0      0        /user.slice/user-42.slice/session-c1.scope
      Reviewed-by: default avatarChristoph Lameter (Ampere) <cl@linux.com>
      Signed-off-by: default avatarHuang Shijie <shijie@os.amperecomputing.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20240906053019.7874-1-shijie@os.amperecomputing.com
      2cab4bd0
    • Peter Zijlstra's avatar
      sched: Fix sched_delayed vs sched_core · c662e2b1
      Peter Zijlstra authored
      Completely analogous to commit dfa0a574 ("sched/uclamg: Handle
      delayed dequeue"), avoid double dequeue for the sched_core entries.
      
      Fixes: 152e11f6 ("sched/fair: Implement delayed dequeue")
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      c662e2b1
    • Dietmar Eggemann's avatar
      kernel/sched: Fix util_est accounting for DELAY_DEQUEUE · 729288bc
      Dietmar Eggemann authored
      Remove delayed tasks from util_est even they are runnable.
      
      Exclude delayed task which are (a) migrating between rq's or (b) in a
      SAVE/RESTORE dequeue/enqueue.
      Signed-off-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com
      729288bc
    • Chen Yu's avatar
      kthread: Fix task state in kthread worker if being frozen · 6b9ccbc0
      Chen Yu authored
      When analyzing a kernel waring message, Peter pointed out that there is a race
      condition when the kworker is being frozen and falls into try_to_freeze() with
      TASK_INTERRUPTIBLE, which could trigger a might_sleep() warning in try_to_freeze().
      Although the root cause is not related to freeze()[1], it is still worthy to fix
      this issue ahead.
      
      One possible race scenario:
      
              CPU 0                                           CPU 1
              -----                                           -----
      
              // kthread_worker_fn
              set_current_state(TASK_INTERRUPTIBLE);
                                                             suspend_freeze_processes()
                                                               freeze_processes
                                                                 static_branch_inc(&freezer_active);
                                                               freeze_kernel_threads
                                                                 pm_nosig_freezing = true;
              if (work) { //false
                __set_current_state(TASK_RUNNING);
      
              } else if (!freezing(current)) //false, been frozen
      
                            freezing():
                            if (static_branch_unlikely(&freezer_active))
                              if (pm_nosig_freezing)
                                return true;
                schedule()
      	}
      
              // state is still TASK_INTERRUPTIBLE
              try_to_freeze()
                might_sleep() <--- warning
      
      Fix this by explicitly set the TASK_RUNNING before entering
      try_to_freeze().
      
      Fixes: b56c0d89 ("kthread: implement kthread_worker")
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/lkml/Zs2ZoAcUsZMX2B%2FI@chenyu5-mobl2/ [1]
      6b9ccbc0
    • Chen Yu's avatar
      sched/pelt: Use rq_clock_task() for hw_pressure · 84d26528
      Chen Yu authored
      commit 97450eb9 ("sched/pelt: Remove shift of thermal clock")
      removed the decay_shift for hw_pressure. This commit uses the
      sched_clock_task() in sched_tick() while it replaces the
      sched_clock_task() with rq_clock_pelt() in __update_blocked_others().
      This could bring inconsistence. One possible scenario I can think of
      is in ___update_load_sum():
      
        u64 delta = now - sa->last_update_time
      
      'now' could be calculated by rq_clock_pelt() from
      __update_blocked_others(), and last_update_time was calculated by
      rq_clock_task() previously from sched_tick(). Usually the former
      chases after the latter, it cause a very large 'delta' and brings
      unexpected behavior.
      
      Fixes: 97450eb9 ("sched/pelt: Remove shift of thermal clock")
      Signed-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarHongyan Xia <hongyan.xia2@arm.com>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/20240827112607.181206-1-yu.c.chen@intel.com
      84d26528
    • Vincent Guittot's avatar
      sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c · 5d871a63
      Vincent Guittot authored
      Move effective_cpu_util() and sched_cpu_util() functions in fair.c file
      with others utilization related functions.
      
      No functional change.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20240904092417.20660-1-vincent.guittot@linaro.org
      5d871a63
    • Peter Zijlstra's avatar
      sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule() · 3dcac251
      Peter Zijlstra authored
      Since commit b2a02fc4 ("smp: Optimize
      send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode
      can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an
      IPI without actually sending an interrupt. Even in cases where the IPI
      handler does not queue a task on the idle CPU, do_idle() will call
      __schedule() since need_resched() returns true in these cases.
      
      Introduce and use SM_IDLE to identify call to __schedule() from
      schedule_idle() and shorten the idle re-entry time by skipping
      pick_next_task() when nr_running is 0 and the previous task is the idle
      task.
      
      With the SM_IDLE fast-path, the time taken to complete a fixed set of
      IPIs using ipistorm improves noticeably. Following are the numbers
      from a dual socket Intel Ice Lake Xeon server (2 x 32C/64T) and
      3rd Generation AMD EPYC system (2 x 64C/128T) (boost on, C2 disabled)
      running ipistorm between CPU8 and CPU16:
      
      cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1
      
         ==================================================================
         Test          : ipistorm (modified)
         Units         : Normalized runtime
         Interpretation: Lower is better
         Statistic     : AMean
         ======================= Intel Ice Lake Xeon ======================
         kernel:				time [pct imp]
         tip:sched/core			1.00 [baseline]
         tip:sched/core + SM_IDLE		0.80 [20.51%]
         ==================== 3rd Generation AMD EPYC =====================
         kernel:				time [pct imp]
         tip:sched/core			1.00 [baseline]
         tip:sched/core + SM_IDLE		0.90 [10.17%]
         ==================================================================
      
      [ kprateek: Commit message, SM_RTLOCK_WAIT fix ]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Not-yet-signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/r/20240809092240.6921-1-kprateek.nayak@amd.com
      3dcac251
  4. 09 Sep, 2024 13 commits
    • Tejun Heo's avatar
      scx_qmap: Implement highpri boosting · 2d285d56
      Tejun Heo authored
      Implement a silly boosting mechanism for nice -20 tasks. The only purpose is
      demonstrating and testing scx_bpf_dispatch_from_dsq(). The boosting only
      works within SHARED_DSQ and makes only minor differences with increased
      dispatch batch (-b).
      
      This exercises moving tasks to a user DSQ and all local DSQs from
      ops.dispatch() and BPF timerfn.
      
      v2: - Updated to use scx_bpf_dispatch_from_dsq_set_{slice|vtime}().
      
          - Drop the workaround for the iterated tasks not being trusted by the
            verifier. The issue is fixed from BPF side.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
      Cc: David Vernet <void@manifault.com>
      Cc: Changwoo Min <multics69@gmail.com>
      Cc: Andrea Righi <andrea.righi@linux.dev>
      Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
      2d285d56
    • Tejun Heo's avatar
      sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() · 4c30f5ce
      Tejun Heo authored
      Once a task is put into a DSQ, the allowed operations are fairly limited.
      Tasks in the built-in local and global DSQs are executed automatically and,
      ignoring dequeue, there is only one way a task in a user DSQ can be
      manipulated - scx_bpf_consume() moves the first task to the dispatching
      local DSQ. This inflexibility sometimes gets in the way and is an area where
      multiple feature requests have been made.
      
      Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during
      DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and
      user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context
      which dosen't hold a rq lock including BPF timers and SYSCALL programs.
      
      This is an expansion of an earlier patch which only allowed moving into the
      dispatching local DSQ:
      
        http://lkml.kernel.org/r/Zn4Cw4FDTmvXnhaf@slm.duckdns.org
      
      v2: Remove @slice and @vtime from scx_bpf_dispatch_from_dsq[_vtime]() as
          they push scx_bpf_dispatch_from_dsq_vtime() over the kfunc argument
          count limit and often won't be needed anyway. Instead provide
          scx_bpf_dispatch_from_dsq_set_{slice|vtime}() kfuncs which can be called
          only when needed and override the specified parameter for the subsequent
          dispatch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
      Cc: David Vernet <void@manifault.com>
      Cc: Changwoo Min <multics69@gmail.com>
      Cc: Andrea Righi <andrea.righi@linux.dev>
      Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
      4c30f5ce
    • Tejun Heo's avatar
      sched_ext: Compact struct bpf_iter_scx_dsq_kern · 6462dd53
      Tejun Heo authored
      struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was
      using 5 of them. We want to add two more u64 fields but it's better if we do
      so while staying within scx_iter_scx_dsq to maintain binary compatibility.
      
      The way scx_iter_scx_dsq_kern is laid out is rather inefficient - the node
      field takes up three u64's but only one bit of the last u64 is used. Turn
      the bool into u32 flags and only use the lower 16 bits freeing up 48 bits -
      16 bits for flags, 32 bits for a u32 - for use by struct
      bpf_iter_scx_dsq_kern.
      
      This allows moving the dsq_seq and flags fields of bpf_iter_scx_dsq_kern
      into the cursor field reducing the struct size by a full u64.
      
      No behavior changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      6462dd53
    • Tejun Heo's avatar
      sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq() · cf3e9443
      Tejun Heo authored
      - Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq().
      
      - Rename consume_local_task() to move_local_task_to_local_dsq() and remove
        task_unlink_from_dsq() and source DSQ unlocking from it.
      
      This is to make the migration code easier to reuse.
      
      No functional changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      cf3e9443
    • Tejun Heo's avatar
      sched_ext: Move consume_local_task() upward · d434210e
      Tejun Heo authored
      So that the local case comes first and two CONFIG_SMP blocks can be merged.
      
      No functional changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      d434210e
    • Tejun Heo's avatar
      sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq() · 6557133e
      Tejun Heo authored
      All task_unlink_from_dsq() users are doing dsq_mod_nr(dsq, -1). Move it into
      task_unlink_from_dsq(). Also move sanity check into it.
      
      No functional changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      6557133e
    • Tejun Heo's avatar
      sched_ext: Reorder args for consume_local/remote_task() · 1389f490
      Tejun Heo authored
      Reorder args for consistency in the order of:
      
        current_rq, p, src_[rq|dsq], dst_[rq|dsq].
      
      No functional changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      1389f490
    • Tejun Heo's avatar
      sched_ext: Restructure dispatch_to_local_dsq() · 18f85699
      Tejun Heo authored
      Now that there's nothing left after the big if block, flip the if condition
      and unindent the body.
      
      No functional changes intended.
      
      v2: Add BUG() to clarify control can't reach the end of
          dispatch_to_local_dsq() in UP kernels per David.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      18f85699
    • Tejun Heo's avatar
      sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling · 0aab2630
      Tejun Heo authored
      With the preceding update, the only return value which makes meaningful
      difference is DTL_INVALID, for which one caller, finish_dispatch(), falls
      back to the global DSQ and the other, process_ddsp_deferred_locals(),
      doesn't do anything.
      
      It should always fallback to the global DSQ. Move the global DSQ fallback
      into dispatch_to_local_dsq() and remove the return value.
      
      v2: Patch title and description updated to reflect the behavior fix for
          process_ddsp_deferred_locals().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      0aab2630
    • Tejun Heo's avatar
      sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON · e683949a
      Tejun Heo authored
      find_dsq_for_dispatch() handles all DSQ IDs except SCX_DSQ_LOCAL_ON.
      Instead, each caller is hanlding SCX_DSQ_LOCAL_ON before calling it. Move
      SCX_DSQ_LOCAL_ON lookup into find_dsq_for_dispatch() to remove duplicate
      code in direct_dispatch() and dispatch_to_local_dsq().
      
      No functional changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      e683949a
    • Tejun Heo's avatar
      sched_ext: Refactor consume_remote_task() · 4d3ca89b
      Tejun Heo authored
      The tricky p->scx.holding_cpu handling was split across
      consume_remote_task() body and move_task_to_local_dsq(). Refactor such that:
      
      - All the tricky part is now in the new unlink_dsq_and_lock_src_rq() with
        consolidated documentation.
      
      - move_task_to_local_dsq() now implements straightforward task migration
        making it easier to use in other places.
      
      - dispatch_to_local_dsq() is another user move_task_to_local_dsq(). The
        usage is updated accordingly. This makes the local and remote cases more
        symmetric.
      
      No functional changes intended.
      
      v2: s/task_rq/src_rq/ for consistency.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      4d3ca89b
    • Tejun Heo's avatar
      sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate · fdaedba2
      Tejun Heo authored
      Sleepables don't need to be in its own kfunc set as each is tagged with
      KF_SLEEPABLE. Rename to scx_kfunc_set_unlocked indicating that rq lock is
      not held and relocate right above the any set. This will be used to add
      kfuncs that are allowed to be called from SYSCALL but not TRACING.
      
      No functional changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      fdaedba2
    • Tejun Heo's avatar
      sched_ext: Add missing static to scx_dump_data · 3ac35279
      Tejun Heo authored
      scx_dump_data is only used inside ext.c but doesn't have static. Add it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202409070218.RB5WsQ07-lkp@intel.com/
      3ac35279
  5. 06 Sep, 2024 2 commits
  6. 04 Sep, 2024 10 commits
    • Tejun Heo's avatar
      Merge branch 'bpf/master' into for-6.12 · 649e980d
      Tejun Heo authored
      Pull bpf/master to receive baebe9aa ("bpf: allow passing struct
      bpf_iter_<type> as kfunc arguments") and related changes in preparation for
      the DSQ iterator patchset.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      649e980d
    • Tejun Heo's avatar
      sched_ext: Add a cgroup scheduler which uses flattened hierarchy · a4103eac
      Tejun Heo authored
      This patch adds scx_flatcg example scheduler which implements hierarchical
      weight-based cgroup CPU control by flattening the cgroup hierarchy into a
      single layer by compounding the active weight share at each level.
      
      This flattening of hierarchy can bring a substantial performance gain when
      the cgroup hierarchy is nested multiple levels. in a simple benchmark using
      wrk[8] on apache serving a CGI script calculating sha1sum of a small file,
      it outperforms CFS by ~3% with CPU controller disabled and by ~10% with two
      apache instances competing with 2:1 weight ratio nested four level deep.
      
      However, the gain comes at the cost of not being able to properly handle
      thundering herd of cgroups. For example, if many cgroups which are nested
      behind a low priority parent cgroup wake up around the same time, they may
      be able to consume more CPU cycles than they are entitled to. In many use
      cases, this isn't a real concern especially given the performance gain.
      Also, there are ways to mitigate the problem further by e.g. introducing an
      extra scheduling layer on cgroup delegation boundaries.
      
      v5: - Updated to specify SCX_OPS_HAS_CGROUP_WEIGHT instead of
            SCX_OPS_KNOB_CGROUP_WEIGHT.
      
      v4: - Revert reference counted kptr for cgv_node as the change caused easily
            reproducible stalls.
      
      v3: - Updated to reflect the core API changes including ops.init/exit_task()
            and direct dispatch from ops.select_cpu(). Fixes and improvements
            including additional statistics.
      
          - Use reference counted kptr for cgv_node instead of xchg'ing against
            stash location.
      
          - Dropped '-p' option.
      
      v2: - Use SCX_BUG[_ON]() to simplify error handling.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      a4103eac
    • Tejun Heo's avatar
      sched_ext: Add cgroup support · 81951366
      Tejun Heo authored
      Add sched_ext_ops operations to init/exit cgroups, and track task migrations
      and config changes. A BPF scheduler may not implement or implement only
      subset of cgroup features. The implemented features can be indicated using
      %SCX_OPS_HAS_CGOUP_* flags. If cgroup configuration makes use of features
      that are not implemented, a warning is triggered.
      
      While a BPF scheduler is being enabled and disabled, relevant cgroup
      operations are locked out using scx_cgroup_rwsem. This avoids situations
      like task prep taking place while the task is being moved across cgroups,
      making things easier for BPF schedulers.
      
      v7: - cgroup interface file visibility toggling is dropped in favor just
            warning messages. Dynamically changing interface visiblity caused more
            confusion than helping.
      
      v6: - Updated to reflect the removal of SCX_KF_SLEEPABLE.
      
          - Updated to use CONFIG_GROUP_SCHED_WEIGHT and fixes for
            !CONFIG_FAIR_GROUP_SCHED && CONFIG_EXT_GROUP_SCHED.
      
      v5: - Flipped the locking order between scx_cgroup_rwsem and
            cpus_read_lock() to avoid locking order conflict w/ cpuset. Better
            documentation around locking.
      
          - sched_move_task() takes an early exit if the source and destination
            are identical. This triggered the warning in scx_cgroup_can_attach()
            as it left p->scx.cgrp_moving_from uncleared. Updated the cgroup
            migration path so that ops.cgroup_prep_move() is skipped for identity
            migrations so that its invocations always match ops.cgroup_move()
            one-to-one.
      
      v4: - Example schedulers moved into their own patches.
      
          - Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi.
      
      v3: - Make scx_example_pair switch all tasks by default.
      
          - Convert to BPF inline iterators.
      
          - scx_bpf_task_cgroup() is added to determine the current cgroup from
            CPU controller's POV. This allows BPF schedulers to accurately track
            CPU cgroup membership.
      
          - scx_example_flatcg added. This demonstrates flattened hierarchy
            implementation of CPU cgroup control and shows significant performance
            improvement when cgroups which are nested multiple levels are under
            competition.
      
      v2: - Build fixes for different CONFIG combinations.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: Andrea Righi <andrea.righi@canonical.com>
      81951366
    • Tejun Heo's avatar
      sched: Introduce CONFIG_GROUP_SCHED_WEIGHT · e179e80c
      Tejun Heo authored
      sched_ext will soon add cgroup cpu.weigh support. The cgroup interface code
      is currently gated behind CONFIG_FAIR_GROUP_SCHED. As the fair class and/or
      SCX may implement the feature, put the interface code behind the new
      CONFIG_CGROUP_SCHED_WEIGHT which is selected by CONFIG_FAIR_GROUP_SCHED.
      This allows either sched class to enable the itnerface code without ading
      more complex CONFIG tests.
      
      When !CONFIG_FAIR_GROUP_SCHED, a dummy version of sched_group_set_shares()
      is added to support later CONFIG_CGROUP_SCHED_WEIGHT &&
      !CONFIG_FAIR_GROUP_SCHED builds.
      
      No functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      e179e80c
    • Tejun Heo's avatar
      sched: Make cpu_shares_read_u64() use tg_weight() · 41082c1d
      Tejun Heo authored
      Move tg_weight() upward and make cpu_shares_read_u64() use it too. This
      makes the weight retrieval shared between cgroup v1 and v2 paths and will be
      used to implement cgroup support for sched_ext.
      
      No functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      41082c1d
    • Tejun Heo's avatar
      sched: Expose css_tg() · 859dc4ec
      Tejun Heo authored
      A new BPF extensible sched_class will use css_tg() in the init and exit
      paths to visit all task_groups by walking cgroups.
      
      v4: __setscheduler_prio() is already exposed. Dropped from this patch.
      
      v3: Dropped SCHED_CHANGE_BLOCK() as upstream is adding more generic cleanup
          mechanism.
      
      v2: Expose SCHED_CHANGE_BLOCK() too and update the description.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      859dc4ec
    • Tejun Heo's avatar
      sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable · a8532fac
      Tejun Heo authored
      During scx_ops_enable(), SCX needs to invoke the sleepable ops.init_task()
      on every task. To do this, it does get_task_struct() on each iterated task,
      drop the lock and then call ops.init_task().
      
      However, a TASK_DEAD task may already have lost all its usage count and be
      waiting for RCU grace period to be freed. If get_task_struct() is called on
      such task, use-after-free can happen. To avoid such situations,
      scx_ops_enable() skips initialization of TASK_DEAD tasks, which seems safe
      as they are never going to be scheduled again.
      
      Unfortunately, a racing sched_setscheduler(2) can grab the task before the
      task is unhashed and then continue to e.g. move the task from RT to SCX
      after TASK_DEAD is set and ops_enable skipped the task. As the task hasn't
      gone through scx_ops_init_task(), scx_ops_enable_task() called from
      switching_to_scx() triggers the following warning:
      
        sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[2872]
        WARNING: CPU: 6 PID: 2367 at kernel/sched/ext.c:3327 scx_ops_enable_task+0x18f/0x1f0
        ...
        RIP: 0010:scx_ops_enable_task+0x18f/0x1f0
        ...
         switching_to_scx+0x13/0xa0
         __sched_setscheduler+0x84e/0xa50
         do_sched_setscheduler+0x104/0x1c0
         __x64_sys_sched_setscheduler+0x18/0x30
         do_syscall_64+0x7b/0x140
         entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      As in the ops_disable path, it just doesn't seem like a good idea to leave
      any task in an inconsistent state, even when the task is dead. The root
      cause is ops_enable not being able to tell reliably whether a task is truly
      dead (no one else is looking at it and it's about to be freed) and was
      testing TASK_DEAD instead. Fix it by testing the task's usage count
      directly.
      
      - ops_init no longer ignores TASK_DEAD tasks. As now all users iterate all
        tasks, @include_dead is removed from scx_task_iter_next_locked() along
        with dead task filtering.
      
      - tryget_task_struct() is added. Tasks are skipped iff tryget_task_struct()
        fails.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Vernet <void@manifault.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      a8532fac
    • Tejun Heo's avatar
      sched_ext: TASK_DEAD tasks must be switched out of SCX on ops_disable · 61eeb9a9
      Tejun Heo authored
      scx_ops_disable_workfn() only switches !TASK_DEAD tasks out of SCX while
      calling scx_ops_exit_task() on all tasks including dead ones. This can leave
      a dead task on SCX but with SCX_TASK_NONE state, which is inconsistent.
      
      If another task was in the process of changing the TASK_DEAD task's
      scheduling class and grabs the rq lock after scx_ops_disable_workfn() is
      done with the task, the task ends up calling scx_ops_disable_task() on the
      dead task which is in an inconsistent state triggering a warning:
      
        WARNING: CPU: 6 PID: 3316 at kernel/sched/ext.c:3411 scx_ops_disable_task+0x12c/0x160
        ...
        RIP: 0010:scx_ops_disable_task+0x12c/0x160
        ...
        Call Trace:
         <TASK>
         check_class_changed+0x2c/0x70
         __sched_setscheduler+0x8a0/0xa50
         do_sched_setscheduler+0x104/0x1c0
         __x64_sys_sched_setscheduler+0x18/0x30
         do_syscall_64+0x7b/0x140
         entry_SYSCALL_64_after_hwframe+0x76/0x7e
        RIP: 0033:0x7f140d70ea5b
      
      There is no reason to leave dead tasks on SCX when unloading the BPF
      scheduler. Fix by making scx_ops_disable_workfn() eject all tasks including
      the dead ones from SCX.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      61eeb9a9
    • Tejun Heo's avatar
      sched_ext: Remove sched_class->switch_class() · 37cb049e
      Tejun Heo authored
      With sched_ext converted to use put_prev_task() for class switch detection,
      there's no user of switch_class() left. Drop it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      37cb049e
    • Tejun Heo's avatar
      sched_ext: Remove switch_class_scx() · f422316d
      Tejun Heo authored
      Now that put_prev_task_scx() is called with @next on task switches, there's
      no reason to use sched_class.switch_class(). Rename switch_class_scx() to
      switch_class() and call it from put_prev_task_scx().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      f422316d