An error occurred fetching the project authors.
  1. 18 Jun, 2024 8 commits
    • David Vernet's avatar
      sched_ext: Implement runnable task stall watchdog · 8a010b81
      David Vernet authored
      The most common and critical way that a BPF scheduler can misbehave is by
      failing to run runnable tasks for too long. This patch implements a
      watchdog.
      
      * All tasks record when they become runnable.
      
      * A watchdog work periodically scans all runnable tasks. If any task has
        stayed runnable for too long, the BPF scheduler is aborted.
      
      * scheduler_tick() monitors whether the watchdog itself is stuck. If so, the
        BPF scheduler is aborted.
      
      Because the watchdog only scans the tasks which are currently runnable and
      usually very infrequently, the overhead should be negligible.
      scx_qmap is updated so that it can be told to stall user and/or
      kernel tasks.
      
      A detected task stall looks like the following:
      
       sched_ext: BPF scheduler "qmap" errored, disabling
       sched_ext: runnable task stall (dbus-daemon[953] failed to run for 6.478s)
          scx_check_timeout_workfn+0x10e/0x1b0
          process_one_work+0x287/0x560
          worker_thread+0x234/0x420
          kthread+0xe9/0x100
          ret_from_fork+0x1f/0x30
      
      A detected watchdog stall:
      
       sched_ext: BPF scheduler "qmap" errored, disabling
       sched_ext: runnable task stall (watchdog failed to check in for 5.001s)
          scheduler_tick+0x2eb/0x340
          update_process_times+0x7a/0x90
          tick_sched_timer+0xd8/0x130
          __hrtimer_run_queues+0x178/0x3b0
          hrtimer_interrupt+0xfc/0x390
          __sysvec_apic_timer_interrupt+0xb7/0x2b0
          sysvec_apic_timer_interrupt+0x90/0xb0
          asm_sysvec_apic_timer_interrupt+0x1b/0x20
          default_idle+0x14/0x20
          arch_cpu_idle+0xf/0x20
          default_idle_call+0x50/0x90
          do_idle+0xe8/0x240
          cpu_startup_entry+0x1d/0x20
          kernel_init+0x0/0x190
          start_kernel+0x0/0x392
          start_kernel+0x324/0x392
          x86_64_start_reservations+0x2a/0x2c
          x86_64_start_kernel+0x104/0x109
          secondary_startup_64_no_verify+0xce/0xdb
      
      Note that this patch exposes scx_ops_error[_type]() in kernel/sched/ext.h to
      inline scx_notify_sched_tick().
      
      v4: - While disabling, cancel_delayed_work_sync(&scx_watchdog_work) was
            being called before forward progress was guaranteed and thus could
            lead to system lockup. Relocated.
      
          - While enabling, it was comparing msecs against jiffies without
            conversion leading to spurious load failures on lower HZ kernels.
            Fixed.
      
          - runnable list management is now used by core bypass logic and moved to
            the patch implementing sched_ext core.
      
      v3: - bpf_scx_init_member() was incorrectly comparing ops->timeout_ms
            against SCX_WATCHDOG_MAX_TIMEOUT which is in jiffies without
            conversion leading to spurious load failures in lower HZ kernels.
            Fixed.
      
      v2: - Julia Lawall noticed that the watchdog code was mixing msecs and
            jiffies. Fix by using jiffies for everything.
      Signed-off-by: default avatarDavid Vernet <dvernet@meta.com>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      Cc: Julia Lawall <julia.lawall@inria.fr>
      8a010b81
    • Tejun Heo's avatar
      sched_ext: Implement BPF extensible scheduler class · f0e1a064
      Tejun Heo authored
      Implement a new scheduler class sched_ext (SCX), which allows scheduling
      policies to be implemented as BPF programs to achieve the following:
      
      1. Ease of experimentation and exploration: Enabling rapid iteration of new
         scheduling policies.
      
      2. Customization: Building application-specific schedulers which implement
         policies that are not applicable to general-purpose schedulers.
      
      3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
         policies in production environments.
      
      sched_ext leverages BPF’s struct_ops feature to define a structure which
      exports function callbacks and flags to BPF programs that wish to implement
      scheduling policies. The struct_ops structure exported by sched_ext is
      struct sched_ext_ops, and is conceptually similar to struct sched_class. The
      role of sched_ext is to map the complex sched_class callbacks to the more
      simple and ergonomic struct sched_ext_ops callbacks.
      
      For more detailed discussion on the motivations and overview, please refer
      to the cover letter.
      
      Later patches will also add several example schedulers and documentation.
      
      This patch implements the minimum core framework to enable implementation of
      BPF schedulers. Subsequent patches will gradually add functionalities
      including safety guarantee mechanisms, nohz and cgroup support.
      
      include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on
      top, each operation should be self-explanatory. The followings are worth
      noting:
      
      - Both "sched_ext" and its shorthand "scx" are used. If the identifier
        already has "sched" in it, "ext" is used; otherwise, "scx".
      
      - In sched_ext_ops, only .name is mandatory. Every operation is optional and
        if omitted a simple but functional default behavior is provided.
      
      - A new policy constant SCHED_EXT is added and a task can select sched_ext
        by invoking sched_setscheduler(2) with the new policy constant. However,
        if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL
        and the task is scheduled by CFS. When the BPF scheduler is loaded, all
        tasks which have the SCHED_EXT policy are switched to sched_ext.
      
      - To bridge the workflow imbalance between the scheduler core and
        sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch
        queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and
        one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for
        convenience and need not be used by a scheduler that doesn't require it.
        SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting
        the next task on the CPU. The BPF scheduler can manage an arbitrary number
        of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().
      
      - sched_ext guarantees system integrity no matter what the BPF scheduler
        does. To enable this, each task's ownership is tracked through
        p->scx.ops_state and all tasks are put on scx_tasks list. The disable path
        can always recover and revert all tasks back to CFS. See p->scx.ops_state
        and scx_tasks.
      
      - A task is not tied to its rq while enqueued. This decouples CPU selection
        from queueing and allows sharing a scheduling queue across an arbitrary
        subset of CPUs. This adds some complexities as a task may need to be
        bounced between rq's right before it starts executing. See
        dispatch_to_local_dsq() and move_task_to_local_dsq().
      
      - One complication that arises from the above weak association between task
        and rq is that synchronizing with dequeue() gets complicated as dequeue()
        may happen anytime while the task is enqueued and the dispatch path might
        need to release the rq lock to transfer the task. Solving this requires a
        bit of complexity. See the logic around p->scx.sticky_cpu and
        p->scx.ops_qseq.
      
      - Both enable and disable paths are a bit complicated. The enable path
        switches all tasks without blocking to avoid issues which can arise from
        partially switched states (e.g. the switching task itself being starved).
        The disable path can't trust the BPF scheduler at all, so it also has to
        guarantee forward progress without blocking. See scx_ops_enable() and
        scx_ops_disable_workfn().
      
      - When sched_ext is disabled, static_branches are used to shut down the
        entry points from hot paths.
      
      v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab
            scx_ops_enable_mutex which can lead to deadlocks in the disable path.
            Fixed.
      
          - Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead
            to use-after-free.
      
          - Consolidated per-cpu variable usages and other cleanups.
      
      v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple
            groups can be expressed. Later CPU hotplug operations are put into
            their own group.
      
          - SCX_OPS_DISABLING state is replaced with the new bypass mechanism
            which allows temporarily putting the system into simple FIFO
            scheduling mode bypassing the BPF scheduler. In addition to the shut
            down path, this will also be used to isolate the BPF scheduler across
            PM events. Enabling and disabling the bypass mode requires iterating
            all runnable tasks. rq->scx.runnable_list addition is moved from the
            later watchdog patch.
      
          - ops.prep_enable() is replaced with ops.init_task() and
            ops.enable/disable() are now called whenever the task enters and
            leaves sched_ext instead of when the task becomes schedulable on
            sched_ext and stops being so. A new operation - ops.exit_task() - is
            called when the task stops being schedulable on sched_ext.
      
          - scx_bpf_dispatch() can now be called from ops.select_cpu() too. This
            removes the need for communicating local dispatch decision made by
            ops.select_cpu() to ops.enqueue() via per-task storage.
            SCX_KF_SELECT_CPU is added to support the change.
      
          - SCX_TASK_ENQ_LOCAL which told the BPF scheudler that
            scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ
            was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly
            if it finds a suitable idle CPU. If such behavior is not desired,
            users can use scx_bpf_select_cpu_dfl() which returns the verdict in a
            bool out param.
      
          - scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up
            queueing many tasks on a local DSQ which makes tasks to execute in
            order while other CPUs stay idle which made some hackbench numbers
            really bad. Fixed.
      
          - The current state of sched_ext can now be monitored through files
            under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is
            to enable monitoring on kernels which don't enable debugfs.
      
          - sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may
            be NULL and a BPF scheduler which derefs the pointer without checking
            could crash the kernel. Tell BPF. This is currently a bit ugly. A
            better way to annotate this is expected in the future.
      
          - scx_exit_info updated to carry pointers to message buffers instead of
            embedding them directly. This decouples buffer sizes from API so that
            they can be changed without breaking compatibility.
      
          - exit_code added to scx_exit_info. This is used to indicate different
            exit conditions on non-error exits and will be used to handle e.g. CPU
            hotplugs.
      
          - The patch "sched_ext: Allow BPF schedulers to switch all eligible
            tasks into sched_ext" is folded in and the interface is changed so
            that partial switching is indicated with a new ops flag
            %SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry
            and in turn SCX_KF_INIT. ops.init() is now called with
            SCX_KF_SLEEPABLE.
      
          - Code reorganized so that only the parts necessary to integrate with
            the rest of the kernel are in the header files.
      
          - Changes to reflect the BPF and other kernel changes including the
            addition of bpf_sched_ext_ops.cfi_stubs.
      
      v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t
            instead of atomic64_t and scx_dsp_buf_ent.qseq which uses
            load_acquire/store_release is now unsigned long instead of u64.
      
          - Fix the bug where bpf_scx_btf_struct_access() was allowing write
            access to arbitrary fields.
      
          - Distinguish kfuncs which can be called from any sched_ext ops and from
            anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from
            sched_ext ops.
      
          - Rename "type" to "kind" in scx_exit_info to make it easier to use on
            languages in which "type" is a reserved keyword.
      
          - Since cff9b233 ("kernel/sched: Modify initial boot task idle
            setup"), PF_IDLE is not set on idle tasks which haven't been online
            yet which made scx_task_iter_next_filtered() include those idle tasks
            in iterations leading to oopses. Update scx_task_iter_next_filtered()
            to directly test p->sched_class against idle_sched_class instead of
            using is_idle_task() which tests PF_IDLE.
      
          - Other updates to match upstream changes such as adding const to
            set_cpumask() param and renaming check_preempt_curr() to
            wakeup_preempt().
      
      v4: - SCHED_CHANGE_BLOCK replaced with the previous
            sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is
            because upstream is adaopting a different generic cleanup mechanism.
            Once that lands, the code will be adapted accordingly.
      
          - task_on_scx() used to test whether a task should be switched into SCX,
            which is confusing. Renamed to task_should_scx(). task_on_scx() now
            tests whether a task is currently on SCX.
      
          - scx_has_idle_cpus is barely used anymore and replaced with direct
            check on the idle cpumask.
      
          - SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer
            fully idle cores.
      
          - ops.enable() now sees up-to-date p->scx.weight value.
      
          - ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF
            schedulers expecting ->select_cpu() call.
      
          - Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest
            of the scheduler.
      
      v3: - ops.set_weight() added to allow BPF schedulers to track weight changes
            without polling p->scx.weight.
      
          - move_task_to_local_dsq() was losing SCX-specific enq_flags when
            enqueueing the task on the target dsq because it goes through
            activate_task() which loses the upper 32bit of the flags. Carry the
            flags through rq->scx.extra_enq_flags.
      
          - scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running()
            and scx_bpf_task_cpu() now use the new KF_RCU instead of
            KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them.
      
          - The kfunc helper access control mechanism implemented through
            sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always
            used when invoking scx_ops operations.
      
      v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is
            called from put_prev_taks_scx() and pick_next_task_scx() as necessary.
            To determine whether balance_scx() should be called from
            put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the
            comment in put_prev_task_scx() for details.
      
          - sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced
            with SCHED_CHANGE_BLOCK().
      
          - Unused all_dsqs list removed. This was a left-over from previous
            iterations.
      
          - p->scx.kf_mask is added to track and enforce which kfunc helpers are
            allowed. Also, init/exit sequences are updated to make some kfuncs
            always safe to call regardless of the current BPF scheduler state.
            Combined, this should make all the kfuncs safe.
      
          - BPF now supports sleepable struct_ops operations. Hacky workaround
            removed and operations and kfunc helpers are tagged appropriately.
      
          - BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask()
            and friends are added so that BPF schedulers can use the idle masks
            with the generic helpers. This replaces the hacky kfunc helpers added
            by a separate patch in V1.
      
          - CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is
            enabled. This restriction will be removed by a later patch which adds
            core-sched support.
      
          - Add MAINTAINERS entries and other misc changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Co-authored-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      Cc: Andrea Righi <andrea.righi@canonical.com>
      f0e1a064
    • Tejun Heo's avatar
      sched_ext: Add boilerplate for extensible scheduler class · a7a9fc54
      Tejun Heo authored
      This adds dummy implementations of sched_ext interfaces which interact with
      the scheduler core and hook them in the correct places. As they're all
      dummies, this doesn't cause any behavior changes. This is split out to help
      reviewing.
      
      v2: balance_scx_on_up() dropped. This will be handled in sched_ext proper.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      a7a9fc54
    • Tejun Heo's avatar
      sched: Factor out cgroup weight conversion functions · 4f9c7ca8
      Tejun Heo authored
      Factor out sched_weight_from/to_cgroup() which convert between scheduler
      shares and cgroup weight. No functional change. The factored out functions
      will be used by a new BPF extensible sched_class so that the weights can be
      exposed to the BPF programs in a way which is consistent cgroup weights and
      easier to interpret.
      
      The weight conversions will be used regardless of cgroup usage. It's just
      borrowing the cgroup weight range as it's more intuitive.
      CGROUP_WEIGHT_MIN/DFL/MAX constants are moved outside CONFIG_CGROUPS so that
      the conversion helpers can always be defined.
      
      v2: The helpers are now defined regardless of COFNIG_CGROUPS.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      4f9c7ca8
    • Tejun Heo's avatar
      sched: Add sched_class->switching_to() and expose check_class_changing/changed() · d8c7bc2e
      Tejun Heo authored
      When a task switches to a new sched_class, the prev and new classes are
      notified through ->switched_from() and ->switched_to(), respectively, after
      the switching is done.
      
      A new BPF extensible sched_class will have callbacks that allow the BPF
      scheduler to keep track of relevant task states (like priority and cpumask).
      Those callbacks aren't called while a task is on a different sched_class.
      When a task comes back, we wanna tell the BPF progs the up-to-date state
      before the task gets enqueued, so we need a hook which is called before the
      switching is committed.
      
      This patch adds ->switching_to() which is called during sched_class switch
      through check_class_changing() before the task is restored. Also, this patch
      exposes check_class_changing/changed() in kernel/sched/sched.h. They will be
      used by the new BPF extensible sched_class to implement implicit sched_class
      switching which is used e.g. when falling back to CFS when the BPF scheduler
      fails or unloads.
      
      This is a prep patch and doesn't cause any behavior changes. The new
      operation and exposed functions aren't used yet.
      
      v3: Refreshed on top of tip:sched/core.
      
      v2: Improve patch description w/ details on planned use.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      d8c7bc2e
    • Tejun Heo's avatar
      sched: Add sched_class->reweight_task() · e83edbf8
      Tejun Heo authored
      Currently, during a task weight change, sched core directly calls
      reweight_task() defined in fair.c if @p is on CFS. Let's make it a proper
      sched_class operation instead. CFS's reweight_task() is renamed to
      reweight_task_fair() and now called through sched_class.
      
      While it turns a direct call into an indirect one, set_load_weight() isn't
      called from a hot path and this change shouldn't cause any noticeable
      difference. This will be used to implement reweight_task for a new BPF
      extensible sched_class so that it can keep its cached task weight
      up-to-date.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      e83edbf8
    • Tejun Heo's avatar
      sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork() · 304b3f2b
      Tejun Heo authored
      A new BPF extensible sched_class will need more control over the forking
      process. It wants to be able to fail from sched_cgroup_fork() after the new
      task's sched_task_group is initialized so that the loaded BPF program can
      prepare the task with its cgroup association is established and reject fork
      if e.g. allocation fails.
      
      Allow sched_cgroup_fork() to fail by making it return int instead of void
      and adding sched_cancel_fork() to undo sched_fork() in the error path.
      
      sched_cgroup_fork() doesn't fail yet and this patch shouldn't cause any
      behavior changes.
      
      v2: Patch description updated to detail the expected use.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      304b3f2b
    • Tejun Heo's avatar
      sched: Restructure sched_class order sanity checks in sched_init() · df268382
      Tejun Heo authored
      Currently, sched_init() checks that the sched_class'es are in the expected
      order by testing each adjacency which is a bit brittle and makes it
      cumbersome to add optional sched_class'es. Instead, let's verify whether
      they're in the expected order using sched_class_above() which is what
      matters.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      df268382
  2. 05 Jun, 2024 1 commit
  3. 27 May, 2024 2 commits
    • Ingo Molnar's avatar
      sched: Fix spelling in comments · 402de7fc
      Ingo Molnar authored
      Do a spell-checking pass.
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      402de7fc
    • Ingo Molnar's avatar
      sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c · 04746ed8
      Ingo Molnar authored
      core.c has become rather large, move most scheduler syscall
      related functionality into a separate file, syscalls.c.
      
      This is about ~15% of core.c's raw linecount.
      
      Move the alloc_user_cpus_ptr(), __rt_effective_prio(),
      rt_effective_prio(), uclamp_none(), uclamp_se_set()
      and uclamp_bucket_id() inlines to kernel/sched/sched.h.
      
      Internally export the __sched_setscheduler(), __sched_setaffinity(),
      __setscheduler_prio(), set_load_weight(), enqueue_task(), dequeue_task(),
      check_class_changed(), splice_balance_callbacks() and balance_callbacks()
      methods to better facilitate this.
      
      Move the new file's build to sched_policy.c, because it fits there
      semantically, but also because it's the smallest of the 4 build units
      under an allmodconfig build:
      
        -rw-rw-r-- 1 mingo mingo 7.3M May 27 12:35 kernel/sched/core.i
        -rw-rw-r-- 1 mingo mingo 6.4M May 27 12:36 kernel/sched/build_utility.i
        -rw-rw-r-- 1 mingo mingo 6.3M May 27 12:36 kernel/sched/fair.i
        -rw-rw-r-- 1 mingo mingo 5.8M May 27 12:36 kernel/sched/build_policy.i
      
      This better balances build time for scheduler subsystem rebuilds.
      
      I build-tested this new file as a standalone syscalls.o file for a bit,
      to make sure all the encapsulations & abstractions are robust.
      
      Also update/add my copyright notices to these files.
      
      Build time measurements:
      
       # -Before/+After:
      
       kepler:~/tip> perf stat -e 'cycles,instructions,duration_time' --sync --repeat 5 --pre 'rm -f kernel/sched/*.o' m kernel/sched/built-in.a >/dev/null
      
       Performance counter stats for 'm kernel/sched/built-in.a' (5 runs):
      
       -    71,938,508,607      cycles                                                                  ( +-  0.17% )
       +    71,992,916,493      cycles                                                                  ( +-  0.22% )
       -   106,214,780,964      instructions                     #    1.48  insn per cycle              ( +-  0.01% )
       +   105,450,231,154      instructions                     #    1.46  insn per cycle              ( +-  0.01% )
       -     5,878,232,620 ns   duration_time                                                           ( +-  0.38% )
       +     5,290,085,069 ns   duration_time                                                           ( +-  0.21% )
      
       -            5.8782 +- 0.0221 seconds time elapsed  ( +-  0.38% )
       +            5.2901 +- 0.0111 seconds time elapsed  ( +-  0.21% )
      
      Build time improvement of -11.1% (duration_time) is expected: the
      parallel build time of the scheduler subsystem is determined by the
      largest, slowest to build object file, which is kernel/sched/core.o.
      By moving ~15% of its complexity into another build unit, we reduced
      build time by -11%.
      
      Measured cycles spent on building is within its ~0.2% stddev noise envelope.
      
      The -0.7% reduction in instructions spent on building the scheduler is
      statistically reliable and somewhat surprising - I can only speculate:
      maybe compilers aren't that efficient at building & optimizing 10+ KLOC files
      (core.c), and it's an overall win to balance the linecount a bit.
      
      Anyway, this might be a data point that suggests that reducing the linecount
      of our largest files will improve not just code readability and maintainability,
      but might also improve build times a bit.
      
      Code generation got a bit worse, by 0.5kb text on an x86 defconfig build:
      
        # -Before/+After:
      
        kepler:~/tip> size vmlinux
           text	   data	    bss	    dec	    hex	filename
        -26475475	10439178	1740804	38655457	24dd5e1	vmlinux
        +26476003	10439178	1740804	38655985	24dd7f1	vmlinux
      
        kepler:~/tip> size kernel/sched/built-in.a
           text	   data	    bss	    dec	    hex	filename
        - 76056	  30025	    489	 106570	  1a04a	kernel/sched/core.o (ex kernel/sched/built-in.a)
        + 63452	  29453	    489	  93394	  16cd2	kernel/sched/core.o (ex kernel/sched/built-in.a)
          44299	   2181	    104	  46584	   b5f8	kernel/sched/fair.o (ex kernel/sched/built-in.a)
        - 42764	   3424	    120	  46308	   b4e4	kernel/sched/build_policy.o (ex kernel/sched/built-in.a)
        + 55651	   4044	    120	  59815	   e9a7	kernel/sched/build_policy.o (ex kernel/sched/built-in.a)
          44866	  12655	   2192	  59713	   e941	kernel/sched/build_utility.o (ex kernel/sched/built-in.a)
          44866	  12655	   2192	  59713	   e941	kernel/sched/build_utility.o (ex kernel/sched/built-in.a)
      
      This is primarily due to the extra functions exported, and the size
      gets exaggerated somewhat by __pfx CFI function padding:
      
      	ffffffff810cc710 <__pfx_enqueue_task>:
      	ffffffff810cc710:	90                   	nop
      	ffffffff810cc711:	90                   	nop
      	ffffffff810cc712:	90                   	nop
      	ffffffff810cc713:	90                   	nop
      	ffffffff810cc714:	90                   	nop
      	ffffffff810cc715:	90                   	nop
      	ffffffff810cc716:	90                   	nop
      	ffffffff810cc717:	90                   	nop
      	ffffffff810cc718:	90                   	nop
      	ffffffff810cc719:	90                   	nop
      	ffffffff810cc71a:	90                   	nop
      	ffffffff810cc71b:	90                   	nop
      	ffffffff810cc71c:	90                   	nop
      	ffffffff810cc71d:	90                   	nop
      	ffffffff810cc71e:	90                   	nop
      	ffffffff810cc71f:	90                   	nop
      
      AFAICS the cost is primarily not to core.o and fair.o though (which contain
      most performance sensitive scheduler functions), only to syscalls.o
      that get called with much lower frequency - so I think this is an acceptable
      trade-off for better code separation.
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Link: https://lore.kernel.org/r/20240407084319.1462211-2-mingo@kernel.org
      04746ed8
  4. 17 May, 2024 1 commit
  5. 24 Apr, 2024 3 commits
  6. 12 Mar, 2024 2 commits
  7. 24 Feb, 2024 1 commit
  8. 16 Feb, 2024 1 commit
  9. 15 Feb, 2024 3 commits
  10. 05 Feb, 2024 1 commit
  11. 27 Dec, 2023 1 commit
  12. 23 Nov, 2023 1 commit
    • Vincent Guittot's avatar
      sched/cpufreq: Rework schedutil governor performance estimation · 9c0b4bb7
      Vincent Guittot authored
      The current method to take into account uclamp hints when estimating the
      target frequency can end in a situation where the selected target
      frequency is finally higher than uclamp hints, whereas there are no real
      needs. Such cases mainly happen because we are currently mixing the
      traditional scheduler utilization signal with the uclamp performance
      hints. By adding these 2 metrics, we loose an important information when
      it comes to select the target frequency, and we have to make some
      assumptions which can't fit all cases.
      
      Rework the interface between the scheduler and schedutil governor in order
      to propagate all information down to the cpufreq governor.
      
      effective_cpu_util() interface changes and now returns the actual
      utilization of the CPU with 2 optional inputs:
      
      - The minimum performance for this CPU; typically the capacity to handle
        the deadline task and the interrupt pressure. But also uclamp_min
        request when available.
      
      - The maximum targeting performance for this CPU which reflects the
        maximum level that we would like to not exceed. By default it will be
        the CPU capacity but can be reduced because of some performance hints
        set with uclamp. The value can be lower than actual utilization and/or
        min performance level.
      
      A new sugov_effective_cpu_perf() interface is also available to compute
      the final performance level that is targeted for the CPU, after applying
      some cpufreq headroom and taking into account all inputs.
      
      With these 2 functions, schedutil is now able to decide when it must go
      above uclamp hints. It now also has a generic way to get the min
      performance level.
      
      The dependency between energy model and cpufreq governor and its headroom
      policy doesn't exist anymore.
      
      eenv_pd_max_util() asks schedutil for the targeted performance after
      applying the impact of the waking task.
      
      [ mingo: Refined the changelog & C comments. ]
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarRafael J. Wysocki <rafael@kernel.org>
      Link: https://lore.kernel.org/r/20231122133904.446032-2-vincent.guittot@linaro.org
      9c0b4bb7
  13. 15 Nov, 2023 4 commits
  14. 24 Oct, 2023 3 commits
  15. 18 Oct, 2023 1 commit
  16. 13 Oct, 2023 1 commit
    • Peter Zijlstra's avatar
      sched: Fix stop_one_cpu_nowait() vs hotplug · f0498d2a
      Peter Zijlstra authored
      Kuyo reported sporadic failures on a sched_setaffinity() vs CPU
      hotplug stress-test -- notably affine_move_task() remains stuck in
      wait_for_completion(), leading to a hung-task detector warning.
      
      Specifically, it was reported that stop_one_cpu_nowait(.fn =
      migration_cpu_stop) returns false -- this stopper is responsible for
      the matching complete().
      
      The race scenario is:
      
      	CPU0					CPU1
      
      					// doing _cpu_down()
      
        __set_cpus_allowed_ptr()
          task_rq_lock();
      					takedown_cpu()
      					  stop_machine_cpuslocked(take_cpu_down..)
      
      					<PREEMPT: cpu_stopper_thread()
      					  MULTI_STOP_PREPARE
      					  ...
          __set_cpus_allowed_ptr_locked()
            affine_move_task()
              task_rq_unlock();
      
        <PREEMPT: cpu_stopper_thread()\>
          ack_state()
      					  MULTI_STOP_RUN
      					    take_cpu_down()
      					      __cpu_disable();
      					      stop_machine_park();
      						stopper->enabled = false;
      					 />
         />
      	stop_one_cpu_nowait(.fn = migration_cpu_stop);
                if (stopper->enabled) // false!!!
      
      That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the
      stopper thread gets a chance to preempt and allows the cpu-down for
      the target CPU to complete.
      
      OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to
      issue a wakeup, it must not be ran under the scheduler locks.
      
      Solve this apparent contradiction by keeping preemption disabled over
      the unlock + queue_stopper combination:
      
      	preempt_disable();
      	task_rq_unlock(...);
      	if (!stop_pending)
      	  stop_one_cpu_nowait(...)
      	preempt_enable();
      
      This respects the lock ordering contraints while still avoiding the
      above race. That is, if we find the CPU is online under rq-lock, the
      targeted stop_one_cpu_nowait() must succeed.
      
      Apply this pattern to all similar stop_one_cpu_nowait() invocations.
      
      Fixes: 6d337eab ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
      Reported-by: default avatar"Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatar"Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
      Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net
      f0498d2a
  17. 09 Oct, 2023 1 commit
  18. 07 Oct, 2023 2 commits
  19. 03 Oct, 2023 1 commit
  20. 29 Sep, 2023 1 commit
  21. 24 Sep, 2023 1 commit