1. 18 Jun, 2024 24 commits
    • David Vernet's avatar
      sched_ext: Implement SCX_KICK_WAIT · 90e55164
      David Vernet authored
      If set when calling scx_bpf_kick_cpu(), the invoking CPU will busy wait for
      the kicked cpu to enter the scheduler. See the following for example usage:
      
        https://github.com/sched-ext/scx/blob/main/scheds/c/scx_pair.bpf.c
      
      v2: - Updated to fit the updated kick_cpus_irq_workfn() implementation.
      
          - Include SCX_KICK_WAIT related information in debug dump.
      Signed-off-by: default avatarDavid Vernet <dvernet@meta.com>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      90e55164
    • Tejun Heo's avatar
      sched_ext: Track tasks that are subjects of the in-flight SCX operation · 36454023
      Tejun Heo authored
      When some SCX operations are in flight, it is known that the subject task's
      rq lock is held throughout which makes it safe to access certain fields of
      the task - e.g. its current task_group. We want to add SCX kfunc helpers
      that can make use of this guarantee - e.g. to help determining the currently
      associated CPU cgroup from the task's current task_group.
      
      As it'd be dangerous call such a helper on a task which isn't rq lock
      protected, the helper should be able to verify the input task and reject
      accordingly. This patch adds sched_ext_entity.kf_tasks[] that track the
      tasks which are currently being operated on by a terminal SCX operation. The
      new SCX_CALL_OP_[2]TASK[_RET]() can be used when invoking SCX operations
      which take tasks as arguments and the scx_kf_allowed_on_arg_tasks() can be
      used by kfunc helpers to verify the input task status.
      
      Note that as sched_ext_entity.kf_tasks[] can't handle nesting, the tracking
      is currently only limited to terminal SCX operations. If needed in the
      future, this restriction can be removed by moving the tracking to the task
      side with a couple per-task counters.
      
      v2: Updated to reflect the addition of SCX_KF_SELECT_CPU.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      36454023
    • Tejun Heo's avatar
      sched_ext: Implement tickless support · 22a92020
      Tejun Heo authored
      Allow BPF schedulers to indicate tickless operation by setting p->scx.slice
      to SCX_SLICE_INF. A CPU whose current task has infinte slice goes into
      tickless operation.
      
      scx_central is updated to use tickless operations for all tasks and
      instead use a BPF timer to expire slices. This also uses the SCX_ENQ_PREEMPT
      and task state tracking added by the previous patches.
      
      Currently, there is no way to pin the timer on the central CPU, so it may
      end up on one of the worker CPUs; however, outside of that, the worker CPUs
      can go tickless both while running sched_ext tasks and idling.
      
      With schbench running, scx_central shows:
      
        root@test ~# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
        LOC:     142024        656        664        449   Local timer interrupts
        LOC:     161663        663        665        449   Local timer interrupts
      
      Without it:
      
        root@test ~ [SIGINT]# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
        LOC:     188778       3142       3793       3993   Local timer interrupts
        LOC:     198993       5314       6323       6438   Local timer interrupts
      
      While scx_central itself is too barebone to be useful as a
      production scheduler, a more featureful central scheduler can be built using
      the same approach. Google's experience shows that such an approach can have
      significant benefits for certain applications such as VM hosting.
      
      v4: Allow operation even if BPF_F_TIMER_CPU_PIN is not available.
      
      v3: Pin the central scheduler's timer on the central_cpu using
          BPF_F_TIMER_CPU_PIN.
      
      v2: Convert to BPF inline iterators.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      22a92020
    • Tejun Heo's avatar
      sched_ext: Add task state tracking operations · 1c29f854
      Tejun Heo authored
      Being able to track the task runnable and running state transitions are
      useful for a variety of purposes including latency tracking and load factor
      calculation.
      
      Currently, BPF schedulers don't have a good way of tracking these
      transitions. Becoming runnable can be determined from ops.enqueue() but
      becoming quiescent can only be inferred from the lack of subsequent enqueue.
      Also, as the local dsq can have multiple tasks and some events are handled
      in the sched_ext core, it's difficult to determine when a given task starts
      and stops executing.
      
      This patch adds sched_ext_ops.runnable(), .running(), .stopping() and
      .quiescent() operations to track the task runnable and running state
      transitions. They're mostly self explanatory; however, we want to ensure
      that running <-> stopping transitions are always contained within runnable
      <-> quiescent transitions which is a bit different from how the scheduler
      core behaves. This adds a bit of complication. See the comment in
      dequeue_task_scx().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      1c29f854
    • Tejun Heo's avatar
      sched_ext: Make watchdog handle ops.dispatch() looping stall · 0922f54f
      Tejun Heo authored
      The dispatch path retries if the local DSQ is still empty after
      ops.dispatch() either dispatched or consumed a task. This is both out of
      necessity and for convenience. It has to retry because the dispatch path
      might lose the tasks to dequeue while the rq lock is released while trying
      to migrate tasks across CPUs, and the retry mechanism makes ops.dispatch()
      implementation easier as it only needs to make some forward progress each
      iteration.
      
      However, this makes it possible for ops.dispatch() to stall CPUs by
      repeatedly dispatching ineligible tasks. If all CPUs are stalled that way,
      the watchdog or sysrq handler can't run and the system can't be saved. Let's
      address the issue by breaking out of the dispatch loop after 32 iterations.
      
      It is unlikely but not impossible for ops.dispatch() to legitimately go over
      the iteration limit. We want to come back to the dispatch path in such cases
      as not doing so risks stalling the CPU by idling with runnable tasks
      pending. As the previous task is still current in balance_scx(),
      resched_curr() doesn't do anything - it will just get cleared. Let's instead
      use scx_kick_bpf() which will trigger reschedule after switching to the next
      task which will likely be the idle task.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      0922f54f
    • Tejun Heo's avatar
      sched_ext: Add a central scheduler which makes all scheduling decisions on one CPU · 037df2a3
      Tejun Heo authored
      This patch adds a new example scheduler, scx_central, which demonstrates
      central scheduling where one CPU is responsible for making all scheduling
      decisions in the system using scx_bpf_kick_cpu(). The central CPU makes
      scheduling decisions for all CPUs in the system, queues tasks on the
      appropriate local dsq's and preempts the worker CPUs. The worker CPUs in
      turn preempt the central CPU when it needs tasks to run.
      
      Currently, every CPU depends on its own tick to expire the current task. A
      follow-up patch implementing tickless support for sched_ext will allow the
      worker CPUs to go full tickless so that they can run completely undisturbed.
      
      v3: - Kumar fixed a bug where the dispatch path could overflow the dispatch
            buffer if too many are dispatched to the fallback DSQ.
      
          - Use the new SCX_KICK_IDLE to wake up non-central CPUs.
      
          - Dropped '-p' option.
      
      v2: - Use RESIZABLE_ARRAY() instead of fixed MAX_CPUS and use SCX_BUG[_ON]()
            to simplify error handling.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
      Cc: Julia Lawall <julia.lawall@inria.fr>
      037df2a3
    • Tejun Heo's avatar
      sched_ext: Implement scx_bpf_kick_cpu() and task preemption support · 81aae789
      Tejun Heo authored
      It's often useful to wake up and/or trigger reschedule on other CPUs. This
      patch adds scx_bpf_kick_cpu() kfunc helper that BPF scheduler can call to
      kick the target CPU into the scheduling path.
      
      As a sched_ext task relinquishes its CPU only after its slice is depleted,
      this patch also adds SCX_KICK_PREEMPT and SCX_ENQ_PREEMPT which clears the
      slice of the target CPU's current task to guarantee that sched_ext's
      scheduling path runs on the CPU.
      
      If SCX_KICK_IDLE is specified, the target CPU is kicked iff the CPU is idle
      to guarantee that the target CPU will go through at least one full sched_ext
      scheduling cycle after the kicking. This can be used to wake up idle CPUs
      without incurring unnecessary overhead if it isn't currently idle.
      
      As a demonstration of how backward compatibility can be supported using BPF
      CO-RE, tools/sched_ext/include/scx/compat.bpf.h is added. It provides
      __COMPAT_scx_bpf_kick_cpu_IDLE() which uses SCX_KICK_IDLE if available or
      becomes a regular kicking otherwise. This allows schedulers to use the new
      SCX_KICK_IDLE while maintaining support for older kernels. The plan is to
      temporarily use compat helpers to ease API updates and drop them after a few
      kernel releases.
      
      v5: - SCX_KICK_IDLE added. Note that this also adds a compat mechanism for
            schedulers so that they can support kernels without SCX_KICK_IDLE.
            This is useful as a demonstration of how new feature flags can be
            added in a backward compatible way.
      
          - kick_cpus_irq_workfn() reimplemented so that it touches the pending
            cpumasks only as necessary to reduce kicking overhead on machines with
            a lot of CPUs.
      
          - tools/sched_ext/include/scx/compat.bpf.h added.
      
      v4: - Move example scheduler to its own patch.
      
      v3: - Make scx_example_central switch all tasks by default.
      
          - Convert to BPF inline iterators.
      
      v2: - Julia Lawall reported that scx_example_central can overflow the
            dispatch buffer and malfunction. As scheduling for other CPUs can't be
            handled by the automatic retry mechanism, fix by implementing an
            explicit overflow and retry handling.
      
          - Updated to use generic BPF cpumask helpers.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      81aae789
    • Tejun Heo's avatar
      tools/sched_ext: Add scx_show_state.py · 1c3ae1cb
      Tejun Heo authored
      There are states which are interesting but don't quite fit the interface
      exposed under /sys/kernel/sched_ext. Add tools/scx_show_state.py to show
      them.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      1c3ae1cb
    • Tejun Heo's avatar
      sched_ext: Print debug dump after an error exit · 07814a94
      Tejun Heo authored
      If a BPF scheduler triggers an error, the scheduler is aborted and the
      system is reverted to the built-in scheduler. In the process, a lot of
      information which may be useful for figuring out what happened can be lost.
      
      This patch adds debug dump which captures information which may be useful
      for debugging including runqueue and runnable thread states at the time of
      failure. The following shows a debug dump after triggering the watchdog:
      
        root@test ~# os/work/tools/sched_ext/build/bin/scx_qmap -t 100
        stats  : enq=1 dsp=0 delta=1 deq=0
        stats  : enq=90 dsp=90 delta=0 deq=0
        stats  : enq=156 dsp=156 delta=0 deq=0
        stats  : enq=218 dsp=218 delta=0 deq=0
        stats  : enq=255 dsp=255 delta=0 deq=0
        stats  : enq=271 dsp=271 delta=0 deq=0
        stats  : enq=284 dsp=284 delta=0 deq=0
        stats  : enq=293 dsp=293 delta=0 deq=0
      
        DEBUG DUMP
        ================================================================================
      
        kworker/u32:12[320] triggered exit kind 1026:
          runnable task stall (stress[1530] failed to run for 6.841s)
      
        Backtrace:
          scx_watchdog_workfn+0x136/0x1c0
          process_scheduled_works+0x2b5/0x600
          worker_thread+0x269/0x360
          kthread+0xeb/0x110
          ret_from_fork+0x36/0x40
          ret_from_fork_asm+0x1a/0x30
      
        QMAP FIFO[0]:
        QMAP FIFO[1]:
        QMAP FIFO[2]: 1436
        QMAP FIFO[3]:
        QMAP FIFO[4]:
      
        CPU states
        ----------
      
        CPU 0   : nr_run=1 ops_qseq=244
      	    curr=swapper/0[0] class=idle_sched_class
      
          QMAP: dsp_idx=1 dsp_cnt=0
      
          R stress[1530] -6841ms
      	scx_state/flags=3/0x1 ops_state/qseq=2/20
      	sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      	cpus=ff
      
            QMAP: force_local=0
      
            asm_sysvec_apic_timer_interrupt+0x16/0x20
      
        CPU 2   : nr_run=2 ops_qseq=142
      	    curr=swapper/2[0] class=idle_sched_class
      
          QMAP: dsp_idx=1 dsp_cnt=0
      
          R sshd[1703] -5905ms
      	scx_state/flags=3/0x9 ops_state/qseq=2/88
      	sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      	cpus=ff
      
            QMAP: force_local=1
      
            __x64_sys_ppoll+0xf6/0x120
            do_syscall_64+0x7b/0x150
            entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
          R fish[1539] -4141ms
      	scx_state/flags=3/0x9 ops_state/qseq=2/124
      	sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      	cpus=ff
      
            QMAP: force_local=1
      
            futex_wait+0x60/0xe0
            do_futex+0x109/0x180
            __x64_sys_futex+0x117/0x190
            do_syscall_64+0x7b/0x150
            entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
        CPU 3   : nr_run=2 ops_qseq=162
      	    curr=kworker/u32:12[320] class=ext_sched_class
      
          QMAP: dsp_idx=1 dsp_cnt=0
      
         *R kworker/u32:12[320] +0ms
      	scx_state/flags=3/0xd ops_state/qseq=0/0
      	sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      	cpus=ff
      
            QMAP: force_local=0
      
            scx_dump_state+0x613/0x6f0
            scx_ops_error_irq_workfn+0x1f/0x40
            irq_work_run_list+0x82/0xd0
            irq_work_run+0x14/0x30
            __sysvec_irq_work+0x40/0x140
            sysvec_irq_work+0x60/0x70
            asm_sysvec_irq_work+0x16/0x20
            scx_watchdog_workfn+0x15f/0x1c0
            process_scheduled_works+0x2b5/0x600
            worker_thread+0x269/0x360
            kthread+0xeb/0x110
            ret_from_fork+0x36/0x40
            ret_from_fork_asm+0x1a/0x30
      
          R kworker/3:2[1436] +0ms
      	scx_state/flags=3/0x9 ops_state/qseq=2/160
      	sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      	cpus=08
      
            QMAP: force_local=0
      
            kthread+0xeb/0x110
            ret_from_fork+0x36/0x40
            ret_from_fork_asm+0x1a/0x30
      
        CPU 7   : nr_run=0 ops_qseq=76
      	    curr=swapper/7[0] class=idle_sched_class
      
      
        ================================================================================
      
        EXIT: runnable task stall (stress[1530] failed to run for 6.841s)
      
      It shows that CPU 3 was running the watchdog when it triggered the error
      condition and the scx_qmap thread has been queued on CPU 0 for over 5
      seconds but failed to run. It also prints out scx_qmap specific information
      - e.g. which tasks are queued on each FIFO and so on using the dump_*() ops.
      This dump has proved pretty useful for developing and debugging BPF
      schedulers.
      
      Debug dump is generated automatically when the BPF scheduler exits due to an
      error. The debug buffer used in such cases is determined by
      sched_ext_ops.exit_dump_len and defaults to 32k. If the debug dump overruns
      the available buffer, the output is truncated and marked accordingly.
      
      Debug dump output can also be read through the sched_ext_dump tracepoint.
      When read through the tracepoint, there is no length limit.
      
      SysRq-D can be used to trigger debug dump at any time while a BPF scheduler
      is loaded. This is non-destructive - the scheduler keeps running afterwards.
      The output can be read through the sched_ext_dump tracepoint.
      
      v2: - The size of exit debug dump buffer can now be customized using
            sched_ext_ops.exit_dump_len.
      
          - sched_ext_ops.dump*() added to enable dumping of BPF scheduler
            specific information.
      
          - Tracpoint output and SysRq-D triggering added.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      07814a94
    • David Vernet's avatar
      sched_ext: Print sched_ext info when dumping stack · 1538e339
      David Vernet authored
      It would be useful to see what the sched_ext scheduler state is, and what
      scheduler is running, when we're dumping a task's stack. This patch
      therefore adds a new print_scx_info() function that's called in the same
      context as print_worker_info() and print_stop_info(). An example dump
      follows.
      
        BUG: kernel NULL pointer dereference, address: 0000000000000999
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        PGD 0 P4D 0
        Oops: 0002 [#1] PREEMPT SMP
        CPU: 13 PID: 2047 Comm: insmod Tainted: G           O       6.6.0-work-10323-gb58d4cae8e99-dirty #34
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 2/2/2022
        Sched_ext: qmap (enabled+all), task: runnable_at=-17ms
        RIP: 0010:init_module+0x9/0x1000 [test_module]
        ...
      
      v3: - scx_ops_enable_state_str[] definition moved to an earlier patch as
            it's now used by core implementation.
      
          - Convert jiffy delta to msecs using jiffies_to_msecs() instead of
            multiplying by (HZ / MSEC_PER_SEC). The conversion is implemented in
            jiffies_delta_msecs().
      
      v2: - We are now using scx_ops_enable_state_str[] outside
            CONFIG_SCHED_DEBUG. Move it outside of CONFIG_SCHED_DEBUG and to the
            top. This was reported by Changwoo and Andrea.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Reported-by: default avatarChangwoo Min <changwoo@igalia.com>
      Reported-by: default avatarAndrea Righi <andrea.righi@canonical.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      1538e339
    • Tejun Heo's avatar
      sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT · 7bb6f081
      Tejun Heo authored
      BPF schedulers might not want to schedule certain tasks - e.g. kernel
      threads. This patch adds p->scx.disallow which can be set by BPF schedulers
      in such cases. The field can be changed anytime and setting it in
      ops.prep_enable() guarantees that the task can never be scheduled by
      sched_ext.
      
      scx_qmap is updated with the -d option to disallow a specific PID:
      
        # echo $$
        1092
        # grep -E '(policy)|(ext\.enabled)' /proc/self/sched
        policy                                       :                    0
        ext.enabled                                  :                    0
        # ./set-scx 1092
        # grep -E '(policy)|(ext\.enabled)' /proc/self/sched
        policy                                       :                    7
        ext.enabled                                  :                    0
      
      Run "scx_qmap -p -d 1092" in another terminal.
      
        # cat /sys/kernel/sched_ext/nr_rejected
        1
        # grep -E '(policy)|(ext\.enabled)' /proc/self/sched
        policy                                       :                    0
        ext.enabled                                  :                    0
        # ./set-scx 1092
        setparam failed for 1092 (Permission denied)
      
      - v4: Refreshed on top of tip:sched/core.
      
      - v3: Update description to reflect /sys/kernel/sched_ext interface change.
      
      - v2: Use atomic_long_t instead of atomic64_t for scx_kick_cpus_pnt_seqs to
            accommodate 32bit archs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarBarret Rhoden <brho@google.com>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      7bb6f081
    • David Vernet's avatar
      sched_ext: Implement runnable task stall watchdog · 8a010b81
      David Vernet authored
      The most common and critical way that a BPF scheduler can misbehave is by
      failing to run runnable tasks for too long. This patch implements a
      watchdog.
      
      * All tasks record when they become runnable.
      
      * A watchdog work periodically scans all runnable tasks. If any task has
        stayed runnable for too long, the BPF scheduler is aborted.
      
      * scheduler_tick() monitors whether the watchdog itself is stuck. If so, the
        BPF scheduler is aborted.
      
      Because the watchdog only scans the tasks which are currently runnable and
      usually very infrequently, the overhead should be negligible.
      scx_qmap is updated so that it can be told to stall user and/or
      kernel tasks.
      
      A detected task stall looks like the following:
      
       sched_ext: BPF scheduler "qmap" errored, disabling
       sched_ext: runnable task stall (dbus-daemon[953] failed to run for 6.478s)
          scx_check_timeout_workfn+0x10e/0x1b0
          process_one_work+0x287/0x560
          worker_thread+0x234/0x420
          kthread+0xe9/0x100
          ret_from_fork+0x1f/0x30
      
      A detected watchdog stall:
      
       sched_ext: BPF scheduler "qmap" errored, disabling
       sched_ext: runnable task stall (watchdog failed to check in for 5.001s)
          scheduler_tick+0x2eb/0x340
          update_process_times+0x7a/0x90
          tick_sched_timer+0xd8/0x130
          __hrtimer_run_queues+0x178/0x3b0
          hrtimer_interrupt+0xfc/0x390
          __sysvec_apic_timer_interrupt+0xb7/0x2b0
          sysvec_apic_timer_interrupt+0x90/0xb0
          asm_sysvec_apic_timer_interrupt+0x1b/0x20
          default_idle+0x14/0x20
          arch_cpu_idle+0xf/0x20
          default_idle_call+0x50/0x90
          do_idle+0xe8/0x240
          cpu_startup_entry+0x1d/0x20
          kernel_init+0x0/0x190
          start_kernel+0x0/0x392
          start_kernel+0x324/0x392
          x86_64_start_reservations+0x2a/0x2c
          x86_64_start_kernel+0x104/0x109
          secondary_startup_64_no_verify+0xce/0xdb
      
      Note that this patch exposes scx_ops_error[_type]() in kernel/sched/ext.h to
      inline scx_notify_sched_tick().
      
      v4: - While disabling, cancel_delayed_work_sync(&scx_watchdog_work) was
            being called before forward progress was guaranteed and thus could
            lead to system lockup. Relocated.
      
          - While enabling, it was comparing msecs against jiffies without
            conversion leading to spurious load failures on lower HZ kernels.
            Fixed.
      
          - runnable list management is now used by core bypass logic and moved to
            the patch implementing sched_ext core.
      
      v3: - bpf_scx_init_member() was incorrectly comparing ops->timeout_ms
            against SCX_WATCHDOG_MAX_TIMEOUT which is in jiffies without
            conversion leading to spurious load failures in lower HZ kernels.
            Fixed.
      
      v2: - Julia Lawall noticed that the watchdog code was mixing msecs and
            jiffies. Fix by using jiffies for everything.
      Signed-off-by: default avatarDavid Vernet <dvernet@meta.com>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      Cc: Julia Lawall <julia.lawall@inria.fr>
      8a010b81
    • Tejun Heo's avatar
      sched_ext: Add sysrq-S which disables the BPF scheduler · 79e10440
      Tejun Heo authored
      This enables the admin to abort the BPF scheduler and revert to CFS anytime.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      79e10440
    • Tejun Heo's avatar
      sched_ext: Add scx_simple and scx_example_qmap example schedulers · 2a52ca7c
      Tejun Heo authored
      Add two simple example BPF schedulers - simple and qmap.
      
      * simple: In terms of scheduling, it behaves identical to not having any
        operation implemented at all. The two operations it implements are only to
        improve visibility and exit handling. On certain homogeneous
        configurations, this actually can perform pretty well.
      
      * qmap: A fixed five level priority scheduler to demonstrate queueing PIDs
        on BPF maps for scheduling. While not very practical, this is useful as a
        simple example and will be used to demonstrate different features.
      
      v7: - Compat helpers stripped out in prepartion of upstreaming as the
            upstreamed patchset will be the baselinfe. Utility macros that can be
            used to implement compat features are kept.
      
          - Explicitly disable map autoattach on struct_ops to avoid trying to
            attach twice while maintaining compatbility with older libbpf.
      
      v6: - Common header files reorganized and cleaned up. Compat helpers are
            added to demonstrate how schedulers can maintain backward
            compatibility with older kernels while making use of newly added
            features.
      
          - simple_select_cpu() added to keep track of the number of local
            dispatches. This is needed because the default ops.select_cpu()
            implementation is updated to dispatch directly and won't call
            ops.enqueue().
      
          - Updated to reflect the sched_ext API changes. Switching all tasks is
            the default behavior now and scx_qmap supports partial switching when
            `-p` is specified.
      
          - tools/sched_ext/Kconfig dropped. This will be included in the doc
            instead.
      
      v5: - Improve Makefile. Build artifects are now collected into a separate
            dir which change be changed. Install and help targets are added and
            clean actually cleans everything.
      
          - MEMBER_VPTR() improved to improve access to structs. ARRAY_ELEM_PTR()
            and RESIZEABLE_ARRAY() are added to support resizable arrays in .bss.
      
          - Add scx_common.h which provides common utilities to user code such as
            SCX_BUG[_ON]() and RESIZE_ARRAY().
      
          - Use SCX_BUG[_ON]() to simplify error handling.
      
      v4: - Dropped _example prefix from scheduler names.
      
      v3: - Rename scx_example_dummy to scx_example_simple and restructure a bit
            to ease later additions. Comment updates.
      
          - Added declarations for BPF inline iterators. In the future, hopefully,
            these will be consolidated into a generic BPF header so that they
            don't need to be replicated here.
      
      v2: - Updated with the generic BPF cpumask helpers.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      2a52ca7c
    • Tejun Heo's avatar
      sched_ext: Implement BPF extensible scheduler class · f0e1a064
      Tejun Heo authored
      Implement a new scheduler class sched_ext (SCX), which allows scheduling
      policies to be implemented as BPF programs to achieve the following:
      
      1. Ease of experimentation and exploration: Enabling rapid iteration of new
         scheduling policies.
      
      2. Customization: Building application-specific schedulers which implement
         policies that are not applicable to general-purpose schedulers.
      
      3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
         policies in production environments.
      
      sched_ext leverages BPF’s struct_ops feature to define a structure which
      exports function callbacks and flags to BPF programs that wish to implement
      scheduling policies. The struct_ops structure exported by sched_ext is
      struct sched_ext_ops, and is conceptually similar to struct sched_class. The
      role of sched_ext is to map the complex sched_class callbacks to the more
      simple and ergonomic struct sched_ext_ops callbacks.
      
      For more detailed discussion on the motivations and overview, please refer
      to the cover letter.
      
      Later patches will also add several example schedulers and documentation.
      
      This patch implements the minimum core framework to enable implementation of
      BPF schedulers. Subsequent patches will gradually add functionalities
      including safety guarantee mechanisms, nohz and cgroup support.
      
      include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on
      top, each operation should be self-explanatory. The followings are worth
      noting:
      
      - Both "sched_ext" and its shorthand "scx" are used. If the identifier
        already has "sched" in it, "ext" is used; otherwise, "scx".
      
      - In sched_ext_ops, only .name is mandatory. Every operation is optional and
        if omitted a simple but functional default behavior is provided.
      
      - A new policy constant SCHED_EXT is added and a task can select sched_ext
        by invoking sched_setscheduler(2) with the new policy constant. However,
        if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL
        and the task is scheduled by CFS. When the BPF scheduler is loaded, all
        tasks which have the SCHED_EXT policy are switched to sched_ext.
      
      - To bridge the workflow imbalance between the scheduler core and
        sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch
        queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and
        one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for
        convenience and need not be used by a scheduler that doesn't require it.
        SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting
        the next task on the CPU. The BPF scheduler can manage an arbitrary number
        of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().
      
      - sched_ext guarantees system integrity no matter what the BPF scheduler
        does. To enable this, each task's ownership is tracked through
        p->scx.ops_state and all tasks are put on scx_tasks list. The disable path
        can always recover and revert all tasks back to CFS. See p->scx.ops_state
        and scx_tasks.
      
      - A task is not tied to its rq while enqueued. This decouples CPU selection
        from queueing and allows sharing a scheduling queue across an arbitrary
        subset of CPUs. This adds some complexities as a task may need to be
        bounced between rq's right before it starts executing. See
        dispatch_to_local_dsq() and move_task_to_local_dsq().
      
      - One complication that arises from the above weak association between task
        and rq is that synchronizing with dequeue() gets complicated as dequeue()
        may happen anytime while the task is enqueued and the dispatch path might
        need to release the rq lock to transfer the task. Solving this requires a
        bit of complexity. See the logic around p->scx.sticky_cpu and
        p->scx.ops_qseq.
      
      - Both enable and disable paths are a bit complicated. The enable path
        switches all tasks without blocking to avoid issues which can arise from
        partially switched states (e.g. the switching task itself being starved).
        The disable path can't trust the BPF scheduler at all, so it also has to
        guarantee forward progress without blocking. See scx_ops_enable() and
        scx_ops_disable_workfn().
      
      - When sched_ext is disabled, static_branches are used to shut down the
        entry points from hot paths.
      
      v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab
            scx_ops_enable_mutex which can lead to deadlocks in the disable path.
            Fixed.
      
          - Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead
            to use-after-free.
      
          - Consolidated per-cpu variable usages and other cleanups.
      
      v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple
            groups can be expressed. Later CPU hotplug operations are put into
            their own group.
      
          - SCX_OPS_DISABLING state is replaced with the new bypass mechanism
            which allows temporarily putting the system into simple FIFO
            scheduling mode bypassing the BPF scheduler. In addition to the shut
            down path, this will also be used to isolate the BPF scheduler across
            PM events. Enabling and disabling the bypass mode requires iterating
            all runnable tasks. rq->scx.runnable_list addition is moved from the
            later watchdog patch.
      
          - ops.prep_enable() is replaced with ops.init_task() and
            ops.enable/disable() are now called whenever the task enters and
            leaves sched_ext instead of when the task becomes schedulable on
            sched_ext and stops being so. A new operation - ops.exit_task() - is
            called when the task stops being schedulable on sched_ext.
      
          - scx_bpf_dispatch() can now be called from ops.select_cpu() too. This
            removes the need for communicating local dispatch decision made by
            ops.select_cpu() to ops.enqueue() via per-task storage.
            SCX_KF_SELECT_CPU is added to support the change.
      
          - SCX_TASK_ENQ_LOCAL which told the BPF scheudler that
            scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ
            was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly
            if it finds a suitable idle CPU. If such behavior is not desired,
            users can use scx_bpf_select_cpu_dfl() which returns the verdict in a
            bool out param.
      
          - scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up
            queueing many tasks on a local DSQ which makes tasks to execute in
            order while other CPUs stay idle which made some hackbench numbers
            really bad. Fixed.
      
          - The current state of sched_ext can now be monitored through files
            under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is
            to enable monitoring on kernels which don't enable debugfs.
      
          - sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may
            be NULL and a BPF scheduler which derefs the pointer without checking
            could crash the kernel. Tell BPF. This is currently a bit ugly. A
            better way to annotate this is expected in the future.
      
          - scx_exit_info updated to carry pointers to message buffers instead of
            embedding them directly. This decouples buffer sizes from API so that
            they can be changed without breaking compatibility.
      
          - exit_code added to scx_exit_info. This is used to indicate different
            exit conditions on non-error exits and will be used to handle e.g. CPU
            hotplugs.
      
          - The patch "sched_ext: Allow BPF schedulers to switch all eligible
            tasks into sched_ext" is folded in and the interface is changed so
            that partial switching is indicated with a new ops flag
            %SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry
            and in turn SCX_KF_INIT. ops.init() is now called with
            SCX_KF_SLEEPABLE.
      
          - Code reorganized so that only the parts necessary to integrate with
            the rest of the kernel are in the header files.
      
          - Changes to reflect the BPF and other kernel changes including the
            addition of bpf_sched_ext_ops.cfi_stubs.
      
      v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t
            instead of atomic64_t and scx_dsp_buf_ent.qseq which uses
            load_acquire/store_release is now unsigned long instead of u64.
      
          - Fix the bug where bpf_scx_btf_struct_access() was allowing write
            access to arbitrary fields.
      
          - Distinguish kfuncs which can be called from any sched_ext ops and from
            anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from
            sched_ext ops.
      
          - Rename "type" to "kind" in scx_exit_info to make it easier to use on
            languages in which "type" is a reserved keyword.
      
          - Since cff9b233 ("kernel/sched: Modify initial boot task idle
            setup"), PF_IDLE is not set on idle tasks which haven't been online
            yet which made scx_task_iter_next_filtered() include those idle tasks
            in iterations leading to oopses. Update scx_task_iter_next_filtered()
            to directly test p->sched_class against idle_sched_class instead of
            using is_idle_task() which tests PF_IDLE.
      
          - Other updates to match upstream changes such as adding const to
            set_cpumask() param and renaming check_preempt_curr() to
            wakeup_preempt().
      
      v4: - SCHED_CHANGE_BLOCK replaced with the previous
            sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is
            because upstream is adaopting a different generic cleanup mechanism.
            Once that lands, the code will be adapted accordingly.
      
          - task_on_scx() used to test whether a task should be switched into SCX,
            which is confusing. Renamed to task_should_scx(). task_on_scx() now
            tests whether a task is currently on SCX.
      
          - scx_has_idle_cpus is barely used anymore and replaced with direct
            check on the idle cpumask.
      
          - SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer
            fully idle cores.
      
          - ops.enable() now sees up-to-date p->scx.weight value.
      
          - ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF
            schedulers expecting ->select_cpu() call.
      
          - Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest
            of the scheduler.
      
      v3: - ops.set_weight() added to allow BPF schedulers to track weight changes
            without polling p->scx.weight.
      
          - move_task_to_local_dsq() was losing SCX-specific enq_flags when
            enqueueing the task on the target dsq because it goes through
            activate_task() which loses the upper 32bit of the flags. Carry the
            flags through rq->scx.extra_enq_flags.
      
          - scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running()
            and scx_bpf_task_cpu() now use the new KF_RCU instead of
            KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them.
      
          - The kfunc helper access control mechanism implemented through
            sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always
            used when invoking scx_ops operations.
      
      v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is
            called from put_prev_taks_scx() and pick_next_task_scx() as necessary.
            To determine whether balance_scx() should be called from
            put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the
            comment in put_prev_task_scx() for details.
      
          - sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced
            with SCHED_CHANGE_BLOCK().
      
          - Unused all_dsqs list removed. This was a left-over from previous
            iterations.
      
          - p->scx.kf_mask is added to track and enforce which kfunc helpers are
            allowed. Also, init/exit sequences are updated to make some kfuncs
            always safe to call regardless of the current BPF scheduler state.
            Combined, this should make all the kfuncs safe.
      
          - BPF now supports sleepable struct_ops operations. Hacky workaround
            removed and operations and kfunc helpers are tagged appropriately.
      
          - BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask()
            and friends are added so that BPF schedulers can use the idle masks
            with the generic helpers. This replaces the hacky kfunc helpers added
            by a separate patch in V1.
      
          - CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is
            enabled. This restriction will be removed by a later patch which adds
            core-sched support.
      
          - Add MAINTAINERS entries and other misc changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Co-authored-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      Cc: Andrea Righi <andrea.righi@canonical.com>
      f0e1a064
    • Tejun Heo's avatar
      sched_ext: Add boilerplate for extensible scheduler class · a7a9fc54
      Tejun Heo authored
      This adds dummy implementations of sched_ext interfaces which interact with
      the scheduler core and hook them in the correct places. As they're all
      dummies, this doesn't cause any behavior changes. This is split out to help
      reviewing.
      
      v2: balance_scx_on_up() dropped. This will be handled in sched_ext proper.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      a7a9fc54
    • Tejun Heo's avatar
      sched: Add normal_policy() · 2c8d046d
      Tejun Heo authored
      A new BPF extensible sched_class will need to dynamically change how a task
      picks its sched_class. For example, if the loaded BPF scheduler progs fail,
      the tasks will be forced back on CFS even if the task's policy is set to the
      new sched_class. To support such mapping, add normal_policy() which wraps
      testing for %SCHED_NORMAL. This doesn't cause any behavior changes.
      
      v2: Update the description with more details on the expected use.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      2c8d046d
    • Tejun Heo's avatar
      sched: Factor out update_other_load_avgs() from __update_blocked_others() · 96fd6c65
      Tejun Heo authored
      RT, DL, thermal and irq load and utilization metrics need to be decayed and
      updated periodically and before consumption to keep the numbers reasonable.
      This is currently done from __update_blocked_others() as a part of the fair
      class load balance path. Let's factor it out to update_other_load_avgs().
      Pure refactor. No functional changes.
      
      This will be used by the new BPF extensible scheduling class to ensure that
      the above metrics are properly maintained.
      
      v2: Refreshed on top of tip:sched/core.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      96fd6c65
    • Tejun Heo's avatar
      sched: Factor out cgroup weight conversion functions · 4f9c7ca8
      Tejun Heo authored
      Factor out sched_weight_from/to_cgroup() which convert between scheduler
      shares and cgroup weight. No functional change. The factored out functions
      will be used by a new BPF extensible sched_class so that the weights can be
      exposed to the BPF programs in a way which is consistent cgroup weights and
      easier to interpret.
      
      The weight conversions will be used regardless of cgroup usage. It's just
      borrowing the cgroup weight range as it's more intuitive.
      CGROUP_WEIGHT_MIN/DFL/MAX constants are moved outside CONFIG_CGROUPS so that
      the conversion helpers can always be defined.
      
      v2: The helpers are now defined regardless of COFNIG_CGROUPS.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      4f9c7ca8
    • Tejun Heo's avatar
      sched: Add sched_class->switching_to() and expose check_class_changing/changed() · d8c7bc2e
      Tejun Heo authored
      When a task switches to a new sched_class, the prev and new classes are
      notified through ->switched_from() and ->switched_to(), respectively, after
      the switching is done.
      
      A new BPF extensible sched_class will have callbacks that allow the BPF
      scheduler to keep track of relevant task states (like priority and cpumask).
      Those callbacks aren't called while a task is on a different sched_class.
      When a task comes back, we wanna tell the BPF progs the up-to-date state
      before the task gets enqueued, so we need a hook which is called before the
      switching is committed.
      
      This patch adds ->switching_to() which is called during sched_class switch
      through check_class_changing() before the task is restored. Also, this patch
      exposes check_class_changing/changed() in kernel/sched/sched.h. They will be
      used by the new BPF extensible sched_class to implement implicit sched_class
      switching which is used e.g. when falling back to CFS when the BPF scheduler
      fails or unloads.
      
      This is a prep patch and doesn't cause any behavior changes. The new
      operation and exposed functions aren't used yet.
      
      v3: Refreshed on top of tip:sched/core.
      
      v2: Improve patch description w/ details on planned use.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      d8c7bc2e
    • Tejun Heo's avatar
      sched: Add sched_class->reweight_task() · e83edbf8
      Tejun Heo authored
      Currently, during a task weight change, sched core directly calls
      reweight_task() defined in fair.c if @p is on CFS. Let's make it a proper
      sched_class operation instead. CFS's reweight_task() is renamed to
      reweight_task_fair() and now called through sched_class.
      
      While it turns a direct call into an indirect one, set_load_weight() isn't
      called from a hot path and this change shouldn't cause any noticeable
      difference. This will be used to implement reweight_task for a new BPF
      extensible sched_class so that it can keep its cached task weight
      up-to-date.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      e83edbf8
    • Tejun Heo's avatar
      sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork() · 304b3f2b
      Tejun Heo authored
      A new BPF extensible sched_class will need more control over the forking
      process. It wants to be able to fail from sched_cgroup_fork() after the new
      task's sched_task_group is initialized so that the loaded BPF program can
      prepare the task with its cgroup association is established and reject fork
      if e.g. allocation fails.
      
      Allow sched_cgroup_fork() to fail by making it return int instead of void
      and adding sched_cancel_fork() to undo sched_fork() in the error path.
      
      sched_cgroup_fork() doesn't fail yet and this patch shouldn't cause any
      behavior changes.
      
      v2: Patch description updated to detail the expected use.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      304b3f2b
    • Tejun Heo's avatar
      sched: Restructure sched_class order sanity checks in sched_init() · df268382
      Tejun Heo authored
      Currently, sched_init() checks that the sched_class'es are in the expected
      order by testing each adjacency which is a bit brittle and makes it
      cumbersome to add optional sched_class'es. Instead, let's verify whether
      they're in the expected order using sched_class_above() which is what
      matters.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      df268382
    • Tejun Heo's avatar
      8cce4759
  2. 17 Jun, 2024 7 commits
    • Andrii Nakryiko's avatar
      Merge branch 'bpf-support-resilient-split-btf' · f6afdaf7
      Andrii Nakryiko authored
      Alan Maguire says:
      
      ====================
      bpf: support resilient split BTF
      
      Split BPF Type Format (BTF) provides huge advantages in that kernel
      modules only have to provide type information for types that they do not
      share with the core kernel; for core kernel types, split BTF refers to
      core kernel BTF type ids.  So for a STRUCT sk_buff, a module that
      uses that structure (or a pointer to it) simply needs to refer to the
      core kernel type id, saving the need to define the structure and its many
      dependents.  This cuts down on duplication and makes BTF as compact
      as possible.
      
      However, there is a downside.  This scheme requires the references from
      split BTF to base BTF to be valid not just at encoding time, but at use
      time (when the module is loaded).  Even a small change in kernel types
      can perturb the type ids in core kernel BTF, and - if the new reproducible
      BTF option is not used - pahole's parallel processing of compilation units
      can lead to different type ids for the same kernel if the BTF is
      regenerated.
      
      So we have a robustness problem for split BTF for cases where a module is
      not always compiled at the same time as the kernel.  This problem is
      particularly acute for distros which generally want module builders to be
      able to compile a module for the lifetime of a Linux stable-based release,
      and have it continue to be valid over the lifetime of that release, even
      as changes in data structures (and hence BTF types) accrue.  Today it's not
      possible to generate BTF for modules that works beyond the initial
      kernel it is compiled against - kernel bugfixes etc invalidate the split
      BTF references to vmlinux BTF, and BTF is no longer usable for the
      module.
      
      The goal of this series is to provide options to provide additional
      context for cases like this.  That context comes in the form of
      distilled base BTF; it stands in for the base BTF, and contains
      information about the types referenced from split BTF, but not their
      full descriptions.  The modified split BTF will refer to type ids in
      this .BTF.base section, and when the kernel loads such modules it
      will use that .BTF.base to map references from split BTF to the
      equivalent current vmlinux base BTF types.  Once this relocation
      process has succeeded, the module BTF available in /sys/kernel/btf
      will look exactly as if it was built with the current vmlinux;
      references to base types will be fixed up etc.
      
      A module builder - using this series along with the pahole changes -
      can then build a module with distilled base BTF via an out-of-tree
      module build, i.e.
      
      make -C . M=path/2/module
      
      The module will have a .BTF section (the split BTF) and a
      .BTF.base section.  The latter is small in size - distilled base
      BTF does not need full struct/union/enum information for named
      types for example.  For 2667 modules built with distilled base BTF,
      the average size observed was 1556 bytes (stddev 1563).  The overall
      size added to this 2667 modules was 5.3Mb.
      
      Note that for the in-tree modules, this approach is not needed as
      split and base BTF in the case of in-tree modules are always built
      and re-built together.
      
      The series first focuses on generating split BTF with distilled base
      BTF; then relocation support is added to allow split BTF with
      an associated distlled base to be relocated with a new base BTF.
      
      Next Eduard's patch allows BTF ELF parsing to work with both
      .BTF and .BTF.base sections; this ensures that bpftool will be
      able to dump BTF for a module with a .BTF.base section for example,
      or indeed dump relocated BTF where a module and a "-B vmlinux"
      is supplied.
      
      Then we add support to resolve_btfids to ignore base BTF - i.e.
      to avoid relocation - if a .BTF.base section is found.  This ensures
      the .BTF.ids section is populated with ids relative to the distilled
      base (these will be relocated as part of module load).
      
      Finally the series supports storage of .BTF.base data/size in modules
      and supports sharing of relocation code with the kernel to allow
      relocation of module BTF.  For the kernel, this relocation
      process happens at module load time, and we relocate split BTF
      references to point at types in the current vmlinux BTF.  As part of
      this, .BTF.ids references need to be mapped also.
      
      So concretely, what happens is
      
      - we generate split BTF in the .BTF section of a module that refers to
        types in the .BTF.base section as base types; the latter are not full
        type descriptions but provide information about the base type.  So
        a STRUCT sk_buff would be represented as a FWD struct sk_buff in
        distilled base BTF for example.
      - when the module is loaded, the split BTF is relocated with vmlinux
        BTF; in the case of the FWD struct sk_buff, we find the STRUCT sk_buff
        in vmlinux BTF and map all split BTF references to the distilled base
        FWD sk_buff, replacing them with references to the vmlinux BTF
        STRUCT sk_buff.
      
      A previous approach to this problem [1] utilized standalone BTF for such
      cases - where the BTF is not defined relative to base BTF so there is no
      relocation required.  The problem with that approach is that from
      the verifier perspective, some types are special, and having a custom
      representation of a core kernel type that did not necessarily match the
      current representation is not tenable.  So the approach taken here was
      to preserve the split BTF model while minimizing the representation of
      the context needed to relocate split and current vmlinux BTF.
      
      To generate distilled .BTF.base sections the associated dwarves
      patch (to be applied on the "next" branch there) is needed [3]
      Without it, things will still work but modules will not be built
      with a .BTF.base section.
      
      Changes since v5[4]:
      
      - Update search of distilled types to return the first occurrence
        of a string (or a string+size pair); this allows us to iterate
        over all matches in distilled base BTF (Andrii, patch 3)
      - Update to use BTF field iterators (Andrii, patches 1, 3 and 8)
      - Update tests to cover multiple match and associated error cases
        (Eduard, patch 4)
      - Rename elf_sections_info to btf_elf_secs, remove use of
        libbpf_get_error(), reset btf->owns_base when relocation
        succeeds (Andrii, patch 5)
      
      Changes since v4[5]:
      
      - Moved embeddedness, duplicate name checks to relocation time
        and record struct/union size for all distilled struct/unions
        instead of using forwards.  This allows us to carry out
        type compatibility checks based on the base BTF we want to
        relocate with (Eduard, patches 1, 3)
      - Moved to using qsort() instead of qsort_r() as support for
        qsort_r() appears to be missing in Android libc (Andrii, patch 3)
      - Sorting/searching now incorporates size matching depending
        on BTF kind and embeddedness of struct/union (Eduard, Andrii,
        patch 3)
      - Improved naming of various types during relocation to avoid
        confusion (Andrii, patch 3)
      - Incorporated Eduard's patch (patch 5) which handles .BTF.base
        sections internally in btf_parse_elf().  This makes ELF parsing
        work with split BTF, split BTF with a distilled base, split
        BTF with a distilled base _and_ base BTF (by relocating) etc.
        Having this avoids the need for bpftool changes; it will work
        as-is with .BTF.base sections (Eduard, patch 4)
      - Updated resolve_btfids to _not_ relocate BTF for modules
        where a .BTF.base section is present; in that one case we
        do not want to relocate BTF as the .BTF.ids section should
        reflect ids in .BTF.base which will later be relocated on
        module load (Eduard, Andrii, patch 5)
      
      Changes since v3[6]:
      
      - distill now checks for duplicate-named struct/unions and records
        them as a sized struct/union to help identify which of the
        multiple base BTF structs/unions it refers to (Eduard, patch 1)
      - added test support for multiple name handling (Eduard, patch 2)
      - simplified the string mapping when updating split BTF to use
        base BTF instead of distilled base.  Since the only string
        references split BTF can make to base BTF are the names of
        the base types, create a string map from distilled string
        offset -> base BTF string offset and update string offsets
        by visiting all strings in split BTF; this saves having to
        do costly searches of base BTF (Eduard, patch 7,10)
      - fixed bpftool manpage and indentation issues (Quentin, patch 11)
      
      Also explored Eduard's suggestion of doing an implicit fallback
      to checking for .BTF.base section in btf__parse() when it is
      called to get base BTF.  However while it is doable, it turned
      out to be difficult operationally.  Since fallback is implicit
      we do not know the source of the BTF - was it from .BTF or
      .BTF.base? In bpftool, we want to try first standalone BTF,
      then split, then split with distilled base.  Having a way
      to explicitly request .BTF.base via btf__parse_opts() fits
      that model better.
      
      Changes since v2[7]:
      
      - submitted patch to use --btf_features in Makefile.btf for pahole
        v1.26 and later separately (Andrii).  That has landed in bpf-next
        now.
      - distilled base now encodes ENUM64 as fwd ENUM (size 8), eliminating
        the need for support for ENUM64 in btf__add_fwd (patch 1, Andrii)
      - moved to distilling only named types, augmenting split BTF with
        associated reference types; this simplifies greatly the distilled
        base BTF and the mapping operation between distilled and base
        BTF when relocating (most of the series changes, Andrii)
      - relocation now iterates over base BTF, looking for matches based
        on name in distilled BTF.  Distilled BTF is pre-sorted by name
        (Andrii, patch 8)
      - removed most redundant compabitiliby checks aside from struct
        size for base types/embedded structs and kind compatibility
        (since we only match on name) (Andrii, patch 8)
      - btf__parse_opts() now replaces btf_parse() internally in libbpf
        (Eduard, patch 3)
      
      Changes since RFC [8]:
      
      - updated terminology; we replace clunky "base reference" BTF with
        distilling base BTF into a .BTF.base section. Similarly BTF
        reconcilation becomes BTF relocation (Andrii, most patches)
      - add distilled base BTF by default for out-of-tree modules
        (Alexei, patch 8)
      - distill algorithm updated to record size of embedded struct/union
        by recording it as a 0-vlen STRUCT/UNION with size preserved
        (Andrii, patch 2)
      - verify size match on relocation for such STRUCT/UNIONs (Andrii,
        patch 9)
      - with embedded STRUCT/UNION recording size, we can have bpftool
        dump a header representation using .BTF.base + .BTF sections
        rather than special-casing and refusing to use "format c" for
        that case (patch 5)
      - match enum with enum64 and vice versa (Andrii, patch 9)
      - ensure that resolve_btfids works with BTF without .BTF.base
        section (patch 7)
      - update tests to cover embedded types, arrays and function
        prototypes (patches 3, 12)
      
      [1] https://lore.kernel.org/bpf/20231112124834.388735-14-alan.maguire@oracle.com/
      [2] https://lore.kernel.org/bpf/20240501175035.2476830-1-alan.maguire@oracle.com/
      [3] https://lore.kernel.org/bpf/20240517102714.4072080-1-alan.maguire@oracle.com/
      [4] https://lore.kernel.org/bpf/20240528122408.3154936-1-alan.maguire@oracle.com/
      [5] https://lore.kernel.org/bpf/20240517102246.4070184-1-alan.maguire@oracle.com/
      [6] https://lore.kernel.org/bpf/20240510103052.850012-1-alan.maguire@oracle.com/
      [7] https://lore.kernel.org/bpf/20240424154806.3417662-1-alan.maguire@oracle.com/
      [8] https://lore.kernel.org/bpf/20240322102455.98558-1-alan.maguire@oracle.com/
      ====================
      
      Link: https://lore.kernel.org/r/20240613095014.357981-1-alan.maguire@oracle.comSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      f6afdaf7
    • Alan Maguire's avatar
      resolve_btfids: Handle presence of .BTF.base section · 6ba77385
      Alan Maguire authored
      Now that btf_parse_elf() handles .BTF.base section presence,
      we need to ensure that resolve_btfids uses .BTF.base when present
      rather than the vmlinux base BTF passed in via the -B option.
      Detect .BTF.base section presence and unset the base BTF path
      to ensure that BTF ELF parsing will do the right thing.
      Signed-off-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/bpf/20240613095014.357981-7-alan.maguire@oracle.com
      6ba77385
    • Eduard Zingerman's avatar
      libbpf: Make btf_parse_elf process .BTF.base transparently · c86f180f
      Eduard Zingerman authored
      Update btf_parse_elf() to check if .BTF.base section is present.
      The logic is as follows:
      
        if .BTF.base section exists:
           distilled_base := btf_new(.BTF.base)
        if distilled_base:
           btf := btf_new(.BTF, .base_btf=distilled_base)
           if base_btf:
              btf_relocate(btf, base_btf)
        else:
           btf := btf_new(.BTF)
        return btf
      
      In other words:
      - if .BTF.base section exists, load BTF from it and use it as a base
        for .BTF load;
      - if base_btf is specified and .BTF.base section exist, relocate newly
        loaded .BTF against base_btf.
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240613095014.357981-6-alan.maguire@oracle.com
      c86f180f
    • Alan Maguire's avatar
      selftests/bpf: Extend distilled BTF tests to cover BTF relocation · affdeb50
      Alan Maguire authored
      Ensure relocated BTF looks as expected; in this case identical to
      original split BTF, with a few duplicate anonymous types added to
      split BTF by the relocation process.  Also add relocation tests
      for edge cases like missing type in base BTF and multiple types
      of the same name.
      Signed-off-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/bpf/20240613095014.357981-5-alan.maguire@oracle.com
      affdeb50
    • Alan Maguire's avatar
      libbpf: Split BTF relocation · 19e00c89
      Alan Maguire authored
      Map distilled base BTF type ids referenced in split BTF and their
      references to the base BTF passed in, and if the mapping succeeds,
      reparent the split BTF to the base BTF.
      
      Relocation is done by first verifying that distilled base BTF
      only consists of named INT, FLOAT, ENUM, FWD, STRUCT and
      UNION kinds; then we sort these to speed lookups.  Once sorted,
      the base BTF is iterated, and for each relevant kind we check
      for an equivalent in distilled base BTF.  When found, the
      mapping from distilled -> base BTF id and string offset is recorded.
      In establishing mappings, we need to ensure we check STRUCT/UNION
      size when the STRUCT/UNION is embedded in a split BTF STRUCT/UNION,
      and when duplicate names exist for the same STRUCT/UNION.  Otherwise
      size is ignored in matching STRUCT/UNIONs.
      
      Once all mappings are established, we can update type ids
      and string offsets in split BTF and reparent it to the new base.
      Signed-off-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/bpf/20240613095014.357981-4-alan.maguire@oracle.com
      19e00c89
    • Alan Maguire's avatar
      selftests/bpf: Test distilled base, split BTF generation · eb20e727
      Alan Maguire authored
      Test generation of split+distilled base BTF, ensuring that
      
      - named base BTF STRUCTs and UNIONs are represented as 0-vlen sized
        STRUCT/UNIONs
      - named ENUM[64]s are represented as 0-vlen named ENUM[64]s
      - anonymous struct/unions are represented in full in split BTF
      - anonymous enums are represented in full in split BTF
      - types unreferenced from split BTF are not present in distilled
        base BTF
      
      Also test that with vmlinux BTF and split BTF based upon it,
      we only represent needed base types referenced from split BTF
      in distilled base.
      Signed-off-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/bpf/20240613095014.357981-3-alan.maguire@oracle.com
      eb20e727
    • Alan Maguire's avatar
      libbpf: Add btf__distill_base() creating split BTF with distilled base BTF · 58e185a0
      Alan Maguire authored
      To support more robust split BTF, adding supplemental context for the
      base BTF type ids that split BTF refers to is required.  Without such
      references, a simple shuffling of base BTF type ids (without any other
      significant change) invalidates the split BTF.  Here the attempt is made
      to store additional context to make split BTF more robust.
      
      This context comes in the form of distilled base BTF providing minimal
      information (name and - in some cases - size) for base INTs, FLOATs,
      STRUCTs, UNIONs, ENUMs and ENUM64s along with modified split BTF that
      points at that base and contains any additional types needed (such as
      TYPEDEF, PTR and anonymous STRUCT/UNION declarations).  This
      information constitutes the minimal BTF representation needed to
      disambiguate or remove split BTF references to base BTF.  The rules
      are as follows:
      
      - INT, FLOAT, FWD are recorded in full.
      - if a named base BTF STRUCT or UNION is referred to from split BTF, it
        will be encoded as a zero-member sized STRUCT/UNION (preserving
        size for later relocation checks).  Only base BTF STRUCT/UNIONs
        that are either embedded in split BTF STRUCT/UNIONs or that have
        multiple STRUCT/UNION instances of the same name will _need_ size
        checks at relocation time, but as it is possible a different set of
        types will be duplicates in the later to-be-resolved base BTF,
        we preserve size information for all named STRUCT/UNIONs.
      - if an ENUM[64] is named, a ENUM forward representation (an ENUM
        with no values) of the same size is used.
      - in all other cases, the type is added to the new split BTF.
      
      Avoiding struct/union/enum/enum64 expansion is important to keep the
      distilled base BTF representation to a minimum size.
      
      When successful, new representations of the distilled base BTF and new
      split BTF that refers to it are returned.  Both need to be freed by the
      caller.
      
      So to take a simple example, with split BTF with a type referring
      to "struct sk_buff", we will generate distilled base BTF with a
      0-member STRUCT sk_buff of the appropriate size, and the split BTF
      will refer to it instead.
      
      Tools like pahole can utilize such split BTF to populate the .BTF
      section (split BTF) and an additional .BTF.base section.  Then
      when the split BTF is loaded, the distilled base BTF can be used
      to relocate split BTF to reference the current (and possibly changed)
      base BTF.
      
      So for example if "struct sk_buff" was id 502 when the split BTF was
      originally generated,  we can use the distilled base BTF to see that
      id 502 refers to a "struct sk_buff" and replace instances of id 502
      with the current (relocated) base BTF sk_buff type id.
      
      Distilled base BTF is small; when building a kernel with all modules
      using distilled base BTF as a test, overall module size grew by only
      5.3Mb total across ~2700 modules.
      Signed-off-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/bpf/20240613095014.357981-2-alan.maguire@oracle.com
      58e185a0
  3. 14 Jun, 2024 4 commits
  4. 13 Jun, 2024 5 commits