• Tejun Heo's avatar
    sched_ext: Print debug dump after an error exit · 07814a94
    Tejun Heo authored
    If a BPF scheduler triggers an error, the scheduler is aborted and the
    system is reverted to the built-in scheduler. In the process, a lot of
    information which may be useful for figuring out what happened can be lost.
    
    This patch adds debug dump which captures information which may be useful
    for debugging including runqueue and runnable thread states at the time of
    failure. The following shows a debug dump after triggering the watchdog:
    
      root@test ~# os/work/tools/sched_ext/build/bin/scx_qmap -t 100
      stats  : enq=1 dsp=0 delta=1 deq=0
      stats  : enq=90 dsp=90 delta=0 deq=0
      stats  : enq=156 dsp=156 delta=0 deq=0
      stats  : enq=218 dsp=218 delta=0 deq=0
      stats  : enq=255 dsp=255 delta=0 deq=0
      stats  : enq=271 dsp=271 delta=0 deq=0
      stats  : enq=284 dsp=284 delta=0 deq=0
      stats  : enq=293 dsp=293 delta=0 deq=0
    
      DEBUG DUMP
      ================================================================================
    
      kworker/u32:12[320] triggered exit kind 1026:
        runnable task stall (stress[1530] failed to run for 6.841s)
    
      Backtrace:
        scx_watchdog_workfn+0x136/0x1c0
        process_scheduled_works+0x2b5/0x600
        worker_thread+0x269/0x360
        kthread+0xeb/0x110
        ret_from_fork+0x36/0x40
        ret_from_fork_asm+0x1a/0x30
    
      QMAP FIFO[0]:
      QMAP FIFO[1]:
      QMAP FIFO[2]: 1436
      QMAP FIFO[3]:
      QMAP FIFO[4]:
    
      CPU states
      ----------
    
      CPU 0   : nr_run=1 ops_qseq=244
    	    curr=swapper/0[0] class=idle_sched_class
    
        QMAP: dsp_idx=1 dsp_cnt=0
    
        R stress[1530] -6841ms
    	scx_state/flags=3/0x1 ops_state/qseq=2/20
    	sticky/holding_cpu=-1/-1 dsq_id=(n/a)
    	cpus=ff
    
          QMAP: force_local=0
    
          asm_sysvec_apic_timer_interrupt+0x16/0x20
    
      CPU 2   : nr_run=2 ops_qseq=142
    	    curr=swapper/2[0] class=idle_sched_class
    
        QMAP: dsp_idx=1 dsp_cnt=0
    
        R sshd[1703] -5905ms
    	scx_state/flags=3/0x9 ops_state/qseq=2/88
    	sticky/holding_cpu=-1/-1 dsq_id=(n/a)
    	cpus=ff
    
          QMAP: force_local=1
    
          __x64_sys_ppoll+0xf6/0x120
          do_syscall_64+0x7b/0x150
          entry_SYSCALL_64_after_hwframe+0x76/0x7e
    
        R fish[1539] -4141ms
    	scx_state/flags=3/0x9 ops_state/qseq=2/124
    	sticky/holding_cpu=-1/-1 dsq_id=(n/a)
    	cpus=ff
    
          QMAP: force_local=1
    
          futex_wait+0x60/0xe0
          do_futex+0x109/0x180
          __x64_sys_futex+0x117/0x190
          do_syscall_64+0x7b/0x150
          entry_SYSCALL_64_after_hwframe+0x76/0x7e
    
      CPU 3   : nr_run=2 ops_qseq=162
    	    curr=kworker/u32:12[320] class=ext_sched_class
    
        QMAP: dsp_idx=1 dsp_cnt=0
    
       *R kworker/u32:12[320] +0ms
    	scx_state/flags=3/0xd ops_state/qseq=0/0
    	sticky/holding_cpu=-1/-1 dsq_id=(n/a)
    	cpus=ff
    
          QMAP: force_local=0
    
          scx_dump_state+0x613/0x6f0
          scx_ops_error_irq_workfn+0x1f/0x40
          irq_work_run_list+0x82/0xd0
          irq_work_run+0x14/0x30
          __sysvec_irq_work+0x40/0x140
          sysvec_irq_work+0x60/0x70
          asm_sysvec_irq_work+0x16/0x20
          scx_watchdog_workfn+0x15f/0x1c0
          process_scheduled_works+0x2b5/0x600
          worker_thread+0x269/0x360
          kthread+0xeb/0x110
          ret_from_fork+0x36/0x40
          ret_from_fork_asm+0x1a/0x30
    
        R kworker/3:2[1436] +0ms
    	scx_state/flags=3/0x9 ops_state/qseq=2/160
    	sticky/holding_cpu=-1/-1 dsq_id=(n/a)
    	cpus=08
    
          QMAP: force_local=0
    
          kthread+0xeb/0x110
          ret_from_fork+0x36/0x40
          ret_from_fork_asm+0x1a/0x30
    
      CPU 7   : nr_run=0 ops_qseq=76
    	    curr=swapper/7[0] class=idle_sched_class
    
    
      ================================================================================
    
      EXIT: runnable task stall (stress[1530] failed to run for 6.841s)
    
    It shows that CPU 3 was running the watchdog when it triggered the error
    condition and the scx_qmap thread has been queued on CPU 0 for over 5
    seconds but failed to run. It also prints out scx_qmap specific information
    - e.g. which tasks are queued on each FIFO and so on using the dump_*() ops.
    This dump has proved pretty useful for developing and debugging BPF
    schedulers.
    
    Debug dump is generated automatically when the BPF scheduler exits due to an
    error. The debug buffer used in such cases is determined by
    sched_ext_ops.exit_dump_len and defaults to 32k. If the debug dump overruns
    the available buffer, the output is truncated and marked accordingly.
    
    Debug dump output can also be read through the sched_ext_dump tracepoint.
    When read through the tracepoint, there is no length limit.
    
    SysRq-D can be used to trigger debug dump at any time while a BPF scheduler
    is loaded. This is non-destructive - the scheduler keeps running afterwards.
    The output can be read through the sched_ext_dump tracepoint.
    
    v2: - The size of exit debug dump buffer can now be customized using
          sched_ext_ops.exit_dump_len.
    
        - sched_ext_ops.dump*() added to enable dumping of BPF scheduler
          specific information.
    
        - Tracpoint output and SysRq-D triggering added.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
    07814a94
sched_ext.h 536 Bytes