• Frederic Weisbecker's avatar
    rcu: Defer RCU kthreads wakeup when CPU is dying · e787644c
    Frederic Weisbecker authored
    When the CPU goes idle for the last time during the CPU down hotplug
    process, RCU reports a final quiescent state for the current CPU. If
    this quiescent state propagates up to the top, some tasks may then be
    woken up to complete the grace period: the main grace period kthread
    and/or the expedited main workqueue (or kworker).
    
    If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
    arm the RT bandwith timer to the local offline CPU. Since this happens
    after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
    timer gets ignored. Therefore if the RCU kthreads are waiting for RT
    bandwidth to be available, they may never be actually scheduled.
    
    This triggers TREE03 rcutorture hangs:
    
    	 rcu: INFO: rcu_preempt self-detected stall on CPU
    	 rcu:     4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
    	 rcu:     (t=21035 jiffies g=938281 q=40787 ncpus=6)
    	 rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
    	 rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
    	 rcu: RCU grace-period kthread stack dump:
    	 task:rcu_preempt     state:R  running task     stack:14896 pid:14    tgid:14    ppid:2      flags:0x00004000
    	 Call Trace:
    	  <TASK>
    	  __schedule+0x2eb/0xa80
    	  schedule+0x1f/0x90
    	  schedule_timeout+0x163/0x270
    	  ? __pfx_process_timeout+0x10/0x10
    	  rcu_gp_fqs_loop+0x37c/0x5b0
    	  ? __pfx_rcu_gp_kthread+0x10/0x10
    	  rcu_gp_kthread+0x17c/0x200
    	  kthread+0xde/0x110
    	  ? __pfx_kthread+0x10/0x10
    	  ret_from_fork+0x2b/0x40
    	  ? __pfx_kthread+0x10/0x10
    	  ret_from_fork_asm+0x1b/0x30
    	  </TASK>
    
    The situation can't be solved with just unpinning the timer. The hrtimer
    infrastructure and the nohz heuristics involved in finding the best
    remote target for an unpinned timer would then also need to handle
    enqueues from an offline CPU in the most horrendous way.
    
    So fix this on the RCU side instead and defer the wake up to an online
    CPU if it's too late for the local one.
    Reported-by: default avatarPaul E. McKenney <paulmck@kernel.org>
    Fixes: 5c0930cc ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
    Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
    Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
    Signed-off-by: default avatarNeeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
    e787644c
tree_exp.h 35.2 KB