• Peter Zijlstra's avatar
    cpu/hotplug, stop_machine: Fix stop_machine vs hotplug order · 45178ac0
    Peter Zijlstra authored
    
    
    Paul reported a very sporadic, rcutorture induced, workqueue failure.
    When the planets align, the workqueue rescuer's self-migrate fails and
    then triggers a WARN for running a work on the wrong CPU.
    
    Tejun then figured that set_cpus_allowed_ptr()'s stop_one_cpu() call
    could be ignored! When stopper->enabled is false, stop_machine will
    insta complete the work, without actually doing the work. Worse, it
    will not WARN about this (we really should fix this).
    
    It turns out there is a small window where a freshly online'ed CPU is
    marked 'online' but doesn't yet have the stopper task running:
    
    	BP				AP
    
    	bringup_cpu()
    	  __cpu_up(cpu, idle)	 -->	start_secondary()
    					...
    					cpu_startup_entry()
    	  bringup_wait_for_ap()
    	    wait_for_ap_thread() <--	  cpuhp_online_idle()
    					  while (1)
    					    do_idle()
    
    					... available to run kthreads ...
    
    	    stop_machine_unpark()
    	      stopper->enable = true;
    
    Close this by moving the stop_machine_unpark() into
    cpuhp_online_idle(), such that the stopper thread is ready before we
    start the idle loop and schedule.
    Reported-by: default avatar"Paul E. McKenney" <paulmck@kernel.org>
    Debugged-by: default avatarTejun Heo <tj@kernel.org>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: default avatar"Paul E. McKenney" <paulmck@kernel.org>
    45178ac0
cpu.c 58 KB