• Paul E. McKenney's avatar
    rcu: Fix grace-period-stall bug on large systems with CPU hotplug · b668c9cf
    Paul E. McKenney authored
    When the last CPU of a given leaf rcu_node structure goes
    offline, all of the tasks queued on that leaf rcu_node structure
    (due to having blocked in their current RCU read-side critical
    sections) are requeued onto the root rcu_node structure.  This
    requeuing is carried out by rcu_preempt_offline_tasks().
    However, it is possible that these queued tasks are the only
    thing preventing the leaf rcu_node structure from reporting a
    quiescent state up the rcu_node hierarchy.  Unfortunately, the
    old code would fail to do this reporting, resulting in a
    grace-period stall given the following sequence of events:
    
    1.	Kernel built for more than 32 CPUs on 32-bit systems or for more
    	than 64 CPUs on 64-bit systems, so that there is more than one
    	rcu_node structure.  (Or CONFIG_RCU_FANOUT is artificially set
    	to a number smaller than CONFIG_NR_CPUS.)
    
    2.	The kernel is built with CONFIG_TREE_PREEMPT_RCU.
    
    3.	A task running on a CPU associated with a given leaf rcu_node
    	structure blocks while in an RCU read-side critical section
    	-and- that CPU has not yet passed through a quiescent state
    	for the current RCU grace period.  This will cause the task
    	to be queued on the leaf rcu_node's blocked_tasks[] array, in
    	particular, on the element of this array corresponding to the
    	current grace period.
    
    4.	Each of the remaining CPUs corresponding to this same leaf rcu_node
    	structure pass through a quiescent state.  However, the task is
    	still in its RCU read-side critical section, so these quiescent
    	states cannot be reported further up the rcu_node hierarchy.
    	Nevertheless, all bits in the leaf rcu_node structure's ->qsmask
    	field are now zero.
    
    5.	Each of the remaining CPUs go offline.  (The events in step
    	#4 and #5 can happen in any order as long as each CPU passes
    	through a quiescent state before going offline.)
    
    6.	When the last CPU goes offline, __rcu_offline_cpu() will invoke
    	rcu_preempt_offline_tasks(), which will move the task to the
    	root rcu_node structure, but without reporting a quiescent state
    	up the rcu_node hierarchy (and this failure to report a quiescent
    	state is the bug).
    
    	But because this leaf rcu_node structure's ->qsmask field is
    	already zero and its ->block_tasks[] entries are all empty,
    	force_quiescent_state() will skip this rcu_node structure.
    
    	Therefore, grace periods are now hung.
    
    This patch abstracts some code out of rcu_read_unlock_special(),
    calling the result task_quiet() by analogy with cpu_quiet(), and
    invokes task_quiet() from both rcu_read_lock_special() and
    __rcu_offline_cpu().  Invoking task_quiet() from
    __rcu_offline_cpu() reports the quiescent state up the rcu_node
    hierarchy, fixing the bug.  This ends up requiring a separate
    lock_class_key per level of the rcu_node hierarchy, which this
    patch also provides.
    Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    LKML-Reference: <12589088301770-git-send-email->
    Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
    b668c9cf
rcutree.h 12.8 KB