• Neeraj Upadhyay's avatar
    rcu: Check and report missed fqs timer wakeup on RCU stall · 683954e5
    Neeraj Upadhyay authored
    For a new grace period request, the RCU GP kthread transitions through
    following states:
    
    a. [RCU_GP_WAIT_GPS] -> [RCU_GP_DONE_GPS]
    
    The RCU_GP_WAIT_GPS state is where the GP kthread waits for a request
    for a new GP.  Once it receives a request (for example, when a new RCU
    callback is queued), the GP kthread transitions to RCU_GP_DONE_GPS.
    
    b. [RCU_GP_DONE_GPS] -> [RCU_GP_ONOFF]
    
    Grace period initialization starts in rcu_gp_init(), which records the
    start of new GP in rcu_state.gp_seq and transitions to RCU_GP_ONOFF.
    
    c. [RCU_GP_ONOFF] -> [RCU_GP_INIT]
    
    The purpose of the RCU_GP_ONOFF state is to apply the online/offline
    information that was buffered for any CPUs that recently came online or
    went offline.  This state is maintained in per-leaf rcu_node bitmasks,
    with the buffered state in ->qsmaskinitnext and the state for the upcoming
    GP in ->qsmaskinit.  At the end of this RCU_GP_ONOFF state, each bit in
    ->qsmaskinit will correspond to a CPU that must pass through a quiescent
    state before the upcoming grace period is allowed to complete.
    
    However, a leaf rcu_node structure with an all-zeroes ->qsmaskinit
    cannot necessarily be ignored.  In preemptible RCU, there might well be
    tasks still in RCU read-side critical sections that were first preempted
    while running on one of the CPUs managed by this structure.  Such tasks
    will be queued on this structure's ->blkd_tasks list.  Only after this
    list fully drains can this leaf rcu_node structure be ignored, and even
    then only if none of its CPUs have come back online in the meantime.
    Once that happens, the ->qsmaskinit masks further up the tree will be
    updated to exclude this leaf rcu_node structure.
    
    Once the ->qsmaskinitnext and ->qsmaskinit fields have been updated
    as needed, the GP kthread transitions to RCU_GP_INIT.
    
    d. [RCU_GP_INIT] -> [RCU_GP_WAIT_FQS]
    
    The purpose of the RCU_GP_INIT state is to copy each ->qsmaskinit to
    the ->qsmask field within each rcu_node structure.  This copying is done
    breadth-first from the root to the leaves.  Why not just copy directly
    from ->qsmaskinitnext to ->qsmask?  Because the ->qsmaskinitnext masks
    can change in the meantime as additional CPUs come online or go offline.
    Such changes would result in inconsistencies in the ->qsmask fields up and
    down the tree, which could in turn result in too-short grace periods or
    grace-period hangs.  These issues are avoided by snapshotting the leaf
    rcu_node structures' ->qsmaskinitnext fields into their ->qsmaskinit
    counterparts, generating a consistent set of ->qsmaskinit fields
    throughout the tree, and only then copying these consistent ->qsmaskinit
    fields to their ->qsmask counterparts.
    
    Once this initialization step is complete, the GP kthread transitions
    to RCU_GP_WAIT_FQS, where it waits to do a force-quiescent-state scan
    on the one hand or for the end of the grace period on the other.
    
    e. [RCU_GP_WAIT_FQS] -> [RCU_GP_DOING_FQS]
    
    The RCU_GP_WAIT_FQS state waits for one of three things:  (1) An
    explicit request to do a force-quiescent-state scan, (2) The end of
    the grace period, or (3) A short interval of time, after which it
    will do a force-quiescent-state (FQS) scan.  The explicit request can
    come from rcutorture or from any CPU that has too many RCU callbacks
    queued (see the qhimark kernel parameter and the RCU_GP_FLAG_OVLD
    flag).  The aforementioned "short period of time" is specified by the
    jiffies_till_first_fqs boot parameter for a given grace period's first
    FQS scan and by the jiffies_till_next_fqs for later FQS scans.
    
    Either way, once the wait is over, the GP kthread transitions to
    RCU_GP_DOING_FQS.
    
    f. [RCU_GP_DOING_FQS] -> [RCU_GP_CLEANUP]
    
    The RCU_GP_DOING_FQS state performs an FQS scan.  Each such scan carries
    out two functions for any CPU whose bit is still set in its leaf rcu_node
    structure's ->qsmask field, that is, for any CPU that has not yet reported
    a quiescent state for the current grace period:
    
      i.  Report quiescent states on behalf of CPUs that have been observed
          to be idle (from an RCU perspective) since the beginning of the
          grace period.
    
      ii. If the current grace period is too old, take various actions to
          encourage holdout CPUs to pass through quiescent states, including
          enlisting the aid of any calls to cond_resched() and might_sleep(),
          and even including IPIing the holdout CPUs.
    
    These checks are skipped for any leaf rcu_node structure with a all-zero
    ->qsmask field, however such structures are subject to RCU priority
    boosting if there are tasks on a given structure blocking the current
    grace period.  The end of the grace period is detected when the root
    rcu_node structure's ->qsmask is zero and when there are no longer any
    preempted tasks blocking the current grace period.  (No, this last check
    is not redundant.  To see this, consider an rcu_node tree having exactly
    one structure that serves as both root and leaf.)
    
    Once the end of the grace period is detected, the GP kthread transitions
    to RCU_GP_CLEANUP.
    
    g. [RCU_GP_CLEANUP] -> [RCU_GP_CLEANED]
    
    The RCU_GP_CLEANUP state marks the end of grace period by updating the
    rcu_state structure's ->gp_seq field and also all rcu_node structures'
    ->gp_seq field.  As before, the rcu_node tree is traversed in breadth
    first order.  Once this update is complete, the GP kthread transitions
    to the RCU_GP_CLEANED state.
    
    i. [RCU_GP_CLEANED] -> [RCU_GP_INIT]
    
    Once in the RCU_GP_CLEANED state, the GP kthread immediately transitions
    into the RCU_GP_INIT state.
    
    j. The role of timers.
    
    If there is at least one idle CPU, and if timers are not firing, the
    transition from RCU_GP_DOING_FQS to RCU_GP_CLEANUP will never happen.
    Timers can fail to fire for a number of reasons, including issues in
    timer configuration, issues in the timer framework, and failure to handle
    softirqs (for example, when there is a storm of interrupts).  Whatever the
    reason, if the timers fail to fire, the GP kthread will never be awakened,
    resulting in RCU CPU stall warnings and eventually in OOM.
    
    However, an RCU CPU stall warning has a large number of potential causes,
    as documented in Documentation/RCU/stallwarn.rst.  This commit therefore
    adds analysis to the RCU CPU stall-warning code to emit an additional
    message if the cause of the stall is likely to be timer failure.
    Signed-off-by: default avatarNeeraj Upadhyay <neeraju@codeaurora.org>
    Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
    683954e5
tree_stall.h 26.4 KB