• Dave Chinner's avatar
    xfs: don't use BMBT btree split workers for IO completion · c85007e2
    Dave Chinner authored
    When we split a BMBT due to record insertion, we offload it to a
    worker thread because we can be deep in the stack when we try to
    allocate a new block for the BMBT. Allocation can use several
    kilobytes of stack (full memory reclaim, swap and/or IO path can
    end up on the stack during allocation) and we can already be several
    kilobytes deep in the stack when we need to split the BMBT.
    
    A recent workload demonstrated a deadlock in this BMBT split
    offload. It requires several things to happen at once:
    
    1. two inodes need a BMBT split at the same time, one must be
    unwritten extent conversion from IO completion, the other must be
    from extent allocation.
    
    2. there must be a no available xfs_alloc_wq worker threads
    available in the worker pool.
    
    3. There must be sustained severe memory shortages such that new
    kworker threads cannot be allocated to the xfs_alloc_wq pool for
    both threads that need split work to be run
    
    4. The split work from the unwritten extent conversion must run
    first.
    
    5. when the BMBT block allocation runs from the split work, it must
    loop over all AGs and not be able to either trylock an AGF
    successfully, or each AGF is is able to lock has no space available
    for a single block allocation.
    
    6. The BMBT allocation must then attempt to lock the AGF that the
    second task queued to the rescuer thread already has locked before
    it finds an AGF it can allocate from.
    
    At this point, we have an ABBA deadlock between tasks queued on the
    xfs_alloc_wq rescuer thread and a locked AGF. i.e. The queued task
    holding the AGF lock can't be run by the rescuer thread until the
    task the rescuer thread is runing gets the AGF lock....
    
    This is a highly improbably series of events, but there it is.
    
    There's a couple of ways to fix this, but the easiest way to ensure
    that we only punt tasks with a locked AGF that holds enough space
    for the BMBT block allocations to the worker thread.
    
    This works for unwritten extent conversion in IO completion (which
    doesn't have a locked AGF and space reservations) because we have
    tight control over the IO completion stack. It is typically only 6
    functions deep when xfs_btree_split() is called because we've
    already offloaded the IO completion work to a worker thread and
    hence we don't need to worry about stack overruns here.
    
    The other place we can be called for a BMBT split without a
    preceeding allocation is __xfs_bunmapi() when punching out the
    center of an existing extent. We don't remove extents in the IO
    path, so these operations don't tend to be called with a lot of
    stack consumed. Hence we don't really need to ship the split off to
    a worker thread in these cases, either.
    Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
    Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
    c85007e2
xfs_btree.c 132 KB