• Darrick J. Wong's avatar
    xfs: change the order in which child and parent defer ops are finished · 27dada07
    Darrick J. Wong authored
    The defer ops code has been finishing items in the wrong order -- if a
    top level defer op creates items A and B, and finishing item A creates
    more defer ops A1 and A2, we'll put the new items on the end of the
    chain and process them in the order A B A1 A2.  This is kind of weird,
    since it's convenient for programmers to be able to think of A and B as
    an ordered sequence where all the sub-tasks for A must finish before we
    move on to B, e.g. A A1 A2 D.
    
    Right now, our log intent items are not so complex that this matters,
    but this will become important for the atomic extent swapping patchset.
    In order to maintain correct reference counting of extents, we have to
    unmap and remap extents in that order, and we want to complete that work
    before moving on to the next range that the user wants to swap.  This
    patch fixes defer ops to satsify that requirement.
    
    The primary symptom of the incorrect order was noticed in an early
    performance analysis of the atomic extent swap code.  An astonishingly
    large number of deferred work items accumulated when userspace requested
    an atomic update of two very fragmented files.  The cause of this was
    traced to the same ordering bug in the inner loop of
    xfs_defer_finish_noroll.
    
    If the ->finish_item method of a deferred operation queues new deferred
    operations, those new deferred ops are appended to the tail of the
    pending work list.  To illustrate, say that a caller creates a
    transaction t0 with four deferred operations D0-D3.  The first thing
    defer ops does is roll the transaction to t1, leaving us with:
    
    t1: D0(t0), D1(t0), D2(t0), D3(t0)
    
    Let's say that finishing each of D0-D3 will create two new deferred ops.
    After finish D0 and roll, we'll have the following chain:
    
    t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1)
    
    d4 and d5 were logged to t1.  Notice that while we're about to start
    work on D1, we haven't actually completed all the work implied by D0
    being finished.  So far we've been careful (or lucky) to structure the
    dfops callers such that D1 doesn't depend on d4 or d5 being finished,
    but this is a potential logic bomb.
    
    There's a second problem lurking.  Let's see what happens as we finish
    D1-D3:
    
    t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2)
    t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3)
    t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
    
    Let's say that d4-d11 are simple work items that don't queue any other
    operations, which means that we can complete each d4 and roll to t6:
    
    t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
    t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
    ...
    t11: d10(t4), d11(t4)
    t12: d11(t4)
    <done>
    
    When we try to roll to transaction #12, we're holding defer op d11,
    which we logged way back in t4.  This means that the tail of the log is
    pinned at t4.  If the log is very small or there are a lot of other
    threads updating metadata, this means that we might have wrapped the log
    and cannot get roll to t11 because there isn't enough space left before
    we'd run into t4.
    
    Let's shift back to the original failure.  I mentioned before that I
    discovered this flaw while developing the atomic file update code.  In
    that scenario, we have a defer op (D0) that finds a range of file blocks
    to remap, creates a handful of new defer ops to do that, and then asks
    to be continued with however much work remains.
    
    So, D0 is the original swapext deferred op.  The first thing defer ops
    does is rolls to t1:
    
    t1: D0(t0)
    
    We try to finish D0, logging d1 and d2 in the process, but can't get all
    the work done.  We log a done item and a new intent item for the work
    that D0 still has to do, and roll to t2:
    
    t2: D0'(t1), d1(t1), d2(t1)
    
    We roll and try to finish D0', but still can't get all the work done, so
    we log a done item and a new intent item for it, requeue D0 a second
    time, and roll to t3:
    
    t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2)
    
    If it takes 48 more rolls to complete D0, then we'll finally dispense
    with D0 in t50:
    
    t50: D<fifty primes>(t49), d1(t1), ..., d102(t50)
    
    We then try to roll again to get a chain like this:
    
    t51: d1(t1), d2(t1), ..., d101(t50), d102(t50)
    ...
    t152: d102(t50)
    <done>
    
    Notice that in rolling to transaction #51, we're holding on to a log
    intent item for d1 that was logged in transaction #1.  This means that
    the tail of the log is pinned at t1.  If the log is very small or there
    are a lot of other threads updating metadata, this means that we might
    have wrapped the log and cannot roll to t51 because there isn't enough
    space left before we'd run into t1.  This is of course problem #2 again.
    
    But notice the third problem with this scenario: we have 102 defer ops
    tied to this transaction!  Each of these items are backed by pinned
    kernel memory, which means that we risk OOM if the chains get too long.
    
    Yikes.  Problem #1 is a subtle logic bomb that could hit someone in the
    future; problem #2 applies (rarely) to the current upstream, and problem
    #3 applies to work under development.
    
    This is not how incremental deferred operations were supposed to work.
    The dfops design of logging in the same transaction an intent-done item
    and a new intent item for the work remaining was to make it so that we
    only have to juggle enough deferred work items to finish that one small
    piece of work.  Deferred log item recovery will find that first
    unfinished work item and restart it, no matter how many other intent
    items might follow it in the log.  Therefore, it's ok to put the new
    intents at the start of the dfops chain.
    
    For the first example, the chains look like this:
    
    t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
    t3: d5(t1), D1(t0), D2(t0), D3(t0)
    ...
    t9: d9(t7), D3(t0)
    t10: D3(t0)
    t11: d10(t10), d11(t10)
    t12: d11(t10)
    
    For the second example, the chains look like this:
    
    t1: D0(t0)
    t2: d1(t1), d2(t1), D0'(t1)
    t3: d2(t1), D0'(t1)
    t4: D0'(t1)
    t5: d1(t4), d2(t4), D0''(t4)
    ...
    t148: D0<50 primes>(t147)
    t149: d101(t148), d102(t148)
    t150: d102(t148)
    <done>
    
    This actually sucks more for pinning the log tail (we try to roll to t10
    while holding an intent item that was logged in t1) but we've solved
    problem #1.  We've also reduced the maximum chain length from:
    
        sum(all the new items) + nr_original_items
    
    to:
    
        max(new items that each original item creates) + nr_original_items
    
    This solves problem #3 by sharply reducing the number of defer ops that
    can be attached to a transaction at any given time.  The change makes
    the problem of log tail pinning worse, but is improvement we need to
    solve problem #2.  Actually solving #2, however, is left to the next
    patch.
    
    Note that a subsequent analysis of some hard-to-trigger reflink and COW
    livelocks on extremely fragmented filesystems (or systems running a lot
    of IO threads) showed the same symptoms -- uncomfortably large numbers
    of incore deferred work items and occasional stalls in the transaction
    grant code while waiting for log reservations.  I think this patch and
    the next one will also solve these problems.
    
    As originally written, the code used list_splice_tail_init instead of
    list_splice_init, so change that, and leave a short comment explaining
    our actions.
    Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
    Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
    27dada07
xfs_defer.c 21.6 KB