• Dave Chinner's avatar
    xfs: intent item whiteouts · 0d227466
    Dave Chinner authored
    When we log modifications based on intents, we add both intent
    and intent done items to the modification being made. These get
    written to the log to ensure that the operation is re-run if the
    intent done is not found in the log.
    
    However, for operations that complete wholly within a single
    checkpoint, the change in the checkpoint is atomic and will never
    need replay. In this case, we don't need to actually write the
    intent and intent done items to the journal because log recovery
    will never need to manually restart this modification.
    
    Log recovery currently handles intent/intent done matching by
    inserting the intent into the AIL, then removing it when a matching
    intent done item is found. Hence for all the intent-based operations
    that complete within a checkpoint, we spend all that time parsing
    the intent/intent done items just to cancel them and do nothing with
    them.
    
    Hence it follows that the only time we actually need intents in the
    log is when the modification crosses checkpoint boundaries in the
    log and so may only be partially complete in the journal. Hence if
    we commit and intent done item to the CIL and the intent item is in
    the same checkpoint, we don't actually have to write them to the
    journal because log recovery will always cancel the intents.
    
    We've never really worried about the overhead of logging intents
    unnecessarily like this because the intents we log are generally
    very much smaller than the change being made. e.g. freeing an extent
    involves modifying at lease two freespace btree blocks and the AGF,
    so the EFI/EFD overhead is only a small increase in space and
    processing time compared to the overall cost of freeing an extent.
    
    However, delayed attributes change this cost equation dramatically,
    especially for inline attributes. In the case of adding an inline
    attribute, we only log the inode core and attribute fork at present.
    With delayed attributes, we now log the attr intent which includes
    the name and value, the inode core adn attr fork, and finally the
    attr intent done item. We increase the number of items we log from 1
    to 3, and the number of log vectors (regions) goes up from 3 to 7.
    Hence we tripple the number of objects that the CIL has to process,
    and more than double the number of log vectors that need to be
    written to the journal.
    
    At scale, this means delayed attributes cause a non-pipelined CIL to
    become CPU bound processing all the extra items, resulting in a > 40%
    performance degradation on 16-way file+xattr create worklaods.
    Pipelining the CIL (as per 5.15) reduces the performance degradation
    to 20%, but now the limitation is the rate at which the log items
    can be written to the iclogs and iclogs be dispatched for IO and
    completed.
    
    Even log IO completion is slowed down by these intents, because it
    now has to process 3x the number of items in the checkpoint.
    Processing completed intents is especially inefficient here, because
    we first insert the intent into the AIL, then remove it from the AIL
    when the intent done is processed. IOWs, we are also doing expensive
    operations in log IO completion we could completely avoid if we
    didn't log completed intent/intent done pairs.
    
    Enter log item whiteouts.
    
    When an intent done is committed, we can check to see if the
    associated intent is in the same checkpoint as we are currently
    committing the intent done to. If so, we can mark the intent log
    item with a whiteout and immediately free the intent done item
    rather than committing it to the CIL. We can basically skip the
    entire formatting and CIL insertion steps for the intent done item.
    
    However, we cannot remove the intent item from the CIL at this point
    because the unlocked per-cpu CIL item lists do not permit removal
    without holding the CIL context lock exclusively. Transaction commit
    only holds the context lock shared, hence the best we can do is mark
    the intent item with a whiteout so that the CIL push can release it
    rather than writing it to the log.
    
    This means we never write the intent to the log if the intent done
    has also been committed to the same checkpoint, but we'll always
    write the intent if the intent done has not been committed or has
    been committed to a different checkpoint. This will result in
    correct log recovery behaviour in all cases, without the overhead of
    logging unnecessary intents.
    
    This intent whiteout concept is generic - we can apply it to all
    intent/intent done pairs that have a direct 1:1 relationship. The
    way deferred ops iterate and relog intents mean that all intents
    currently have a 1:1 relationship with their done intent, and hence
    we can apply this cancellation to all existing intent/intent done
    implementations.
    
    For delayed attributes with a 16-way 64kB xattr create workload,
    whiteouts reduce the amount of journalled metadata from ~2.5GB/s
    down to ~600MB/s and improve the creation rate from 9000/s to
    14000/s.
    Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
    Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
    Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
    0d227466
xfs_trace.h 128 KB