• Josef Bacik's avatar
    btrfs: check delayed refs when we're checking if a ref exists · 42fac187
    Josef Bacik authored
    In the patch 78c52d9e ("btrfs: check for refs on snapshot delete
    resume") I added some code to handle file systems that had been
    corrupted by a bug that incorrectly skipped updating the drop progress
    key while dropping a snapshot.  This code would check to see if we had
    already deleted our reference for a child block, and skip the deletion
    if we had already.
    
    Unfortunately there is a bug, as the check would only check the on-disk
    references.  I made an incorrect assumption that blocks in an already
    deleted snapshot that was having the deletion resume on mount wouldn't
    be modified.
    
    If we have 2 pending deleted snapshots that share blocks, we can easily
    modify the rules for a block.  Take the following example
    
    subvolume a exists, and subvolume b is a snapshot of subvolume a.  They
    share references to block 1.  Block 1 will have 2 full references, one
    for subvolume a and one for subvolume b, and it belongs to subvolume a
    (btrfs_header_owner(block 1) == subvolume a).
    
    When deleting subvolume a, we will drop our full reference for block 1,
    and because we are the owner we will drop our full reference for all of
    block 1's children, convert block 1 to FULL BACKREF, and add a shared
    reference to all of block 1's children.
    
    Then we will start the snapshot deletion of subvolume b.  We look up the
    extent info for block 1, which checks delayed refs and tells us that
    FULL BACKREF is set, so sets parent to the bytenr of block 1.  However
    because this is a resumed snapshot deletion, we call into
    check_ref_exists().  Because check_ref_exists() only looks at the disk,
    it doesn't find the shared backref for the child of block 1, and thus
    returns 0 and we skip deleting the reference for the child of block 1
    and continue.  This orphans the child of block 1.
    
    The fix is to lookup the delayed refs, similar to what we do in
    btrfs_lookup_extent_info().  However we only care about whether the
    reference exists or not.  If we fail to find our reference on disk, go
    look up the bytenr in the delayed refs, and if it exists look for an
    existing ref in the delayed ref head.  If that exists then we know we
    can delete the reference safely and carry on.  If it doesn't exist we
    know we have to skip over this block.
    
    This bug has existed since I introduced this fix, however requires
    having multiple deleted snapshots pending when we unmount.  We noticed
    this in production because our shutdown path stops the container on the
    system, which deletes a bunch of subvolumes, and then reboots the box.
    This gives us plenty of opportunities to hit this issue.  Looking at the
    history we've seen this occasionally in production, but we had a big
    spike recently thanks to faster machines getting jobs with multiple
    subvolumes in the job.
    
    Chris Mason wrote a reproducer which does the following
    
    mount /dev/nvme4n1 /btrfs
    btrfs subvol create /btrfs/s1
    simoop -E -f 4k -n 200000 -z /btrfs/s1
    while(true) ; do
    	btrfs subvol snap /btrfs/s1 /btrfs/s2
    	simoop -f 4k -n 200000 -r 10 -z /btrfs/s2
    	btrfs subvol snap /btrfs/s2 /btrfs/s3
    	btrfs balance start -dusage=80 /btrfs
    	btrfs subvol del /btrfs/s2 /btrfs/s3
    	umount /btrfs
    	btrfsck /dev/nvme4n1 || exit 1
    	mount /dev/nvme4n1 /btrfs
    done
    
    On the second loop this would fail consistently, with my patch it has
    been running for hours and hasn't failed.
    
    I also used dm-log-writes to capture the state of the failure so I could
    debug the problem.  Using the existing failure case to test my patch
    validated that it fixes the problem.
    
    Fixes: 78c52d9e ("btrfs: check for refs on snapshot delete resume")
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    42fac187
extent-tree.c 181 KB