• Filipe Manana's avatar
    btrfs: fix race between reflinking and ordered extent completion · d4597898
    Filipe Manana authored
    While doing a reflink operation, if an ordered extent for a file range
    that does not overlap with the source and destination ranges of the
    reflink operation happens, we can end up having a failure in the reflink
    operation and return -EINVAL to user space.
    
    The following sequence of steps explains how this can happen:
    
    1) We have the page at file offset 315392 dirty (under delalloc);
    
    2) A reflink operation for this file starts, using the same file as both
       source and destination, the source range is [372736, 409600) (length of
       36864 bytes) and the destination range is [208896, 245760);
    
    3) At btrfs_remap_file_range_prep(), we flush all delalloc in the source
       and destination ranges, and wait for any ordered extents in those range
       to complete;
    
    4) Still at btrfs_remap_file_range_prep(), we then flush all delalloc in
       the inode, but we neither wait for it to complete nor any ordered
       extents to complete. This results in starting delalloc for the page at
       file offset 315392 and creating an ordered extent for that single page
       range;
    
    5) We then move to btrfs_clone() and enter the loop to find file extent
       items to copy from the source range to destination range;
    
    6) In the first iteration we end up at last file extent item stored in
       leaf A:
    
       (...)
       item 131 key (143616 108 315392) itemoff 5101 itemsize 53
                extent data disk bytenr 1903988736 nr 73728
                extent data offset 12288 nr 61440 ram 73728
    
       This represents the file range [315392, 376832), which overlaps with
       the source range to clone.
    
       @datal is set to 61440, key.offset is 315392 and @next_key_min_offset
       is therefore set to 376832 (315392 + 61440).
    
       @off (372736) is > key.offset (315392), so @new_key.offset is set to
       the value of @destoff (208896).
    
       @new_key.offset == @last_dest_end (208896) so @drop_start is set to
       208896 (@new_key.offset).
    
       @datal is adjusted to 4096, as @off is > @key.offset.
    
       So in this iteration we call btrfs_replace_file_extents() for the range
       [208896, 212991] (a single page, which is
       [@drop_start, @new_key.offset + @datal - 1]).
    
       @last_dest_end is set to 212992 (@new_key.offset + @datal =
       208896 + 4096 = 212992).
    
       Before the next iteration of the loop, @key.offset is set to the value
       376832, which is @next_key_min_offset;
    
    7) On the second iteration btrfs_search_slot() leaves us again at leaf A,
       but this time pointing beyond the last slot of leaf A, as that's where
       a key with offset 376832 should be at if it existed. So end up calling
       btrfs_next_leaf();
    
    8) btrfs_next_leaf() releases the path, but before it searches again the
       tree for the next key/leaf, the ordered extent for the single page
       range at file offset 315392 completes. That results in trimming the
       file extent item we processed before, adjusting its key offset from
       315392 to 319488, reducing its length from 61440 to 57344 and inserting
       a new file extent item for that single page range, with a key offset of
       315392 and a length of 4096.
    
       Leaf A now looks like:
    
         (...)
         item 132 key (143616 108 315392) itemoff 4995 itemsize 53
                  extent data disk bytenr 1801666560 nr 4096
                  extent data offset 0 nr 4096 ram 4096
         item 133 key (143616 108 319488) itemoff 4942 itemsize 53
                  extent data disk bytenr 1903988736 nr 73728
                  extent data offset 16384 nr 57344 ram 73728
    
    9) When btrfs_next_leaf() returns, it gives us a path pointing to leaf A
       at slot 133, since it's the first key that follows what was the last
       key we saw (143616 108 315392). In fact it's the same item we processed
       before, but its key offset was changed, so it counts as a new key;
    
    10) So now we have:
    
        @key.offset == 319488
        @datal == 57344
    
        @off (372736) is > key.offset (319488), so @new_key.offset is set to
        208896 (@destoff value).
    
        @new_key.offset (208896) != @last_dest_end (212992), so @drop_start
        is set to 212992 (@last_dest_end value).
    
        @datal is adjusted to 4096 because @off > @key.offset.
    
        So in this iteration we call btrfs_replace_file_extents() for the
        invalid range of [212992, 212991] (which is
        [@drop_start, @new_key.offset + @datal - 1]).
    
        This range is empty, the end offset is smaller than the start offset
        so btrfs_replace_file_extents() returns -EINVAL, which we end up
        returning to user space and fail the reflink operation.
    
        This all happens because the range of this file extent item was
        already processed in the previous iteration.
    
    This scenario can be triggered very sporadically by fsx from fstests, for
    example with test case generic/522.
    
    So fix this by having btrfs_clone() skip file extent items that cover a
    file range that we have already processed.
    
    CC: stable@vger.kernel.org # 5.10+
    Reviewed-by: default avatarBoris Burkov <boris@bur.io>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    d4597898
reflink.c 27.9 KB