• Filipe Manana's avatar
    Btrfs: fix duplicate extents after fsync of file with prealloc extents · 31d11b83
    Filipe Manana authored
    In commit 471d557a ("Btrfs: fix loss of prealloc extents past i_size
    after fsync log replay"), on fsync,  we started to always log all prealloc
    extents beyond an inode's i_size in order to avoid losing them after a
    power failure. However under some cases this can lead to the log replay
    code to create duplicate extent items, with different lengths, in the
    extent tree. That happens because, as of that commit, we can now log
    extent items based on extent maps that are not on the "modified" list
    of extent maps of the inode's extent map tree. Logging extent items based
    on extent maps is used during the fast fsync path to save time and for
    this to work reliably it requires that the extent maps are not merged
    with other adjacent extent maps - having the extent maps in the list
    of modified extents gives such guarantee.
    
    Consider the following example, captured during a long run of fsstress,
    which illustrates this problem.
    
    We have inode 271, in the filesystem tree (root 5), for which all of the
    following operations and discussion apply to.
    
    A buffered write starts at offset 312391 with a length of 933471 bytes
    (end offset at 1245862). At this point we have, for this inode, the
    following extent maps with the their field values:
    
    em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
          block_len 0, orig_block_len 0
    em B, start 40960, orig_start 40960, len 376832, block_start 1106399232,
          block_len 376832, orig_block_len 376832
    em C, start 417792, orig_start 417792, len 782336, block_start
          18446744073709551613, block_len 0, orig_block_len 0
    em D, start 1200128, orig_start 1200128, len 835584, block_start
          1106776064, block_len 835584, orig_block_len 835584
    em E, start 2035712, orig_start 2035712, len 245760, block_start
          1107611648, block_len 245760, orig_block_len 245760
    
    Extent map A corresponds to a hole and extent maps D and E correspond to
    preallocated extents.
    
    Extent map D ends where extent map E begins (1106776064 + 835584 =
    1107611648), but these extent maps were not merged because they are in
    the inode's list of modified extent maps.
    
    An fsync against this inode is made, which triggers the fast path
    (BTRFS_INODE_NEEDS_FULL_SYNC is not set). This fsync triggers writeback
    of the data previously written using buffered IO, and when the respective
    ordered extent finishes, btrfs_drop_extents() is called against the
    (aligned) range 311296..1249279. This causes a split of extent map D at
    btrfs_drop_extent_cache(), replacing extent map D with a new extent map
    D', also added to the list of modified extents,  with the following
    values:
    
    em D', start 1249280, orig_start of 1200128,
           block_start 1106825216 (= 1106776064 + 1249280 - 1200128),
           orig_block_len 835584,
           block_len 786432 (835584 - (1249280 - 1200128))
    
    Then, during the fast fsync, btrfs_log_changed_extents() is called and
    extent maps D' and E are removed from the list of modified extents. The
    flag EXTENT_FLAG_LOGGING is also set on them. After the extents are logged
    clear_em_logging() is called on each of them, and that makes extent map E
    to be merged with extent map D' (try_merge_map()), resulting in D' being
    deleted and E adjusted to:
    
    em E, start 1249280, orig_start 1200128, len 1032192,
          block_start 1106825216, block_len 1032192,
          orig_block_len 245760
    
    A direct IO write at offset 1847296 and length of 360448 bytes (end offset
    at 2207744) starts, and at that moment the following extent maps exist for
    our inode:
    
    em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
          block_len 0, orig_block_len 0
    em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
          block_len 270336, orig_block_len 376832
    em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
          block_len 937984, orig_block_len 937984
    em E (prealloc), start 1249280, orig_start 1200128, len 1032192,
          block_start 1106825216, block_len 1032192, orig_block_len 245760
    
    The dio write results in drop_extent_cache() being called twice. The first
    time for a range that starts at offset 1847296 and ends at offset 2035711
    (length of 188416), which results in a double split of extent map E,
    replacing it with two new extent maps:
    
    em F, start 1249280, orig_start 1200128, block_start 1106825216,
          block_len 598016, orig_block_len 598016
    em G, start 2035712, orig_start 1200128, block_start 1107611648,
          block_len 245760, orig_block_len 1032192
    
    It also creates a new extent map that represents a part of the requested
    IO (through create_io_em()):
    
    em H, start 1847296, len 188416, block_start 1107423232, block_len 188416
    
    The second call to drop_extent_cache() has a range with a start offset of
    2035712 and end offset of 2207743 (length of 172032). This leads to
    replacing extent map G with a new extent map I with the following values:
    
    em I, start 2207744, orig_start 1200128, block_start 1107783680,
          block_len 73728, orig_block_len 1032192
    
    It also creates a new extent map that represents the second part of the
    requested IO (through create_io_em()):
    
    em J, start 2035712, len 172032, block_start 1107611648, block_len 172032
    
    The dio write set the inode's i_size to 2207744 bytes.
    
    After the dio write the inode has the following extent maps:
    
    em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
          block_len 0, orig_block_len 0
    em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
          block_len 270336, orig_block_len 376832
    em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
          block_len 937984, orig_block_len 937984
    em F, start 1249280, orig_start 1200128, len 598016,
          block_start 1106825216, block_len 598016, orig_block_len 598016
    em H, start 1847296, orig_start 1200128, len 188416,
          block_start 1107423232, block_len 188416, orig_block_len 835584
    em J, start 2035712, orig_start 2035712, len 172032,
          block_start 1107611648, block_len 172032, orig_block_len 245760
    em I, start 2207744, orig_start 1200128, len 73728,
          block_start 1107783680, block_len 73728, orig_block_len 1032192
    
    Now do some change to the file, like adding a xattr for example and then
    fsync it again. This triggers a fast fsync path, and as of commit
    471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync
    log replay"), we use the extent map I to log a file extent item because
    it's a prealloc extent and it starts at an offset matching the inode's
    i_size. However when we log it, we create a file extent item with a value
    for the disk byte location that is wrong, as can be seen from the
    following output of "btrfs inspect-internal dump-tree":
    
     item 1 key (271 EXTENT_DATA 2207744) itemoff 3782 itemsize 53
         generation 22 type 2 (prealloc)
         prealloc data disk byte 1106776064 nr 1032192
         prealloc data offset 1007616 nr 73728
    
    Here the disk byte value corresponds to calculation based on some fields
    from the extent map I:
    
      1106776064 = block_start (1107783680) - 1007616 (extent_offset)
      extent_offset = 2207744 (start) - 1200128 (orig_start) = 1007616
    
    The disk byte value of 1106776064 clashes with disk byte values of the
    file extent items at offsets 1249280 and 1847296 in the fs tree:
    
            item 6 key (271 EXTENT_DATA 1249280) itemoff 3568 itemsize 53
                    generation 20 type 2 (prealloc)
                    prealloc data disk byte 1106776064 nr 835584
                    prealloc data offset 49152 nr 598016
            item 7 key (271 EXTENT_DATA 1847296) itemoff 3515 itemsize 53
                    generation 20 type 1 (regular)
                    extent data disk byte 1106776064 nr 835584
                    extent data offset 647168 nr 188416 ram 835584
                    extent compression 0 (none)
            item 8 key (271 EXTENT_DATA 2035712) itemoff 3462 itemsize 53
                    generation 20 type 1 (regular)
                    extent data disk byte 1107611648 nr 245760
                    extent data offset 0 nr 172032 ram 245760
                    extent compression 0 (none)
            item 9 key (271 EXTENT_DATA 2207744) itemoff 3409 itemsize 53
                    generation 20 type 2 (prealloc)
                    prealloc data disk byte 1107611648 nr 245760
                    prealloc data offset 172032 nr 73728
    
    Instead of the disk byte value of 1106776064, the value of 1107611648
    should have been logged. Also the data offset value should have been
    172032 and not 1007616.
    After a log replay we end up getting two extent items in the extent tree
    with different lengths, one of 835584, which is correct and existed
    before the log replay, and another one of 1032192 which is wrong and is
    based on the logged file extent item:
    
     item 12 key (1106776064 EXTENT_ITEM 835584) itemoff 3406 itemsize 53
        refs 2 gen 15 flags DATA
        extent data backref root 5 objectid 271 offset 1200128 count 2
     item 13 key (1106776064 EXTENT_ITEM 1032192) itemoff 3353 itemsize 53
        refs 1 gen 22 flags DATA
        extent data backref root 5 objectid 271 offset 1200128 count 1
    
    Obviously this leads to many problems and a filesystem check reports many
    errors:
    
     (...)
     checking extents
     Extent back ref already exists for 1106776064 parent 0 root 5 owner 271 offset 1200128 num_refs 1
     extent item 1106776064 has multiple extent items
     ref mismatch on [1106776064 835584] extent item 2, found 3
     Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 2 wanted 1 back 0x55b1d0ad7680
     Backref 1106776064 root 5 owner 271 offset 1200128 num_refs 0 not found in extent tree
     Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 1 wanted 0 back 0x55b1d0ad4e70
     Backref bytes do not match extent backref, bytenr=1106776064, ref bytes=835584, backref bytes=1032192
     backpointer mismatch on [1106776064 835584]
     checking free space cache
     block group 1103101952 has wrong amount of free space
     failed to load free space cache for block group 1103101952
     checking fs roots
     (...)
    
    So fix this by logging the prealloc extents beyond the inode's i_size
    based on searches in the subvolume tree instead of the extent maps.
    
    Fixes: 471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
    CC: stable@vger.kernel.org # 4.14+
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    31d11b83
tree-log.c 163 KB