• Filipe Manana's avatar
    btrfs: fix partial loss of prealloc extent past i_size after fsync · f135cea3
    Filipe Manana authored
    When we have an inode with a prealloc extent that starts at an offset
    lower than the i_size and there is another prealloc extent that starts at
    an offset beyond i_size, we can end up losing part of the first prealloc
    extent (the part that starts at i_size) and have an implicit hole if we
    fsync the file and then have a power failure.
    
    Consider the following example with comments explaining how and why it
    happens.
    
      $ mkfs.btrfs -f /dev/sdb
      $ mount /dev/sdb /mnt
    
      # Create our test file with 2 consecutive prealloc extents, each with a
      # size of 128Kb, and covering the range from 0 to 256Kb, with a file
      # size of 0.
      $ xfs_io -f -c "falloc -k 0 128K" /mnt/foo
      $ xfs_io -c "falloc -k 128K 128K" /mnt/foo
    
      # Fsync the file to record both extents in the log tree.
      $ xfs_io -c "fsync" /mnt/foo
    
      # Now do a redudant extent allocation for the range from 0 to 64Kb.
      # This will merely increase the file size from 0 to 64Kb. Instead we
      # could also do a truncate to set the file size to 64Kb.
      $ xfs_io -c "falloc 0 64K" /mnt/foo
    
      # Fsync the file, so we update the inode item in the log tree with the
      # new file size (64Kb). This also ends up setting the number of bytes
      # for the first prealloc extent to 64Kb. This is done by the truncation
      # at btrfs_log_prealloc_extents().
      # This means that if a power failure happens after this, a write into
      # the file range 64Kb to 128Kb will not use the prealloc extent and
      # will result in allocation of a new extent.
      $ xfs_io -c "fsync" /mnt/foo
    
      # Now set the file size to 256K with a truncate and then fsync the file.
      # Since no changes happened to the extents, the fsync only updates the
      # i_size in the inode item at the log tree. This results in an implicit
      # hole for the file range from 64Kb to 128Kb, something which fsck will
      # complain when not using the NO_HOLES feature if we replay the log
      # after a power failure.
      $ xfs_io -c "truncate 256K" -c "fsync" /mnt/foo
    
    So instead of always truncating the log to the inode's current i_size at
    btrfs_log_prealloc_extents(), check first if there's a prealloc extent
    that starts at an offset lower than the i_size and with a length that
    crosses the i_size - if there is one, just make sure we truncate to a
    size that corresponds to the end offset of that prealloc extent, so
    that we don't lose the part of that extent that starts at i_size if a
    power failure happens.
    
    A test case for fstests follows soon.
    
    Fixes: 31d11b83 ("Btrfs: fix duplicate extents after fsync of file with prealloc extents")
    CC: stable@vger.kernel.org # 4.14+
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    f135cea3
tree-log.c 172 KB