• Filipe Manana's avatar
    Btrfs: fix incorrect file size after shrinking truncate and fsync · 22dcb30f
    Filipe Manana authored
    commit bf504110 upstream.
    
    If we do a shrinking truncate against an inode which is already present
    in the respective log tree and then rename it, as part of logging the new
    name we end up logging an inode item that reflects the old size of the
    file (the one which we previously logged) and not the new smaller size.
    The decision to preserve the size previously logged was added by commit
    1a4bcf47 ("Btrfs: fix fsync data loss after adding hard link to
    inode") in order to avoid data loss after replaying the log. However that
    decision is only needed for the case the logged inode size is smaller then
    the current size of the inode, as explained in that commit's change log.
    If the current size of the inode is smaller then the previously logged
    size, we know a shrinking truncate happened and therefore need to use
    that smaller size.
    
    Example to trigger the problem:
    
      $ mkfs.btrfs -f /dev/sdb
      $ mount /dev/sdb /mnt
    
      $ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo
      $ xfs_io -c "fsync" /mnt/foo
      $ xfs_io -c "truncate 3000" /mnt/foo
    
      $ mv /mnt/foo /mnt/bar
      $ xfs_io -c "fsync" /mnt/bar
    
      <power failure>
    
      $ mount /dev/sdb /mnt
      $ od -t x1 -A d /mnt/bar
      0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
      *
      0008000
    
    Once we rename the file, we log its name (and inode item), and because
    the inode was already logged before in the current transaction, we log it
    with a size of 8000 bytes because that is the size we previously logged
    (with the first fsync). As part of the rename, besides logging the inode,
    we do also sync the log, which is done since commit d4682ba0
    ("Btrfs: sync log after logging new name"), so the next fsync against our
    inode is effectively a no-op, since no new changes happened since the
    rename operation. Even if did not sync the log during the rename
    operation, the same problem (fize size of 8000 bytes instead of 3000
    bytes) would be visible after replaying the log if the log ended up
    getting synced to disk through some other means, such as for example by
    fsyncing some other modified file. In the example above the fsync after
    the rename operation is there just because not every filesystem may
    guarantee logging/journalling the inode (and syncing the log/journal)
    during the rename operation, for example it is needed for f2fs, but not
    for ext4 and xfs.
    
    Fix this scenario by, when logging a new name (which is triggered by
    rename and link operations), using the current size of the inode instead
    of the previously logged inode size.
    
    A test case for fstests follows soon.
    
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695
    CC: stable@vger.kernel.org # 4.4+
    Reported-by: default avatarSeulbae Kim <seulbae@gatech.edu>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    22dcb30f
tree-log.c 165 KB