• Filipe Manana's avatar
    Btrfs: fix fsync data loss after append write · e4545de5
    Filipe Manana authored
    If we do an append write to a file (which increases its inode's i_size)
    that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode,
    and the previous transaction added a new hard link to the file, which sets
    the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync
    the file, the inode's new i_size isn't logged. This has the consequence
    that after the fsync log is replayed, the file size remains what it was
    before the append write operation, which means users/applications will
    not be able to read the data that was successsfully fsync'ed before.
    
    This happens because neither the inode item nor the delayed inode get
    their i_size updated when the append write is made - doing so would
    require starting a transaction in the buffered write path, something that
    we do not do intentionally for performance reasons.
    
    Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is
    set the inode is logged with its current i_size (log the in-memory inode
    into the log tree).
    
    This issue is not a recent regression and is easy to reproduce with the
    following test case for fstests:
    
      seq=`basename $0`
      seqres=$RESULT_DIR/$seq
      echo "QA output created by $seq"
    
      here=`pwd`
      tmp=/tmp/$$
      status=1	# failure is the default!
    
      _cleanup()
      {
              _cleanup_flakey
              rm -f $tmp.*
      }
      trap "_cleanup; exit \$status" 0 1 2 3 15
    
      # get standard environment, filters and checks
      . ./common/rc
      . ./common/filter
      . ./common/dmflakey
    
      # real QA test starts here
      _supported_fs generic
      _supported_os Linux
      _need_to_be_root
      _require_scratch
      _require_dm_flakey
      _require_metadata_journaling $SCRATCH_DEV
    
      _crash_and_mount()
      {
              # Simulate a crash/power loss.
              _load_flakey_table $FLAKEY_DROP_WRITES
              _unmount_flakey
              # Allow writes again and mount. This makes the fs replay its fsync log.
              _load_flakey_table $FLAKEY_ALLOW_WRITES
              _mount_flakey
      }
    
      rm -f $seqres.full
    
      _scratch_mkfs >> $seqres.full 2>&1
      _init_flakey
      _mount_flakey
    
      # Create the test file with some initial data and then fsync it.
      # The fsync here is only needed to trigger the issue in btrfs, as it causes the
      # the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode.
      $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" \
                      -c "fsync" \
                      $SCRATCH_MNT/foo | _filter_xfs_io
      sync
    
      # Add a hard link to our file.
      # On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode,
      # which is a necessary condition to trigger the issue.
      ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar
    
      # Sync the filesystem to force a commit of the current btrfs transaction, this
      # is a necessary condition to trigger the bug on btrfs.
      sync
    
      # Now append more data to our file, increasing its size, and fsync the file.
      # In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the
      # write path did not update the inode item in the btree nor the delayed inode
      # item (in memory struture) in the current transaction (created by the fsync
      # handler), the fsync did not record the inode's new i_size in the fsync
      # log/journal. This made the data unavailable after the fsync log/journal is
      # replayed.
      $XFS_IO_PROG -c "pwrite -S 0xbb 32K 32K" \
                   -c "fsync" \
                   $SCRATCH_MNT/foo | _filter_xfs_io
    
      echo "File content after fsync and before crash:"
      od -t x1 $SCRATCH_MNT/foo
    
      _crash_and_mount
    
      echo "File content after crash and log replay:"
      od -t x1 $SCRATCH_MNT/foo
    
      status=0
      exit
    
    The expected file output before and after the crash/power failure expects the
    appended data to be available, which is:
    
      0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
      *
      0100000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
      *
      0200000
    
    Cc: stable@vger.kernel.org
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
    Signed-off-by: default avatarChris Mason <clm@fb.com>
    e4545de5
tree-log.c 133 KB