• Filipe Manana's avatar
    Btrfs: fix rare chances for data loss when doing a fast fsync · aab15e8e
    Filipe Manana authored
    After the simplification of the fast fsync patch done recently by commit
    b5e6c3e1 ("btrfs: always wait on ordered extents at fsync time") and
    commit e7175a69 ("btrfs: remove the wait ordered logic in the
    log_one_extent path"), we got a very short time window where we can get
    extents logged without writeback completing first or extents logged
    without logging the respective data checksums. Both issues can only happen
    when doing a non-full (fast) fsync.
    
    As soon as we enter btrfs_sync_file() we trigger writeback, then lock the
    inode and then wait for the writeback to complete before starting to log
    the inode. However before we acquire the inode's lock and after we started
    writeback, it's possible that more writes happened and dirtied more pages.
    If that happened and those pages get writeback triggered while we are
    logging the inode (for example, the VM subsystem triggering it due to
    memory pressure, or another concurrent fsync), we end up seeing the
    respective extent maps in the inode's list of modified extents and will
    log matching file extent items without waiting for the respective
    ordered extents to complete, meaning that either of the following will
    happen:
    
    1) We log an extent after its writeback finishes but before its checksums
       are added to the csum tree, leading to -EIO errors when attempting to
       read the extent after a log replay.
    
    2) We log an extent before its writeback finishes.
       Therefore after the log replay we will have a file extent item pointing
       to an unwritten extent (and without the respective data checksums as
       well).
    
    This could not happen before the fast fsync patch simplification, because
    for any extent we found in the list of modified extents, we would wait for
    its respective ordered extent to finish writeback or collect its checksums
    for logging if it did not complete yet.
    
    Fix this by triggering writeback again after acquiring the inode's lock
    and before waiting for ordered extents to complete.
    
    Fixes: e7175a69 ("btrfs: remove the wait ordered logic in the log_one_extent path")
    Fixes: b5e6c3e1 ("btrfs: always wait on ordered extents at fsync time")
    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    aab15e8e
file.c 88.2 KB