• Filipe Manana's avatar
    Btrfs: fix race between fsync and direct IO writes for prealloc extents · 0b901916
    Filipe Manana authored
    When we do a direct IO write against a preallocated extent (fallocate)
    that does not go beyond the i_size of the inode, we do the write operation
    without holding the inode's i_mutex (an optimization that landed in
    commit 38851cc1 ("Btrfs: implement unlocked dio write")). This allows
    for a very tiny time window where a race can happen with a concurrent
    fsync using the fast code path, as the direct IO write path creates first
    a new extent map (no longer flagged as a prealloc extent) and then it
    creates the ordered extent, while the fast fsync path first collects
    ordered extents and then it collects extent maps. This allows for the
    possibility of the fast fsync path to collect the new extent map without
    collecting the new ordered extent, and therefore logging an extent item
    based on the extent map without waiting for the ordered extent to be
    created and complete. This can result in a situation where after a log
    replay we end up with an extent not marked anymore as prealloc but it was
    only partially written (or not written at all), exposing random, stale or
    garbage data corresponding to the unwritten pages and without any
    checksums in the csum tree covering the extent's range.
    
    This is an extension of what was done in commit de0ee0ed ("Btrfs: fix
    race between fsync and lockless direct IO writes").
    
    So fix this by creating first the ordered extent and then the extent
    map, so that this way if the fast fsync patch collects the new extent
    map it also collects the corresponding ordered extent.
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
    0b901916
inode.c 282 KB