• Filipe Manana's avatar
    Btrfs: fix race between block group relocation and nocow writes · f78c436c
    Filipe Manana authored
    Relocation of a block group waits for all existing tasks flushing
    dellaloc, starting direct IO writes and any ordered extents before
    starting the relocation process. However for direct IO writes that end
    up doing nocow (inode either has the flag nodatacow set or the write is
    against a prealloc extent) we have a short time window that allows for a
    race that makes relocation proceed without waiting for the direct IO
    write to complete first, resulting in data loss after the relocation
    finishes. This is illustrated by the following diagram:
    
               CPU 1                                     CPU 2
    
     btrfs_relocate_block_group(bg X)
    
                                                   direct IO write starts against
                                                   an extent in block group X
                                                   using nocow mode (inode has the
                                                   nodatacow flag or the write is
                                                   for a prealloc extent)
    
                                                   btrfs_direct_IO()
                                                     btrfs_get_blocks_direct()
                                                       --> can_nocow_extent() returns 1
    
       btrfs_inc_block_group_ro(bg X)
         --> turns block group into RO mode
    
       btrfs_wait_ordered_roots()
         --> returns and does not know about
             the DIO write happening at CPU 2
             (the task there has not created
              yet an ordered extent)
    
       relocate_block_group(bg X)
         --> rc->stage == MOVE_DATA_EXTENTS
    
         find_next_extent()
           --> returns extent that the DIO
               write is going to write to
    
         relocate_data_extent()
    
           relocate_file_extent_cluster()
    
             --> reads the extent from disk into
                 pages belonging to the relocation
                 inode and dirties them
    
                                                       --> creates DIO ordered extent
    
                                                     btrfs_submit_direct()
                                                       --> submits bio against a location
                                                           on disk obtained from an extent
                                                           map before the relocation started
    
       btrfs_wait_ordered_range()
         --> writes all the pages read before
             to disk (belonging to the
             relocation inode)
    
       relocation finishes
    
                                                     bio completes and wrote new data
                                                     to the old location of the block
                                                     group
    
    So fix this by tracking the number of nocow writers for a block group and
    make sure relocation waits for that number to go down to 0 before starting
    to move the extents.
    
    The same race can also happen with buffered writes in nocow mode since the
    patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes
    when relocating", because we are no longer flushing all delalloc which
    served as a synchonization mechanism (due to page locking) and ensured
    the ordered extents for nocow buffered writes were created before we
    called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow
    mode existed before that patch (no pages are locked or used during direct
    IO) and that fixed only races with direct IO writes that do cow.
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
    f78c436c
extent-tree.c 296 KB