• Qu Wenruo's avatar
    btrfs: subpage: fix relocation potentially overwriting last page data · 9d9ea1e6
    Qu Wenruo authored
    [BUG]
    When using the following script, btrfs will report data corruption after
    one data balance with subpage support:
    
      mkfs.btrfs -f -s 4k $dev
      mount $dev -o nospace_cache $mnt
      $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
      sync
      btrfs balance start -d $mnt
      btrfs scrub start -B $mnt
    
    Similar problem can be easily observed in btrfs/028 test case, there
    will be tons of balance failure with -EIO.
    
    [CAUSE]
    Above fsstress will result the following data extents layout in extent
    tree:
      item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 itemsize 82
        refs 2 gen 7 flags DATA
        extent data backref root FS_TREE objectid 259 offset 1339392 count 1
        extent data backref root FS_TREE objectid 259 offset 647168 count 1
      item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 itemsize 24
        block group used 102400 chunk_objectid 256 flags DATA
      item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 itemsize 53
        refs 1 gen 7 flags DATA
        extent data backref root FS_TREE objectid 259 offset 729088 count 1
    
    Then when creating the data reloc inode, the data reloc inode will look
    like this:
    
    	0	32K	64K	96K 100K	104K
    	|<------ Extent A ----->|   |<- Ext B ->|
    
    Then when we first try to relocate extent A, we setup the data reloc
    inode with i_size 96K, then read both page [0, 64K) and page [64K, 128K).
    
    For page 64K, since the i_size is just 96K, we fill range [96K, 128K)
    with 0 and set it uptodate.
    
    Then when we come to extent B, we update i_size to 104K, then try to read
    page [64K, 128K).
    Then we find the page is already uptodate, so we skip the read.
    But range [96K, 128K) is filled with 0, not the real data.
    
    Then we writeback the data reloc inode to disk, with 0 filling range
    [96K, 128K), corrupting the content of extent B.
    
    The behavior is caused by the fact that we still do full page read for
    subpage case.
    
    The bug won't really happen for regular sectorsize, as one page only
    contains one sector.
    
    [FIX]
    This patch will fix the problem by invalidating range [i_size, PAGE_END]
    in prealloc_file_extent_cluster().
    
    So that if above example happens, when we preallocate the file extent
    for extent B, we will clear the uptodate bits for range [96K, 128K),
    allowing later relocate_one_page() to re-read the needed range.
    
    There is a special note for the invalidating part.
    
    Since we're not calling real btrfs_invalidatepage(), but just clearing
    the subpage and page uptodate bits, we can leave a page half dirty and
    half out of date.
    
    Reading such page can cause a deadlock, as we normally expect a dirty
    page to be fully uptodate.
    
    Thus here we flush and wait the data reloc inode before doing the hacked
    invalidating.  This won't cause extra overhead, as we're going to
    writeback the data later anyway.
    Reported-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
    Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    9d9ea1e6
relocation.c 113 KB