• Qu Wenruo's avatar
    btrfs: raid56: do data csum verification during RMW cycle · 7a315072
    Qu Wenruo authored
    [BUG]
    For the following small script, btrfs will be unable to recover the
    content of file1:
    
      mkfs.btrfs -f -m raid1 -d raid5 -b 1G $dev1 $dev2 $dev3
    
      mount $dev1 $mnt
      xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/file1
      md5sum $mnt/file1
      umount $mnt
    
      # Corrupt the above 64K data stripe.
      xfs_io -f -c "pwrite -S 0x00 323026944 64K" -c sync $dev3
      mount $dev1 $mnt
    
      # Write a new 64K, which should be in the other data stripe
      # And this is a sub-stripe write, which will cause RMW
      xfs_io -f -c "pwrite 0 64k" -c sync $mnt/file2
      md5sum $mnt/file1
      umount $mnt
    
    Above md5sum would fail.
    
    [CAUSE]
    There is a long existing problem for raid56 (not limited to btrfs
    raid56) that, if we already have some corrupted on-disk data, and then
    trigger a sub-stripe write (which needs RMW cycle), it can cause further
    damage into P/Q stripe.
    
      Disk 1: data 1 |0x000000000000| <- Corrupted
      Disk 2: data 2 |0x000000000000|
      Disk 2: parity |0xffffffffffff|
    
    In above case, data 1 is already corrupted, the original data should be
    64KiB of 0xff.
    
    At this stage, if we read data 1, and it has data checksum, we can still
    recovery going via the regular RAID56 recovery path.
    
    But if now we decide to write some data into data 2, then we need to go
    RMW.
    Let's say we want to write 64KiB of '0x00' into data 2, then we read the
    on-disk data of data 1, calculate the new parity, resulting the
    following layout:
    
      Disk 1: data 1 |0x000000000000| <- Corrupted
      Disk 2: data 2 |0x000000000000| <- New '0x00' writes
      Disk 2: parity |0x000000000000| <- New Parity.
    
    But the new parity is calculated using the *corrupted* data 1, we can
    no longer recover the correct data of data1.  Thus the corruption is
    forever there.
    
    [FIX]
    To solve above problem, this patch will do a full stripe data checksum
    verification at RMW time.
    
    This involves the following changes:
    
    - Always read the full stripe (including data/P/Q) when doing RMW
      Before we only read the missing data sectors, but since we may do a
      data csum verification and recovery, we need to read everything out.
    
      Please note that, if we have a cached rbio, we don't need to read
      anything, and can treat it the same as full stripe write.
    
      As only stripe with all its csum matches can be cached.
    
    - Verify the data csum during read.
      The goal is only the rbio stripe sectors, and only if the rbio
      already has csum_buf/csum_bitmap filled.
    
      And sectors which cannot pass csum verification will have their bit
      set in error_bitmap.
    
    - Always call recovery_sectors() after we read out all the sectors
      Since error_bitmap will be updated during read, recover_sectors()
      can easily find out all the bad sectors and try to recover (if still
      under tolerance).
    
      And since recovery_sectors() is already migrated to use error_bitmap,
      it can skip vertical stripes which don't have any error.
    
    - Verify the repaired sectors against its csum in recover_vertical()
    
    - Rename rmw_read_and_wait() to rmw_read_wait_recover()
      Since we will always recover the sectors, the old name is no longer
      accurate.
    
      Furthermore since recovery is already done in rmw_read_wait_recover(),
      we no longer need to call recovery_sectors() inside rmw_rbio().
    
    Obviously this will have a performance impact, as we are doing more
    work during RMW cycle:
    
    - Fetch the data checksums
    - Do checksum verification for all data stripes
    - Do checksum verification again after repair
    
    But for full stripe write or cached rbio we won't have the overhead all,
    thus for fully optimized RAID56 workload (always full stripe write),
    there should be no extra overhead.
    
    To me, the extra overhead looks reasonable, as data consistency is way
    more important than performance.
    Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    7a315072
raid56.c 75 KB