• Qu Wenruo's avatar
    btrfs: do not clear page dirty inside extent_write_locked_range() · 97713b1a
    Qu Wenruo authored
    [BUG]
    For subpage + zoned case, the following workload can lead to rsv data
    leak at unmount time:
    
      # mkfs.btrfs -f -s 4k $dev
      # mount $dev $mnt
      # fsstress -w -n 8 -d $mnt -s 1709539240
      0/0: fiemap - no filename
      0/1: copyrange read - no filename
      0/2: write - no filename
      0/3: rename - no source filename
      0/4: creat f0 x:0 0 0
      0/4: creat add id=0,parent=-1
      0/5: writev f0[259 1 0 0 0 0] [778052,113,965] 0
      0/6: ioctl(FIEMAP) f0[259 1 0 0 224 887097] [1294220,2291618343991484791,0x10000] -1
      0/7: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 224 887097] return 25, fallback to stat()
      0/7: dwrite f0[259 1 0 0 224 887097] [696320,102400] 0
      # umount $mnt
    
    The dmesg includes the following rsv leak detection warning (all call
    trace skipped):
    
      ------------[ cut here ]------------
      WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8653 btrfs_destroy_inode+0x1e0/0x200 [btrfs]
      ---[ end trace 0000000000000000 ]---
      ------------[ cut here ]------------
      WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8654 btrfs_destroy_inode+0x1a8/0x200 [btrfs]
      ---[ end trace 0000000000000000 ]---
      ------------[ cut here ]------------
      WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8660 btrfs_destroy_inode+0x1a0/0x200 [btrfs]
      ---[ end trace 0000000000000000 ]---
      BTRFS info (device sda): last unmount of filesystem 1b4abba9-de34-4f07-9e7f-157cf12a18d6
      ------------[ cut here ]------------
      WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
      ---[ end trace 0000000000000000 ]---
      BTRFS info (device sda): space_info DATA has 268218368 free, is not full
      BTRFS info (device sda): space_info total=268435456, used=204800, pinned=0, reserved=0, may_use=12288, readonly=0 zone_unusable=0
      BTRFS info (device sda): global_block_rsv: size 0 reserved 0
      BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
      BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
      BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
      BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0
      ------------[ cut here ]------------
      WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
      ---[ end trace 0000000000000000 ]---
      BTRFS info (device sda): space_info METADATA has 267796480 free, is not full
      BTRFS info (device sda): space_info total=268435456, used=131072, pinned=0, reserved=0, may_use=262144, readonly=0 zone_unusable=245760
      BTRFS info (device sda): global_block_rsv: size 0 reserved 0
      BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
      BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
      BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
      BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0
    
    Above $dev is a tcmu-runner emulated zoned HDD, which has a max zone
    append size of 64K, and the system has 64K page size.
    
    [CAUSE]
    I have added several trace_printk() to show the events (header skipped):
    
      > btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
      > btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
      > btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
      > btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864
    
    The above lines show our buffered write has dirtied 3 pages of inode
    259 of root 5:
    
      704K             768K              832K              896K
      I           |////I/////////////////I///////////|     I
                  756K                               868K
    
      |///| is the dirtied range using subpage bitmaps. and 'I' is the page
      boundary.
    
      Meanwhile all three pages (704K, 768K, 832K) have their PageDirty
      flag set.
    
      > btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400
    
    Then direct IO write starts, since the range [680K, 780K) covers the
    beginning part of the above dirty range, we need to writeback the
    two pages at 704K and 768K.
    
      > cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
      > extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536
    
    Now the above 2 lines show that we're writing back for dirty range
    [756K, 756K + 64K).
    We only writeback 64K because the zoned device has max zone append size
    as 64K.
    
      > extent_write_locked_range: r/i=5/259 clear dirty for page=786432
    
    !!! The above line shows the root cause. !!!
    
    We're calling clear_page_dirty_for_io() inside extent_write_locked_range(),
    for the page 768K.
    This is because extent_write_locked_range() can go beyond the current
    locked page, here we hit the page at 768K and clear its page dirt.
    
    In fact this would lead to the desync between subpage dirty and page
    dirty flags.  We have the page dirty flag cleared, but the subpage range
    [820K, 832K) is still dirty.
    
    After the writeback of range [756K, 820K), the dirty flags look like
    this, as page 768K no longer has dirty flag set.
    
      704K             768K              832K              896K
      I                I      |          I/////////////|   I
                              820K                     868K
    
    This means we will no longer writeback range [820K, 832K), thus the
    reserved data/metadata space would never be properly released.
    
      > extent_write_cache_pages: r/i=5/259 skip non-dirty folio=786432
    
    Now even though we try to start writeback for page 768K, since the
    page is not dirty, we completely skip it at extent_write_cache_pages()
    time.
    
      > btrfs_direct_write: r/i=5/259 dio done filepos=696320 len=0
    
    Now the direct IO finished.
    
      > cow_file_range: r/i=5/259 add ordered extent filepos=851968 len=36864
      > extent_write_locked_range: r/i=5/259 locked page=851968 start=851968 len=36864
    
    Now we writeback the remaining dirty range, which is [832K, 868K).
    Causing the range [820K, 832K) never to be submitted, thus leaking the
    reserved space.
    
    This bug only affects subpage and zoned case.  For non-subpage and zoned
    case, we have exactly one sector for each page, thus no such partial dirty
    cases.
    
    For subpage and non-zoned case, we never go into run_delalloc_cow(), and
    normally all the dirty subpage ranges would be properly submitted inside
    __extent_writepage_io().
    
    [FIX]
    Just do not clear the page dirty at all inside extent_write_locked_range().
    As __extent_writepage_io() would do a more accurate, subpage compatible
    clear for page and subpage dirty flags anyway.
    
    Now the correct trace would look like this:
    
      > btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
      > btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
      > btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
      > btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864
    
    The page dirty part is still the same 3 pages.
    
      > btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400
      > cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
      > extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536
    
    And the writeback for the first 64K is still correct.
    
      > cow_file_range: r/i=5/259 add ordered extent filepos=839680 len=49152
      > extent_write_locked_range: r/i=5/259 locked page=786432 start=839680 len=49152
    
    Now with the fix, we can properly writeback the range [820K, 832K), and
    properly release the reserved data/metadata space.
    Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    97713b1a
extent_io.c 123 KB