1. 05 Dec, 2022 40 commits
    • Josef Bacik's avatar
      btrfs: fix uninitialized variable in find_first_clear_extent_bit · 26df39a9
      Josef Bacik authored
      This was caught when syncing extent-io-tree.c into btrfs-progs.  This
      however isn't really a problem, the only way next would be uninitialized
      is if we found the range we were looking for, and in this case we don't
      care about next.  However it's a compile error, so fix it up.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      26df39a9
    • Josef Bacik's avatar
      btrfs: fix uninitialized parent in insert_state · d7c9e1be
      Josef Bacik authored
      I don't know how this isn't caught when we build this in the kernel, but
      while syncing extent-io-tree.c into btrfs-progs I got an error because
      parent could potentially be uninitialized when we link in a new node,
      specifically when the extent_io_tree is empty.  This means we could have
      garbage in the parent color.  I don't know what the ramifications are of
      that, but it's probably not great, so fix this by initializing parent to
      NULL.  I spot checked all of our other usages in btrfs and we appear to
      be doing the correct thing everywhere else.
      
      Fixes: c7e118cf ("btrfs: open code rbtree search in insert_state")
      CC: stable@vger.kernel.org # 6.0+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7c9e1be
    • ChenXiaoSong's avatar
      btrfs: add might_sleep() annotations · a4c853af
      ChenXiaoSong authored
      Add annotations to functions that might sleep due to allocations or IO
      and could be called from various contexts. In case of btrfs_search_slot
      it's not obvious why it would sleep:
      
          btrfs_search_slot
            setup_nodes_for_search
              reada_for_balance
                btrfs_readahead_node_child
                  btrfs_readahead_tree_block
                    btrfs_find_create_tree_block
                      alloc_extent_buffer
                        kmem_cache_zalloc
                          /* allocate memory non-atomically, might sleep */
                          kmem_cache_alloc(GFP_NOFS|__GFP_NOFAIL|__GFP_ZERO)
                    read_extent_buffer_pages
                      submit_extent_page
                        /* disk IO, might sleep */
                        submit_one_bio
      
      Other examples where the sleeping could happen is in 3 places might
      sleep in update_qgroup_limit_item(), as shown below:
      
        update_qgroup_limit_item
          btrfs_alloc_path
            /* allocate memory non-atomically, might sleep */
            kmem_cache_zalloc(btrfs_path_cachep, GFP_NOFS)
      Signed-off-by: default avatarChenXiaoSong <chenxiaosong2@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4c853af
    • Josef Bacik's avatar
      btrfs: add stack helpers for a few btrfs items · 054056bd
      Josef Bacik authored
      We don't have these defined in the kernel because we don't have any
      users of these helpers.  However we do use them in btrfs-progs, so
      define them to make keeping accessors.h in sync between progs and the
      kernel easier.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      054056bd
    • Josef Bacik's avatar
      btrfs: add nr_global_roots to the super block definition · 0c703003
      Josef Bacik authored
      We already have this defined in btrfs-progs, add it to the kernel to
      make it easier to sync these files into btrfs-progs.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0c703003
    • Josef Bacik's avatar
      btrfs: remove BTRFS_LEAF_DATA_OFFSET · 8009adf3
      Josef Bacik authored
      This is simply the same thing as btrfs_item_nr_offset(leaf, 0), so
      remove this helper and replace it's usage with the above statement.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8009adf3
    • Josef Bacik's avatar
      btrfs: add helpers for manipulating leaf items and data · 637e3b48
      Josef Bacik authored
      We have some gnarly memmove and copy_extent_buffer calls for leaf
      manipulation.  This is because our item offsets aren't absolute, they're
      based on 0 being where the items start in the leaf, which is after the
      btrfs_header.  This means any manipulation of the data requires adding
      sizeof(struct btrfs_header) to the offsets we pull from the items.
      Moving the items themselves is easier as the helpers are absolute
      offsets, however we of course have to call the helpers to get the
      offsets for the item numbers.  This makes for
      copy_extent_buffer/memmove_extent_buffer calls that are kind of hard to
      reason about what's happening.
      
      Fix this by pushing this logic into helpers.  For data we'll only use
      the item provided offsets, and the helpers will use the
      BTRFS_LEAF_DATA_OFFSET addition for the offsets.  Additionally for the
      item manipulation simply pass in the item numbers, and then the helpers
      will call the offset helper to get the actual offset into the leaf.
      
      The diffstat makes this look like more code, but that's simply because I
      added comments for the helpers, it's net negative for the amount of
      code, and is easier to reason.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      637e3b48
    • Josef Bacik's avatar
      btrfs: add eb to btrfs_node_key_ptr_offset · e23efd8e
      Josef Bacik authored
      This is a change needed for extent tree v2, as we will be growing the
      header size.  This exists in btrfs-progs currently, and not having it
      makes syncing accessors.[ch] more problematic.  So make this change to
      set us up for extent tree v2 and match what btrfs-progs does to make
      syncing easier.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e23efd8e
    • Josef Bacik's avatar
      btrfs: pass the extent buffer for the btrfs_item_nr helpers · 42c9419a
      Josef Bacik authored
      This is actually a change for extent tree v2, but it exists in
      btrfs-progs but not in the kernel.  This makes it annoying to sync
      accessors.h with btrfs-progs, and since this is the way I need it for
      extent-tree v2 simply update these helpers to take the extent buffer in
      order to make syncing possible now, and make the extent tree v2 stuff
      easier moving forward.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      42c9419a
    • Josef Bacik's avatar
      btrfs: move the csum helpers into ctree.h · 0e6c40eb
      Josef Bacik authored
      These got moved because of copy+paste, but this code exists in ctree.c,
      so move the declarations back into ctree.h.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0e6c40eb
    • Josef Bacik's avatar
      btrfs: move eb offset helpers into extent_io.h · 9b48adda
      Josef Bacik authored
      These are very specific to how the extent buffer is defined, so this
      differs between btrfs-progs and the kernel.  Make things easier by
      moving these helpers into extent_io.h so we don't have to worry about
      this when syncing ctree.h.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9b48adda
    • Josef Bacik's avatar
      btrfs: move file_extent_item helpers into file-item.h · 6bfd0ffa
      Josef Bacik authored
      These helpers use functions that are in multiple places, which makes it
      tricky to sync them into btrfs-progs.  Move them to file-item.h and then
      include file-item.h in places that use these helpers.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6bfd0ffa
    • Josef Bacik's avatar
      btrfs: move leaf_data_end into ctree.c · 3a3178c7
      Josef Bacik authored
      This is only used in ctree.c, with the exception of zero'ing out extent
      buffers we're getting ready to write out.  In theory we shouldn't have
      an extent buffer with 0 items that we're writing out, however I'd rather
      be safe than sorry so open code it in extent_io.c, and then copy the
      helper into ctree.c.  This will make it easier to sync accessors.[ch]
      into btrfs-progs, as this requires a helper that isn't defined in
      accessors.h.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3a3178c7
    • Josef Bacik's avatar
      btrfs: move root helpers back into ctree.h · 1fe5ebc4
      Josef Bacik authored
      These accidentally got brought into accessors.h, but belong with the
      btrfs_root definitions which are currently in ctree.h.  Move these to
      make it easier to sync accessors.[ch] into btrfs-progs.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1fe5ebc4
    • Christoph Hellwig's avatar
      btrfs: move repair_io_failure to bio.c · bacf60e5
      Christoph Hellwig authored
      repair_io_failure ties directly into all the glory low-level details of
      mapping a bio with a logic address to the actual physical location.
      Move it right below btrfs_submit_bio to keep all the related logic
      together.
      
      Also move btrfs_repair_eb_io_failure to its caller in disk-io.c now that
      repair_io_failure is available in a header.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bacf60e5
    • Christoph Hellwig's avatar
      btrfs: split the bio submission path into a separate file · 103c1972
      Christoph Hellwig authored
      The code used by btrfs_submit_bio only interacts with the rest of
      volumes.c through __btrfs_map_block (which itself is a more generic
      version of two exported helpers) and does not really have anything
      to do with volumes.c.  Create a new bio.c file and a bio.h header
      going along with it for the btrfs_bio-based storage layer, which
      will grow even more going forward.
      
      Also update the file with my copyright notice given that a large
      part of the moved code was written or rewritten by me.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      103c1972
    • Christoph Hellwig's avatar
      btrfs: move struct btrfs_tree_parent_check out of disk-io.h · 27137fac
      Christoph Hellwig authored
      Move struct btrfs_tree_parent_check out of disk-io.h so that volumes.h
      an various .c files don't have to include disk-io.h just for it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ use tree-checker.h for the structure ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      27137fac
    • Qu Wenruo's avatar
      btrfs: raid56: do data csum verification during RMW cycle · 7a315072
      Qu Wenruo authored
      [BUG]
      For the following small script, btrfs will be unable to recover the
      content of file1:
      
        mkfs.btrfs -f -m raid1 -d raid5 -b 1G $dev1 $dev2 $dev3
      
        mount $dev1 $mnt
        xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/file1
        md5sum $mnt/file1
        umount $mnt
      
        # Corrupt the above 64K data stripe.
        xfs_io -f -c "pwrite -S 0x00 323026944 64K" -c sync $dev3
        mount $dev1 $mnt
      
        # Write a new 64K, which should be in the other data stripe
        # And this is a sub-stripe write, which will cause RMW
        xfs_io -f -c "pwrite 0 64k" -c sync $mnt/file2
        md5sum $mnt/file1
        umount $mnt
      
      Above md5sum would fail.
      
      [CAUSE]
      There is a long existing problem for raid56 (not limited to btrfs
      raid56) that, if we already have some corrupted on-disk data, and then
      trigger a sub-stripe write (which needs RMW cycle), it can cause further
      damage into P/Q stripe.
      
        Disk 1: data 1 |0x000000000000| <- Corrupted
        Disk 2: data 2 |0x000000000000|
        Disk 2: parity |0xffffffffffff|
      
      In above case, data 1 is already corrupted, the original data should be
      64KiB of 0xff.
      
      At this stage, if we read data 1, and it has data checksum, we can still
      recovery going via the regular RAID56 recovery path.
      
      But if now we decide to write some data into data 2, then we need to go
      RMW.
      Let's say we want to write 64KiB of '0x00' into data 2, then we read the
      on-disk data of data 1, calculate the new parity, resulting the
      following layout:
      
        Disk 1: data 1 |0x000000000000| <- Corrupted
        Disk 2: data 2 |0x000000000000| <- New '0x00' writes
        Disk 2: parity |0x000000000000| <- New Parity.
      
      But the new parity is calculated using the *corrupted* data 1, we can
      no longer recover the correct data of data1.  Thus the corruption is
      forever there.
      
      [FIX]
      To solve above problem, this patch will do a full stripe data checksum
      verification at RMW time.
      
      This involves the following changes:
      
      - Always read the full stripe (including data/P/Q) when doing RMW
        Before we only read the missing data sectors, but since we may do a
        data csum verification and recovery, we need to read everything out.
      
        Please note that, if we have a cached rbio, we don't need to read
        anything, and can treat it the same as full stripe write.
      
        As only stripe with all its csum matches can be cached.
      
      - Verify the data csum during read.
        The goal is only the rbio stripe sectors, and only if the rbio
        already has csum_buf/csum_bitmap filled.
      
        And sectors which cannot pass csum verification will have their bit
        set in error_bitmap.
      
      - Always call recovery_sectors() after we read out all the sectors
        Since error_bitmap will be updated during read, recover_sectors()
        can easily find out all the bad sectors and try to recover (if still
        under tolerance).
      
        And since recovery_sectors() is already migrated to use error_bitmap,
        it can skip vertical stripes which don't have any error.
      
      - Verify the repaired sectors against its csum in recover_vertical()
      
      - Rename rmw_read_and_wait() to rmw_read_wait_recover()
        Since we will always recover the sectors, the old name is no longer
        accurate.
      
        Furthermore since recovery is already done in rmw_read_wait_recover(),
        we no longer need to call recovery_sectors() inside rmw_rbio().
      
      Obviously this will have a performance impact, as we are doing more
      work during RMW cycle:
      
      - Fetch the data checksums
      - Do checksum verification for all data stripes
      - Do checksum verification again after repair
      
      But for full stripe write or cached rbio we won't have the overhead all,
      thus for fully optimized RAID56 workload (always full stripe write),
      there should be no extra overhead.
      
      To me, the extra overhead looks reasonable, as data consistency is way
      more important than performance.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7a315072
    • Qu Wenruo's avatar
      btrfs: raid56: prepare data checksums for later RMW verification · c5a41562
      Qu Wenruo authored
      This is for later data checksum verification at RMW time.
      
      This patch will try to allocate the needed memory for a locked rbio if
      the rbio is for data exclusively (we don't want to handle mixed bg yet).
      The memory will be released when the rbio is finished.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c5a41562
    • Qu Wenruo's avatar
      btrfs: introduce a bitmap based csum range search function · 97e38239
      Qu Wenruo authored
      Although we have an existing function, btrfs_lookup_csums_range(), to
      find all data checksums for a range, it's based on a btrfs_ordered_sum
      list.
      
      For the incoming RAID56 data checksum verification at RMW time, we don't
      want to waste time by allocating temporary memory.
      
      So this patch will introduce a new helper, btrfs_lookup_csums_bitmap().
      It will use bitmap based result, which will be a perfect fit for later
      RAID56 usage.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      97e38239
    • Qu Wenruo's avatar
      btrfs: refactor checksum calculations in btrfs_lookup_csums_range() · cb649e81
      Qu Wenruo authored
      The refactoring involves the following parts:
      
      - Introduce bytes_to_csum_size() and csum_size_to_bytes() helpers
        As we have quite some open-coded calculations, some of them are even
        split into two assignments just to fit 80 chars limit.
      
      - Remove the @csum_size parameter from max_ordered_sum_bytes()
        Csum size can be fetched from @fs_info.
        And we will use the csum_size_to_bytes() helper anyway.
      
      - Add a comment explaining how we handle the first search result
      
      - Use newly introduced helpers to cleanup btrfs_lookup_csums_range()
      
      - Move variables declaration to the minimal scope
      
      - Never mix number of sectors with bytes
        There are several locations doing things like:
      
       			size = min_t(size_t, csum_end - start,
      				     max_ordered_sum_bytes(fs_info));
      			...
      			size >>= fs_info->sectorsize_bits
      
        Or
      
      			offset = (start - key.offset) >> fs_info->sectorsize_bits;
      			offset *= csum_size;
      
        Make sure these variables can only represent BYTES inside the
        function, by using the above bytes_to_csum_size() helpers.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb649e81
    • Li zeming's avatar
      btrfs: allocate btrfs_io_context without GFP_NOFAIL · 9f0eac07
      Li zeming authored
      The __GFP_NOFAIL flag could loop indefinitely when allocation memory in
      alloc_btrfs_io_context. The callers starting from __btrfs_map_block
      already handle errors so it's safe to drop the flag.
      Signed-off-by: default avatarLi zeming <zeming@nfschina.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f0eac07
    • Qu Wenruo's avatar
      btrfs: use btrfs_dev_name() helper to handle missing devices better · cb3e217b
      Qu Wenruo authored
      [BUG]
      If dev-replace failed to re-construct its data/metadata, the kernel
      message would be incorrect for the missing device:
      
       BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
       BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault)
      
      Note the above "dev (efault)" of the second line.
      While the first line is properly reporting "<missing disk>".
      
      [CAUSE]
      Although dev-replace is using btrfs_dev_name(), the heavy lifting work
      is still done by scrub (scrub is reused by both dev-replace and regular
      scrub).
      
      Unfortunately scrub code never uses btrfs_dev_name() helper, as it's
      only declared locally inside dev-replace.c.
      
      [FIX]
      Fix the output by:
      
      - Move the btrfs_dev_name() helper to volumes.h
      
      - Use btrfs_dev_name() to replace open-coded rcu_str_deref() calls
        Only zoned code is not touched, as I'm not familiar with degraded
        zoned code.
      
      - Constify return value and parameter
      
      Now the output looks pretty sane:
      
       BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
       BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb3e217b
    • Filipe Manana's avatar
      btrfs: use cached state when looking for delalloc ranges with lseek · 3c32c721
      Filipe Manana authored
      During lseek (SEEK_HOLE/DATA), whenever we find a hole or prealloc extent,
      we will look for delalloc in that range, and one of the things we do for
      that is to find out ranges in the inode's io_tree marked with
      EXTENT_DELALLOC, using calls to count_range_bits().
      
      Typically there's a single, or few, searches in the io_tree for delalloc
      per lseek call. However it's common for applications to keep calling
      lseek with SEEK_HOLE and SEEK_DATA to find where extents and holes are in
      a file, read the extents and skip holes in order to avoid unnecessary IO
      and save disk space by preserving holes.
      
      One popular user is the cp utility from coreutils. Starting with coreutils
      9.0, cp uses SEEK_HOLE and SEEK_DATA to iterate over the extents of a
      file. Before 9.0, it used fiemap to figure out where holes and extents are
      in the source file. Another popular user is the tar utility when used with
      the --sparse / -S option to detect and preserve holes.
      
      Given that the pattern is to keep calling lseek with a start offset that
      matches the returned offset from the previous lseek call, we can benefit
      from caching the last extent state visited in count_range_bits() and use
      it for the next count_range_bits() from the next lseek call. Example,
      the following strace excerpt from running tar:
      
         $ strace tar cJSvf foo.tar.xz qemu_disk_file.raw
         (...)
         lseek(5, 125019574272, SEEK_HOLE)       = 125024989184
         lseek(5, 125024989184, SEEK_DATA)       = 125024993280
         lseek(5, 125024993280, SEEK_HOLE)       = 125025239040
         lseek(5, 125025239040, SEEK_DATA)       = 125025255424
         lseek(5, 125025255424, SEEK_HOLE)       = 125025353728
         lseek(5, 125025353728, SEEK_DATA)       = 125025357824
         lseek(5, 125025357824, SEEK_HOLE)       = 125026766848
         lseek(5, 125026766848, SEEK_DATA)       = 125026770944
         lseek(5, 125026770944, SEEK_HOLE)       = 125027053568
         (...)
      
      Shows that pattern, which is the same as with cp from coreutils 9.0+.
      
      So start using a cached state for the delalloc searches in lseek, and
      store it in struct file's private data so that it can be reused across
      lseek calls.
      
      This change is part of a patchset that is comprised of the following
      patches:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      
      The following test was run before and after applying the whole patchset:
      
         $ cat test-cp.sh
         #!/bin/bash
      
         DEV=/dev/sdh
         MNT=/mnt/sdh
      
         # coreutils 8.32, cp uses fiemap to detect holes and extents
         #CP_PROG=/usr/bin/cp
         # coreutils 9.1, cp uses SEEK_HOLE/DATA to detect holes and extents
         CP_PROG=/home/fdmanana/git/hub/coreutils/src/cp
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         FILE_SIZE=$((1024 * 1024 * 1024))
         echo "Creating file with a size of $((FILE_SIZE / 1024 / 1024))M"
         # Create a very sparse file, where each extent has a length of 4K and
         # is preceded by a 4K hole and followed by another 4K hole.
         start=$(date +%s%N)
         echo -n > $MNT/foobar
         for ((off = 0; off < $FILE_SIZE; off += 8192)); do
                 xfs_io -c "pwrite -S 0xab $off 4K" $MNT/foobar > /dev/null
                 echo -ne "\r$off / $FILE_SIZE ..."
         done
         end=$(date +%s%N)
         echo -e "\nFile created ($(( (end - start) / 1000000 )) milliseconds)"
      
         start=$(date +%s%N)
         $CP_PROG $MNT/foobar /dev/null
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "cp took $dur milliseconds with data/metadata cached and delalloc"
      
         # Flush all delalloc.
         sync
      
         start=$(date +%s%N)
         $CP_PROG $MNT/foobar /dev/null
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "cp took $dur milliseconds with data/metadata cached and no delalloc"
      
         # Unmount and mount again to test the case without any metadata
         # loaded in memory.
         umount $MNT
         mount $DEV $MNT
      
         start=$(date +%s%N)
         $CP_PROG $MNT/foobar /dev/null
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "cp took $dur milliseconds without data/metadata cached and no delalloc"
      
         umount $MNT
      
      The results, running on a box with a non-debug kernel (Debian's default
      kernel config), were the following:
      
      128M file, before patchset:
      
         cp took 16574 milliseconds with data/metadata cached and delalloc
         cp took 122 milliseconds with data/metadata cached and no delalloc
         cp took 20144 milliseconds without data/metadata cached and no delalloc
      
      128M file, after patchset:
      
         cp took 6277 milliseconds with data/metadata cached and delalloc
         cp took 109 milliseconds with data/metadata cached and no delalloc
         cp took 210 milliseconds without data/metadata cached and no delalloc
      
      512M file, before patchset:
      
         cp took 14369 milliseconds with data/metadata cached and delalloc
         cp took 429 milliseconds with data/metadata cached and no delalloc
         cp took 88034 milliseconds without data/metadata cached and no delalloc
      
      512M file, after patchset:
      
         cp took 12106 milliseconds with data/metadata cached and delalloc
         cp took 427 milliseconds with data/metadata cached and no delalloc
         cp took 824 milliseconds without data/metadata cached and no delalloc
      
      1G file, before patchset:
      
         cp took 10074 milliseconds with data/metadata cached and delalloc
         cp took 886 milliseconds with data/metadata cached and no delalloc
         cp took 181261 milliseconds without data/metadata cached and no delalloc
      
      1G file, after patchset:
      
         cp took 3320 milliseconds with data/metadata cached and delalloc
         cp took 880 milliseconds with data/metadata cached and no delalloc
         cp took 1801 milliseconds without data/metadata cached and no delalloc
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c32c721
    • Filipe Manana's avatar
      btrfs: use cached state when looking for delalloc ranges with fiemap · b3e744fe
      Filipe Manana authored
      During fiemap, whenever we find a hole or prealloc extent, we will look
      for delalloc in that range, and one of the things we do for that is to
      find out ranges in the inode's io_tree marked with EXTENT_DELALLOC, using
      calls to count_range_bits().
      
      Since we process file extents from left to right, if we have a file with
      several holes or prealloc extents, we benefit from keeping a cached extent
      state record for calls to count_range_bits(). Most of the time the last
      extent state record we visited in one call to count_range_bits() matches
      the first extent state record we will use in the next call to
      count_range_bits(), so there's a benefit here. So use an extent state
      record to cache results from count_range_bits() calls during fiemap.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3e744fe
    • Filipe Manana's avatar
      btrfs: update stale comment for count_range_bits() · 1ee51a06
      Filipe Manana authored
      The comment for count_range_bits() mentions that the search is fast if we
      are asking for a range with the EXTENT_DIRTY bit set. However that is no
      longer true since we don't use that bit and the optimization for that was
      removed in:
      
        commit 71528e9e ("btrfs: get rid of extent_io_tree::dirty_bytes")
      
      So remove that part of the comment mentioning the no longer existing
      optimized case, and, while at it, add proper documentation describing the
      purpose, arguments and return value of the function.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1ee51a06
    • Filipe Manana's avatar
      btrfs: allow passing a cached state record to count_range_bits() · 8c6e53a7
      Filipe Manana authored
      An inode's io_tree can be quite large and there are cases where due to
      delalloc it can have thousands of extent state records, which makes the
      red black tree have a depth of 10 or more, making the operation of
      count_range_bits() slow if we repeatedly call it for a range that starts
      where, or after, the previous one we called it for. Such use cases are
      when searching for delalloc in a file range that corresponds to a hole or
      a prealloc extent, which is done during lseek SEEK_HOLE/DATA and fiemap.
      
      So introduce a cached state parameter to count_range_bits() which we use
      to store the last extent state record we visited, and then allow the
      caller to pass it again on its next call to count_range_bits(). The next
      patches in the series will make fiemap and lseek use the new parameter.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8c6e53a7
    • Filipe Manana's avatar
      btrfs: remove no longer used btrfs_next_extent_map() · cfd7a17d
      Filipe Manana authored
      There are no more users of btrfs_next_extent_map(), the previous patch
      in the series ("btrfs: search for delalloc more efficiently during
      lseek/fiemap") removed the last usage of the function, so delete it.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cfd7a17d
    • Filipe Manana's avatar
      btrfs: search for delalloc more efficiently during lseek/fiemap · 8ddc8274
      Filipe Manana authored
      During lseek (SEEK_HOLE/DATA) and fiemap, when processing a file range
      that corresponds to a hole or a prealloc extent, we have to check if
      there's any delalloc in the range. We do it by searching for delalloc
      ranges in the inode's io_tree (for unflushed delalloc) and in the inode's
      extent map tree (for delalloc that is flushing).
      
      We avoid searching the extent map tree if the number of outstanding
      extents is 0, as in that case we can't have extent maps for our search
      range in the tree that correspond to delalloc that is flushing. However
      if we have any unflushed delalloc, due to buffered writes or mmap writes,
      then the outstanding extents counter is not 0 and we'll search the extent
      map tree. The tree may be large because it can have lots of extent maps
      that were loaded by reads or created by previous writes, therefore taking
      a significant time to search the tree, specially if have a file with a
      lot of holes and/or prealloc extents.
      
      We can improve on this by instead of searching the extent map tree,
      searching the ordered extents tree of the inode, since when delalloc is
      flushing we create an ordered extent along with the new extent map, while
      holding the respective file range locked in the inode's io_tree. The
      ordered extents tree is typically much smaller, since ordered extents have
      a short life and get removed from the tree once they are completed, while
      extent maps can stay for a very long time in the extent map tree, either
      created by previous writes or loaded by read operations.
      
      So use the ordered extents tree instead of the extent maps tree.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8ddc8274
    • Filipe Manana's avatar
      btrfs: skip unnecessary delalloc searches during lseek/fiemap · af979fd6
      Filipe Manana authored
      During lseek (SEEK_HOLE/DATA) and fiemap, when processing a file range
      that corresponds to a hole or a prealloc extent, if we find that there is
      no delalloc marked in the inode's io_tree but there is delalloc due to
      an extent map in the io tree, then on the next iteration that calls
      find_delalloc_subrange() we can skip searching the io tree again, since
      on the first call we had no delalloc in the io tree for the whole range.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af979fd6
    • Filipe Manana's avatar
      btrfs: add an early exit when searching for delalloc range for lseek/fiemap · 40daf3e0
      Filipe Manana authored
      During fiemap and lseek (SEEK_HOLE/DATA), when looking for delalloc in a
      range corresponding to a hole or a prealloc extent, if we found the whole
      range marked as delalloc in the inode's io_tree, then we can terminate
      immediately and avoid searching the extent map tree. If not, and if the
      found delalloc starts at the same offset of our search start but ends
      before our search range's end, then we can adjust the search range for
      the search in the extent map tree. So implement those changes.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      40daf3e0
    • Filipe Manana's avatar
      btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree · 2c8f5e8c
      Filipe Manana authored
      We don't need to set the EXTENT_UPDATE bit in an inode's io_tree to mark a
      range as uptodate, we rely on the pages themselves being uptodate - page
      reading is not triggered for already uptodate pages. Recently we removed
      most use of the EXTENT_UPTODATE for buffered IO with commit 52b029f4
      ("btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path"),
      but there were a few leftovers, namely when reading from holes and
      successfully finishing read repair.
      
      These leftovers are unnecessarily making an inode's tree larger and deeper,
      slowing down searches on it. So remove all the leftovers.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2c8f5e8c
    • Qu Wenruo's avatar
      btrfs: move tree block parentness check into validate_extent_buffer() · 947a6299
      Qu Wenruo authored
      [BACKGROUND]
      Although both btrfs metadata and data has their read time verification
      done at endio time (btrfs_validate_metadata_buffer() and
      btrfs_verify_data_csum()), metadata has extra verification, mostly
      parentness check including first key/transid/owner_root/level, done at
      read_tree_block() and btrfs_read_extent_buffer().
      
      On the other hand, all the data verification is done at endio context.
      
      [ENHANCEMENT]
      This patch will make a new union in btrfs_bio, taking the space of the
      old data checksums, thus it will not increase the memory usage.
      
      With that extra btrfs_tree_parent_check inside btrfs_bio, we can just
      pass the check parameter into read_extent_buffer_pages(), and before
      submitting the bio, we can copy the check structure into btrfs_bio.
      
      And finally at endio time, we can grab btrfs_bio::parent_check and pass
      it to validate_extent_buffer(), to move the remaining checks into it.
      
      This brings the following benefits:
      
      - Much simpler btrfs_read_extent_buffer()
        Now it only needs to iterate through all mirrors.
      
      - Simpler read-time transid check
        Previously we go verify_parent_transid() after reading out the extent
        buffer.
        Now the transid check is done inside the endio function, no other
        code can modify the content.
        Thus no need to use the extent lock anymore.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      947a6299
    • Qu Wenruo's avatar
      btrfs: concentrate all tree block parentness check parameters into one structure · 789d6a3a
      Qu Wenruo authored
      There are several different tree block parentness check parameters used
      across several helpers:
      
      - level
        Mandatory
      
      - transid
        Under most cases it's mandatory, but there are several backref cases
        which skips this check.
      
      - owner_root
      - first_key
        Utilized by most top-down tree search routine. Otherwise can be
        skipped.
      
      Those four members are not always mandatory checks, and some of them are
      the same u64, which means if some arguments got swapped compiler will
      not catch it.
      
      Furthermore if we're going to further expand the parentness check, we
      need to modify quite some helpers just to add one more parameter.
      
      This patch will concentrate all these members into a structure called
      btrfs_tree_parent_check, and pass that structure for the following
      helpers:
      
      - btrfs_read_extent_buffer()
      - read_tree_block()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      789d6a3a
    • Anand Jain's avatar
      btrfs: move device->name RCU allocation and assign to btrfs_alloc_device() · bb21e302
      Anand Jain authored
      There is a repeating code section in the parent function after calling
      btrfs_alloc_device(), as below:
      
            name = rcu_string_strdup(path, GFP_...);
            if (!name) {
                    btrfs_free_device(device);
                    return ERR_PTR(-ENOMEM);
            }
            rcu_assign_pointer(device->name, name);
      
      Except in add_missing_dev() for obvious reasons.
      
      This patch consolidates that repeating code into the btrfs_alloc_device()
      itself so that the parent function doesn't have to duplicate code.
      This consolidation also helps to review issues regarding RCU lock
      violation with device->name.
      
      Parent function device_list_add() and add_missing_dev() use GFP_NOFS for
      the allocation, whereas the rest of the parent functions use GFP_KERNEL,
      so bring the NOFS allocation context using memalloc_nofs_save() in the
      function device_list_add() and add_missing_dev() is already doing it.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb21e302
    • David Sterba's avatar
      btrfs: constify input buffer parameter in compression code · 3e09b5b2
      David Sterba authored
      The input buffers passed down to compression must never be changed,
      switch type to u8 as it's a raw byte buffer and use const.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3e09b5b2
    • Qu Wenruo's avatar
      btrfs: raid56: remove the old error tracking system · ad3daf1c
      Qu Wenruo authored
      Since all the recovery paths have been migrated to the new error bitmap
      based system, we can remove the old stripe number based system.
      
      This cleanup involves one behavior change:
      
      - Rebuild rbio can no longer be merged
        Previously a rebuild rbio (caused by retry after data csum mismatch)
        can be merged, if the error happens in the same stripe.
      
        But with the new error bitmap based solution, it's much harder to
        compare error bitmaps.
      
        So here we just don't merge rebuild rbio at all.
        This may introduce some performance impact at extreme corner cases,
        but we're willing to take it.
      
      Other than that, this patch will cleanup the following members:
      
      - rbio::faila
      - rbio::failb
        They will be replaced by per-vertical stripe check, which is more
        accurate.
      
      - rbio::error
        It will be replace by per-vertical stripe error bitmap check.
      
      - Allow get_rbio_vertical_errors() to accept NULL pointers for
        @faila and @failb
        Some call sites only want to check if we have errors beyond the
        tolerance.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad3daf1c
    • Qu Wenruo's avatar
      btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap · 75b47033
      Qu Wenruo authored
      Since we have rbio::error_bitmap to indicate exactly where the errors
      are (including read error and csum mismatch error), we can make recovery
      path more accurate.
      
      For example:
      
                   0        32K       64K
           Data 1  |XXXXXXXX|         |
           Data 2  |        |XXXXXXXXX|
           Parity  |        |         |
      
      1) Get csum mismatch when reading data 1 [0, 32K)
      
      2) Mark corresponding range error
         The old code will mark the whole data 1 stripe as error.
         While the new code will only mark data 1 [0, 32K) as error.
      
      3) Recovery path
         The old code will recover data 1 [0, 64K), all using Data 2 and
         parity.
      
         This means, Data 1 [32K, 64K) will be corrupted data, as data 2
         [32K, 64K) is already corrupted.
      
         While the new code will only recover data 1 [0, 32K), as only
         that range has error so far.
      
      This new behavior can avoid populating rbio cache with incorrect data.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      75b47033
    • Qu Wenruo's avatar
      btrfs: raid56: introduce btrfs_raid_bio::error_bitmap · 2942a50d
      Qu Wenruo authored
      Currently btrfs raid56 uses btrfs_raid_bio::faila and failb to indicate
      which stripe(s) had IO errors.
      
      But that has some problems:
      
      - If one sector failed csum check, the whole stripe where the corruption
        is will be marked error.
        This can reduce the chance we do recover, like this:
      
                0  4K 8K
        Data 1  |XX|  |
        Data 2  |  |XX|
        Parity  |  |  |
      
        In above case, 0~4K in data 1 should be recovered using data 2 and
        parity, while 4K~8K in data 2 should be recovered using data 1 and
        parity.
      
        Currently if we trigger read on 0~4K of data 1, we will also recover
        4K~8K of data 1 using corrupted data 2 and parity, causing wrong
        result in rbio cache.
      
      - Harder to expand for future M-N scheme
        As we're limited to just faila/b, two corruptions.
      
      - Harder to expand to handle extra csum errors
        This can be problematic if we start to do csum verification.
      
      This patch will introduce an extra @error_bitmap, where one bit
      represents error that happened for that sector.
      
      The choice to introduce a new error bitmap other than reusing
      sector_ptr, is to avoid extra search between rbio::stripe_sectors[] and
      rbio::bio_sectors[].
      
      Since we can submit bio using sectors from both sectors, doing proper
      search on both array will more complex.
      
      Although the new bitmap will take extra memory, later we can remove
      things like @error and faila/b to save some memory.
      
      Currently the new error bitmap and failab mechanism coexists, the error
      bitmap is only updated at endio time and recover entrance.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2942a50d
    • David Sterba's avatar
      btrfs: pass btrfs_inode to btrfs_add_delayed_iput · e55cf7ca
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e55cf7ca