1. 16 May, 2022 40 commits
    • Christoph Hellwig's avatar
      btrfs: use a normal workqueue for rmw_workers · 385de0ef
      Christoph Hellwig authored
      rmw_workers doesn't need ordered execution or thread disabling threshold
      (as the thresh parameter is less than DFT_THRESHOLD).
      
      Just switch to the normal workqueues that use a lot less resources,
      especially in the work_struct vs btrfs_work structures.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      385de0ef
    • Christoph Hellwig's avatar
      btrfs: use normal workqueues for scrub · be539518
      Christoph Hellwig authored
      All three scrub workqueues don't need ordered execution or thread
      disabling threshold (as the thresh parameter is less than DFT_THRESHOLD).
      Just switch to the normal workqueues that use a lot less resources,
      especially in the work_struct vs btrfs_work structures.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      be539518
    • Christoph Hellwig's avatar
      btrfs: simplify WQ_HIGHPRI handling in struct btrfs_workqueue · a31b4a43
      Christoph Hellwig authored
      Just let the one caller that wants optional WQ_HIGHPRI handling allocate
      a separate btrfs_workqueue for that.  This allows to rename struct
      __btrfs_workqueue to btrfs_workqueue, remove a pointer indirection and
      separate allocation for all btrfs_workqueue users and generally simplify
      the code.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a31b4a43
    • Qu Wenruo's avatar
      btrfs: raid56: enable subpage support for RAID56 · a7b8e39c
      Qu Wenruo authored
      Now the btrfs RAID56 infrastructure has migrated to use sector_ptr
      interface, it should be safe to enable subpage support for RAID56.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7b8e39c
    • Qu Wenruo's avatar
      btrfs: raid56: make alloc_rbio_essential_pages() subpage compatible · 3907ce29
      Qu Wenruo authored
      The non-compatible part is only the bitmap iteration part, now the
      bitmap size is extended to rbio::stripe_nsectors, not the old
      rbio::stripe_npages.
      
      Since we're here, also slightly improve the function by:
      
      - Rename @i to @stripe
      - Rename @bit to @sectornr
      - Move @page and @index into the inner loop
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3907ce29
    • Qu Wenruo's avatar
      btrfs: raid56: make steal_rbio() subpage compatible · d4e28d9b
      Qu Wenruo authored
      Function steal_rbio() will take all the uptodate pages from the source
      rbio to destination rbio.
      
      With the new stripe_sectors[] array, we also need to do the extra check:
      
      - Check sector::flags to make sure the full page is uptodate
        Now we don't use PageUptodate flag for subpage cases to indicate
        if the page is uptodate.
      
        Instead we need to check all the sectors belong to the page to be sure
        about whether it's full page uptodate.
      
        So here we introduce a new helper, full_page_sectors_uptodate() to do
        the check.
      
      - Update rbio::stripe_sectors[] to use the new page pointer
        We only need to change the page pointer, no need to change anything
        else.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4e28d9b
    • Qu Wenruo's avatar
      btrfs: raid56: make set_bio_pages_uptodate() subpage compatible · 5fdb7afc
      Qu Wenruo authored
      Unlike previous code, we can not directly set PageUptodate for stripe
      pages now.  Instead we have to iterate through all the sectors and set
      SECTOR_UPTODATE flag there.
      
      Introduce a new helper find_stripe_sector(), to do the work.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5fdb7afc
    • Qu Wenruo's avatar
      btrfs: raid56: remove btrfs_raid_bio::bio_pages array · ac26df8b
      Qu Wenruo authored
      The functionality is completely replaced by the new bio_sectors member,
      now it's time to remove the old member.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac26df8b
    • Qu Wenruo's avatar
      btrfs: raid56: make raid56_add_scrub_pages() subpage compatible · 6346f6bf
      Qu Wenruo authored
      This requires one extra parameter @pgoff for the function.
      
      In the current code base, scrub is still one page per sector, thus the
      new parameter will always be 0.
      
      It needs the extra subpage scrub optimization code to fully take
      advantage.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6346f6bf
    • Qu Wenruo's avatar
      btrfs: raid56: open code rbio_stripe_page_index() · f77183dc
      Qu Wenruo authored
      There is only one caller for that helper now, and we're definitely fine
      to open-code it.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f77183dc
    • Qu Wenruo's avatar
      btrfs: raid56: make finish_rmw() subpage compatible · 1145059a
      Qu Wenruo authored
      With this function converted to subpage compatible sector interfaces,
      the following helper functions can be removed:
      
      - rbio_stripe_page()
      - rbio_pstripe_page()
      - rbio_qstripe_page()
      - page_in_rbio()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1145059a
    • Qu Wenruo's avatar
      btrfs: raid56: make __raid_recover_endio_io() subpage compatible · 07e4d380
      Qu Wenruo authored
      This involves:
      
      - Use sector_ptr interface to grab the pointers
      
      - Add sector->pgoff to pointers[]
      
      - Rebuild data using sectorsize instead of PAGE_SIZE
      
      - Use memcpy() to replace copy_page()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07e4d380
    • Qu Wenruo's avatar
      btrfs: raid56: make finish_parity_scrub() subpage compatible · 46900662
      Qu Wenruo authored
      The core is to convert direct page usage into sector_ptr usage, and
      use memcpy() to replace copy_page().
      
      For pointers usage, we need to convert it to kmap_local_page() +
      sector->pgoff.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      46900662
    • Qu Wenruo's avatar
      btrfs: raid56: make rbio_add_io_page() subpage compatible · 3e77605d
      Qu Wenruo authored
      Make rbio_add_io_page() subpage compatible, which involves:
      
      - Rename rbio_add_io_page() to rbio_add_io_sector()
        Although we still rely on PAGE_SIZE == sectorsize, so add a new
        ASSERT() inside rbio_add_io_sector() to make sure all pgoff is 0.
      
      - Introduce rbio_stripe_sector() helper
        The equivalent of rbio_stripe_page().
      
        This new helper has extra ASSERT()s to validate the stripe and sector
        number.
      
      - Introduce sector_in_rbio() helper
        The equivalent of page_in_rbio().
      
      - Rename @pagenr variables to @sectornr
      
      - Use rbio::stripe_nsectors when iterating the bitmap
      
      Please note that, this only changes the interface, the bios are still
      using full page for IO.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3e77605d
    • Qu Wenruo's avatar
      btrfs: raid56: introduce btrfs_raid_bio::bio_sectors · 00425dd9
      Qu Wenruo authored
      This new member is going to fully replace bio_pages in the future, but
      for now let's keep them co-exist, until the full switch is done.
      
      Currently cache_rbio_pages() and index_rbio_pages() will also populate
      the new array.
      
      And cache_rbio_pages() need to record which sectors are uptodate, so we
      also need to introduce sector_ptr::uptodate bit.
      
      To avoid extra memory usage, we let the new @uptodate bit to share bits
      with @pgoff.  Now pgoff only has at most 31 bits, which is already more
      than enough, as even for 256K page size, we only need 18 bits.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      00425dd9
    • Qu Wenruo's avatar
      btrfs: raid56: introduce btrfs_raid_bio::stripe_sectors · eb357060
      Qu Wenruo authored
      The new member is an array of sector_ptr pointers, they will represent
      all sectors inside a full stripe (including P/Q).
      
      They co-operate with btrfs_raid_bio::stripe_pages:
      
      stripe_pages:   | Page 0, range [0, 64K)   | Page 1 ...
      stripe_sectors: |  |  | ...             |  |
                      |  |                    \- sector 15, page 0, pgoff=60K
                      |  \- sector 1, page 0, pgoff=4K
                      \---- sector 0, page 0, pfoff=0
      
      With such structure, we can represent subpage sectors without using
      extra pages.
      
      Here we introduce a new helper, index_stripe_sectors(), to update
      stripe_sectors[] to point to correct page and pgoff.
      
      So every time rbio::stripe_pages[] pointer gets updated, the new helper
      should be called.
      
      The following functions have to call the new helper:
      
      - steal_rbio()
      - alloc_rbio_pages()
      - alloc_rbio_parity_pages()
      - alloc_rbio_essential_pages()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eb357060
    • Qu Wenruo's avatar
      btrfs: raid56: introduce new cached members for btrfs_raid_bio · 94efbe19
      Qu Wenruo authored
      The new members are all related to number of sectors, but the existing
      number of pages members are kept as is:
      
      - nr_sectors
        Total sectors of the full stripe including P/Q.
      
      - stripe_nsectors
        The sectors of a single stripe.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      94efbe19
    • Qu Wenruo's avatar
      btrfs: raid56: make btrfs_raid_bio more compact · 29b06838
      Qu Wenruo authored
      There are a lot of members using much larger type in btrfs_raid_bio than
      necessary, like nr_pages which represents the total number of a full
      stripe.
      
      Instead of int (which is at least 32bits), u16 is already enough
      (max stripe length will be 256MiB, already beyond current RAID56 device
      number limit).
      
      So this patch will reduce the width of the following members:
      
      - stripe_len to u32
      - nr_pages to u16
      - nr_data to u8
      - real_stripes to u8
      - scrubp to u8
      - faila/b to s8
        As -1 is used to indicate no corruption
      
      This will slightly reduce the size of btrfs_raid_bio from 272 bytes to
      256 bytes, reducing 16 bytes usage.
      
      But please note that, when using btrfs_raid_bio, we allocate extra space
      for it to cover various pointer array, so the reduce memory is not
      really a big saving overall.
      
      As we're here modifying the comments already, update existing comments
      to current code standard.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      29b06838
    • Qu Wenruo's avatar
      btrfs: raid56: open code rbio_nr_pages() · 843de58b
      Qu Wenruo authored
      The function rbio_nr_pages() is only called once inside alloc_rbio(),
      there is no reason to make it dedicated helper.
      
      Furthermore, the return type doesn't match, the function return "unsigned
      long" which may not be necessary, while the only caller only uses "int".
      
      Since we're doing cleaning up here, also fix the type to "const unsigned
      int" for all involved local variables.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      843de58b
    • Qu Wenruo's avatar
      btrfs: reduce width for stripe_len from u64 to u32 · cc353a8b
      Qu Wenruo authored
      Currently btrfs uses fixed stripe length (64K), thus u32 is wide enough
      for the usage.
      
      Furthermore, even in the future we choose to enlarge stripe length to
      larger values, I don't believe we would want stripe as large as 4G or
      larger.
      
      So this patch will reduce the width for all in-memory structures and
      parameters, this involves:
      
      - RAID56 related function argument lists
        This allows us to do direct division related to stripe_len.
        Although we will use bits shift to replace the division anyway.
      
      - btrfs_io_geometry structure
        This involves one change to simplify the calculation of both @stripe_nr
        and @stripe_offset, using div64_u64_rem().
        And add extra sanity check to make sure @stripe_offset is always small
        enough for u32.
      
        This saves 8 bytes for the structure.
      
      - map_lookup structure
        This convert @stripe_len to u32, which saves 8 bytes. (saved 4 bytes,
        and removed a 4-bytes hole)
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cc353a8b
    • Christoph Hellwig's avatar
      btrfs: do not return errors from submit_bio_hook_t instances · ad357938
      Christoph Hellwig authored
      Both btrfs_repair_one_sector and submit_bio_one as the direct caller of
      one of the instances ignore errors as they expect the methods themselves
      to call ->bi_end_io on error.  Remove the unused and dangerous return
      value.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad357938
    • Christoph Hellwig's avatar
      btrfs: do not return errors from btrfs_submit_compressed_read · cb4411dd
      Christoph Hellwig authored
      btrfs_submit_compressed_read already calls ->bi_end_io on error and
      the caller must ignore the return value, so remove it.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb4411dd
    • Christoph Hellwig's avatar
      btrfs: do not return errors from btrfs_submit_metadata_bio · 94d9e11b
      Christoph Hellwig authored
      btrfs_submit_metadata_bio already calls ->bi_end_io on error and the
      caller must ignore the return value, so remove it.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      94d9e11b
    • Christoph Hellwig's avatar
      btrfs: remove unused bio_flags argument to btrfs_submit_metadata_bio · abf48d58
      Christoph Hellwig authored
      This argument is unused since commit 953651eb ("btrfs: factor out
      helper adding a page to bio") and commit 1b36294a ("btrfs: call
      submit_bio_hook directly for metadata pages") reworked the way metadata
      bio submission is handled.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      abf48d58
    • Christoph Hellwig's avatar
      btrfs: move btrfs_readpage to extent_io.c · 7aab8b32
      Christoph Hellwig authored
      Keep btrfs_readpage next to btrfs_do_readpage and the other address
      space operations.  This allows to keep submit_one_bio and
      struct btrfs_bio_ctrl file local in extent_io.c.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7aab8b32
    • Qu Wenruo's avatar
      btrfs: repair super block num_devices automatically · d201238c
      Qu Wenruo authored
      [BUG]
      There is a report that a btrfs has a bad super block num devices.
      
      This makes btrfs to reject the fs completely.
      
        BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here
        BTRFS error (device sdd3): failed to read chunk tree: -22
        BTRFS error (device sdd3): open_ctree failed
      
      [CAUSE]
      During btrfs device removal, chunk tree and super block num devs are
      updated in two different transactions:
      
        btrfs_rm_device()
        |- btrfs_rm_dev_item(device)
        |  |- trans = btrfs_start_transaction()
        |  |  Now we got transaction X
        |  |
        |  |- btrfs_del_item()
        |  |  Now device item is removed from chunk tree
        |  |
        |  |- btrfs_commit_transaction()
        |     Transaction X got committed, super num devs untouched,
        |     but device item removed from chunk tree.
        |     (AKA, super num devs is already incorrect)
        |
        |- cur_devices->num_devices--;
        |- cur_devices->total_devices--;
        |- btrfs_set_super_num_devices()
           All those operations are not in transaction X, thus it will
           only be written back to disk in next transaction.
      
      So after the transaction X in btrfs_rm_dev_item() committed, but before
      transaction X+1 (which can be minutes away), a power loss happen, then
      we got the super num mismatch.
      
      This has been fixed by commit bbac5869 ("btrfs: remove device item
      and update super block in the same transaction").
      
      [FIX]
      Make the super_num_devices check less strict, converting it from a hard
      error to a warning, and reset the value to a correct one for the current
      or next transaction commit.
      
      As the number of device items is the critical information where the
      super block num_devices is only a cached value (and also useful for
      cross checking), it's safe to automatically update it. Other device
      related problems like missing device are handled after that and may
      require other means to resolve, like degraded mount. With this fix,
      potentially affected filesystems won't fail mount and require the manual
      repair by btrfs check.
      Reported-by: default avatarLuca Béla Palkovics <luca.bela.palkovics@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d201238c
    • Goldwyn Rodrigues's avatar
      btrfs: do not pass compressed_bio to submit_compressed_bio() · 46fbd18e
      Goldwyn Rodrigues authored
      Parameter struct compressed_bio is not used by the function
      submit_compressed_bio(). Remove it.
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      46fbd18e
    • Filipe Manana's avatar
      btrfs: avoid double search for block group during NOCOW writes · 2306e83e
      Filipe Manana authored
      When doing a NOCOW write, either through direct IO or buffered IO, we do
      two lookups for the block group that contains the target extent: once
      when we call btrfs_inc_nocow_writers() and then later again when we call
      btrfs_dec_nocow_writers() after creating the ordered extent.
      
      The lookups require taking a lock and navigating the red black tree used
      to track all block groups, which can take a non-negligible amount of time
      for a large filesystem with thousands of block groups, as well as lock
      contention and cache line bouncing.
      
      Improve on this by having a single block group search: making
      btrfs_inc_nocow_writers() return the block group to its caller and then
      have the caller pass that block group to btrfs_dec_nocow_writers().
      
      This is part of a patchset comprised of the following patches:
      
        btrfs: remove search start argument from first_logical_byte()
        btrfs: use rbtree with leftmost node cached for tracking lowest block group
        btrfs: use a read/write lock for protecting the block groups tree
        btrfs: return block group directly at btrfs_next_block_group()
        btrfs: avoid double search for block group during NOCOW writes
      
      The following test was used to test these changes from a performance
      perspective:
      
         $ cat test.sh
         #!/bin/bash
      
         modprobe null_blk nr_devices=0
      
         NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0
         mkdir $NULL_DEV_PATH
         if [ $? -ne 0 ]; then
             echo "Failed to create nullb0 directory."
             exit 1
         fi
         echo 2 > $NULL_DEV_PATH/submit_queues
         echo 16384 > $NULL_DEV_PATH/size # 16G
         echo 1 > $NULL_DEV_PATH/memory_backed
         echo 1 > $NULL_DEV_PATH/power
      
         DEV=/dev/nullb0
         MNT=/mnt/nullb0
         LOOP_MNT="$MNT/loop"
         MOUNT_OPTIONS="-o ssd -o nodatacow"
         MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
         cat <<EOF > /tmp/fio-job.ini
         [io_uring_writes]
         rw=randwrite
         fsync=0
         fallocate=posix
         group_reporting=1
         direct=1
         ioengine=io_uring
         iodepth=64
         bs=64k
         filesize=1g
         runtime=300
         time_based
         directory=$LOOP_MNT
         numjobs=8
         thread
         EOF
      
         echo performance | \
             tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
         echo
         echo "Using config:"
         echo
         cat /tmp/fio-job.ini
         echo
      
         umount $MNT &> /dev/null
         mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
         mount $MOUNT_OPTIONS $DEV $MNT
      
         mkdir $LOOP_MNT
      
         truncate -s 4T $MNT/loopfile
         mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null
         mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT
      
         # Trigger the allocation of about 3500 data block groups, without
         # actually consuming space on underlying filesystem, just to make
         # the tree of block group large.
         fallocate -l 3500G $LOOP_MNT/filler
      
         fio /tmp/fio-job.ini
      
         umount $LOOP_MNT
         umount $MNT
      
         echo 0 > $NULL_DEV_PATH/power
         rmdir $NULL_DEV_PATH
      
      The test was run on a non-debug kernel (Debian's default kernel config),
      the result were the following.
      
      Before patchset:
      
        WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec
      
      After patchset:
      
        WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec
      
        +3.3% write throughput and +3.3% IO done in the same time period.
      
      The test has somewhat limited coverage scope, as with only NOCOW writes
      we get less contention on the red black tree of block groups, since we
      don't have the extra contention caused by COW writes, namely when
      allocating data extents, pinning and unpinning data extents, but on the
      hand there's access to tree in the NOCOW path, when incrementing a block
      group's number of NOCOW writers.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2306e83e
    • Filipe Manana's avatar
      btrfs: return block group directly at btrfs_next_block_group() · 8b01f931
      Filipe Manana authored
      At btrfs_next_block_group(), we have this long line with two statements:
      
        cache = btrfs_lookup_first_block_group(...); return cache;
      
      This makes it a bit harder to read due to two statements on the same
      line, so change that to directly return the result of the call to
      btrfs_lookup_first_block_group().
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8b01f931
    • Filipe Manana's avatar
      btrfs: use a read/write lock for protecting the block groups tree · 16b0c258
      Filipe Manana authored
      Currently we use a spin lock to protect the red black tree that we use to
      track block groups. Most accesses to that tree are actually read only and
      for large filesystems, with thousands of block groups, it actually has
      a bad impact on performance, as concurrent read only searches on the tree
      are serialized.
      
      Read only searches on the tree are very frequent and done when:
      
      1) Pinning and unpinning extents, as we need to lookup the respective
         block group from the tree;
      
      2) Freeing the last reference of a tree block, regardless if we pin the
         underlying extent or add it back to free space cache/tree;
      
      3) During NOCOW writes, both buffered IO and direct IO, we need to check
         if the block group that contains an extent is read only or not and to
         increment the number of NOCOW writers in the block group. For those
         operations we need to search for the block group in the tree.
         Similarly, after creating the ordered extent for the NOCOW write, we
         need to decrement the number of NOCOW writers from the same block
         group, which requires searching for it in the tree;
      
      4) Decreasing the number of extent reservations in a block group;
      
      5) When allocating extents and freeing reserved extents;
      
      6) Adding and removing free space to the free space tree;
      
      7) When releasing delalloc bytes during ordered extent completion;
      
      8) When relocating a block group;
      
      9) During fitrim, to iterate over the block groups;
      
      10) etc;
      
      Write accesses to the tree, to add or remove block groups, are much less
      frequent as they happen only when allocating a new block group or when
      deleting a block group.
      
      We also use the same spin lock to protect the list of currently caching
      block groups. Additions to this list are made when we need to cache a
      block group, because we don't have a free space cache for it (or we have
      but it's invalid), and removals from this list are done when caching of
      the block group's free space finishes. These cases are also not very
      common, but when they happen, they happen only once when the filesystem
      is mounted.
      
      So switch the lock that protects the tree of block groups from a spinning
      lock to a read/write lock.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      16b0c258
    • Filipe Manana's avatar
      btrfs: use rbtree with leftmost node cached for tracking lowest block group · 08dddb29
      Filipe Manana authored
      We keep track of the start offset of the block group with the lowest start
      offset at fs_info->first_logical_byte. This requires explicitly updating
      that field every time we add, delete or lookup a block group to/from the
      red black tree at fs_info->block_group_cache_tree.
      
      Since the block group with the lowest start address happens to always be
      the one that is the leftmost node of the tree, we can use a red black tree
      that caches the left most node. Then when we need the start address of
      that block group, we can just quickly get the leftmost node in the tree
      and extract the start offset of that node's block group. This avoids the
      need to explicitly keep track of that address in the dedicated member
      fs_info->first_logical_byte, and it also allows the next patch in the
      series to switch the lock that protects the red black tree from a spin
      lock to a read/write lock - without this change it would be tricky
      because block group searches also update fs_info->first_logical_byte.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      08dddb29
    • Filipe Manana's avatar
      btrfs: remove search start argument from first_logical_byte() · 0eb997bf
      Filipe Manana authored
      The search start argument passed to first_logical_byte() is always 0, as
      we always want to get the logical start address of the block group with
      the lowest logical start address. So remove it, as not only it is not
      necessary, it also makes the following patches that change the lock that
      protects the red black tree of block groups from a spin lock to a
      read/write lock.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0eb997bf
    • Qu Wenruo's avatar
      btrfs: return correct error number for __extent_writepage_io() · 44e5801f
      Qu Wenruo authored
      [BUG]
      If we hit an error from submit_extent_page() inside
      __extent_writepage_io(), we could still return 0 to the caller, and
      even trigger the warning in btrfs_page_assert_not_dirty().
      
      [CAUSE]
      In __extent_writepage_io(), if we hit an error from
      submit_extent_page(), we will just clean up the range and continue.
      
      This is completely fine for regular PAGE_SIZE == sectorsize, as we can
      only hit one sector in one page, thus after the error we're ensured to
      exit and @ret will be saved.
      
      But for subpage case, we may have other dirty subpage range in the page,
      and in the next loop, we may succeeded submitting the next range.
      
      In that case, @ret will be overwritten, and we return 0 to the caller,
      while we have hit some error.
      
      [FIX]
      Introduce @has_error and @saved_ret to record the first error we hit, so
      we will never forget what error we hit.
      
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44e5801f
    • Qu Wenruo's avatar
      btrfs: fix the error handling for submit_extent_page() for btrfs_do_readpage() · 10f7f6f8
      Qu Wenruo authored
      [BUG]
      Test case generic/475 have a very high chance (almost 100%) to hit a fs
      hang, where a data page will never be unlocked and hang all later
      operations.
      
      [CAUSE]
      In btrfs_do_readpage(), if we hit an error from submit_extent_page() we
      will try to do the cleanup for our current io range, and exit.
      
      This works fine for PAGE_SIZE == sectorsize cases, but not for subpage.
      
      For subpage btrfs_do_readpage() will lock the full page first, which can
      contain several different sectors and extents:
      
       btrfs_do_readpage()
       |- begin_page_read()
       |  |- btrfs_subpage_start_reader();
       |     Now the page will have PAGE_SIZE / sectorsize reader pending,
       |     and the page is locked.
       |
       |- end_page_read() for different branches
       |  This function will reduce subpage readers, and when readers
       |  reach 0, it will unlock the page.
      
      But when submit_extent_page() failed, we only cleanup the current
      io range, while the remaining io range will never be cleaned up, and the
      page remains locked forever.
      
      [FIX]
      Update the error handling of submit_extent_page() to cleanup all the
      remaining subpage range before exiting the loop.
      
      Please note that, now submit_extent_page() can only fail due to
      sanity check in alloc_new_bio().
      
      Thus regular IO errors are impossible to trigger the error path.
      
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      10f7f6f8
    • Qu Wenruo's avatar
      btrfs: avoid double clean up when submit_one_bio() failed · c9583ada
      Qu Wenruo authored
      [BUG]
      When running generic/475 with 64K page size and 4K sector size, it has a
      very high chance (almost 100%) to hang, with mostly data page locked but
      no one is going to unlock it.
      
      [CAUSE]
      With commit 1784b7d5 ("btrfs: handle csum lookup errors properly on
      reads"), if we failed to lookup checksum due to metadata IO error, we
      will return error for btrfs_submit_data_bio().
      
      This will cause the page to be unlocked twice in btrfs_do_readpage():
      
       btrfs_do_readpage()
       |- submit_extent_page()
       |  |- submit_one_bio()
       |     |- btrfs_submit_data_bio()
       |        |- if (ret) {
       |        |-     bio->bi_status = ret;
       |        |-     bio_endio(bio); }
       |               In the endio function, we will call end_page_read()
       |               and unlock_extent() to cleanup the subpage range.
       |
       |- if (ret) {
       |-        unlock_extent(); end_page_read() }
                 Here we unlock the extent and cleanup the subpage range
                 again.
      
      For unlock_extent(), it's mostly double unlock safe.
      
      But for end_page_read(), it's not, especially for subpage case,
      as for subpage case we will call btrfs_subpage_end_reader() to reduce
      the reader number, and use that to number to determine if we need to
      unlock the full page.
      
      If double accounted, it can underflow the number and leave the page
      locked without anyone to unlock it.
      
      [FIX]
      The commit 1784b7d5 ("btrfs: handle csum lookup errors properly on
      reads") itself is completely fine, it's our existing code not properly
      handling the error from bio submission hook properly.
      
      This patch will make submit_one_bio() to return void so that the callers
      will never be able to do cleanup when bio submission hook fails.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9583ada
    • Schspa Shi's avatar
      btrfs: use non-bh spin_lock in zstd timer callback · dd7382a2
      Schspa Shi authored
      This is an optimization for fix fee13fe9 ("btrfs: correct zstd
      workspace manager lock to use spin_lock_bh()")
      
      The critical region for wsm.lock is only accessed by the process context and
      the softirq context.
      
      Because in the soft interrupt, the critical section will not be
      preempted by the soft interrupt again, there is no need to call
      spin_lock_bh(&wsm.lock) to turn off the soft interrupt,
      spin_lock(&wsm.lock) is enough for this situation.
      Signed-off-by: default avatarSchspa Shi <schspa@gmail.com>
      [ minor comment update ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dd7382a2
    • Filipe Manana's avatar
      btrfs: use BTRFS_DIR_START_INDEX at btrfs_create_new_inode() · 49024388
      Filipe Manana authored
      We are still using the magic value of 2 at btrfs_create_new_inode(), but
      there's now a constant for that, named BTRFS_DIR_START_INDEX, which was
      introduced in commit 528ee697 ("btrfs: put initial index value of a
      directory in a constant"). So change that to use the constant.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      49024388
    • Qu Wenruo's avatar
      btrfs: simplify parameters of submit_read_repair() and rename · c0111c44
      Qu Wenruo authored
      Cleanup the function submit_read_repair() by:
      
      - Remove the fixed argument submit_bio_hook()
        The function is only called on buffered data read path, so the
        @submit_bio_hook argument is always btrfs_submit_data_bio().
      
        Since it's fixed, then there is no need to pass that argument at all.
      
      - Rename the function to submit_data_read_repair()
        Just to be more explicit on all the 3 things, data, read and repair.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c0111c44
    • Christoph Hellwig's avatar
      btrfs: remove the zoned/zone_size union in struct btrfs_fs_info · 8e010b3d
      Christoph Hellwig authored
      Reading a value from a different member of a union is not just a great
      way to obfuscate code, but also creates an aliasing violation.  Switch
      btrfs_is_zoned to look at ->zone_size and remove the union.
      
      Note: union was to simplify the detection of zoned filesystem but now
      this is wrapped behind btrfs_is_zoned so we can drop the union.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8e010b3d
    • Lv Ruyi's avatar
      btrfs: remove unnecessary check of iput argument · 8aa1e49e
      Lv Ruyi authored
      iput() already handles NULL and non-NULL parameter, so it is not needed
      to check that. This unifies all iput calls.
      Reported-by: default avatarZeal Robot <zealci@zte.com.cn>
      Signed-off-by: default avatarLv Ruyi <lv.ruyi@zte.com.cn>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8aa1e49e