1. 21 Aug, 2023 40 commits
    • Anand Jain's avatar
      btrfs: add a helper to read the superblock metadata_uuid · 4844c366
      Anand Jain authored
      In some cases, we need to read the FSID from the superblock when the
      metadata_uuid is not set, and otherwise, read the metadata_uuid. So,
      add a helper.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Tested-by: default avatarGuilherme G. Piccoli <gpiccoli@igalia.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4844c366
    • Qu Wenruo's avatar
      btrfs: remove v0 extent handling · 182741d2
      Qu Wenruo authored
      The v0 extent item has been deprecated for a long time, and we don't have
      any report from the community either.
      
      So it's time to remove the v0 extent specific error handling, and just
      treat them as regular extent tree corruption.
      
      This patch would remove the btrfs_print_v0_err() helper, and enhance the
      involved error handling to treat them just as any extent tree
      corruption. No reports regarding v0 extents have been seen since the
      graceful handling was added in 2018.
      
      This involves:
      
      - btrfs_backref_add_tree_node()
        This change is a little tricky, the new code is changed to only handle
        BTRFS_TREE_BLOCK_REF_KEY and BTRFS_SHARED_BLOCK_REF_KEY.
      
        But this is safe, as we have rejected any unknown inline refs through
        btrfs_get_extent_inline_ref_type().
        For keyed backrefs, we're safe to skip anything we don't know (that's
        if it can pass tree-checker in the first place).
      
      - btrfs_lookup_extent_info()
      - lookup_inline_extent_backref()
      - run_delayed_extent_op()
      - __btrfs_free_extent()
      - add_tree_block()
        Regular error handling of unexpected extent tree item, and abort
        transaction (if we have a trans handle).
      
      - remove_extent_data_ref()
        It's pretty much the same as the regular rejection of unknown backref
        key.
        But for this particular case, we can also remove a BUG_ON().
      
      - extent_data_ref_count()
        We can remove the BTRFS_EXTENT_REF_V0_KEY BUG_ON(), as it would be
        rejected by the only caller.
      
      - btrfs_print_leaf()
        Remove the handling for BTRFS_EXTENT_REF_V0_KEY.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      182741d2
    • Qu Wenruo's avatar
      btrfs: output extra debug info if we failed to find an inline backref · 7f72f505
      Qu Wenruo authored
      [BUG]
      Syzbot reported several warning triggered inside
      lookup_inline_extent_backref().
      
      [CAUSE]
      As usual, the reproducer doesn't reliably trigger locally here, but at
      least we know the WARN_ON() is triggered when an inline backref can not
      be found, and it can only be triggered when @insert is true. (I.e.
      inserting a new inline backref, which means the backref should already
      exist)
      
      [ENHANCEMENT]
      After the WARN_ON(), dump all the parameters and the extent tree
      leaf to help debug.
      
      Link: https://syzkaller.appspot.com/bug?extid=d6f9ff86c1d804ba2bc6Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f72f505
    • Christoph Hellwig's avatar
      btrfs: move the !zoned assert into run_delalloc_cow · 76c5126e
      Christoph Hellwig authored
      Having the assert in the actual helper documents the pre-conditions
      much better than having it in the caller, so move it.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      76c5126e
    • Christoph Hellwig's avatar
      btrfs: consolidate the error handling in run_delalloc_nocow · 38dc8889
      Christoph Hellwig authored
      Share the calls to extent_clear_unlock_delalloc for btrfs_path allocation
      failure handling and the normal exit path.
      
      This relies on btrfs_free_path ignoring a NULL pointer, and the
      initialization of cur_offset to start at the beginning of the function.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      38dc8889
    • Christoph Hellwig's avatar
      btrfs: cleanup the COW fallback logic in run_delalloc_nocow · 18f62b86
      Christoph Hellwig authored
      Use the block group pointer used to track the outstanding NOCOW writes as
      a boolean to remove the duplicate nocow variable, and keep it contained
      in the main loop to simplify the logic.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      18f62b86
    • Christoph Hellwig's avatar
      btrfs: fix error handling when in a COW window in run_delalloc_nocow · 953fa5ce
      Christoph Hellwig authored
      When run_delalloc_nocow has cow_start set to a value other than (u64)-1,
      it has delayed COW writeback pending behind cur_offset.  When an error
      occurs in such a window, the range going back to cow_start and not just
      cur_offset needs to be unlocked, but only two error cases handle this
      correctly  Move the code to handle unlock the COW range to the common
      error handling label and document the logic.
      
      To make things even more complicated, cow_file_range as called by
      fallback_to_cow will unlock the range it is operating on when it fails as
      well, so we need to reset cow_start right after caling fallback_to_cow
      instead of only when it succeeded.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      953fa5ce
    • Naohiro Aota's avatar
      btrfs: zoned: do not zone finish data relocation block group · 332581bd
      Naohiro Aota authored
      When multiple writes happen at once, we may need to sacrifice a currently
      active block group to be zone finished for a new allocation. We choose a
      block group with the least free space left, and zone finish it.
      
      To do the finishing, we need to send IOs for already allocated region
      and wait for them and on-going IOs. Otherwise, these IOs fail because the
      zone is already finished at the time the IO reach a device.
      
      However, if a block group dedicated to the data relocation is zone
      finished, there is a chance that finishing it before an ongoing write IO
      reaches the device. That is because there is timing gap between an
      allocation is done (block_group->reservations == 0, as pre-allocation is
      done) and an ordered extent is created when the relocation IO starts.
      Thus, if we finish the zone between them, we can fail the IOs.
      
      We cannot simply use "fs_info->data_reloc_bg == block_group->start" to
      avoid the zone finishing. Because, the data_reloc_bg may already switch to
      a new block group, while there are still ongoing write IOs to the old
      data_reloc_bg.
      
      So, this patch reworks the BLOCK_GROUP_FLAG_ZONED_DATA_RELOC bit to
      indicate there is a data relocation allocation and/or ongoing write to the
      block group. The bit is set on allocation and cleared in end_io function of
      the last IO for the currently allocated region.
      
      To change the timing of the bit setting also solves the issue that the bit
      being left even after there is no IO going on. With the current code, if
      the data_reloc_bg switches after the last IO to the current data_reloc_bg,
      the bit is set at this timing and there is no one clearing that bit. As a
      result, that block group is kept unallocatable for anything.
      
      Fixes: 343d8a30 ("btrfs: zoned: prevent allocation from previous data relocation BG")
      Fixes: 74e91b12 ("btrfs: zoned: zone finish unused block group")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      332581bd
    • Josef Bacik's avatar
      btrfs: set page extent mapped after read_folio in relocate_one_page · e7f1326c
      Josef Bacik authored
      One of the CI runs triggered the following panic
      
        assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:229
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/subpage.c:229!
        Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
        CPU: 0 PID: 923660 Comm: btrfs Not tainted 6.5.0-rc3+ #1
        pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
        pc : btrfs_subpage_assert+0xbc/0xf0
        lr : btrfs_subpage_assert+0xbc/0xf0
        sp : ffff800093213720
        x29: ffff800093213720 x28: ffff8000932138b4 x27: 000000000c280000
        x26: 00000001b5d00000 x25: 000000000c281000 x24: 000000000c281fff
        x23: 0000000000001000 x22: 0000000000000000 x21: ffffff42b95bf880
        x20: ffff42b9528e0000 x19: 0000000000001000 x18: ffffffffffffffff
        x17: 667274622f736620 x16: 6e69202c65746176 x15: 0000000000000028
        x14: 0000000000000003 x13: 00000000002672d7 x12: 0000000000000000
        x11: ffffcd3f0ccd9204 x10: ffffcd3f0554ae50 x9 : ffffcd3f0379528c
        x8 : ffff800093213428 x7 : 0000000000000000 x6 : ffffcd3f091771e8
        x5 : ffff42b97f333948 x4 : 0000000000000000 x3 : 0000000000000000
        x2 : 0000000000000000 x1 : ffff42b9556cde80 x0 : 000000000000004f
        Call trace:
         btrfs_subpage_assert+0xbc/0xf0
         btrfs_subpage_set_dirty+0x38/0xa0
         btrfs_page_set_dirty+0x58/0x88
         relocate_one_page+0x204/0x5f0
         relocate_file_extent_cluster+0x11c/0x180
         relocate_data_extent+0xd0/0xf8
         relocate_block_group+0x3d0/0x4e8
         btrfs_relocate_block_group+0x2d8/0x490
         btrfs_relocate_chunk+0x54/0x1a8
         btrfs_balance+0x7f4/0x1150
         btrfs_ioctl+0x10f0/0x20b8
         __arm64_sys_ioctl+0x120/0x11d8
         invoke_syscall.constprop.0+0x80/0xd8
         do_el0_svc+0x6c/0x158
         el0_svc+0x50/0x1b0
         el0t_64_sync_handler+0x120/0x130
         el0t_64_sync+0x194/0x198
        Code: 91098021 b0007fa0 91346000 97e9c6d2 (d4210000)
      
      This is the same problem outlined in 17b17fcd ("btrfs:
      set_page_extent_mapped after read_folio in btrfs_cont_expand") , and the
      fix is the same.  I originally looked for the same pattern elsewhere in
      our code, but mistakenly skipped over this code because I saw the page
      cache readahead before we set_page_extent_mapped, not realizing that
      this was only in the !page case, that we can still end up with a
      !uptodate page and then do the btrfs_read_folio further down.
      
      The fix here is the same as the above mentioned patch, move the
      set_page_extent_mapped call to after the btrfs_read_folio() block to
      make sure that we have the subpage blocksize stuff setup properly before
      using the page.
      
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e7f1326c
    • Josef Bacik's avatar
      btrfs: wait on uncached block groups on every allocation loop · cd361199
      Josef Bacik authored
      My initial fix for the generic/475 hangs was related to metadata, but
      our CI testing uncovered another case where we hang for similar reasons.
      We again have a task with a plug that is holding an outstanding request
      that is keeping the dm device from finishing it's suspend, and that task
      is stuck in the allocator.
      
      This time it is stuck trying to allocate data, but we do not have a
      block group that matches the size class.  The larger loop in the
      allocator looks like this (simplified of course)
      
        find_free_extent
          for_each_block_group {
            ffe_ctl->cached == btrfs_block_group_cache_done(bg)
            if (!ffe_ctl->cached)
      	ffe_ctl->have_caching_bg = true;
            do_allocation()
      	btrfs_wait_block_group_cache_progress();
          }
      
          if (loop == LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg)
            go search again;
      
      In my earlier fix we were trying to allocate from the block group, but
      we weren't waiting for the progress because we were only waiting for the
      free space to be >= the amount of free space we wanted.  My fix made it
      so we waited for forward progress to be made as well, so we would be
      sure to wait.
      
      This time however we did not have a block group that matched our size
      class, so what was happening was this
      
        find_free_extent
          for_each_block_group {
            ffe_ctl->cached == btrfs_block_group_cache_done(bg)
            if (!ffe_ctl->cached)
      	ffe_ctl->have_caching_bg = true;
            if (size_class_doesn't_match())
      	goto loop;
            do_allocation()
      	btrfs_wait_block_group_cache_progress();
        loop:
            release_block_group(block_group);
          }
      
          if (loop == LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg)
            go search again;
      
      The size_class_doesn't_match() part was true, so we'd just skip this
      block group and never wait for caching, and then because we found a
      caching block group we'd just go back and do the loop again.  We never
      sleep and thus never flush the plug and we have the same deadlock.
      
      Fix the logic for waiting on the block group caching to instead do it
      unconditionally when we goto loop.  This takes the logic out of the
      allocation step, so now the loop looks more like this
      
        find_free_extent
          for_each_block_group {
            ffe_ctl->cached == btrfs_block_group_cache_done(bg)
            if (!ffe_ctl->cached)
      	ffe_ctl->have_caching_bg = true;
            if (size_class_doesn't_match())
      	goto loop;
            do_allocation()
      	btrfs_wait_block_group_cache_progress();
        loop:
            if (loop > LOOP_CACHING_NOWAIT && !ffe_ctl->retry_uncached &&
      	  !ffe_ctl->cached) {
      	 ffe_ctl->retry_uncached = true;
      	 btrfs_wait_block_group_cache_progress();
            }
      
            release_block_group(block_group);
          }
      
          if (loop == LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg)
            go search again;
      
      This simplifies the logic a lot, and makes sure that if we're hitting
      uncached block groups we're always waiting on them at some point.
      
      I ran this through 100 iterations of generic/475, as this particular
      case was harder to hit than the previous one.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd361199
    • Ruan Jinjie's avatar
      btrfs: use LIST_HEAD() to initialize the list_head · 84af994b
      Ruan Jinjie authored
      Use LIST_HEAD() to initialize the list_head instead of open-coding it.
      Signed-off-by: default avatarRuan Jinjie <ruanjinjie@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      84af994b
    • Qu Wenruo's avatar
      btrfs: handle errors properly in update_inline_extent_backref() · 25761430
      Qu Wenruo authored
      [PROBLEM]
      Inside function update_inline_extent_backref(), we have several
      BUG_ON()s along with some ASSERT()s which can be triggered by corrupted
      filesystem.
      
      [ANAYLYSE]
      Most of those BUG_ON()s and ASSERT()s are just a way of handling
      unexpected on-disk data.
      
      Although we have tree-checker to rule out obviously incorrect extent
      tree blocks, it's not enough for these ones.  Thus we need proper error
      handling for them.
      
      [FIX]
      Thankfully all the callers of update_inline_extent_backref() would
      eventually handle the errror by aborting the current transaction.
      So this patch would do the proper error handling by:
      
      - Make update_inline_extent_backref() to return int
        The return value would be either 0 or -EUCLEAN.
      
      - Replace BUG_ON()s and ASSERT()s with proper error handling
        This includes:
        * Dump the bad extent tree leaf
        * Output an error message for the cause
          This would include the extent bytenr, num_bytes (if needed), the bad
          values and expected good values.
        * Return -EUCLEAN
      
        Note here we remove all the WARN_ON()s, as eventually the transaction
        would be aborted, thus a backtrace would be triggered anyway.
      
      - Better comments on why we expect refs == 1 and refs_to_mode == -1 for
        tree blocks
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      25761430
    • Naohiro Aota's avatar
      btrfs: zoned: re-enable metadata over-commit for zoned mode · 5b135b38
      Naohiro Aota authored
      Now that, we can re-enable metadata over-commit. As we moved the activation
      from the reservation time to the write time, we no longer need to ensure
      all the reserved bytes is properly activated.
      
      Without the metadata over-commit, it suffers from lower performance because
      it needs to flush the delalloc items more often and allocate more block
      groups. Re-enabling metadata over-commit will solve the issue.
      
      Fixes: 79417d04 ("btrfs: zoned: disable metadata overcommit for zoned")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5b135b38
    • Naohiro Aota's avatar
      btrfs: zoned: don't activate non-DATA BG on allocation · 5a7d107e
      Naohiro Aota authored
      Now that a non-DATA block group is activated at write time, don't
      activate it on allocation time.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a7d107e
    • Naohiro Aota's avatar
      btrfs: zoned: no longer count fresh BG region as zone unusable · 6a8ebc77
      Naohiro Aota authored
      Now that we switched to write time activation, we no longer need to (and
      must not) count the fresh region as zone unusable. This commit is similar
      to revert of commit fa2068d7 ("btrfs: zoned: count fresh BG
      region as zone unusable").
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6a8ebc77
    • Naohiro Aota's avatar
      btrfs: zoned: activate metadata block group on write time · 13bb483d
      Naohiro Aota authored
      In the current implementation, block groups are activated at reservation
      time to ensure that all reserved bytes can be written to an active metadata
      block group. However, this approach has proven to be less efficient, as it
      activates block groups more frequently than necessary, putting pressure on
      the active zone resource and leading to potential issues such as early
      ENOSPC or hung_task.
      
      Another drawback of the current method is that it hampers metadata
      over-commit, and necessitates additional flush operations and block group
      allocations, resulting in decreased overall performance.
      
      To address these issues, this commit introduces a write-time activation of
      metadata and system block group. This involves reserving at least one
      active block group specifically for a metadata and system block group.
      
      Since metadata write-out is always allocated sequentially, when we need to
      write to a non-active block group, we can wait for the ongoing IOs to
      complete, activate a new block group, and then proceed with writing to the
      new block group.
      
      Fixes: b0931513 ("btrfs: zoned: activate metadata block group on flush_space")
      CC: stable@vger.kernel.org # 6.1+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      13bb483d
    • Naohiro Aota's avatar
      btrfs: zoned: reserve zones for an active metadata/system block group · a7e1ac7b
      Naohiro Aota authored
      Ensure a metadata and system block group can be activated on write time, by
      leaving a certain number of active zones when trying to activate a data
      block group.
      
      Zones for two metadata block groups (normal and tree-log) and one system
      block group are reserved, according to the profile type: two zones per
      block group on the DUP profile and one zone per block group otherwise.
      
      The reservation must be freed once a non-data block group is allocated. If
      not, we over-reserve the active zones and data block group activation will
      suffer. For the dynamic reservation count, we need to manage the
      reservation count per device.
      
      The reservation count variable is protected by
      fs_info->zone_active_bgs_lock.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7e1ac7b
    • Naohiro Aota's avatar
      btrfs: zoned: update meta write pointer on zone finish · c1c3c2bc
      Naohiro Aota authored
      On finishing a zone, the meta_write_pointer should be set of the end of the
      zone to reflect the actual write pointer position.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c1c3c2bc
    • Naohiro Aota's avatar
      btrfs: zoned: defer advancing meta write pointer · 0356ad41
      Naohiro Aota authored
      We currently advance the meta_write_pointer in
      btrfs_check_meta_write_pointer(). That makes it necessary to revert it
      when locking the buffer failed. Instead, we can advance it just before
      sending the buffer.
      
      Also, this is necessary for the following commit. In the commit, it needs
      to release the zoned_meta_io_lock to allow IOs to come in and wait for them
      to fill the currently active block group. If we advance the
      meta_write_pointer before locking the extent buffer, the following extent
      buffer can pass the meta_write_pointer check, resulting in an unaligned
      write failure.
      
      Advancing the pointer is still thread-safe as the extent buffer is locked.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0356ad41
    • Naohiro Aota's avatar
      btrfs: zoned: return int from btrfs_check_meta_write_pointer · 2ad8c051
      Naohiro Aota authored
      Now that we have writeback_control passed to
      btrfs_check_meta_write_pointer(), we can move the wbc condition in
      submit_eb_page() to btrfs_check_meta_write_pointer() and return int.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ad8c051
    • Naohiro Aota's avatar
      btrfs: zoned: introduce block group context to btrfs_eb_write_context · 7db94301
      Naohiro Aota authored
      For metadata write out on the zoned mode, we call
      btrfs_check_meta_write_pointer() to check if an extent buffer to be written
      is aligned to the write pointer.
      
      We look up a block group containing the extent buffer for every extent
      buffer, which takes unnecessary effort as the writing extent buffers are
      mostly contiguous.
      
      Introduce "zoned_bg" to cache the block group working on.  Also, while
      at it, rename "cache" to "block_group".
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7db94301
    • Naohiro Aota's avatar
      btrfs: introduce struct to consolidate extent buffer write context · 861093ef
      Naohiro Aota authored
      Introduce btrfs_eb_write_context to consolidate writeback_control and the
      exntent buffer context.  This will help adding a block group context as
      well.
      
      While at it, move the eb context setting before
      btrfs_check_meta_write_pointer(). We can set it here because we anyway need
      to skip pages in the same eb if that eb is rejected by
      btrfs_check_meta_write_pointer().
      Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      861093ef
    • Filipe Manana's avatar
      btrfs: avoid start and commit empty transaction when flushing qgroups · 9c93c238
      Filipe Manana authored
      When flushing qgroups, we try to join a running transaction, with
      btrfs_join_transaction(), and then commit the transaction. However using
      btrfs_join_transaction() will result in creating a new transaction in case
      there isn't any running or if there's an existing one already committing.
      This is pointless as we only need to attach to an existing one that is
      not committing and in case there's an existing one committing, wait for
      its commit to complete. Creating and committing an empty transaction is
      wasteful, pointless IO and unnecessary rotation of the backup roots.
      
      So use btrfs_attach_transaction_barrier() instead, to avoid creating and
      committing empty transactions.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c93c238
    • Filipe Manana's avatar
      btrfs: avoid start and commit empty transaction when starting qgroup rescan · 6705b48a
      Filipe Manana authored
      When starting a qgroup rescan, we try to join a running transaction, with
      btrfs_join_transaction(), and then commit the transaction. However using
      btrfs_join_transaction() will result in creating a new transaction in case
      there isn't any running or if there's an existing one already committing.
      This is pointless as we only need to attach to an existing one that is
      not committing and in case there's an existing one committing, wait for
      its commit to complete. Creating and committing an empty transaction is
      wasteful, pointless IO and unnecessary rotation of the backup roots.
      
      So use btrfs_attach_transaction_barrier() instead, to avoid creating and
      committing empty transactions.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6705b48a
    • Filipe Manana's avatar
      btrfs: avoid starting and committing empty transaction when flushing space · 2ee70ed1
      Filipe Manana authored
      When flushing space and we are in the COMMIT_TRANS state, we join a
      transaction with btrfs_join_transaction() and then commit the returned
      transaction. However btrfs_join_transaction() starts a new transaction if
      there is none currently open, which is pointless since comitting a new,
      empty transaction, doesn't achieve anything, it only wastes time, IO and
      creates an unnecessary rotation of the backup roots.
      
      So use btrfs_attach_transaction_barrier() to avoid starting a new
      transaction. This also waits for any ongoing transaction that is
      committing (state >= TRANS_STATE_COMMIT_DOING) to fully complete, and
      therefore wait for all the extents that were pinned during the
      transaction's lifetime to be unpinned.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ee70ed1
    • Filipe Manana's avatar
      btrfs: avoid starting new transaction when flushing delayed items and refs · 2391245a
      Filipe Manana authored
      When flushing space we join a transaction to flush delayed items and
      delayed references, in order to try to release space. However using
      btrfs_join_transaction() not only joins an existing transaction as well
      as it starts a new transaction if there is none open. If there is no
      transaction open, we don't have neither delayed items nor delayed
      references, so creating a new transaction is a waste of time, IO and
      creates an unnecessary rotation of the backup roots without gaining any
      benefits (including releasing space).
      
      So use btrfs_join_transaction_nostart() when attempting to flush delayed
      items and references.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2391245a
    • Filipe Manana's avatar
      btrfs: merge find_free_dev_extent() and find_free_dev_extent_start() · ed8947bc
      Filipe Manana authored
      There is no point in having find_free_dev_extent() because it's just a
      simple wrapper around find_free_dev_extent_start() which always passes a
      value of 0 for the search_start argument. Since there are no other callers
      of find_free_dev_extent_start(), remove find_free_dev_extent() and rename
      find_free_dev_extent_start() to find_free_dev_extent(), removing its
      search_start argument because it's always 0.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ed8947bc
    • Filipe Manana's avatar
      btrfs: make find_free_dev_extent() static · 883647f4
      Filipe Manana authored
      The function find_free_dev_extent() is only used within volumes.c, so make
      it static and remove its prototype from volumes.h.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      883647f4
    • Filipe Manana's avatar
      btrfs: make btrfs_cleanup_fs_roots() static · 504b1596
      Filipe Manana authored
      btrfs_cleanup_fs_roots() is not used outside disk-io.c, so make it static,
      remove its prototype from disk-io.h and move its definition above the
      where it's used in disk-io.c
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      504b1596
    • Filipe Manana's avatar
      btrfs: fail priority metadata ticket with real fs error · 7e3bfd14
      Filipe Manana authored
      At priority_reclaim_metadata_space(), if we were not able to satisfy the
      the ticket after going through the various flushing states and we notice
      the fs went into an error state, likely due to a transaction abort during
      the flushing, set the ticket's error to the error that caused the
      transaction abort instead of an unconditional -EROFS.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7e3bfd14
    • Filipe Manana's avatar
      btrfs: return real error when orphan cleanup fails due to a transaction abort · a7f8de50
      Filipe Manana authored
      During mount we will call btrfs_orphan_cleanup() to remove any inodes that
      were previously deleted (have a link count of 0) but for which we were not
      able before to remove their items from the subvolume tree. The removal of
      the items will happen by triggering eviction, when we do the final iput()
      on them at btrfs_orphan_cleanup(), which will end in the loop at
      btrfs_evict_inode() that truncates inode items.
      
      In a dire situation we may have a transaction abort due to -ENOSPC when
      attempting to truncate the inode items, and in that case the orphan item
      (key type BTRFS_ORPHAN_ITEM_KEY) will remain in the subvolume tree and
      when we hit the next iteration of the while loop at btrfs_orphan_cleanup()
      we will find the same orphan item as before, and then we will return
      -EINVAL from btrfs_orphan_cleanup() through the following if statement:
      
          if (found_key.offset == last_objectid) {
             btrfs_err(fs_info,
                       "Error removing orphan entry, stopping orphan cleanup");
             ret = -EINVAL;
             goto out;
          }
      
      This makes the mount operation fail with -EINVAL, when it should have been
      -ENOSPC. This is confusing because -EINVAL might lead a user into thinking
      it provided invalid mount options for example.
      
      An example where this happens:
      
         $ mount test.img /mnt
         mount: /mnt: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.
      
         $ dmesg
         [ 2542.356934] BTRFS: device fsid 977fff75-1181-4d2b-a739-384fa710d16e devid 1 transid 47409973 /dev/loop0 scanned by mount (4459)
         [ 2542.357451] BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm
         [ 2542.357461] BTRFS info (device loop0): disk space caching is enabled
         [ 2542.742287] BTRFS info (device loop0): auto enabling async discard
         [ 2542.764554] BTRFS info (device loop0): checking UUID tree
         [ 2551.743065] ------------[ cut here ]------------
         [ 2551.743068] BTRFS: Transaction aborted (error -28)
         [ 2551.743149] WARNING: CPU: 7 PID: 215 at fs/btrfs/block-group.c:3494 btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs]
         [ 2551.743311] Modules linked in: btrfs blake2b_generic (...)
         [ 2551.743353] CPU: 7 PID: 215 Comm: kworker/u24:5 Not tainted 6.4.0-rc6-btrfs-next-134+ #1
         [ 2551.743356] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
         [ 2551.743357] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
         [ 2551.743405] RIP: 0010:btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs]
         [ 2551.743449] Code: 8b 43 0c (...)
         [ 2551.743451] RSP: 0018:ffff982c005a7c40 EFLAGS: 00010286
         [ 2551.743452] RAX: 0000000000000000 RBX: ffff88fc6e44b400 RCX: 0000000000000000
         [ 2551.743453] RDX: 0000000000000002 RSI: ffffffff8dff0878 RDI: 00000000ffffffff
         [ 2551.743454] RBP: ffff88fc51817208 R08: 0000000000000000 R09: ffff982c005a7ae0
         [ 2551.743455] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88fc43d2e570
         [ 2551.743456] R13: ffff88fc43d2e400 R14: ffff88fc8fb08ee0 R15: ffff88fc6e44b530
         [ 2551.743457] FS:  0000000000000000(0000) GS:ffff89035fbc0000(0000) knlGS:0000000000000000
         [ 2551.743458] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [ 2551.743459] CR2: 00007fa8cdf2f6f4 CR3: 0000000124850003 CR4: 0000000000370ee0
         [ 2551.743462] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [ 2551.743463] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [ 2551.743464] Call Trace:
         [ 2551.743472]  <TASK>
         [ 2551.743474]  ? __warn+0x80/0x130
         [ 2551.743478]  ? btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs]
         [ 2551.743520]  ? report_bug+0x1f4/0x200
         [ 2551.743523]  ? handle_bug+0x42/0x70
         [ 2551.743526]  ? exc_invalid_op+0x14/0x70
         [ 2551.743528]  ? asm_exc_invalid_op+0x16/0x20
         [ 2551.743532]  ? btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs]
         [ 2551.743574]  ? _raw_spin_unlock+0x15/0x30
         [ 2551.743576]  ? btrfs_run_delayed_refs+0x1bd/0x200 [btrfs]
         [ 2551.743609]  commit_cowonly_roots+0x1e9/0x260 [btrfs]
         [ 2551.743652]  btrfs_commit_transaction+0x42e/0xfa0 [btrfs]
         [ 2551.743693]  ? __pfx_autoremove_wake_function+0x10/0x10
         [ 2551.743697]  flush_space+0xf1/0x5d0 [btrfs]
         [ 2551.743743]  ? _raw_spin_unlock+0x15/0x30
         [ 2551.743745]  ? finish_task_switch+0x91/0x2a0
         [ 2551.743748]  ? _raw_spin_unlock+0x15/0x30
         [ 2551.743750]  ? btrfs_get_alloc_profile+0xc9/0x1f0 [btrfs]
         [ 2551.743793]  btrfs_async_reclaim_metadata_space+0xe1/0x230 [btrfs]
         [ 2551.743837]  process_one_work+0x1d9/0x3e0
         [ 2551.743844]  worker_thread+0x4a/0x3b0
         [ 2551.743847]  ? __pfx_worker_thread+0x10/0x10
         [ 2551.743849]  kthread+0xee/0x120
         [ 2551.743852]  ? __pfx_kthread+0x10/0x10
         [ 2551.743854]  ret_from_fork+0x29/0x50
         [ 2551.743860]  </TASK>
         [ 2551.743861] ---[ end trace 0000000000000000 ]---
         [ 2551.743863] BTRFS info (device loop0: state A): dumping space info:
         [ 2551.743866] BTRFS info (device loop0: state A): space_info DATA has 126976 free, is full
         [ 2551.743868] BTRFS info (device loop0: state A): space_info total=13458472960, used=13458137088, pinned=143360, reserved=0, may_use=0, readonly=65536 zone_unusable=0
         [ 2551.743870] BTRFS info (device loop0: state A): space_info METADATA has -51625984 free, is full
         [ 2551.743872] BTRFS info (device loop0: state A): space_info total=771751936, used=770146304, pinned=1605632, reserved=0, may_use=51625984, readonly=0 zone_unusable=0
         [ 2551.743874] BTRFS info (device loop0: state A): space_info SYSTEM has 14663680 free, is not full
         [ 2551.743875] BTRFS info (device loop0: state A): space_info total=14680064, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
         [ 2551.743877] BTRFS info (device loop0: state A): global_block_rsv: size 53231616 reserved 51544064
         [ 2551.743878] BTRFS info (device loop0: state A): trans_block_rsv: size 0 reserved 0
         [ 2551.743879] BTRFS info (device loop0: state A): chunk_block_rsv: size 0 reserved 0
         [ 2551.743880] BTRFS info (device loop0: state A): delayed_block_rsv: size 0 reserved 0
         [ 2551.743881] BTRFS info (device loop0: state A): delayed_refs_rsv: size 786432 reserved 0
         [ 2551.743886] BTRFS: error (device loop0: state A) in btrfs_write_dirty_block_groups:3494: errno=-28 No space left
         [ 2551.743911] BTRFS info (device loop0: state EA): forced readonly
         [ 2551.743951] BTRFS warning (device loop0: state EA): could not allocate space for delete; will truncate on mount
         [ 2551.743962] BTRFS error (device loop0: state EA): Error removing orphan entry, stopping orphan cleanup
         [ 2551.743973] BTRFS warning (device loop0: state EA): Skipping commit of aborted transaction.
         [ 2551.743989] BTRFS error (device loop0: state EA): could not do orphan cleanup -22
      
      So make the btrfs_orphan_cleanup() return the value of BTRFS_FS_ERROR(),
      if it's set, and -EINVAL otherwise.
      
      For that same example, after this change, the mount operation fails with
      -ENOSPC:
      
         $ mount test.img /mnt
         mount: /mnt: mount(2) system call failed: No space left on device.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7f8de50
    • Filipe Manana's avatar
      btrfs: store the error that turned the fs into error state · ae3364e5
      Filipe Manana authored
      Currently when we turn the fs into an error state, typically after a
      transaction abort, we don't store the error anywhere, we just set a bit
      (BTRFS_FS_STATE_ERROR) at struct btrfs_fs_info::fs_state to signal the
      error state.
      
      There are cases where it would be useful to have access to the specific
      error in order to provide a more meaningful error to users/applications.
      This change adds a member to struct btrfs_fs_info to store the error and
      removes the BTRFS_FS_STATE_ERROR bit. When there's no error, the new
      member (fs_error) has a value of 0, otherwise its value is a negative
      errno value.
      
      Followup changes will make use of this new member.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae3364e5
    • Filipe Manana's avatar
      btrfs: don't steal space from global rsv after a transaction abort · 1b6948ac
      Filipe Manana authored
      When doing a priority metadata space reclaim, while we are going through
      the flush states and running their respective operations, it's possible
      that a transaction abort happened, for example when running delayed refs
      we hit -ENOSPC or in the critical section of transaction commit we failed
      with -ENOSPC or some other error. In these cases a transaction was aborted
      and the fs turned into error state. If that happened, then it makes no
      sense to steal from the global block reserve and return success to the
      caller if the stealing was successful - the caller will later get an
      error when attempting to modify the fs. Instead make the ticket fail if
      we have the fs in error state and don't attempt to steal from the global
      rsv, as it's not only it's pointless, it also simplifies debugging some
      -ENOSPC problems.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1b6948ac
    • Filipe Manana's avatar
      btrfs: print available space across all block groups when dumping space info · 1ff9fee3
      Filipe Manana authored
      When dumping a space info also sum the available space for all block
      groups and then print it. This often useful for debugging -ENOSPC
      related problems.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1ff9fee3
    • Filipe Manana's avatar
      btrfs: print available space for a block group when dumping a space info · e50b122b
      Filipe Manana authored
      When dumping a space info, we iterate over all its block groups and then
      print their size and the amounts of bytes used, reserved, pinned, etc.
      When debugging -ENOSPC problems it's also useful to know how much space
      is available (free), so calculate that and print it as well.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e50b122b
    • Filipe Manana's avatar
      btrfs: print block group super and delalloc bytes when dumping space info · b92e8f54
      Filipe Manana authored
      When dumping a space info's block groups, also print the number of bytes
      used for super blocks and delalloc. This is often useful for debugging
      -ENOSPC problems.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b92e8f54
    • Filipe Manana's avatar
      btrfs: print target number of bytes when dumping free space · 4d2024e9
      Filipe Manana authored
      When dumping free space, with btrfs_dump_free_space(), we pass a bytes
      argument in order to count how many free space entries in the block group
      have a size greater than or equal to that number of bytes. We then print
      how many suitable entries we found, but we don't print the target number
      of bytes, we just say "bytes". Change the message to actually print the
      number of bytes, which makes debugging -ENOSPC issues a bit easier.
      
      Also sligthly change the odd grammar and terminology: the sentence is
      ending with 'is', which doesn't make sense, and the term 'blocks' is
      confusing as we are referring to free space entries within the block
      group's free space cache.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d2024e9
    • Filipe Manana's avatar
      btrfs: update comment for btrfs_join_transaction_nostart() · 19288951
      Filipe Manana authored
      Update the comment for btrfs_join_transaction_nostart() to be more clear
      about how it works and how it's different from btrfs_attach_transaction().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      19288951
    • Filipe Manana's avatar
      btrfs: don't start transaction when joining with TRANS_JOIN_NOSTART · 4490e803
      Filipe Manana authored
      When joining a transaction with TRANS_JOIN_NOSTART, if we don't find a
      running transaction we end up creating one. This goes against the purpose
      of TRANS_JOIN_NOSTART which is to join a running transaction if its state
      is at or below the state TRANS_STATE_COMMIT_START, otherwise return an
      -ENOENT error and don't start a new transaction. So fix this to not create
      a new transaction if there's no running transaction at or below that
      state.
      
      CC: stable@vger.kernel.org # 4.14+
      Fixes: a6d155d2 ("Btrfs: fix deadlock between fiemap and transaction commits")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4490e803
    • Qu Wenruo's avatar
      btrfs: refactor main loop in memmove_extent_buffer() · 096d2301
      Qu Wenruo authored
      [BACKGROUND]
      Currently memove_extent_buffer() does a loop where it strop at any page
      boundary inside [dst_offset, dst_offset + len) or [src_offset,
      src_offset + len).
      
      This is mostly allowing us to do copy_pages(), but if we're going to use
      folios we will need to handle multi-page (the old behavior) or single
      folio (the new optimization).
      
      The current code would be a burden for future changes.
      
      [ENHANCEMENT]
      Instead of sticking with copy_pages(), here we utilize the new
      __write_extent_buffer() helper to handle the writes.
      
      Unlike the refactoring in memcpy_extent_buffer(), we can not just rely
      on the write_extent_buffer() and only handle page boundaries inside src
      range.
      
      The function write_extent_buffer() itself is still doing forward
      writing, thus it cannot handle the following case: (already in the
      extent buffer memory operation tests, cross page overlapping run 2)
      
      	Src	Page boundary
      	|///////|
      	    |///|////|
      	    Dst
      
      In the above case, if we just follow page boundary in the src range, we
      have no need to do any split, just one __write_extent_buffer() with
      use_memmove = true.
      
      But __write_extent_buffer() would split the dst range into two,
      so it first copies the beginning part of the src range into the first half
      of the dst range.
      After this operation, the beginning of the dst range is already updated,
      causing corruption.
      
      So we have to follow the old behavior of handling both page boundaries.
      
      And since we're the last caller of copy_pages(), we can remove it
      completely.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      096d2301