1. 25 Jul, 2022 40 commits
    • Filipe Manana's avatar
      btrfs: join running log transaction when logging new name · 723df2bc
      Filipe Manana authored
      When logging a new name, in case of a rename, we pin the log before
      changing it. We then either delete a directory entry from the log or
      insert a key range item to mark the old name for deletion on log replay.
      
      However when doing one of those log changes we may have another task that
      started writing out the log (at btrfs_sync_log()) and it started before
      we pinned the log root. So we may end up changing a log tree while its
      writeback is being started by another task syncing the log. This can lead
      to inconsistencies in a log tree and other unexpected results during log
      replay, because we can get some committed node pointing to a node/leaf
      that ends up not getting written to disk before the next log commit.
      
      The problem, conceptually, started to happen in commit 88d2beec
      ("btrfs: avoid logging all directory changes during renames"), because
      there we started to update the log without joining its current transaction
      first.
      
      However the problem only became visible with commit 259c4b96
      ("btrfs: stop doing unnecessary log updates during a rename"), and that is
      because we used to pin the log at btrfs_rename() and then before entering
      btrfs_log_new_name(), when unlinking the old dentry, we ended up at
      btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(). Both
      of them join the current log transaction, effectively waiting for any log
      transaction writeout (due to acquiring the root's log_mutex). This made it
      safe even after leaving the current log transaction, because we remained
      with the log pinned when we called btrfs_log_new_name().
      
      Then in commit 259c4b96 ("btrfs: stop doing unnecessary log updates
      during a rename"), we removed the log pinning from btrfs_rename() and
      stopped calling btrfs_del_inode_ref_in_log() and
      btrfs_del_dir_entries_in_log() during the rename, and started to do all
      the needed work at btrfs_log_new_name(), but without joining the current
      log transaction, only pinning the log, which is racy because another task
      may have started writeout of the log tree right before we pinned the log.
      
      Both commits landed in kernel 5.18, so it doesn't make any practical
      difference which should be blamed, but I'm blaming the second commit only
      because with the first one, by chance, the problem did not happen due to
      the fact we joined the log transaction after pinning the log and unpinned
      it only after calling btrfs_log_new_name().
      
      So make btrfs_log_new_name() join the current log transaction instead of
      pinning it, so that we never do log updates if it's writeout is starting.
      
      Fixes: 259c4b96 ("btrfs: stop doing unnecessary log updates during a rename")
      CC: stable@vger.kernel.org # 5.18+
      Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Tested-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      723df2bc
    • Nikolay Borisov's avatar
      btrfs: simplify error handling in btrfs_lookup_dentry · fc8b235f
      Nikolay Borisov authored
      In btrfs_lookup_dentry releasing the reference of the sub_root and the
      running orphan cleanup should only happen if the dentry found actually
      represents a subvolume. This can only be true in the 'else' branch as
      otherwise either fixup_tree_root_location returned an ENOENT error, in
      which case sub_root wouldn't have been changed or if we got a different
      errno this means btrfs_get_fs_root couldn't have executed successfully
      again meaning sub_root will equal to root. So simplify all the branches
      by moving the code into the 'else'.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc8b235f
    • Filipe Manana's avatar
      btrfs: send: always use the rbtree based inode ref management infrastructure · 0d8869fb
      Filipe Manana authored
      After the patch "btrfs: send: fix sending link commands for existing file
      paths", we now have two infrastructures to detect and eliminate duplicated
      inode references (due to names that got removed and re-added between the
      send and parent snapshots):
      
      1) One that works on a single inode ref/extref item;
      
      2) A new one that works acrosss all ref/extref items for an inode, and
         it's also more efficient because even in the single ref/extref item
         case, it does not do a linear search for all the names encoded in the
         ref/extref item, it uses red black trees to speedup up the search.
      
      There's no good reason to keep both infrastructures, we can use the new
      one everywhere, and it's always more efficient.
      
      So remove the old infrastructure and change all sites that are using it
      to use the new one.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d8869fb
    • BingJing Chang's avatar
      btrfs: send: fix sending link commands for existing file paths · 3aa5bd36
      BingJing Chang authored
      There is a bug sending link commands for existing file paths. When we're
      processing an inode, we go over all references. All the new file paths are
      added to the "new_refs" list. And all the deleted file paths are added to
      the "deleted_refs" list. In the end, when we finish processing the inode,
      we iterate over all the items in the "new_refs" list and send link commands
      for those file paths. After that, we go over all the items in the
      "deleted_refs" list and send unlink commands for them. If there are
      duplicated file paths in both lists, we will try to create them before we
      remove them. Then the receiver gets an -EEXIST error when trying the link
      operations.
      
      Example for having duplicated file paths in both list:
      
        $ btrfs subvolume create vol
      
        # create a file and 2000 hard links to the same inode
        $ touch vol/foo
        $ for i in {1..2000}; do link vol/foo vol/$i ; done
      
        # take a snapshot for a parent snapshot
        $ btrfs subvolume snapshot -r vol snap1
      
        # remove 2000 hard links and re-create the last 1000 links
        $ for i in {1..2000}; do rm vol/$i; done;
        $ for i in {1001..2000}; do link vol/foo vol/$i; done
      
        # take another one for a send snapshot
        $ btrfs subvolume snapshot -r vol snap2
      
        $ mkdir receive_dir
        $ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
        At subvol snap2
        link 1238 -> foo
        ERROR: link 1238 -> foo failed: File exists
      
      In this case, we will have the same file paths added to both lists. In the
      parent snapshot, reference paths {1..1237} are stored in inode references,
      but reference paths {1238..2000} are stored in inode extended references.
      In the send snapshot, all reference paths {1001..2000} are stored in inode
      references. During the incremental send, we process their inode references
      first. In record_changed_ref(), we iterate all its inode references in the
      send/parent snapshot. For every inode reference, we also use find_iref() to
      check whether the same file path also appears in the parent/send snapshot
      or not. Inode references {1238..2000} which appear in the send snapshot but
      not in the parent snapshot are added to the "new_refs" list. On the other
      hand, Inode references {1..1000} which appear in the parent snapshot but
      not in the send snapshot are added to the "deleted_refs" list. Next, when
      we process their inode extended references, reference paths {1238..2000}
      are added to the "deleted_refs" list because all of them only appear in the
      parent snapshot. Now two lists contain items as below:
      "new_refs" list: {1238..2000}
      "deleted_refs" list: {1..1000}, {1238..2000}
      
      Reference paths {1238..2000} appear in both lists. And as the processing
      order mentioned about before, the receiver gets an -EEXIST error when trying
      the link operations.
      
      To fix the bug, the idea is to process the "deleted_refs" list before
      the "new_refs" list. However, it's not easy to reshuffle the processing
      order. For one reason, if we do so, we may unlink all the existing paths
      first, there's no valid path anymore for links. And it's inefficient
      because we do a bunch of unlinks followed by links for the same paths.
      Moreover, it makes less sense to have duplications in both lists. A
      reference path cannot not only be regarded as new but also has been seen in
      the past, or we won't call it a new path. However, it's also not a good
      idea to make find_iref() check a reference against all inode references
      and all inode extended references because it may result in large disk
      reads.
      
      So we introduce two rbtrees to make the references easier for lookups.
      And we also introduce record_new_ref_if_needed() and
      record_deleted_ref_if_needed() for changed_ref() to check and remove
      duplicated references early.
      Reviewed-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBingJing Chang <bingjingc@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3aa5bd36
    • BingJing Chang's avatar
      btrfs: send: introduce recorded_ref_alloc and recorded_ref_free · 71ecfc13
      BingJing Chang authored
      Introduce wrappers to allocate and free recorded_ref structures.
      Reviewed-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBingJing Chang <bingjingc@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      71ecfc13
    • Naohiro Aota's avatar
      btrfs: zoned: wait until zone is finished when allocation didn't progress · 2ce543f4
      Naohiro Aota authored
      When the allocated position doesn't progress, we cannot submit IOs to
      finish a block group, but there should be ongoing IOs that will finish a
      block group. So, in that case, we wait for a zone to be finished and retry
      the allocation after that.
      
      Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to
      indicate we need a zone finish to have proceeded. The flag is set when the
      allocator detected it cannot activate a new block group. And, it is cleared
      once a zone is finished.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ce543f4
    • Naohiro Aota's avatar
      btrfs: zoned: write out partially allocated region · 898793d9
      Naohiro Aota authored
      cow_file_range() works in an all-or-nothing way: if it fails to allocate an
      extent for a part of the given region, it gives up all the region including
      the successfully allocated parts. On cow_file_range(), run_delalloc_zoned()
      writes data for the region only when it successfully allocate all the
      region.
      
      This all-or-nothing allocation and write-out are problematic when available
      space in all the block groups are get tight with the active zone
      restriction. btrfs_reserve_extent() try hard to utilize the left space in
      the active block groups and gives up finally and fails with
      -ENOSPC. However, if we send IOs for the successfully allocated region, we
      can finish a zone and can continue on the rest of the allocation on a newly
      allocated block group.
      
      This patch implements the partial write-out for run_delalloc_zoned(). With
      this patch applied, cow_file_range() returns -EAGAIN to tell the caller to
      do something to progress the further allocation, and tells the successfully
      allocated region with done_offset. Furthermore, the zoned extent allocator
      returns -EAGAIN to tell cow_file_range() going back to the caller side.
      
      Actually, we still need to wait for an IO to complete to continue the
      allocation. The next patch implements that part.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      898793d9
    • Naohiro Aota's avatar
      btrfs: zoned: activate necessary block group · b6a98021
      Naohiro Aota authored
      There are two places where allocating a chunk is not enough. These two
      places are trying to ensure the space by allocating a chunk. To meet the
      condition for active_total_bytes, we also need to activate a block group
      there.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b6a98021
    • Naohiro Aota's avatar
      btrfs: zoned: activate metadata block group on flush_space · b0931513
      Naohiro Aota authored
      For metadata space on zoned filesystem, reaching ALLOC_CHUNK{,_FORCE}
      means we don't have enough space left in the active_total_bytes. Before
      allocating a new chunk, we can try to activate an existing block group
      in this case.
      
      Also, allocating a chunk is not enough to grant a ticket for metadata
      space on zoned filesystem we need to activate the block group to
      increase the active_total_bytes.
      
      btrfs_zoned_activate_one_bg() implements the activation feature. It will
      activate a block group by (maybe) finishing a block group. It will give up
      activating a block group if it cannot finish any block group.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0931513
    • Naohiro Aota's avatar
      btrfs: zoned: disable metadata overcommit for zoned · 79417d04
      Naohiro Aota authored
      The metadata overcommit makes the space reservation flexible but it is also
      harmful to active zone tracking. Since we cannot finish a block group from
      the metadata allocation context, we might not activate a new block group
      and might not be able to actually write out the overcommit reservations.
      
      So, disable metadata overcommit for zoned filesystems. We will ensure
      the reservations are under active_total_bytes in the following patches.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79417d04
    • Naohiro Aota's avatar
      btrfs: zoned: introduce space_info->active_total_bytes · 6a921de5
      Naohiro Aota authored
      The active_total_bytes, like the total_bytes, accounts for the total bytes
      of active block groups in the space_info.
      
      With an introduction of active_total_bytes, we can check if the reserved
      bytes can be written to the block groups without activating a new block
      group. The check is necessary for metadata allocation on zoned
      filesystem. We cannot finish a block group, which may require waiting
      for the current transaction, from the metadata allocation context.
      Instead, we need to ensure the ongoing allocation (reserved bytes) fits
      in active block groups.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6a921de5
    • Naohiro Aota's avatar
      btrfs: zoned: finish least available block group on data bg allocation · 393f646e
      Naohiro Aota authored
      When we run out of active zones and no sufficient space is left in any
      block groups, we need to finish one block group to make room to activate a
      new block group.
      
      However, we cannot do this for metadata block groups because we can cause a
      deadlock by waiting for a running transaction commit. So, do that only for
      a data block group.
      
      Furthermore, the block group to be finished has two requirements. First,
      the block group must not have reserved bytes left. Having reserved bytes
      means we have an allocated region but did not yet send bios for it. If that
      region is allocated by the thread calling btrfs_zone_finish(), it results
      in a deadlock.
      
      Second, the block group to be finished must not be a SYSTEM block
      group. Finishing a SYSTEM block group easily breaks further chunk
      allocation by nullifying the SYSTEM free space.
      
      In a certain case, we cannot find any zone finish candidate or
      btrfs_zone_finish() may fail. In that case, we fall back to split the
      allocation bytes and fill the last spaces left in the block groups.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      393f646e
    • Naohiro Aota's avatar
      btrfs: let can_allocate_chunk return error · bb9950d3
      Naohiro Aota authored
      For the later patch, convert the return type from bool to int and return
      errors. No functional changes.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb9950d3
    • Naohiro Aota's avatar
      btrfs: use fs_info->max_extent_size in get_extent_max_capacity() · d7601566
      Naohiro Aota authored
      Use fs_info->max_extent_size also in get_extent_max_capacity() for the
      completeness. This is only used for defrag and not really necessary to fix
      the metadata reservation size. But, it still suppresses unnecessary defrag
      operations.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7601566
    • Naohiro Aota's avatar
      btrfs: convert count_max_extents() to use fs_info->max_extent_size · 7d7672bc
      Naohiro Aota authored
      If count_max_extents() uses BTRFS_MAX_EXTENT_SIZE to calculate the number
      of extents needed, btrfs release the metadata reservation too much on its
      way to write out the data.
      
      Now that BTRFS_MAX_EXTENT_SIZE is replaced with fs_info->max_extent_size,
      convert count_max_extents() to use it instead, and fix the calculation of
      the metadata reservation.
      
      CC: stable@vger.kernel.org # 5.12+
      Fixes: d8e3fb10 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7d7672bc
    • Naohiro Aota's avatar
      btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size · f7b12a62
      Naohiro Aota authored
      On zoned filesystem, data write out is limited by max_zone_append_size,
      and a large ordered extent is split according the size of a bio. OTOH,
      the number of extents to be written is calculated using
      BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
      metadata bytes to update and/or create the metadata items.
      
      The metadata reservation is done at e.g, btrfs_buffered_write() and then
      released according to the estimation changes. Thus, if the number of extent
      increases massively, the reserved metadata can run out.
      
      The increase of the number of extents easily occurs on zoned filesystem
      if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
      following warning on a small RAM environment with disabling metadata
      over-commit (in the following patch).
      
      [75721.498492] ------------[ cut here ]------------
      [75721.505624] BTRFS: block rsv 1 returned -28
      [75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
      [75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G        W         5.18.0-rc2-BTRFS-ZNS+ #109
      [75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
      [75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
      [75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
      [75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
      [75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
      [75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
      [75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
      [75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
      [75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
      [75721.701878] FS:  0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
      [75721.712601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
      [75721.730499] Call Trace:
      [75721.735166]  <TASK>
      [75721.739886]  btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
      [75721.747545]  ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
      [75721.756145]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
      [75721.762852]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
      [75721.769520]  ? push_leaf_left+0x420/0x620 [btrfs]
      [75721.776431]  ? memcpy+0x4e/0x60
      [75721.781931]  split_leaf+0x433/0x12d0 [btrfs]
      [75721.788392]  ? btrfs_get_token_32+0x580/0x580 [btrfs]
      [75721.795636]  ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
      [75721.803759]  ? leaf_space_used+0x15d/0x1a0 [btrfs]
      [75721.811156]  btrfs_search_slot+0x1bc3/0x2790 [btrfs]
      [75721.818300]  ? lock_downgrade+0x7c0/0x7c0
      [75721.824411]  ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
      [75721.832456]  ? split_leaf+0x12d0/0x12d0 [btrfs]
      [75721.839149]  ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
      [75721.846945]  ? free_extent_buffer+0x13/0x20 [btrfs]
      [75721.853960]  ? btrfs_release_path+0x4b/0x190 [btrfs]
      [75721.861429]  btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
      [75721.869313]  ? rcu_read_lock_sched_held+0x16/0x80
      [75721.876085]  ? lock_release+0x552/0xf80
      [75721.881957]  ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
      [75721.888886]  ? __kasan_check_write+0x14/0x20
      [75721.895152]  ? do_raw_read_unlock+0x44/0x80
      [75721.901323]  ? _raw_write_lock_irq+0x60/0x80
      [75721.907983]  ? btrfs_global_root+0xb9/0xe0 [btrfs]
      [75721.915166]  ? btrfs_csum_root+0x12b/0x180 [btrfs]
      [75721.921918]  ? btrfs_get_global_root+0x820/0x820 [btrfs]
      [75721.929166]  ? _raw_write_unlock+0x23/0x40
      [75721.935116]  ? unpin_extent_cache+0x1e3/0x390 [btrfs]
      [75721.942041]  btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
      [75721.949906]  ? try_to_wake_up+0x30/0x14a0
      [75721.955700]  ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
      [75721.962661]  ? rcu_read_lock_sched_held+0x16/0x80
      [75721.969111]  ? lock_acquire+0x41b/0x4c0
      [75721.974982]  finish_ordered_fn+0x15/0x20 [btrfs]
      [75721.981639]  btrfs_work_helper+0x1af/0xa80 [btrfs]
      [75721.988184]  ? _raw_spin_unlock_irq+0x28/0x50
      [75721.994643]  process_one_work+0x815/0x1460
      [75722.000444]  ? pwq_dec_nr_in_flight+0x250/0x250
      [75722.006643]  ? do_raw_spin_trylock+0xbb/0x190
      [75722.013086]  worker_thread+0x59a/0xeb0
      [75722.018511]  kthread+0x2ac/0x360
      [75722.023428]  ? process_one_work+0x1460/0x1460
      [75722.029431]  ? kthread_complete_and_exit+0x30/0x30
      [75722.036044]  ret_from_fork+0x22/0x30
      [75722.041255]  </TASK>
      [75722.045047] irq event stamp: 0
      [75722.049703] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
      [75722.067533] softirqs last  enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
      [75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [75722.085335] ---[ end trace 0000000000000000 ]---
      
      To fix the estimation, we need to introduce fs_info->max_extent_size to
      replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
      regular vs zoned filesystem.
      
      Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
      filesystem, it is set to fs_info->max_zone_append_size.
      
      CC: stable@vger.kernel.org # 5.12+
      Fixes: d8e3fb10 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f7b12a62
    • Naohiro Aota's avatar
      btrfs: zoned: revive max_zone_append_bytes · c2ae7b77
      Naohiro Aota authored
      This patch is basically a revert of commit 5a80d1c6 ("btrfs: zoned:
      remove max_zone_append_size logic"), but without unnecessary ASSERT and
      check. The max_zone_append_size will be used as a hint to estimate the
      number of extents to cover delalloc/writeback region in the later commits.
      
      The size of a ZONE APPEND bio is also limited by queue_max_segments(), so
      this commit considers it to calculate max_zone_append_size. Technically, a
      bio can be larger than queue_max_segments() * PAGE_SIZE if the pages are
      contiguous. But, it is safe to consider "queue_max_segments() * PAGE_SIZE"
      as an upper limit of an extent size to calculate the number of extents
      needed to write data.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2ae7b77
    • Naohiro Aota's avatar
      block: add bdev_max_segments() helper · 65ea1b66
      Naohiro Aota authored
      Add bdev_max_segments() like other queue parameters.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      65ea1b66
    • Filipe Manana's avatar
      btrfs: add optimized btrfs_ino() version for 64 bits systems · cf2404a9
      Filipe Manana authored
      Currently btrfs_ino() tries to use first the objectid of the inode's
      location key. This is to avoid truncation of the inode number on 32 bits
      platforms because the i_ino field of struct inode has the unsigned long
      type, while the objectid is a 64 bits unsigned type (u64) on every system.
      This logic was added in commit 33345d01 ("Btrfs: Always use 64bit
      inode number").
      
      However if we are running on a 64 bits system, we can always directly
      return the i_ino value from struct inode, which eliminates the need for
      he special if statement that tests for a location key type of
      BTRFS_ROOT_ITEM_KEY - in which case i_ino may not have the same value as
      the objectid in the inode's location objectid, it may have a value of
      BTRFS_EMPTY_SUBVOL_DIR_OBJECTID, for the case of snapshots of trees with
      subvolumes/snapshots inside them.
      
      So add a special version for 64 bits system that directly returns i_ino
      of struct inode. This eliminates one branch and reduces the overall code
      size, since btrfs_ino() is an inline function that is extensively used.
      
      Before:
      
      $ size fs/btrfs/btrfs.ko
         text	   data	    bss	    dec	    hex	filename
      1617487	 189240	  29032	1835759	 1c02ef	fs/btrfs/btrfs.ko
      
      After:
      
      $ size fs/btrfs/btrfs.ko
         text	   data	    bss	    dec	    hex	filename
      1612028	 189180	  29032	1830240	 1bed60	fs/btrfs/btrfs.ko
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cf2404a9
    • Filipe Manana's avatar
      btrfs: set the objectid of the btree inode's location key · adac5584
      Filipe Manana authored
      We currently don't use the location key of the btree inode, its content
      is set to zeroes, as it's a special inode that is not persisted (it has
      no inode item stored in any btree).
      
      At btrfs_ino(), an inline function used extensively in btrfs, we have
      this special check if the given inode's location objectid is 0, and if it
      is, we return the value stored in the VFS' inode i_ino field instead
      (which is BTRFS_BTREE_INODE_OBJECTID for the btree inode).
      
      To reduce the code at btrfs_ino(), we can simply set the objectid of the
      btree inode to the value BTRFS_BTREE_INODE_OBJECTID. This eliminates the
      need to check for the special case of the objectid being zero, with the
      side effect of reducing the overall code size and having less code to
      execute, as btrfs_ino() is an inline function.
      
      Before:
      
      $ size fs/btrfs/btrfs.ko
         text	   data	    bss	    dec	    hex	filename
      1620502	 189240	  29032	1838774	 1c0eb6	fs/btrfs/btrfs.ko
      
      After:
      
      $ size fs/btrfs/btrfs.ko
         text	   data	    bss	    dec	    hex	filename
      1617487	 189240	  29032	1835759	 1c02ef	fs/btrfs/btrfs.ko
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      adac5584
    • Fabio M. De Francesco's avatar
      btrfs: replace kmap_atomic() with kmap_local_page() · 4cb2e5e8
      Fabio M. De Francesco authored
      kmap_atomic() is being deprecated in favor of kmap_local_page() where it
      is feasible. With kmap_local_page() mappings are per thread, CPU local,
      and not globally visible.
      
      The last use of kmap_atomic is in inode.c where the context is atomic [1]
      and can be safely replaced by kmap_local_page.
      
      Tested with xfstests on a QEMU + KVM 32-bits VM with 4GB RAM and booting a
      kernel with HIGHMEM64GB enabled.
      
      [1] https://lore.kernel.org/linux-btrfs/20220601132545.GM20633@twin.jikos.cz/Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4cb2e5e8
    • Fabio M. De Francesco's avatar
      btrfs: zlib: replace kmap() with kmap_local_page() in zlib_decompress_bio() · 5a6e6e7c
      Fabio M. De Francesco authored
      The use of kmap() is being deprecated in favor of kmap_local_page(). With
      kmap_local_page(), the mapping is per thread, CPU local and not globally
      visible.
      
      Therefore, use kmap_local_page() / kunmap_local() in zlib_decompress_bio()
      because in this function the mappings are per thread and are not visible
      in other contexts.
      
      Tested with xfstests on QEMU + KVM 32-bits VM with 4GB of RAM and
      HIGHMEM64G enabled. This patch passes 26/26 tests of group "compress".
      Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a6e6e7c
    • Fabio M. De Francesco's avatar
      btrfs: zlib: replace kmap() with kmap_local_page() in zlib_compress_pages() · 718e5855
      Fabio M. De Francesco authored
      The use of kmap() is being deprecated in favor of kmap_local_page(). With
      kmap_local_page(), the mapping is per thread, CPU local and not globally
      visible.
      
      Therefore, use kmap_local_page() / kunmap_local() in zlib_compress_pages()
      because in this function the mappings are per thread and are not visible
      in other contexts. Furthermore, drop the mappings of "out_page" which is
      allocated within zlib_compress_pages() with alloc_page(GFP_NOFS) and use
      page_address().
      
      Tested with xfstests on a QEMU + KVM 32-bits VM with 4GB of RAM booting
      a kernel with HIGHMEM64G enabled. This patch passes 26/26 tests of group
      "compress".
      
      CC: Qu Wenruo <wqu@suse.com>
      Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      718e5855
    • Fabio M. De Francesco's avatar
      btrfs: zstd: replace kmap() with kmap_local_page() · ebd23482
      Fabio M. De Francesco authored
      The use of kmap() is being deprecated in favor of kmap_local_page(). With
      kmap_local_page(), the mapping is per thread, CPU local and not globally
      visible.
      
      Therefore, use kmap_local_page() / kunmap_local() in zstd.c because in this
      file the mappings are per thread and are not visible in other contexts. In
      the meanwhile use plain page_address() on output pages allocated with
      the GFP_NOFS flag instead of calling kmap*() on them (since they are
      always allocated from ZONE_NORMAL).
      
      Tested with xfstests on QEMU + KVM 32 bits VM with 4GB of RAM, booting a
      kernel with HIGHMEM64G enabled.
      Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ebd23482
    • Fabio M. De Francesco's avatar
      highmem: Make __kunmap_{local,atomic}() take const void pointer · 39ade048
      Fabio M. De Francesco authored
      __kunmap_ {local,atomic}() currently take pointers to void. However, this
      is semantically incorrect, since these functions do not change the memory
      their arguments point to.
      
      Therefore, make this semantics explicit by modifying the
      __kunmap_{local,atomic}() prototypes to take pointers to const void.
      
      As a side effect, compilers may produce more efficient code.
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: Helge Deller <deller@gmx.de>  # parisc
      Suggested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      39ade048
    • Filipe Manana's avatar
      btrfs: don't fallback to buffered IO for NOWAIT direct IO writes · ac5e6669
      Filipe Manana authored
      Currently, for a direct IO write, if we need to fallback to buffered IO,
      either to satisfy the whole write operation or just a part of it, we do
      it in the current context even if it's a NOWAIT context. This is not ideal
      because we currently don't have support for NOWAIT semantics in the
      buffered IO path (we can block for several reasons), so we should instead
      return -EAGAIN to the caller, so that it knows it should retry (the whole
      operation or what's left of it) in a context where blocking is acceptable.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac5e6669
    • David Sterba's avatar
      btrfs: use enum for btrfs_block_rsv::type · 8bfc9b2c
      David Sterba authored
      The number of block group reserve types BTRFS_BLOCK_RSV_* is small and
      fits to u8 and there's enough left in case we want to add more.
      For type safety use the enum but make it 8 bits in the structure to save
      space.
      
      The structure size is now 48 on release build, making a slight
      improvement in structures where it's embedded, like btrfs_fs_info or
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8bfc9b2c
    • David Sterba's avatar
      btrfs: switch btrfs_block_rsv::failfast to bool · 710d5921
      David Sterba authored
      Use simple bool type for the block reserve failfast status, there's
      short to save space as there used to be int but there's no reason for
      that.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      710d5921
    • David Sterba's avatar
      btrfs: switch btrfs_block_rsv::full to bool · c70c2c5b
      David Sterba authored
      Use simple bool type for the block reserve full status, there's short to
      save space as there used to be int but there's no reason for that.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c70c2c5b
    • Christoph Hellwig's avatar
      btrfs: do not return errors from btrfs_submit_dio_bio · 37899117
      Christoph Hellwig authored
      Always consume the bio and call the end_io handler on error instead of
      returning an error and letting the caller handle it.  This matches what
      the block layer submission and the other btrfs bio submission handlers do
      and avoids any confusion on who needs to handle errors.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      37899117
    • Christoph Hellwig's avatar
      btrfs: handle allocation failure in btrfs_wq_submit_bio gracefully · ea1f0ced
      Christoph Hellwig authored
      btrfs_wq_submit_bio is used for writeback under memory pressure.
      Instead of failing the I/O when we can't allocate the async_submit_bio,
      just punt back to the synchronous submission path.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea1f0ced
    • Christoph Hellwig's avatar
      btrfs: simplify sync/async submission in btrfs_submit_data_write_bio · 82443fd5
      Christoph Hellwig authored
      btrfs_submit_data_write_bio special cases the reloc root because the
      checksums are preloaded, but only does so for the !sync case.  The sync
      case can't happen for data relocation, but just handling it more generally
      significantly simplifies the logic.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      82443fd5
    • Christoph Hellwig's avatar
      btrfs: raid56: transfer the bio counter reference to the raid submission helpers · b9af128d
      Christoph Hellwig authored
      Transfer the bio counter reference acquired by btrfs_submit_bio to
      raid56_parity_write and raid56_parity_recovery together with the bio
      that the reference was acquired for instead of acquiring another
      reference in those helpers and dropping the original one in
      btrfs_submit_bio.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b9af128d
    • Christoph Hellwig's avatar
      btrfs: do not return errors from raid56_parity_recover · 6065fd95
      Christoph Hellwig authored
      Always consume the bio and call the end_io handler on error instead of
      returning an error and letting the caller handle it.  This matches what
      the block layer submission does and avoids any confusion on who
      needs to handle errors.
      
      Also use the proper bool type for the generic_io argument.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6065fd95
    • Christoph Hellwig's avatar
      btrfs: do not return errors from raid56_parity_write · 31683f4a
      Christoph Hellwig authored
      Always consume the bio and call the end_io handler on error instead of
      returning an error and letting the caller handle it.  This matches what
      the block layer submission does and avoids any confusion on who
      needs to handle errors.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      31683f4a
    • Christoph Hellwig's avatar
      btrfs: do not return errors from btrfs_map_bio · 1a722d8f
      Christoph Hellwig authored
      Always consume the bio and call the end_io handler on error instead of
      returning an error and letting the caller handle it.  This matches
      what the block layer submission does and avoids any confusion on who
      needs to handle errors.
      
      As this requires touching all the callers, rename the function to
      btrfs_submit_bio, which describes the functionality much better.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a722d8f
    • Qu Wenruo's avatar
      btrfs: return proper mapped length for RAID56 profiles in __btrfs_map_block() · 462b0b2a
      Qu Wenruo authored
      For profiles other than RAID56, __btrfs_map_block() returns @map_length
      as min(stripe_end, logical + *length), which is also the same result
      from btrfs_get_io_geometry().
      
      But for RAID56, __btrfs_map_block() returns @map_length as stripe_len.
      
      This strange behavior is going to hurt incoming bio split at
      btrfs_map_bio() time, as we will use @map_length as bio split size.
      
      Fix this behavior by returning @map_length by the same calculation as
      for other profiles.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      462b0b2a
    • Christoph Hellwig's avatar
      btrfs: raid56: use fixed stripe length everywhere · ff18a4af
      Christoph Hellwig authored
      The raid56 code assumes a fixed stripe length BTRFS_STRIPE_LEN but there
      are functions passing it as arguments, this is not necessary. The fixed
      value has been used for a long time and though the stripe length should
      be configurable by super block member stripesize, this hasn't been
      implemented and would require more changes so we don't need to keep this
      code around until then.
      
      Partially based on a patch from Qu Wenruo.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      [ update changelog ]
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ff18a4af
    • Filipe Manana's avatar
      btrfs: remove the inode cache check at btrfs_is_free_space_inode() · 0201fceb
      Filipe Manana authored
      The inode cache feature was removed in kernel 5.11, and we no longer have
      any code that reads from or writes to inode caches. We may still mount a
      filesystem that has inode caches, but they are ignored.
      
      Remove the check for an inode cache from btrfs_is_free_space_inode(),
      since we no longer have code to trigger reads from an inode cache or
      writes to an inode cache. The check at send.c is still needed, because
      in case we find a filesystem with an inode cache, we must ignore it.
      Also leave the checks at tree-checker.c, as they are sanity checks.
      
      This eliminates a dead branch and reduces the amount of code since it's
      in an inline function.
      
      Before:
      
      $ size fs/btrfs/btrfs.ko
         text	   data	    bss	    dec	    hex	filename
      1620662	 189240	  29032	1838934	 1c0f56	fs/btrfs/btrfs.ko
      
      After:
      
      $ size fs/btrfs/btrfs.ko
         text	   data	    bss	    dec	    hex	filename
      1620502	 189240	  29032	1838774	 1c0eb6	fs/btrfs/btrfs.ko
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0201fceb
    • Nikolay Borisov's avatar
      btrfs: sysfs: remove BIG_METADATA feature files · 74860816
      Nikolay Borisov authored
      This flag has been merged in 3.10 and is effectively always-on. Its
      status depends on the host page size so there's another way to guarantee
      compatibility with old kernels.
      
      Due to a bug introduced in 6f93e834 ("btrfs: fix upper limit for
      max_inline for page size 64K") the flag is not persisted among features
      in the superblock so it's not reliable.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      74860816