1. 08 Dec, 2020 40 commits
    • Josef Bacik's avatar
      btrfs: cleanup extent buffer readahead · bfb484d9
      Josef Bacik authored
      We're going to pass around more information when we allocate extent
      buffers, in order to make that cleaner how we do readahead.  Most of the
      callers have the parent node that we're getting our blockptr from, with
      the sole exception of relocation which simply has the bytenr it wants to
      read.
      
      Add a helper that takes the current arguments that we need (bytenr and
      gen), and add another helper for simply reading the slot out of a node.
      In followup patches the helper that takes all the extra arguments will
      be expanded, and the simpler helper won't need to have it's arguments
      adjusted.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bfb484d9
    • Josef Bacik's avatar
      btrfs: remove lockdep classes for the fs tree · 416e3445
      Josef Bacik authored
      We have this weird problem where our lockdep class is set after we
      read a tree block, which can race with concurrent readers and result in
      erroneous lockdep errors.  We want to set the lockdep class at
      allocation time if possible, but in certain cases we may not have the
      actual root owner, such as with relocation or any backref lookups.  This
      is only really a problem for reference counted trees, because all other
      trees have their root reference set in their extent reference.  Remove
      the fs tree specific lock class.  We need to still keep the reloc tree
      one, it's still reference counted, because replace_path will lock the
      reloc tree and the destination tree, and if they're both set to
      tree-<level> we'll have issues.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      416e3445
    • Pavel Begunkov's avatar
      btrfs: discard: reschedule work after sysfs param update · 3e48d8d2
      Pavel Begunkov authored
      After sysfs updates discard's iops_limit or kbps_limit it also needs to
      adjust current timer through rescheduling, otherwise the discard work
      may wait for a long time for the previous timer to expire or bumped by
      someone else.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3e48d8d2
    • Pavel Begunkov's avatar
      btrfs: don't miss async discards after scheduled work override · df903e5d
      Pavel Begunkov authored
      If btrfs_discard_schedule_work() is called with override=true, it sets
      delay anew regardless how much time is left until the timer should have
      fired. If delays are long (that can happen, for example, with low
      kbps_limit), they might get constantly overridden without having a
      chance to run the discard work.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      df903e5d
    • Pavel Begunkov's avatar
      btrfs: discard: store async discard delay as ns not as jiffies · 6e88f116
      Pavel Begunkov authored
      Most delay calculations are done in ns or ms, so store
      discard_ctl->delay in ms and convert the final delay to jiffies only at
      the end.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6e88f116
    • Pavel Begunkov's avatar
      btrfs: discard: speed up async discard up to iops_limit · e50404a8
      Pavel Begunkov authored
      Instead of using iops_limit only for cutting off extremes, calculate the
      discard delay directly from it, so it closely follows iops_limit and
      doesn't under-discard even though quotas are not saturated.
      
      The iops limit could be hit more often in some cases and could increase
      the discard rate.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e50404a8
    • Qu Wenruo's avatar
      btrfs: scrub: refactor scrub_find_csum() · 480a8ec8
      Qu Wenruo authored
      Function scrub_find_csum() is to locate the csum for bytenr @logical
      from sctx->csum_list.
      
      However it lacks a lot of comments to explain things like how the
      csum_list is organized and why we need to drop csum range which is
      before us.
      
      Refactor the function by:
      
      - Add more comments explaining the behavior
      - Add comment explaining why we need to drop the csum range
      - Put the csum copy in the main loop
        This is mostly for the incoming patches to make scrub_find_csum() able
        to find multiple checksums.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      480a8ec8
    • Qu Wenruo's avatar
      btrfs: scrub: remove the force parameter from scrub_pages · 96e63a45
      Qu Wenruo authored
      The @force parameter for scrub_pages() is to indicate whether we want to
      force bio submission.  Currently it's only used for the super block,
      and it can be easily determined by the @flags, so we can remove the
      parameter.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96e63a45
    • Qu Wenruo's avatar
      btrfs: scrub: distinguish scrub page from regular page · 261d2dcb
      Qu Wenruo authored
      There are several call sites where we declare something like
      "struct scrub_page *page".
      
      This is confusing as we also use regular page in this code,
      rename it to 'spage' where applicable.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      261d2dcb
    • Qu Wenruo's avatar
      btrfs: pass bvec to csum_dirty_buffer instead of page · ac303b69
      Qu Wenruo authored
      Currently csum_dirty_buffer() uses page to grab extent buffer, but that
      only works for sector size == PAGE_SIZE case.
      
      For subpage we need page + page_offset to grab extent buffer.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac303b69
    • Qu Wenruo's avatar
      btrfs: extract extent buffer verification from btrfs_validate_metadata_buffer() · 77bf40a2
      Qu Wenruo authored
      Currently btrfs_validate_metadata_buffer() only needs to handle one
      extent buffer as currently one page maps to at most one extent buffer.
      
      For incoming subpage support, we need to extend the support where one
      page could contain multiple extent buffers.
      
      Split the function so we can call validate_extent_buffer on extent
      buffers independently.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77bf40a2
    • Qu Wenruo's avatar
      btrfs: make csum_tree_block() handle node smaller than page · a26663e7
      Qu Wenruo authored
      For subpage size support, metadata blocks of nodesize are smaller than
      one page and this needs to be handled when calculating the checksum.
      
      The checksummed start and length need to be adjusted but only for the
      first page:
      
      - start is simply offset in the page
      
      - length is nodesize (subpage) or PAGE_SIZE for all other cases
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a26663e7
    • Qu Wenruo's avatar
      btrfs: grab fs_info from extent_buffer in btrfs_mark_buffer_dirty · 2f4d60df
      Qu Wenruo authored
      Since commit f28491e0 ("Btrfs: move the extent buffer radix tree into
      the fs_info"), fs_info can be grabbed from extent_buffer directly.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2f4d60df
    • Qu Wenruo's avatar
      btrfs: make buffer_radix take sector size units · 478ef886
      Qu Wenruo authored
      For subpage sector size support, one page can contain multiple tree
      blocks. The entries cannot be based on page size and index must be
      derived from the sectorsize. No change for page size == sector size.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      478ef886
    • Qu Wenruo's avatar
      btrfs: assert page mapping lock in attach_extent_buffer_page · 0d01e247
      Qu Wenruo authored
      When calling attach_extent_buffer_page(), either we're attaching
      anonymous pages, called from btrfs_clone_extent_buffer(),
      or we're attaching btree inode pages, called from alloc_extent_buffer().
      
      For the latter case, we should hold page->mapping->private_lock to avoid
      parallel changes to page->private.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d01e247
    • Josef Bacik's avatar
      btrfs: protect fs_info->caching_block_groups by block_group_cache_lock · bbb86a37
      Josef Bacik authored
      I got the following lockdep splat
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0+ #101 Not tainted
        ------------------------------------------------------
        btrfs-cleaner/3445 is trying to acquire lock:
        ffff89dbec39ab48 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170
      
        but task is already holding lock:
        ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (&fs_info->commit_root_sem){++++}-{3:3}:
      	 down_write+0x3d/0x70
      	 btrfs_cache_block_group+0x2d5/0x510
      	 find_free_extent+0xb6e/0x12f0
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb1/0x330
      	 alloc_tree_block_no_bg_flush+0x4f/0x60
      	 __btrfs_cow_block+0x11d/0x580
      	 btrfs_cow_block+0x10c/0x220
      	 commit_cowonly_roots+0x47/0x2e0
      	 btrfs_commit_transaction+0x595/0xbd0
      	 sync_filesystem+0x74/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x36/0xa0
      	 cleanup_mnt+0x12d/0x190
      	 task_work_run+0x5c/0xa0
      	 exit_to_user_mode_prepare+0x1df/0x200
      	 syscall_exit_to_user_mode+0x54/0x280
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&space_info->groups_sem){++++}-{3:3}:
      	 down_read+0x40/0x130
      	 find_free_extent+0x2ed/0x12f0
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb1/0x330
      	 alloc_tree_block_no_bg_flush+0x4f/0x60
      	 __btrfs_cow_block+0x11d/0x580
      	 btrfs_cow_block+0x10c/0x220
      	 commit_cowonly_roots+0x47/0x2e0
      	 btrfs_commit_transaction+0x595/0xbd0
      	 sync_filesystem+0x74/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x36/0xa0
      	 cleanup_mnt+0x12d/0x190
      	 task_work_run+0x5c/0xa0
      	 exit_to_user_mode_prepare+0x1df/0x200
      	 syscall_exit_to_user_mode+0x54/0x280
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (btrfs-root-00){++++}-{3:3}:
      	 __lock_acquire+0x1167/0x2150
      	 lock_acquire+0xb9/0x3d0
      	 down_read_nested+0x43/0x130
      	 __btrfs_tree_read_lock+0x32/0x170
      	 __btrfs_read_lock_root_node+0x3a/0x50
      	 btrfs_search_slot+0x614/0x9d0
      	 btrfs_find_root+0x35/0x1b0
      	 btrfs_read_tree_root+0x61/0x120
      	 btrfs_get_root_ref+0x14b/0x600
      	 find_parent_nodes+0x3e6/0x1b30
      	 btrfs_find_all_roots_safe+0xb4/0x130
      	 btrfs_find_all_roots+0x60/0x80
      	 btrfs_qgroup_trace_extent_post+0x27/0x40
      	 btrfs_add_delayed_data_ref+0x3fd/0x460
      	 btrfs_free_extent+0x42/0x100
      	 __btrfs_mod_ref+0x1d7/0x2f0
      	 walk_up_proc+0x11c/0x400
      	 walk_up_tree+0xf0/0x180
      	 btrfs_drop_snapshot+0x1c7/0x780
      	 btrfs_clean_one_deleted_snapshot+0xfb/0x110
      	 cleaner_kthread+0xd4/0x140
      	 kthread+0x13a/0x150
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          btrfs-root-00 --> &space_info->groups_sem --> &fs_info->commit_root_sem
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_info->commit_root_sem);
      				 lock(&space_info->groups_sem);
      				 lock(&fs_info->commit_root_sem);
          lock(btrfs-root-00);
      
         *** DEADLOCK ***
      
        3 locks held by btrfs-cleaner/3445:
         #0: ffff89dbeaf28838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
         #1: ffff89dbeb6c7640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
         #2: ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80
      
        stack backtrace:
        CPU: 0 PID: 3445 Comm: btrfs-cleaner Not tainted 5.9.0+ #101
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb0
         check_noncircular+0xcf/0xf0
         __lock_acquire+0x1167/0x2150
         ? __bfs+0x42/0x210
         lock_acquire+0xb9/0x3d0
         ? __btrfs_tree_read_lock+0x32/0x170
         down_read_nested+0x43/0x130
         ? __btrfs_tree_read_lock+0x32/0x170
         __btrfs_tree_read_lock+0x32/0x170
         __btrfs_read_lock_root_node+0x3a/0x50
         btrfs_search_slot+0x614/0x9d0
         ? find_held_lock+0x2b/0x80
         btrfs_find_root+0x35/0x1b0
         ? do_raw_spin_unlock+0x4b/0xa0
         btrfs_read_tree_root+0x61/0x120
         btrfs_get_root_ref+0x14b/0x600
         find_parent_nodes+0x3e6/0x1b30
         btrfs_find_all_roots_safe+0xb4/0x130
         btrfs_find_all_roots+0x60/0x80
         btrfs_qgroup_trace_extent_post+0x27/0x40
         btrfs_add_delayed_data_ref+0x3fd/0x460
         btrfs_free_extent+0x42/0x100
         __btrfs_mod_ref+0x1d7/0x2f0
         walk_up_proc+0x11c/0x400
         walk_up_tree+0xf0/0x180
         btrfs_drop_snapshot+0x1c7/0x780
         ? btrfs_clean_one_deleted_snapshot+0x73/0x110
         btrfs_clean_one_deleted_snapshot+0xfb/0x110
         cleaner_kthread+0xd4/0x140
         ? btrfs_alloc_root+0x50/0x50
         kthread+0x13a/0x150
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      while testing another lockdep fix.  This happens because we're using the
      commit_root_sem to protect fs_info->caching_block_groups, which creates
      a dependency on the groups_sem -> commit_root_sem, which is problematic
      because we will allocate blocks while holding tree roots.  Fix this by
      making the list itself protected by the fs_info->block_group_cache_lock.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bbb86a37
    • Josef Bacik's avatar
      btrfs: load free space cache asynchronously · e747853c
      Josef Bacik authored
      While documenting the usage of the commit_root_sem, I noticed that we do
      not actually take the commit_root_sem in the case of the free space
      cache.  This is problematic because we're supposed to hold that sem
      while we're reading the commit roots, which is what we do for the free
      space cache.
      
      The reason I did it inline when I originally wrote the code was because
      there's the case of unpinning where we need to make sure that the free
      space cache is loaded if we're going to use the free space cache.  But
      we can accomplish the same thing by simply waiting for the cache to be
      loaded.
      
      Rework this code to load the free space cache asynchronously.  This
      allows us to greatly cleanup the caching code because now it's all
      shared by the various caching methods.  We also are now in a position to
      have the commit_root semaphore held while we're loading the free space
      cache.  And finally our modification of ->last_byte_to_unpin is removed
      because it can be handled in the proper way on commit.
      
      Some care must be taken when replaying the log, when we expect that the
      free space cache will be read entirely before we start excluding space
      to replay. This could lead to overwriting space during replay.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e747853c
    • Josef Bacik's avatar
      btrfs: load the free space cache inode extents from commit root · 4d7240f0
      Josef Bacik authored
      Historically we've allowed recursive locking specifically for the free
      space inode.  This is because we are only doing reads and know that it's
      safe.  However we don't actually need this feature, we can get away with
      reading the commit root for the extents.  In fact if we want to allow
      asynchronous loading of the free space cache we have to use the commit
      root, otherwise we will deadlock.
      
      Switch to using the commit root for the file extents.  These are only
      read at load time, and are replaced as soon as we start writing the
      cache out to disk.  The cache is never read again, so this is
      legitimate.  This matches what we do for the inode itself, as we read
      that from the commit root as well.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d7240f0
    • Josef Bacik's avatar
      btrfs: load free space cache into a temporary ctl · cd79909b
      Josef Bacik authored
      The free space cache has been special in that we would load it right
      away instead of farming the work off to a worker thread.  This resulted
      in some weirdness that had to be taken into account for this fact,
      namely that if we every found a block group being cached the fast way we
      had to wait for it to finish, because we could get the cache before it
      had been validated and we may throw the cache away.
      
      To handle this particular case instead create a temporary
      btrfs_free_space_ctl to load the free space cache into.  Then once we've
      validated that it makes sense, copy it's contents into the actual
      block_group->free_space_ctl.  This allows us to avoid the problems of
      needing to wait for the caching to complete, we can clean up the discard
      extent handling stuff in __load_free_space_cache, and we no longer need
      to do the merge_space_tree() because the space is added one by one into
      the real free_space_ctl.  This will allow further reworks of how we
      handle loading the free space cache.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd79909b
    • Josef Bacik's avatar
      btrfs: cleanup btrfs_discard_update_discardable usage · 66b53bae
      Josef Bacik authored
      This passes in the block_group and the free_space_ctl, but we can get
      this from the block group itself.  Part of this is because we call it
      from __load_free_space_cache, which can be called for the inode cache as
      well.
      
      Move that call into the block group specific load section, wrap it in
      the right lock that we need for the assertion (but otherwise this is
      safe without the lock because this happens in single-thread context).
      
      Fix up the arguments to only take the block group.  Add a lockdep_assert
      as well for good measure to make sure we don't mess up the locking
      again.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      66b53bae
    • Josef Bacik's avatar
      btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range · 2ca08c56
      Josef Bacik authored
      Currently unpin_extent_range happens in the transaction commit context,
      so we are protected from ->last_byte_to_unpin changing while we're
      unpinning, because any new transactions would have to wait for us to
      complete before modifying ->last_byte_to_unpin.
      
      However in the future we may want to change how this works, for instance
      with async unpinning or other such TODO items.  To prepare for that
      future explicitly protect ->last_byte_to_unpin with the commit_root_sem
      so we are sure it won't change while we're doing our work.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ca08c56
    • Josef Bacik's avatar
      btrfs: update last_byte_to_unpin in switch_commit_roots · 27d56e62
      Josef Bacik authored
      While writing an explanation for the need of the commit_root_sem for
      btrfs_prepare_extent_commit, I realized we have a slight hole that could
      result in leaked space if we have to do the old style caching.  Consider
      the following scenario
      
       commit root
       +----+----+----+----+----+----+----+
       |\\\\|    |\\\\|\\\\|    |\\\\|\\\\|
       +----+----+----+----+----+----+----+
       0    1    2    3    4    5    6    7
      
       new commit root
       +----+----+----+----+----+----+----+
       |    |    |    |\\\\|    |    |\\\\|
       +----+----+----+----+----+----+----+
       0    1    2    3    4    5    6    7
      
      Prior to this patch, we run btrfs_prepare_extent_commit, which updates
      the last_byte_to_unpin, and then we subsequently run
      switch_commit_roots.  In this example lets assume that
      caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which
      means that cache->last_byte_to_unpin == 1.  Then we go and do the
      switch_commit_roots(), but in the meantime the caching thread has made
      some more progress, because we drop the commit_root_sem and re-acquired
      it.  Now caching_ctl->progress == 3.  We swap out the commit root and
      carry on to unpin.
      
      The race can happen like:
      
        1) The caching thread was running using the old commit root when it
           found the extent for [2, 3);
      
        2) Then it released the commit_root_sem because it was in the last
           item of a leaf and the semaphore was contended, and set ->progress
           to 3 (value of 'last'), as the last extent item in the current leaf
           was for the extent for range [2, 3);
      
        3) Next time it gets the commit_root_sem, will start using the new
           commit root and search for a key with offset 3, so it never finds
           the hole for [2, 3).
      
        So the caching thread never saw [2, 3) as free space in any of the
        commit roots, and by the time finish_extent_commit() was called for
        the range [0, 3), ->last_byte_to_unpin was 1, so it only returned the
        subrange [0, 1) to the free space cache, skipping [2, 3).
      
      In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1),
      but do not unpin [2,3).  However because caching_ctl->progress == 3 we
      do not see the newly freed section of [2,3), and thus do not add it to
      our free space cache.  This results in us missing a chunk of free space
      in memory (on disk too, unless we have a power failure before writing
      the free space cache to disk).
      
      Fix this by making sure the ->last_byte_to_unpin is set at the same time
      that we swap the commit roots, this ensures that we will always be
      consistent.
      
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ update changelog with Filipe's review comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      27d56e62
    • Josef Bacik's avatar
      btrfs: do not shorten unpin len for caching block groups · 9076dbd5
      Josef Bacik authored
      While fixing up our ->last_byte_to_unpin locking I noticed that we will
      shorten len based on ->last_byte_to_unpin if we're caching when we're
      adding back the free space.  This is correct for the free space, as we
      cannot unpin more than ->last_byte_to_unpin, however we use len to
      adjust the ->bytes_pinned counters and such, which need to track the
      actual pinned usage.  This could result in
      WARN_ON(space_info->bytes_pinned) triggering at unmount time.
      
      Fix this by using a local variable for the amount to add to free space
      cache, and leave len untouched in this case.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9076dbd5
    • David Sterba's avatar
      btrfs: reorder extent buffer members for better packing · dc516164
      David Sterba authored
      After the rwsem replaced the tree lock implementation, the extent buffer
      got smaller but leaving some holes behind. By changing log_index type
      and reordering, we can squeeze the size further to 240 bytes, measured on
      release config on x86_64. Log_index spans only 3 values and needs to be
      signed.
      
      Before:
      
      struct extent_buffer {
              u64                        start;                /*     0     8 */
              long unsigned int          len;                  /*     8     8 */
              long unsigned int          bflags;               /*    16     8 */
              struct btrfs_fs_info *     fs_info;              /*    24     8 */
              spinlock_t                 refs_lock;            /*    32     4 */
              atomic_t                   refs;                 /*    36     4 */
              atomic_t                   io_pages;             /*    40     4 */
              int                        read_mirror;          /*    44     4 */
              struct callback_head       callback_head __attribute__((__aligned__(8))); /*    48    16 */
              /* --- cacheline 1 boundary (64 bytes) --- */
              pid_t                      lock_owner;           /*    64     4 */
              bool                       lock_recursed;        /*    68     1 */
      
              /* XXX 3 bytes hole, try to pack */
      
              struct rw_semaphore        lock;                 /*    72    40 */
              short int                  log_index;            /*   112     2 */
      
              /* XXX 6 bytes hole, try to pack */
      
              struct page *              pages[16];            /*   120   128 */
      
              /* size: 248, cachelines: 4, members: 14 */
              /* sum members: 239, holes: 2, sum holes: 9 */
              /* forced alignments: 1 */
              /* last cacheline: 56 bytes */
      } __attribute__((__aligned__(8)));
      
      After:
      
      struct extent_buffer {
              u64                        start;                /*     0     8 */
              long unsigned int          len;                  /*     8     8 */
              long unsigned int          bflags;               /*    16     8 */
              struct btrfs_fs_info *     fs_info;              /*    24     8 */
              spinlock_t                 refs_lock;            /*    32     4 */
              atomic_t                   refs;                 /*    36     4 */
              atomic_t                   io_pages;             /*    40     4 */
              int                        read_mirror;          /*    44     4 */
              struct callback_head       callback_head __attribute__((__aligned__(8))); /*    48    16 */
              /* --- cacheline 1 boundary (64 bytes) --- */
              pid_t                      lock_owner;           /*    64     4 */
              bool                       lock_recursed;        /*    68     1 */
              s8                         log_index;            /*    69     1 */
      
              /* XXX 2 bytes hole, try to pack */
      
              struct rw_semaphore        lock;                 /*    72    40 */
              struct page *              pages[16];            /*   112   128 */
      
              /* size: 240, cachelines: 4, members: 14 */
              /* sum members: 238, holes: 1, sum holes: 2 */
              /* forced alignments: 1 */
              /* last cacheline: 48 bytes */
      } __attribute__((__aligned__(8)));
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc516164
    • Josef Bacik's avatar
      btrfs: locking: rip out path->leave_spinning · b9729ce0
      Josef Bacik authored
      We no longer distinguish between blocking and spinning, so rip out all
      this code.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b9729ce0
    • Josef Bacik's avatar
      btrfs: locking: remove all the blocking helpers · ac5887c8
      Josef Bacik authored
      Now that we're using a rw_semaphore we no longer need to indicate if a
      lock is blocking or not, nor do we need to flip the entire path from
      blocking to spinning.  Remove these helpers and all the places they are
      called.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac5887c8
    • David Sterba's avatar
      btrfs: scrub: remove local copy of csum_size from context · 2ae0c2d8
      David Sterba authored
      The context structure unnecessarily stores copy of the checksum size,
      that can be now easily obtained from fs_info.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ae0c2d8
    • David Sterba's avatar
      btrfs: check integrity: remove local copy of csum_size · 419b791c
      David Sterba authored
      The state structure unnecessarily stores copy of the checksum size, that
      can be now easily obtained from fs_info.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      419b791c
    • David Sterba's avatar
      btrfs: remove unnecessary local variables for checksum size · 713cebfb
      David Sterba authored
      Remove local variable that is then used just once and replace it with
      fs_info::csum_size.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      713cebfb
    • David Sterba's avatar
      btrfs: switch cached fs_info::csum_size from u16 to u32 · 223486c2
      David Sterba authored
      The fs_info value is 32bit, switch also the local u16 variables. This
      leads to a better assembly code generated due to movzwl.
      
      This simple change will shave some bytes on x86_64 and release config:
      
         text    data     bss     dec     hex filename
      1090000   17980   14912 1122892  11224c pre/btrfs.ko
      1089794   17980   14912 1122686  11217e post/btrfs.ko
      
      DELTA: -206
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      223486c2
    • David Sterba's avatar
      btrfs: use cached value of fs_info::csum_size everywhere · 55fc29be
      David Sterba authored
      btrfs_get_16 shows up in the system performance profiles (helper to read
      16bit values from on-disk structures). This is partially because of the
      checksum size that's frequently read along with data reads/writes, other
      u16 uses are from item size or directory entries.
      
      Replace all calls to btrfs_super_csum_size by the cached value from
      fs_info.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      55fc29be
    • David Sterba's avatar
      btrfs: precalculate checksums per leaf once · fe5ecbe8
      David Sterba authored
      btrfs_csum_bytes_to_leaves shows up in system profiles, which makes it a
      candidate for optimizations. After the 64bit division has been replaced
      by shift, there's still a calculation done each time the function is
      called: checksums per leaf.
      
      As this is a constant value for the entire filesystem lifetime, we
      can calculate it once at mount time and reuse. This also allows to
      reduce the division to 64bit/32bit as we know the constant will always
      fit the 32bit type.
      
      Replace the open-coded rounding up with a macro that internally handles
      the 64bit division and as it's now a short function, make it static
      inline (slight code increase, slight stack usage reduction).
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fe5ecbe8
    • David Sterba's avatar
      btrfs: store precalculated csum_size in fs_info · 22b6331d
      David Sterba authored
      In many places we need the checksum size and it is inefficient to read
      it from the raw superblock. Store the value into fs_info, actual use
      will be in followup patches.  The size is u32 as it allows to generate
      better assembly than with u16.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      22b6331d
    • David Sterba's avatar
      btrfs: replace s_blocksize_bits with fs_info::sectorsize_bits · 265fdfa6
      David Sterba authored
      The value of super_block::s_blocksize_bits is the same as
      fs_info::sectorsize_bits, but we don't need to do the extra dereferences
      in many functions and storing the bits as u32 (in fs_info) generates
      shorter assembly.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      265fdfa6
    • David Sterba's avatar
      btrfs: replace div_u64 by shift in free_space_bitmap_size · 098e6308
      David Sterba authored
      Change free_space_bitmap_size to take btrfs_fs_info so we can get the
      sectorsize_bits to do calculations.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      098e6308
    • David Sterba's avatar
      btrfs: use precalculated sectorsize_bits from fs_info · ab108d99
      David Sterba authored
      We do a lot of calculations where we divide or multiply by sectorsize.
      We also know and make sure that sectorsize is a power of two, so this
      means all divisions can be turned to shifts and avoid eg. expensive
      u64/u32 divisions.
      
      The type is u32 as it's more register friendly on x86_64 compared to u8
      and the resulting assembly is smaller (movzbl vs movl).
      
      There's also superblock s_blocksize_bits but it's usually one more
      pointer dereference farther than fs_info.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ab108d99
    • Qu Wenruo's avatar
      btrfs: rename page_size to io_size in submit_extent_page · e940e9a7
      Qu Wenruo authored
      The variable @page_size in submit_extent_page() is not related to page
      size.
      
      It can already be smaller than PAGE_SIZE, so rename it to io_size to
      reduce confusion, this is especially important for later subpage
      support.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e940e9a7
    • Qu Wenruo's avatar
      btrfs: only require sector size alignment for page read · 8b8bbd46
      Qu Wenruo authored
      If we're reading partial page, btrfs will warn about this as read/write
      is always done in sector size, which now equals page size.
      
      But for the upcoming subpage read-only support, our data read is only
      aligned to sectorsize, which can be smaller than page size.
      
      Thus here we change the warning condition to check it against
      sectorsize, the behavior is not changed for regular sectorsize ==
      PAGE_SIZE case, and won't report error for subpage read.
      
      Also, pass the proper start/end with bv_offset for check_data_csum() to
      handle.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8b8bbd46
    • Qu Wenruo's avatar
      btrfs: rename pages_locked in process_pages_contig() · 12e3360f
      Qu Wenruo authored
      Function process_pages_contig() does not only handle page locking but
      also other operations.  Rename the local variable pages_locked to
      pages_processed to reduce confusion.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      12e3360f
    • Qu Wenruo's avatar
      btrfs: sink parameter start and len to check_data_csum · 265d4ac0
      Qu Wenruo authored
      For check_data_csum(), the page we're using is directly from the inode
      mapping, thus it has valid page_offset().
      
      We can use (page_offset() + pg_off) to replace @start parameter
      completely, while the @len should always be sectorsize.
      
      Since we're here, also add some comment, as there are quite some
      confusion in words like start/offset, without explaining whether it's
      file_offset or logical bytenr.
      
      This should not affect the existing behavior, as for current sectorsize
      == PAGE_SIZE case, @pgoff should always be 0, and len is always
      PAGE_SIZE (or sectorsize from the dio read path).
      Reviewed-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      265d4ac0