1. 14 Sep, 2023 2 commits
    • Filipe Manana's avatar
      btrfs: refresh dir last index during a rewinddir(3) call · e60aa5da
      Filipe Manana authored
      When opening a directory we find what's the index of its last entry and
      then store it in the directory's file handle private data (struct
      btrfs_file_private::last_index), so that in the case new directory entries
      are added to a directory after an opendir(3) call we don't end up in an
      infinite loop (see commit 9b378f6a ("btrfs: fix infinite directory
      reads")) when calling readdir(3).
      
      However once rewinddir(3) is called, POSIX states [1] that any new
      directory entries added after the previous opendir(3) call, must be
      returned by subsequent calls to readdir(3):
      
        "The rewinddir() function shall reset the position of the directory
         stream to which dirp refers to the beginning of the directory.
         It shall also cause the directory stream to refer to the current
         state of the corresponding directory, as a call to opendir() would
         have done."
      
      We currently don't refresh the last_index field of the struct
      btrfs_file_private associated to the directory, so after a rewinddir(3)
      we are not returning any new entries added after the opendir(3) call.
      
      Fix this by finding the current last index of the directory when llseek
      is called against the directory.
      
      This can be reproduced by the following C program provided by Ian Johnson:
      
         #include <dirent.h>
         #include <stdio.h>
      
         int main(void) {
           DIR *dir = opendir("test");
      
           FILE *file;
           file = fopen("test/1", "w");
           fwrite("1", 1, 1, file);
           fclose(file);
      
           file = fopen("test/2", "w");
           fwrite("2", 1, 1, file);
           fclose(file);
      
           rewinddir(dir);
      
           struct dirent *entry;
           while ((entry = readdir(dir))) {
              printf("%s\n", entry->d_name);
           }
           closedir(dir);
           return 0;
         }
      Reported-by: default avatarIan Johnson <ian@ianjohnson.dev>
      Link: https://lore.kernel.org/linux-btrfs/YR1P0S.NGASEG570GJ8@ianjohnson.dev/
      Fixes: 9b378f6a ("btrfs: fix infinite directory reads")
      CC: stable@vger.kernel.org # 6.5+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e60aa5da
    • Filipe Manana's avatar
      btrfs: set last dir index to the current last index when opening dir · 35795036
      Filipe Manana authored
      When opening a directory for reading it, we set the last index where we
      stop iteration to the value in struct btrfs_inode::index_cnt. That value
      does not match the index of the most recently added directory entry but
      it's instead the index number that will be assigned the next directory
      entry.
      
      This means that if after the call to opendir(3) new directory entries are
      added, a readdir(3) call will return the first new directory entry. This
      is fine because POSIX says the following [1]:
      
        "If a file is removed from or added to the directory after the most
         recent call to opendir() or rewinddir(), whether a subsequent call to
         readdir() returns an entry for that file is unspecified."
      
      For example for the test script from commit 9b378f6a ("btrfs: fix
      infinite directory reads"), where we have 2000 files in a directory, ext4
      doesn't return any new directory entry after opendir(3), while xfs returns
      the first 13 new directory entries added after the opendir(3) call.
      
      If we move to a shorter example with an empty directory when opendir(3) is
      called, and 2 files added to the directory after the opendir(3) call, then
      readdir(3) on btrfs will return the first file, ext4 and xfs return the 2
      files (but in a different order). A test program for this, reported by
      Ian Johnson, is the following:
      
         #include <dirent.h>
         #include <stdio.h>
      
         int main(void) {
           DIR *dir = opendir("test");
      
           FILE *file;
           file = fopen("test/1", "w");
           fwrite("1", 1, 1, file);
           fclose(file);
      
           file = fopen("test/2", "w");
           fwrite("2", 1, 1, file);
           fclose(file);
      
           struct dirent *entry;
           while ((entry = readdir(dir))) {
              printf("%s\n", entry->d_name);
           }
           closedir(dir);
           return 0;
         }
      
      To make this less odd, change the behaviour to never return new entries
      that were added after the opendir(3) call. This is done by setting the
      last_index field of the struct btrfs_file_private attached to the
      directory's file handle with a value matching btrfs_inode::index_cnt
      minus 1, since that value always matches the index of the next new
      directory entry and not the index of the most recently added entry.
      
      [1] https://pubs.opengroup.org/onlinepubs/007904875/functions/readdir_r.html
      
      Link: https://lore.kernel.org/linux-btrfs/YR1P0S.NGASEG570GJ8@ianjohnson.dev/
      CC: stable@vger.kernel.org # 6.5+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      35795036
  2. 13 Sep, 2023 3 commits
    • Josef Bacik's avatar
      btrfs: don't clear uptodate on write errors · b595d259
      Josef Bacik authored
      We have been consistently seeing hangs with generic/648 in our subpage
      GitHub CI setup.  This is a classic deadlock, we are calling
      btrfs_read_folio() on a folio, which requires holding the folio lock on
      the folio, and then finding a ordered extent that overlaps that range
      and calling btrfs_start_ordered_extent(), which then tries to write out
      the dirty page, which requires taking the folio lock and then we
      deadlock.
      
      The hang happens because we're writing to range [1271750656, 1271767040),
      page index [77621, 77622], and page 77621 is !Uptodate.  It is also Dirty,
      so we call btrfs_read_folio() for 77621 and which does
      btrfs_lock_and_flush_ordered_range() for that range, and we find an ordered
      extent which is [1271644160, 1271746560), page index [77615, 77621].
      The page indexes overlap, but the actual bytes don't overlap.  We're
      holding the page lock for 77621, then call
      btrfs_lock_and_flush_ordered_range() which tries to flush the dirty
      page, and tries to lock 77621 again and then we deadlock.
      
      The byte ranges do not overlap, but with subpage support if we clear
      uptodate on any portion of the page we mark the entire thing as not
      uptodate.
      
      We have been clearing page uptodate on write errors, but no other file
      system does this, and is in fact incorrect.  This doesn't hurt us in the
      !subpage case because we can't end up with overlapped ranges that don't
      also overlap on the page.
      
      Fix this by not clearing uptodate when we have a write error.  The only
      thing we should be doing in this case is setting the mapping error and
      carrying on.  This makes it so we would no longer call
      btrfs_read_folio() on the page as it's uptodate and eliminates the
      deadlock.
      
      With this patch we're now able to make it through a full fstests run on
      our subpage blocksize VMs.
      
      Note for stable backports: this probably goes beyond 6.1 but the code
      has been cleaned up and clearing the uptodate bit must be verified on
      each version independently.
      
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b595d259
    • Bernd Schubert's avatar
      btrfs: file_remove_privs needs an exclusive lock in direct io write · 9af86694
      Bernd Schubert authored
      This was noticed by Miklos that file_remove_privs might call into
      notify_change(), which requires to hold an exclusive lock. The problem
      exists in FUSE and btrfs. We can fix it without any additional helpers
      from VFS, in case the privileges would need to be dropped, change the
      lock type to be exclusive and redo the loop.
      
      Fixes: e9adabb9 ("btrfs: use shared lock for direct writes within EOF")
      CC: Miklos Szeredi <miklos@szeredi.hu>
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarBernd Schubert <bschubert@ddn.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9af86694
    • Matthew Wilcox (Oracle)'s avatar
      btrfs: convert btrfs_read_merkle_tree_page() to use a folio · 06ed0935
      Matthew Wilcox (Oracle) authored
      Remove a number of hidden calls to compound_head() by using a folio
      throughout.  Also follow core kernel coding style by adding the folio to
      the page cache immediately after allocation instead of doing the read
      first, then adding it to the page cache.  This ordering makes subsequent
      readers block waiting for the first reader instead of duplicating the
      work only to throw it away when they find out they lost the race.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06ed0935
  3. 08 Sep, 2023 10 commits
    • Bhaskar Chowdhury's avatar
      MAINTAINERS: remove links to obsolete btrfs.wiki.kernel.org · 5facccc9
      Bhaskar Chowdhury authored
      The wiki has been archived and is not updated anymore. Remove or replace
      the links in files that contain it (MAINTAINERS, Kconfig, docs).
      Signed-off-by: default avatarBhaskar Chowdhury <unixbhaskar@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5facccc9
    • Filipe Manana's avatar
      btrfs: assert delayed node locked when removing delayed item · a57c2d4e
      Filipe Manana authored
      When removing a delayed item, or releasing which will remove it as well,
      we will modify one of the delayed node's rbtrees and item counter if the
      delayed item is in one of the rbtrees. This require having the delayed
      node's mutex locked, otherwise we will race with other tasks modifying
      the rbtrees and the counter.
      
      This is motivated by a previous version of another patch actually calling
      btrfs_release_delayed_item() after unlocking the delayed node's mutex and
      against a delayed item that is in a rbtree.
      
      So assert at __btrfs_remove_delayed_item() that the delayed node's mutex
      is locked.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a57c2d4e
    • Filipe Manana's avatar
      btrfs: remove BUG() after failure to insert delayed dir index item · 2c58c393
      Filipe Manana authored
      Instead of calling BUG() when we fail to insert a delayed dir index item
      into the delayed node's tree, we can just release all the resources we
      have allocated/acquired before and return the error to the caller. This is
      fine because all existing call chains undo anything they have done before
      calling btrfs_insert_delayed_dir_index() or BUG_ON (when creating pending
      snapshots in the transaction commit path).
      
      So remove the BUG() call and do proper error handling.
      
      This relates to a syzbot report linked below, but does not fix it because
      it only prevents hitting a BUG(), it does not fix the issue where somehow
      we attempt to use twice the same index number for different index items.
      
      Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2c58c393
    • Filipe Manana's avatar
      btrfs: improve error message after failure to add delayed dir index item · 91bfe310
      Filipe Manana authored
      If we fail to add a delayed dir index item because there's already another
      item with the same index number, we print an error message (and then BUG).
      However that message isn't very helpful to debug anything because we don't
      know what's the index number and what are the values of index counters in
      the inode and its delayed inode (index_cnt fields of struct btrfs_inode
      and struct btrfs_delayed_node).
      
      So update the error message to include the index number and counters.
      
      We actually had a recent case where this issue was hit by a syzbot report
      (see the link below).
      
      Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      91bfe310
    • Qu Wenruo's avatar
      btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folio · 5e0e8799
      Qu Wenruo authored
      [BUG]
      After commit 72a69cd0 ("btrfs: subpage: pack all subpage bitmaps
      into a larger bitmap"), the DEBUG section of btree_dirty_folio() would
      no longer compile.
      
      [CAUSE]
      If DEBUG is defined, we would do extra checks for btree_dirty_folio(),
      mostly to make sure the range we marked dirty has an extent buffer and
      that extent buffer is dirty.
      
      For subpage, we need to iterate through all the extent buffers covered
      by that page range, and make sure they all matches the criteria.
      
      However commit 72a69cd0 ("btrfs: subpage: pack all subpage bitmaps
      into a larger bitmap") changes how we store the bitmap, we pack all the
      16 bits bitmaps into a larger bitmap, which would save some space.
      
      This means we no longer have btrfs_subpage::dirty_bitmap, instead the
      dirty bitmap is starting at btrfs_subpage_info::dirty_offset, and has a
      length of btrfs_subpage_info::bitmap_nr_bits.
      
      [FIX]
      Although I'm not sure if it still makes sense to maintain such code, at
      least let it compile.
      
      This patch would let us test the bits one by one through the bitmaps.
      
      CC: stable@vger.kernel.org # 6.1+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5e0e8799
    • Josef Bacik's avatar
      btrfs: check for BTRFS_FS_ERROR in pending ordered assert · 4ca8e03c
      Josef Bacik authored
      If we do fast tree logging we increment a counter on the current
      transaction for every ordered extent we need to wait for.  This means we
      expect the transaction to still be there when we clear pending on the
      ordered extent.  However if we happen to abort the transaction and clean
      it up, there could be no running transaction, and thus we'll trip the
      "ASSERT(trans)" check.  This is obviously incorrect, and the code
      properly deals with the case that the transaction doesn't exist.  Fix
      this ASSERT() to only fire if there's no trans and we don't have
      BTRFS_FS_ERROR() set on the file system.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ca8e03c
    • Filipe Manana's avatar
      btrfs: fix lockdep splat and potential deadlock after failure running delayed items · e110f891
      Filipe Manana authored
      When running delayed items we are holding a delayed node's mutex and then
      we will attempt to modify a subvolume btree to insert/update/delete the
      delayed items. However if have an error during the insertions for example,
      btrfs_insert_delayed_items() may return with a path that has locked extent
      buffers (a leaf at the very least), and then we attempt to release the
      delayed node at __btrfs_run_delayed_items(), which requires taking the
      delayed node's mutex, causing an ABBA type of deadlock. This was reported
      by syzbot and the lockdep splat is the following:
      
        WARNING: possible circular locking dependency detected
        6.5.0-rc7-syzkaller-00024-g93f5de5f #0 Not tainted
        ------------------------------------------------------
        syz-executor.2/13257 is trying to acquire lock:
        ffff88801835c0c0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256
      
        but task is already holding lock:
        ffff88802a5ab8e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x3c/0x2a0 fs/btrfs/locking.c:198
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (btrfs-tree-00){++++}-{3:3}:
               __lock_release kernel/locking/lockdep.c:5475 [inline]
               lock_release+0x36f/0x9d0 kernel/locking/lockdep.c:5781
               up_write+0x79/0x580 kernel/locking/rwsem.c:1625
               btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline]
               btrfs_unlock_up_safe+0x179/0x3b0 fs/btrfs/locking.c:239
               search_leaf fs/btrfs/ctree.c:1986 [inline]
               btrfs_search_slot+0x2511/0x2f80 fs/btrfs/ctree.c:2230
               btrfs_insert_empty_items+0x9c/0x180 fs/btrfs/ctree.c:4376
               btrfs_insert_delayed_item fs/btrfs/delayed-inode.c:746 [inline]
               btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline]
               __btrfs_commit_inode_delayed_items+0xd24/0x2410 fs/btrfs/delayed-inode.c:1111
               __btrfs_run_delayed_items+0x1db/0x430 fs/btrfs/delayed-inode.c:1153
               flush_space+0x269/0xe70 fs/btrfs/space-info.c:723
               btrfs_async_reclaim_metadata_space+0x106/0x350 fs/btrfs/space-info.c:1078
               process_one_work+0x92c/0x12c0 kernel/workqueue.c:2600
               worker_thread+0xa63/0x1210 kernel/workqueue.c:2751
               kthread+0x2b8/0x350 kernel/kthread.c:389
               ret_from_fork+0x2e/0x60 arch/x86/kernel/process.c:145
               ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
               check_prev_add kernel/locking/lockdep.c:3142 [inline]
               check_prevs_add kernel/locking/lockdep.c:3261 [inline]
               validate_chain kernel/locking/lockdep.c:3876 [inline]
               __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144
               lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761
               __mutex_lock_common+0x1d8/0x2530 kernel/locking/mutex.c:603
               __mutex_lock kernel/locking/mutex.c:747 [inline]
               mutex_lock_nested+0x1b/0x20 kernel/locking/mutex.c:799
               __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256
               btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline]
               __btrfs_run_delayed_items+0x2b5/0x430 fs/btrfs/delayed-inode.c:1156
               btrfs_commit_transaction+0x859/0x2ff0 fs/btrfs/transaction.c:2276
               btrfs_sync_file+0xf56/0x1330 fs/btrfs/file.c:1988
               vfs_fsync_range fs/sync.c:188 [inline]
               vfs_fsync fs/sync.c:202 [inline]
               do_fsync fs/sync.c:212 [inline]
               __do_sys_fsync fs/sync.c:220 [inline]
               __se_sys_fsync fs/sync.c:218 [inline]
               __x64_sys_fsync+0x196/0x1e0 fs/sync.c:218
               do_syscall_x64 arch/x86/entry/common.c:50 [inline]
               do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
               entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(btrfs-tree-00);
                                       lock(&delayed_node->mutex);
                                       lock(btrfs-tree-00);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by syz-executor.2/13257:
         #0: ffff88802c1ee370 (btrfs_trans_num_writers){++++}-{0:0}, at: spin_unlock include/linux/spinlock.h:391 [inline]
         #0: ffff88802c1ee370 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0xb87/0xe00 fs/btrfs/transaction.c:287
         #1: ffff88802c1ee398 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0xbb2/0xe00 fs/btrfs/transaction.c:288
         #2: ffff88802a5ab8e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x3c/0x2a0 fs/btrfs/locking.c:198
      
        stack backtrace:
        CPU: 0 PID: 13257 Comm: syz-executor.2 Not tainted 6.5.0-rc7-syzkaller-00024-g93f5de5f #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
        Call Trace:
         <TASK>
         __dump_stack lib/dump_stack.c:88 [inline]
         dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
         check_noncircular+0x375/0x4a0 kernel/locking/lockdep.c:2195
         check_prev_add kernel/locking/lockdep.c:3142 [inline]
         check_prevs_add kernel/locking/lockdep.c:3261 [inline]
         validate_chain kernel/locking/lockdep.c:3876 [inline]
         __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144
         lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761
         __mutex_lock_common+0x1d8/0x2530 kernel/locking/mutex.c:603
         __mutex_lock kernel/locking/mutex.c:747 [inline]
         mutex_lock_nested+0x1b/0x20 kernel/locking/mutex.c:799
         __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256
         btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline]
         __btrfs_run_delayed_items+0x2b5/0x430 fs/btrfs/delayed-inode.c:1156
         btrfs_commit_transaction+0x859/0x2ff0 fs/btrfs/transaction.c:2276
         btrfs_sync_file+0xf56/0x1330 fs/btrfs/file.c:1988
         vfs_fsync_range fs/sync.c:188 [inline]
         vfs_fsync fs/sync.c:202 [inline]
         do_fsync fs/sync.c:212 [inline]
         __do_sys_fsync fs/sync.c:220 [inline]
         __se_sys_fsync fs/sync.c:218 [inline]
         __x64_sys_fsync+0x196/0x1e0 fs/sync.c:218
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
        RIP: 0033:0x7f3ad047cae9
        Code: 28 00 00 00 75 (...)
        RSP: 002b:00007f3ad12510c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
        RAX: ffffffffffffffda RBX: 00007f3ad059bf80 RCX: 00007f3ad047cae9
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
        RBP: 00007f3ad04c847a R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
        R13: 000000000000000b R14: 00007f3ad059bf80 R15: 00007ffe56af92f8
         </TASK>
        ------------[ cut here ]------------
      
      Fix this by releasing the path before releasing the delayed node in the
      error path at __btrfs_run_delayed_items().
      
      Reported-by: syzbot+a379155f07c134ea9879@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/000000000000abba27060403b5bd@google.com/
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e110f891
    • Josef Bacik's avatar
      btrfs: do not block starts waiting on previous transaction commit · 77d20c68
      Josef Bacik authored
      Internally I got a report of very long stalls on normal operations like
      creating a new file when auto relocation was running.  The reporter used
      the 'bpf offcputime' tracer to show that we would get stuck in
      start_transaction for 5 to 30 seconds, and were always being woken up by
      the transaction commit.
      
      Using my timing-everything script, which times how long a function takes
      and what percentage of that total time is taken up by its children, I
      saw several traces like this
      
      1083 took 32812902424 ns
              29929002926 ns 91.2110% wait_for_commit_duration
              25568 ns 7.7920e-05% commit_fs_roots_duration
              1007751 ns 0.00307% commit_cowonly_roots_duration
              446855602 ns 1.36182% btrfs_run_delayed_refs_duration
              271980 ns 0.00082% btrfs_run_delayed_items_duration
              2008 ns 6.1195e-06% btrfs_apply_pending_changes_duration
              9656 ns 2.9427e-05% switch_commit_roots_duration
              1598 ns 4.8700e-06% btrfs_commit_device_sizes_duration
              4314 ns 1.3147e-05% btrfs_free_log_root_tree_duration
      
      Here I was only tracing functions that happen where we are between
      START_COMMIT and UNBLOCKED in order to see what would be keeping us
      blocked for so long.  The wait_for_commit() we do is where we wait for a
      previous transaction that hasn't completed it's commit.  This can
      include all of the unpin work and other cleanups, which tends to be the
      longest part of our transaction commit.
      
      There is no reason we should be blocking new things from entering the
      transaction at this point, it just adds to random latency spikes for no
      reason.
      
      Fix this by adding a PREP stage.  This allows us to properly deal with
      multiple committers coming in at the same time, we retain the behavior
      that the winner waits on the previous transaction and the losers all
      wait for this transaction commit to occur.  Nothing else is blocked
      during the PREP stage, and then once the wait is complete we switch to
      COMMIT_START and all of the same behavior as before is maintained.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77d20c68
    • Filipe Manana's avatar
      btrfs: release path before inode lookup during the ino lookup ioctl · ee34a82e
      Filipe Manana authored
      During the ino lookup ioctl we can end up calling btrfs_iget() to get an
      inode reference while we are holding on a root's btree. If btrfs_iget()
      needs to lookup the inode from the root's btree, because it's not
      currently loaded in memory, then it will need to lock another or the
      same path in the same root btree. This may result in a deadlock and
      trigger the following lockdep splat:
      
        WARNING: possible circular locking dependency detected
        6.5.0-rc7-syzkaller-00004-gf7757129 #0 Not tainted
        ------------------------------------------------------
        syz-executor277/5012 is trying to acquire lock:
        ffff88802df41710 (btrfs-tree-01){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
      
        but task is already holding lock:
        ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (btrfs-tree-00){++++}-{3:3}:
               down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645
               __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
               btrfs_search_slot+0x13a4/0x2f80 fs/btrfs/ctree.c:2302
               btrfs_init_root_free_objectid+0x148/0x320 fs/btrfs/disk-io.c:4955
               btrfs_init_fs_root fs/btrfs/disk-io.c:1128 [inline]
               btrfs_get_root_ref+0x5ae/0xae0 fs/btrfs/disk-io.c:1338
               btrfs_get_fs_root fs/btrfs/disk-io.c:1390 [inline]
               open_ctree+0x29c8/0x3030 fs/btrfs/disk-io.c:3494
               btrfs_fill_super+0x1c7/0x2f0 fs/btrfs/super.c:1154
               btrfs_mount_root+0x7e0/0x910 fs/btrfs/super.c:1519
               legacy_get_tree+0xef/0x190 fs/fs_context.c:611
               vfs_get_tree+0x8c/0x270 fs/super.c:1519
               fc_mount fs/namespace.c:1112 [inline]
               vfs_kern_mount+0xbc/0x150 fs/namespace.c:1142
               btrfs_mount+0x39f/0xb50 fs/btrfs/super.c:1579
               legacy_get_tree+0xef/0x190 fs/fs_context.c:611
               vfs_get_tree+0x8c/0x270 fs/super.c:1519
               do_new_mount+0x28f/0xae0 fs/namespace.c:3335
               do_mount fs/namespace.c:3675 [inline]
               __do_sys_mount fs/namespace.c:3884 [inline]
               __se_sys_mount+0x2d9/0x3c0 fs/namespace.c:3861
               do_syscall_x64 arch/x86/entry/common.c:50 [inline]
               do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
               entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        -> #0 (btrfs-tree-01){++++}-{3:3}:
               check_prev_add kernel/locking/lockdep.c:3142 [inline]
               check_prevs_add kernel/locking/lockdep.c:3261 [inline]
               validate_chain kernel/locking/lockdep.c:3876 [inline]
               __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144
               lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761
               down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645
               __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
               btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline]
               btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281
               btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline]
               btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154
               btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412
               btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline]
               btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716
               btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline]
               btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105
               btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683
               vfs_ioctl fs/ioctl.c:51 [inline]
               __do_sys_ioctl fs/ioctl.c:870 [inline]
               __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856
               do_syscall_x64 arch/x86/entry/common.c:50 [inline]
               do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
               entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          rlock(btrfs-tree-00);
                                       lock(btrfs-tree-01);
                                       lock(btrfs-tree-00);
          rlock(btrfs-tree-01);
      
         *** DEADLOCK ***
      
        1 lock held by syz-executor277/5012:
         #0: ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
      
        stack backtrace:
        CPU: 1 PID: 5012 Comm: syz-executor277 Not tainted 6.5.0-rc7-syzkaller-00004-gf7757129 #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
        Call Trace:
         <TASK>
         __dump_stack lib/dump_stack.c:88 [inline]
         dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
         check_noncircular+0x375/0x4a0 kernel/locking/lockdep.c:2195
         check_prev_add kernel/locking/lockdep.c:3142 [inline]
         check_prevs_add kernel/locking/lockdep.c:3261 [inline]
         validate_chain kernel/locking/lockdep.c:3876 [inline]
         __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144
         lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761
         down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645
         __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
         btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline]
         btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281
         btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline]
         btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154
         btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412
         btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline]
         btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716
         btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline]
         btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105
         btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:870 [inline]
         __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
        RIP: 0033:0x7f0bec94ea39
      
      Fix this simply by releasing the path before calling btrfs_iget() as at
      point we don't need the path anymore.
      
      Reported-by: syzbot+bf66ad948981797d2f1d@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/00000000000045fa140603c4a969@google.com/
      Fixes: 23d0b79d ("btrfs: Add unprivileged version of ino_lookup ioctl")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ee34a82e
    • Filipe Manana's avatar
      btrfs: fix race between finishing block group creation and its item update · 2d6cd791
      Filipe Manana authored
      Commit 675dfe12 ("btrfs: fix block group item corruption after
      inserting new block group") fixed one race that resulted in not persisting
      a block group's item when its "used" bytes field decreases to zero.
      However there's another race that can happen in a much shorter time window
      that results in the same problem. The following sequence of steps explains
      how it can happen:
      
      1) Task A creates a metadata block group X, its "used" and "commit_used"
         fields are initialized to 0;
      
      2) Two extents are allocated from block group X, so its "used" field is
         updated to 32K, and its "commit_used" field remains as 0;
      
      3) Transaction commit starts, by some task B, and it enters
         btrfs_start_dirty_block_groups(). There it tries to update the block
         group item for block group X, which currently has its "used" field with
         a value of 32K and its "commit_used" field with a value of 0. However
         that fails since the block group item was not yet inserted, so at
         update_block_group_item(), the btrfs_search_slot() call returns 1, and
         then we set 'ret' to -ENOENT. Before jumping to the label 'fail'...
      
      4) The block group item is inserted by task A, when for example
         btrfs_create_pending_block_groups() is called when releasing its
         transaction handle. This results in insert_block_group_item() inserting
         the block group item in the extent tree (or block group tree), with a
         "used" field having a value of 32K and setting "commit_used", in struct
         btrfs_block_group, to the same value (32K);
      
      5) Task B jumps to the 'fail' label and then resets the "commit_used"
         field to 0. At btrfs_start_dirty_block_groups(), because -ENOENT was
         returned from update_block_group_item(), we add the block group again
         to the list of dirty block groups, so that we will try again in the
         critical section of the transaction commit when calling
         btrfs_write_dirty_block_groups();
      
      6) Later the two extents from block group X are freed, so its "used" field
         becomes 0;
      
      7) If no more extents are allocated from block group X before we get into
         btrfs_write_dirty_block_groups(), then when we call
         update_block_group_item() again for block group X, we will not update
         the block group item to reflect that it has 0 bytes used, because the
         "used" and "commit_used" fields in struct btrfs_block_group have the
         same value, a value of 0.
      
         As a result after committing the transaction we have an empty block
         group with its block group item having a 32K value for its "used" field.
         This will trigger errors from fsck ("btrfs check" command) and after
         mounting again the fs, the cleaner kthread will not automatically delete
         the empty block group, since its "used" field is not 0. Possibly there
         are other issues due to this inconsistency.
      
         When this issue happens, the error reported by fsck is like this:
      
           [1/7] checking root items
           [2/7] checking extents
           block group [1104150528 1073741824] used 39796736 but extent items used 0
           ERROR: errors found in extent allocation tree or chunk allocation
           (...)
      
      So fix this by not resetting the "commit_used" field of a block group when
      we don't find the block group item at update_block_group_item().
      
      Fixes: 7248e0ce ("btrfs: skip update of block group item if used bytes are the same")
      CC: stable@vger.kernel.org # 6.2+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d6cd791
  4. 22 Aug, 2023 1 commit
    • Naohiro Aota's avatar
      btrfs: zoned: skip splitting and logical rewriting on pre-alloc write · c02d35d8
      Naohiro Aota authored
      When doing a relocation, there is a chance that at the time of
      btrfs_reloc_clone_csums(), there is no checksum for the corresponding
      region.
      
      In this case, btrfs_finish_ordered_zoned()'s sum points to an invalid item
      and so ordered_extent's logical is set to some invalid value. Then,
      btrfs_lookup_block_group() in btrfs_zone_finish_endio() failed to find a
      block group and will hit an assert or a null pointer dereference as
      following.
      
      This can be reprodcued by running btrfs/028 several times (e.g, 4 to 16
      times) with a null_blk setup. The device's zone size and capacity is set to
      32 MB and the storage size is set to 5 GB on my setup.
      
          KASAN: null-ptr-deref in range [0x0000000000000088-0x000000000000008f]
          CPU: 6 PID: 3105720 Comm: kworker/u16:13 Tainted: G        W          6.5.0-rc6-kts+ #1
          Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015
          Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
          RIP: 0010:btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs]
          Code: 41 54 49 89 fc 55 48 89 f5 53 e8 57 7d fc ff 48 8d b8 88 00 00 00 48 89 c3 48 b8 00 00 00 00 00
          > 3c 02 00 0f 85 02 01 00 00 f6 83 88 00 00 00 01 0f 84 a8 00 00
          RSP: 0018:ffff88833cf87b08 EFLAGS: 00010206
          RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
          RDX: 0000000000000011 RSI: 0000000000000004 RDI: 0000000000000088
          RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed102877b827
          R10: ffff888143bdc13b R11: ffff888125b1cbc0 R12: ffff888143bdc000
          R13: 0000000000007000 R14: ffff888125b1cba8 R15: 0000000000000000
          FS:  0000000000000000(0000) GS:ffff88881e500000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007f3ed85223d5 CR3: 00000001519b4005 CR4: 00000000001706e0
          Call Trace:
           <TASK>
           ? die_addr+0x3c/0xa0
           ? exc_general_protection+0x148/0x220
           ? asm_exc_general_protection+0x22/0x30
           ? btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs]
           ? btrfs_zone_finish_endio.part.0+0x19/0x160 [btrfs]
           btrfs_finish_one_ordered+0x7b8/0x1de0 [btrfs]
           ? rcu_is_watching+0x11/0xb0
           ? lock_release+0x47a/0x620
           ? btrfs_finish_ordered_zoned+0x59b/0x800 [btrfs]
           ? __pfx_btrfs_finish_one_ordered+0x10/0x10 [btrfs]
           ? btrfs_finish_ordered_zoned+0x358/0x800 [btrfs]
           ? __smp_call_single_queue+0x124/0x350
           ? rcu_is_watching+0x11/0xb0
           btrfs_work_helper+0x19f/0xc60 [btrfs]
           ? __pfx_try_to_wake_up+0x10/0x10
           ? _raw_spin_unlock_irq+0x24/0x50
           ? rcu_is_watching+0x11/0xb0
           process_one_work+0x8c1/0x1430
           ? __pfx_lock_acquire+0x10/0x10
           ? __pfx_process_one_work+0x10/0x10
           ? __pfx_do_raw_spin_lock+0x10/0x10
           ? _raw_spin_lock_irq+0x52/0x60
           worker_thread+0x100/0x12c0
           ? __kthread_parkme+0xc1/0x1f0
           ? __pfx_worker_thread+0x10/0x10
           kthread+0x2ea/0x3c0
           ? __pfx_kthread+0x10/0x10
           ret_from_fork+0x30/0x70
           ? __pfx_kthread+0x10/0x10
           ret_from_fork_asm+0x1b/0x30
           </TASK>
      
      On the zoned mode, writing to pre-allocated region means data relocation
      write. Such write always uses WRITE command so there is no need of splitting
      and rewriting logical address. Thus, we can just skip the function for the
      case.
      
      Fixes: cbfce4c7 ("btrfs: optimize the logical to physical mapping for zoned writes")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c02d35d8
  5. 21 Aug, 2023 24 commits
    • Josef Bacik's avatar
      btrfs: tests: test invalid splitting when skipping pinned drop extent_map · 92e1229b
      Josef Bacik authored
      This reproduces the bug fixed by "btrfs: fix incorrect splitting in
      btrfs_drop_extent_map_range", we were improperly calculating the range
      for the split extent.  Add a test that exercises this scenario and
      validates that we get the correct resulting extent_maps in our tree.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      92e1229b
    • Josef Bacik's avatar
      btrfs: tests: add a test for btrfs_add_extent_mapping · f345dbdf
      Josef Bacik authored
      This helper is different from the normal add_extent_mapping in that it
      will stuff an em into a gap that exists between overlapping em's in the
      tree.  It appeared there was a bug so I wrote a self test to validate it
      did the correct thing when it worked with two side by side ems.
      Thankfully it is correct, but more testing is better.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f345dbdf
    • Josef Bacik's avatar
      btrfs: tests: add extent_map tests for dropping with odd layouts · 89c37604
      Josef Bacik authored
      While investigating weird problems with the extent_map I wrote a self
      test testing the various edge cases of btrfs_drop_extent_map_range.
      This can split in different ways and behaves different in each case, so
      test the various edge cases to make sure everything is functioning
      properly.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      89c37604
    • Qu Wenruo's avatar
      btrfs: scrub: move write back of repaired sectors to scrub_stripe_read_repair_worker() · 4fe44f9d
      Qu Wenruo authored
      Currently the scrub_stripe_read_repair_worker() only does reads to
      rebuild the corrupted sectors, it doesn't do any writeback.
      
      The design is mostly to put writeback into a more ordered manner, to
      co-operate with dev-replace with zoned mode, which requires every write
      to be submitted in their bytenr order.
      
      However the writeback for repaired sectors into the original mirror
      doesn't need such strong sync requirement, as it can only happen for
      non-zoned devices.
      
      This patch would move the writeback for repaired sectors into
      scrub_stripe_read_repair_worker(), which removes two calls sites for
      repaired sectors writeback. (one from flush_scrub_stripes(), one from
      scrub_raid56_parity_stripe())
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4fe44f9d
    • Qu Wenruo's avatar
      btrfs: scrub: don't go ordered workqueue for dev-replace · 39dc7bd9
      Qu Wenruo authored
      The workqueue fs_info->scrub_worker would go ordered workqueue if it's a
      device replace operation.
      
      However the scrub is relying on multiple workers to do data csum
      verification, and we always submit several read requests in a row.
      
      Thus there is no need to use ordered workqueue just for dev-replace.
      We have extra synchronization (the main thread will always
      submit-and-wait for dev-replace writes) to handle it for zoned devices.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      39dc7bd9
    • Qu Wenruo's avatar
      btrfs: scrub: fix grouping of read IO · ae76d8e3
      Qu Wenruo authored
      [REGRESSION]
      There are several regression reports about the scrub performance with
      v6.4 kernel.
      
      On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
      v6.4 can only go 1GB/s, an obvious 66% performance drop.
      
      [CAUSE]
      Iostat shows a very different behavior between v6.3 and v6.4 kernel:
      
        Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
        nvme0n1p3  9731.00 3425544.00 17237.00  63.92    2.18   352.02  21.18 100.00
        nvme0n1p3 15578.00  993616.00     5.00   0.03    0.09    63.78   1.32 100.00
      
      The upper one is v6.3 while the lower one is v6.4.
      
      There are several obvious differences:
      
      - Very few read merges
        This turns out to be a behavior change that we no longer do bio
        plug/unplug.
      
      - Very low aqu-sz
        This is due to the submit-and-wait behavior of flush_scrub_stripes(),
        and extra extent/csum tree search.
      
      Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
      to merge the reads, while SATA SSDs can not handle high queue depth well
      either.
      
      [FIX]
      For now this patch focuses on the read speed fix. Dev-replace replace
      speed needs more work.
      
      For the read part, we go two directions to fix the problems:
      
      - Re-introduce blk plug/unplug to merge read requests
        This is pretty simple, and the behavior is pretty easy to observe.
      
        This would enlarge the average read request size to 512K.
      
      - Introduce multi-group reads and no longer wait for each group
        Instead of the old behavior, which submits 8 stripes and waits for
        them, here we would enlarge the total number of stripes to 16 * 8.
        Which is 8M per device, the same limit as the old scrub in-flight
        bios size limit.
      
        Now every time we fill a group (8 stripes), we submit them and
        continue to next stripes.
      
        Only when the full 16 * 8 stripes are all filled, we submit the
        remaining ones (the last group), and wait for all groups to finish.
        Then submit the repair writes and dev-replace writes.
      
        This should enlarge the queue depth.
      
      This would greatly improve the merge rate (thus read block size) and
      queue depth:
      
      Before (with regression, and cached extent/csum path):
      
       Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
       nvme0n1p3 20666.00 1318240.00    10.00   0.05    0.08    63.79   1.63 100.00
      
      After (with all patches applied):
      
       nvme0n1p3  5165.00 2278304.00 30557.00  85.54    0.55   441.10   2.81 100.00
      
      i.e. 1287 to 2224 MB/s.
      
      CC: stable@vger.kernel.org # 6.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae76d8e3
    • Qu Wenruo's avatar
      btrfs: scrub: avoid unnecessary csum tree search preparing stripes · 3c771c19
      Qu Wenruo authored
      One of the bottleneck of the new scrub code is the extra csum tree
      search.
      
      The old code would only do the csum tree search for each scrub bio,
      which can be as large as 512KiB, thus they can afford to allocate a new
      path each time.
      
      But the new scrub code is doing csum tree search for each stripe, which
      is only 64KiB, this means we'd better re-use the same csum path during
      each search.
      
      This patch would introduce a per-sctx path for csum tree search, as we
      don't need to re-allocate the path every time we need to do a csum tree
      search.
      
      With this change we can further improve the queue depth and improve the
      scrub read performance:
      
      Before (with regression and cached extent tree path):
      
       Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
       nvme0n1p3 15875.00 1013328.00    12.00   0.08    0.08    63.83   1.35 100.00
      
      After (with both cached extent/csum tree path):
      
       nvme0n1p3 17759.00 1133280.00    10.00   0.06    0.08    63.81   1.50 100.00
      
      Fixes: e02ee89b ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
      CC: stable@vger.kernel.org # 6.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c771c19
    • Qu Wenruo's avatar
      btrfs: scrub: avoid unnecessary extent tree search preparing stripes · 1dc4888e
      Qu Wenruo authored
      Since commit e02ee89b ("btrfs: scrub: switch scrub_simple_mirror()
      to scrub_stripe infrastructure"), scrub no longer re-use the same path
      for extent tree search.
      
      This can lead to unnecessary extent tree search, especially for the new
      stripe based scrub, as we have way more stripes to prepare.
      
      This patch would re-introduce a shared path for extent tree search, and
      properly release it when the block group is scrubbed.
      
      This change alone can improve scrub performance slightly by reducing the
      time spend preparing the stripe thus improving the queue depth.
      
      Before (with regression):
      
       Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
       nvme0n1p3 15578.00  993616.00     5.00   0.03    0.09    63.78   1.32 100.00
      
      After (with this patch):
      
       nvme0n1p3 15875.00 1013328.00    12.00   0.08    0.08    63.83   1.35 100.00
      
      Fixes: e02ee89b ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
      CC: stable@vger.kernel.org # 6.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1dc4888e
    • Lee Trager's avatar
      btrfs: copy dir permission and time when creating a stub subvolume · 94628ad9
      Lee Trager authored
      btrfs supports creating nested subvolumes however snapshots are not
      recursive.  When a snapshot is taken of a volume which contains a
      subvolume the subvolume is replaced with a stub subvolume which has the
      same name and uses inode number 2[1]. The stub subvolume kept the
      directory name but did not set the time or permissions of the stub
      subvolume. This resulted in all time information being the current time
      and ownership defaulting to root. When subvolumes and snapshots are
      created using unshare this results in a snapshot directory the user
      created but has no permissions for.
      
      Test case:
      
        [vmuser@archvm ~]# sudo -i
        [root@archvm ~]# mkdir -p /mnt/btrfs/test
        [root@archvm ~]# chown vmuser:users /mnt/btrfs/test/
        [root@archvm ~]# exit
        logout
        [vmuser@archvm ~]$ cd /mnt/btrfs/test
        [vmuser@archvm test]$ unshare --user --keep-caps --map-auto --map-root-user
        [root@archvm test]# btrfs subvolume create subvolume
        Create subvolume './subvolume'
        [root@archvm test]# btrfs subvolume create subvolume/subsubvolume
        Create subvolume 'subvolume/subsubvolume'
        [root@archvm test]# btrfs subvolume snapshot subvolume snapshot
        Create a snapshot of 'subvolume' in './snapshot'
        [root@archvm test]# exit
        logout
        [vmuser@archvm test]$ tree -ug
        [vmuser   users   ]  .
        ├── [vmuser   users   ]  snapshot
        │   └── [vmuser   users   ]  subsubvolume  <-- Without patch perm is root:root
        └── [vmuser   users   ]  subvolume
            └── [vmuser   users   ]  subsubvolume
      
        5 directories, 0 files
      
      [1] https://btrfs.readthedocs.io/en/latest/btrfs-subvolume.html#nested-subvolumesSigned-off-by: default avatarLee Trager <lee@trager.us>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      94628ad9
    • Filipe Manana's avatar
      btrfs: remove pointless empty list check when reading delayed dir indexes · 6b604c9a
      Filipe Manana authored
      At btrfs_readdir_delayed_dir_index(), called when reading a directory, we
      have this check for an empty list to return immediately, but it's not
      needed since list_for_each_entry_safe(), called immediately after, is
      prepared to deal with an empty list, it simply does nothing. So remove
      the empty list check.
      
      Besides shorter source code, it also slightly reduces the binary text
      size:
      
        Before this change:
      
          $ size fs/btrfs/btrfs.ko
             text	   data	    bss	    dec	    hex	filename
          1609408	 167269	  16864	1793541	 1b5e05	fs/btrfs/btrfs.ko
      
        After this change:
      
          $ size fs/btrfs/btrfs.ko
             text	   data	    bss	    dec	    hex	filename
          1609392	 167269	  16864	1793525	 1b5df5	fs/btrfs/btrfs.ko
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b604c9a
    • Anand Jain's avatar
      btrfs: drop redundant check to use fs_devices::metadata_uuid · 67bc5ad0
      Anand Jain authored
      fs_devices::metadata_uuid value is already updated based on the
      super_block::METADATA_UUID flag for either fsid or metadata_uuid as
      appropriate. So, fs_devices::metadata_uuid can be used directly.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Tested-by: default avatarGuilherme G. Piccoli <gpiccoli@igalia.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      67bc5ad0
    • Anand Jain's avatar
      btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super · 6bfe3959
      Anand Jain authored
      The function btrfs_validate_super() should verify the metadata_uuid in
      the provided superblock argument. Because, all its callers expect it to
      do that.
      
      Such as in the following stacks:
      
        write_all_supers()
         sb = fs_info->super_for_commit;
         btrfs_validate_write_super(.., sb)
           btrfs_validate_super(.., sb, ..)
      
        scrub_one_super()
      	btrfs_validate_super(.., sb, ..)
      
      And
         check_dev_super()
      	btrfs_validate_super(.., sb, ..)
      
      However, it currently verifies the fs_info::super_copy::metadata_uuid
      instead.  Fix this using the correct metadata_uuid in the superblock
      argument.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Tested-by: default avatarGuilherme G. Piccoli <gpiccoli@igalia.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6bfe3959
    • Anand Jain's avatar
      btrfs: use the correct superblock to compare fsid in btrfs_validate_super · d167aa76
      Anand Jain authored
      The function btrfs_validate_super() should verify the fsid in the provided
      superblock argument. Because, all its callers expect it to do that.
      
      Such as in the following stack:
      
         write_all_supers()
             sb = fs_info->super_for_commit;
             btrfs_validate_write_super(.., sb)
               btrfs_validate_super(.., sb, ..)
      
         scrub_one_super()
      	btrfs_validate_super(.., sb, ..)
      
      And
         check_dev_super()
      	btrfs_validate_super(.., sb, ..)
      
      However, it currently verifies the fs_info::super_copy::fsid instead,
      which is not correct.  Fix this using the correct fsid in the superblock
      argument.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Tested-by: default avatarGuilherme G. Piccoli <gpiccoli@igalia.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d167aa76
    • Anand Jain's avatar
      btrfs: simplify memcpy either of metadata_uuid or fsid · 319baafc
      Anand Jain authored
      There is a helper which provides either metadata_uuid or fsid as per
      METADATA_UUID flag. So use it.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Tested-by: default avatarGuilherme G. Piccoli <gpiccoli@igalia.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      319baafc
    • Anand Jain's avatar
      btrfs: add a helper to read the superblock metadata_uuid · 4844c366
      Anand Jain authored
      In some cases, we need to read the FSID from the superblock when the
      metadata_uuid is not set, and otherwise, read the metadata_uuid. So,
      add a helper.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Tested-by: default avatarGuilherme G. Piccoli <gpiccoli@igalia.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4844c366
    • Qu Wenruo's avatar
      btrfs: remove v0 extent handling · 182741d2
      Qu Wenruo authored
      The v0 extent item has been deprecated for a long time, and we don't have
      any report from the community either.
      
      So it's time to remove the v0 extent specific error handling, and just
      treat them as regular extent tree corruption.
      
      This patch would remove the btrfs_print_v0_err() helper, and enhance the
      involved error handling to treat them just as any extent tree
      corruption. No reports regarding v0 extents have been seen since the
      graceful handling was added in 2018.
      
      This involves:
      
      - btrfs_backref_add_tree_node()
        This change is a little tricky, the new code is changed to only handle
        BTRFS_TREE_BLOCK_REF_KEY and BTRFS_SHARED_BLOCK_REF_KEY.
      
        But this is safe, as we have rejected any unknown inline refs through
        btrfs_get_extent_inline_ref_type().
        For keyed backrefs, we're safe to skip anything we don't know (that's
        if it can pass tree-checker in the first place).
      
      - btrfs_lookup_extent_info()
      - lookup_inline_extent_backref()
      - run_delayed_extent_op()
      - __btrfs_free_extent()
      - add_tree_block()
        Regular error handling of unexpected extent tree item, and abort
        transaction (if we have a trans handle).
      
      - remove_extent_data_ref()
        It's pretty much the same as the regular rejection of unknown backref
        key.
        But for this particular case, we can also remove a BUG_ON().
      
      - extent_data_ref_count()
        We can remove the BTRFS_EXTENT_REF_V0_KEY BUG_ON(), as it would be
        rejected by the only caller.
      
      - btrfs_print_leaf()
        Remove the handling for BTRFS_EXTENT_REF_V0_KEY.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      182741d2
    • Qu Wenruo's avatar
      btrfs: output extra debug info if we failed to find an inline backref · 7f72f505
      Qu Wenruo authored
      [BUG]
      Syzbot reported several warning triggered inside
      lookup_inline_extent_backref().
      
      [CAUSE]
      As usual, the reproducer doesn't reliably trigger locally here, but at
      least we know the WARN_ON() is triggered when an inline backref can not
      be found, and it can only be triggered when @insert is true. (I.e.
      inserting a new inline backref, which means the backref should already
      exist)
      
      [ENHANCEMENT]
      After the WARN_ON(), dump all the parameters and the extent tree
      leaf to help debug.
      
      Link: https://syzkaller.appspot.com/bug?extid=d6f9ff86c1d804ba2bc6Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f72f505
    • Christoph Hellwig's avatar
      btrfs: move the !zoned assert into run_delalloc_cow · 76c5126e
      Christoph Hellwig authored
      Having the assert in the actual helper documents the pre-conditions
      much better than having it in the caller, so move it.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      76c5126e
    • Christoph Hellwig's avatar
      btrfs: consolidate the error handling in run_delalloc_nocow · 38dc8889
      Christoph Hellwig authored
      Share the calls to extent_clear_unlock_delalloc for btrfs_path allocation
      failure handling and the normal exit path.
      
      This relies on btrfs_free_path ignoring a NULL pointer, and the
      initialization of cur_offset to start at the beginning of the function.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      38dc8889
    • Christoph Hellwig's avatar
      btrfs: cleanup the COW fallback logic in run_delalloc_nocow · 18f62b86
      Christoph Hellwig authored
      Use the block group pointer used to track the outstanding NOCOW writes as
      a boolean to remove the duplicate nocow variable, and keep it contained
      in the main loop to simplify the logic.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      18f62b86
    • Christoph Hellwig's avatar
      btrfs: fix error handling when in a COW window in run_delalloc_nocow · 953fa5ce
      Christoph Hellwig authored
      When run_delalloc_nocow has cow_start set to a value other than (u64)-1,
      it has delayed COW writeback pending behind cur_offset.  When an error
      occurs in such a window, the range going back to cow_start and not just
      cur_offset needs to be unlocked, but only two error cases handle this
      correctly  Move the code to handle unlock the COW range to the common
      error handling label and document the logic.
      
      To make things even more complicated, cow_file_range as called by
      fallback_to_cow will unlock the range it is operating on when it fails as
      well, so we need to reset cow_start right after caling fallback_to_cow
      instead of only when it succeeded.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      953fa5ce
    • Naohiro Aota's avatar
      btrfs: zoned: do not zone finish data relocation block group · 332581bd
      Naohiro Aota authored
      When multiple writes happen at once, we may need to sacrifice a currently
      active block group to be zone finished for a new allocation. We choose a
      block group with the least free space left, and zone finish it.
      
      To do the finishing, we need to send IOs for already allocated region
      and wait for them and on-going IOs. Otherwise, these IOs fail because the
      zone is already finished at the time the IO reach a device.
      
      However, if a block group dedicated to the data relocation is zone
      finished, there is a chance that finishing it before an ongoing write IO
      reaches the device. That is because there is timing gap between an
      allocation is done (block_group->reservations == 0, as pre-allocation is
      done) and an ordered extent is created when the relocation IO starts.
      Thus, if we finish the zone between them, we can fail the IOs.
      
      We cannot simply use "fs_info->data_reloc_bg == block_group->start" to
      avoid the zone finishing. Because, the data_reloc_bg may already switch to
      a new block group, while there are still ongoing write IOs to the old
      data_reloc_bg.
      
      So, this patch reworks the BLOCK_GROUP_FLAG_ZONED_DATA_RELOC bit to
      indicate there is a data relocation allocation and/or ongoing write to the
      block group. The bit is set on allocation and cleared in end_io function of
      the last IO for the currently allocated region.
      
      To change the timing of the bit setting also solves the issue that the bit
      being left even after there is no IO going on. With the current code, if
      the data_reloc_bg switches after the last IO to the current data_reloc_bg,
      the bit is set at this timing and there is no one clearing that bit. As a
      result, that block group is kept unallocatable for anything.
      
      Fixes: 343d8a30 ("btrfs: zoned: prevent allocation from previous data relocation BG")
      Fixes: 74e91b12 ("btrfs: zoned: zone finish unused block group")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      332581bd
    • Josef Bacik's avatar
      btrfs: set page extent mapped after read_folio in relocate_one_page · e7f1326c
      Josef Bacik authored
      One of the CI runs triggered the following panic
      
        assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:229
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/subpage.c:229!
        Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
        CPU: 0 PID: 923660 Comm: btrfs Not tainted 6.5.0-rc3+ #1
        pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
        pc : btrfs_subpage_assert+0xbc/0xf0
        lr : btrfs_subpage_assert+0xbc/0xf0
        sp : ffff800093213720
        x29: ffff800093213720 x28: ffff8000932138b4 x27: 000000000c280000
        x26: 00000001b5d00000 x25: 000000000c281000 x24: 000000000c281fff
        x23: 0000000000001000 x22: 0000000000000000 x21: ffffff42b95bf880
        x20: ffff42b9528e0000 x19: 0000000000001000 x18: ffffffffffffffff
        x17: 667274622f736620 x16: 6e69202c65746176 x15: 0000000000000028
        x14: 0000000000000003 x13: 00000000002672d7 x12: 0000000000000000
        x11: ffffcd3f0ccd9204 x10: ffffcd3f0554ae50 x9 : ffffcd3f0379528c
        x8 : ffff800093213428 x7 : 0000000000000000 x6 : ffffcd3f091771e8
        x5 : ffff42b97f333948 x4 : 0000000000000000 x3 : 0000000000000000
        x2 : 0000000000000000 x1 : ffff42b9556cde80 x0 : 000000000000004f
        Call trace:
         btrfs_subpage_assert+0xbc/0xf0
         btrfs_subpage_set_dirty+0x38/0xa0
         btrfs_page_set_dirty+0x58/0x88
         relocate_one_page+0x204/0x5f0
         relocate_file_extent_cluster+0x11c/0x180
         relocate_data_extent+0xd0/0xf8
         relocate_block_group+0x3d0/0x4e8
         btrfs_relocate_block_group+0x2d8/0x490
         btrfs_relocate_chunk+0x54/0x1a8
         btrfs_balance+0x7f4/0x1150
         btrfs_ioctl+0x10f0/0x20b8
         __arm64_sys_ioctl+0x120/0x11d8
         invoke_syscall.constprop.0+0x80/0xd8
         do_el0_svc+0x6c/0x158
         el0_svc+0x50/0x1b0
         el0t_64_sync_handler+0x120/0x130
         el0t_64_sync+0x194/0x198
        Code: 91098021 b0007fa0 91346000 97e9c6d2 (d4210000)
      
      This is the same problem outlined in 17b17fcd ("btrfs:
      set_page_extent_mapped after read_folio in btrfs_cont_expand") , and the
      fix is the same.  I originally looked for the same pattern elsewhere in
      our code, but mistakenly skipped over this code because I saw the page
      cache readahead before we set_page_extent_mapped, not realizing that
      this was only in the !page case, that we can still end up with a
      !uptodate page and then do the btrfs_read_folio further down.
      
      The fix here is the same as the above mentioned patch, move the
      set_page_extent_mapped call to after the btrfs_read_folio() block to
      make sure that we have the subpage blocksize stuff setup properly before
      using the page.
      
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e7f1326c
    • Josef Bacik's avatar
      btrfs: wait on uncached block groups on every allocation loop · cd361199
      Josef Bacik authored
      My initial fix for the generic/475 hangs was related to metadata, but
      our CI testing uncovered another case where we hang for similar reasons.
      We again have a task with a plug that is holding an outstanding request
      that is keeping the dm device from finishing it's suspend, and that task
      is stuck in the allocator.
      
      This time it is stuck trying to allocate data, but we do not have a
      block group that matches the size class.  The larger loop in the
      allocator looks like this (simplified of course)
      
        find_free_extent
          for_each_block_group {
            ffe_ctl->cached == btrfs_block_group_cache_done(bg)
            if (!ffe_ctl->cached)
      	ffe_ctl->have_caching_bg = true;
            do_allocation()
      	btrfs_wait_block_group_cache_progress();
          }
      
          if (loop == LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg)
            go search again;
      
      In my earlier fix we were trying to allocate from the block group, but
      we weren't waiting for the progress because we were only waiting for the
      free space to be >= the amount of free space we wanted.  My fix made it
      so we waited for forward progress to be made as well, so we would be
      sure to wait.
      
      This time however we did not have a block group that matched our size
      class, so what was happening was this
      
        find_free_extent
          for_each_block_group {
            ffe_ctl->cached == btrfs_block_group_cache_done(bg)
            if (!ffe_ctl->cached)
      	ffe_ctl->have_caching_bg = true;
            if (size_class_doesn't_match())
      	goto loop;
            do_allocation()
      	btrfs_wait_block_group_cache_progress();
        loop:
            release_block_group(block_group);
          }
      
          if (loop == LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg)
            go search again;
      
      The size_class_doesn't_match() part was true, so we'd just skip this
      block group and never wait for caching, and then because we found a
      caching block group we'd just go back and do the loop again.  We never
      sleep and thus never flush the plug and we have the same deadlock.
      
      Fix the logic for waiting on the block group caching to instead do it
      unconditionally when we goto loop.  This takes the logic out of the
      allocation step, so now the loop looks more like this
      
        find_free_extent
          for_each_block_group {
            ffe_ctl->cached == btrfs_block_group_cache_done(bg)
            if (!ffe_ctl->cached)
      	ffe_ctl->have_caching_bg = true;
            if (size_class_doesn't_match())
      	goto loop;
            do_allocation()
      	btrfs_wait_block_group_cache_progress();
        loop:
            if (loop > LOOP_CACHING_NOWAIT && !ffe_ctl->retry_uncached &&
      	  !ffe_ctl->cached) {
      	 ffe_ctl->retry_uncached = true;
      	 btrfs_wait_block_group_cache_progress();
            }
      
            release_block_group(block_group);
          }
      
          if (loop == LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg)
            go search again;
      
      This simplifies the logic a lot, and makes sure that if we're hitting
      uncached block groups we're always waiting on them at some point.
      
      I ran this through 100 iterations of generic/475, as this particular
      case was harder to hit than the previous one.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd361199