1. 05 Nov, 2020 1 commit
  2. 27 Oct, 2020 2 commits
    • Davidlohr Bueso's avatar
      btrfs: convert data_seqcount to seqcount_mutex_t · d5c82388
      Davidlohr Bueso authored
      By doing so we can associate the sequence counter to the chunk_mutex
      for lockdep purposes (compiled-out otherwise), the mutex is otherwise
      used on the write side.
      Also avoid explicitly disabling preemption around the write region as it
      will now be done automatically by the seqcount machinery based on the
      lock type.
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d5c82388
    • Johannes Thumshirn's avatar
      btrfs: don't fallback to buffered read if we don't need to · 0425e7ba
      Johannes Thumshirn authored
      Since we switched to the iomap infrastructure in b5ff9f1a96e8f ("btrfs:
      switch to iomap for direct IO") we're calling generic_file_buffered_read()
      directly and not via generic_file_read_iter() anymore.
      
      If the read could read everything there is no need to bother calling
      generic_file_buffered_read(), like it is handled in
      generic_file_read_iter().
      
      If we call generic_file_buffered_read() in this case we can hit a
      situation where we do an invalid readahead and cause this UBSAN splat
      in fstest generic/091:
      
        run fstests generic/091 at 2020-10-21 10:52:32
        ================================================================================
        UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
        shift exponent 64 is too large for 64-bit type 'long unsigned int'
        CPU: 0 PID: 656 Comm: fsx Not tainted 5.9.0-rc7+ #821
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
        Call Trace:
         __dump_stack lib/dump_stack.c:77
         dump_stack+0x57/0x70 lib/dump_stack.c:118
         ubsan_epilogue+0x5/0x40 lib/ubsan.c:148
         __ubsan_handle_shift_out_of_bounds.cold+0x61/0xe9 lib/ubsan.c:395
         __roundup_pow_of_two ./include/linux/log2.h:57
         get_init_ra_size mm/readahead.c:318
         ondemand_readahead.cold+0x16/0x2c mm/readahead.c:530
         generic_file_buffered_read+0x3ac/0x840 mm/filemap.c:2199
         call_read_iter ./include/linux/fs.h:1876
         new_sync_read+0x102/0x180 fs/read_write.c:415
         vfs_read+0x11c/0x1a0 fs/read_write.c:481
         ksys_read+0x4f/0xc0 fs/read_write.c:615
         do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9 arch/x86/entry/entry_64.S:118
        RIP: 0033:0x7fe87fee992e
        RSP: 002b:00007ffe01605278 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
        RAX: ffffffffffffffda RBX: 000000000004f000 RCX: 00007fe87fee992e
        RDX: 0000000000004000 RSI: 0000000001677000 RDI: 0000000000000003
        RBP: 000000000004f000 R08: 0000000000004000 R09: 000000000004f000
        R10: 0000000000053000 R11: 0000000000000246 R12: 0000000000004000
        R13: 0000000000000000 R14: 000000000007a120 R15: 0000000000000000
        ================================================================================
        BTRFS info (device nullb0): has skinny extents
        BTRFS info (device nullb0): ZONED mode enabled, zone size 268435456 B
        BTRFS info (device nullb0): enabling ssd optimizations
      
      Fixes: f85781fb ("btrfs: switch to iomap for direct IO")
      Reviewed-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0425e7ba
  3. 26 Oct, 2020 7 commits
    • Josef Bacik's avatar
      btrfs: add a helper to read the tree_root commit root for backref lookup · 49d11bea
      Josef Bacik authored
      I got the following lockdep splat with tree locks converted to rwsem
      patches on btrfs/104:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0+ #102 Not tainted
        ------------------------------------------------------
        btrfs-cleaner/903 is trying to acquire lock:
        ffff8e7fab6ffe30 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170
      
        but task is already holding lock:
        ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (&fs_info->commit_root_sem){++++}-{3:3}:
      	 down_read+0x40/0x130
      	 caching_thread+0x53/0x5a0
      	 btrfs_work_helper+0xfa/0x520
      	 process_one_work+0x238/0x540
      	 worker_thread+0x55/0x3c0
      	 kthread+0x13a/0x150
      	 ret_from_fork+0x1f/0x30
      
        -> #2 (&caching_ctl->mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7b0
      	 btrfs_cache_block_group+0x1e0/0x510
      	 find_free_extent+0xb6e/0x12f0
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb1/0x330
      	 alloc_tree_block_no_bg_flush+0x4f/0x60
      	 __btrfs_cow_block+0x11d/0x580
      	 btrfs_cow_block+0x10c/0x220
      	 commit_cowonly_roots+0x47/0x2e0
      	 btrfs_commit_transaction+0x595/0xbd0
      	 sync_filesystem+0x74/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x36/0xa0
      	 cleanup_mnt+0x12d/0x190
      	 task_work_run+0x5c/0xa0
      	 exit_to_user_mode_prepare+0x1df/0x200
      	 syscall_exit_to_user_mode+0x54/0x280
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&space_info->groups_sem){++++}-{3:3}:
      	 down_read+0x40/0x130
      	 find_free_extent+0x2ed/0x12f0
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb1/0x330
      	 alloc_tree_block_no_bg_flush+0x4f/0x60
      	 __btrfs_cow_block+0x11d/0x580
      	 btrfs_cow_block+0x10c/0x220
      	 commit_cowonly_roots+0x47/0x2e0
      	 btrfs_commit_transaction+0x595/0xbd0
      	 sync_filesystem+0x74/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x36/0xa0
      	 cleanup_mnt+0x12d/0x190
      	 task_work_run+0x5c/0xa0
      	 exit_to_user_mode_prepare+0x1df/0x200
      	 syscall_exit_to_user_mode+0x54/0x280
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (btrfs-root-00){++++}-{3:3}:
      	 __lock_acquire+0x1167/0x2150
      	 lock_acquire+0xb9/0x3d0
      	 down_read_nested+0x43/0x130
      	 __btrfs_tree_read_lock+0x32/0x170
      	 __btrfs_read_lock_root_node+0x3a/0x50
      	 btrfs_search_slot+0x614/0x9d0
      	 btrfs_find_root+0x35/0x1b0
      	 btrfs_read_tree_root+0x61/0x120
      	 btrfs_get_root_ref+0x14b/0x600
      	 find_parent_nodes+0x3e6/0x1b30
      	 btrfs_find_all_roots_safe+0xb4/0x130
      	 btrfs_find_all_roots+0x60/0x80
      	 btrfs_qgroup_trace_extent_post+0x27/0x40
      	 btrfs_add_delayed_data_ref+0x3fd/0x460
      	 btrfs_free_extent+0x42/0x100
      	 __btrfs_mod_ref+0x1d7/0x2f0
      	 walk_up_proc+0x11c/0x400
      	 walk_up_tree+0xf0/0x180
      	 btrfs_drop_snapshot+0x1c7/0x780
      	 btrfs_clean_one_deleted_snapshot+0xfb/0x110
      	 cleaner_kthread+0xd4/0x140
      	 kthread+0x13a/0x150
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          btrfs-root-00 --> &caching_ctl->mutex --> &fs_info->commit_root_sem
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_info->commit_root_sem);
      				 lock(&caching_ctl->mutex);
      				 lock(&fs_info->commit_root_sem);
          lock(btrfs-root-00);
      
         *** DEADLOCK ***
      
        3 locks held by btrfs-cleaner/903:
         #0: ffff8e7fab628838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
         #1: ffff8e7faadac640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
         #2: ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80
      
        stack backtrace:
        CPU: 0 PID: 903 Comm: btrfs-cleaner Not tainted 5.9.0+ #102
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb0
         check_noncircular+0xcf/0xf0
         __lock_acquire+0x1167/0x2150
         ? __bfs+0x42/0x210
         lock_acquire+0xb9/0x3d0
         ? __btrfs_tree_read_lock+0x32/0x170
         down_read_nested+0x43/0x130
         ? __btrfs_tree_read_lock+0x32/0x170
         __btrfs_tree_read_lock+0x32/0x170
         __btrfs_read_lock_root_node+0x3a/0x50
         btrfs_search_slot+0x614/0x9d0
         ? find_held_lock+0x2b/0x80
         btrfs_find_root+0x35/0x1b0
         ? do_raw_spin_unlock+0x4b/0xa0
         btrfs_read_tree_root+0x61/0x120
         btrfs_get_root_ref+0x14b/0x600
         find_parent_nodes+0x3e6/0x1b30
         btrfs_find_all_roots_safe+0xb4/0x130
         btrfs_find_all_roots+0x60/0x80
         btrfs_qgroup_trace_extent_post+0x27/0x40
         btrfs_add_delayed_data_ref+0x3fd/0x460
         btrfs_free_extent+0x42/0x100
         __btrfs_mod_ref+0x1d7/0x2f0
         walk_up_proc+0x11c/0x400
         walk_up_tree+0xf0/0x180
         btrfs_drop_snapshot+0x1c7/0x780
         ? btrfs_clean_one_deleted_snapshot+0x73/0x110
         btrfs_clean_one_deleted_snapshot+0xfb/0x110
         cleaner_kthread+0xd4/0x140
         ? btrfs_alloc_root+0x50/0x50
         kthread+0x13a/0x150
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
        BTRFS info (device sdb): disk space caching is enabled
        BTRFS info (device sdb): has skinny extents
      
      This happens because qgroups does a backref lookup when we create a
      delayed ref.  From here it may have to look up a root from an indirect
      ref, which does a normal lookup on the tree_root, which takes the read
      lock on the tree_root nodes.
      
      To fix this we need to add a variant for looking up roots that searches
      the commit root of the tree_root.  Then when we do the backref search
      using the commit root we are sure to not take any locks on the tree_root
      nodes.  This gets rid of the lockdep splat when running btrfs/104.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      49d11bea
    • Josef Bacik's avatar
      btrfs: drop the path before adding qgroup items when enabling qgroups · 5223cc60
      Josef Bacik authored
      When enabling qgroups we walk the tree_root and then add a qgroup item
      for every root that we have.  This creates a lock dependency on the
      tree_root and qgroup_root, which results in the following lockdep splat
      (with tree locks using rwsem), eg. in tests btrfs/017 or btrfs/022:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-default+ #1299 Not tainted
        ------------------------------------------------------
        btrfs/24552 is trying to acquire lock:
        ffff9142dfc5f630 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
      
        but task is already holding lock:
        ffff9142dfc5d0b0 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (btrfs-root-00){++++}-{3:3}:
      	 __lock_acquire+0x3fb/0x730
      	 lock_acquire.part.0+0x6a/0x130
      	 down_read_nested+0x46/0x130
      	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
      	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
      	 btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
      	 btrfs_search_slot+0xc3/0x9f0 [btrfs]
      	 btrfs_insert_item+0x6e/0x140 [btrfs]
      	 btrfs_create_tree+0x1cb/0x240 [btrfs]
      	 btrfs_quota_enable+0xcd/0x790 [btrfs]
      	 btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
      	 __x64_sys_ioctl+0x83/0xa0
      	 do_syscall_64+0x2d/0x70
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (btrfs-quota-00){++++}-{3:3}:
      	 check_prev_add+0x91/0xc30
      	 validate_chain+0x491/0x750
      	 __lock_acquire+0x3fb/0x730
      	 lock_acquire.part.0+0x6a/0x130
      	 down_read_nested+0x46/0x130
      	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
      	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
      	 btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
      	 btrfs_search_slot+0xc3/0x9f0 [btrfs]
      	 btrfs_insert_empty_items+0x58/0xa0 [btrfs]
      	 add_qgroup_item.part.0+0x72/0x210 [btrfs]
      	 btrfs_quota_enable+0x3bb/0x790 [btrfs]
      	 btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
      	 __x64_sys_ioctl+0x83/0xa0
      	 do_syscall_64+0x2d/0x70
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-root-00);
      				 lock(btrfs-quota-00);
      				 lock(btrfs-root-00);
          lock(btrfs-quota-00);
      
         *** DEADLOCK ***
      
        5 locks held by btrfs/24552:
         #0: ffff9142df431478 (sb_writers#10){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0xa0
         #1: ffff9142f9b10cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl_quota_ctl+0x7b/0xe0 [btrfs]
         #2: ffff9142f9b11a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0x790 [btrfs]
         #3: ffff9142df431698 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x406/0x510 [btrfs]
         #4: ffff9142dfc5d0b0 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
      
        stack backtrace:
        CPU: 1 PID: 24552 Comm: btrfs Not tainted 5.9.0-default+ #1299
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        Call Trace:
         dump_stack+0x77/0x97
         check_noncircular+0xf3/0x110
         check_prev_add+0x91/0xc30
         validate_chain+0x491/0x750
         __lock_acquire+0x3fb/0x730
         lock_acquire.part.0+0x6a/0x130
         ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
         ? lock_acquire+0xc4/0x140
         ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
         down_read_nested+0x46/0x130
         ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
         __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
         ? btrfs_root_node+0xd9/0x200 [btrfs]
         __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
         btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
         btrfs_search_slot+0xc3/0x9f0 [btrfs]
         btrfs_insert_empty_items+0x58/0xa0 [btrfs]
         add_qgroup_item.part.0+0x72/0x210 [btrfs]
         btrfs_quota_enable+0x3bb/0x790 [btrfs]
         btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
         __x64_sys_ioctl+0x83/0xa0
         do_syscall_64+0x2d/0x70
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fix this by dropping the path whenever we find a root item, add the
      qgroup item, and then re-lookup the root item we found and continue
      processing roots.
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5223cc60
    • Filipe Manana's avatar
      btrfs: fix readahead hang and use-after-free after removing a device · 66d204a1
      Filipe Manana authored
      Very sporadically I had test case btrfs/069 from fstests hanging (for
      years, it is not a recent regression), with the following traces in
      dmesg/syslog:
      
        [162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
        [162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
        [162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
        [162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
        [162513.514318]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.514747] task:btrfs-transacti state:D stack:    0 pid:1356167 ppid:     2 flags:0x00004000
        [162513.514751] Call Trace:
        [162513.514761]  __schedule+0x5ce/0xd00
        [162513.514765]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.514771]  schedule+0x46/0xf0
        [162513.514844]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.514850]  ? finish_wait+0x90/0x90
        [162513.514864]  start_transaction+0x37c/0x5f0 [btrfs]
        [162513.514879]  transaction_kthread+0xa4/0x170 [btrfs]
        [162513.514891]  ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
        [162513.514894]  kthread+0x153/0x170
        [162513.514897]  ? kthread_stop+0x2c0/0x2c0
        [162513.514902]  ret_from_fork+0x22/0x30
        [162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
        [162513.515192]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.515680] task:fsstress        state:D stack:    0 pid:1356184 ppid:1356177 flags:0x00004000
        [162513.515682] Call Trace:
        [162513.515688]  __schedule+0x5ce/0xd00
        [162513.515691]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.515697]  schedule+0x46/0xf0
        [162513.515712]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.515716]  ? finish_wait+0x90/0x90
        [162513.515729]  start_transaction+0x37c/0x5f0 [btrfs]
        [162513.515743]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
        [162513.515753]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
        [162513.515758]  ? __ia32_sys_fdatasync+0x20/0x20
        [162513.515761]  iterate_supers+0x87/0xf0
        [162513.515765]  ksys_sync+0x60/0xb0
        [162513.515768]  __do_sys_sync+0xa/0x10
        [162513.515771]  do_syscall_64+0x33/0x80
        [162513.515774]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.515781] RIP: 0033:0x7f5238f50bd7
        [162513.515782] Code: Bad RIP value.
        [162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
        [162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
        [162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
        [162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
        [162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
        [162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
        [162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
        [162513.516064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.516617] task:fsstress        state:D stack:    0 pid:1356185 ppid:1356177 flags:0x00000000
        [162513.516620] Call Trace:
        [162513.516625]  __schedule+0x5ce/0xd00
        [162513.516628]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.516634]  schedule+0x46/0xf0
        [162513.516647]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.516650]  ? finish_wait+0x90/0x90
        [162513.516662]  start_transaction+0x4d7/0x5f0 [btrfs]
        [162513.516679]  btrfs_setxattr_trans+0x3c/0x100 [btrfs]
        [162513.516686]  __vfs_setxattr+0x66/0x80
        [162513.516691]  __vfs_setxattr_noperm+0x70/0x200
        [162513.516697]  vfs_setxattr+0x6b/0x120
        [162513.516703]  setxattr+0x125/0x240
        [162513.516709]  ? lock_acquire+0xb1/0x480
        [162513.516712]  ? mnt_want_write+0x20/0x50
        [162513.516721]  ? rcu_read_lock_any_held+0x8e/0xb0
        [162513.516723]  ? preempt_count_add+0x49/0xa0
        [162513.516725]  ? __sb_start_write+0x19b/0x290
        [162513.516727]  ? preempt_count_add+0x49/0xa0
        [162513.516732]  path_setxattr+0xba/0xd0
        [162513.516739]  __x64_sys_setxattr+0x27/0x30
        [162513.516741]  do_syscall_64+0x33/0x80
        [162513.516743]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.516745] RIP: 0033:0x7f5238f56d5a
        [162513.516746] Code: Bad RIP value.
        [162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
        [162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
        [162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
        [162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
        [162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
        [162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
        [162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
        [162513.517064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.517763] task:fsstress        state:D stack:    0 pid:1356196 ppid:1356177 flags:0x00004000
        [162513.517780] Call Trace:
        [162513.517786]  __schedule+0x5ce/0xd00
        [162513.517789]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.517796]  schedule+0x46/0xf0
        [162513.517810]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.517814]  ? finish_wait+0x90/0x90
        [162513.517829]  start_transaction+0x37c/0x5f0 [btrfs]
        [162513.517845]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
        [162513.517857]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
        [162513.517862]  ? __ia32_sys_fdatasync+0x20/0x20
        [162513.517865]  iterate_supers+0x87/0xf0
        [162513.517869]  ksys_sync+0x60/0xb0
        [162513.517872]  __do_sys_sync+0xa/0x10
        [162513.517875]  do_syscall_64+0x33/0x80
        [162513.517878]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.517881] RIP: 0033:0x7f5238f50bd7
        [162513.517883] Code: Bad RIP value.
        [162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
        [162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
        [162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
        [162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
        [162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
        [162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
        [162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
        [162513.518298]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.519157] task:fsstress        state:D stack:    0 pid:1356197 ppid:1356177 flags:0x00000000
        [162513.519160] Call Trace:
        [162513.519165]  __schedule+0x5ce/0xd00
        [162513.519168]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.519174]  schedule+0x46/0xf0
        [162513.519190]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.519193]  ? finish_wait+0x90/0x90
        [162513.519206]  start_transaction+0x4d7/0x5f0 [btrfs]
        [162513.519222]  btrfs_create+0x57/0x200 [btrfs]
        [162513.519230]  lookup_open+0x522/0x650
        [162513.519246]  path_openat+0x2b8/0xa50
        [162513.519270]  do_filp_open+0x91/0x100
        [162513.519275]  ? find_held_lock+0x32/0x90
        [162513.519280]  ? lock_acquired+0x33b/0x470
        [162513.519285]  ? do_raw_spin_unlock+0x4b/0xc0
        [162513.519287]  ? _raw_spin_unlock+0x29/0x40
        [162513.519295]  do_sys_openat2+0x20d/0x2d0
        [162513.519300]  do_sys_open+0x44/0x80
        [162513.519304]  do_syscall_64+0x33/0x80
        [162513.519307]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.519309] RIP: 0033:0x7f5238f4a903
        [162513.519310] Code: Bad RIP value.
        [162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
        [162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
        [162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
        [162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
        [162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
        [162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
        [162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
        [162513.519727]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.520508] task:btrfs           state:D stack:    0 pid:1356211 ppid:1356178 flags:0x00004002
        [162513.520511] Call Trace:
        [162513.520516]  __schedule+0x5ce/0xd00
        [162513.520519]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.520525]  schedule+0x46/0xf0
        [162513.520544]  btrfs_scrub_pause+0x11f/0x180 [btrfs]
        [162513.520548]  ? finish_wait+0x90/0x90
        [162513.520562]  btrfs_commit_transaction+0x45a/0xc30 [btrfs]
        [162513.520574]  ? start_transaction+0xe0/0x5f0 [btrfs]
        [162513.520596]  btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
        [162513.520619]  btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
        [162513.520639]  btrfs_ioctl+0x2a25/0x36f0 [btrfs]
        [162513.520643]  ? do_sigaction+0xf3/0x240
        [162513.520645]  ? find_held_lock+0x32/0x90
        [162513.520648]  ? do_sigaction+0xf3/0x240
        [162513.520651]  ? lock_acquired+0x33b/0x470
        [162513.520655]  ? _raw_spin_unlock_irq+0x24/0x50
        [162513.520657]  ? lockdep_hardirqs_on+0x7d/0x100
        [162513.520660]  ? _raw_spin_unlock_irq+0x35/0x50
        [162513.520662]  ? do_sigaction+0xf3/0x240
        [162513.520671]  ? __x64_sys_ioctl+0x83/0xb0
        [162513.520672]  __x64_sys_ioctl+0x83/0xb0
        [162513.520677]  do_syscall_64+0x33/0x80
        [162513.520679]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.520681] RIP: 0033:0x7fc3cd307d87
        [162513.520682] Code: Bad RIP value.
        [162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
        [162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
        [162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
        [162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
        [162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
        [162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
        [162513.520703]
      		  Showing all locks held in the system:
        [162513.520712] 1 lock held by khungtaskd/54:
        [162513.520713]  #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
        [162513.520728] 1 lock held by in:imklog/596:
        [162513.520729]  #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
        [162513.520782] 1 lock held by btrfs-transacti/1356167:
        [162513.520784]  #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
        [162513.520798] 1 lock held by btrfs/1356190:
        [162513.520800]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
        [162513.520805] 1 lock held by fsstress/1356184:
        [162513.520806]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
        [162513.520811] 3 locks held by fsstress/1356185:
        [162513.520812]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
        [162513.520815]  #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
        [162513.520820]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
        [162513.520833] 1 lock held by fsstress/1356196:
        [162513.520834]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
        [162513.520838] 3 locks held by fsstress/1356197:
        [162513.520839]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
        [162513.520843]  #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
        [162513.520846]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
        [162513.520858] 2 locks held by btrfs/1356211:
        [162513.520859]  #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
        [162513.520877]  #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
      
      This was weird because the stack traces show that a transaction commit,
      triggered by a device replace operation, is blocking trying to pause any
      running scrubs but there are no stack traces of blocked tasks doing a
      scrub.
      
      After poking around with drgn, I noticed there was a scrub task that was
      constantly running and blocking for shorts periods of time:
      
        >>> t = find_task(prog, 1356190)
        >>> prog.stack_trace(t)
        #0  __schedule+0x5ce/0xcfc
        #1  schedule+0x46/0xe4
        #2  schedule_timeout+0x1df/0x475
        #3  btrfs_reada_wait+0xda/0x132
        #4  scrub_stripe+0x2a8/0x112f
        #5  scrub_chunk+0xcd/0x134
        #6  scrub_enumerate_chunks+0x29e/0x5ee
        #7  btrfs_scrub_dev+0x2d5/0x91b
        #8  btrfs_ioctl+0x7f5/0x36e7
        #9  __x64_sys_ioctl+0x83/0xb0
        #10 do_syscall_64+0x33/0x77
        #11 entry_SYSCALL_64+0x7c/0x156
      
      Which corresponds to:
      
      int btrfs_reada_wait(void *handle)
      {
          struct reada_control *rc = handle;
          struct btrfs_fs_info *fs_info = rc->fs_info;
      
          while (atomic_read(&rc->elems)) {
              if (!atomic_read(&fs_info->reada_works_cnt))
                  reada_start_machine(fs_info);
              wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
                                (HZ + 9) / 10);
          }
      (...)
      
      So the counter "rc->elems" was set to 1 and never decreased to 0, causing
      the scrub task to loop forever in that function. Then I used the following
      script for drgn to check the readahead requests:
      
        $ cat dump_reada.py
        import sys
        import drgn
        from drgn import NULL, Object, cast, container_of, execscript, \
            reinterpret, sizeof
        from drgn.helpers.linux import *
      
        mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
      
        mnt = None
        for mnt in for_each_mount(prog, dst = mnt_path):
            pass
      
        if mnt is None:
            sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
            sys.exit(1)
      
        fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
      
        def dump_re(re):
            nzones = re.nzones.value_()
            print(f're at {hex(re.value_())}')
            print(f'\t logical {re.logical.value_()}')
            print(f'\t refcnt {re.refcnt.value_()}')
            print(f'\t nzones {nzones}')
            for i in range(nzones):
                dev = re.zones[i].device
                name = dev.name.str.string_()
                print(f'\t\t dev id {dev.devid.value_()} name {name}')
            print()
      
        for _, e in radix_tree_for_each(fs_info.reada_tree):
            re = cast('struct reada_extent *', e)
            dump_re(re)
      
        $ drgn dump_reada.py
        re at 0xffff8f3da9d25ad8
                logical 38928384
                refcnt 1
                nzones 1
                       dev id 0 name b'/dev/sdd'
        $
      
      So there was one readahead extent with a single zone corresponding to the
      source device of that last device replace operation logged in dmesg/syslog.
      Also the ID of that zone's device was 0 which is a special value set in
      the source device of a device replace operation when the operation finishes
      (constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
      confirming again that device /dev/sdd was the source of a device replace
      operation.
      
      Normally there should be as many zones in the readahead extent as there are
      devices, and I wasn't expecting the extent to be in a block group with a
      'single' profile, so I went and confirmed with the following drgn script
      that there weren't any single profile block groups:
      
        $ cat dump_block_groups.py
        import sys
        import drgn
        from drgn import NULL, Object, cast, container_of, execscript, \
            reinterpret, sizeof
        from drgn.helpers.linux import *
      
        mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
      
        mnt = None
        for mnt in for_each_mount(prog, dst = mnt_path):
            pass
      
        if mnt is None:
            sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
            sys.exit(1)
      
        fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
      
        BTRFS_BLOCK_GROUP_DATA = (1 << 0)
        BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
        BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
        BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
        BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
        BTRFS_BLOCK_GROUP_DUP = (1 << 5)
        BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
        BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
        BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
        BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
        BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)
      
        def bg_flags_string(bg):
            flags = bg.flags.value_()
            ret = ''
            if flags & BTRFS_BLOCK_GROUP_DATA:
                ret = 'data'
            if flags & BTRFS_BLOCK_GROUP_METADATA:
                if len(ret) > 0:
                    ret += '|'
                ret += 'meta'
            if flags & BTRFS_BLOCK_GROUP_SYSTEM:
                if len(ret) > 0:
                    ret += '|'
                ret += 'system'
            if flags & BTRFS_BLOCK_GROUP_RAID0:
                ret += ' raid0'
            elif flags & BTRFS_BLOCK_GROUP_RAID1:
                ret += ' raid1'
            elif flags & BTRFS_BLOCK_GROUP_DUP:
                ret += ' dup'
            elif flags & BTRFS_BLOCK_GROUP_RAID10:
                ret += ' raid10'
            elif flags & BTRFS_BLOCK_GROUP_RAID5:
                ret += ' raid5'
            elif flags & BTRFS_BLOCK_GROUP_RAID6:
                ret += ' raid6'
            elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
                ret += ' raid1c3'
            elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
                ret += ' raid1c4'
            else:
                ret += ' single'
      
            return ret
      
        def dump_bg(bg):
            print()
            print(f'block group at {hex(bg.value_())}')
            print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
            print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')
      
        bg_root = fs_info.block_group_cache_tree.address_of_()
        for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
            dump_bg(bg)
      
        $ drgn dump_block_groups.py
      
        block group at 0xffff8f3d673b0400
               start 22020096 length 16777216
               flags 258 - system raid6
      
        block group at 0xffff8f3d53ddb400
               start 38797312 length 536870912
               flags 260 - meta raid6
      
        block group at 0xffff8f3d5f4d9c00
               start 575668224 length 2147483648
               flags 257 - data raid6
      
        block group at 0xffff8f3d08189000
               start 2723151872 length 67108864
               flags 258 - system raid6
      
        block group at 0xffff8f3db70ff000
               start 2790260736 length 1073741824
               flags 260 - meta raid6
      
        block group at 0xffff8f3d5f4dd800
               start 3864002560 length 67108864
               flags 258 - system raid6
      
        block group at 0xffff8f3d67037000
               start 3931111424 length 2147483648
               flags 257 - data raid6
        $
      
      So there were only 2 reasons left for having a readahead extent with a
      single zone: reada_find_zone(), called when creating a readahead extent,
      returned NULL either because we failed to find the corresponding block
      group or because a memory allocation failed. With some additional and
      custom tracing I figured out that on every further ocurrence of the
      problem the block group had just been deleted when we were looping to
      create the zones for the readahead extent (at reada_find_extent()), so we
      ended up with only one zone in the readahead extent, corresponding to a
      device that ends up getting replaced.
      
      So after figuring that out it became obvious why the hang happens:
      
      1) Task A starts a scrub on any device of the filesystem, except for
         device /dev/sdd;
      
      2) Task B starts a device replace with /dev/sdd as the source device;
      
      3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
         starting to scrub a stripe from block group X. This call to
         btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
         calls reada_add_block(), it passes the logical address of the extent
         tree's root node as its 'logical' argument - a value of 38928384;
      
      4) Task A then enters reada_find_extent(), called from reada_add_block().
         It finds there isn't any existing readahead extent for the logical
         address 38928384, so it proceeds to the path of creating a new one.
      
         It calls btrfs_map_block() to find out which stripes exist for the block
         group X. On the first iteration of the for loop that iterates over the
         stripes, it finds the stripe for device /dev/sdd, so it creates one
         zone for that device and adds it to the readahead extent. Before getting
         into the second iteration of the loop, the cleanup kthread deletes block
         group X because it was empty. So in the iterations for the remaining
         stripes it does not add more zones to the readahead extent, because the
         calls to reada_find_zone() returned NULL because they couldn't find
         block group X anymore.
      
         As a result the new readahead extent has a single zone, corresponding to
         the device /dev/sdd;
      
      4) Before task A returns to btrfs_reada_add() and queues the readahead job
         for the readahead work queue, task B finishes the device replace and at
         btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
         device /dev/sdg;
      
      5) Task A returns to reada_add_block(), which increments the counter
         "->elems" of the reada_control structure allocated at btrfs_reada_add().
      
         Then it returns back to btrfs_reada_add() and calls
         reada_start_machine(). This queues a job in the readahead work queue to
         run the function reada_start_machine_worker(), which calls
         __reada_start_machine().
      
         At __reada_start_machine() we take the device list mutex and for each
         device found in the current device list, we call
         reada_start_machine_dev() to start the readahead work. However at this
         point the device /dev/sdd was already freed and is not in the device
         list anymore.
      
         This means the corresponding readahead for the extent at 38928384 is
         never started, and therefore the "->elems" counter of the reada_control
         structure allocated at btrfs_reada_add() never goes down to 0, causing
         the call to btrfs_reada_wait(), done by the scrub task, to wait forever.
      
      Note that the readahead request can be made either after the device replace
      started or before it started, however in pratice it is very unlikely that a
      device replace is able to start after a readahead request is made and is
      able to complete before the readahead request completes - maybe only on a
      very small and nearly empty filesystem.
      
      This hang however is not the only problem we can have with readahead and
      device removals. When the readahead extent has other zones other than the
      one corresponding to the device that is being removed (either by a device
      replace or a device remove operation), we risk having a use-after-free on
      the device when dropping the last reference of the readahead extent.
      
      For example if we create a readahead extent with two zones, one for the
      device /dev/sdd and one for the device /dev/sde:
      
      1) Before the readahead worker starts, the device /dev/sdd is removed,
         and the corresponding btrfs_device structure is freed. However the
         readahead extent still has the zone pointing to the device structure;
      
      2) When the readahead worker starts, it only finds device /dev/sde in the
         current device list of the filesystem;
      
      3) It starts the readahead work, at reada_start_machine_dev(), using the
         device /dev/sde;
      
      4) Then when it finishes reading the extent from device /dev/sde, it calls
         __readahead_hook() which ends up dropping the last reference on the
         readahead extent through the last call to reada_extent_put();
      
      5) At reada_extent_put() it iterates over each zone of the readahead extent
         and attempts to delete an element from the device's 'reada_extents'
         radix tree, resulting in a use-after-free, as the device pointer of the
         zone for /dev/sdd is now stale. We can also access the device after
         dropping the last reference of a zone, through reada_zone_release(),
         also called by reada_extent_put().
      
      And a device remove suffers the same problem, however since it shrinks the
      device size down to zero before removing the device, it is very unlikely to
      still have readahead requests not completed by the time we free the device,
      the only possibility is if the device has a very little space allocated.
      
      While the hang problem is exclusive to scrub, since it is currently the
      only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
      problem affects any path that triggers readhead, which includes
      btree_readahead_hook() and __readahead_hook() (a readahead worker can
      trigger readahed for the children of a node) for example - any path that
      ends up calling reada_add_block() can trigger the use-after-free after a
      device is removed.
      
      So fix this by waiting for any readahead requests for a device to complete
      before removing a device, ensuring that while waiting for existing ones no
      new ones can be made.
      
      This problem has been around for a very long time - the readahead code was
      added in 2011, device remove exists since 2008 and device replace was
      introduced in 2013, hard to pick a specific commit for a git Fixes tag.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      66d204a1
    • Filipe Manana's avatar
      btrfs: fix use-after-free on readahead extent after failure to create it · 83bc1560
      Filipe Manana authored
      If we fail to find suitable zones for a new readahead extent, we end up
      leaving a stale pointer in the global readahead extents radix tree
      (fs_info->reada_tree), which can trigger the following trace later on:
      
        [13367.696354] BUG: kernel NULL pointer dereference, address: 00000000000000b0
        [13367.696802] #PF: supervisor read access in kernel mode
        [13367.697249] #PF: error_code(0x0000) - not-present page
        [13367.697721] PGD 0 P4D 0
        [13367.698171] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        [13367.698632] CPU: 6 PID: 851214 Comm: btrfs Tainted: G        W         5.9.0-rc6-btrfs-next-69 #1
        [13367.699100] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        [13367.700069] RIP: 0010:__lock_acquire+0x20a/0x3970
        [13367.700562] Code: ff 1f 0f b7 c0 48 0f (...)
        [13367.701609] RSP: 0018:ffffb14448f57790 EFLAGS: 00010046
        [13367.702140] RAX: 0000000000000000 RBX: 29b935140c15e8cf RCX: 0000000000000000
        [13367.702698] RDX: 0000000000000002 RSI: ffffffffb3d66bd0 RDI: 0000000000000046
        [13367.703240] RBP: ffff8a52ba8ac040 R08: 00000c2866ad9288 R09: 0000000000000001
        [13367.703783] R10: 0000000000000001 R11: 00000000b66d9b53 R12: ffff8a52ba8ac9b0
        [13367.704330] R13: 0000000000000000 R14: ffff8a532b6333e8 R15: 0000000000000000
        [13367.704880] FS:  00007fe1df6b5700(0000) GS:ffff8a5376600000(0000) knlGS:0000000000000000
        [13367.705438] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [13367.705995] CR2: 00000000000000b0 CR3: 000000022cca8004 CR4: 00000000003706e0
        [13367.706565] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [13367.707127] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [13367.707686] Call Trace:
        [13367.708246]  ? ___slab_alloc+0x395/0x740
        [13367.708820]  ? reada_add_block+0xae/0xee0 [btrfs]
        [13367.709383]  lock_acquire+0xb1/0x480
        [13367.709955]  ? reada_add_block+0xe0/0xee0 [btrfs]
        [13367.710537]  ? reada_add_block+0xae/0xee0 [btrfs]
        [13367.711097]  ? rcu_read_lock_sched_held+0x5d/0x90
        [13367.711659]  ? kmem_cache_alloc_trace+0x8d2/0x990
        [13367.712221]  ? lock_acquired+0x33b/0x470
        [13367.712784]  _raw_spin_lock+0x34/0x80
        [13367.713356]  ? reada_add_block+0xe0/0xee0 [btrfs]
        [13367.713966]  reada_add_block+0xe0/0xee0 [btrfs]
        [13367.714529]  ? btrfs_root_node+0x15/0x1f0 [btrfs]
        [13367.715077]  btrfs_reada_add+0x117/0x170 [btrfs]
        [13367.715620]  scrub_stripe+0x21e/0x10d0 [btrfs]
        [13367.716141]  ? kvm_sched_clock_read+0x5/0x10
        [13367.716657]  ? __lock_acquire+0x41e/0x3970
        [13367.717184]  ? scrub_chunk+0x60/0x140 [btrfs]
        [13367.717697]  ? find_held_lock+0x32/0x90
        [13367.718254]  ? scrub_chunk+0x60/0x140 [btrfs]
        [13367.718773]  ? lock_acquired+0x33b/0x470
        [13367.719278]  ? scrub_chunk+0xcd/0x140 [btrfs]
        [13367.719786]  scrub_chunk+0xcd/0x140 [btrfs]
        [13367.720291]  scrub_enumerate_chunks+0x270/0x5c0 [btrfs]
        [13367.720787]  ? finish_wait+0x90/0x90
        [13367.721281]  btrfs_scrub_dev+0x1ee/0x620 [btrfs]
        [13367.721762]  ? rcu_read_lock_any_held+0x8e/0xb0
        [13367.722235]  ? preempt_count_add+0x49/0xa0
        [13367.722710]  ? __sb_start_write+0x19b/0x290
        [13367.723192]  btrfs_ioctl+0x7f5/0x36f0 [btrfs]
        [13367.723660]  ? __fget_files+0x101/0x1d0
        [13367.724118]  ? find_held_lock+0x32/0x90
        [13367.724559]  ? __fget_files+0x101/0x1d0
        [13367.724982]  ? __x64_sys_ioctl+0x83/0xb0
        [13367.725399]  __x64_sys_ioctl+0x83/0xb0
        [13367.725802]  do_syscall_64+0x33/0x80
        [13367.726188]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [13367.726574] RIP: 0033:0x7fe1df7add87
        [13367.726948] Code: 00 00 00 48 8b 05 09 91 (...)
        [13367.727763] RSP: 002b:00007fe1df6b4d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [13367.728179] RAX: ffffffffffffffda RBX: 000055ce1fb596a0 RCX: 00007fe1df7add87
        [13367.728604] RDX: 000055ce1fb596a0 RSI: 00000000c400941b RDI: 0000000000000003
        [13367.729021] RBP: 0000000000000000 R08: 00007fe1df6b5700 R09: 0000000000000000
        [13367.729431] R10: 00007fe1df6b5700 R11: 0000000000000246 R12: 00007ffd922b07de
        [13367.729842] R13: 00007ffd922b07df R14: 00007fe1df6b4e40 R15: 0000000000802000
        [13367.730275] Modules linked in: btrfs blake2b_generic xor (...)
        [13367.732638] CR2: 00000000000000b0
        [13367.733166] ---[ end trace d298b6805556acd9 ]---
      
      What happens is the following:
      
      1) At reada_find_extent() we don't find any existing readahead extent for
         the metadata extent starting at logical address X;
      
      2) So we proceed to create a new one. We then call btrfs_map_block() to get
         information about which stripes contain extent X;
      
      3) After that we iterate over the stripes and create only one zone for the
         readahead extent - only one because reada_find_zone() returned NULL for
         all iterations except for one, either because a memory allocation failed
         or it couldn't find the block group of the extent (it may have just been
         deleted);
      
      4) We then add the new readahead extent to the readahead extents radix
         tree at fs_info->reada_tree;
      
      5) Then we iterate over each zone of the new readahead extent, and find
         that the device used for that zone no longer exists, because it was
         removed or it was the source device of a device replace operation.
         Since this left 'have_zone' set to 0, after finishing the loop we jump
         to the 'error' label, call kfree() on the new readahead extent and
         return without removing it from the radix tree at fs_info->reada_tree;
      
      6) Any future call to reada_find_extent() for the logical address X will
         find the stale pointer in the readahead extents radix tree, increment
         its reference counter, which can trigger the use-after-free right
         away or return it to the caller reada_add_block() that results in the
         use-after-free of the example trace above.
      
      So fix this by making sure we delete the readahead extent from the radix
      tree if we fail to setup zones for it (when 'have_zone = 0').
      
      Fixes: 31945021 ("btrfs: reada: bypass adding extent when all zone failed")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      83bc1560
    • Daniel Xu's avatar
      btrfs: tree-checker: validate number of chunk stripes and parity · 85d07fbe
      Daniel Xu authored
      If there's no parity and num_stripes < ncopies, a crafted image can
      trigger a division by zero in calc_stripe_length().
      
      The image was generated through fuzzing.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209587Signed-off-by: default avatarDaniel Xu <dxu@dxuuu.xyz>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      85d07fbe
    • Pujin Shi's avatar
      btrfs: tree-checker: fix incorrect printk format · cad69d13
      Pujin Shi authored
      This patch addresses a compile warning:
      
      fs/btrfs/extent-tree.c: In function '__btrfs_free_extent':
      fs/btrfs/extent-tree.c:3187:4: warning: format '%lu' expects argument of type 'long unsigned int', but argument 8 has type 'unsigned int' [-Wformat=]
      
      Fixes: 1c2a07f5 ("btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent()")
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarPujin Shi <shipujin.t@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cad69d13
    • Josef Bacik's avatar
      btrfs: drop the path before adding block group sysfs files · 7837fa88
      Josef Bacik authored
      Dave reported a problem with my rwsem conversion patch where we got the
      following lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-default+ #1297 Not tainted
        ------------------------------------------------------
        kswapd0/76 is trying to acquire lock:
        ffff9d5d25df2530 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
      
        but task is already holding lock:
        ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #4 (fs_reclaim){+.+.}-{0:0}:
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 fs_reclaim_acquire.part.0+0x25/0x30
      	 kmem_cache_alloc+0x30/0x9c0
      	 alloc_inode+0x81/0x90
      	 iget_locked+0xcd/0x1a0
      	 kernfs_get_inode+0x1b/0x130
      	 kernfs_get_tree+0x136/0x210
      	 sysfs_get_tree+0x1a/0x50
      	 vfs_get_tree+0x1d/0xb0
      	 path_mount+0x70f/0xa80
      	 do_mount+0x75/0x90
      	 __x64_sys_mount+0x8e/0xd0
      	 do_syscall_64+0x2d/0x70
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #3 (kernfs_mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 __mutex_lock+0xa0/0xaf0
      	 kernfs_add_one+0x23/0x150
      	 kernfs_create_dir_ns+0x58/0x80
      	 sysfs_create_dir_ns+0x70/0xd0
      	 kobject_add_internal+0xbb/0x2d0
      	 kobject_add+0x7a/0xd0
      	 btrfs_sysfs_add_block_group_type+0x141/0x1d0 [btrfs]
      	 btrfs_read_block_groups+0x1f1/0x8c0 [btrfs]
      	 open_ctree+0x981/0x1108 [btrfs]
      	 btrfs_mount_root.cold+0xe/0xb0 [btrfs]
      	 legacy_get_tree+0x2d/0x60
      	 vfs_get_tree+0x1d/0xb0
      	 fc_mount+0xe/0x40
      	 vfs_kern_mount.part.0+0x71/0x90
      	 btrfs_mount+0x13b/0x3e0 [btrfs]
      	 legacy_get_tree+0x2d/0x60
      	 vfs_get_tree+0x1d/0xb0
      	 path_mount+0x70f/0xa80
      	 do_mount+0x75/0x90
      	 __x64_sys_mount+0x8e/0xd0
      	 do_syscall_64+0x2d/0x70
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (btrfs-extent-00){++++}-{3:3}:
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 down_read_nested+0x45/0x220
      	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
      	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
      	 btrfs_search_slot+0x6d4/0xfd0 [btrfs]
      	 check_committed_ref+0x69/0x200 [btrfs]
      	 btrfs_cross_ref_exist+0x65/0xb0 [btrfs]
      	 run_delalloc_nocow+0x446/0x9b0 [btrfs]
      	 btrfs_run_delalloc_range+0x61/0x6a0 [btrfs]
      	 writepage_delalloc+0xae/0x160 [btrfs]
      	 __extent_writepage+0x262/0x420 [btrfs]
      	 extent_write_cache_pages+0x2b6/0x510 [btrfs]
      	 extent_writepages+0x43/0x90 [btrfs]
      	 do_writepages+0x40/0xe0
      	 __writeback_single_inode+0x62/0x610
      	 writeback_sb_inodes+0x20f/0x500
      	 wb_writeback+0xef/0x4a0
      	 wb_do_writeback+0x49/0x2e0
      	 wb_workfn+0x81/0x340
      	 process_one_work+0x233/0x5d0
      	 worker_thread+0x50/0x3b0
      	 kthread+0x137/0x150
      	 ret_from_fork+0x1f/0x30
      
        -> #1 (btrfs-fs-00){++++}-{3:3}:
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 down_read_nested+0x45/0x220
      	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
      	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
      	 btrfs_search_slot+0x6d4/0xfd0 [btrfs]
      	 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
      	 __btrfs_update_delayed_inode+0x93/0x2c0 [btrfs]
      	 __btrfs_commit_inode_delayed_items+0x7de/0x850 [btrfs]
      	 __btrfs_run_delayed_items+0x8e/0x140 [btrfs]
      	 btrfs_commit_transaction+0x367/0xbc0 [btrfs]
      	 btrfs_mksubvol+0x2db/0x470 [btrfs]
      	 btrfs_mksnapshot+0x7b/0xb0 [btrfs]
      	 __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs]
      	 btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
      	 btrfs_ioctl+0xd0b/0x2690 [btrfs]
      	 __x64_sys_ioctl+0x6f/0xa0
      	 do_syscall_64+0x2d/0x70
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 check_prev_add+0x91/0xc60
      	 validate_chain+0xa6e/0x2a20
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 __mutex_lock+0xa0/0xaf0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
      	 btrfs_evict_inode+0x3cc/0x560 [btrfs]
      	 evict+0xd6/0x1c0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x54/0x80
      	 super_cache_scan+0x121/0x1a0
      	 do_shrink_slab+0x16d/0x3b0
      	 shrink_slab+0xb1/0x2e0
      	 shrink_node+0x230/0x6a0
      	 balance_pgdat+0x325/0x750
      	 kswapd+0x206/0x4d0
      	 kthread+0x137/0x150
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> kernfs_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(kernfs_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/76:
         #0: ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffffa40b8b58 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0
         #2: ffff9d5d322390e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50
      
        stack backtrace:
        CPU: 2 PID: 76 Comm: kswapd0 Not tainted 5.9.0-default+ #1297
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        Call Trace:
         dump_stack+0x77/0x97
         check_noncircular+0xff/0x110
         ? save_trace+0x50/0x470
         check_prev_add+0x91/0xc60
         validate_chain+0xa6e/0x2a20
         ? save_trace+0x50/0x470
         __lock_acquire+0x582/0xac0
         lock_acquire+0xca/0x430
         ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         __mutex_lock+0xa0/0xaf0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         ? __lock_acquire+0x582/0xac0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         ? btrfs_evict_inode+0x30b/0x560 [btrfs]
         ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         btrfs_evict_inode+0x3cc/0x560 [btrfs]
         evict+0xd6/0x1c0
         dispose_list+0x48/0x70
         prune_icache_sb+0x54/0x80
         super_cache_scan+0x121/0x1a0
         do_shrink_slab+0x16d/0x3b0
         shrink_slab+0xb1/0x2e0
         shrink_node+0x230/0x6a0
         balance_pgdat+0x325/0x750
         kswapd+0x206/0x4d0
         ? finish_wait+0x90/0x90
         ? balance_pgdat+0x750/0x750
         kthread+0x137/0x150
         ? kthread_mod_delayed_work+0xc0/0xc0
         ret_from_fork+0x1f/0x30
      
      This happens because we are still holding the path open when we start
      adding the sysfs files for the block groups, which creates a dependency
      on fs_reclaim via the tree lock.  Fix this by dropping the path before
      we start doing anything with sysfs.
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7837fa88
  4. 16 Oct, 2020 1 commit
    • Filipe Manana's avatar
      btrfs: fix relocation failure due to race with fallocate · 1afc708d
      Filipe Manana authored
      When doing a fallocate() we have a short time window, after reserving an
      extent and before starting a transaction, where if relocation for the block
      group containing the reserved extent happens, we can end up missing the
      extent in the data relocation inode causing relocation to fail later.
      
      This only happens when we don't pass a transaction to the internal
      fallocate function __btrfs_prealloc_file_range(), which is for all the
      cases where fallocate() is called from user space (the internal use cases
      include space cache extent allocation and relocation).
      
      When the race triggers the relocation failure, it produces a trace like
      the following:
      
        [200611.995995] ------------[ cut here ]------------
        [200611.997084] BTRFS: Transaction aborted (error -2)
        [200611.998208] WARNING: CPU: 3 PID: 235845 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x3a0/0x5b0 [btrfs]
        [200611.999042] Modules linked in: dm_thin_pool dm_persistent_data (...)
        [200612.003287] CPU: 3 PID: 235845 Comm: btrfs Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [200612.004442] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        [200612.006186] RIP: 0010:__btrfs_cow_block+0x3a0/0x5b0 [btrfs]
        [200612.007110] Code: 1b 00 00 02 72 2a 83 f8 fb 0f 84 b8 01 (...)
        [200612.007341] BTRFS warning (device sdb): Skipping commit of aborted transaction.
        [200612.008959] RSP: 0018:ffffaee38550f918 EFLAGS: 00010286
        [200612.009672] BTRFS: error (device sdb) in cleanup_transaction:1901: errno=-30 Readonly filesystem
        [200612.010428] RAX: 0000000000000000 RBX: ffff9174d96f4000 RCX: 0000000000000000
        [200612.011078] BTRFS info (device sdb): forced readonly
        [200612.011862] RDX: 0000000000000001 RSI: ffffffffa8161978 RDI: 00000000ffffffff
        [200612.013215] RBP: ffff9172569a0f80 R08: 0000000000000000 R09: 0000000000000000
        [200612.014263] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9174b8403b88
        [200612.015203] R13: ffff9174b8400a88 R14: ffff9174c90f1000 R15: ffff9174a5a60e08
        [200612.016182] FS:  00007fa55cf878c0(0000) GS:ffff9174ece00000(0000) knlGS:0000000000000000
        [200612.017174] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [200612.018418] CR2: 00007f8fb8048148 CR3: 0000000428a46003 CR4: 00000000003706e0
        [200612.019510] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [200612.020648] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [200612.021520] Call Trace:
        [200612.022434]  btrfs_cow_block+0x10b/0x250 [btrfs]
        [200612.023407]  do_relocation+0x54e/0x7b0 [btrfs]
        [200612.024343]  ? do_raw_spin_unlock+0x4b/0xc0
        [200612.025280]  ? _raw_spin_unlock+0x29/0x40
        [200612.026200]  relocate_tree_blocks+0x3bc/0x6d0 [btrfs]
        [200612.027088]  relocate_block_group+0x2f3/0x600 [btrfs]
        [200612.027961]  btrfs_relocate_block_group+0x15e/0x340 [btrfs]
        [200612.028896]  btrfs_relocate_chunk+0x38/0x110 [btrfs]
        [200612.029772]  btrfs_balance+0xb22/0x1790 [btrfs]
        [200612.030601]  ? btrfs_ioctl_balance+0x253/0x380 [btrfs]
        [200612.031414]  btrfs_ioctl_balance+0x2cf/0x380 [btrfs]
        [200612.032279]  btrfs_ioctl+0x620/0x36f0 [btrfs]
        [200612.033077]  ? _raw_spin_unlock+0x29/0x40
        [200612.033948]  ? handle_mm_fault+0x116d/0x1ca0
        [200612.034749]  ? up_read+0x18/0x240
        [200612.035542]  ? __x64_sys_ioctl+0x83/0xb0
        [200612.036244]  __x64_sys_ioctl+0x83/0xb0
        [200612.037269]  do_syscall_64+0x33/0x80
        [200612.038190]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [200612.038976] RIP: 0033:0x7fa55d07ed87
        [200612.040127] Code: 00 00 00 48 8b 05 09 91 0c 00 64 c7 00 26 (...)
        [200612.041669] RSP: 002b:00007ffd5ebf03e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
        [200612.042437] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fa55d07ed87
        [200612.043511] RDX: 00007ffd5ebf0470 RSI: 00000000c4009420 RDI: 0000000000000003
        [200612.044250] RBP: 0000000000000003 R08: 000055d8362642a0 R09: 00007fa55d148be0
        [200612.044963] R10: fffffffffffff52e R11: 0000000000000206 R12: 00007ffd5ebf1614
        [200612.045683] R13: 00007ffd5ebf0470 R14: 0000000000000002 R15: 00007ffd5ebf0470
        [200612.046361] irq event stamp: 0
        [200612.047040] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        [200612.047725] hardirqs last disabled at (0): [<ffffffffa6eb5ab3>] copy_process+0x823/0x1bc0
        [200612.048387] softirqs last  enabled at (0): [<ffffffffa6eb5ab3>] copy_process+0x823/0x1bc0
        [200612.049024] softirqs last disabled at (0): [<0000000000000000>] 0x0
        [200612.049722] ---[ end trace 49006c6876e65227 ]---
      
      The race happens like this:
      
      1) Task A starts an fallocate() (plain or zero range) and it calls
         __btrfs_prealloc_file_range() with the 'trans' parameter set to NULL;
      
      2) Task A calls btrfs_reserve_extent() and gets an extent that belongs to
         block group X;
      
      3) Before task A gets into btrfs_replace_file_extents(), through the call
         to insert_prealloc_file_extent(), task B starts relocation of block
         group X;
      
      4) Task B enters btrfs_relocate_block_group() and it sets block group X to
         RO mode;
      
      5) Task B enters relocate_block_group(), it calls prepare_to_relocate()
         whichs joins/starts a transaction and then commits the transaction;
      
      6) Task B then starts scanning the extent tree looking for extents that
         belong to block group X - it does not find yet the extent reserved by
         task A, since that extent was not yet added to the extent tree, as its
         delayed reference was not even yet created at this point;
      
      7) The data relocation inode ends up not having the extent reserved by
         task A associated to it;
      
      8) Task A then starts a transaction through btrfs_replace_file_extents(),
         inserts a file extent item in the subvolume tree pointing to the
         reserved extent and creates a delayed reference for it;
      
      9) Task A finishes and returns success to user space;
      
      10) Later on, while relocation is still in progress, the leaf where task A
          inserted the new file extent item is COWed, so we end up at
          __btrfs_cow_block(), which calls btrfs_reloc_cow_block(), and that in
          turn calls relocation.c:replace_file_extents();
      
      11) At relocation.c:replace_file_extents() we iterate over all the items in
          the leaf and find the file extent item pointing to the extent that was
          allocated by task A, and then call relocation.c:get_new_location(), to
          find the new location for the extent;
      
      12) However relocation.c:get_new_location() fails, returning -ENOENT,
          because it couldn't find a corresponding file extent item associated
          with the data relocation inode. This is because the extent was not seen
          in the extent tree at step 6). The -ENOENT error is propagated to
          __btrfs_cow_block(), which aborts the transaction.
      
      So fix this simply by decrementing the block group's number of reservations
      after calling insert_prealloc_file_extent(), as relocation waits for that
      counter to go down to zero before calling prepare_to_relocate() and start
      looking for extents in the extent tree.
      
      This issue only started to happen recently as of commit 8fccebfa
      ("btrfs: fix metadata reservation for fallocate that leads to transaction
      aborts"), because now we can reserve an extent before starting/joining a
      transaction, and previously we always did it after that, so relocation
      ended up waiting for a concurrent fallocate() to finish because before
      searching for the extents of the block group, it starts/joins a transaction
      and then commits it (at prepare_to_relocate()), which made it wait for the
      fallocate task to complete first.
      
      Fixes: 8fccebfa ("btrfs: fix metadata reservation for fallocate that leads to transaction aborts")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1afc708d
  5. 07 Oct, 2020 29 commits
    • Nikolay Borisov's avatar
      btrfs: rename BTRFS_INODE_ORDERED_DATA_CLOSE flag · 1fd4033d
      Nikolay Borisov authored
      Commit 8d875f95 ("btrfs: disable strict file flushes for
      renames and truncates") eliminated the notion of ordered operations and
      instead BTRFS_INODE_ORDERED_DATA_CLOSE only remained as a flag
      indicating that a file's content should be synced to disk in case a
      file is truncated and any writes happen to it concurrently. In fact
      this intendend behavior was broken until it was fixed in
      f6dc45c7 ("Btrfs: fix filemap_flush call in btrfs_file_release").
      
      All things considered let's give the flag a more descriptive name. Also
      slightly reword comments.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1fd4033d
    • Madhuparna Bhowmik's avatar
      btrfs: annotate device name rcu_string with __rcu · 8d1a7aae
      Madhuparna Bhowmik authored
      This patch fixes the following sparse errors in
      fs/btrfs/super.c in function btrfs_show_devname()
      
        fs/btrfs/super.c: error: incompatible types in comparison expression (different address spaces):
        fs/btrfs/super.c:    struct rcu_string [noderef] <asn:4> *
        fs/btrfs/super.c:    struct rcu_string *
      
      The error was because of the following line in function btrfs_show_devname():
      
        if (first_dev)
      	 seq_escape(m, rcu_str_deref(first_dev->name), " \t\n\\");
      
      Annotating the btrfs_device::name member with __rcu fixes the sparse
      error.
      Acked-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarMadhuparna Bhowmik <madhuparnabhowmik04@gmail.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8d1a7aae
    • Anand Jain's avatar
      btrfs: skip devices without magic signature when mounting · 96c2e067
      Anand Jain authored
      Many things can happen after the device is scanned and before the device
      is mounted.  One such thing is losing the BTRFS_MAGIC on the device.
      If it happens we still won't free that device from the memory and cause
      the userland confusion.
      
      For example: As the BTRFS_IOC_DEV_INFO still carries the device path
      which does not have the BTRFS_MAGIC, 'btrfs fi show' still lists
      device which does not belong to the filesystem anymore:
      
        $ mkfs.btrfs -fq -draid1 -mraid1 /dev/sda /dev/sdb
        $ wipefs -a /dev/sdb
        # /dev/sdb does not contain magic signature
        $ mount -o degraded /dev/sda /btrfs
        $ btrfs fi show -m
        Label: none  uuid: 470ec6fb-646b-4464-b3cb-df1b26c527bd
      	  Total devices 2 FS bytes used 128.00KiB
      	  devid    1 size 3.00GiB used 571.19MiB path /dev/sda
      	  devid    2 size 3.00GiB used 571.19MiB path /dev/sdb
      
      We need to distinguish the missing signature and invalid superblock, so
      add a specific error code ENODATA for that. This also fixes failure of
      fstest btrfs/198.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96c2e067
    • Josef Bacik's avatar
      btrfs: cleanup cow block on error · 572c83ac
      Josef Bacik authored
      In fstest btrfs/064 a transaction abort in __btrfs_cow_block could lead
      to a system lockup. It gets stuck trying to write back inodes, and the
      write back thread was trying to lock an extent buffer:
      
        $ cat /proc/2143497/stack
        [<0>] __btrfs_tree_lock+0x108/0x250
        [<0>] lock_extent_buffer_for_io+0x35e/0x3a0
        [<0>] btree_write_cache_pages+0x15a/0x3b0
        [<0>] do_writepages+0x28/0xb0
        [<0>] __writeback_single_inode+0x54/0x5c0
        [<0>] writeback_sb_inodes+0x1e8/0x510
        [<0>] wb_writeback+0xcc/0x440
        [<0>] wb_workfn+0xd7/0x650
        [<0>] process_one_work+0x236/0x560
        [<0>] worker_thread+0x55/0x3c0
        [<0>] kthread+0x13a/0x150
        [<0>] ret_from_fork+0x1f/0x30
      
      This is because we got an error while COWing a block, specifically here
      
              if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state)) {
                      ret = btrfs_reloc_cow_block(trans, root, buf, cow);
                      if (ret) {
                              btrfs_abort_transaction(trans, ret);
                              return ret;
                      }
              }
      
        [16402.241552] BTRFS: Transaction aborted (error -2)
        [16402.242362] WARNING: CPU: 1 PID: 2563188 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x376/0x540
        [16402.249469] CPU: 1 PID: 2563188 Comm: fsstress Not tainted 5.9.0-rc6+ #8
        [16402.249936] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        [16402.250525] RIP: 0010:__btrfs_cow_block+0x376/0x540
        [16402.252417] RSP: 0018:ffff9cca40e578b0 EFLAGS: 00010282
        [16402.252787] RAX: 0000000000000025 RBX: 0000000000000002 RCX: ffff9132bbd19388
        [16402.253278] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9132bbd19380
        [16402.254063] RBP: ffff9132b41a49c0 R08: 0000000000000000 R09: 0000000000000000
        [16402.254887] R10: 0000000000000000 R11: ffff91324758b080 R12: ffff91326ef17ce0
        [16402.255694] R13: ffff91325fc0f000 R14: ffff91326ef176b0 R15: ffff9132815e2000
        [16402.256321] FS:  00007f542c6d7b80(0000) GS:ffff9132bbd00000(0000) knlGS:0000000000000000
        [16402.256973] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [16402.257374] CR2: 00007f127b83f250 CR3: 0000000133480002 CR4: 0000000000370ee0
        [16402.257867] Call Trace:
        [16402.258072]  btrfs_cow_block+0x109/0x230
        [16402.258356]  btrfs_search_slot+0x530/0x9d0
        [16402.258655]  btrfs_lookup_file_extent+0x37/0x40
        [16402.259155]  __btrfs_drop_extents+0x13c/0xd60
        [16402.259628]  ? btrfs_block_rsv_migrate+0x4f/0xb0
        [16402.259949]  btrfs_replace_file_extents+0x190/0x820
        [16402.260873]  btrfs_clone+0x9ae/0xc00
        [16402.261139]  btrfs_extent_same_range+0x66/0x90
        [16402.261771]  btrfs_remap_file_range+0x353/0x3b1
        [16402.262333]  vfs_dedupe_file_range_one.part.0+0xd5/0x140
        [16402.262821]  vfs_dedupe_file_range+0x189/0x220
        [16402.263150]  do_vfs_ioctl+0x552/0x700
        [16402.263662]  __x64_sys_ioctl+0x62/0xb0
        [16402.264023]  do_syscall_64+0x33/0x40
        [16402.264364]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [16402.264862] RIP: 0033:0x7f542c7d15cb
        [16402.266901] RSP: 002b:00007ffd35944ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [16402.267627] RAX: ffffffffffffffda RBX: 00000000009d1968 RCX: 00007f542c7d15cb
        [16402.268298] RDX: 00000000009d2490 RSI: 00000000c0189436 RDI: 0000000000000003
        [16402.268958] RBP: 00000000009d2520 R08: 0000000000000036 R09: 00000000009d2e64
        [16402.269726] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
        [16402.270659] R13: 000000000001f000 R14: 00000000009d1970 R15: 00000000009d2e80
        [16402.271498] irq event stamp: 0
        [16402.271846] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        [16402.272497] hardirqs last disabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
        [16402.273343] softirqs last  enabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
        [16402.273905] softirqs last disabled at (0): [<0000000000000000>] 0x0
        [16402.274338] ---[ end trace 737874a5a41a8236 ]---
        [16402.274669] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
        [16402.276179] BTRFS info (device dm-9): forced readonly
        [16402.277046] BTRFS: error (device dm-9) in btrfs_replace_file_extents:2723: errno=-2 No such entry
        [16402.278744] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
        [16402.279968] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
        [16402.280582] BTRFS info (device dm-9): balance: ended with status: -30
      
      The problem here is that as soon as we allocate the new block it is
      locked and marked dirty in the btree inode.  This means that we could
      attempt to writeback this block and need to lock the extent buffer.
      However we're not unlocking it here and thus we deadlock.
      
      Fix this by unlocking the cow block if we have any errors inside of
      __btrfs_cow_block, and also free it so we do not leak it.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      572c83ac
    • Goldwyn Rodrigues's avatar
      btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK · e3c57805
      Goldwyn Rodrigues authored
      Since we now perform direct reads using i_rwsem, we can remove this
      inode flag used to co-ordinate unlocked reads.
      
      The truncate call takes i_rwsem. This means it is correctly synchronized
      with concurrent direct reads.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jth@kernel.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3c57805
    • Goldwyn Rodrigues's avatar
      fs: remove no longer used dio_end_io() · c33fe275
      Goldwyn Rodrigues authored
      Since we removed the last user of dio_end_io() when btrfs got converted
      to iomap infrastructure ("btrfs: switch to iomap for direct IO"), remove
      the helper function dio_end_io().
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c33fe275
    • Josef Bacik's avatar
      btrfs: return error if we're unable to read device stats · 92e26df4
      Josef Bacik authored
      I noticed when fixing device stats for seed devices that we simply threw
      away the return value from btrfs_search_slot().  This is because we may
      not have stat items, but we could very well get an error, and thus miss
      reporting the error up the chain.
      
      Fix this by returning ret if it's an actual error, and then stop trying
      to init the rest of the devices stats and return the error up the chain.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      92e26df4
    • Josef Bacik's avatar
      btrfs: init device stats for seed devices · 124604eb
      Josef Bacik authored
      We recently started recording device stats across the fleet, and noticed
      a large increase in messages such as this
      
        BTRFS warning (device dm-0): get dev_stats failed, not yet valid
      
      on our tiers that use seed devices for their root devices.  This is
      because we do not initialize the device stats for any seed devices if we
      have a sprout device and mount using that sprout device.  The basic
      steps for reproducing are:
      
        $ mkfs seed device
        $ mount seed device
        # fill seed device
        $ umount seed device
        $ btrfstune -S 1 seed device
        $ mount seed device
        $ btrfs device add -f sprout device /mnt/wherever
        $ umount /mnt/wherever
        $ mount sprout device /mnt/wherever
        $ btrfs device stats /mnt/wherever
      
      This will fail with the above message in dmesg.
      
      Fix this by iterating over the fs_devices->seed if they exist in
      btrfs_init_dev_stats.  This fixed the problem and properly reports the
      stats for both devices.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ rename to btrfs_device_init_dev_stats ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      124604eb
    • Nikolay Borisov's avatar
      btrfs: remove struct extent_io_ops · 905eb88b
      Nikolay Borisov authored
      It's no longer used just remove the function and any related code which
      was initialising it for inodes. No functional changes.
      
      Removing 8 bytes from extent_io_tree in turn reduces size of other
      structures where it is embedded, notably btrfs_inode where it reduces
      size by 24 bytes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      905eb88b
    • Nikolay Borisov's avatar
      btrfs: call submit_bio_hook directly for metadata pages · 1b36294a
      Nikolay Borisov authored
      No need to go through a function pointer indirection simply call
      submit_bio_hook directly by exporting and renaming the helper to
      btrfs_submit_metadata_bio. This makes the code more readable and should
      result in somewhat faster code due to no longer paying the price for
      specualtive attack mitigations that come with indirect function calls.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1b36294a
    • Nikolay Borisov's avatar
      btrfs: stop calling submit_bio_hook for data inodes · 908930f3
      Nikolay Borisov authored
      Instead export and rename the function to btrfs_submit_data_bio and
      call it directly in submit_one_bio. This avoids paying the cost for
      speculative attacks mitigations and improves code readability.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      908930f3
    • Nikolay Borisov's avatar
      btrfs: don't opencode is_data_inode in end_bio_extent_readpage · be17b3af
      Nikolay Borisov authored
      Use the is_data_inode helper.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      be17b3af
    • Nikolay Borisov's avatar
      btrfs: call submit_bio_hook directly in submit_one_bio · cd053744
      Nikolay Borisov authored
      BTRFS has 2 inode types (for the purposes of the code in submit_one_bio)
      - ordinary data inodes (including the freespace inode) and the btree
      inode. Both of these implement submit_bio_hook so btrfsic_submit_bio can
      never be called from submit_one_bio so just remove it.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd053744
    • Nikolay Borisov's avatar
      btrfs: remove extent_io_ops::readpage_end_io_hook · 1f03d9cf
      Nikolay Borisov authored
      It's no longer used so let's remove it.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1f03d9cf
    • Nikolay Borisov's avatar
      btrfs: replace readpage_end_io_hook with direct calls · 9a446d6a
      Nikolay Borisov authored
      Don't call readpage_end_io_hook for the btree inode.  Instead of relying
      on indirect calls to implement metadata buffer validation simply check
      if the inode whose page we are processing equals the btree inode. If it
      does call the necessary function.
      
      This is an improvement in 2 directions:
      
      1. We aren't paying the penalty of indirect calls in a post-speculation
         attacks world.
      
      2. The function is now named more explicitly so it's obvious what's
         going on
      
      This is in preparation to removing struct extent_io_ops altogether.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9a446d6a
    • Filipe Manana's avatar
      btrfs: send, recompute reference path after orphanization of a directory · 9c2b4e03
      Filipe Manana authored
      During an incremental send, when an inode has multiple new references we
      might end up emitting rename operations for orphanizations that have a
      source path that is no longer valid due to a previous orphanization of
      some directory inode. This causes the receiver to fail since it tries
      to rename a path that does not exists.
      
      Example reproducer:
      
        $ cat reproducer.sh
        #!/bin/bash
      
        mkfs.btrfs -f /dev/sdi >/dev/null
        mount /dev/sdi /mnt/sdi
      
        touch /mnt/sdi/f1
        touch /mnt/sdi/f2
        mkdir /mnt/sdi/d1
        mkdir /mnt/sdi/d1/d2
      
        # Filesystem looks like:
        #
        # .                           (ino 256)
        # |----- f1                   (ino 257)
        # |----- f2                   (ino 258)
        # |----- d1/                  (ino 259)
        #        |----- d2/           (ino 260)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
        btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
      
        # Now do a series of changes such that:
        #
        # *) inode 258 has one new hardlink and the previous name changed
        #
        # *) both names conflict with the old names of two other inodes:
        #
        #    1) the new name "d1" conflicts with the old name of inode 259,
        #       under directory inode 256 (root)
        #
        #    2) the new name "d2" conflicts with the old name of inode 260
        #       under directory inode 259
        #
        # *) inodes 259 and 260 now have the old names of inode 258
        #
        # *) inode 257 is now located under inode 260 - an inode with a number
        #    smaller than the inode (258) for which we created a second hard
        #    link and swapped its names with inodes 259 and 260
        #
        ln /mnt/sdi/f2 /mnt/sdi/d1/f2_link
        mv /mnt/sdi/f1 /mnt/sdi/d1/d2/f1
      
        # Swap d1 and f2.
        mv /mnt/sdi/d1 /mnt/sdi/tmp
        mv /mnt/sdi/f2 /mnt/sdi/d1
        mv /mnt/sdi/tmp /mnt/sdi/f2
      
        # Swap d2 and f2_link
        mv /mnt/sdi/f2/d2 /mnt/sdi/tmp
        mv /mnt/sdi/f2/f2_link /mnt/sdi/f2/d2
        mv /mnt/sdi/tmp /mnt/sdi/f2/f2_link
      
        # Filesystem now looks like:
        #
        # .                                (ino 256)
        # |----- d1                        (ino 258)
        # |----- f2/                       (ino 259)
        #        |----- f2_link/           (ino 260)
        #        |       |----- f1         (ino 257)
        #        |
        #        |----- d2                 (ino 258)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
        btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
      
        mkfs.btrfs -f /dev/sdj >/dev/null
        mount /dev/sdj /mnt/sdj
      
        btrfs receive -f /tmp/snap1.send /mnt/sdj
        btrfs receive -f /tmp/snap2.send /mnt/sdj
      
        umount /mnt/sdi
        umount /mnt/sdj
      
      When executed the receive of the incremental stream fails:
      
        $ ./reproducer.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: rename d1/d2 -> o260-6-0 failed: No such file or directory
      
      This happens because:
      
      1) When processing inode 257 we end up computing the name for inode 259
         because it is an ancestor in the send snapshot, and at that point it
         still has its old name, "d1", from the parent snapshot because inode
         259 was not yet processed. We then cache that name, which is valid
         until we start processing inode 259 (or set the progress to 260 after
         processing its references);
      
      2) Later we start processing inode 258 and collecting all its new
         references into the list sctx->new_refs. The first reference in the
         list happens to be the reference for name "d1" while the reference for
         name "d2" is next (the last element of the list).
         We compute the full path "d1/d2" for this second reference and store
         it in the reference (its ->full_path member). The path used for the
         new parent directory was "d1" and not "f2" because inode 259, the
         new parent, was not yet processed;
      
      3) When we start processing the new references at process_recorded_refs()
         we start with the first reference in the list, for the new name "d1".
         Because there is a conflicting inode that was not yet processed, which
         is directory inode 259, we orphanize it, renaming it from "d1" to
         "o259-6-0";
      
      4) Then we start processing the new reference for name "d2", and we
         realize it conflicts with the reference of inode 260 in the parent
         snapshot. So we issue an orphanization operation for inode 260 by
         emitting a rename operation with a destination path of "o260-6-0"
         and a source path of "d1/d2" - this source path is the value we
         stored in the reference earlier at step 2), corresponding to the
         ->full_path member of the reference, however that path is no longer
         valid due to the orphanization of the directory inode 259 in step 3).
         This makes the receiver fail since the path does not exists, it should
         have been "o259-6-0/d2".
      
      Fix this by recomputing the full path of a reference before emitting an
      orphanization if we previously orphanized any directory, since that
      directory could be a parent in the new path. This is a rare scenario so
      keeping it simple and not checking if that previously orphanized directory
      is in fact an ancestor of the inode we are trying to orphanize.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c2b4e03
    • Filipe Manana's avatar
      btrfs: send, orphanize first all conflicting inodes when processing references · 98272bb7
      Filipe Manana authored
      When doing an incremental send it is possible that when processing the new
      references for an inode we end up issuing rename or link operations that
      have an invalid path, which contains the orphanized name of a directory
      before we actually orphanized it, causing the receiver to fail.
      
      The following reproducer triggers such scenario:
      
        $ cat reproducer.sh
        #!/bin/bash
      
        mkfs.btrfs -f /dev/sdi >/dev/null
        mount /dev/sdi /mnt/sdi
      
        touch /mnt/sdi/a
        touch /mnt/sdi/b
        mkdir /mnt/sdi/testdir
        # We want "a" to have a lower inode number then "testdir" (257 vs 259).
        mv /mnt/sdi/a /mnt/sdi/testdir/a
      
        # Filesystem looks like:
        #
        # .                           (ino 256)
        # |----- testdir/             (ino 259)
        # |          |----- a         (ino 257)
        # |
        # |----- b                    (ino 258)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
        btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
      
        # Now rename 259 to "testdir_2", then change the name of 257 to
        # "testdir" and make it a direct descendant of the root inode (256).
        # Also create a new link for inode 257 with the old name of inode 258.
        # By swapping the names and location of several inodes and create a
        # nasty dependency chain of rename and link operations.
        mv /mnt/sdi/testdir/a /mnt/sdi/a2
        touch /mnt/sdi/testdir/a
        mv /mnt/sdi/b /mnt/sdi/b2
        ln /mnt/sdi/a2 /mnt/sdi/b
        mv /mnt/sdi/testdir /mnt/sdi/testdir_2
        mv /mnt/sdi/a2 /mnt/sdi/testdir
      
        # Filesystem now looks like:
        #
        # .                            (ino 256)
        # |----- testdir_2/            (ino 259)
        # |          |----- a          (ino 260)
        # |
        # |----- testdir               (ino 257)
        # |----- b                     (ino 257)
        # |----- b2                    (ino 258)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
        btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
      
        mkfs.btrfs -f /dev/sdj >/dev/null
        mount /dev/sdj /mnt/sdj
      
        btrfs receive -f /tmp/snap1.send /mnt/sdj
        btrfs receive -f /tmp/snap2.send /mnt/sdj
      
        umount /mnt/sdi
        umount /mnt/sdj
      
      When running the reproducer, the receive of the incremental send stream
      fails:
      
        $ ./reproducer.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: link b -> o259-6-0/a failed: No such file or directory
      
      The problem happens because of the following:
      
      1) Before we start iterating the list of new references for inode 257,
         we generate its current path and store it at @valid_path, done at
         the very beginning of process_recorded_refs(). The generated path
         is "o259-6-0/a", containing the orphanized name for inode 259;
      
      2) Then we iterate over the list of new references, which has the
         references "b" and "testdir" in that specific order;
      
      3) We process reference "b" first, because it is in the list before
         reference "testdir". We then issue a link operation to create
         the new reference "b" using a target path corresponding to the
         content at @valid_path, which corresponds to "o259-6-0/a".
         However we haven't yet orphanized inode 259, its name is still
         "testdir", and not "o259-6-0". The orphanization of 259 did not
         happen yet because we will process the reference named "testdir"
         for inode 257 only in the next iteration of the loop that goes
         over the list of new references.
      
      Fix the issue by having a preliminar iteration over all the new references
      at process_recorded_refs(). This iteration is responsible only for doing
      the orphanization of other inodes that have and old reference that
      conflicts with one of the new references of the inode we are currently
      processing. The emission of rename and link operations happen now in the
      next iteration of the new references.
      
      A test case for fstests will follow soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      98272bb7
    • Qu Wenruo's avatar
      btrfs: tree-checker: fix false alert caused by legacy btrfs root item · 1465af12
      Qu Wenruo authored
      Commit 259ee775 ("btrfs: tree-checker: Add ROOT_ITEM check")
      introduced btrfs root item size check, however btrfs root item has two
      versions, the legacy one which just ends before generation_v2 member, is
      smaller than current btrfs root item size.
      
      This caused btrfs kernel to reject valid but old tree root leaves.
      
      Fix this problem by also allowing legacy root item, since kernel can
      already handle them pretty well and upgrade to newer root item format
      when needed.
      Reported-by: default avatarMartin Steigerwald <martin@lichtvoll.de>
      Fixes: 259ee775 ("btrfs: tree-checker: Add ROOT_ITEM check")
      CC: stable@vger.kernel.org # 5.4+
      Tested-By: default avatarMartin Steigerwald <martin@lichtvoll.de>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1465af12
    • David Sterba's avatar
      btrfs: use unaligned helpers for stack and header set/get helpers · e97659ce
      David Sterba authored
      In the definitions generated by BTRFS_SETGET_HEADER_FUNCS there's direct
      pointer assignment but we should use the helpers for unaligned access
      for clarity. It hasn't been a problem so far because of the natural
      alignment.
      
      Similarly for BTRFS_SETGET_STACK_FUNCS, that usually get a structure
      from stack that has an aligned start but some members may not be aligned
      due to packing. This as well hasn't caused problems so far.
      
      Move the put/get_unaligned_le8 stubs to ctree.h so we can use them.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e97659ce
    • David Sterba's avatar
      btrfs: free-space-cache: use unaligned helpers to access data · 6994ca36
      David Sterba authored
      The free space inode stores the tracking data, checksums etc, using the
      io_ctl structure and moving the pointers. The data are generally aligned
      to at least 4 bytes (u32 for CRC) so it's not completely unaligned but
      for clarity we should use the proper helpers whenever a struct is
      initialized from io_ctl->cur pointer.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6994ca36
    • David Sterba's avatar
      btrfs: send: use helpers for unaligned access to header members · e2f896b3
      David Sterba authored
      The header is mapped onto the send buffer and thus its members may be
      potentially unaligned so use the helpers instead of directly assigning
      the pointers. This has worked so far but let's use the helpers to make
      that clear.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e2f896b3
    • Qu Wenruo's avatar
      btrfs: use own btree inode io_tree owner id · 2c53a14d
      Qu Wenruo authored
      Btree inode is special compared to all other inode extent io_trees,
      although it has a btrfs inode, it doesn't have the track_uptodate bit at
      all.
      
      This means a lot of things like extent locking doesn't even need to be
      applied to btree io tree.
      
      Since it's so special, adds a new owner value for it to make debuging a
      little easier.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2c53a14d
    • Johannes Thumshirn's avatar
      btrfs: reschedule when cloning lots of extents · 6b613cc9
      Johannes Thumshirn authored
      We have several occurrences of a soft lockup from fstest's generic/175
      testcase, which look more or less like this one:
      
        watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [xfs_io:10030]
        Kernel panic - not syncing: softlockup: hung tasks
        CPU: 0 PID: 10030 Comm: xfs_io Tainted: G             L    5.9.0-rc5+ #768
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
        Call Trace:
         <IRQ>
         dump_stack+0x77/0xa0
         panic+0xfa/0x2cb
         watchdog_timer_fn.cold+0x85/0xa5
         ? lockup_detector_update_enable+0x50/0x50
         __hrtimer_run_queues+0x99/0x4c0
         ? recalibrate_cpu_khz+0x10/0x10
         hrtimer_run_queues+0x9f/0xb0
         update_process_times+0x28/0x80
         tick_handle_periodic+0x1b/0x60
         __sysvec_apic_timer_interrupt+0x76/0x210
         asm_call_on_stack+0x12/0x20
         </IRQ>
         sysvec_apic_timer_interrupt+0x7f/0x90
         asm_sysvec_apic_timer_interrupt+0x12/0x20
        RIP: 0010:btrfs_tree_unlock+0x91/0x1a0 [btrfs]
        RSP: 0018:ffffc90007123a58 EFLAGS: 00000282
        RAX: ffff8881cea2fbe0 RBX: ffff8881cea2fbe0 RCX: 0000000000000000
        RDX: ffff8881d23fd200 RSI: ffffffff82045220 RDI: ffff8881cea2fba0
        RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000032
        R10: 0000160000000000 R11: 0000000000001000 R12: 0000000000001000
        R13: ffff8882357fd5b0 R14: ffff88816fa76e70 R15: ffff8881cea2fad0
         ? btrfs_tree_unlock+0x15b/0x1a0 [btrfs]
         btrfs_release_path+0x67/0x80 [btrfs]
         btrfs_insert_replace_extent+0x177/0x2c0 [btrfs]
         btrfs_replace_file_extents+0x472/0x7c0 [btrfs]
         btrfs_clone+0x9ba/0xbd0 [btrfs]
         btrfs_clone_files.isra.0+0xeb/0x140 [btrfs]
         ? file_update_time+0xcd/0x120
         btrfs_remap_file_range+0x322/0x3b0 [btrfs]
         do_clone_file_range+0xb7/0x1e0
         vfs_clone_file_range+0x30/0xa0
         ioctl_file_clone+0x8a/0xc0
         do_vfs_ioctl+0x5b2/0x6f0
         __x64_sys_ioctl+0x37/0xa0
         do_syscall_64+0x33/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f87977fc247
        RSP: 002b:00007ffd51a2f6d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f87977fc247
        RDX: 00007ffd51a2f710 RSI: 000000004020940d RDI: 0000000000000003
        RBP: 0000000000000004 R08: 00007ffd51a79080 R09: 0000000000000000
        R10: 00005621f11352f2 R11: 0000000000000206 R12: 0000000000000000
        R13: 0000000000000000 R14: 00005621f128b958 R15: 0000000080000000
        Kernel Offset: disabled
        ---[ end Kernel panic - not syncing: softlockup: hung tasks ]---
      
      All of these lockup reports have the call chain btrfs_clone_files() ->
      btrfs_clone() in common. btrfs_clone_files() calls btrfs_clone() with
      both source and destination extents locked and loops over the source
      extent to create the clones.
      
      Conditionally reschedule in the btrfs_clone() loop, to give some time back
      to other processes.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b613cc9
    • Denis Efremov's avatar
      btrfs: use kvcalloc for allocation in btrfs_ioctl_send() · bae12df9
      Denis Efremov authored
      Replace kvzalloc() call with kvcalloc() that also checks the size
      internally. There's a standalone overflow check in the function so we
      can return invalid parameter combination.  Use array_size() helper to
      compute the memory size for clone_sources_tmp.
      
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDenis Efremov <efremov@linux.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bae12df9
    • Denis Efremov's avatar
      btrfs: use kvzalloc() to allocate clone_roots in btrfs_ioctl_send() · 8eb2fd00
      Denis Efremov authored
      btrfs_ioctl_send() used open-coded kvzalloc implementation earlier.
      The code was accidentally replaced with kzalloc() call [1]. Restore
      the original code by using kvzalloc() to allocate sctx->clone_roots.
      
      [1] https://patchwork.kernel.org/patch/9757891/#20529627
      
      Fixes: 818e010b ("btrfs: replace opencoded kvzalloc with the helper")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarDenis Efremov <efremov@linux.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8eb2fd00
    • Nikolay Borisov's avatar
      btrfs: remove inode argument from btrfs_start_ordered_extent · c0a43603
      Nikolay Borisov authored
      The passed in ordered_extent struct is always well-formed and contains
      the inode making the explicit argument redundant.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c0a43603
    • Nikolay Borisov's avatar
      btrfs: remove inode argument from add_pending_csums · 510f85ed
      Nikolay Borisov authored
      It's used to reference the csum root which can be done from the trans
      handle as well. Simplify the signature and while at it also remove the
      noinline attribute as the function uses only at most 16 bytes of stack
      space.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      510f85ed
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar