1. 27 Aug, 2020 5 commits
    • Josef Bacik's avatar
      btrfs: set the correct lockdep class for new nodes · ad244665
      Josef Bacik authored
      When flipping over to the rw_semaphore I noticed I'd get a lockdep splat
      in replace_path(), which is weird because we're swapping the reloc root
      with the actual target root.  Turns out this is because we're using the
      root->root_key.objectid as the root id for the newly allocated tree
      block when setting the lockdep class, however we need to be using the
      actual owner of this new block, which is saved in owner.
      
      The affected path is through btrfs_copy_root as all other callers of
      btrfs_alloc_tree_block (which calls init_new_buffer) have root_objectid
      == root->root_key.objectid .
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad244665
    • Josef Bacik's avatar
      btrfs: allocate scrub workqueues outside of locks · e89c4a9c
      Josef Bacik authored
      I got the following lockdep splat while testing:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc7-00172-g021118712e59 #932 Not tainted
        ------------------------------------------------------
        btrfs/229626 is trying to acquire lock:
        ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
      
        but task is already holding lock:
        ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_scrub_dev+0x11c/0x630
      	 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
      	 btrfs_ioctl+0x2799/0x30a0
      	 ksys_ioctl+0x83/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_run_dev_stats+0x49/0x480
      	 commit_cowonly_roots+0xb5/0x2a0
      	 btrfs_commit_transaction+0x516/0xa60
      	 sync_filesystem+0x6b/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0xe/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x29/0x60
      	 cleanup_mnt+0xb8/0x140
      	 task_work_run+0x6d/0xb0
      	 __prepare_exit_to_usermode+0x1cc/0x1e0
      	 do_syscall_64+0x5c/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_commit_transaction+0x4bb/0xa60
      	 sync_filesystem+0x6b/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0xe/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x29/0x60
      	 cleanup_mnt+0xb8/0x140
      	 task_work_run+0x6d/0xb0
      	 __prepare_exit_to_usermode+0x1cc/0x1e0
      	 do_syscall_64+0x5c/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_record_root_in_trans+0x43/0x70
      	 start_transaction+0xd1/0x5d0
      	 btrfs_dirty_inode+0x42/0xd0
      	 touch_atime+0xa1/0xd0
      	 btrfs_file_mmap+0x3f/0x60
      	 mmap_region+0x3a4/0x640
      	 do_mmap+0x376/0x580
      	 vm_mmap_pgoff+0xd5/0x120
      	 ksys_mmap_pgoff+0x193/0x230
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #3 (&mm->mmap_lock#2){++++}-{3:3}:
      	 __might_fault+0x68/0x90
      	 _copy_to_user+0x1e/0x80
      	 perf_read+0x141/0x2c0
      	 vfs_read+0xad/0x1b0
      	 ksys_read+0x5f/0xe0
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (&cpuctx_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 perf_event_init_cpu+0x88/0x150
      	 perf_event_init+0x1db/0x20b
      	 start_kernel+0x3ae/0x53c
      	 secondary_startup_64+0xa4/0xb0
      
        -> #1 (pmus_lock){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 perf_event_init_cpu+0x4f/0x150
      	 cpuhp_invoke_callback+0xb1/0x900
      	 _cpu_up.constprop.26+0x9f/0x130
      	 cpu_up+0x7b/0xc0
      	 bringup_nonboot_cpus+0x4f/0x60
      	 smp_init+0x26/0x71
      	 kernel_init_freeable+0x110/0x258
      	 kernel_init+0xa/0x103
      	 ret_from_fork+0x1f/0x30
      
        -> #0 (cpu_hotplug_lock){++++}-{0:0}:
      	 __lock_acquire+0x1272/0x2310
      	 lock_acquire+0x9e/0x360
      	 cpus_read_lock+0x39/0xb0
      	 alloc_workqueue+0x378/0x450
      	 __btrfs_alloc_workqueue+0x15d/0x200
      	 btrfs_alloc_workqueue+0x51/0x160
      	 scrub_workers_get+0x5a/0x170
      	 btrfs_scrub_dev+0x18c/0x630
      	 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
      	 btrfs_ioctl+0x2799/0x30a0
      	 ksys_ioctl+0x83/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
        Chain exists of:
          cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_info->scrub_lock);
      				 lock(&fs_devs->device_list_mutex);
      				 lock(&fs_info->scrub_lock);
          lock(cpu_hotplug_lock);
      
         *** DEADLOCK ***
      
        2 locks held by btrfs/229626:
         #0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
         #1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
      
        stack backtrace:
        CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
        Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
        Call Trace:
         dump_stack+0x78/0xa0
         check_noncircular+0x165/0x180
         __lock_acquire+0x1272/0x2310
         lock_acquire+0x9e/0x360
         ? alloc_workqueue+0x378/0x450
         cpus_read_lock+0x39/0xb0
         ? alloc_workqueue+0x378/0x450
         alloc_workqueue+0x378/0x450
         ? rcu_read_lock_sched_held+0x52/0x80
         __btrfs_alloc_workqueue+0x15d/0x200
         btrfs_alloc_workqueue+0x51/0x160
         scrub_workers_get+0x5a/0x170
         btrfs_scrub_dev+0x18c/0x630
         ? start_transaction+0xd1/0x5d0
         btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
         btrfs_ioctl+0x2799/0x30a0
         ? do_sigaction+0x102/0x250
         ? lockdep_hardirqs_on_prepare+0xca/0x160
         ? _raw_spin_unlock_irq+0x24/0x30
         ? trace_hardirqs_on+0x1c/0xe0
         ? _raw_spin_unlock_irq+0x24/0x30
         ? do_sigaction+0x102/0x250
         ? ksys_ioctl+0x83/0xc0
         ksys_ioctl+0x83/0xc0
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x50/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This happens because we're allocating the scrub workqueues under the
      scrub and device list mutex, which brings in a whole host of other
      dependencies.
      
      Because the work queue allocation is done with GFP_KERNEL, it can
      trigger reclaim, which can lead to a transaction commit, which in turns
      needs the device_list_mutex, it can lead to a deadlock. A different
      problem for which this fix is a solution.
      
      Fix this by moving the actual allocation outside of the
      scrub lock, and then only take the lock once we're ready to actually
      assign them to the fs_info.  We'll now have to cleanup the workqueues in
      a few more places, so I've added a helper to do the refcount dance to
      safely free the workqueues.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e89c4a9c
    • Josef Bacik's avatar
      btrfs: fix potential deadlock in the search ioctl · a48b73ec
      Josef Bacik authored
      With the conversion of the tree locks to rwsem I got the following
      lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted
        ------------------------------------------------------
        compsize/11122 is trying to acquire lock:
        ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90
      
        but task is already holding lock:
        ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (btrfs-fs-00){++++}-{3:3}:
      	 down_write_nested+0x3b/0x70
      	 __btrfs_tree_lock+0x24/0x120
      	 btrfs_search_slot+0x756/0x990
      	 btrfs_lookup_inode+0x3a/0xb4
      	 __btrfs_update_delayed_inode+0x93/0x270
      	 btrfs_async_run_delayed_root+0x168/0x230
      	 btrfs_work_helper+0xd4/0x570
      	 process_one_work+0x2ad/0x5f0
      	 worker_thread+0x3a/0x3d0
      	 kthread+0x133/0x150
      	 ret_from_fork+0x1f/0x30
      
        -> #1 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_delayed_update_inode+0x50/0x440
      	 btrfs_update_inode+0x8a/0xf0
      	 btrfs_dirty_inode+0x5b/0xd0
      	 touch_atime+0xa1/0xd0
      	 btrfs_file_mmap+0x3f/0x60
      	 mmap_region+0x3a4/0x640
      	 do_mmap+0x376/0x580
      	 vm_mmap_pgoff+0xd5/0x120
      	 ksys_mmap_pgoff+0x193/0x230
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&mm->mmap_lock#2){++++}-{3:3}:
      	 __lock_acquire+0x1272/0x2310
      	 lock_acquire+0x9e/0x360
      	 __might_fault+0x68/0x90
      	 _copy_to_user+0x1e/0x80
      	 copy_to_sk.isra.32+0x121/0x300
      	 search_ioctl+0x106/0x200
      	 btrfs_ioctl_tree_search_v2+0x7b/0xf0
      	 btrfs_ioctl+0x106f/0x30a0
      	 ksys_ioctl+0x83/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
        Chain exists of:
          &mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-fs-00);
      				 lock(&delayed_node->mutex);
      				 lock(btrfs-fs-00);
          lock(&mm->mmap_lock#2);
      
         *** DEADLOCK ***
      
        1 lock held by compsize/11122:
         #0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        stack backtrace:
        CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922
        Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
        Call Trace:
         dump_stack+0x78/0xa0
         check_noncircular+0x165/0x180
         __lock_acquire+0x1272/0x2310
         lock_acquire+0x9e/0x360
         ? __might_fault+0x3e/0x90
         ? find_held_lock+0x72/0x90
         __might_fault+0x68/0x90
         ? __might_fault+0x3e/0x90
         _copy_to_user+0x1e/0x80
         copy_to_sk.isra.32+0x121/0x300
         ? btrfs_search_forward+0x2a6/0x360
         search_ioctl+0x106/0x200
         btrfs_ioctl_tree_search_v2+0x7b/0xf0
         btrfs_ioctl+0x106f/0x30a0
         ? __do_sys_newfstat+0x5a/0x70
         ? ksys_ioctl+0x83/0xc0
         ksys_ioctl+0x83/0xc0
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x50/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The problem is we're doing a copy_to_user() while holding tree locks,
      which can deadlock if we have to do a page fault for the copy_to_user().
      This exists even without my locking changes, so it needs to be fixed.
      Rework the search ioctl to do the pre-fault and then
      copy_to_user_nofault for the copying.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a48b73ec
    • Josef Bacik's avatar
      btrfs: drop path before adding new uuid tree entry · 9771a5cf
      Josef Bacik authored
      With the conversion of the tree locks to rwsem I got the following
      lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Not tainted
        ------------------------------------------------------
        btrfs-uuid/7955 is trying to acquire lock:
        ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        but task is already holding lock:
        ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (btrfs-uuid-00){++++}-{3:3}:
      	 down_read_nested+0x3e/0x140
      	 __btrfs_tree_read_lock+0x39/0x180
      	 __btrfs_read_lock_root_node+0x3a/0x50
      	 btrfs_search_slot+0x4bd/0x990
      	 btrfs_uuid_tree_add+0x89/0x2d0
      	 btrfs_uuid_scan_kthread+0x330/0x390
      	 kthread+0x133/0x150
      	 ret_from_fork+0x1f/0x30
      
        -> #0 (btrfs-root-00){++++}-{3:3}:
      	 __lock_acquire+0x1272/0x2310
      	 lock_acquire+0x9e/0x360
      	 down_read_nested+0x3e/0x140
      	 __btrfs_tree_read_lock+0x39/0x180
      	 __btrfs_read_lock_root_node+0x3a/0x50
      	 btrfs_search_slot+0x4bd/0x990
      	 btrfs_find_root+0x45/0x1b0
      	 btrfs_read_tree_root+0x61/0x100
      	 btrfs_get_root_ref.part.50+0x143/0x630
      	 btrfs_uuid_tree_iterate+0x207/0x314
      	 btrfs_uuid_rescan_kthread+0x12/0x50
      	 kthread+0x133/0x150
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-uuid-00);
      				 lock(btrfs-root-00);
      				 lock(btrfs-uuid-00);
          lock(btrfs-root-00);
      
         *** DEADLOCK ***
      
        1 lock held by btrfs-uuid/7955:
         #0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        stack backtrace:
        CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925
        Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
        Call Trace:
         dump_stack+0x78/0xa0
         check_noncircular+0x165/0x180
         __lock_acquire+0x1272/0x2310
         lock_acquire+0x9e/0x360
         ? __btrfs_tree_read_lock+0x39/0x180
         ? btrfs_root_node+0x1c/0x1d0
         down_read_nested+0x3e/0x140
         ? __btrfs_tree_read_lock+0x39/0x180
         __btrfs_tree_read_lock+0x39/0x180
         __btrfs_read_lock_root_node+0x3a/0x50
         btrfs_search_slot+0x4bd/0x990
         btrfs_find_root+0x45/0x1b0
         btrfs_read_tree_root+0x61/0x100
         btrfs_get_root_ref.part.50+0x143/0x630
         btrfs_uuid_tree_iterate+0x207/0x314
         ? btree_readpage+0x20/0x20
         btrfs_uuid_rescan_kthread+0x12/0x50
         kthread+0x133/0x150
         ? kthread_create_on_node+0x60/0x60
         ret_from_fork+0x1f/0x30
      
      This problem exists because we have two different rescan threads,
      btrfs_uuid_scan_kthread which creates the uuid tree, and
      btrfs_uuid_tree_iterate that goes through and updates or deletes any out
      of date roots.  The problem is they both do things in different order.
      btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries
      into the uuid_root.  btrfs_uuid_tree_iterate() scans the uuid_root, but
      then does a btrfs_get_fs_root() which can read from the tree_root.
      
      It's actually easy enough to not be holding the path in
      btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop
      it further down and re-start the search when we loop.  So simply move
      the path release before we add our entry to the uuid tree.
      
      This also fixes a problem where we're holding a path open after we do
      btrfs_end_transaction(), which has it's own problems.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9771a5cf
    • Marcos Paulo de Souza's avatar
      btrfs: block-group: fix free-space bitmap threshold · e3e39c72
      Marcos Paulo de Souza authored
      [BUG]
      After commit 9afc6649 ("btrfs: block-group: refactor how we read one
      block group item"), cache->length is being assigned after calling
      btrfs_create_block_group_cache. This causes a problem since
      set_free_space_tree_thresholds calculates the free-space threshold to
      decide if the free-space tree should convert from extents to bitmaps.
      
      The current code calls set_free_space_tree_thresholds with cache->length
      being 0, which then makes cache->bitmap_high_thresh zero. This implies
      the system will always use bitmap instead of extents, which is not
      desired if the block group is not fragmented.
      
      This behavior can be seen by a test that expects to repair systems
      with FREE_SPACE_EXTENT and FREE_SPACE_BITMAP, but the current code only
      created FREE_SPACE_BITMAP.
      
      [FIX]
      Call set_free_space_tree_thresholds after setting cache->length. There
      is now a WARN_ON in set_free_space_tree_thresholds to help preventing
      the same mistake to happen again in the future.
      
      Link: https://github.com/kdave/btrfs-progs/issues/251
      Fixes: 9afc6649 ("btrfs: block-group: refactor how we read one block group item")
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3e39c72
  2. 21 Aug, 2020 2 commits
    • Boris Burkov's avatar
      btrfs: detect nocow for swap after snapshot delete · a84d5d42
      Boris Burkov authored
      can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for
      detecting a must cow condition which is not exactly accurate, but saves
      unnecessary tree traversal. The incorrect assumption is that if the
      extent was created in a generation smaller than the last snapshot
      generation, it must be referenced by that snapshot. That is true, except
      the snapshot could have since been deleted, without affecting the last
      snapshot generation.
      
      The original patch claimed a performance win from this check, but it
      also leads to a bug where you are unable to use a swapfile if you ever
      snapshotted the subvolume it's in. Make the check slower and more strict
      for the swapon case, without modifying the general cow checks as a
      compromise. Turning swap on does not seem to be a particularly
      performance sensitive operation, so incurring a possibly unnecessary
      btrfs_search_slot seems worthwhile for the added usability.
      
      Note: Until the snapshot is competely cleaned after deletion,
      check_committed_refs will still cause the logic to think that cow is
      necessary, so the user must until 'btrfs subvolu sync' finished before
      activating the swapfile swapon.
      
      CC: stable@vger.kernel.org # 5.4+
      Suggested-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a84d5d42
    • Josef Bacik's avatar
      btrfs: check the right error variable in btrfs_del_dir_entries_in_log · fb2fecba
      Josef Bacik authored
      With my new locking code dbench is so much faster that I tripped over a
      transaction abort from ENOSPC.  This turned out to be because
      btrfs_del_dir_entries_in_log was checking for ret == -ENOSPC, but this
      function sets err on error, and returns err.  So instead of properly
      marking the inode as needing a full commit, we were returning -ENOSPC
      and aborting in __btrfs_unlink_inode.  Fix this by checking the proper
      variable so that we return the correct thing in the case of ENOSPC.
      
      The ENOENT needs to be checked, because btrfs_lookup_dir_item_index()
      can return -ENOENT if the dir item isn't in the tree log (which would
      happen if we hadn't fsync'ed this guy).  We actually handle that case in
      __btrfs_unlink_inode, so it's an expected error to get back.
      
      Fixes: 4a500fd1 ("Btrfs: Metadata ENOSPC handling for tree log")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add note and comment about ENOENT ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fb2fecba
  3. 19 Aug, 2020 4 commits
    • Filipe Manana's avatar
      btrfs: fix space cache memory leak after transaction abort · bbc37d6e
      Filipe Manana authored
      If a transaction aborts it can cause a memory leak of the pages array of
      a block group's io_ctl structure. The following steps explain how that can
      happen:
      
      1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED
         and it's about to start writing out dirty extent buffers;
      
      2) Transaction N + 1 already started and another task, task A, just called
         btrfs_commit_transaction() on it;
      
      3) Block group B was dirtied (extents allocated from it) by transaction
         N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the
         very beginning of the transaction commit, it starts writeback for the
         block group's space cache by calling btrfs_write_out_cache(), which
         allocates the pages array for the block group's io_ctl with a call to
         io_ctl_init(). Block group A is added to the io_list of transaction
         N + 1 by btrfs_start_dirty_block_groups();
      
      4) While transaction N's commit is writing out the extent buffers, it gets
         an IO error and aborts transaction N, also setting the file system to
         RO mode;
      
      5) Task A has already returned from btrfs_start_dirty_block_groups(), is at
         btrfs_commit_transaction() and has set transaction N + 1 state to
         TRANS_STATE_COMMIT_START. Immediately after that it checks that the
         filesystem was turned to RO mode, due to transaction N's abort, and
         jumps to the "cleanup_transaction" label. After that we end up at
         btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs().
         That helper finds block group B in the transaction's io_list but it
         never releases the pages array of the block group's io_ctl, resulting in
         a memory leak.
      
      In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages
      array points to pages that were already released by us at
      __btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end
      up freeing the pages array only after waiting for the ordered extent to
      complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do
      that. But in the transaction abort case we don't wait for the space cache's
      ordered extent to complete through a call to btrfs_wait_cache_io(), so
      that's why we end up with a memory leak - we wait for the ordered extent
      to complete indirectly by shutting down the work queues and waiting for
      any jobs in them to complete before returning from close_ctree().
      
      We can solve the leak simply by freeing the pages array right after
      releasing the pages (with the call to io_ctl_drop_pages()) at
      __btrfs_write_out_cache(), since we will never use it anymore after that
      and the pages array points to already released pages at that point, which
      is currently not a problem since no one will use it after that, but not a
      good practice anyway since it can easily lead to use-after-free issues.
      
      So fix this by freeing the pages array right after releasing the pages at
      __btrfs_write_out_cache().
      
      This issue can often be reproduced with test case generic/475 from fstests
      and kmemleak can detect it and reports it with the following trace:
      
      unreferenced object 0xffff9bbf009fa600 (size 512):
        comm "fsstress", pid 38807, jiffies 4298504428 (age 22.028s)
        hex dump (first 32 bytes):
          00 a0 7c 4d 3d ed ff ff 40 a0 7c 4d 3d ed ff ff  ..|M=...@.|M=...
          80 a0 7c 4d 3d ed ff ff c0 a0 7c 4d 3d ed ff ff  ..|M=.....|M=...
        backtrace:
          [<00000000f4b5cfe2>] __kmalloc+0x1a8/0x3e0
          [<0000000028665e7f>] io_ctl_init+0xa7/0x120 [btrfs]
          [<00000000a1f95b2d>] __btrfs_write_out_cache+0x86/0x4a0 [btrfs]
          [<00000000207ea1b0>] btrfs_write_out_cache+0x7f/0xf0 [btrfs]
          [<00000000af21f534>] btrfs_start_dirty_block_groups+0x27b/0x580 [btrfs]
          [<00000000c3c23d44>] btrfs_commit_transaction+0xa6f/0xe70 [btrfs]
          [<000000009588930c>] create_subvol+0x581/0x9a0 [btrfs]
          [<000000009ef2fd7f>] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
          [<00000000474e5187>] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
          [<00000000708ee349>] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
          [<00000000ea60106f>] btrfs_ioctl+0x12c/0x3130 [btrfs]
          [<000000005c923d6d>] __x64_sys_ioctl+0x83/0xb0
          [<0000000043ace2c9>] do_syscall_64+0x33/0x80
          [<00000000904efbce>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bbc37d6e
    • David Sterba's avatar
      btrfs: use the correct const function attribute for btrfs_get_num_csums · 604997b4
      David Sterba authored
      The build robot reports
      
      compiler: h8300-linux-gcc (GCC) 9.3.0
         In file included from fs/btrfs/tests/extent-map-tests.c:8:
      >> fs/btrfs/tests/../ctree.h:2166:8: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
          2166 | size_t __const btrfs_get_num_csums(void);
               |        ^~~~~~~
      
      The function attribute for const does not follow the expected scheme and
      in this case is confused with a const type qualifier.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      604997b4
    • Marcos Paulo de Souza's avatar
      btrfs: reset compression level for lzo on remount · 282dd7d7
      Marcos Paulo de Souza authored
      Currently a user can set mount "-o compress" which will set the
      compression algorithm to zlib, and use the default compress level for
      zlib (3):
      
        relatime,compress=zlib:3,space_cache
      
      If the user remounts the fs using "-o compress=lzo", then the old
      compress_level is used:
      
        relatime,compress=lzo:3,space_cache
      
      But lzo does not expose any tunable compression level. The same happens
      if we set any compress argument with different level, also with zstd.
      
      Fix this by resetting the compress_level when compress=lzo is
      specified.  With the fix applied, lzo is shown without compress level:
      
        relatime,compress=lzo,space_cache
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      282dd7d7
    • Johannes Thumshirn's avatar
      btrfs: handle errors from async submission · c965d640
      Johannes Thumshirn authored
      Btrfs' async submit mechanism is able to handle errors in the submission
      path and the meta-data async submit function correctly passes the error
      code to the caller.
      
      In btrfs_submit_bio_start() and btrfs_submit_bio_start_direct_io() we're
      not handling the errors returned by btrfs_csum_one_bio() correctly though
      and simply call BUG_ON(). This is unnecessary as the caller of these two
      functions - run_one_async_start - correctly checks for the return values
      and sets the status of the async_submit_bio. The actual bio submission
      will be handled later on by run_one_async_done only if
      async_submit_bio::status is 0, so the data won't be written if we
      encountered an error in the checksum process.
      
      Simply return the error from btrfs_csum_one_bio() to the async submitters,
      like it's done in btree_submit_bio_start().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c965d640
  4. 12 Aug, 2020 1 commit
    • Qu Wenruo's avatar
      btrfs: trim: fix underflow in trim length to prevent access beyond device boundary · c57dd1f2
      Qu Wenruo authored
      [BUG]
      The following script can lead to tons of beyond device boundary access:
      
        mkfs.btrfs -f $dev -b 10G
        mount $dev $mnt
        trimfs $mnt
        btrfs filesystem resize 1:-1G $mnt
        trimfs $mnt
      
      [CAUSE]
      Since commit 929be17a ("btrfs: Switch btrfs_trim_free_extents to
      find_first_clear_extent_bit"), we try to avoid trimming ranges that's
      already trimmed.
      
      So we check device->alloc_state by finding the first range which doesn't
      have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.
      
      But if we shrunk the device, that bits are not cleared, thus we could
      easily got a range starts beyond the shrunk device size.
      
      This results the returned @start and @end are all beyond device size,
      then we call "end = min(end, device->total_bytes -1);" making @end
      smaller than device size.
      
      Then finally we goes "len = end - start + 1", totally underflow the
      result, and lead to the beyond-device-boundary access.
      
      [FIX]
      This patch will fix the problem in two ways:
      
      - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
        This is the root fix
      
      - Add extra safety check when trimming free device extents
        We check and warn if the returned range is already beyond current
        device.
      
      Link: https://github.com/kdave/btrfs-progs/issues/282
      Fixes: 929be17a ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c57dd1f2
  5. 11 Aug, 2020 1 commit
  6. 10 Aug, 2020 8 commits
    • Qu Wenruo's avatar
      btrfs: sysfs: fix NULL pointer dereference at btrfs_sysfs_del_qgroups() · 62ab2cc0
      Qu Wenruo authored
      [BUG]
      Unmounting a btrfs filesystem with quota disabled will cause the
      following NULL pointer dereference:
      
        BTRFS info (device dm-5): has skinny extents
        BUG: kernel NULL pointer dereference, address: 0000000000000018
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        CPU: 7 PID: 637 Comm: umount Not tainted 5.8.0-rc7-next-20200731-custom #76
        RIP: 0010:kobject_del+0x6/0x20
        Call Trace:
         btrfs_sysfs_del_qgroups+0xac/0xf0 [btrfs]
         btrfs_free_qgroup_config+0x63/0x70 [btrfs]
         close_ctree+0x1f5/0x323 [btrfs]
         btrfs_put_super+0x15/0x17 [btrfs]
         generic_shutdown_super+0x72/0x110
         kill_anon_super+0x18/0x30
         btrfs_kill_super+0x17/0x30 [btrfs]
         deactivate_locked_super+0x3b/0xa0
         deactivate_super+0x40/0x50
         cleanup_mnt+0x135/0x190
         __cleanup_mnt+0x12/0x20
         task_work_run+0x64/0xb0
         exit_to_user_mode_prepare+0x18a/0x190
         syscall_exit_to_user_mode+0x4f/0x270
         do_syscall_64+0x45/0x50
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace 37b7adca5c1d5c5d ]---
      
      [CAUSE]
      Commit 079ad2fb ("kobject: Avoid premature parent object freeing in
      kobject_cleanup()") changed kobject_del() that it no longer accepts NULL
      pointer.
      
      Before that commit, kobject_del() and kobject_put() all accept NULL
      pointers and just ignore such NULL pointers.
      
      But that mentioned commit needs to access the parent node, killing the
      old NULL pointer behavior.
      
      Unfortunately btrfs is relying on that hidden feature thus we will
      trigger such NULL pointer dereference.
      
      [FIX]
      Instead of just saving several lines, do proper fs_info->qgroups_kobj
      check before calling kobject_del() and kobject_put().
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      62ab2cc0
    • Boleyn Su's avatar
      btrfs: check correct variable after allocation in btrfs_backref_iter_alloc · c15c2ec0
      Boleyn Su authored
      The `if (!ret)` check will always be false and it may result in
      ret->path being dereferenced while it is a NULL pointer.
      
      Fixes: a37f232b ("btrfs: backref: introduce the skeleton of btrfs_backref_iter")
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoleyn Su <boleynsu@google.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c15c2ec0
    • Josef Bacik's avatar
      btrfs: make sure SB_I_VERSION doesn't get unset by remount · faa00889
      Josef Bacik authored
      There's some inconsistency around SB_I_VERSION handling with mount and
      remount.  Since we don't really want it to be off ever just work around
      this by making sure we don't get the flag cleared on remount.
      
      There's a tiny cpu cost of setting the bit, otherwise all changes to
      i_version also change some of the times (ctime/mtime) so the inode needs
      to be synced. We wouldn't save anything by disabling it.
      Reported-by: default avatarEric Sandeen <sandeen@redhat.com>
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add perf impact analysis ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      faa00889
    • Filipe Manana's avatar
      btrfs: fix memory leaks after failure to lookup checksums during inode logging · 4f26433e
      Filipe Manana authored
      While logging an inode, at copy_items(), if we fail to lookup the checksums
      for an extent we release the destination path, free the ins_data array and
      then return immediately. However a previous iteration of the for loop may
      have added checksums to the ordered_sums list, in which case we leak the
      memory used by them.
      
      So fix this by making sure we iterate the ordered_sums list and free all
      its checksums before returning.
      
      Fixes: 3650860b ("Btrfs: remove almost all of the BUG()'s from tree-log.c")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4f26433e
    • Josef Bacik's avatar
      btrfs: don't show full path of bind mounts in subvol= · 3ef3959b
      Josef Bacik authored
      Chris Murphy reported a problem where rpm ostree will bind mount a bunch
      of things for whatever voodoo it's doing.  But when it does this
      /proc/mounts shows something like
      
        /dev/sda /mnt/test btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
        /dev/sda /mnt/test/baz btrfs rw,relatime,subvolid=256,subvol=/foo/bar 0 0
      
      Despite subvolid=256 being subvol=/foo.  This is because we're just
      spitting out the dentry of the mount point, which in the case of bind
      mounts is the source path for the mountpoint.  Instead we should spit
      out the path to the actual subvol.  Fix this by looking up the name for
      the subvolid we have mounted.  With this fix the same test looks like
      this
      
        /dev/sda /mnt/test btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
        /dev/sda /mnt/test/baz btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
      Reported-by: default avatarChris Murphy <chris@colorremedies.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ef3959b
    • David Sterba's avatar
      btrfs: fix messages after changing compression level by remount · 27942c99
      David Sterba authored
      Reported by Forza on IRC that remounting with compression options does
      not reflect the change in level, or at least it does not appear to do so
      according to the messages:
      
        mount -o compress=zstd:1 /dev/sda /mnt
        mount -o remount,compress=zstd:15 /mnt
      
      does not print the change to the level to syslog:
      
        [   41.366060] BTRFS info (device vda): use zstd compression, level 1
        [   41.368254] BTRFS info (device vda): disk space caching is enabled
        [   41.390429] BTRFS info (device vda): disk space caching is enabled
      
      What really happens is that the message is lost but the level is actualy
      changed.
      
      There's another weird output, if compression is reset to 'no':
      
        [   45.413776] BTRFS info (device vda): use no compression, level 4
      
      To fix that, save the previous compression level and print the message
      in that case too and use separate message for 'no' compression.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      27942c99
    • Josef Bacik's avatar
      btrfs: only search for left_info if there is no right_info in try_merge_free_space · bf53d468
      Josef Bacik authored
      In try_to_merge_free_space we attempt to find entries to the left and
      right of the entry we are adding to see if they can be merged.  We
      search for an entry past our current info (saved into right_info), and
      then if right_info exists and it has a rb_prev() we save the rb_prev()
      into left_info.
      
      However there's a slight problem in the case that we have a right_info,
      but no entry previous to that entry.  At that point we will search for
      an entry just before the info we're attempting to insert.  This will
      simply find right_info again, and assign it to left_info, making them
      both the same pointer.
      
      Now if right_info _can_ be merged with the range we're inserting, we'll
      add it to the info and free right_info.  However further down we'll
      access left_info, which was right_info, and thus get a use-after-free.
      
      Fix this by only searching for the left entry if we don't find a right
      entry at all.
      
      The CVE referenced had a specially crafted file system that could
      trigger this use-after-free. However with the tree checker improvements
      we no longer trigger the conditions for the UAF.  But the original
      conditions still apply, hence this fix.
      
      Reference: CVE-2019-19448
      Fixes: 96303081 ("Btrfs: use hybrid extents+bitmap rb tree for free space")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf53d468
    • Qu Wenruo's avatar
      btrfs: inode: fix NULL pointer dereference if inode doesn't need compression · 1e6e238c
      Qu Wenruo authored
      [BUG]
      There is a bug report of NULL pointer dereference caused in
      compress_file_extent():
      
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
        NIP [c008000006dd4d34] compress_file_range.constprop.41+0x75c/0x8a0 [btrfs]
        LR [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs]
        Call Trace:
        [c000000c69093b00] [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs] (unreliable)
        [c000000c69093bd0] [c008000006dd4ebc] async_cow_start+0x44/0xa0 [btrfs]
        [c000000c69093c10] [c008000006e14824] normal_work_helper+0xdc/0x598 [btrfs]
        [c000000c69093c80] [c0000000001608c0] process_one_work+0x2c0/0x5b0
        [c000000c69093d10] [c000000000160c38] worker_thread+0x88/0x660
        [c000000c69093db0] [c00000000016b55c] kthread+0x1ac/0x1c0
        [c000000c69093e20] [c00000000000b660] ret_from_kernel_thread+0x5c/0x7c
        ---[ end trace f16954aa20d822f6 ]---
      
      [CAUSE]
      For the following execution route of compress_file_range(), it's
      possible to hit NULL pointer dereference:
      
       compress_file_extent()
       |- pages = NULL;
       |- start = async_chunk->start = 0;
       |- end = async_chunk = 4095;
       |- nr_pages = 1;
       |- inode_need_compress() == false; <<< Possible, see later explanation
       |  Now, we have nr_pages = 1, pages = NULL
       |- cont:
       |- 		ret = cow_file_range_inline();
       |- 		if (ret <= 0) {
       |-		for (i = 0; i < nr_pages; i++) {
       |-			WARN_ON(pages[i]->mapping);	<<< Crash
      
      To enter above call execution branch, we need the following race:
      
          Thread 1 (chattr)     |            Thread 2 (writeback)
      --------------------------+------------------------------
                                | btrfs_run_delalloc_range
                                | |- inode_need_compress = true
                                | |- cow_file_range_async()
      btrfs_ioctl_set_flag()    |
      |- binode_flags |=        |
         BTRFS_INODE_NOCOMPRESS |
                                | compress_file_range()
                                | |- inode_need_compress = false
                                | |- nr_page = 1 while pages = NULL
                                | |  Then hit the crash
      
      [FIX]
      This patch will fix it by checking @pages before doing accessing it.
      This patch is only designed as a hot fix and easy to backport.
      
      More elegant fix may make btrfs only check inode_need_compress() once to
      avoid such race, but that would be another story.
      Reported-by: default avatarLuciano Chavez <chavez@us.ibm.com>
      Fixes: 4d3a800e ("btrfs: merge nr_pages input and output parameter in compress_pages")
      CC: stable@vger.kernel.org # 4.14.x: cecc8d90: btrfs: Move free_pages_out label in inline extent handling branch in compress_file_range
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e6e238c
  7. 27 Jul, 2020 19 commits
    • Filipe Manana's avatar
      btrfs: do not set the full sync flag on the inode during page release · 5e548b32
      Filipe Manana authored
      When removing an extent map at try_release_extent_mapping(), called through
      the page release callback (btrfs_releasepage()), we always set the full
      sync flag on the inode, which forces the next fsync to use a slower code
      path.
      
      This hurts performance for workloads that dirty an amount of data that
      exceeds or is very close to the system's RAM memory and do frequent fsync
      operations (like database servers can for example). In particular if there
      are concurrent fsyncs against different files, by falling back to a full
      fsync we do a lot more checksum lookups in the checksums btree, as we do
      it for all the extents created in the current transaction, instead of only
      the new ones since the last fsync. These checksums lookups not only take
      some time but, more importantly, they also cause contention on the
      checksums btree locks due to the concurrency with checksum insertions in
      the btree by ordered extents from other inodes.
      
      We actually don't need to set the full sync flag on the inode, because we
      only remove extent maps that are in the list of modified extents if they
      were created in a past transaction, in which case an fsync skips them as
      it's pointless to log them. So stop setting the full fsync flag on the
      inode whenever we remove an extent map.
      
      This patch is part of a patchset that consists of 3 patches, which have
      the following subjects:
      
      1/3 btrfs: fix race between page release and a fast fsync
      2/3 btrfs: release old extent maps during page release
      3/3 btrfs: do not set the full sync flag on the inode during page release
      
      Performance tests were ran against a branch (misc-next) containing the
      whole patchset. The test exercises a workload where there are multiple
      processes writing to files and fsyncing them (each writing and fsyncing
      its own file), and in total the amount of data dirtied ranges from 2x to
      4x the system's RAM memory (16GiB), so that the page release callback is
      invoked frequently.
      
      The following script, using fio, was used to perform the tests:
      
        $ cat test-fsync.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 3 ]; then
            echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
            exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=write
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=64k
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        thread
        EOF
      
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
        fio /tmp/fio-job.ini
        umount $MNT
      
      The tests were performed for different numbers of jobs, file sizes and
      fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
      12 cores, with cpu governance set to performance mode on all cores), 16GiB
      of ram (the host has 64GiB) and using a NVMe device directly (without an
      intermediary filesystem in the host). While running the tests, the host
      was not used for anything else, to avoid disturbing the tests.
      
      The obtained results were the following, and the last line printed by
      fio is pasted (includes aggregated throughput and test run time).
      
          *****************************************************
          ****     1 job, 32GiB file, fsync frequency 1     ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=32.0GiB (34.4GB), run=1127557-1127557msec
      
      After patchset:
      
      WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=32.0GiB (34.4GB), run=1119042-1119042msec
      (+0.7% throughput, -0.8% run time)
      
          *****************************************************
          ****     2 jobs, 16GiB files, fsync frequency 1   ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=33.5MiB/s (35.1MB/s), 33.5MiB/s-33.5MiB/s (35.1MB/s-35.1MB/s), io=32.0GiB (34.4GB), run=979000-979000msec
      
      After patchset:
      
      WRITE: bw=39.9MiB/s (41.8MB/s), 39.9MiB/s-39.9MiB/s (41.8MB/s-41.8MB/s), io=32.0GiB (34.4GB), run=821283-821283msec
      (+19.1% throughput, -16.1% runtime)
      
          *****************************************************
          ****     4 jobs, 8GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=52.1MiB/s (54.6MB/s), 52.1MiB/s-52.1MiB/s (54.6MB/s-54.6MB/s), io=32.0GiB (34.4GB), run=629130-629130msec
      
      After patchset:
      
      WRITE: bw=71.8MiB/s (75.3MB/s), 71.8MiB/s-71.8MiB/s (75.3MB/s-75.3MB/s), io=32.0GiB (34.4GB), run=456357-456357msec
      (+37.8% throughput, -27.5% runtime)
      
          *****************************************************
          ****     8 jobs, 4GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430708-430708msec
      
      After patchset:
      
      WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=245458-245458msec
      (+74.7% throughput, -43.0% run time)
      
          *****************************************************
          ****    16 jobs, 2GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=74.7MiB/s (78.3MB/s), 74.7MiB/s-74.7MiB/s (78.3MB/s-78.3MB/s), io=32.0GiB (34.4GB), run=438625-438625msec
      
      After patchset:
      
      WRITE: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=32.0GiB (34.4GB), run=177864-177864msec
      (+146.3% throughput, -59.5% run time)
      
          *****************************************************
          ****    32 jobs, 2GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=72.6MiB/s (76.1MB/s), 72.6MiB/s-72.6MiB/s (76.1MB/s-76.1MB/s), io=64.0GiB (68.7GB), run=902615-902615msec
      
      After patchset:
      
      WRITE: bw=227MiB/s (238MB/s), 227MiB/s-227MiB/s (238MB/s-238MB/s), io=64.0GiB (68.7GB), run=288936-288936msec
      (+212.7% throughput, -68.0% run time)
      
          *****************************************************
          ****    64 jobs, 1GiB files, fsync frequency 1    ****
          *****************************************************
      
      Before patchset:
      
      WRITE: bw=98.8MiB/s (104MB/s), 98.8MiB/s-98.8MiB/s (104MB/s-104MB/s), io=64.0GiB (68.7GB), run=663126-663126msec
      
      After patchset:
      
      WRITE: bw=294MiB/s (308MB/s), 294MiB/s-294MiB/s (308MB/s-308MB/s), io=64.0GiB (68.7GB), run=222940-222940msec
      (+197.6% throughput, -66.4% run time)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5e548b32
    • Filipe Manana's avatar
      btrfs: release old extent maps during page release · fbc2bd7e
      Filipe Manana authored
      When removing an extent map at try_release_extent_mapping(), called through
      the page release callback (btrfs_releasepage()), we never release an extent
      map that is in the list of modified extents. This is to prevent races with
      a concurrent fsync using the fast path, which could lead to not logging an
      extent created in the current transaction.
      
      However we can safely remove an extent map created in a past transaction
      that is still in the list of modified extents (because no one fsynced yet
      the inode after that transaction got commited), because such extents are
      skipped during an fsync as it is pointless to log them. This change does
      that.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fbc2bd7e
    • Filipe Manana's avatar
      btrfs: fix race between page release and a fast fsync · 3d6448e6
      Filipe Manana authored
      When releasing an extent map, done through the page release callback, we
      can race with an ongoing fast fsync and cause the fsync to miss a new
      extent and not log it. The steps for this to happen are the following:
      
      1) A page is dirtied for some inode I;
      
      2) Writeback for that page is triggered by a path other than fsync, for
         example by the system due to memory pressure;
      
      3) When the ordered extent for the extent (a single 4K page) finishes,
         we unpin the corresponding extent map and set its generation to N,
         the current transaction's generation;
      
      4) The btrfs_releasepage() callback is invoked by the system due to
         memory pressure for that no longer dirty page of inode I;
      
      5) At the same time, some task calls fsync on inode I, joins transaction
         N, and at btrfs_log_inode() it sees that the inode does not have the
         full sync flag set, so we proceed with a fast fsync. But before we get
         into btrfs_log_changed_extents() and lock the inode's extent map tree:
      
      6) Through btrfs_releasepage() we end up at try_release_extent_mapping()
         and we remove the extent map for the new 4Kb extent, because it is
         neither pinned anymore nor locked. By calling remove_extent_mapping(),
         we remove the extent map from the list of modified extents, since the
         extent map does not have the logging flag set. We unlock the inode's
         extent map tree;
      
      7) The task doing the fast fsync now enters btrfs_log_changed_extents(),
         locks the inode's extent map tree and iterates its list of modified
         extents, which no longer has the 4Kb extent in it, so it does not log
         the extent;
      
      8) The fsync finishes;
      
      9) Before transaction N is committed, a power failure happens. After
         replaying the log, the 4K extent of inode I will be missing, since
         it was not logged due to the race with try_release_extent_mapping().
      
      So fix this by teaching try_release_extent_mapping() to not remove an
      extent map if it's still in the list of modified extents.
      
      Fixes: ff44c6e3 ("Btrfs: do not hold the write_lock on the extent tree while logging")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d6448e6
    • Johannes Thumshirn's avatar
      btrfs: open-code remount flag setting in btrfs_remount · 88c4703f
      Johannes Thumshirn authored
      When we're (re)mounting a btrfs filesystem we set the
      BTRFS_FS_STATE_REMOUNTING state in fs_info to serialize against async
      reclaim or defrags.
      
      This flag is set in btrfs_remount_prepare() called by btrfs_remount().
      As btrfs_remount_prepare() does nothing but setting this flag and
      doesn't have a second caller, we can just open-code the flag setting in
      btrfs_remount().
      
      Similarly do for so clearing of the flag by moving it out of
      btrfs_remount_cleanup() into btrfs_remount() to be symmetrical.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88c4703f
    • Josef Bacik's avatar
      btrfs: if we're restriping, use the target restripe profile · 162e0a16
      Josef Bacik authored
      Previously we depended on some weird behavior in our chunk allocator to
      force the allocation of new stripes, so by the time we got to doing the
      reduce we would usually already have a chunk with the proper target.
      
      However that behavior causes other problems and needs to be removed.
      First however we need to remove this check to only restripe if we
      already have those available profiles, because if we're allocating our
      first chunk it obviously will not be available.  Simply use the target
      as specified, and if that fails it'll be because we're out of space.
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      162e0a16
    • Josef Bacik's avatar
      btrfs: don't adjust bg flags and use default allocation profiles · 349e120e
      Josef Bacik authored
      btrfs/061 has been failing consistently for me recently with a
      transaction abort.  We run out of space in the system chunk array, which
      means we've allocated way too many system chunks than we need.
      
      Chris added this a long time ago for balance as a poor mans restriping.
      If you had a single disk and then added another disk and then did a
      balance, update_block_group_flags would then figure out which RAID level
      you needed.
      
      Fast forward to today and we have restriping behavior, so we can
      explicitly tell the fs that we're trying to change the raid level.  This
      is accomplished through the normal get_alloc_profile path.
      
      Furthermore this code actually causes btrfs/061 to fail, because we do
      things like mkfs -m dup -d single with multiple devices.  This trips
      this check
      
      alloc_flags = update_block_group_flags(fs_info, cache->flags);
      if (alloc_flags != cache->flags) {
      	ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
      
      in btrfs_inc_block_group_ro.  Because we're balancing and scrubbing, but
      not actually restriping, we keep forcing chunk allocation of RAID1
      chunks.  This eventually causes us to run out of system space and the
      file system aborts and flips read only.
      
      We don't need this poor mans restriping any more, simply use the normal
      get_alloc_profile helper, which will get the correct alloc_flags and
      thus make the right decision for chunk allocation.  This keeps us from
      allocating a billion system chunks and falling over.
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      349e120e
    • Josef Bacik's avatar
      btrfs: fix lockdep splat from btrfs_dump_space_info · ab0db043
      Josef Bacik authored
      When running with -o enospc_debug you can get the following splat if one
      of the dump_space_info's trip
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc5+ #20 Tainted: G           OE
        ------------------------------------------------------
        dd/563090 is trying to acquire lock:
        ffff9e7dbf4f1e18 (&ctl->tree_lock){+.+.}-{2:2}, at: btrfs_dump_free_space+0x2b/0xa0 [btrfs]
      
        but task is already holding lock:
        ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (&cache->lock){+.+.}-{2:2}:
      	 _raw_spin_lock+0x25/0x30
      	 btrfs_add_reserved_bytes+0x3c/0x3c0 [btrfs]
      	 find_free_extent+0x7ef/0x13b0 [btrfs]
      	 btrfs_reserve_extent+0x9b/0x180 [btrfs]
      	 btrfs_alloc_tree_block+0xc1/0x340 [btrfs]
      	 alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
      	 __btrfs_cow_block+0x122/0x530 [btrfs]
      	 btrfs_cow_block+0x106/0x210 [btrfs]
      	 commit_cowonly_roots+0x55/0x300 [btrfs]
      	 btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
      	 sync_filesystem+0x74/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20 [btrfs]
      	 deactivate_locked_super+0x36/0x70
      	 cleanup_mnt+0x104/0x160
      	 task_work_run+0x5f/0x90
      	 __prepare_exit_to_usermode+0x1bd/0x1c0
      	 do_syscall_64+0x5e/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (&space_info->lock){+.+.}-{2:2}:
      	 _raw_spin_lock+0x25/0x30
      	 btrfs_block_rsv_release+0x1a6/0x3f0 [btrfs]
      	 btrfs_inode_rsv_release+0x4f/0x170 [btrfs]
      	 btrfs_clear_delalloc_extent+0x155/0x480 [btrfs]
      	 clear_state_bit+0x81/0x1a0 [btrfs]
      	 __clear_extent_bit+0x25c/0x5d0 [btrfs]
      	 clear_extent_bit+0x15/0x20 [btrfs]
      	 btrfs_invalidatepage+0x2b7/0x3c0 [btrfs]
      	 truncate_cleanup_page+0x47/0xe0
      	 truncate_inode_pages_range+0x238/0x840
      	 truncate_pagecache+0x44/0x60
      	 btrfs_setattr+0x202/0x5e0 [btrfs]
      	 notify_change+0x33b/0x490
      	 do_truncate+0x76/0xd0
      	 path_openat+0x687/0xa10
      	 do_filp_open+0x91/0x100
      	 do_sys_openat2+0x215/0x2d0
      	 do_sys_open+0x44/0x80
      	 do_syscall_64+0x52/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&tree->lock#2){+.+.}-{2:2}:
      	 _raw_spin_lock+0x25/0x30
      	 find_first_extent_bit+0x32/0x150 [btrfs]
      	 write_pinned_extent_entries.isra.0+0xc5/0x100 [btrfs]
      	 __btrfs_write_out_cache+0x172/0x480 [btrfs]
      	 btrfs_write_out_cache+0x7a/0xf0 [btrfs]
      	 btrfs_write_dirty_block_groups+0x286/0x3b0 [btrfs]
      	 commit_cowonly_roots+0x245/0x300 [btrfs]
      	 btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
      	 close_ctree+0xf9/0x2f5 [btrfs]
      	 generic_shutdown_super+0x6c/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20 [btrfs]
      	 deactivate_locked_super+0x36/0x70
      	 cleanup_mnt+0x104/0x160
      	 task_work_run+0x5f/0x90
      	 __prepare_exit_to_usermode+0x1bd/0x1c0
      	 do_syscall_64+0x5e/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&ctl->tree_lock){+.+.}-{2:2}:
      	 __lock_acquire+0x1240/0x2460
      	 lock_acquire+0xab/0x360
      	 _raw_spin_lock+0x25/0x30
      	 btrfs_dump_free_space+0x2b/0xa0 [btrfs]
      	 btrfs_dump_space_info+0xf4/0x120 [btrfs]
      	 btrfs_reserve_extent+0x176/0x180 [btrfs]
      	 __btrfs_prealloc_file_range+0x145/0x550 [btrfs]
      	 cache_save_setup+0x28d/0x3b0 [btrfs]
      	 btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
      	 btrfs_commit_transaction+0xcc/0xac0 [btrfs]
      	 btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
      	 btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
      	 btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
      	 btrfs_file_write_iter+0x3cf/0x610 [btrfs]
      	 new_sync_write+0x11e/0x1b0
      	 vfs_write+0x1c9/0x200
      	 ksys_write+0x68/0xe0
      	 do_syscall_64+0x52/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
        Chain exists of:
          &ctl->tree_lock --> &space_info->lock --> &cache->lock
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&cache->lock);
      				 lock(&space_info->lock);
      				 lock(&cache->lock);
          lock(&ctl->tree_lock);
      
         *** DEADLOCK ***
      
        6 locks held by dd/563090:
         #0: ffff9e7e21d18448 (sb_writers#14){.+.+}-{0:0}, at: vfs_write+0x195/0x200
         #1: ffff9e7dd0410ed8 (&sb->s_type->i_mutex_key#19){++++}-{3:3}, at: btrfs_file_write_iter+0x86/0x610 [btrfs]
         #2: ffff9e7e21d18638 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40b/0x5b0 [btrfs]
         #3: ffff9e7e1f05d688 (&cur_trans->cache_write_mutex){+.+.}-{3:3}, at: btrfs_start_dirty_block_groups+0x158/0x4f0 [btrfs]
         #4: ffff9e7e2284ddb8 (&space_info->groups_sem){++++}-{3:3}, at: btrfs_dump_space_info+0x69/0x120 [btrfs]
         #5: ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]
      
        stack backtrace:
        CPU: 3 PID: 563090 Comm: dd Tainted: G           OE     5.8.0-rc5+ #20
        Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
        Call Trace:
         dump_stack+0x96/0xd0
         check_noncircular+0x162/0x180
         __lock_acquire+0x1240/0x2460
         ? wake_up_klogd.part.0+0x30/0x40
         lock_acquire+0xab/0x360
         ? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
         _raw_spin_lock+0x25/0x30
         ? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
         btrfs_dump_free_space+0x2b/0xa0 [btrfs]
         btrfs_dump_space_info+0xf4/0x120 [btrfs]
         btrfs_reserve_extent+0x176/0x180 [btrfs]
         __btrfs_prealloc_file_range+0x145/0x550 [btrfs]
         ? btrfs_qgroup_reserve_data+0x1d/0x60 [btrfs]
         cache_save_setup+0x28d/0x3b0 [btrfs]
         btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
         btrfs_commit_transaction+0xcc/0xac0 [btrfs]
         ? start_transaction+0xe0/0x5b0 [btrfs]
         btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
         btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
         btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
         ? ktime_get_coarse_real_ts64+0xa8/0xd0
         ? trace_hardirqs_on+0x1c/0xe0
         btrfs_file_write_iter+0x3cf/0x610 [btrfs]
         new_sync_write+0x11e/0x1b0
         vfs_write+0x1c9/0x200
         ksys_write+0x68/0xe0
         do_syscall_64+0x52/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This is because we're holding the block_group->lock while trying to dump
      the free space cache.  However we don't need this lock, we just need it
      to read the values for the printk, so move the free space cache dumping
      outside of the block group lock.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ab0db043
    • Josef Bacik's avatar
      btrfs: move the chunk_mutex in btrfs_read_chunk_tree · 01d01caf
      Josef Bacik authored
      We are currently getting this lockdep splat in btrfs/161:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc5+ #20 Tainted: G            E
        ------------------------------------------------------
        mount/678048 is trying to acquire lock:
        ffff9b769f15b6e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: clone_fs_devices+0x4d/0x170 [btrfs]
      
        but task is already holding lock:
        ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x8b/0x8f0
      	 btrfs_init_new_device+0x2d2/0x1240 [btrfs]
      	 btrfs_ioctl+0x1de/0x2d20 [btrfs]
      	 ksys_ioctl+0x87/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x52/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x1240/0x2460
      	 lock_acquire+0xab/0x360
      	 __mutex_lock+0x8b/0x8f0
      	 clone_fs_devices+0x4d/0x170 [btrfs]
      	 btrfs_read_chunk_tree+0x330/0x800 [btrfs]
      	 open_ctree+0xb7c/0x18ce [btrfs]
      	 btrfs_mount_root.cold+0x13/0xfa [btrfs]
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 fc_mount+0xe/0x40
      	 vfs_kern_mount.part.0+0x71/0x90
      	 btrfs_mount+0x13b/0x3e0 [btrfs]
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 do_mount+0x7de/0xb30
      	 __x64_sys_mount+0x8e/0xd0
      	 do_syscall_64+0x52/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_info->chunk_mutex);
      				 lock(&fs_devs->device_list_mutex);
      				 lock(&fs_info->chunk_mutex);
          lock(&fs_devs->device_list_mutex);
      
         *** DEADLOCK ***
      
        3 locks held by mount/678048:
         #0: ffff9b75ff5fb0e0 (&type->s_umount_key#63/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
         #1: ffffffffc0c2fbc8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x54/0x800 [btrfs]
         #2: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
      
        stack backtrace:
        CPU: 2 PID: 678048 Comm: mount Tainted: G            E     5.8.0-rc5+ #20
        Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
        Call Trace:
         dump_stack+0x96/0xd0
         check_noncircular+0x162/0x180
         __lock_acquire+0x1240/0x2460
         ? asm_sysvec_apic_timer_interrupt+0x12/0x20
         lock_acquire+0xab/0x360
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         __mutex_lock+0x8b/0x8f0
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         ? rcu_read_lock_sched_held+0x52/0x60
         ? cpumask_next+0x16/0x20
         ? module_assert_mutex_or_preempt+0x14/0x40
         ? __module_address+0x28/0xf0
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         ? static_obj+0x4f/0x60
         ? lockdep_init_map_waits+0x43/0x200
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         clone_fs_devices+0x4d/0x170 [btrfs]
         btrfs_read_chunk_tree+0x330/0x800 [btrfs]
         open_ctree+0xb7c/0x18ce [btrfs]
         ? super_setup_bdi_name+0x79/0xd0
         btrfs_mount_root.cold+0x13/0xfa [btrfs]
         ? vfs_parse_fs_string+0x84/0xb0
         ? rcu_read_lock_sched_held+0x52/0x60
         ? kfree+0x2b5/0x310
         legacy_get_tree+0x30/0x50
         vfs_get_tree+0x28/0xc0
         fc_mount+0xe/0x40
         vfs_kern_mount.part.0+0x71/0x90
         btrfs_mount+0x13b/0x3e0 [btrfs]
         ? cred_has_capability+0x7c/0x120
         ? rcu_read_lock_sched_held+0x52/0x60
         ? legacy_get_tree+0x30/0x50
         legacy_get_tree+0x30/0x50
         vfs_get_tree+0x28/0xc0
         do_mount+0x7de/0xb30
         ? memdup_user+0x4e/0x90
         __x64_sys_mount+0x8e/0xd0
         do_syscall_64+0x52/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This is because btrfs_read_chunk_tree() can come upon DEV_EXTENT's and
      then read the device, which takes the device_list_mutex.  The
      device_list_mutex needs to be taken before the chunk_mutex, so this is a
      problem.  We only really need the chunk mutex around adding the chunk,
      so move the mutex around read_one_chunk.
      
      An argument could be made that we don't even need the chunk_mutex here
      as it's during mount, and we are protected by various other locks.
      However we already have special rules for ->device_list_mutex, and I'd
      rather not have another special case for ->chunk_mutex.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      01d01caf
    • Josef Bacik's avatar
      btrfs: open device without device_list_mutex · 18c850fd
      Josef Bacik authored
      There's long existed a lockdep splat because we open our bdev's under
      the ->device_list_mutex at mount time, which acquires the bd_mutex.
      Usually this goes unnoticed, but if you do loopback devices at all
      suddenly the bd_mutex comes with a whole host of other dependencies,
      which results in the splat when you mount a btrfs file system.
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
      ------------------------------------------------------
      systemd-journal/509 is trying to acquire lock:
      ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
      
      but task is already holding lock:
      ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
       -> #6 (sb_pagefaults){.+.+}-{0:0}:
             __sb_start_write+0x13e/0x220
             btrfs_page_mkwrite+0x59/0x560 [btrfs]
             do_page_mkwrite+0x4f/0x130
             do_wp_page+0x3b0/0x4f0
             handle_mm_fault+0xf47/0x1850
             do_user_addr_fault+0x1fc/0x4b0
             exc_page_fault+0x88/0x300
             asm_exc_page_fault+0x1e/0x30
      
       -> #5 (&mm->mmap_lock#2){++++}-{3:3}:
             __might_fault+0x60/0x80
             _copy_from_user+0x20/0xb0
             get_sg_io_hdr+0x9a/0xb0
             scsi_cmd_ioctl+0x1ea/0x2f0
             cdrom_ioctl+0x3c/0x12b4
             sr_block_ioctl+0xa4/0xd0
             block_ioctl+0x3f/0x50
             ksys_ioctl+0x82/0xc0
             __x64_sys_ioctl+0x16/0x20
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #4 (&cd->lock){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             sr_block_open+0xa2/0x180
             __blkdev_get+0xdd/0x550
             blkdev_get+0x38/0x150
             do_dentry_open+0x16b/0x3e0
             path_openat+0x3c9/0xa00
             do_filp_open+0x75/0x100
             do_sys_openat2+0x8a/0x140
             __x64_sys_openat+0x46/0x70
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             __blkdev_get+0x6a/0x550
             blkdev_get+0x85/0x150
             blkdev_get_by_path+0x2c/0x70
             btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
             open_fs_devices+0x88/0x240 [btrfs]
             btrfs_open_devices+0x92/0xa0 [btrfs]
             btrfs_mount_root+0x250/0x490 [btrfs]
             legacy_get_tree+0x30/0x50
             vfs_get_tree+0x28/0xc0
             vfs_kern_mount.part.0+0x71/0xb0
             btrfs_mount+0x119/0x380 [btrfs]
             legacy_get_tree+0x30/0x50
             vfs_get_tree+0x28/0xc0
             do_mount+0x8c6/0xca0
             __x64_sys_mount+0x8e/0xd0
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             btrfs_run_dev_stats+0x36/0x420 [btrfs]
             commit_cowonly_roots+0x91/0x2d0 [btrfs]
             btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
             btrfs_sync_file+0x38a/0x480 [btrfs]
             __x64_sys_fdatasync+0x47/0x80
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
             btrfs_sync_file+0x38a/0x480 [btrfs]
             __x64_sys_fdatasync+0x47/0x80
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
             __lock_acquire+0x1241/0x20c0
             lock_acquire+0xb0/0x400
             __mutex_lock+0x7b/0x820
             btrfs_record_root_in_trans+0x44/0x70 [btrfs]
             start_transaction+0xd2/0x500 [btrfs]
             btrfs_dirty_inode+0x44/0xd0 [btrfs]
             file_update_time+0xc6/0x120
             btrfs_page_mkwrite+0xda/0x560 [btrfs]
             do_page_mkwrite+0x4f/0x130
             do_wp_page+0x3b0/0x4f0
             handle_mm_fault+0xf47/0x1850
             do_user_addr_fault+0x1fc/0x4b0
             exc_page_fault+0x88/0x300
             asm_exc_page_fault+0x1e/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
      
      Possible unsafe locking scenario:
      
           CPU0                    CPU1
           ----                    ----
       lock(sb_pagefaults);
                                   lock(&mm->mmap_lock#2);
                                   lock(sb_pagefaults);
       lock(&fs_info->reloc_mutex);
      
       *** DEADLOCK ***
      
      3 locks held by systemd-journal/509:
       #0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
       #1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
       #2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
      
      stack backtrace:
      CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      Call Trace:
       dump_stack+0x92/0xc8
       check_noncircular+0x134/0x150
       __lock_acquire+0x1241/0x20c0
       lock_acquire+0xb0/0x400
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       ? lock_acquire+0xb0/0x400
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       __mutex_lock+0x7b/0x820
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       ? kvm_sched_clock_read+0x14/0x30
       ? sched_clock+0x5/0x10
       ? sched_clock_cpu+0xc/0xb0
       btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       start_transaction+0xd2/0x500 [btrfs]
       btrfs_dirty_inode+0x44/0xd0 [btrfs]
       file_update_time+0xc6/0x120
       btrfs_page_mkwrite+0xda/0x560 [btrfs]
       ? sched_clock+0x5/0x10
       do_page_mkwrite+0x4f/0x130
       do_wp_page+0x3b0/0x4f0
       handle_mm_fault+0xf47/0x1850
       do_user_addr_fault+0x1fc/0x4b0
       exc_page_fault+0x88/0x300
       ? asm_exc_page_fault+0x8/0x30
       asm_exc_page_fault+0x1e/0x30
      RIP: 0033:0x7fa3972fdbfe
      Code: Bad RIP value.
      
      Fix this by not holding the ->device_list_mutex at this point.  The
      device_list_mutex exists to protect us from modifying the device list
      while the file system is running.
      
      However it can also be modified by doing a scan on a device.  But this
      action is specifically protected by the uuid_mutex, which we are holding
      here.  We cannot race with opening at this point because we have the
      ->s_mount lock held during the mount.  Not having the
      ->device_list_mutex here is perfectly safe as we're not going to change
      the devices at this point.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add some comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      18c850fd
    • Josef Bacik's avatar
      btrfs: sysfs: use NOFS for device creation · a47bd78d
      Josef Bacik authored
      Dave hit this splat during testing btrfs/078:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc6-default+ #1191 Not tainted
        ------------------------------------------------------
        kswapd0/75 is trying to acquire lock:
        ffffa040e9d04ff8 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
      
        but task is already holding lock:
        ffffffff8b0c8040 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (fs_reclaim){+.+.}-{0:0}:
      	 __lock_acquire+0x56f/0xaa0
      	 lock_acquire+0xa3/0x440
      	 fs_reclaim_acquire.part.0+0x25/0x30
      	 __kmalloc_track_caller+0x49/0x330
      	 kstrdup+0x2e/0x60
      	 __kernfs_new_node.constprop.0+0x44/0x250
      	 kernfs_new_node+0x25/0x50
      	 kernfs_create_link+0x34/0xa0
      	 sysfs_do_create_link_sd+0x5e/0xd0
      	 btrfs_sysfs_add_devices_dir+0x65/0x100 [btrfs]
      	 btrfs_init_new_device+0x44c/0x12b0 [btrfs]
      	 btrfs_ioctl+0xc3c/0x25c0 [btrfs]
      	 ksys_ioctl+0x68/0xa0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0xe0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x56f/0xaa0
      	 lock_acquire+0xa3/0x440
      	 __mutex_lock+0xa0/0xaf0
      	 btrfs_chunk_alloc+0x137/0x3e0 [btrfs]
      	 find_free_extent+0xb44/0xfb0 [btrfs]
      	 btrfs_reserve_extent+0x9b/0x180 [btrfs]
      	 btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
      	 alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
      	 __btrfs_cow_block+0x143/0x7a0 [btrfs]
      	 btrfs_cow_block+0x15f/0x310 [btrfs]
      	 push_leaf_right+0x150/0x240 [btrfs]
      	 split_leaf+0x3cd/0x6d0 [btrfs]
      	 btrfs_search_slot+0xd14/0xf70 [btrfs]
      	 btrfs_insert_empty_items+0x64/0xc0 [btrfs]
      	 __btrfs_commit_inode_delayed_items+0xb2/0x840 [btrfs]
      	 btrfs_async_run_delayed_root+0x10e/0x1d0 [btrfs]
      	 btrfs_work_helper+0x2f9/0x650 [btrfs]
      	 process_one_work+0x22c/0x600
      	 worker_thread+0x50/0x3b0
      	 kthread+0x137/0x150
      	 ret_from_fork+0x1f/0x30
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 check_prev_add+0x98/0xa20
      	 validate_chain+0xa8c/0x2a00
      	 __lock_acquire+0x56f/0xaa0
      	 lock_acquire+0xa3/0x440
      	 __mutex_lock+0xa0/0xaf0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
      	 btrfs_evict_inode+0x3bf/0x560 [btrfs]
      	 evict+0xd6/0x1c0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x54/0x80
      	 super_cache_scan+0x121/0x1a0
      	 do_shrink_slab+0x175/0x420
      	 shrink_slab+0xb1/0x2e0
      	 shrink_node+0x192/0x600
      	 balance_pgdat+0x31f/0x750
      	 kswapd+0x206/0x510
      	 kthread+0x137/0x150
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(&fs_info->chunk_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/75:
         #0: ffffffff8b0c8040 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffff8b0b50b8 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0
         #2: ffffa040e057c0e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50
      
        stack backtrace:
        CPU: 2 PID: 75 Comm: kswapd0 Not tainted 5.8.0-rc6-default+ #1191
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        Call Trace:
         dump_stack+0x78/0xa0
         check_noncircular+0x16f/0x190
         check_prev_add+0x98/0xa20
         validate_chain+0xa8c/0x2a00
         __lock_acquire+0x56f/0xaa0
         lock_acquire+0xa3/0x440
         ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
         __mutex_lock+0xa0/0xaf0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
         ? __lock_acquire+0x56f/0xaa0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
         ? lock_acquire+0xa3/0x440
         ? btrfs_evict_inode+0x138/0x560 [btrfs]
         ? btrfs_evict_inode+0x2fe/0x560 [btrfs]
         ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
         __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
         btrfs_evict_inode+0x3bf/0x560 [btrfs]
         evict+0xd6/0x1c0
         dispose_list+0x48/0x70
         prune_icache_sb+0x54/0x80
         super_cache_scan+0x121/0x1a0
         do_shrink_slab+0x175/0x420
         shrink_slab+0xb1/0x2e0
         shrink_node+0x192/0x600
         balance_pgdat+0x31f/0x750
         kswapd+0x206/0x510
         ? _raw_spin_unlock_irqrestore+0x3e/0x50
         ? finish_wait+0x90/0x90
         ? balance_pgdat+0x750/0x750
         kthread+0x137/0x150
         ? kthread_stop+0x2a0/0x2a0
         ret_from_fork+0x1f/0x30
      
      This is because we're holding the chunk_mutex while adding this device
      and adding its sysfs entries.  We actually hold different locks in
      different places when calling this function, the dev_replace semaphore
      for instance in dev replace, so instead of moving this call around
      simply wrap it's operations in NOFS.
      
      CC: stable@vger.kernel.org # 4.14+
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a47bd78d
    • Josef Bacik's avatar
      btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases · fbabd4a3
      Josef Bacik authored
      Eric reported seeing this message while running generic/475
      
        BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
      
      Full stack trace:
      
        BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
        BTRFS info (device dm-0): forced readonly
        BTRFS warning (device dm-0): Skipping commit of aborted transaction.
        ------------[ cut here ]------------
        BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure
        BTRFS: Transaction aborted (error -117)
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10
        WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs]
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10
        CPU: 3 PID: 23985 Comm: fsstress Tainted: G        W    L    5.8.0-rc4-default+ #1181
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs]
        RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
        RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7
        RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000
        R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70
        FS:  00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0
        Call Trace:
         ? finish_wait+0x90/0x90
         ? __mutex_unlock_slowpath+0x45/0x2a0
         ? lock_acquire+0xa3/0x440
         ? lockref_put_or_lock+0x9/0x30
         ? dput+0x20/0x4a0
         ? dput+0x20/0x4a0
         ? do_raw_spin_unlock+0x4b/0xc0
         ? _raw_spin_unlock+0x1f/0x30
         btrfs_sync_file+0x335/0x490 [btrfs]
         do_fsync+0x38/0x70
         __x64_sys_fsync+0x10/0x20
         do_syscall_64+0x50/0xe0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f6a6ef1b6e3
        Code: Bad RIP value.
        RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
        RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3
        RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003
        RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c
        R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f
        R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
        softirqs last  enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace af146e0e38433456 ]---
        BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
      
      This ret came from btrfs_write_marked_extents().  If we get an aborted
      transaction via EIO before, we'll see it in btree_write_cache_pages()
      and return EUCLEAN, which gets printed as "Filesystem corrupted".
      
      Except we shouldn't be returning EUCLEAN here, we need to be returning
      EROFS because EUCLEAN is reserved for actual corruption, not IO errors.
      
      We are inconsistent about our handling of BTRFS_FS_STATE_ERROR
      elsewhere, but we want to use EROFS for this particular case.  The
      original transaction abort has the real error code for why we ended up
      with an aborted transaction, all subsequent actions just need to return
      EROFS because they may not have a trans handle and have no idea about
      the original cause of the abort.
      
      After patch "btrfs: don't WARN if we abort a transaction with EROFS" the
      stacktrace will not be dumped either.
      Reported-by: default avatarEric Sandeen <esandeen@redhat.com>
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add full test stacktrace ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fbabd4a3
    • Josef Bacik's avatar
      btrfs: document special case error codes for fs errors · 59131393
      Josef Bacik authored
      We've had some discussions about what to do in certain scenarios for
      error codes, specifically EUCLEAN and EROFS.  Document these near the
      error handling code so its clear what their intentions are.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      59131393
    • Josef Bacik's avatar
      btrfs: don't WARN if we abort a transaction with EROFS · f95ebdbe
      Josef Bacik authored
      If we got some sort of corruption via a read and call
      btrfs_handle_fs_error() we'll set BTRFS_FS_STATE_ERROR on the fs and
      complain.  If a subsequent trans handle trips over this it'll get EROFS
      and then abort.  However at that point we're not aborting for the
      original reason, we're aborting because we've been flipped read only.
      We do not need to WARN_ON() here.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f95ebdbe
    • Filipe Manana's avatar
      btrfs: reduce contention on log trees when logging checksums · 3ebac17c
      Filipe Manana authored
      The possibility of extents being shared (through clone and deduplication
      operations) requires special care when logging data checksums, to avoid
      having a log tree with different checksum items that cover ranges which
      overlap (which resulted in missing checksums after replaying a log tree).
      Such problems were fixed in the past by the following commits:
      
      commit 40e046ac ("Btrfs: fix missing data checksums after replaying a
                            log tree")
      
      commit e289f03e ("btrfs: fix corrupt log due to concurrent fsync of
                            inodes with shared extents")
      
      Test case generic/588 exercises the scenario solved by the first commit
      (purely sequential and deterministic) while test case generic/457 often
      triggered the case fixed by the second commit (not deterministic, requires
      specific timings under concurrency).
      
      The problems were addressed by deleting, from the log tree, any existing
      checksums before logging the new ones. And also by doing the deletion and
      logging of the cheksums while locking the checksum range in an extent io
      tree (root->log_csum_range), to deal with the case where we have concurrent
      fsyncs against files with shared extents.
      
      That however causes more contention on the leaves of a log tree where we
      store checksums (and all the nodes in the paths leading to them), even
      when we do not have shared extents, or all the shared extents were created
      by past transactions. It also adds a bit of contention on the spin lock of
      the log_csums_range extent io tree of the log root.
      
      This change adds a 'last_reflink_trans' field to the inode to keep track
      of the last transaction where a new extent was shared between inodes
      (through clone and deduplication operations). It is updated for both the
      source and destination inodes of reflink operations whenever a new extent
      (created in the current transaction) becomes shared by the inodes. This
      field is kept in memory only, not persisted in the inode item, similar
      to other existing fields (last_unlink_trans, logged_trans).
      
      When logging checksums for an extent, if the value of 'last_reflink_trans'
      is smaller then the current transaction's generation/id, we skip locking
      the extent range and deletion of checksums from the log tree, since we
      know we do not have new shared extents. This reduces contention on the
      log tree's leaves where checksums are stored.
      
      The following script, which uses fio, was used to measure the impact of
      this change:
      
        $ cat test-fsync.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 3 ]; then
            echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
            exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=write
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=64k
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        EOF
      
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
        fio /tmp/fio-job.ini
        umount $MNT
      
      The tests were performed for different numbers of jobs, file sizes and
      fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
      12 cores, with cpu governance set to performance mode on all cores), 16GiB
      of ram (the host has 64GiB) and using a NVMe device directly (without an
      intermediary filesystem in the host). While running the tests, the host
      was not used for anything else, to avoid disturbing the tests.
      
      The obtained results were the following (the last line of fio's output was
      pasted). Starting with 16 jobs is where a significant difference is
      observable in this particular setup and hardware (differences highlighted
      below). The very small differences for tests with less than 16 jobs are
      possibly just noise and random.
      
          **** 1 job, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=23.8MiB/s (24.9MB/s), 23.8MiB/s-23.8MiB/s (24.9MB/s-24.9MB/s), io=1024MiB (1074MB), run=43075-43075msec
      
      after this change:
      
      WRITE: bw=24.4MiB/s (25.6MB/s), 24.4MiB/s-24.4MiB/s (25.6MB/s-25.6MB/s), io=1024MiB (1074MB), run=41938-41938msec
      
          **** 2 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=37.7MiB/s (39.5MB/s), 37.7MiB/s-37.7MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54351-54351msec
      
      after this change:
      
      WRITE: bw=37.7MiB/s (39.5MB/s), 37.6MiB/s-37.6MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54428-54428msec
      
          **** 4 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=67.5MiB/s (70.8MB/s), 67.5MiB/s-67.5MiB/s (70.8MB/s-70.8MB/s), io=4096MiB (4295MB), run=60669-60669msec
      
      after this change:
      
      WRITE: bw=68.6MiB/s (71.0MB/s), 68.6MiB/s-68.6MiB/s (71.0MB/s-71.0MB/s), io=4096MiB (4295MB), run=59678-59678msec
      
          **** 8 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=8192MiB (8590MB), run=64048-64048msec
      
      after this change:
      
      WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=8192MiB (8590MB), run=63405-63405msec
      
          **** 16 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=78.5MiB/s (82.3MB/s), 78.5MiB/s-78.5MiB/s (82.3MB/s-82.3MB/s), io=16.0GiB (17.2GB), run=208676-208676msec
      
      after this change:
      
      WRITE: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=16.0GiB (17.2GB), run=149295-149295msec
      (+40.1% throughput, -28.5% runtime)
      
          **** 32 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=58.8MiB/s (61.7MB/s), 58.8MiB/s-58.8MiB/s (61.7MB/s-61.7MB/s), io=32.0GiB (34.4GB), run=557134-557134msec
      
      after this change:
      
      WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430550-430550msec
      (+29.4% throughput, -22.7% runtime)
      
          **** 64 jobs, file size 512M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=65.8MiB/s (68.0MB/s), 65.8MiB/s-65.8MiB/s (68.0MB/s-68.0MB/s), io=32.0GiB (34.4GB), run=498055-498055msec
      
      after this change:
      
      WRITE: bw=85.1MiB/s (89.2MB/s), 85.1MiB/s-85.1MiB/s (89.2MB/s-89.2MB/s), io=32.0GiB (34.4GB), run=385116-385116msec
      (+29.3% throughput, -22.7% runtime)
      
          **** 128 jobs, file size 256M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=54.7MiB/s (57.3MB/s), 54.7MiB/s-54.7MiB/s (57.3MB/s-57.3MB/s), io=32.0GiB (34.4GB), run=599373-599373msec
      
      after this change:
      
      WRITE: bw=121MiB/s (126MB/s), 121MiB/s-121MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=271907-271907msec
      (+121.2% throughput, -54.6% runtime)
      
          **** 256 jobs, file size 256M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=69.2MiB/s (72.5MB/s), 69.2MiB/s-69.2MiB/s (72.5MB/s-72.5MB/s), io=64.0GiB (68.7GB), run=947536-947536msec
      
      after this change:
      
      WRITE: bw=121MiB/s (127MB/s), 121MiB/s-121MiB/s (127MB/s-127MB/s), io=64.0GiB (68.7GB), run=541916-541916msec
      (+74.9% throughput, -42.8% runtime)
      
          **** 512 jobs, file size 128M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=85.4MiB/s (89.5MB/s), 85.4MiB/s-85.4MiB/s (89.5MB/s-89.5MB/s), io=64.0GiB (68.7GB), run=767734-767734msec
      
      after this change:
      
      WRITE: bw=141MiB/s (147MB/s), 141MiB/s-141MiB/s (147MB/s-147MB/s), io=64.0GiB (68.7GB), run=466022-466022msec
      (+65.1% throughput, -39.3% runtime)
      
          **** 1024 jobs, file size 128M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=115MiB/s (120MB/s), 115MiB/s-115MiB/s (120MB/s-120MB/s), io=128GiB (137GB), run=1143775-1143775msec
      
      after this change:
      
      WRITE: bw=171MiB/s (180MB/s), 171MiB/s-171MiB/s (180MB/s-180MB/s), io=128GiB (137GB), run=764843-764843msec
      (+48.7% throughput, -33.1% runtime)
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ebac17c
    • Nikolay Borisov's avatar
      btrfs: remove done label in writepage_delalloc · b69d1ee9
      Nikolay Borisov authored
      Since there is not common cleanup run after the label it makes it
      somewhat redundant.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b69d1ee9
    • Qu Wenruo's avatar
      btrfs: add comments for btrfs_reserve_flush_enum · fd7fb634
      Qu Wenruo authored
      This enum is the interface exposed to developers.
      
      Although we have a detailed comment explaining the whole idea of space
      flushing at the beginning of space-info.c, the exposed enum interface
      doesn't have any comment.
      
      Some corner cases, like BTRFS_RESERVE_FLUSH_ALL and
      BTRFS_RESERVE_FLUSH_ALL_STEAL can be interrupted by fatal signals, are
      not explained at all.
      
      So add some simple comments for these enums as a quick reference.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd7fb634
    • Qu Wenruo's avatar
      btrfs: relocation: review the call sites which can be interrupted by signal · 44d354ab
      Qu Wenruo authored
      Since most metadata reservation calls can return -EINTR when get
      interrupted by fatal signal, we need to review the all the metadata
      reservation call sites.
      
      In relocation code, the metadata reservation happens in the following
      sites:
      
      - btrfs_block_rsv_refill() in merge_reloc_root()
        merge_reloc_root() is a pretty critical section, we don't want to be
        interrupted by signal, so change the flush status to
        BTRFS_RESERVE_FLUSH_LIMIT, so it won't get interrupted by signal.
        Since such change can be ENPSPC-prone, also shrink the amount of
        metadata to reserve least amount avoid deadly ENOSPC there.
      
      - btrfs_block_rsv_refill() in reserve_metadata_space()
        It calls with BTRFS_RESERVE_FLUSH_LIMIT, which won't get interrupted
        by signal.
      
      - btrfs_block_rsv_refill() in prepare_to_relocate()
      
      - btrfs_block_rsv_add() in prepare_to_relocate()
      
      - btrfs_block_rsv_refill() in relocate_block_group()
      
      - btrfs_delalloc_reserve_metadata() in relocate_file_extent_cluster()
      
      - btrfs_start_transaction() in relocate_block_group()
      
      - btrfs_start_transaction() in create_reloc_inode()
        Can be interrupted by fatal signal and we can handle it easily.
        For these call sites, just catch the -EINTR value in btrfs_balance()
        and count them as canceled.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44d354ab
    • Qu Wenruo's avatar
      btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree · f3e3d9cc
      Qu Wenruo authored
      [BUG]
      There is a bug report about bad signal timing could lead to read-only
      fs during balance:
      
        BTRFS info (device xvdb): balance: start -d -m -s
        BTRFS info (device xvdb): relocating block group 73001861120 flags metadata
        BTRFS info (device xvdb): found 12236 extents, stage: move data extents
        BTRFS info (device xvdb): relocating block group 71928119296 flags data
        BTRFS info (device xvdb): found 3 extents, stage: move data extents
        BTRFS info (device xvdb): found 3 extents, stage: update data pointers
        BTRFS info (device xvdb): relocating block group 60922265600 flags metadata
        BTRFS: error (device xvdb) in btrfs_drop_snapshot:5505: errno=-4 unknown
        BTRFS info (device xvdb): forced readonly
        BTRFS info (device xvdb): balance: ended with status: -4
      
      [CAUSE]
      The direct cause is the -EINTR from the following call chain when a
      fatal signal is pending:
      
       relocate_block_group()
       |- clean_dirty_subvols()
          |- btrfs_drop_snapshot()
             |- btrfs_start_transaction()
                |- btrfs_delayed_refs_rsv_refill()
                   |- btrfs_reserve_metadata_bytes()
                      |- __reserve_metadata_bytes()
                         |- wait_reserve_ticket()
                            |- prepare_to_wait_event();
                            |- ticket->error = -EINTR;
      
      Normally this behavior is fine for most btrfs_start_transaction()
      callers, as they need to catch any other error, same for the signal, and
      exit ASAP.
      
      However for balance, especially for the clean_dirty_subvols() case, we're
      already doing cleanup works, getting -EINTR from btrfs_drop_snapshot()
      could cause a lot of unexpected problems.
      
      From the mentioned forced read-only report, to later balance error due
      to half dropped reloc trees.
      
      [FIX]
      Fix this problem by using btrfs_join_transaction() if
      btrfs_drop_snapshot() is called from relocation context.
      
      Since btrfs_join_transaction() won't get interrupted by signal, we can
      continue the cleanup.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: David Sterba <dsterba@suse.com>3
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3e3d9cc
    • Qu Wenruo's avatar
      btrfs: relocation: allow signal to cancel balance · 5cb502f4
      Qu Wenruo authored
      Although btrfs balance can be canceled with "btrfs balance cancel"
      command, it's still almost muscle memory to press Ctrl-C to cancel a
      long running btrfs balance.
      
      So allow btrfs balance to check signal to determine if it should exit.
      The cancellation points are in known location and we're only adding one
      more reason, so this should be safe.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5cb502f4