1. 08 Dec, 2020 40 commits
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Josef Bacik's avatar
      btrfs: remove extent_buffer::recursed · a55463c9
      Josef Bacik authored
      It is unused everywhere now, it can be removed.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a55463c9
    • Josef Bacik's avatar
      btrfs: remove the recurse parameter from __btrfs_tree_read_lock · 0ecae6ff
      Josef Bacik authored
      It is completely unused now, remove it.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ecae6ff
    • Josef Bacik's avatar
      btrfs: use btrfs_tree_read_lock in btrfs_search_slot · fe596ca3
      Josef Bacik authored
      We no longer use recursion, so
      __btrfs_tree_read_lock(BTRFS_NESTING_NORMAL) == btrfs_tree_read_lock.
      Replace this call with the simple helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fe596ca3
    • Josef Bacik's avatar
      btrfs: merge back btrfs_read_lock_root_node helpers · 1bb96598
      Josef Bacik authored
      We no longer have recursive locking and there's no need for separate
      helpers that allowed the transition to rwsem with minimal code changes.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1bb96598
    • Josef Bacik's avatar
      btrfs: locking: remove the recursion handling code · 4048daed
      Josef Bacik authored
      Now that we're no longer using recursion, rip out all of the supporting
      code.  Follow up patches will clean up the callers of these functions.
      
      The extent_buffer::lock_owner is still retained as it allows safety
      checks in btrfs_init_new_buffer for the case that the free space cache
      is corrupted and we try to allocate a block that we are currently using
      and have locked in the path.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4048daed
    • Josef Bacik's avatar
      btrfs: remove btrfs_path::recurse · 2f5239dc
      Josef Bacik authored
      With my async free space cache loading patches ("btrfs: load free space
      cache asynchronously") we no longer have a user of path->recurse and can
      remove it.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2f5239dc
    • Josef Bacik's avatar
      btrfs: unlock to current level in btrfs_next_old_leaf · 0e46318d
      Josef Bacik authored
      Filipe reported the following lockdep splat
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.10.0-rc2-btrfs-next-71 #1 Not tainted
        ------------------------------------------------------
        find/324157 is trying to acquire lock:
        ffff8ebc48d293a0 (btrfs-tree-01#2/3){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
      
        but task is already holding lock:
        ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (btrfs-tree-00){++++}-{3:3}:
      	 lock_acquire+0xd8/0x490
      	 down_write_nested+0x44/0x120
      	 __btrfs_tree_lock+0x27/0x120 [btrfs]
      	 btrfs_search_slot+0x2a3/0xc50 [btrfs]
      	 btrfs_insert_empty_items+0x58/0xa0 [btrfs]
      	 insert_with_overflow+0x44/0x110 [btrfs]
      	 btrfs_insert_xattr_item+0xb8/0x1d0 [btrfs]
      	 btrfs_setxattr+0xd6/0x4c0 [btrfs]
      	 btrfs_setxattr_trans+0x68/0x100 [btrfs]
      	 __vfs_setxattr+0x66/0x80
      	 __vfs_setxattr_noperm+0x70/0x200
      	 vfs_setxattr+0x6b/0x120
      	 setxattr+0x125/0x240
      	 path_setxattr+0xba/0xd0
      	 __x64_sys_setxattr+0x27/0x30
      	 do_syscall_64+0x33/0x80
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (btrfs-tree-01#2/3){++++}-{3:3}:
      	 check_prev_add+0x91/0xc60
      	 __lock_acquire+0x1689/0x3130
      	 lock_acquire+0xd8/0x490
      	 down_read_nested+0x45/0x220
      	 __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
      	 btrfs_next_old_leaf+0x27d/0x580 [btrfs]
      	 btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
      	 iterate_dir+0x170/0x1c0
      	 __x64_sys_getdents64+0x83/0x140
      	 do_syscall_64+0x33/0x80
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-tree-00);
      				 lock(btrfs-tree-01#2/3);
      				 lock(btrfs-tree-00);
          lock(btrfs-tree-01#2/3);
      
         *** DEADLOCK ***
      
        5 locks held by find/324157:
         #0: ffff8ebc502c6e00 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
         #1: ffff8eb97f689980 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: iterate_dir+0x52/0x1c0
         #2: ffff8ebaec00ca58 (btrfs-tree-02#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
         #3: ffff8eb98f986f78 (btrfs-tree-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
         #4: ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
      
        stack backtrace:
        CPU: 2 PID: 324157 Comm: find Not tainted 5.10.0-rc2-btrfs-next-71 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         check_noncircular+0xff/0x110
         ? mark_lock.part.0+0x468/0xe90
         check_prev_add+0x91/0xc60
         __lock_acquire+0x1689/0x3130
         ? kvm_clock_read+0x14/0x30
         ? kvm_sched_clock_read+0x5/0x10
         lock_acquire+0xd8/0x490
         ? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
         down_read_nested+0x45/0x220
         ? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
         __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
         btrfs_next_old_leaf+0x27d/0x580 [btrfs]
         btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
         iterate_dir+0x170/0x1c0
         __x64_sys_getdents64+0x83/0x140
         ? filldir+0x1d0/0x1d0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This happens because btrfs_next_old_leaf searches down to our current
      key, and then walks up the path until we can move to the next slot, and
      then reads back down the path so we get the next leaf.
      
      However it doesn't unlock any lower levels until it replaces them with
      the new extent buffer.  This is technically fine, but of course causes
      lockdep to complain, because we could be holding locks on lower levels
      while locking upper levels.
      
      Fix this by dropping all nodes below the level that we use as our new
      starting point before we start reading back down the path.  This also
      allows us to drop the nested/recursive locking magic, because we're no
      longer locking two nodes at the same level anymore.
      Reported-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0e46318d
    • Josef Bacik's avatar
      btrfs: cleanup the locking in btrfs_next_old_leaf · ffeb03cf
      Josef Bacik authored
      We are carrying around this next_rw_lock from when we would do spinning
      vs blocking read locks.  Now that we have the rwsem locking we can
      simply use the read lock flag unconditionally and the read lock helpers.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ffeb03cf
    • Anand Jain's avatar
      btrfs: remove unused argument seed from btrfs_find_device · b2598edf
      Anand Jain authored
      Commit 343694eee8d8 ("btrfs: switch seed device to list api"), missed to
      check if the parameter seed is true in the function btrfs_find_device().
      This tells it whether to traverse the seed device list or not.
      
      After this commit, the argument is unused and can be removed.
      
      In device_list_add() it's not necessary because fs_devices always points
      to the device's fs_devices. So with the devid+uuid matching, it will
      find the right device and return, thus not needing to traverse seed
      devices.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b2598edf
    • Anand Jain's avatar
      btrfs: drop never met disk total bytes check in verify_one_dev_extent · 3a160a93
      Anand Jain authored
      Drop the condition in verify_one_dev_extent,
      btrfs_device::disk_total_bytes is set even for a seed device. The
      comment is wrong, the size is properly set when cloning the device.
      
      Commit 1b3922a8 ("btrfs: Use real device structure to verify
      dev extent") introduced it but it's unclear why the total_disk_bytes
      was 0.
      
      Theoretically, all devices (including missing and seed) marked with the
      BTRFS_DEV_STATE_IN_FS_METADATA flag gets the total_disk_bytes updated at
      fill_device_from_item():
      
        open_ctree()
          btrfs_read_chunk_tree()
            read_one_dev()
              open_seed_device()
              fill_device_from_item()
      
      Even if verify_one_dev_extent() reports total_disk_bytes == 0, then its
      a bug to be fixed somewhere else and not in verify_one_dev_extent() as
      it's just a messenger. It is never expected that a total_disk_bytes
      shall be zero.
      
      The function fill_device_from_item() does the job of reading it from the
      item and updating btrfs_device::disk_total_bytes. So both the missing
      device and the seed devices do have their disk_total_bytes updated.
      btrfs_find_device can also return a device from fs_info->seed_list
      because it searches it as well.
      
      Furthermore, while removing the device if there is a power loss, we
      could have a device with its total_bytes = 0, that's still valid.
      
      Instead, introduce a check against maximum block device size in
      read_one_dev().
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3a160a93
    • Anand Jain's avatar
      btrfs: drop unused argument step from btrfs_free_extra_devids · bacce86a
      Anand Jain authored
      Commit cf89af14 ("btrfs: dev-replace: fail mount if we don't have
      replace item with target device") dropped the multi stage operation of
      btrfs_free_extra_devids() that does not need to check replace target
      anymore and we can remove the 'step' argument.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bacce86a
    • Filipe Manana's avatar
      btrfs: update the number of bytes used by an inode atomically · 2766ff61
      Filipe Manana authored
      There are several occasions where we do not update the inode's number of
      used bytes atomically, resulting in a concurrent stat(2) syscall to report
      a value of used blocks that does not correspond to a valid value, that is,
      a value that does not match neither what we had before the operation nor
      what we get after the operation completes.
      
      In extreme cases it can result in stat(2) reporting zero used blocks, which
      can cause problems for some userspace tools where they can consider a file
      with a non-zero size and zero used blocks as completely sparse and skip
      reading data, as reported/discussed a long time ago in some threads like
      the following:
      
        https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
      
      The cases where this can happen are the following:
      
      -> Case 1
      
      If we do a write (buffered or direct IO) against a file region for which
      there is already an allocated extent (or multiple extents), then we have a
      short time window where we can report a number of used blocks to stat(2)
      that does not take into account the file region being overwritten. This
      short time window happens when completing the ordered extent(s).
      
      This happens because when we drop the extents in the write range we
      decrement the inode's number of bytes and later on when we insert the new
      extent(s) we increment the number of bytes in the inode, resulting in a
      short time window where a stat(2) syscall can get an incorrect number of
      used blocks.
      
      If we do writes that overwrite an entire file, then we have a short time
      window where we report 0 used blocks to stat(2).
      
      Example reproducer:
      
        $ cat reproducer-1.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        stat_loop()
        {
            trap "wait; exit" SIGTERM
            local filepath=$1
            local expected=$2
            local got
      
            while :; do
                got=$(stat -c %b $filepath)
                if [ $got -ne $expected ]; then
                   echo -n "ERROR: unexpected used blocks"
                   echo " (got: $got expected: $expected)"
                fi
            done
        }
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f $DEV > /dev/null
        # mkfs.ext4 -F $DEV > /dev/null
        # mkfs.f2fs -f $DEV > /dev/null
        # mkfs.reiserfs -f $DEV > /dev/null
        mount $DEV $MNT
      
        xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
        expected=$(stat -c %b $MNT/foobar)
      
        # Create a process to keep calling stat(2) on the file and see if the
        # reported number of blocks used (disk space used) changes, it should
        # not because we are not increasing the file size nor punching holes.
        stat_loop $MNT/foobar $expected &
        loop_pid=$!
      
        for ((i = 0; i < 50000; i++)); do
            xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
        done
      
        kill $loop_pid &> /dev/null
        wait
      
        umount $DEV
      
        $ ./reproducer-1.sh
        ERROR: unexpected used blocks (got: 0 expected: 128)
        ERROR: unexpected used blocks (got: 0 expected: 128)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      -> Case 2
      
      If we do a buffered write against a file region that does not have any
      allocated extents, like a hole or beyond EOF, then during ordered extent
      completion we have a short time window where a concurrent stat(2) syscall
      can report a number of used blocks that does not correspond to the value
      before or after the write operation, a value that is actually larger than
      the value after the write completes.
      
      This happens because once we start a buffered write into an unallocated
      file range we increment the inode's 'new_delalloc_bytes', to make sure
      any stat(2) call gets a correct used blocks value before delalloc is
      flushed and completes. However at ordered extent completion, after we
      inserted the new extent, we increment the inode's number of bytes used
      with the size of the new extent, and only later, when clearing the range
      in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
      counter with the size of the extent. So this results in a short time
      window where a concurrent stat(2) syscall can report a number of used
      blocks that accounts for the new extent twice.
      
      Example reproducer:
      
        $ cat reproducer-2.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        stat_loop()
        {
            trap "wait; exit" SIGTERM
            local filepath=$1
            local expected=$2
            local got
      
            while :; do
                got=$(stat -c %b $filepath)
                if [ $got -ne $expected ]; then
                    echo -n "ERROR: unexpected used blocks"
                    echo " (got: $got expected: $expected)"
                fi
            done
        }
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f $DEV > /dev/null
        # mkfs.ext4 -F $DEV > /dev/null
        # mkfs.f2fs -f $DEV > /dev/null
        # mkfs.reiserfs -f $DEV > /dev/null
        mount $DEV $MNT
      
        touch $MNT/foobar
        write_size=$((64 * 1024))
        for ((i = 0; i < 16384; i++)); do
           offset=$(($i * $write_size))
           xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
           blocks_used=$(stat -c %b $MNT/foobar)
      
           # Fsync the file to trigger writeback and keep calling stat(2) on it
           # to see if the number of blocks used changes.
           stat_loop $MNT/foobar $blocks_used &
           loop_pid=$!
           xfs_io -c "fsync" $MNT/foobar
      
           kill $loop_pid &> /dev/null
           wait $loop_pid
        done
      
        umount $DEV
      
        $ ./reproducer-2.sh
        ERROR: unexpected used blocks (got: 265472 expected: 265344)
        ERROR: unexpected used blocks (got: 284032 expected: 283904)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      -> Case 3
      
      Another case where such problems happen is during other operations that
      replace extents in a file range with other extents. Those operations are
      extent cloning, deduplication and fallocate's zero range operation.
      
      The cause of the problem is similar to the first case. When we drop the
      extents from a range, we decrement the inode's number of bytes, and later
      on, after inserting the new extents we increment it. Since this is not
      done atomically, a concurrent stat(2) call can see and return a number of
      used blocks that is smaller than it should be, does not match the number
      of used blocks before or after the clone/deduplication/zero operation.
      
      Like for the first case, when doing a clone, deduplication or zero range
      operation against an entire file, we end up having a time window where we
      can report 0 used blocks to a stat(2) call.
      
      Example reproducer:
      
        $ cat reproducer-3.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f -m reflink=1 $DEV > /dev/null
        mount $DEV $MNT
      
        extent_size=$((64 * 1024))
        num_extents=16384
        file_size=$(($extent_size * $num_extents))
      
        # File foo has many small extents.
        xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
            > /dev/null
        # File bar has much less extents and has exactly the same data as foo.
        xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
      
        expected=$(stat -c %b $MNT/foo)
      
        # Now deduplicate bar into foo. While the deduplication is in progres,
        # the number of used blocks/file size reported by stat should not change
        xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null  &
        dedupe_pid=$!
        while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
            used=$(stat -c %b $MNT/foo)
            if [ $used -ne $expected ]; then
                echo "Unexpected blocks used: $used (expected: $expected)"
            fi
        done
      
        umount $DEV
      
        $ ./reproducer-3.sh
        Unexpected blocks used: 2076800 (expected: 2097152)
        Unexpected blocks used: 2097024 (expected: 2097152)
        Unexpected blocks used: 2079872 (expected: 2097152)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      So fix this by:
      
      1) Making btrfs_drop_extents() not decrement the VFS inode's number of
         bytes, and instead return the number of bytes;
      
      2) Making any code that drops extents and adds new extents update the
         inode's number of bytes atomically, while holding the btrfs inode's
         spinlock, which is also used by the stat(2) callback to get the inode's
         number of bytes;
      
      3) For ranges in the inode's iotree that are marked as 'delalloc new',
         corresponding to previously unallocated ranges, increment the inode's
         number of bytes when clearing the 'delalloc new' bit from the range,
         in the same critical section that decrements the inode's
         'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
      
      An alternative would be to have btrfs_getattr() wait for any IO (ordered
      extents in progress) and locking the whole range (0 to (u64)-1) while it
      it computes the number of blocks used. But that would mean blocking
      stat(2), which is a very used syscall and expected to be fast, waiting
      for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2766ff61
    • Filipe Manana's avatar
      btrfs: fix race when defragmenting leads to unnecessary IO · 7f458a38
      Filipe Manana authored
      When defragmenting we skip ranges that have holes or inline extents, so that
      we don't do unnecessary IO and waste space. We do this check when calling
      should_defrag_range() at btrfs_defrag_file(). However we do it without
      holding the inode's lock. The reason we do it like this is to avoid
      blocking other tasks for too long, that possibly want to operate on other
      file ranges, since after the call to should_defrag_range() and before
      locking the inode, we trigger a synchronous page cache readahead. However
      before we were able to lock the inode, some other task might have punched
      a hole in our range, or we may now have an inline extent there, in which
      case we should not set the range for defrag anymore since that would cause
      unnecessary IO and make us waste space (i.e. allocating extents to contain
      zeros for a hole).
      
      So after we locked the inode and the range in the iotree, check again if
      we have holes or an inline extent, and if we do, just skip the range.
      
      I hit this while testing my next patch that fixes races when updating an
      inode's number of bytes (subject "btrfs: update the number of bytes used
      by an inode atomically"), and it depends on this change in order to work
      correctly. Alternatively I could rework that other patch to detect holes
      and flag their range with the 'new delalloc' bit, but this itself fixes
      an efficiency problem due a race that from a functional point of view is
      not harmful (it could be triggered with btrfs/062 from fstests).
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f458a38
    • Filipe Manana's avatar
      btrfs: refactor btrfs_drop_extents() to make it easier to extend · 5893dfb9
      Filipe Manana authored
      There are many arguments for __btrfs_drop_extents() and its wrapper
      btrfs_drop_extents(), which makes it hard to add more arguments to it and
      requires changing every caller. I have added a couple myself back in 2014
      commit 1acae57b ("Btrfs: faster file extent item replace operations")
      and therefore know firsthand that it is a bit cumbersome to add additional
      arguments to these functions.
      
      Since I will need to add more arguments in a subsequent bug fix, this
      change is preparatory work and adds a data structure that holds all the
      arguments, for both input and output, that are passed to this function,
      with some comments in the structure's definition mentioning what each
      field is and how it relates to other fields.
      
      Callers of this function need only to zero out the content of the
      structure and setup only the fields they need. This also removes the
      need to have both __btrfs_drop_extents() and btrfs_drop_extents(), so
      now we have a single function named btrfs_drop_extents() that takes a
      pointer to this new data structure (struct btrfs_drop_extents_args).
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5893dfb9
    • Josef Bacik's avatar
      btrfs: set the lockdep class for extent buffers on creation · e114c545
      Josef Bacik authored
      Both Filipe and Fedora QA recently hit the following lockdep splat:
      
        WARNING: possible recursive locking detected
        5.10.0-0.rc1.20201028gited8780e3.57.fc34.x86_64 #1 Not tainted
        --------------------------------------------
        rsync/2610 is trying to acquire lock:
        ffff89617ed48f20 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
      
        but task is already holding lock:
        ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      	 CPU0
      	 ----
          lock(&eb->lock);
          lock(&eb->lock);
      
         *** DEADLOCK ***
         May be due to missing lock nesting notation
        2 locks held by rsync/2610:
         #0: ffff896107212b90 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: walk_component+0x10c/0x190
         #1: ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
      
        stack backtrace:
        CPU: 1 PID: 2610 Comm: rsync Not tainted 5.10.0-0.rc1.20201028gited8780e3.57.fc34.x86_64 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack+0x8b/0xb0
         __lock_acquire.cold+0x12d/0x2a4
         ? kvm_sched_clock_read+0x14/0x30
         ? sched_clock+0x5/0x10
         lock_acquire+0xc8/0x400
         ? btrfs_tree_read_lock_atomic+0x34/0x140
         ? read_block_for_search.isra.0+0xdd/0x320
         _raw_read_lock+0x3d/0xa0
         ? btrfs_tree_read_lock_atomic+0x34/0x140
         btrfs_tree_read_lock_atomic+0x34/0x140
         btrfs_search_slot+0x616/0x9a0
         btrfs_lookup_dir_item+0x6c/0xb0
         btrfs_lookup_dentry+0xa8/0x520
         ? lockdep_init_map_waits+0x4c/0x210
         btrfs_lookup+0xe/0x30
         __lookup_slow+0x10f/0x1e0
         walk_component+0x11b/0x190
         path_lookupat+0x72/0x1c0
         filename_lookup+0x97/0x180
         ? strncpy_from_user+0x96/0x1e0
         ? getname_flags.part.0+0x45/0x1a0
         vfs_statx+0x64/0x100
         ? lockdep_hardirqs_on_prepare+0xff/0x180
         ? _raw_spin_unlock_irqrestore+0x41/0x50
         __do_sys_newlstat+0x26/0x40
         ? lockdep_hardirqs_on_prepare+0xff/0x180
         ? syscall_enter_from_user_mode+0x27/0x80
         ? syscall_enter_from_user_mode+0x27/0x80
         do_syscall_64+0x33/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      I have also seen a report of lockdep complaining about the lock class
      that was looked up being the same as the lock class on the lock we were
      using, but I can't find the report.
      
      These are problems that occur because we do not have the lockdep class
      set on the extent buffer until _after_ we read the eb in properly.  This
      is problematic for concurrent readers, because we will create the extent
      buffer, lock it, and then attempt to read the extent buffer.
      
      If a second thread comes in and tries to do a search down the same path
      they'll get the above lockdep splat because the class isn't set properly
      on the extent buffer.
      
      There was a good reason for this, we generally didn't know the real
      owner of the eb until we read it, specifically in refcounted roots.
      
      However now all refcounted roots have the same class name, so we no
      longer need to worry about this.  For non-refcounted trees we know
      which root we're on based on the parent.
      
      Fix this by setting the lockdep class on the eb at creation time instead
      of read time.  This will fix the splat and the weirdness where the class
      changes in the middle of locking the block.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e114c545
    • Josef Bacik's avatar
      btrfs: pass the owner_root and level to alloc_extent_buffer · 3fbaf258
      Josef Bacik authored
      Now that we've plumbed all of the callers to have the owner root and the
      level, plumb it down into alloc_extent_buffer().
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3fbaf258
    • Josef Bacik's avatar
      btrfs: pass the root owner and level around for readahead · 5d81230b
      Josef Bacik authored
      The readahead infrastructure does raw reads of extent buffers, but we're
      going to need to know their owner and level in order to set the lockdep
      key properly, so plumb in the infrastructure that we'll need to have
      this information when we start allocating extent buffers.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5d81230b
    • Josef Bacik's avatar
      btrfs: pass root owner to read_tree_block · 1b7ec85e
      Josef Bacik authored
      In order to properly set the lockdep class of a newly allocated block we
      need to know the owner of the block.  For non-refcounted trees this is
      straightforward, we always know in advance what tree we're reading from.
      For refcounted trees we don't necessarily know, however all refcounted
      trees share the same lockdep class name, tree-<level>.
      
      Fix all the callers of read_tree_block() to pass in the root objectid
      we're using.  In places like relocation and backref we could probably
      unconditionally use 0, but just in case use the root when we have it,
      otherwise use 0 in the cases we don't have the root as it's going to be
      a refcounted tree anyway.
      
      This is a preparation patch for further changes.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1b7ec85e
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in btrfs_qgroup_trace_subtree · 182c79fc
      Josef Bacik authored
      We're open-coding btrfs_read_node_slot() here, replace with the helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      182c79fc
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in qgroup_trace_new_subtree_blocks · 3acfbd6a
      Josef Bacik authored
      We're open-coding btrfs_read_node_slot() here, replace with the helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3acfbd6a
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in qgroup_trace_extent_swap · 6b2cb7cb
      Josef Bacik authored
      We're open-coding btrfs_read_node_slot() here, replace with the helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b2cb7cb
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in walk_down_tree · c990ada2
      Josef Bacik authored
      We're open-coding btrfs_read_node_slot() here, replace with the helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c990ada2
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in replace_path · 6b3426be
      Josef Bacik authored
      We're open-coding btrfs_read_node_slot() here, replace with the helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b3426be
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in do_relocation · c9752536
      Josef Bacik authored
      We're open coding btrfs_read_node_slot in do_relocation, replace this
      with the proper helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9752536
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in walk_down_reloc_tree · 8ef385bb
      Josef Bacik authored
      We do not need to call read_tree_block() here, simply use the
      btrfs_read_node_slot helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8ef385bb
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in btrfs_realloc_node · 206983b7
      Josef Bacik authored
      We have this open-coded nightmare in btrfs_realloc_node that does
      the same thing that the normal read path does, which is to see if we
      have the eb in memory already, and if not read it, and verify the eb is
      uptodate.  Delete this open coding and simply use btrfs_read_node_slot.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      206983b7
    • Josef Bacik's avatar
      btrfs: cleanup extent buffer readahead · bfb484d9
      Josef Bacik authored
      We're going to pass around more information when we allocate extent
      buffers, in order to make that cleaner how we do readahead.  Most of the
      callers have the parent node that we're getting our blockptr from, with
      the sole exception of relocation which simply has the bytenr it wants to
      read.
      
      Add a helper that takes the current arguments that we need (bytenr and
      gen), and add another helper for simply reading the slot out of a node.
      In followup patches the helper that takes all the extra arguments will
      be expanded, and the simpler helper won't need to have it's arguments
      adjusted.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bfb484d9
    • Josef Bacik's avatar
      btrfs: remove lockdep classes for the fs tree · 416e3445
      Josef Bacik authored
      We have this weird problem where our lockdep class is set after we
      read a tree block, which can race with concurrent readers and result in
      erroneous lockdep errors.  We want to set the lockdep class at
      allocation time if possible, but in certain cases we may not have the
      actual root owner, such as with relocation or any backref lookups.  This
      is only really a problem for reference counted trees, because all other
      trees have their root reference set in their extent reference.  Remove
      the fs tree specific lock class.  We need to still keep the reloc tree
      one, it's still reference counted, because replace_path will lock the
      reloc tree and the destination tree, and if they're both set to
      tree-<level> we'll have issues.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      416e3445
    • Pavel Begunkov's avatar
      btrfs: discard: reschedule work after sysfs param update · 3e48d8d2
      Pavel Begunkov authored
      After sysfs updates discard's iops_limit or kbps_limit it also needs to
      adjust current timer through rescheduling, otherwise the discard work
      may wait for a long time for the previous timer to expire or bumped by
      someone else.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3e48d8d2
    • Pavel Begunkov's avatar
      btrfs: don't miss async discards after scheduled work override · df903e5d
      Pavel Begunkov authored
      If btrfs_discard_schedule_work() is called with override=true, it sets
      delay anew regardless how much time is left until the timer should have
      fired. If delays are long (that can happen, for example, with low
      kbps_limit), they might get constantly overridden without having a
      chance to run the discard work.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      df903e5d
    • Pavel Begunkov's avatar
      btrfs: discard: store async discard delay as ns not as jiffies · 6e88f116
      Pavel Begunkov authored
      Most delay calculations are done in ns or ms, so store
      discard_ctl->delay in ms and convert the final delay to jiffies only at
      the end.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6e88f116
    • Pavel Begunkov's avatar
      btrfs: discard: speed up async discard up to iops_limit · e50404a8
      Pavel Begunkov authored
      Instead of using iops_limit only for cutting off extremes, calculate the
      discard delay directly from it, so it closely follows iops_limit and
      doesn't under-discard even though quotas are not saturated.
      
      The iops limit could be hit more often in some cases and could increase
      the discard rate.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e50404a8
    • Qu Wenruo's avatar
      btrfs: scrub: refactor scrub_find_csum() · 480a8ec8
      Qu Wenruo authored
      Function scrub_find_csum() is to locate the csum for bytenr @logical
      from sctx->csum_list.
      
      However it lacks a lot of comments to explain things like how the
      csum_list is organized and why we need to drop csum range which is
      before us.
      
      Refactor the function by:
      
      - Add more comments explaining the behavior
      - Add comment explaining why we need to drop the csum range
      - Put the csum copy in the main loop
        This is mostly for the incoming patches to make scrub_find_csum() able
        to find multiple checksums.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      480a8ec8
    • Qu Wenruo's avatar
      btrfs: scrub: remove the force parameter from scrub_pages · 96e63a45
      Qu Wenruo authored
      The @force parameter for scrub_pages() is to indicate whether we want to
      force bio submission.  Currently it's only used for the super block,
      and it can be easily determined by the @flags, so we can remove the
      parameter.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96e63a45
    • Qu Wenruo's avatar
      btrfs: scrub: distinguish scrub page from regular page · 261d2dcb
      Qu Wenruo authored
      There are several call sites where we declare something like
      "struct scrub_page *page".
      
      This is confusing as we also use regular page in this code,
      rename it to 'spage' where applicable.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      261d2dcb