1. 08 Dec, 2020 40 commits
    • Anand Jain's avatar
      btrfs: drop unused argument step from btrfs_free_extra_devids · bacce86a
      Anand Jain authored
      Commit cf89af14 ("btrfs: dev-replace: fail mount if we don't have
      replace item with target device") dropped the multi stage operation of
      btrfs_free_extra_devids() that does not need to check replace target
      anymore and we can remove the 'step' argument.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bacce86a
    • Filipe Manana's avatar
      btrfs: update the number of bytes used by an inode atomically · 2766ff61
      Filipe Manana authored
      There are several occasions where we do not update the inode's number of
      used bytes atomically, resulting in a concurrent stat(2) syscall to report
      a value of used blocks that does not correspond to a valid value, that is,
      a value that does not match neither what we had before the operation nor
      what we get after the operation completes.
      
      In extreme cases it can result in stat(2) reporting zero used blocks, which
      can cause problems for some userspace tools where they can consider a file
      with a non-zero size and zero used blocks as completely sparse and skip
      reading data, as reported/discussed a long time ago in some threads like
      the following:
      
        https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
      
      The cases where this can happen are the following:
      
      -> Case 1
      
      If we do a write (buffered or direct IO) against a file region for which
      there is already an allocated extent (or multiple extents), then we have a
      short time window where we can report a number of used blocks to stat(2)
      that does not take into account the file region being overwritten. This
      short time window happens when completing the ordered extent(s).
      
      This happens because when we drop the extents in the write range we
      decrement the inode's number of bytes and later on when we insert the new
      extent(s) we increment the number of bytes in the inode, resulting in a
      short time window where a stat(2) syscall can get an incorrect number of
      used blocks.
      
      If we do writes that overwrite an entire file, then we have a short time
      window where we report 0 used blocks to stat(2).
      
      Example reproducer:
      
        $ cat reproducer-1.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        stat_loop()
        {
            trap "wait; exit" SIGTERM
            local filepath=$1
            local expected=$2
            local got
      
            while :; do
                got=$(stat -c %b $filepath)
                if [ $got -ne $expected ]; then
                   echo -n "ERROR: unexpected used blocks"
                   echo " (got: $got expected: $expected)"
                fi
            done
        }
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f $DEV > /dev/null
        # mkfs.ext4 -F $DEV > /dev/null
        # mkfs.f2fs -f $DEV > /dev/null
        # mkfs.reiserfs -f $DEV > /dev/null
        mount $DEV $MNT
      
        xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
        expected=$(stat -c %b $MNT/foobar)
      
        # Create a process to keep calling stat(2) on the file and see if the
        # reported number of blocks used (disk space used) changes, it should
        # not because we are not increasing the file size nor punching holes.
        stat_loop $MNT/foobar $expected &
        loop_pid=$!
      
        for ((i = 0; i < 50000; i++)); do
            xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
        done
      
        kill $loop_pid &> /dev/null
        wait
      
        umount $DEV
      
        $ ./reproducer-1.sh
        ERROR: unexpected used blocks (got: 0 expected: 128)
        ERROR: unexpected used blocks (got: 0 expected: 128)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      -> Case 2
      
      If we do a buffered write against a file region that does not have any
      allocated extents, like a hole or beyond EOF, then during ordered extent
      completion we have a short time window where a concurrent stat(2) syscall
      can report a number of used blocks that does not correspond to the value
      before or after the write operation, a value that is actually larger than
      the value after the write completes.
      
      This happens because once we start a buffered write into an unallocated
      file range we increment the inode's 'new_delalloc_bytes', to make sure
      any stat(2) call gets a correct used blocks value before delalloc is
      flushed and completes. However at ordered extent completion, after we
      inserted the new extent, we increment the inode's number of bytes used
      with the size of the new extent, and only later, when clearing the range
      in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
      counter with the size of the extent. So this results in a short time
      window where a concurrent stat(2) syscall can report a number of used
      blocks that accounts for the new extent twice.
      
      Example reproducer:
      
        $ cat reproducer-2.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        stat_loop()
        {
            trap "wait; exit" SIGTERM
            local filepath=$1
            local expected=$2
            local got
      
            while :; do
                got=$(stat -c %b $filepath)
                if [ $got -ne $expected ]; then
                    echo -n "ERROR: unexpected used blocks"
                    echo " (got: $got expected: $expected)"
                fi
            done
        }
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f $DEV > /dev/null
        # mkfs.ext4 -F $DEV > /dev/null
        # mkfs.f2fs -f $DEV > /dev/null
        # mkfs.reiserfs -f $DEV > /dev/null
        mount $DEV $MNT
      
        touch $MNT/foobar
        write_size=$((64 * 1024))
        for ((i = 0; i < 16384; i++)); do
           offset=$(($i * $write_size))
           xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
           blocks_used=$(stat -c %b $MNT/foobar)
      
           # Fsync the file to trigger writeback and keep calling stat(2) on it
           # to see if the number of blocks used changes.
           stat_loop $MNT/foobar $blocks_used &
           loop_pid=$!
           xfs_io -c "fsync" $MNT/foobar
      
           kill $loop_pid &> /dev/null
           wait $loop_pid
        done
      
        umount $DEV
      
        $ ./reproducer-2.sh
        ERROR: unexpected used blocks (got: 265472 expected: 265344)
        ERROR: unexpected used blocks (got: 284032 expected: 283904)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      -> Case 3
      
      Another case where such problems happen is during other operations that
      replace extents in a file range with other extents. Those operations are
      extent cloning, deduplication and fallocate's zero range operation.
      
      The cause of the problem is similar to the first case. When we drop the
      extents from a range, we decrement the inode's number of bytes, and later
      on, after inserting the new extents we increment it. Since this is not
      done atomically, a concurrent stat(2) call can see and return a number of
      used blocks that is smaller than it should be, does not match the number
      of used blocks before or after the clone/deduplication/zero operation.
      
      Like for the first case, when doing a clone, deduplication or zero range
      operation against an entire file, we end up having a time window where we
      can report 0 used blocks to a stat(2) call.
      
      Example reproducer:
      
        $ cat reproducer-3.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f -m reflink=1 $DEV > /dev/null
        mount $DEV $MNT
      
        extent_size=$((64 * 1024))
        num_extents=16384
        file_size=$(($extent_size * $num_extents))
      
        # File foo has many small extents.
        xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
            > /dev/null
        # File bar has much less extents and has exactly the same data as foo.
        xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
      
        expected=$(stat -c %b $MNT/foo)
      
        # Now deduplicate bar into foo. While the deduplication is in progres,
        # the number of used blocks/file size reported by stat should not change
        xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null  &
        dedupe_pid=$!
        while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
            used=$(stat -c %b $MNT/foo)
            if [ $used -ne $expected ]; then
                echo "Unexpected blocks used: $used (expected: $expected)"
            fi
        done
      
        umount $DEV
      
        $ ./reproducer-3.sh
        Unexpected blocks used: 2076800 (expected: 2097152)
        Unexpected blocks used: 2097024 (expected: 2097152)
        Unexpected blocks used: 2079872 (expected: 2097152)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      So fix this by:
      
      1) Making btrfs_drop_extents() not decrement the VFS inode's number of
         bytes, and instead return the number of bytes;
      
      2) Making any code that drops extents and adds new extents update the
         inode's number of bytes atomically, while holding the btrfs inode's
         spinlock, which is also used by the stat(2) callback to get the inode's
         number of bytes;
      
      3) For ranges in the inode's iotree that are marked as 'delalloc new',
         corresponding to previously unallocated ranges, increment the inode's
         number of bytes when clearing the 'delalloc new' bit from the range,
         in the same critical section that decrements the inode's
         'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
      
      An alternative would be to have btrfs_getattr() wait for any IO (ordered
      extents in progress) and locking the whole range (0 to (u64)-1) while it
      it computes the number of blocks used. But that would mean blocking
      stat(2), which is a very used syscall and expected to be fast, waiting
      for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2766ff61
    • Filipe Manana's avatar
      btrfs: fix race when defragmenting leads to unnecessary IO · 7f458a38
      Filipe Manana authored
      When defragmenting we skip ranges that have holes or inline extents, so that
      we don't do unnecessary IO and waste space. We do this check when calling
      should_defrag_range() at btrfs_defrag_file(). However we do it without
      holding the inode's lock. The reason we do it like this is to avoid
      blocking other tasks for too long, that possibly want to operate on other
      file ranges, since after the call to should_defrag_range() and before
      locking the inode, we trigger a synchronous page cache readahead. However
      before we were able to lock the inode, some other task might have punched
      a hole in our range, or we may now have an inline extent there, in which
      case we should not set the range for defrag anymore since that would cause
      unnecessary IO and make us waste space (i.e. allocating extents to contain
      zeros for a hole).
      
      So after we locked the inode and the range in the iotree, check again if
      we have holes or an inline extent, and if we do, just skip the range.
      
      I hit this while testing my next patch that fixes races when updating an
      inode's number of bytes (subject "btrfs: update the number of bytes used
      by an inode atomically"), and it depends on this change in order to work
      correctly. Alternatively I could rework that other patch to detect holes
      and flag their range with the 'new delalloc' bit, but this itself fixes
      an efficiency problem due a race that from a functional point of view is
      not harmful (it could be triggered with btrfs/062 from fstests).
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f458a38
    • Filipe Manana's avatar
      btrfs: refactor btrfs_drop_extents() to make it easier to extend · 5893dfb9
      Filipe Manana authored
      There are many arguments for __btrfs_drop_extents() and its wrapper
      btrfs_drop_extents(), which makes it hard to add more arguments to it and
      requires changing every caller. I have added a couple myself back in 2014
      commit 1acae57b ("Btrfs: faster file extent item replace operations")
      and therefore know firsthand that it is a bit cumbersome to add additional
      arguments to these functions.
      
      Since I will need to add more arguments in a subsequent bug fix, this
      change is preparatory work and adds a data structure that holds all the
      arguments, for both input and output, that are passed to this function,
      with some comments in the structure's definition mentioning what each
      field is and how it relates to other fields.
      
      Callers of this function need only to zero out the content of the
      structure and setup only the fields they need. This also removes the
      need to have both __btrfs_drop_extents() and btrfs_drop_extents(), so
      now we have a single function named btrfs_drop_extents() that takes a
      pointer to this new data structure (struct btrfs_drop_extents_args).
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5893dfb9
    • Josef Bacik's avatar
      btrfs: set the lockdep class for extent buffers on creation · e114c545
      Josef Bacik authored
      Both Filipe and Fedora QA recently hit the following lockdep splat:
      
        WARNING: possible recursive locking detected
        5.10.0-0.rc1.20201028gited8780e3.57.fc34.x86_64 #1 Not tainted
        --------------------------------------------
        rsync/2610 is trying to acquire lock:
        ffff89617ed48f20 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
      
        but task is already holding lock:
        ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      	 CPU0
      	 ----
          lock(&eb->lock);
          lock(&eb->lock);
      
         *** DEADLOCK ***
         May be due to missing lock nesting notation
        2 locks held by rsync/2610:
         #0: ffff896107212b90 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: walk_component+0x10c/0x190
         #1: ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
      
        stack backtrace:
        CPU: 1 PID: 2610 Comm: rsync Not tainted 5.10.0-0.rc1.20201028gited8780e3.57.fc34.x86_64 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack+0x8b/0xb0
         __lock_acquire.cold+0x12d/0x2a4
         ? kvm_sched_clock_read+0x14/0x30
         ? sched_clock+0x5/0x10
         lock_acquire+0xc8/0x400
         ? btrfs_tree_read_lock_atomic+0x34/0x140
         ? read_block_for_search.isra.0+0xdd/0x320
         _raw_read_lock+0x3d/0xa0
         ? btrfs_tree_read_lock_atomic+0x34/0x140
         btrfs_tree_read_lock_atomic+0x34/0x140
         btrfs_search_slot+0x616/0x9a0
         btrfs_lookup_dir_item+0x6c/0xb0
         btrfs_lookup_dentry+0xa8/0x520
         ? lockdep_init_map_waits+0x4c/0x210
         btrfs_lookup+0xe/0x30
         __lookup_slow+0x10f/0x1e0
         walk_component+0x11b/0x190
         path_lookupat+0x72/0x1c0
         filename_lookup+0x97/0x180
         ? strncpy_from_user+0x96/0x1e0
         ? getname_flags.part.0+0x45/0x1a0
         vfs_statx+0x64/0x100
         ? lockdep_hardirqs_on_prepare+0xff/0x180
         ? _raw_spin_unlock_irqrestore+0x41/0x50
         __do_sys_newlstat+0x26/0x40
         ? lockdep_hardirqs_on_prepare+0xff/0x180
         ? syscall_enter_from_user_mode+0x27/0x80
         ? syscall_enter_from_user_mode+0x27/0x80
         do_syscall_64+0x33/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      I have also seen a report of lockdep complaining about the lock class
      that was looked up being the same as the lock class on the lock we were
      using, but I can't find the report.
      
      These are problems that occur because we do not have the lockdep class
      set on the extent buffer until _after_ we read the eb in properly.  This
      is problematic for concurrent readers, because we will create the extent
      buffer, lock it, and then attempt to read the extent buffer.
      
      If a second thread comes in and tries to do a search down the same path
      they'll get the above lockdep splat because the class isn't set properly
      on the extent buffer.
      
      There was a good reason for this, we generally didn't know the real
      owner of the eb until we read it, specifically in refcounted roots.
      
      However now all refcounted roots have the same class name, so we no
      longer need to worry about this.  For non-refcounted trees we know
      which root we're on based on the parent.
      
      Fix this by setting the lockdep class on the eb at creation time instead
      of read time.  This will fix the splat and the weirdness where the class
      changes in the middle of locking the block.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e114c545
    • Josef Bacik's avatar
      btrfs: pass the owner_root and level to alloc_extent_buffer · 3fbaf258
      Josef Bacik authored
      Now that we've plumbed all of the callers to have the owner root and the
      level, plumb it down into alloc_extent_buffer().
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3fbaf258
    • Josef Bacik's avatar
      btrfs: pass the root owner and level around for readahead · 5d81230b
      Josef Bacik authored
      The readahead infrastructure does raw reads of extent buffers, but we're
      going to need to know their owner and level in order to set the lockdep
      key properly, so plumb in the infrastructure that we'll need to have
      this information when we start allocating extent buffers.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5d81230b
    • Josef Bacik's avatar
      btrfs: pass root owner to read_tree_block · 1b7ec85e
      Josef Bacik authored
      In order to properly set the lockdep class of a newly allocated block we
      need to know the owner of the block.  For non-refcounted trees this is
      straightforward, we always know in advance what tree we're reading from.
      For refcounted trees we don't necessarily know, however all refcounted
      trees share the same lockdep class name, tree-<level>.
      
      Fix all the callers of read_tree_block() to pass in the root objectid
      we're using.  In places like relocation and backref we could probably
      unconditionally use 0, but just in case use the root when we have it,
      otherwise use 0 in the cases we don't have the root as it's going to be
      a refcounted tree anyway.
      
      This is a preparation patch for further changes.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1b7ec85e
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in btrfs_qgroup_trace_subtree · 182c79fc
      Josef Bacik authored
      We're open-coding btrfs_read_node_slot() here, replace with the helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      182c79fc
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in qgroup_trace_new_subtree_blocks · 3acfbd6a
      Josef Bacik authored
      We're open-coding btrfs_read_node_slot() here, replace with the helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3acfbd6a
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in qgroup_trace_extent_swap · 6b2cb7cb
      Josef Bacik authored
      We're open-coding btrfs_read_node_slot() here, replace with the helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b2cb7cb
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in walk_down_tree · c990ada2
      Josef Bacik authored
      We're open-coding btrfs_read_node_slot() here, replace with the helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c990ada2
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in replace_path · 6b3426be
      Josef Bacik authored
      We're open-coding btrfs_read_node_slot() here, replace with the helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b3426be
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in do_relocation · c9752536
      Josef Bacik authored
      We're open coding btrfs_read_node_slot in do_relocation, replace this
      with the proper helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9752536
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in walk_down_reloc_tree · 8ef385bb
      Josef Bacik authored
      We do not need to call read_tree_block() here, simply use the
      btrfs_read_node_slot helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8ef385bb
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in btrfs_realloc_node · 206983b7
      Josef Bacik authored
      We have this open-coded nightmare in btrfs_realloc_node that does
      the same thing that the normal read path does, which is to see if we
      have the eb in memory already, and if not read it, and verify the eb is
      uptodate.  Delete this open coding and simply use btrfs_read_node_slot.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      206983b7
    • Josef Bacik's avatar
      btrfs: cleanup extent buffer readahead · bfb484d9
      Josef Bacik authored
      We're going to pass around more information when we allocate extent
      buffers, in order to make that cleaner how we do readahead.  Most of the
      callers have the parent node that we're getting our blockptr from, with
      the sole exception of relocation which simply has the bytenr it wants to
      read.
      
      Add a helper that takes the current arguments that we need (bytenr and
      gen), and add another helper for simply reading the slot out of a node.
      In followup patches the helper that takes all the extra arguments will
      be expanded, and the simpler helper won't need to have it's arguments
      adjusted.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bfb484d9
    • Josef Bacik's avatar
      btrfs: remove lockdep classes for the fs tree · 416e3445
      Josef Bacik authored
      We have this weird problem where our lockdep class is set after we
      read a tree block, which can race with concurrent readers and result in
      erroneous lockdep errors.  We want to set the lockdep class at
      allocation time if possible, but in certain cases we may not have the
      actual root owner, such as with relocation or any backref lookups.  This
      is only really a problem for reference counted trees, because all other
      trees have their root reference set in their extent reference.  Remove
      the fs tree specific lock class.  We need to still keep the reloc tree
      one, it's still reference counted, because replace_path will lock the
      reloc tree and the destination tree, and if they're both set to
      tree-<level> we'll have issues.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      416e3445
    • Pavel Begunkov's avatar
      btrfs: discard: reschedule work after sysfs param update · 3e48d8d2
      Pavel Begunkov authored
      After sysfs updates discard's iops_limit or kbps_limit it also needs to
      adjust current timer through rescheduling, otherwise the discard work
      may wait for a long time for the previous timer to expire or bumped by
      someone else.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3e48d8d2
    • Pavel Begunkov's avatar
      btrfs: don't miss async discards after scheduled work override · df903e5d
      Pavel Begunkov authored
      If btrfs_discard_schedule_work() is called with override=true, it sets
      delay anew regardless how much time is left until the timer should have
      fired. If delays are long (that can happen, for example, with low
      kbps_limit), they might get constantly overridden without having a
      chance to run the discard work.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      df903e5d
    • Pavel Begunkov's avatar
      btrfs: discard: store async discard delay as ns not as jiffies · 6e88f116
      Pavel Begunkov authored
      Most delay calculations are done in ns or ms, so store
      discard_ctl->delay in ms and convert the final delay to jiffies only at
      the end.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6e88f116
    • Pavel Begunkov's avatar
      btrfs: discard: speed up async discard up to iops_limit · e50404a8
      Pavel Begunkov authored
      Instead of using iops_limit only for cutting off extremes, calculate the
      discard delay directly from it, so it closely follows iops_limit and
      doesn't under-discard even though quotas are not saturated.
      
      The iops limit could be hit more often in some cases and could increase
      the discard rate.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e50404a8
    • Qu Wenruo's avatar
      btrfs: scrub: refactor scrub_find_csum() · 480a8ec8
      Qu Wenruo authored
      Function scrub_find_csum() is to locate the csum for bytenr @logical
      from sctx->csum_list.
      
      However it lacks a lot of comments to explain things like how the
      csum_list is organized and why we need to drop csum range which is
      before us.
      
      Refactor the function by:
      
      - Add more comments explaining the behavior
      - Add comment explaining why we need to drop the csum range
      - Put the csum copy in the main loop
        This is mostly for the incoming patches to make scrub_find_csum() able
        to find multiple checksums.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      480a8ec8
    • Qu Wenruo's avatar
      btrfs: scrub: remove the force parameter from scrub_pages · 96e63a45
      Qu Wenruo authored
      The @force parameter for scrub_pages() is to indicate whether we want to
      force bio submission.  Currently it's only used for the super block,
      and it can be easily determined by the @flags, so we can remove the
      parameter.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96e63a45
    • Qu Wenruo's avatar
      btrfs: scrub: distinguish scrub page from regular page · 261d2dcb
      Qu Wenruo authored
      There are several call sites where we declare something like
      "struct scrub_page *page".
      
      This is confusing as we also use regular page in this code,
      rename it to 'spage' where applicable.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      261d2dcb
    • Qu Wenruo's avatar
      btrfs: pass bvec to csum_dirty_buffer instead of page · ac303b69
      Qu Wenruo authored
      Currently csum_dirty_buffer() uses page to grab extent buffer, but that
      only works for sector size == PAGE_SIZE case.
      
      For subpage we need page + page_offset to grab extent buffer.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac303b69
    • Qu Wenruo's avatar
      btrfs: extract extent buffer verification from btrfs_validate_metadata_buffer() · 77bf40a2
      Qu Wenruo authored
      Currently btrfs_validate_metadata_buffer() only needs to handle one
      extent buffer as currently one page maps to at most one extent buffer.
      
      For incoming subpage support, we need to extend the support where one
      page could contain multiple extent buffers.
      
      Split the function so we can call validate_extent_buffer on extent
      buffers independently.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77bf40a2
    • Qu Wenruo's avatar
      btrfs: make csum_tree_block() handle node smaller than page · a26663e7
      Qu Wenruo authored
      For subpage size support, metadata blocks of nodesize are smaller than
      one page and this needs to be handled when calculating the checksum.
      
      The checksummed start and length need to be adjusted but only for the
      first page:
      
      - start is simply offset in the page
      
      - length is nodesize (subpage) or PAGE_SIZE for all other cases
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a26663e7
    • Qu Wenruo's avatar
      btrfs: grab fs_info from extent_buffer in btrfs_mark_buffer_dirty · 2f4d60df
      Qu Wenruo authored
      Since commit f28491e0 ("Btrfs: move the extent buffer radix tree into
      the fs_info"), fs_info can be grabbed from extent_buffer directly.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2f4d60df
    • Qu Wenruo's avatar
      btrfs: make buffer_radix take sector size units · 478ef886
      Qu Wenruo authored
      For subpage sector size support, one page can contain multiple tree
      blocks. The entries cannot be based on page size and index must be
      derived from the sectorsize. No change for page size == sector size.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      478ef886
    • Qu Wenruo's avatar
      btrfs: assert page mapping lock in attach_extent_buffer_page · 0d01e247
      Qu Wenruo authored
      When calling attach_extent_buffer_page(), either we're attaching
      anonymous pages, called from btrfs_clone_extent_buffer(),
      or we're attaching btree inode pages, called from alloc_extent_buffer().
      
      For the latter case, we should hold page->mapping->private_lock to avoid
      parallel changes to page->private.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d01e247
    • Josef Bacik's avatar
      btrfs: protect fs_info->caching_block_groups by block_group_cache_lock · bbb86a37
      Josef Bacik authored
      I got the following lockdep splat
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0+ #101 Not tainted
        ------------------------------------------------------
        btrfs-cleaner/3445 is trying to acquire lock:
        ffff89dbec39ab48 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170
      
        but task is already holding lock:
        ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (&fs_info->commit_root_sem){++++}-{3:3}:
      	 down_write+0x3d/0x70
      	 btrfs_cache_block_group+0x2d5/0x510
      	 find_free_extent+0xb6e/0x12f0
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb1/0x330
      	 alloc_tree_block_no_bg_flush+0x4f/0x60
      	 __btrfs_cow_block+0x11d/0x580
      	 btrfs_cow_block+0x10c/0x220
      	 commit_cowonly_roots+0x47/0x2e0
      	 btrfs_commit_transaction+0x595/0xbd0
      	 sync_filesystem+0x74/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x36/0xa0
      	 cleanup_mnt+0x12d/0x190
      	 task_work_run+0x5c/0xa0
      	 exit_to_user_mode_prepare+0x1df/0x200
      	 syscall_exit_to_user_mode+0x54/0x280
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&space_info->groups_sem){++++}-{3:3}:
      	 down_read+0x40/0x130
      	 find_free_extent+0x2ed/0x12f0
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb1/0x330
      	 alloc_tree_block_no_bg_flush+0x4f/0x60
      	 __btrfs_cow_block+0x11d/0x580
      	 btrfs_cow_block+0x10c/0x220
      	 commit_cowonly_roots+0x47/0x2e0
      	 btrfs_commit_transaction+0x595/0xbd0
      	 sync_filesystem+0x74/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x36/0xa0
      	 cleanup_mnt+0x12d/0x190
      	 task_work_run+0x5c/0xa0
      	 exit_to_user_mode_prepare+0x1df/0x200
      	 syscall_exit_to_user_mode+0x54/0x280
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (btrfs-root-00){++++}-{3:3}:
      	 __lock_acquire+0x1167/0x2150
      	 lock_acquire+0xb9/0x3d0
      	 down_read_nested+0x43/0x130
      	 __btrfs_tree_read_lock+0x32/0x170
      	 __btrfs_read_lock_root_node+0x3a/0x50
      	 btrfs_search_slot+0x614/0x9d0
      	 btrfs_find_root+0x35/0x1b0
      	 btrfs_read_tree_root+0x61/0x120
      	 btrfs_get_root_ref+0x14b/0x600
      	 find_parent_nodes+0x3e6/0x1b30
      	 btrfs_find_all_roots_safe+0xb4/0x130
      	 btrfs_find_all_roots+0x60/0x80
      	 btrfs_qgroup_trace_extent_post+0x27/0x40
      	 btrfs_add_delayed_data_ref+0x3fd/0x460
      	 btrfs_free_extent+0x42/0x100
      	 __btrfs_mod_ref+0x1d7/0x2f0
      	 walk_up_proc+0x11c/0x400
      	 walk_up_tree+0xf0/0x180
      	 btrfs_drop_snapshot+0x1c7/0x780
      	 btrfs_clean_one_deleted_snapshot+0xfb/0x110
      	 cleaner_kthread+0xd4/0x140
      	 kthread+0x13a/0x150
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          btrfs-root-00 --> &space_info->groups_sem --> &fs_info->commit_root_sem
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_info->commit_root_sem);
      				 lock(&space_info->groups_sem);
      				 lock(&fs_info->commit_root_sem);
          lock(btrfs-root-00);
      
         *** DEADLOCK ***
      
        3 locks held by btrfs-cleaner/3445:
         #0: ffff89dbeaf28838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
         #1: ffff89dbeb6c7640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
         #2: ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80
      
        stack backtrace:
        CPU: 0 PID: 3445 Comm: btrfs-cleaner Not tainted 5.9.0+ #101
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb0
         check_noncircular+0xcf/0xf0
         __lock_acquire+0x1167/0x2150
         ? __bfs+0x42/0x210
         lock_acquire+0xb9/0x3d0
         ? __btrfs_tree_read_lock+0x32/0x170
         down_read_nested+0x43/0x130
         ? __btrfs_tree_read_lock+0x32/0x170
         __btrfs_tree_read_lock+0x32/0x170
         __btrfs_read_lock_root_node+0x3a/0x50
         btrfs_search_slot+0x614/0x9d0
         ? find_held_lock+0x2b/0x80
         btrfs_find_root+0x35/0x1b0
         ? do_raw_spin_unlock+0x4b/0xa0
         btrfs_read_tree_root+0x61/0x120
         btrfs_get_root_ref+0x14b/0x600
         find_parent_nodes+0x3e6/0x1b30
         btrfs_find_all_roots_safe+0xb4/0x130
         btrfs_find_all_roots+0x60/0x80
         btrfs_qgroup_trace_extent_post+0x27/0x40
         btrfs_add_delayed_data_ref+0x3fd/0x460
         btrfs_free_extent+0x42/0x100
         __btrfs_mod_ref+0x1d7/0x2f0
         walk_up_proc+0x11c/0x400
         walk_up_tree+0xf0/0x180
         btrfs_drop_snapshot+0x1c7/0x780
         ? btrfs_clean_one_deleted_snapshot+0x73/0x110
         btrfs_clean_one_deleted_snapshot+0xfb/0x110
         cleaner_kthread+0xd4/0x140
         ? btrfs_alloc_root+0x50/0x50
         kthread+0x13a/0x150
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      while testing another lockdep fix.  This happens because we're using the
      commit_root_sem to protect fs_info->caching_block_groups, which creates
      a dependency on the groups_sem -> commit_root_sem, which is problematic
      because we will allocate blocks while holding tree roots.  Fix this by
      making the list itself protected by the fs_info->block_group_cache_lock.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bbb86a37
    • Josef Bacik's avatar
      btrfs: load free space cache asynchronously · e747853c
      Josef Bacik authored
      While documenting the usage of the commit_root_sem, I noticed that we do
      not actually take the commit_root_sem in the case of the free space
      cache.  This is problematic because we're supposed to hold that sem
      while we're reading the commit roots, which is what we do for the free
      space cache.
      
      The reason I did it inline when I originally wrote the code was because
      there's the case of unpinning where we need to make sure that the free
      space cache is loaded if we're going to use the free space cache.  But
      we can accomplish the same thing by simply waiting for the cache to be
      loaded.
      
      Rework this code to load the free space cache asynchronously.  This
      allows us to greatly cleanup the caching code because now it's all
      shared by the various caching methods.  We also are now in a position to
      have the commit_root semaphore held while we're loading the free space
      cache.  And finally our modification of ->last_byte_to_unpin is removed
      because it can be handled in the proper way on commit.
      
      Some care must be taken when replaying the log, when we expect that the
      free space cache will be read entirely before we start excluding space
      to replay. This could lead to overwriting space during replay.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e747853c
    • Josef Bacik's avatar
      btrfs: load the free space cache inode extents from commit root · 4d7240f0
      Josef Bacik authored
      Historically we've allowed recursive locking specifically for the free
      space inode.  This is because we are only doing reads and know that it's
      safe.  However we don't actually need this feature, we can get away with
      reading the commit root for the extents.  In fact if we want to allow
      asynchronous loading of the free space cache we have to use the commit
      root, otherwise we will deadlock.
      
      Switch to using the commit root for the file extents.  These are only
      read at load time, and are replaced as soon as we start writing the
      cache out to disk.  The cache is never read again, so this is
      legitimate.  This matches what we do for the inode itself, as we read
      that from the commit root as well.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d7240f0
    • Josef Bacik's avatar
      btrfs: load free space cache into a temporary ctl · cd79909b
      Josef Bacik authored
      The free space cache has been special in that we would load it right
      away instead of farming the work off to a worker thread.  This resulted
      in some weirdness that had to be taken into account for this fact,
      namely that if we every found a block group being cached the fast way we
      had to wait for it to finish, because we could get the cache before it
      had been validated and we may throw the cache away.
      
      To handle this particular case instead create a temporary
      btrfs_free_space_ctl to load the free space cache into.  Then once we've
      validated that it makes sense, copy it's contents into the actual
      block_group->free_space_ctl.  This allows us to avoid the problems of
      needing to wait for the caching to complete, we can clean up the discard
      extent handling stuff in __load_free_space_cache, and we no longer need
      to do the merge_space_tree() because the space is added one by one into
      the real free_space_ctl.  This will allow further reworks of how we
      handle loading the free space cache.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd79909b
    • Josef Bacik's avatar
      btrfs: cleanup btrfs_discard_update_discardable usage · 66b53bae
      Josef Bacik authored
      This passes in the block_group and the free_space_ctl, but we can get
      this from the block group itself.  Part of this is because we call it
      from __load_free_space_cache, which can be called for the inode cache as
      well.
      
      Move that call into the block group specific load section, wrap it in
      the right lock that we need for the assertion (but otherwise this is
      safe without the lock because this happens in single-thread context).
      
      Fix up the arguments to only take the block group.  Add a lockdep_assert
      as well for good measure to make sure we don't mess up the locking
      again.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      66b53bae
    • Josef Bacik's avatar
      btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range · 2ca08c56
      Josef Bacik authored
      Currently unpin_extent_range happens in the transaction commit context,
      so we are protected from ->last_byte_to_unpin changing while we're
      unpinning, because any new transactions would have to wait for us to
      complete before modifying ->last_byte_to_unpin.
      
      However in the future we may want to change how this works, for instance
      with async unpinning or other such TODO items.  To prepare for that
      future explicitly protect ->last_byte_to_unpin with the commit_root_sem
      so we are sure it won't change while we're doing our work.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ca08c56
    • Josef Bacik's avatar
      btrfs: update last_byte_to_unpin in switch_commit_roots · 27d56e62
      Josef Bacik authored
      While writing an explanation for the need of the commit_root_sem for
      btrfs_prepare_extent_commit, I realized we have a slight hole that could
      result in leaked space if we have to do the old style caching.  Consider
      the following scenario
      
       commit root
       +----+----+----+----+----+----+----+
       |\\\\|    |\\\\|\\\\|    |\\\\|\\\\|
       +----+----+----+----+----+----+----+
       0    1    2    3    4    5    6    7
      
       new commit root
       +----+----+----+----+----+----+----+
       |    |    |    |\\\\|    |    |\\\\|
       +----+----+----+----+----+----+----+
       0    1    2    3    4    5    6    7
      
      Prior to this patch, we run btrfs_prepare_extent_commit, which updates
      the last_byte_to_unpin, and then we subsequently run
      switch_commit_roots.  In this example lets assume that
      caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which
      means that cache->last_byte_to_unpin == 1.  Then we go and do the
      switch_commit_roots(), but in the meantime the caching thread has made
      some more progress, because we drop the commit_root_sem and re-acquired
      it.  Now caching_ctl->progress == 3.  We swap out the commit root and
      carry on to unpin.
      
      The race can happen like:
      
        1) The caching thread was running using the old commit root when it
           found the extent for [2, 3);
      
        2) Then it released the commit_root_sem because it was in the last
           item of a leaf and the semaphore was contended, and set ->progress
           to 3 (value of 'last'), as the last extent item in the current leaf
           was for the extent for range [2, 3);
      
        3) Next time it gets the commit_root_sem, will start using the new
           commit root and search for a key with offset 3, so it never finds
           the hole for [2, 3).
      
        So the caching thread never saw [2, 3) as free space in any of the
        commit roots, and by the time finish_extent_commit() was called for
        the range [0, 3), ->last_byte_to_unpin was 1, so it only returned the
        subrange [0, 1) to the free space cache, skipping [2, 3).
      
      In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1),
      but do not unpin [2,3).  However because caching_ctl->progress == 3 we
      do not see the newly freed section of [2,3), and thus do not add it to
      our free space cache.  This results in us missing a chunk of free space
      in memory (on disk too, unless we have a power failure before writing
      the free space cache to disk).
      
      Fix this by making sure the ->last_byte_to_unpin is set at the same time
      that we swap the commit roots, this ensures that we will always be
      consistent.
      
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ update changelog with Filipe's review comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      27d56e62
    • Josef Bacik's avatar
      btrfs: do not shorten unpin len for caching block groups · 9076dbd5
      Josef Bacik authored
      While fixing up our ->last_byte_to_unpin locking I noticed that we will
      shorten len based on ->last_byte_to_unpin if we're caching when we're
      adding back the free space.  This is correct for the free space, as we
      cannot unpin more than ->last_byte_to_unpin, however we use len to
      adjust the ->bytes_pinned counters and such, which need to track the
      actual pinned usage.  This could result in
      WARN_ON(space_info->bytes_pinned) triggering at unmount time.
      
      Fix this by using a local variable for the amount to add to free space
      cache, and leave len untouched in this case.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9076dbd5
    • David Sterba's avatar
      btrfs: reorder extent buffer members for better packing · dc516164
      David Sterba authored
      After the rwsem replaced the tree lock implementation, the extent buffer
      got smaller but leaving some holes behind. By changing log_index type
      and reordering, we can squeeze the size further to 240 bytes, measured on
      release config on x86_64. Log_index spans only 3 values and needs to be
      signed.
      
      Before:
      
      struct extent_buffer {
              u64                        start;                /*     0     8 */
              long unsigned int          len;                  /*     8     8 */
              long unsigned int          bflags;               /*    16     8 */
              struct btrfs_fs_info *     fs_info;              /*    24     8 */
              spinlock_t                 refs_lock;            /*    32     4 */
              atomic_t                   refs;                 /*    36     4 */
              atomic_t                   io_pages;             /*    40     4 */
              int                        read_mirror;          /*    44     4 */
              struct callback_head       callback_head __attribute__((__aligned__(8))); /*    48    16 */
              /* --- cacheline 1 boundary (64 bytes) --- */
              pid_t                      lock_owner;           /*    64     4 */
              bool                       lock_recursed;        /*    68     1 */
      
              /* XXX 3 bytes hole, try to pack */
      
              struct rw_semaphore        lock;                 /*    72    40 */
              short int                  log_index;            /*   112     2 */
      
              /* XXX 6 bytes hole, try to pack */
      
              struct page *              pages[16];            /*   120   128 */
      
              /* size: 248, cachelines: 4, members: 14 */
              /* sum members: 239, holes: 2, sum holes: 9 */
              /* forced alignments: 1 */
              /* last cacheline: 56 bytes */
      } __attribute__((__aligned__(8)));
      
      After:
      
      struct extent_buffer {
              u64                        start;                /*     0     8 */
              long unsigned int          len;                  /*     8     8 */
              long unsigned int          bflags;               /*    16     8 */
              struct btrfs_fs_info *     fs_info;              /*    24     8 */
              spinlock_t                 refs_lock;            /*    32     4 */
              atomic_t                   refs;                 /*    36     4 */
              atomic_t                   io_pages;             /*    40     4 */
              int                        read_mirror;          /*    44     4 */
              struct callback_head       callback_head __attribute__((__aligned__(8))); /*    48    16 */
              /* --- cacheline 1 boundary (64 bytes) --- */
              pid_t                      lock_owner;           /*    64     4 */
              bool                       lock_recursed;        /*    68     1 */
              s8                         log_index;            /*    69     1 */
      
              /* XXX 2 bytes hole, try to pack */
      
              struct rw_semaphore        lock;                 /*    72    40 */
              struct page *              pages[16];            /*   112   128 */
      
              /* size: 240, cachelines: 4, members: 14 */
              /* sum members: 238, holes: 1, sum holes: 2 */
              /* forced alignments: 1 */
              /* last cacheline: 48 bytes */
      } __attribute__((__aligned__(8)));
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc516164