1. 05 Apr, 2022 8 commits
  2. 24 Mar, 2022 5 commits
    • Kaiwen Hu's avatar
      btrfs: prevent subvol with swapfile from being deleted · 60021bd7
      Kaiwen Hu authored
      A subvolume with an active swapfile must not be deleted otherwise it
      would not be possible to deactivate it.
      
      After the subvolume is deleted, we cannot swapoff the swapfile in this
      deleted subvolume because the path is unreachable.  The swapfile is
      still active and holding references, the filesystem cannot be unmounted.
      
      The test looks like this:
      
        mkfs.btrfs -f $dev > /dev/null
        mount $dev $mnt
      
        btrfs sub create $mnt/subvol
        touch $mnt/subvol/swapfile
        chmod 600 $mnt/subvol/swapfile
        chattr +C $mnt/subvol/swapfile
        dd if=/dev/zero of=$mnt/subvol/swapfile bs=1K count=4096
        mkswap $mnt/subvol/swapfile
        swapon $mnt/subvol/swapfile
      
        btrfs sub delete $mnt/subvol
        swapoff $mnt/subvol/swapfile  # failed: No such file or directory
        swapoff --all
      
        unmount $mnt                  # target is busy.
      
      To prevent above issue, we simply check that whether the subvolume
      contains any active swapfile, and stop the deleting process.  This
      behavior is like snapshot ioctl dealing with a swapfile.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarKaiwen Hu <kevinhu@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      60021bd7
    • Josef Bacik's avatar
      btrfs: do not warn for free space inode in cow_file_range · a7d16d9a
      Josef Bacik authored
      This is a long time leftover from when I originally added the free space
      inode, the point was to catch cases where we weren't honoring the NOCOW
      flag.  However there exists a race with relocation, if we allocate our
      free space inode in a block group that is about to be relocated, we
      could trigger the COW path before the relocation has the opportunity to
      find the extents and delete the free space cache.  In production where
      we have auto-relocation enabled we're seeing this WARN_ON_ONCE() around
      5k times in a 2 week period, so not super common but enough that it's at
      the top of our metrics.
      
      We're properly handling the error here, and with us phasing out v1 space
      cache anyway just drop the WARN_ON_ONCE.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7d16d9a
    • Qu Wenruo's avatar
      btrfs: avoid defragging extents whose next extents are not targets · 75a36a7d
      Qu Wenruo authored
      [BUG]
      There is a report that autodefrag is defragging single sector, which
      is completely waste of IO, and no help for defragging:
      
         btrfs-cleaner-808 defrag_one_locked_range: root=256 ino=651122 start=0 len=4096
      
      [CAUSE]
      In defrag_collect_targets(), we check if the current range (A) can be merged
      with next one (B).
      
      If mergeable, we will add range A into target for defrag.
      
      However there is a catch for autodefrag, when checking mergeability
      against range B, we intentionally pass 0 as @newer_than, hoping to get a
      higher chance to merge with the next extent.
      
      But in the next iteration, range B will looked up by defrag_lookup_extent(),
      with non-zero @newer_than.
      
      And if range B is not really newer, it will rejected directly, causing
      only range A being defragged, while we expect to defrag both range A and
      B.
      
      [FIX]
      Since the root cause is the difference in check condition of
      defrag_check_next_extent() and defrag_collect_targets(), we fix it by:
      
      1. Pass @newer_than to defrag_check_next_extent()
      2. Pass @extent_thresh to defrag_check_next_extent()
      
      This makes the check between defrag_collect_targets() and
      defrag_check_next_extent() more consistent.
      
      While there is still some minor difference, the remaining checks are
      focus on runtime flags like writeback/delalloc, which are mostly
      transient and safe to be checked only in defrag_collect_targets().
      
      Link: https://github.com/btrfs/linux/issues/423#issuecomment-1066981856
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      75a36a7d
    • Darrick J. Wong's avatar
      btrfs: fix fallocate to use file_modified to update permissions consistently · 05fd9564
      Darrick J. Wong authored
      Since the initial introduction of (posix) fallocate back at the turn of
      the century, it has been possible to use this syscall to change the
      user-visible contents of files.  This can happen by extending the file
      size during a preallocation, or through any of the newer modes (punch,
      zero range).  Because the call can be used to change file contents, we
      should treat it like we do any other modification to a file -- update
      the mtime, and drop set[ug]id privileges/capabilities.
      
      The VFS function file_modified() does all this for us if pass it a
      locked inode, so let's make fallocate drop permissions correctly.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      05fd9564
    • Qu Wenruo's avatar
      btrfs: remove device item and update super block in the same transaction · bbac5869
      Qu Wenruo authored
      [BUG]
      There is a report that a btrfs has a bad super block num devices.
      
      This makes btrfs to reject the fs completely.
      
        BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here
        BTRFS error (device sdd3): failed to read chunk tree: -22
        BTRFS error (device sdd3): open_ctree failed
      
      [CAUSE]
      During btrfs device removal, chunk tree and super block num devs are
      updated in two different transactions:
      
        btrfs_rm_device()
        |- btrfs_rm_dev_item(device)
        |  |- trans = btrfs_start_transaction()
        |  |  Now we got transaction X
        |  |
        |  |- btrfs_del_item()
        |  |  Now device item is removed from chunk tree
        |  |
        |  |- btrfs_commit_transaction()
        |     Transaction X got committed, super num devs untouched,
        |     but device item removed from chunk tree.
        |     (AKA, super num devs is already incorrect)
        |
        |- cur_devices->num_devices--;
        |- cur_devices->total_devices--;
        |- btrfs_set_super_num_devices()
           All those operations are not in transaction X, thus it will
           only be written back to disk in next transaction.
      
      So after the transaction X in btrfs_rm_dev_item() committed, but before
      transaction X+1 (which can be minutes away), a power loss happen, then
      we got the super num mismatch.
      
      [FIX]
      Instead of starting and committing a transaction inside
      btrfs_rm_dev_item(), start a transaction in side btrfs_rm_device() and
      pass it to btrfs_rm_dev_item().
      
      And only commit the transaction after everything is done.
      Reported-by: default avatarLuca Béla Palkovics <luca.bela.palkovics@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bbac5869
  3. 23 Mar, 2022 3 commits
    • Ethan Lien's avatar
      btrfs: fix qgroup reserve overflow the qgroup limit · b642b52d
      Ethan Lien authored
      We use extent_changeset->bytes_changed in qgroup_reserve_data() to record
      how many bytes we set for EXTENT_QGROUP_RESERVED state. Currently the
      bytes_changed is set as "unsigned int", and it will overflow if we try to
      fallocate a range larger than 4GiB. The result is we reserve less bytes
      and eventually break the qgroup limit.
      
      Unlike regular buffered/direct write, which we use one changeset for
      each ordered extent, which can never be larger than 256M.  For
      fallocate, we use one changeset for the whole range, thus it no longer
      respects the 256M per extent limit, and caused the problem.
      
      The following example test script reproduces the problem:
      
        $ cat qgroup-overflow.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        # Set qgroup limit to 2GiB.
        btrfs quota enable $MNT
        btrfs qgroup limit 2G $MNT
      
        # Try to fallocate a 3GiB file. This should fail.
        echo
        echo "Try to fallocate a 3GiB file..."
        fallocate -l 3G $MNT/3G.file
      
        # Try to fallocate a 5GiB file.
        echo
        echo "Try to fallocate a 5GiB file..."
        fallocate -l 5G $MNT/5G.file
      
        # See we break the qgroup limit.
        echo
        sync
        btrfs qgroup show -r $MNT
      
        umount $MNT
      
      When running the test:
      
        $ ./qgroup-overflow.sh
        (...)
      
        Try to fallocate a 3GiB file...
        fallocate: fallocate failed: Disk quota exceeded
      
        Try to fallocate a 5GiB file...
      
        qgroupid         rfer         excl     max_rfer
        --------         ----         ----     --------
        0/5           5.00GiB      5.00GiB      2.00GiB
      
      Since we have no control of how bytes_changed is used, it's better to
      set it to u64.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarEthan Lien <ethanlien@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b642b52d
    • Johannes Thumshirn's avatar
      btrfs: zoned: remove left over ASSERT checking for single profile · 62ed0bf7
      Johannes Thumshirn authored
      With commit dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block
      groups") we started allowing DUP on metadata block groups, so the
      ASSERT()s in btrfs_can_activate_zone() and btrfs_zoned_get_device() are
      no longer valid and in fact even harmful.
      
      Fixes: dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block groups")
      CC: stable@vger.kernel.org # 5.17
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      62ed0bf7
    • Johannes Thumshirn's avatar
      btrfs: zoned: traverse devices under chunk_mutex in btrfs_can_activate_zone · 0b9e6676
      Johannes Thumshirn authored
      btrfs_can_activate_zone() can be called with the device_list_mutex already
      held, which will lead to a deadlock:
      
      insert_dev_extents() // Takes device_list_mutex
      `-> insert_dev_extent()
       `-> btrfs_insert_empty_item()
        `-> btrfs_insert_empty_items()
         `-> btrfs_search_slot()
          `-> btrfs_cow_block()
           `-> __btrfs_cow_block()
            `-> btrfs_alloc_tree_block()
             `-> btrfs_reserve_extent()
              `-> find_free_extent()
               `-> find_free_extent_update_loop()
                `-> can_allocate_chunk()
                 `-> btrfs_can_activate_zone() // Takes device_list_mutex again
      
      Instead of using the RCU on fs_devices->device_list we
      can use fs_devices->alloc_list, protected by the chunk_mutex to traverse
      the list of active devices.
      
      We are in the chunk allocation thread. The newer chunk allocation
      happens from the devices in the fs_device->alloc_list protected by the
      chunk_mutex.
      
        btrfs_create_chunk()
          lockdep_assert_held(&info->chunk_mutex);
          gather_device_info
            list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
      
      Also, a device that reappears after the mount won't join the alloc_list
      yet and, it will be in the dev_list, which we don't want to consider in
      the context of the chunk alloc.
      
        [15.166572] WARNING: possible recursive locking detected
        [15.167117] 5.17.0-rc6-dennis #79 Not tainted
        [15.167487] --------------------------------------------
        [15.167733] kworker/u8:3/146 is trying to acquire lock:
        [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: find_free_extent+0x15a/0x14f0 [btrfs]
        [15.167733]
        [15.167733] but task is already holding lock:
        [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs]
        [15.167733]
        [15.167733] other info that might help us debug this:
        [15.167733]  Possible unsafe locking scenario:
        [15.167733]
        [15.171834]        CPU0
        [15.171834]        ----
        [15.171834]   lock(&fs_devs->device_list_mutex);
        [15.171834]   lock(&fs_devs->device_list_mutex);
        [15.171834]
        [15.171834]  *** DEADLOCK ***
        [15.171834]
        [15.171834]  May be due to missing lock nesting notation
        [15.171834]
        [15.171834] 5 locks held by kworker/u8:3/146:
        [15.171834]  #0: ffff888100050938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0
        [15.171834]  #1: ffffc9000067be80 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0
        [15.176244]  #2: ffff88810521e620 (sb_internal){.+.+}-{0:0}, at: flush_space+0x335/0x600 [btrfs]
        [15.176244]  #3: ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs]
        [15.176244]  #4: ffff8881152e4b78 (btrfs-dev-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x130 [btrfs]
        [15.179641]
        [15.179641] stack backtrace:
        [15.179641] CPU: 1 PID: 146 Comm: kworker/u8:3 Not tainted 5.17.0-rc6-dennis #79
        [15.179641] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
        [15.179641] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
        [15.179641] Call Trace:
        [15.179641]  <TASK>
        [15.179641]  dump_stack_lvl+0x45/0x59
        [15.179641]  __lock_acquire.cold+0x217/0x2b2
        [15.179641]  lock_acquire+0xbf/0x2b0
        [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  __mutex_lock+0x8e/0x970
        [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  ? lock_is_held_type+0xd7/0x130
        [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  ? _raw_spin_unlock+0x24/0x40
        [15.183838]  ? btrfs_get_alloc_profile+0x106/0x230 [btrfs]
        [15.187601]  btrfs_reserve_extent+0x131/0x260 [btrfs]
        [15.187601]  btrfs_alloc_tree_block+0xb5/0x3b0 [btrfs]
        [15.187601]  __btrfs_cow_block+0x138/0x600 [btrfs]
        [15.187601]  btrfs_cow_block+0x10f/0x230 [btrfs]
        [15.187601]  btrfs_search_slot+0x55f/0xbc0 [btrfs]
        [15.187601]  ? lock_is_held_type+0xd7/0x130
        [15.187601]  btrfs_insert_empty_items+0x2d/0x60 [btrfs]
        [15.187601]  btrfs_create_pending_block_groups+0x2b3/0x560 [btrfs]
        [15.187601]  __btrfs_end_transaction+0x36/0x2a0 [btrfs]
        [15.192037]  flush_space+0x374/0x600 [btrfs]
        [15.192037]  ? find_held_lock+0x2b/0x80
        [15.192037]  ? btrfs_async_reclaim_data_space+0x49/0x180 [btrfs]
        [15.192037]  ? lock_release+0x131/0x2b0
        [15.192037]  btrfs_async_reclaim_data_space+0x70/0x180 [btrfs]
        [15.192037]  process_one_work+0x24c/0x5a0
        [15.192037]  worker_thread+0x4a/0x3d0
      
      Fixes: a85f05e5 ("btrfs: zoned: avoid chunk allocation if active block group has enough space")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0b9e6676
  4. 14 Mar, 2022 24 commits
    • Nikolay Borisov's avatar
      btrfs: zoned: put block group after final usage · d3e29967
      Nikolay Borisov authored
      It's counter-intuitive (and wrong) to put the block group _before_ the
      final usage in submit_eb_page. Fix it by re-ordering the call to
      btrfs_put_block_group after its final reference. Also fix a minor typo
      in 'implies'
      
      Fixes: be1a1d7a ("btrfs: zoned: finish fully written block group")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d3e29967
    • Dongliang Mu's avatar
      btrfs: don't access possibly stale fs_info data in device_list_add · 79c9234b
      Dongliang Mu authored
      Syzbot reported a possible use-after-free in printing information
      in device_list_add.
      
      Very similar with the bug fixed by commit 0697d9a6 ("btrfs: don't
      access possibly stale fs_info data for printing duplicate device"),
      but this time the use occurs in btrfs_info_in_rcu.
      
        Call Trace:
         kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
         btrfs_printk+0x395/0x425 fs/btrfs/super.c:244
         device_list_add.cold+0xd7/0x2ed fs/btrfs/volumes.c:957
         btrfs_scan_one_device+0x4c7/0x5c0 fs/btrfs/volumes.c:1387
         btrfs_control_ioctl+0x12a/0x2d0 fs/btrfs/super.c:2409
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:874 [inline]
         __se_sys_ioctl fs/ioctl.c:860 [inline]
         __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fix this by modifying device->fs_info to NULL too.
      
      Reported-and-tested-by: syzbot+82650a4e0ed38f218363@syzkaller.appspotmail.com
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarDongliang Mu <mudongliangabcd@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79c9234b
    • Niels Dossche's avatar
      btrfs: add lockdep_assert_held to need_preemptive_reclaim · bf7bd725
      Niels Dossche authored
      In a previous patch ("btrfs: extend locking to all space_info members
      accesses") the locking for the space_info members was extended in
      btrfs_preempt_reclaim_metadata_space because not all the member
      accesses that needed locks were actually locked (bytes_pinned et al).
      
      It was then suggested to also add a call to lockdep_assert_held to
      need_preemptive_reclaim. This function also works with space_info
      members. As of now, it has only two call sites which both hold the lock.
      Suggested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNiels Dossche <dossche.niels@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf7bd725
    • Qu Wenruo's avatar
      btrfs: verify the tranisd of the to-be-written dirty extent buffer · 3777369f
      Qu Wenruo authored
      [BUG]
      There is a bug report that a bitflip in the transid part of an extent
      buffer makes btrfs to reject certain tree blocks:
      
        BTRFS error (device dm-0): parent transid verify failed on 1382301696 wanted 262166 found 22
      
      [CAUSE]
      Note the failed transid check, hex(262166) = 0x40016, while
      hex(22) = 0x16.
      
      It's an obvious bitflip.
      
      Furthermore, the reporter also confirmed the bitflip is from the
      hardware, so it's a real hardware caused bitflip, and such problem can
      not be detected by the existing tree-checker framework.
      
      As tree-checker can only verify the content inside one tree block, while
      generation of a tree block can only be verified against its parent.
      
      So such problem remain undetected.
      
      [FIX]
      Although tree-checker can not verify it at write-time, we still have a
      quick (but not the most accurate) way to catch such obvious corruption.
      
      Function csum_one_extent_buffer() is called before we submit metadata
      write.
      
      Thus it means, all the extent buffer passed in should be dirty tree
      blocks, and should be newer than last committed transaction.
      
      Using that we can catch the above bitflip.
      
      Although it's not a perfect solution, as if the corrupted generation is
      higher than the correct value, we have no way to catch it at all.
      Reported-by: default avatarChristoph Anton Mitterer <calestyo@scientia.org>
      Link: https://lore.kernel.org/linux-btrfs/2dfcbc130c55cc6fd067b93752e90bd2b079baca.camel@scientia.org/
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarQu Wenruo <wqu@sus,ree.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3777369f
    • Qu Wenruo's avatar
      btrfs: unify the error handling of btrfs_read_buffer() · 9a4ffa1b
      Qu Wenruo authored
      There is one oddball error handling of btrfs_read_buffer():
      
      	ret = btrfs_read_buffer(tmp, gen, parent_level - 1, &first_key);
      	if (!ret) {
      		*eb_ret = tmp;
      		return 0;
      	}
      	free_extent_buffer(tmp);
      	btrfs_release_path(p);
      	return -EIO;
      
      While all other call sites check the error first.  Unify the behavior.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9a4ffa1b
    • Qu Wenruo's avatar
      btrfs: unify the error handling pattern for read_tree_block() · 4eb150d6
      Qu Wenruo authored
      We had an error handling pattern for read_tree_block() like this:
      
      	eb = read_tree_block();
      	if (IS_ERR(eb)) {
      		/*
      		 * Handling error here
      		 * Normally ended up with return or goto out.
      		 */
      	} else if (!extent_buffer_uptodate(eb)) {
      		/*
      		 * Different error handling here
      		 * Normally also ended up with return or goto out;
      		 */
      	}
      
      This is fine, but if we want to add extra check for each
      read_tree_block(), the existing if-else-if is not that expandable and
      will take reader some seconds to figure out there is no extra branch.
      
      Here we change it to a more common way, without the extra else:
      
      	eb = read_tree_block();
      	if (IS_ERR(eb)) {
      		/*
      		 * Handling error here
      		 */
      		return eb or goto out;
      	}
      	if (!extent_buffer_uptodate(eb)) {
      		/*
      		 * Different error handling here
      		 */
      		return eb or goto out;
      	}
      
      This also removes some oddball call sites which uses some creative way
      to check error.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4eb150d6
    • Josef Bacik's avatar
      btrfs: factor out do_free_extent_accounting helper · 8f8aa4c7
      Josef Bacik authored
      __btrfs_free_extent() does all of the hard work of updating the extent
      ref items, and then at the end if we dropped the extent completely it
      does the cleanup accounting work.  We're going to only want to do that
      work for metadata with extent tree v2, so extract this bit into its own
      helper.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8f8aa4c7
    • Josef Bacik's avatar
      btrfs: remove last_ref from the extent freeing code · 5b2a54bb
      Josef Bacik authored
      This is a remnant of the work I did for qgroups a long time ago to only
      run for a block when we had dropped the last ref.  We haven't done that
      for years, but the code remains.  Drop this remnant.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5b2a54bb
    • Josef Bacik's avatar
      btrfs: add a alloc_reserved_extent helper · 34666705
      Josef Bacik authored
      We duplicate this logic for both data and metadata, at this point we've
      already done our type specific extent root operations, this is just
      doing the accounting and removing the space from the free space tree.
      Extract this common logic out into a helper.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      34666705
    • Josef Bacik's avatar
      btrfs: remove BUG_ON(ret) in alloc_reserved_tree_block · b3c958a3
      Josef Bacik authored
      Switch this to an ASSERT() and return the error in the normal case.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3c958a3
    • Filipe Manana's avatar
      btrfs: add and use helper for unlinking inode during log replay · 313ab753
      Filipe Manana authored
      During log replay there is this pattern of running delayed items after
      every inode unlink. To avoid repeating this several times, move the
      logic into an helper function and use it instead of calling
      btrfs_unlink_inode() followed by btrfs_run_delayed_items().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      313ab753
    • Niels Dossche's avatar
      btrfs: extend locking to all space_info members accesses · 06bae876
      Niels Dossche authored
      bytes_pinned is always accessed under space_info->lock, except in
      btrfs_preempt_reclaim_metadata_space, however the other members are
      accessed under that lock. The reserved member of the rsv's are also
      partially accessed under a lock and partially not. Move all these
      accesses into the same lock to ensure consistency.
      
      This could potentially race and lead to a flush instead of a commit but
      it's not a big problem as it's only for preemptive flush.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNiels Dossche <niels.dossche@ugent.be>
      Signed-off-by: default avatarNiels Dossche <dossche.niels@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06bae876
    • Naohiro Aota's avatar
      btrfs: zoned: mark relocation as writing · ca5e4ea0
      Naohiro Aota authored
      There is a hung_task issue with running generic/068 on an SMR
      device. The hang occurs while a process is trying to thaw the
      filesystem. The process is trying to take sb->s_umount to thaw the
      FS. The lock is held by fsstress, which calls btrfs_sync_fs() and is
      waiting for an ordered extent to finish. However, as the FS is frozen,
      the ordered extents never finish.
      
      Having an ordered extent while the FS is frozen is the root cause of
      the hang. The ordered extent is initiated from btrfs_relocate_chunk()
      which is called from btrfs_reclaim_bgs_work().
      
      This commit adds sb_*_write() around btrfs_relocate_chunk() call
      site. For the usual "btrfs balance" command, we already call it with
      mnt_want_file() in btrfs_ioctl_balance().
      
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      CC: stable@vger.kernel.org # 5.13+
      Link: https://github.com/naota/linux/issues/56Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca5e4ea0
    • Josef Bacik's avatar
      fs: allow cross-vfsmount reflink/dedupe · 9f5710bb
      Josef Bacik authored
      Currently we disallow reflink and dedupe if the two files aren't on the
      same vfsmount.  However we really only need to disallow it if they're
      not on the same super block.  It is very common for btrfs to have a main
      subvolume that is mounted and then different subvolumes mounted at
      different locations.  It's allowed to reflink between these volumes, but
      the vfsmount check disallows this.  Instead fix dedupe to check for the
      same superblock, and simply remove the vfsmount check for reflink as it
      already does the superblock check.
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f5710bb
    • Josef Bacik's avatar
      btrfs: remove the cross file system checks from remap · ae460f05
      Josef Bacik authored
      The sb check is already done in do_clone_file_range, and the mnt check
      (which will hopefully go away in a subsequent patch) is done in
      ioctl_file_clone().  Remove the check in our code and put an ASSERT() to
      make sure it doesn't change underneath us.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae460f05
    • Josef Bacik's avatar
      btrfs: pass btrfs_fs_info to btrfs_recover_relocation · 7eefae6b
      Josef Bacik authored
      We don't need a root here, we just need the btrfs_fs_info, we can just
      get the specific roots we need from fs_info.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7eefae6b
    • Josef Bacik's avatar
      btrfs: pass btrfs_fs_info for deleting snapshots and cleaner · 33c44184
      Josef Bacik authored
      We're passing a root around here, but we only really need the fs_info,
      so fix up btrfs_clean_one_deleted_snapshot() to take an fs_info instead,
      and then fix up all the callers appropriately.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      33c44184
    • Sweet Tea Dorminy's avatar
      btrfs: add filesystems state details to error messages · c067da87
      Sweet Tea Dorminy authored
      When a filesystem goes read-only due to an error, multiple errors tend
      to be reported, some of which are knock-on failures. Logging fs_states,
      in btrfs_handle_fs_error() and btrfs_printk() helps distinguish the
      first error from subsequent messages which may only exist due to an
      error state.
      
      Under the new format, most initial errors will look like:
      `BTRFS: error (device loop0) in ...`
      while subsequent errors will begin with:
      `error (device loop0: state E) in ...`
      
      An initial transaction abort error will look like
      `error (device loop0: state A) in ...`
      and subsequent messages will contain
      `(device loop0: state EA) in ...`
      
      In addition to the error states we can also print other states that are
      temporary, like remounting, device replace, or indicate a global state
      that may affect functionality.
      
      Now implemented:
      
      E - filesystem error detected
      A - transaction aborted
      L - log tree errors
      
      M - remounting in progress
      R - device replace in progress
      C - data checksums not verified (mounted with ignoredatacsums)
      Signed-off-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c067da87
    • Filipe Manana's avatar
      btrfs: deal with unexpected extent type during reflinking · b2d9f2dc
      Filipe Manana authored
      Smatch complains about a possible dereference of a pointer that was not
      initialized:
      
          CC [M]  fs/btrfs/reflink.o
          CHECK   fs/btrfs/reflink.c
        fs/btrfs/reflink.c:533 btrfs_clone() error: potentially dereferencing uninitialized 'trans'.
      
      This is because we are not dealing with the case where the type of a file
      extent has an unexpected value (not regular, not prealloc and not inline),
      in which case the transaction handle pointer is not initialized.
      
      Such unexpected type should be impossible, except in case of some memory
      corruption caused either by bad hardware or some software bug causing
      something like a buffer overrun.
      
      So ASSERT that if the extent type is neither regular nor prealloc, then
      it must be inline. Bail out with -EUCLEAN and a warning in case it is
      not. This silences smatch.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b2d9f2dc
    • Filipe Manana's avatar
      btrfs: fix unexpected error path when reflinking an inline extent · 1f4613cd
      Filipe Manana authored
      When reflinking an inline extent, we assert that its file offset is 0 and
      that its uncompressed length is not greater than the sector size. We then
      return an error if one of those conditions is not satisfied. However we
      use a return statement, which results in returning from btrfs_clone()
      without freeing the path and buffer that were allocated before, as well as
      not clearing the flag BTRFS_INODE_NO_DELALLOC_FLUSH for the destination
      inode.
      
      Fix that by jumping to the 'out' label instead, and also add a WARN_ON()
      for each condition so that in case assertions are disabled, we get to
      known which of the unexpected conditions triggered the error.
      
      Fixes: a61e1e0d ("Btrfs: simplify inline extent handling when doing reflinks")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1f4613cd
    • Filipe Manana's avatar
      btrfs: reset last_reflink_trans after fsyncing inode · 23e3337f
      Filipe Manana authored
      When an inode has a last_reflink_trans matching the current transaction,
      we have to take special care when logging its checksums in order to
      avoid getting checksum items with overlapping ranges in a log tree,
      which could result in missing checksums after log replay (more on that
      in the changelogs of commit 40e046ac ("Btrfs: fix missing data
      checksums after replaying a log tree") and commit e289f03e ("btrfs:
      fix corrupt log due to concurrent fsync of inodes with shared extents")).
      We also need to make sure a full fsync will copy all old file extent
      items it finds in modified leaves, because they might have been copied
      from some other inode.
      
      However once we fsync an inode, we don't need to keep paying the price of
      that extra special care in future fsyncs done in the same transaction,
      unless the inode is used for another reflink operation or the full sync
      flag is set on it (truncate, failure to allocate extent maps for holes,
      and other exceptional and infrequent cases).
      
      So after we fsync an inode reset its last_unlink_trans to zero. In case
      another reflink happens, we continue to update the last_reflink_trans of
      the inode, just as before. Also set last_reflink_trans to the generation
      of the last transaction that modified the inode whenever we need to set
      the full sync flag on the inode, just like when we need to load an inode
      from disk after eviction.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      23e3337f
    • Filipe Manana's avatar
      btrfs: voluntarily relinquish cpu when doing a full fsync · 96acb375
      Filipe Manana authored
      Doing a full fsync may require processing many leaves of metadata, which
      can take some time and result in a task monopolizing a cpu for too long.
      So add a cond_resched() after processing a leaf when doing a full fsync,
      while not holding any locks on any tree (a subvolume or a log tree).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96acb375
    • Filipe Manana's avatar
      btrfs: hold on to less memory when logging checksums during full fsync · 5b7ce5e2
      Filipe Manana authored
      When doing a full fsync, at copy_items(), we iterate over all extents and
      then collect their checksums into a list. After copying all the extents to
      the log tree, we then log all the previously collected checksums.
      
      Before the previous patch in the series (subject "btrfs: stop copying old
      file extents when doing a full fsync"), we had to do it this way, because
      while we were iterating over the items in the leaf of the subvolume tree,
      we were holding a write lock on a leaf of the log tree, so logging the
      checksums for an extent right after we collected them could result in a
      deadlock, in case the checksum items ended up in the same leaf.
      
      However after the previous patch in the series we now do a first iteration
      over all the items in the leaf of the subvolume tree before locking a path
      in the log tree, so we can now log the checksums right after we have
      obtained them. This avoids holding in memory all checksums for all extents
      in the leaf while copying items from the source leaf to the log tree. The
      amount of memory used to hold all checksums of the extents in a leaf can
      be significant. For example if a leaf has 200 file extent items referring
      to 1M extents, using the default crc32c checksums, would result in using
      over 200K of memory (not accounting for the extra overhead of struct
      btrfs_ordered_sum), with smaller or less extents it would be less, but
      it could be much more with more extents per leaf and/or much larger
      extents.
      
      So change copy_items() to log the checksums for an extent after looking
      them up, and then free their memory, as they are no longer necessary.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5b7ce5e2
    • Filipe Manana's avatar
      btrfs: stop copying old file extents when doing a full fsync · 7f30c072
      Filipe Manana authored
      When logging an inode in full sync mode, we go over every leaf that was
      modified in the current transaction and has items associated to our inode,
      and then copy all those items into the log tree. This includes copying
      file extent items that were created and added to the inode in past
      transactions, which is useless and only makes use more leaf space in the
      log tree.
      
      It's common to have a file with many file extent items spanning many
      leaves where only a few file extent items are new and need to be logged,
      and in such case we log all the file extent items we find in the modified
      leaves.
      
      So change the full sync behaviour to skip over file extent items that are
      not needed. Those are the ones that match the following criteria:
      
      1) Have a generation older than the current transaction and the inode
         was not a target of a reflink operation, as that can copy file extent
         items from a past generation from some other inode into our inode, so
         we have to log them;
      
      2) Start at an offset within i_size - we must log anything at or beyond
         i_size, otherwise we would lose prealloc extents after log replay.
      
      The following script exercises a scenario where this happens, and it's
      somehow close enough to what happened often on a SQL Server workload which
      I had to debug sometime ago to fix an issue where a pattern of writes to
      prealloc extents and fsync resulted in fsync failing with -EIO (that was
      commit ea7036de ("btrfs: fix fsync failure and transaction abort
      after writes to prealloc extents")). In that particular case, we had large
      files that had random writes and were often truncated, which made the
      next fsync be a full sync.
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        MKFS_OPTIONS="-O no-holes -R free-space-tree"
        MOUNT_OPTIONS="-o ssd"
      
        FILE_SIZE=$((1 * 1024 * 1024 * 1024)) # 1G
        # FILE_SIZE=$((2 * 1024 * 1024 * 1024)) # 2G
        # FILE_SIZE=$((512 * 1024 * 1024)) # 512M
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create a file with many extents. Use direct IO to make it faster
        # to create the file - using buffered IO we would have to fsync
        # after each write (terribly slow).
        echo "Creating file with $((FILE_SIZE / 4096)) extents of 4K each..."
        xfs_io -f -d -c "pwrite -b 4K 0 $FILE_SIZE" $MNT/foobar
      
        # Commit the transaction, so every extent after this is from an
        # old generation.
        sync
      
        # Now rewrite only a few extents, which are all far spread apart from
        # each other (e.g. 1G / 32M = 32 extents).
        # After this only a few extents have a new generation, while all other
        # ones have an old generation.
        echo "Rewriting $((FILE_SIZE / (32 * 1024 * 1024))) extents..."
        for ((i = 0; i < $FILE_SIZE; i += $((32 * 1024 * 1024)))); do
            xfs_io -c "pwrite $i 4K" $MNT/foobar >/dev/null
        done
      
        # Fsync, the inode logged in full sync mode since it was never fsynced
        # before.
        echo "Fsyncing file..."
        xfs_io -c "fsync" $MNT/foobar
      
        umount $MNT
      
      And the following bpftrace program was running when executing the test
      script:
      
        $ cat bpf-script.sh
        #!/usr/bin/bpftrace
      
        k:btrfs_log_inode
        {
            @start_log_inode[tid] = nsecs;
        }
      
        kr:btrfs_log_inode
        /@start_log_inode[tid]/
        {
            @log_inode_dur[tid] = (nsecs - @start_log_inode[tid]) / 1000;
            delete(@start_log_inode[tid]);
        }
      
        k:btrfs_sync_log
        {
            @start_sync_log[tid] = nsecs;
        }
      
        kr:btrfs_sync_log
        /@start_sync_log[tid]/
        {
            $sync_log_dur = (nsecs - @start_sync_log[tid]) / 1000;
            printf("btrfs_log_inode() took %llu us\n", @log_inode_dur[tid]);
            printf("btrfs_sync_log()  took %llu us\n", $sync_log_dur);
            delete(@start_sync_log[tid]);
            delete(@log_inode_dur[tid]);
            exit();
        }
      
      With 512M test file, before this patch:
      
        btrfs_log_inode() took 15218 us
        btrfs_sync_log()  took 1328 us
      
        Log tree has 17 leaves and 1 node, its total size is 294912 bytes.
      
      With 512M test file, after this patch:
      
        btrfs_log_inode() took 14760 us
        btrfs_sync_log()  took 588 us
      
        Log tree has a single leaf, its total size is 16K.
      
      With 1G test file, before this patch:
      
        btrfs_log_inode() took 27301 us
        btrfs_sync_log()  took 1767 us
      
        Log tree has 33 leaves and 1 node, its total size is 557056 bytes.
      
      With 1G test file, after this patch:
      
        btrfs_log_inode() took 26166 us
        btrfs_sync_log()  took 593 us
      
        Log tree has a single leaf, its total size is 16K
      
      With 2G test file, before this patch:
      
        btrfs_log_inode() took 50892 us
        btrfs_sync_log()  took 3127 us
      
        Log tree has 65 leaves and 1 node, its total size is 1081344 bytes.
      
      With 2G test file, after this patch:
      
        btrfs_log_inode() took 50126 us
        btrfs_sync_log()  took 586 us
      
        Log tree has a single leaf, its total size is 16K.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f30c072