1. 07 Oct, 2020 40 commits
    • Anand Jain's avatar
      btrfs: add btrfs_sysfs_remove_device helper · 985e233e
      Anand Jain authored
      btrfs_sysfs_remove_devices_dir() removes device link and devid kobject
      (sysfs entries) for a device or all the devices in the btrfs_fs_devices.
      In preparation to remove these sysfs entries for the seed as well, add
      a btrfs_sysfs_remove_device() helper function and avoid code
      duplication.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      985e233e
    • Anand Jain's avatar
      btrfs: add btrfs_sysfs_add_device helper · 178a16c9
      Anand Jain authored
      btrfs_sysfs_add_devices_dir() adds device link and devid kobject
      (sysfs entries) for a device or all the devices in the btrfs_fs_devices.
      In preparation to add these sysfs entries for the seed as well, add
      a btrfs_sysfs_add_device() helper function and avoid code duplication.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      178a16c9
    • Anand Jain's avatar
      btrfs: fix replace of seed device · c6a5d954
      Anand Jain authored
      If you replace a seed device in a sprouted fs, it appears to have
      successfully replaced the seed device, but if you look closely, it
      didn't.  Here is an example.
      
        $ mkfs.btrfs /dev/sda
        $ btrfstune -S1 /dev/sda
        $ mount /dev/sda /btrfs
        $ btrfs device add /dev/sdb /btrfs
        $ umount /btrfs
        $ btrfs device scan --forget
        $ mount -o device=/dev/sda /dev/sdb /btrfs
        $ btrfs replace start -f /dev/sda /dev/sdc /btrfs
        $ echo $?
        0
      
        BTRFS info (device sdb): dev_replace from /dev/sda (devid 1) to /dev/sdc started
        BTRFS info (device sdb): dev_replace from /dev/sda (devid 1) to /dev/sdc finished
      
        $ btrfs fi show
        Label: none  uuid: ab2c88b7-be81-4a7e-9849-c3666e7f9f4f
      	  Total devices 2 FS bytes used 256.00KiB
      	  devid    1 size 3.00GiB used 520.00MiB path /dev/sdc
      	  devid    2 size 3.00GiB used 896.00MiB path /dev/sdb
      
        Label: none  uuid: 10bd3202-0415-43af-96a8-d5409f310a7e
      	  Total devices 1 FS bytes used 128.00KiB
      	  devid    1 size 3.00GiB used 536.00MiB path /dev/sda
      
      So as per the replace start command and kernel log replace was successful.
      Now let's try to clean mount.
      
        $ umount /btrfs
        $ btrfs device scan --forget
      
        $ mount -o device=/dev/sdc /dev/sdb /btrfs
        mount: /btrfs: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error.
      
        [  636.157517] BTRFS error (device sdc): failed to read chunk tree: -2
        [  636.180177] BTRFS error (device sdc): open_ctree failed
      
      That's because per dev items it is still looking for the original seed
      device.
      
       $ btrfs inspect-internal dump-tree -d /dev/sdb
      
      	item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
      		devid 1 total_bytes 3221225472 bytes_used 545259520
      		io_align 4096 io_width 4096 sector_size 4096 type 0
      		generation 6 start_offset 0 dev_group 0
      		seek_speed 0 bandwidth 0
      		uuid 59368f50-9af2-4b17-91da-8a783cc418d4  <--- seed uuid
      		fsid 10bd3202-0415-43af-96a8-d5409f310a7e  <--- seed fsid
      	item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16087 itemsize 98
      		devid 2 total_bytes 3221225472 bytes_used 939524096
      		io_align 4096 io_width 4096 sector_size 4096 type 0
      		generation 0 start_offset 0 dev_group 0
      		seek_speed 0 bandwidth 0
      		uuid 56a0a6bc-4630-4998-8daf-3c3030c4256a  <- sprout uuid
      		fsid ab2c88b7-be81-4a7e-9849-c3666e7f9f4f <- sprout fsid
      
      But the replaced target has the following uuid+fsid in its superblock
      which doesn't match with the expected uuid+fsid in its devitem.
      
        $ btrfs in dump-super /dev/sdc | egrep '^generation|dev_item.uuid|dev_item.fsid|devid'
        generation	20
        dev_item.uuid	59368f50-9af2-4b17-91da-8a783cc418d4
        dev_item.fsid	ab2c88b7-be81-4a7e-9849-c3666e7f9f4f [match]
        dev_item.devid	1
      
      So if you provide the original seed device the mount shall be
      successful.  Which so long happening in the test case btrfs/163.
      
        $ btrfs device scan --forget
        $ mount -o device=/dev/sda /dev/sdb /btrfs
      
      Fix in this patch:
      If a seed is not sprouted then there is no replacement of it, because of
      its read-only filesystem with a read-only device. Similarly, in the case
      of a sprouted filesystem, the seed device is still read only. So, mark
      it as you can't replace a seed device, you can only add a new device and
      then delete the seed device. If replace is attempted then returns
      -EINVAL.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c6a5d954
    • Anand Jain's avatar
      btrfs: improve device scanning messages · 79dae17d
      Anand Jain authored
      Systems booting without the initramfs seems to scan an unusual kind
      of device path (/dev/root). And at a later time, the device is updated
      to the correct path. We generally print the process name and PID of the
      process scanning the device but we don't capture the same information if
      the device path is rescanned with a different pathname.
      
      The current message is too long, so drop the unnecessary UUID and add
      process name and PID.
      
      While at this also update the duplicate device warning to include the
      process name and PID so the messages are consistent
      
      CC: stable@vger.kernel.org # 4.19+
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=89721Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79dae17d
    • Josef Bacik's avatar
      btrfs: pretty print leaked root name · 457f1864
      Josef Bacik authored
      I'm a actual human being so am incapable of converting u64 to s64 in my
      head, so add a helper to get the pretty name of a root objectid and use
      that helper to spit out the name for any special roots for leaked roots,
      so I don't have to scratch my head and figure out which root I messed up
      the refs for.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      457f1864
    • Goldwyn Rodrigues's avatar
      btrfs: sysfs: export currently running exclusive operation · 66a2823c
      Goldwyn Rodrigues authored
      /sys/fs/<fsid>/exclusive_operation contains the currently executing
      exclusive operation. Add a sysfs_notify() when operation end, so
      userspace can be notified of exclusive operation is finished.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      66a2823c
    • Goldwyn Rodrigues's avatar
      btrfs: enumerate the type of exclusive operation in progress · c3e1f96c
      Goldwyn Rodrigues authored
      Instead of using a flag bit for exclusive operation, use a variable to
      store which exclusive operation is being performed.  Introduce an API
      to start and finish an exclusive operation.
      
      This would enable another way for tools to check which operation is
      running on why starting an exclusive operation failed. The followup
      patch adds a sysfs_notify() to alert userspace when the state changes, so
      userspace can perform select() on it to get notified of the change.
      
      This would enable us to enqueue a command which will wait for current
      exclusive operation to complete before issuing the next exclusive
      operation. This has been done synchronously as opposed to a background
      process, or else error collection (if any) will become difficult.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c3e1f96c
    • Josef Bacik's avatar
      btrfs: sysfs: init devices outside of the chunk_mutex · ca10845a
      Josef Bacik authored
      While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the
      following lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-rc3+ #4 Not tainted
        ------------------------------------------------------
        kswapd0/100 is trying to acquire lock:
        ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
      
        but task is already holding lock:
        ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (fs_reclaim){+.+.}-{0:0}:
      	 fs_reclaim_acquire+0x65/0x80
      	 slab_pre_alloc_hook.constprop.0+0x20/0x200
      	 kmem_cache_alloc+0x37/0x270
      	 alloc_inode+0x82/0xb0
      	 iget_locked+0x10d/0x2c0
      	 kernfs_get_inode+0x1b/0x130
      	 kernfs_get_tree+0x136/0x240
      	 sysfs_get_tree+0x16/0x40
      	 vfs_get_tree+0x28/0xc0
      	 path_mount+0x434/0xc00
      	 __x64_sys_mount+0xe3/0x120
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (kernfs_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 kernfs_add_one+0x23/0x150
      	 kernfs_create_link+0x63/0xa0
      	 sysfs_do_create_link_sd+0x5e/0xd0
      	 btrfs_sysfs_add_devices_dir+0x81/0x130
      	 btrfs_init_new_device+0x67f/0x1250
      	 btrfs_ioctl+0x1ef/0x2e20
      	 __x64_sys_ioctl+0x83/0xb0
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 btrfs_chunk_alloc+0x125/0x3a0
      	 find_free_extent+0xdf6/0x1210
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb0/0x310
      	 alloc_tree_block_no_bg_flush+0x4a/0x60
      	 __btrfs_cow_block+0x11a/0x530
      	 btrfs_cow_block+0x104/0x220
      	 btrfs_search_slot+0x52e/0x9d0
      	 btrfs_insert_empty_items+0x64/0xb0
      	 btrfs_insert_delayed_items+0x90/0x4f0
      	 btrfs_commit_inode_delayed_items+0x93/0x140
      	 btrfs_log_inode+0x5de/0x2020
      	 btrfs_log_inode_parent+0x429/0xc90
      	 btrfs_log_new_name+0x95/0x9b
      	 btrfs_rename2+0xbb9/0x1800
      	 vfs_rename+0x64f/0x9f0
      	 do_renameat2+0x320/0x4e0
      	 __x64_sys_rename+0x1f/0x30
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x119c/0x1fc0
      	 lock_acquire+0xa7/0x3d0
      	 __mutex_lock+0x7e/0x7e0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x330
      	 btrfs_evict_inode+0x24c/0x500
      	 evict+0xcf/0x1f0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x44/0x50
      	 super_cache_scan+0x161/0x1e0
      	 do_shrink_slab+0x178/0x3c0
      	 shrink_slab+0x17c/0x290
      	 shrink_node+0x2b2/0x6d0
      	 balance_pgdat+0x30a/0x670
      	 kswapd+0x213/0x4c0
      	 kthread+0x138/0x160
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> kernfs_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(kernfs_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/100:
         #0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
         #2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
      
        stack backtrace:
        CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb8
         check_noncircular+0x12d/0x150
         __lock_acquire+0x119c/0x1fc0
         lock_acquire+0xa7/0x3d0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         __mutex_lock+0x7e/0x7e0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? lock_acquire+0xa7/0x3d0
         ? find_held_lock+0x2b/0x80
         __btrfs_release_delayed_node.part.0+0x3f/0x330
         btrfs_evict_inode+0x24c/0x500
         evict+0xcf/0x1f0
         dispose_list+0x48/0x70
         prune_icache_sb+0x44/0x50
         super_cache_scan+0x161/0x1e0
         do_shrink_slab+0x178/0x3c0
         shrink_slab+0x17c/0x290
         shrink_node+0x2b2/0x6d0
         balance_pgdat+0x30a/0x670
         kswapd+0x213/0x4c0
         ? _raw_spin_unlock_irqrestore+0x41/0x50
         ? add_wait_queue_exclusive+0x70/0x70
         ? balance_pgdat+0x670/0x670
         kthread+0x138/0x160
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      This happens because we are holding the chunk_mutex at the time of
      adding in a new device.  However we only need to hold the
      device_list_mutex, as we're going to iterate over the fs_devices
      devices.  Move the sysfs init stuff outside of the chunk_mutex to get
      rid of this lockdep splat.
      
      CC: stable@vger.kernel.org # 4.4.x: f3cd2c58: btrfs: sysfs, rename device_link add/remove functions
      CC: stable@vger.kernel.org # 4.4.x
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca10845a
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
      btrfs: convert btrfs_inode_sectorsize to take btrfs_inode · 6fee248d
      Nikolay Borisov authored
      It's counterintuitive to have a function named btrfs_inode_xxx which
      takes a generic inode. Also move the function to btrfs_inode.h so that
      it has access to the definition of struct btrfs_inode.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6fee248d
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Josef Bacik's avatar
      btrfs: use BTRFS_NESTED_NEW_ROOT for double splits · ca9d473a
      Josef Bacik authored
      I've made this change separate since it requires both of the newly added
      NESTED flags and I didn't want to slip it into one of those changes.
      
      If we do a double split of a node we can end up doing a
      BTRFS_NESTED_SPLIT on level 0, which throws lockdep off because it
      appears as a double lock.  Since we're maxed out on subclasses, use
      BTRFS_NESTED_NEW_ROOT if we had to do a double split.  This is OK
      because we won't have to do a double split if we had to insert a new
      root, and the new root would be at a higher level anyway.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca9d473a
    • Josef Bacik's avatar
      btrfs: introduce BTRFS_NESTING_NEW_ROOT for adding new roots · cf6f34aa
      Josef Bacik authored
      The way we add new roots is confusing from a locking perspective for
      lockdep.  We generally have the rule that we lock things in order from
      highest level to lowest, but in the case of adding a new level to the
      tree we actually allocate a new block for the root, which makes the
      locking go in reverse.  A similar issue exists for snapshotting, we cow
      the original root for the root of a new tree, however they're at the
      same level.  Address this by using BTRFS_NESTING_NEW_ROOT for these
      operations.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cf6f34aa
    • Josef Bacik's avatar
      btrfs: introduce BTRFS_NESTING_SPLIT for split blocks · 4dff97e6
      Josef Bacik authored
      If we are splitting a leaf/node, we could do something like the
      following
      
      lock(leaf)  BTRFS_NESTING_NORMAL
        lock(left) BTRFS_NESTING_LEFT + BTRFS_NESTING_COW
          push from leaf -> left
            reset path to point to left
              split left
                allocate new block, lock block BTRFS_NESTING_SPLIT
      
      at the new block point we need to have a different nesting level,
      because we have already used either BTRFS_NESTING_LEFT or
      BTRFS_NESTING_RIGHT when pushing items from the original leaf into the
      adjacent leaves.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4dff97e6
    • Josef Bacik's avatar
      btrfs: introduce BTRFS_NESTING_LEFT/RIGHT_COW · bf59a5a2
      Josef Bacik authored
      For similar reasons as BTRFS_NESTING_COW, we need
      BTRFS_NESTING_LEFT/RIGHT_COW.  The pattern is this
      
      lock leaf -> BTRFS_NESTING_NORMAL
        cow leaf -> BTRFS_NESTING_COW
          split leaf
            lock left -> BTRFS_NESTING_LEFT
              cow left -> BTRFS_NESTING_LEFT_COW
      
      We need this in order to indicate to lockdep that these locks are
      discrete and are being taken in a safe order.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf59a5a2
    • Josef Bacik's avatar
      btrfs: introduce BTRFS_NESTING_LEFT/BTRFS_NESTING_RIGHT · bf77467a
      Josef Bacik authored
      Our lockdep maps are based on rootid+level, however in some cases we
      will lock adjacent blocks on the same level, namely in searching forward
      or in split/balance.  Because of this lockdep will complain, so we need
      a separate subclass to indicate to lockdep that these are different
      locks.
      
      lock leaf -> BTRFS_NESTING_NORMAL
        cow leaf -> BTRFS_NESTING_COW
          split leaf
             lock left -> BTRFS_NESTING_LEFT
             lock right -> BTRFS_NESTING_RIGHT
      
      The above graph illustrates the need for this new nesting subclass.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf77467a
    • Josef Bacik's avatar
      btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks · 9631e4cc
      Josef Bacik authored
      When we COW a block we are holding a lock on the original block, and
      then we lock the new COW block.  Because our lockdep maps are based on
      root + level, this will make lockdep complain.  We need a way to
      indicate a subclass for locking the COW'ed block, so plumb through our
      btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer,
      and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks.
      
      The reason I've added all this extra infrastructure is because there
      will be need of different nesting classes in follow up patches.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9631e4cc
    • Josef Bacik's avatar
      btrfs: add nesting tags to the locking helpers · fd7ba1c1
      Josef Bacik authored
      We will need these when we switch to an rwsem, so plumb in the
      infrastructure here to use later on.  I violate the 80 character limit
      some here because it'll be cleaned up later.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd7ba1c1
    • Josef Bacik's avatar
      btrfs: introduce btrfs_path::recurse · 51899412
      Josef Bacik authored
      Our current tree locking stuff allows us to recurse with read locks if
      we're already holding the write lock.  This is necessary for the space
      cache inode, as we could be holding a lock on the root_tree root when we
      need to cache a block group, and thus need to be able to read down the
      root_tree to read in the inode cache.
      
      We can get away with this in our current locking, but we won't be able
      to with a rwsem.  Handle this by purposefully annotating the places
      where we require recursion, so that in the future we can maybe come up
      with a way to avoid the recursion.  In the case of the free space inode,
      this will be superseded by the free space tree.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      51899412
    • Josef Bacik's avatar
      btrfs: rename extent_buffer::lock_nested to extent_buffer::lock_recursed · 329ced79
      Josef Bacik authored
      Nested locking with lockdep and everything else refers to lock hierarchy
      within the same lock map.  This is how we indicate the same locks for
      different objects are ok to take in a specific order, for our use case
      that would be to take the lock on a leaf and then take a lock on an
      adjacent leaf.
      
      What ->lock_nested _actually_ refers to is if we happen to already be
      holding the write lock on the extent buffer and we're allowing a read
      lock to be taken on that extent buffer, which is recursion.  Rename this
      so we don't get confused when we switch to a rwsem and have to start
      using the _nested helpers.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      329ced79
    • Nikolay Borisov's avatar
      btrfs: don't opencode sync_blockdev in btrfs_init_new_device · b9ba017f
      Nikolay Borisov authored
      Instead of opencoding filemap_write_and_wait simply call syncblockdev as
      it makes it abundantly clear what's going on and why this is used. No
      semantics changes.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b9ba017f
    • Nikolay Borisov's avatar
      btrfs: remove redundant code from btrfs_free_stale_devices · 4ae312e9
      Nikolay Borisov authored
      Following the refactor of btrfs_free_stale_devices in
      7bcb8164 ("btrfs: use device_list_mutex when removing stale devices")
      fs_devices are freed after they have been iterated by the inner
      list_for_each so the use-after-free fixed by introducing the break in
      fd649f10 ("btrfs: Fix use-after-free when cleaning up fs_devs with
      a single stale device") is no longer necessary. Just remove it
      altogether. No functional changes.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ae312e9
    • Nikolay Borisov's avatar
      btrfs: refactor locked condition in btrfs_init_new_device · 44cab9ba
      Nikolay Borisov authored
      Invert unlocked to locked and exploit the fact it can only ever be
      modified if we are adding a new device to a seed filesystem. This allows
      to simplify the check in error: label. No semantics changes.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44cab9ba
    • Nikolay Borisov's avatar
      btrfs: use RCU for quick device check in btrfs_init_new_device · f4cfa9bd
      Nikolay Borisov authored
      When adding a new device there's a mandatory check to see if a device is
      being duplicated to the filesystem it's added to. Since this is a
      read-only operations not necessary to take device_list_mutex and can simply
      make do with an rcu-readlock.
      
      Using just RCU is safe because there won't be another device add delete
      running in parallel as btrfs_init_new_device is called only from
      btrfs_ioctl_add_dev.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f4cfa9bd
    • Qu Wenruo's avatar
      btrfs: ctree: check key order before merging tree blocks · d16c702f
      Qu Wenruo authored
      [BUG]
      With a crafted image, btrfs can panic at btrfs_del_csums():
      
        kernel BUG at fs/btrfs/ctree.c:3188!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 0 PID: 1156 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
        RIP: 0010:btrfs_set_item_key_safe+0x16c/0x180
        RSP: 0018:ffff976141257ab8 EFLAGS: 00010202
        RAX: 0000000000000001 RBX: ffff898a6b890930 RCX: 0000000004b70000
        RDX: 0000000000000000 RSI: ffff976141257bae RDI: ffff976141257acf
        RBP: ffff976141257b10 R08: 0000000000001000 R09: ffff9761412579a8
        R10: 0000000000000000 R11: 0000000000000000 R12: ffff976141257abe
        R13: 0000000000000003 R14: ffff898a6a8be578 R15: ffff976141257bae
        FS: 0000000000000000(0000) GS:ffff898a77a00000(0000) knlGS:0000000000000000
        CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f779d9cd624 CR3: 000000022b2b4006 CR4: 00000000000206f0
        Call Trace:
        truncate_one_csum+0xac/0xf0
        btrfs_del_csums+0x24f/0x3a0
        __btrfs_free_extent.isra.72+0x5a7/0xbe0
        __btrfs_run_delayed_refs+0x539/0x1120
        btrfs_run_delayed_refs+0xdb/0x1b0
        btrfs_commit_transaction+0x52/0x950
        ? start_transaction+0x94/0x450
        transaction_kthread+0x163/0x190
        kthread+0x105/0x140
        ? btrfs_cleanup_transaction+0x560/0x560
        ? kthread_destroy_worker+0x50/0x50
        ret_from_fork+0x35/0x40
        Modules linked in:
        ---[ end trace 93bf9db00e6c374e ]---
      
      [CAUSE]
      This crafted image has a tricky key order corruption:
      
        checksum tree key (CSUM_TREE ROOT_ITEM 0)
        node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
                ...
                key (EXTENT_CSUM EXTENT_CSUM 73785344) block 29757440 gen 19
                key (EXTENT_CSUM EXTENT_CSUM 77594624) block 29753344 gen 19
                ...
      
        leaf 29757440 items 5 free space 150 generation 19 owner CSUM_TREE
                item 0 key (EXTENT_CSUM EXTENT_CSUM 73785344) itemoff 2323 itemsize 1672
                        range start 73785344 end 75497472 length 1712128
                item 1 key (EXTENT_CSUM EXTENT_CSUM 75497472) itemoff 2319 itemsize 4
                        range start 75497472 end 75501568 length 4096
                item 2 key (EXTENT_CSUM EXTENT_CSUM 75501568) itemoff 579 itemsize 1740
                        range start 75501568 end 77283328 length 1781760
                item 3 key (EXTENT_CSUM EXTENT_CSUM 77283328) itemoff 575 itemsize 4
                        range start 77283328 end 77287424 length 4096
                item 4 key (EXTENT_CSUM EXTENT_CSUM 4120596480) itemoff 275 itemsize 300 <<<
                        range start 4120596480 end 4120903680 length 307200
        leaf 29753344 items 3 free space 1936 generation 19 owner CSUM_TREE
                item 0 key (18446744073457893366 EXTENT_CSUM 77594624) itemoff 2323 itemsize 1672
                        range start 77594624 end 79306752 length 1712128
                ...
      
      Note the item 4 key of leaf 29757440, which is obviously too large, and
      even larger than the first key of the next leaf.
      
      However it still follows the key order in that tree block, thus tree
      checker is unable to detect it at read time, since tree checker can only
      work inside one leaf, thus such complex corruption can't be detected in
      advance.
      
      [FIX]
      The next time to detect such problem is at tree block merge time,
      which is in push_node_left(), balance_node_right(), push_leaf_left() or
      push_leaf_right().
      
      Now we check if the key order of the right-most key of the left node is
      larger than the left-most key of the right node.
      
      By this we don't need to call the full tree-checker, while still keeping
      the key order correct as key order in each node is already checked by
      tree checker thus we only need to check the above two slots.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202833Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d16c702f
    • Qu Wenruo's avatar
      btrfs: extent-tree: kill the BUG_ON() in insert_inline_extent_backref() · 07cce5cf
      Qu Wenruo authored
      [BUG]
      With a crafted image, btrfs can panic at insert_inline_extent_backref():
      
        kernel BUG at fs/btrfs/extent-tree.c:1857!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 0 PID: 1117 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
        RIP: 0010:insert_inline_extent_backref+0xcc/0xe0
        RSP: 0018:ffffac4dc1287be8 EFLAGS: 00010293
        RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000000000001
        RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffffac4dc1287c28 R08: ffffac4dc1287ab8 R09: ffffac4dc1287ac0
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
        R13: ffff8febef88a540 R14: ffff8febeaa7bc30 R15: 0000000000000000
        FS: 0000000000000000(0000) GS:ffff8febf7a00000(0000) knlGS:0000000000000000
        CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f663ace94c0 CR3: 0000000235698006 CR4: 00000000000206f0
        Call Trace:
        ? _cond_resched+0x1a/0x50
        __btrfs_inc_extent_ref.isra.64+0x7e/0x240
        ? btrfs_merge_delayed_refs+0xa5/0x330
        __btrfs_run_delayed_refs+0x653/0x1120
        btrfs_run_delayed_refs+0xdb/0x1b0
        btrfs_commit_transaction+0x52/0x950
        ? start_transaction+0x94/0x450
        transaction_kthread+0x163/0x190
        kthread+0x105/0x140
        ? btrfs_cleanup_transaction+0x560/0x560
        ? kthread_destroy_worker+0x50/0x50
        ret_from_fork+0x35/0x40
        Modules linked in:
        ---[ end trace 2ad8b3de903cf825 ]---
      
      [CAUSE]
      Due to extent tree corruption (still valid by itself, but bad cross
      ref), we can allocate an extent which is still in extent tree.  The
      offending tree block of that case is from csum tree.  The newly
      allocated tree block is also for csum tree.
      
      Then we will try to insert a tree block ref for the existing tree block
      ref.
      
      For a tree extent item, tree block can never be shared directly by the
      same tree twice.  We have such BUG_ON() to prevent such problem, but
      this is not a proper error handling.
      
      [FIX]
      Replace that BUG_ON() with proper error message and leaf dump for debug
      build.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202829Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07cce5cf
    • Qu Wenruo's avatar
      btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent() · 1c2a07f5
      Qu Wenruo authored
      __btrfs_free_extent() is doing two things:
      
      1. Reduce the refs number of an extent backref
         Either it's an inline extent backref (inside EXTENT/METADATA item) or
         a keyed extent backref (SHARED_* item).
         We only need to locate that backref line, either reduce the number or
         remove the backref line completely.
      
      2. Update the refs count in EXTENT/METADATA_ITEM
      
      During step 1), we will try to locate the EXTENT/METADATA_ITEM without
      triggering another btrfs_search_slot() as fast path.
      
      Only when we fail to locate that item, we will trigger another
      btrfs_search_slot() to get that EXTENT/METADATA_ITEM after we
      updated/deleted the backref line.
      
      And we have a lot of strict checks on things like refs_to_drop against
      extent refs and special case checks for single ref extents.
      
      There are 7 BUG_ON()s, although they're doing correct checks, they can
      be triggered by crafted images.
      
      This patch improves the function:
      
      - Introduce two examples to show what __btrfs_free_extent() is doing
        One inline backref case and one keyed case.  Should cover most cases.
      
      - Kill all BUG_ON()s with proper error message and optional leaf dump
      
      - Add comment to show the overall flow
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202819
      [ The report triggers one BUG_ON() in __btrfs_free_extent() ]
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1c2a07f5
    • Qu Wenruo's avatar
      btrfs: extent_io: do extra check for extent buffer read write functions · f98b6215
      Qu Wenruo authored
      Although we have start, len check for extent buffer reader/write (e.g.
      read_extent_buffer()), these checks have limitations:
      
      - No overflow check
        Values like start = 1024 len = -1024 can still pass the basic
         (start + len) > eb->len check.
      
      - Checks are not consistent
        For read_extent_buffer() we only check (start + len) against eb->len.
        While for memcmp_extent_buffer() we also check start against eb->len.
      
      - Different error reporting mechanism
        We use WARN() in read_extent_buffer() but BUG() in
        memcpy_extent_buffer().
      
      - Still modify memory if the request is obviously wrong
        In read_extent_buffer() even we find (start + len) > eb->len, we still
        call memset(dst, 0, len), which can easily cause memory access error
        if start + len overflows.
      
      To address above problems, this patch creates a new common function to
      check such access, check_eb_range().
      
      - Add overflow check
        This function checks start, start + len against eb->len and overflow
        check.
      
      - Unified checks
      
      - Unified error reports
        Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
        And also do btrfs_warn() message for non-debug build.
      
      - Exit ASAP if check fails
        No more possible memory corruption.
      
      - Add extra comment for @start @len used in those functions as it's
        sometimes confused with the logical addressing instead of a range
        inside the eb space
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
      [ Inspired by above report, the report itself is already addressed ]
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ use check_add_overflow ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f98b6215
    • Nikolay Borisov's avatar
      btrfs: rework error detection in init_tree_roots · 217f5004
      Nikolay Borisov authored
      To avoid duplicating 3 lines of code the error detection logic in
      init_tree_roots is somewhat quirky. It first checks for the presence of
      any error condition, then checks for the specific condition to perform
      any specific actions. That's spurious because directly checking for
      each respective error condition and doing the necessary steps is more
      obvious. While at it change the -EUCLEAN to -EIO in case the extent
      buffer is not read correctly, this is in line with other sites which
      return -EIO when the eb couldn't be read.
      
      Additionally it results in smaller code and the code reads
      more linearly:
      
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-95 (-95)
      Function                                     old     new   delta
      open_ctree                                 17243   17148     -95
      Total: Before=113104, After=113009, chg -0.08%
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      217f5004
    • Qu Wenruo's avatar
      btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations · e85fde51
      Qu Wenruo authored
      [BUG]
      When quota is enabled for TEST_DEV, generic/013 sometimes fails like this:
      
        generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg)
      
      And with the following metadata leak:
      
        BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs]
        Call Trace:
         btrfs_put_super+0x15/0x17 [btrfs]
         generic_shutdown_super+0x72/0x110
         kill_anon_super+0x18/0x30
         btrfs_kill_super+0x17/0x30 [btrfs]
         deactivate_locked_super+0x3b/0xa0
         deactivate_super+0x40/0x50
         cleanup_mnt+0x135/0x190
         __cleanup_mnt+0x12/0x20
         task_work_run+0x64/0xb0
         __prepare_exit_to_usermode+0x1bc/0x1c0
         __syscall_return_slowpath+0x47/0x230
         do_syscall_64+0x64/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace a6cfd45ba80e4e06 ]---
        BTRFS error (device dm-3): qgroup reserved space leaked
        BTRFS info (device dm-3): disk space caching is enabled
        BTRFS info (device dm-3): has skinny extents
      
      [CAUSE]
      The qgroup preallocated meta rsv operations of that offending root are:
      
        btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
        btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
        btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152
        btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
        btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
      
      It's pretty obvious that, we reserve qgroup meta rsv in
      btrfs_subvolume_reserve_metadata(), but doesn't have corresponding
      release/convert calls in btrfs_subvolume_release_metadata().
      
      This leads to the leakage.
      
      [FIX]
      To fix this bug, we should follow what we're doing in
      btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and
      add it to block_rsv->qgroup_rsv_reserved.
      
      And free the qgroup reserved metadata space when releasing the
      block_rsv.
      
      To do this, we need to change the btrfs_subvolume_release_metadata() to
      accept btrfs_root, and record the qgroup_to_release number, and call
      btrfs_qgroup_convert_reserved_meta() for it.
      
      Fixes: 733e03a0 ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e85fde51
    • Qu Wenruo's avatar
      btrfs: qgroup: fix wrong qgroup metadata reserve for delayed inode · b4c5d8fd
      Qu Wenruo authored
      For delayed inode facility, qgroup metadata is reserved for it, and
      later freed.
      
      However we're freeing more bytes than we reserved.
      In btrfs_delayed_inode_reserve_metadata():
      
      	num_bytes = btrfs_calc_metadata_size(fs_info, 1);
      	...
      		ret = btrfs_qgroup_reserve_meta_prealloc(root,
      				fs_info->nodesize, true);
      		...
      		if (!ret) {
      			node->bytes_reserved = num_bytes;
      
      But in btrfs_delayed_inode_release_metadata():
      
      	if (qgroup_free)
      		btrfs_qgroup_free_meta_prealloc(node->root,
      				node->bytes_reserved);
      	else
      		btrfs_qgroup_convert_reserved_meta(node->root,
      				node->bytes_reserved);
      
      This means, we're always releasing more qgroup metadata rsv than we have
      reserved.
      
      This won't trigger selftest warning, as btrfs qgroup metadata rsv has
      extra protection against cases like quota enabled half-way.
      
      But we still need to fix this problem any way.
      
      This patch will use the same num_bytes for qgroup metadata rsv so we
      could handle it correctly.
      
      Fixes: f218ea6c ("btrfs: delayed-inode: Remove wrong qgroup meta reservation calls")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b4c5d8fd