1. 09 Dec, 2020 4 commits
    • David Sterba's avatar
      btrfs: drop casts of bio bi_sector · 1201b58b
      David Sterba authored
      Since commit 72deb455
      
       ("block: remove CONFIG_LBDAF") (5.2) the
      sector_t type is u64 on all arches and configs so we don't need to
      typecast it.  It used to be unsigned long and the result of sector size
      shifts were not guaranteed to fit in the type.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1201b58b
    • Naohiro Aota's avatar
      btrfs: implement log-structured superblock for ZONED mode · 12659251
      Naohiro Aota authored
      
      Superblock (and its copies) is the only data structure in btrfs which
      has a fixed location on a device. Since we cannot overwrite in a
      sequential write required zone, we cannot place superblock in the zone.
      One easy solution is limiting superblock and copies to be placed only in
      conventional zones.  However, this method has two downsides: one is
      reduced number of superblock copies. The location of the second copy of
      superblock is 256GB, which is in a sequential write required zone on
      typical devices in the market today.  So, the number of superblock and
      copies is limited to be two.  Second downside is that we cannot support
      devices which have no conventional zones at all.
      
      To solve these two problems, we employ superblock log writing. It uses
      two adjacent zones as a circular buffer to write updated superblocks.
      Once the first zone is filled up, start writing into the second one.
      Then, when both zones are filled up and before starting to write to the
      first zone again, it reset the first zone.
      
      We can determine the position of the latest superblock by reading write
      pointer information from a device. One corner case is when both zones
      are full. For this situation, we read out the last superblock of each
      zone, and compare them to determine which zone is older.
      
      The following zones are reserved as the circular buffer on ZONED btrfs.
      
      - The primary superblock: zones 0 and 1
      - The first copy: zones 16 and 17
      - The second copy: zones 1024 or zone at 256GB which is minimum, and
        next to it
      
      If these reserved zones are conventional, superblock is written fixed at
      the start of the zone without logging.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      12659251
    • Naohiro Aota's avatar
      btrfs: check and enable ZONED mode · b70f5097
      Naohiro Aota authored
      
      Introduce function btrfs_check_zoned_mode() to check if ZONED flag is
      enabled on the file system and if the file system consists of zoned
      devices with equal zone size.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b70f5097
    • Naohiro Aota's avatar
      btrfs: get zone information of zoned block devices · 5b316468
      Naohiro Aota authored
      
      If a zoned block device is found, get its zone information (number of
      zones and zone size).  To avoid costly run-time zone report
      commands to test the device zones type during block allocation, attach
      the seq_zones bitmap to the device structure to indicate if a zone is
      sequential or accept random writes. Also it attaches the empty_zones
      bitmap to indicate if a zone is empty or not.
      
      This patch also introduces the helper function btrfs_dev_is_sequential()
      to test if the zone storing a block is a sequential write required zone
      and btrfs_dev_is_empty_zone() to test if the zone is a empty zone.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5b316468
  2. 08 Dec, 2020 8 commits
    • Anand Jain's avatar
      btrfs: remove unused argument seed from btrfs_find_device · b2598edf
      Anand Jain authored
      
      Commit 343694eee8d8 ("btrfs: switch seed device to list api"), missed to
      check if the parameter seed is true in the function btrfs_find_device().
      This tells it whether to traverse the seed device list or not.
      
      After this commit, the argument is unused and can be removed.
      
      In device_list_add() it's not necessary because fs_devices always points
      to the device's fs_devices. So with the devid+uuid matching, it will
      find the right device and return, thus not needing to traverse seed
      devices.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b2598edf
    • Anand Jain's avatar
      btrfs: drop never met disk total bytes check in verify_one_dev_extent · 3a160a93
      Anand Jain authored
      Drop the condition in verify_one_dev_extent,
      btrfs_device::disk_total_bytes is set even for a seed device. The
      comment is wrong, the size is properly set when cloning the device.
      
      Commit 1b3922a8
      
       ("btrfs: Use real device structure to verify
      dev extent") introduced it but it's unclear why the total_disk_bytes
      was 0.
      
      Theoretically, all devices (including missing and seed) marked with the
      BTRFS_DEV_STATE_IN_FS_METADATA flag gets the total_disk_bytes updated at
      fill_device_from_item():
      
        open_ctree()
          btrfs_read_chunk_tree()
            read_one_dev()
              open_seed_device()
              fill_device_from_item()
      
      Even if verify_one_dev_extent() reports total_disk_bytes == 0, then its
      a bug to be fixed somewhere else and not in verify_one_dev_extent() as
      it's just a messenger. It is never expected that a total_disk_bytes
      shall be zero.
      
      The function fill_device_from_item() does the job of reading it from the
      item and updating btrfs_device::disk_total_bytes. So both the missing
      device and the seed devices do have their disk_total_bytes updated.
      btrfs_find_device can also return a device from fs_info->seed_list
      because it searches it as well.
      
      Furthermore, while removing the device if there is a power loss, we
      could have a device with its total_bytes = 0, that's still valid.
      
      Instead, introduce a check against maximum block device size in
      read_one_dev().
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3a160a93
    • Anand Jain's avatar
      btrfs: drop unused argument step from btrfs_free_extra_devids · bacce86a
      Anand Jain authored
      Commit cf89af14
      
       ("btrfs: dev-replace: fail mount if we don't have
      replace item with target device") dropped the multi stage operation of
      btrfs_free_extra_devids() that does not need to check replace target
      anymore and we can remove the 'step' argument.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bacce86a
    • Josef Bacik's avatar
      btrfs: set the lockdep class for extent buffers on creation · e114c545
      Josef Bacik authored
      Both Filipe and Fedora QA recently hit the following lockdep splat:
      
        WARNING: possible recursive locking detected
        5.10.0-0.rc1.20201028gited8780e3.57.fc34.x86_64 #1 Not tainted
        --------------------------------------------
        rsync/2610 is trying to acquire lock:
        ffff89617ed48f20 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
      
        but task is already holding lock:
        ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      	 CPU0
      	 ----
          lock(&eb->lock);
          lock(&eb->lock);
      
         *** DEADLOCK ***
         May be due to missing lock nesting notation
        2 locks held by rsync/2610:
         #0: ffff896107212b90 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: walk_component+0x10c/0x190
         #1: ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
      
        stack backtrace:
        CPU: 1 PID: 2610 Comm: rsync Not tainted 5.10.0-0.rc1.20201028gited8780e3
      
      .57.fc34.x86_64 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack+0x8b/0xb0
         __lock_acquire.cold+0x12d/0x2a4
         ? kvm_sched_clock_read+0x14/0x30
         ? sched_clock+0x5/0x10
         lock_acquire+0xc8/0x400
         ? btrfs_tree_read_lock_atomic+0x34/0x140
         ? read_block_for_search.isra.0+0xdd/0x320
         _raw_read_lock+0x3d/0xa0
         ? btrfs_tree_read_lock_atomic+0x34/0x140
         btrfs_tree_read_lock_atomic+0x34/0x140
         btrfs_search_slot+0x616/0x9a0
         btrfs_lookup_dir_item+0x6c/0xb0
         btrfs_lookup_dentry+0xa8/0x520
         ? lockdep_init_map_waits+0x4c/0x210
         btrfs_lookup+0xe/0x30
         __lookup_slow+0x10f/0x1e0
         walk_component+0x11b/0x190
         path_lookupat+0x72/0x1c0
         filename_lookup+0x97/0x180
         ? strncpy_from_user+0x96/0x1e0
         ? getname_flags.part.0+0x45/0x1a0
         vfs_statx+0x64/0x100
         ? lockdep_hardirqs_on_prepare+0xff/0x180
         ? _raw_spin_unlock_irqrestore+0x41/0x50
         __do_sys_newlstat+0x26/0x40
         ? lockdep_hardirqs_on_prepare+0xff/0x180
         ? syscall_enter_from_user_mode+0x27/0x80
         ? syscall_enter_from_user_mode+0x27/0x80
         do_syscall_64+0x33/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      I have also seen a report of lockdep complaining about the lock class
      that was looked up being the same as the lock class on the lock we were
      using, but I can't find the report.
      
      These are problems that occur because we do not have the lockdep class
      set on the extent buffer until _after_ we read the eb in properly.  This
      is problematic for concurrent readers, because we will create the extent
      buffer, lock it, and then attempt to read the extent buffer.
      
      If a second thread comes in and tries to do a search down the same path
      they'll get the above lockdep splat because the class isn't set properly
      on the extent buffer.
      
      There was a good reason for this, we generally didn't know the real
      owner of the eb until we read it, specifically in refcounted roots.
      
      However now all refcounted roots have the same class name, so we no
      longer need to worry about this.  For non-refcounted trees we know
      which root we're on based on the parent.
      
      Fix this by setting the lockdep class on the eb at creation time instead
      of read time.  This will fix the splat and the weirdness where the class
      changes in the middle of locking the block.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e114c545
    • Josef Bacik's avatar
      btrfs: pass the owner_root and level to alloc_extent_buffer · 3fbaf258
      Josef Bacik authored
      
      Now that we've plumbed all of the callers to have the owner root and the
      level, plumb it down into alloc_extent_buffer().
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3fbaf258
    • Josef Bacik's avatar
      btrfs: cleanup extent buffer readahead · bfb484d9
      Josef Bacik authored
      
      We're going to pass around more information when we allocate extent
      buffers, in order to make that cleaner how we do readahead.  Most of the
      callers have the parent node that we're getting our blockptr from, with
      the sole exception of relocation which simply has the bytenr it wants to
      read.
      
      Add a helper that takes the current arguments that we need (bytenr and
      gen), and add another helper for simply reading the slot out of a node.
      In followup patches the helper that takes all the extra arguments will
      be expanded, and the simpler helper won't need to have it's arguments
      adjusted.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bfb484d9
    • Anand Jain's avatar
      btrfs: create read policy framework · 33fd2f71
      Anand Jain authored
      
      As of now, we use the pid method to read striped mirrored data, which
      means process id determines the stripe id to read. This type of routing
      typically helps in a system with many small independent processes tying
      to read random data. On the other hand, the pid based read IO policy is
      inefficient because if there is a single process trying to read a large
      file, the overall disk bandwidth remains underutilized.
      
      So this patch introduces a read policy framework so that we could add
      more read policies, such as IO routing based on the device's wait-queue
      or manual when we have a read-preferred device or a policy based on the
      target storage caching.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      33fd2f71
    • Josef Bacik's avatar
      btrfs: introduce mount option rescue=ignorebadroots · 42437a63
      Josef Bacik authored
      
      In the face of extent root corruption, or any other core fs wide root
      corruption we will fail to mount the file system.  This makes recovery
      kind of a pain, because you need to fall back to userspace tools to
      scrape off data.  Instead provide a mechanism to gracefully handle bad
      roots, so we can at least mount read-only and possibly recover data from
      the file system.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      42437a63
  3. 23 Nov, 2020 1 commit
    • Johannes Thumshirn's avatar
      btrfs: don't access possibly stale fs_info data for printing duplicate device · 0697d9a6
      Johannes Thumshirn authored
      Syzbot reported a possible use-after-free when printing a duplicate device
      warning device_list_add().
      
      At this point it can happen that a btrfs_device::fs_info is not correctly
      setup yet, so we're accessing stale data, when printing the warning
      message using the btrfs_printk() wrappers.
      
        ==================================================================
        BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
        Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068
      
        CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0x1d6/0x29e lib/dump_stack.c:118
         print_address_description+0x66/0x620 mm/kasan/report.c:383
         __kasan_report mm/kasan/report.c:513 [inline]
         kasan_report+0x132/0x1d0 mm/kasan/report.c:530
         btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
         device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943
         btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359
         btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x44840a
        RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5
        RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a
        RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630
        RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a
        R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003
      
        Allocated by task 6945:
         kasan_save_stack mm/kasan/common.c:48 [inline]
         kasan_set_track mm/kasan/common.c:56 [inline]
         __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461
         kmalloc_node include/linux/slab.h:577 [inline]
         kvmalloc_node+0x81/0x110 mm/util.c:574
         kvmalloc include/linux/mm.h:757 [inline]
         kvzalloc include/linux/mm.h:765 [inline]
         btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 6945:
         kasan_save_stack mm/kasan/common.c:48 [inline]
         kasan_set_track+0x3d/0x70 mm/kasan/common.c:56
         kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355
         __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422
         __cache_free mm/slab.c:3418 [inline]
         kfree+0x113/0x200 mm/slab.c:3756
         deactivate_locked_super+0xa7/0xf0 fs/super.c:335
         btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        The buggy address belongs to the object at ffff8880878e0000
         which belongs to the cache kmalloc-16k of size 16384
        The buggy address is located 1704 bytes inside of
         16384-byte region [ffff8880878e0000, ffff8880878e4000)
        The buggy address belongs to the page:
        page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0
        head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0
        flags: 0xfffe0000010200(slab|head)
        raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00
        raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000
        page dumped because: kasan: bad access detected
      
        Memory state around the buggy address:
         ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        >ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      				    ^
         ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ==================================================================
      
      The syzkaller reproducer for this use-after-free crafts a filesystem image
      and loop mounts it twice in a loop. The mount will fail as the crafted
      image has an invalid chunk tree. When this happens btrfs_mount_root() will
      call deactivate_locked_super(), which then cleans up fs_info and
      fs_info::sb. If a second thread now adds the same block-device to the
      filesystem, it will get detected as a duplicate device and
      device_list_add() will reject the duplicate and print a warning. But as
      the fs_info pointer passed in is non-NULL this will result in a
      use-after-free.
      
      Instead of printing possibly uninitialized or already freed memory in
      btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the
      device name will be skipped altogether.
      
      There was a slightly different approach discussed in
      https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u
      
      Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com
      
      
      Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0697d9a6
  4. 05 Nov, 2020 1 commit
    • Anand Jain's avatar
      btrfs: dev-replace: fail mount if we don't have replace item with target device · cf89af14
      Anand Jain authored
      
      If there is a device BTRFS_DEV_REPLACE_DEVID without the device replace
      item, then it means the filesystem is inconsistent state. This is either
      corruption or a crafted image.  Fail the mount as this needs a closer
      look what is actually wrong.
      
      As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace
      item, in __btrfs_free_extra_devids() we determine that there is an
      extra device, and free those extra devices but continue to mount the
      device.
      However, we were wrong in keeping tack of the rw_devices so the syzbot
      testcase failed:
      
        WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
        Kernel panic - not syncing: panic_on_warn set ...
        CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0x198/0x1fd lib/dump_stack.c:118
         panic+0x347/0x7c0 kernel/panic.c:231
         __warn.cold+0x20/0x46 kernel/panic.c:600
         report_bug+0x1bd/0x210 lib/bug.c:198
         handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234
         exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254
         asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536
        RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
        RSP: 0018:ffffc900091777e0 EFLAGS: 00010246
        RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000
        RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007
        RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130
        R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050
         close_fs_devices fs/btrfs/volumes.c:1193 [inline]
         btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
         open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434
         btrfs_fill_super fs/btrfs/super.c:1316 [inline]
         btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672
      
      The fix here is, when we determine that there isn't a replace item
      then fail the mount if there is a replace target device (devid 0).
      
      CC: stable@vger.kernel.org # 4.19+
      Reported-by: syzbot+4cfe71a4da060be47502@syzkaller.appspotmail.com
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cf89af14
  5. 27 Oct, 2020 1 commit
  6. 26 Oct, 2020 1 commit
    • Filipe Manana's avatar
      btrfs: fix readahead hang and use-after-free after removing a device · 66d204a1
      Filipe Manana authored
      
      Very sporadically I had test case btrfs/069 from fstests hanging (for
      years, it is not a recent regression), with the following traces in
      dmesg/syslog:
      
        [162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
        [162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
        [162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
        [162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
        [162513.514318]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.514747] task:btrfs-transacti state:D stack:    0 pid:1356167 ppid:     2 flags:0x00004000
        [162513.514751] Call Trace:
        [162513.514761]  __schedule+0x5ce/0xd00
        [162513.514765]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.514771]  schedule+0x46/0xf0
        [162513.514844]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.514850]  ? finish_wait+0x90/0x90
        [162513.514864]  start_transaction+0x37c/0x5f0 [btrfs]
        [162513.514879]  transaction_kthread+0xa4/0x170 [btrfs]
        [162513.514891]  ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
        [162513.514894]  kthread+0x153/0x170
        [162513.514897]  ? kthread_stop+0x2c0/0x2c0
        [162513.514902]  ret_from_fork+0x22/0x30
        [162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
        [162513.515192]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.515680] task:fsstress        state:D stack:    0 pid:1356184 ppid:1356177 flags:0x00004000
        [162513.515682] Call Trace:
        [162513.515688]  __schedule+0x5ce/0xd00
        [162513.515691]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.515697]  schedule+0x46/0xf0
        [162513.515712]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.515716]  ? finish_wait+0x90/0x90
        [162513.515729]  start_transaction+0x37c/0x5f0 [btrfs]
        [162513.515743]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
        [162513.515753]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
        [162513.515758]  ? __ia32_sys_fdatasync+0x20/0x20
        [162513.515761]  iterate_supers+0x87/0xf0
        [162513.515765]  ksys_sync+0x60/0xb0
        [162513.515768]  __do_sys_sync+0xa/0x10
        [162513.515771]  do_syscall_64+0x33/0x80
        [162513.515774]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.515781] RIP: 0033:0x7f5238f50bd7
        [162513.515782] Code: Bad RIP value.
        [162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
        [162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
        [162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
        [162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
        [162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
        [162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
        [162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
        [162513.516064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.516617] task:fsstress        state:D stack:    0 pid:1356185 ppid:1356177 flags:0x00000000
        [162513.516620] Call Trace:
        [162513.516625]  __schedule+0x5ce/0xd00
        [162513.516628]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.516634]  schedule+0x46/0xf0
        [162513.516647]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.516650]  ? finish_wait+0x90/0x90
        [162513.516662]  start_transaction+0x4d7/0x5f0 [btrfs]
        [162513.516679]  btrfs_setxattr_trans+0x3c/0x100 [btrfs]
        [162513.516686]  __vfs_setxattr+0x66/0x80
        [162513.516691]  __vfs_setxattr_noperm+0x70/0x200
        [162513.516697]  vfs_setxattr+0x6b/0x120
        [162513.516703]  setxattr+0x125/0x240
        [162513.516709]  ? lock_acquire+0xb1/0x480
        [162513.516712]  ? mnt_want_write+0x20/0x50
        [162513.516721]  ? rcu_read_lock_any_held+0x8e/0xb0
        [162513.516723]  ? preempt_count_add+0x49/0xa0
        [162513.516725]  ? __sb_start_write+0x19b/0x290
        [162513.516727]  ? preempt_count_add+0x49/0xa0
        [162513.516732]  path_setxattr+0xba/0xd0
        [162513.516739]  __x64_sys_setxattr+0x27/0x30
        [162513.516741]  do_syscall_64+0x33/0x80
        [162513.516743]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.516745] RIP: 0033:0x7f5238f56d5a
        [162513.516746] Code: Bad RIP value.
        [162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
        [162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
        [162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
        [162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
        [162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
        [162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
        [162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
        [162513.517064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.517763] task:fsstress        state:D stack:    0 pid:1356196 ppid:1356177 flags:0x00004000
        [162513.517780] Call Trace:
        [162513.517786]  __schedule+0x5ce/0xd00
        [162513.517789]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.517796]  schedule+0x46/0xf0
        [162513.517810]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.517814]  ? finish_wait+0x90/0x90
        [162513.517829]  start_transaction+0x37c/0x5f0 [btrfs]
        [162513.517845]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
        [162513.517857]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
        [162513.517862]  ? __ia32_sys_fdatasync+0x20/0x20
        [162513.517865]  iterate_supers+0x87/0xf0
        [162513.517869]  ksys_sync+0x60/0xb0
        [162513.517872]  __do_sys_sync+0xa/0x10
        [162513.517875]  do_syscall_64+0x33/0x80
        [162513.517878]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.517881] RIP: 0033:0x7f5238f50bd7
        [162513.517883] Code: Bad RIP value.
        [162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
        [162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
        [162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
        [162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
        [162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
        [162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
        [162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
        [162513.518298]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.519157] task:fsstress        state:D stack:    0 pid:1356197 ppid:1356177 flags:0x00000000
        [162513.519160] Call Trace:
        [162513.519165]  __schedule+0x5ce/0xd00
        [162513.519168]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.519174]  schedule+0x46/0xf0
        [162513.519190]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.519193]  ? finish_wait+0x90/0x90
        [162513.519206]  start_transaction+0x4d7/0x5f0 [btrfs]
        [162513.519222]  btrfs_create+0x57/0x200 [btrfs]
        [162513.519230]  lookup_open+0x522/0x650
        [162513.519246]  path_openat+0x2b8/0xa50
        [162513.519270]  do_filp_open+0x91/0x100
        [162513.519275]  ? find_held_lock+0x32/0x90
        [162513.519280]  ? lock_acquired+0x33b/0x470
        [162513.519285]  ? do_raw_spin_unlock+0x4b/0xc0
        [162513.519287]  ? _raw_spin_unlock+0x29/0x40
        [162513.519295]  do_sys_openat2+0x20d/0x2d0
        [162513.519300]  do_sys_open+0x44/0x80
        [162513.519304]  do_syscall_64+0x33/0x80
        [162513.519307]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.519309] RIP: 0033:0x7f5238f4a903
        [162513.519310] Code: Bad RIP value.
        [162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
        [162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
        [162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
        [162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
        [162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
        [162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
        [162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
        [162513.519727]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.520508] task:btrfs           state:D stack:    0 pid:1356211 ppid:1356178 flags:0x00004002
        [162513.520511] Call Trace:
        [162513.520516]  __schedule+0x5ce/0xd00
        [162513.520519]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.520525]  schedule+0x46/0xf0
        [162513.520544]  btrfs_scrub_pause+0x11f/0x180 [btrfs]
        [162513.520548]  ? finish_wait+0x90/0x90
        [162513.520562]  btrfs_commit_transaction+0x45a/0xc30 [btrfs]
        [162513.520574]  ? start_transaction+0xe0/0x5f0 [btrfs]
        [162513.520596]  btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
        [162513.520619]  btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
        [162513.520639]  btrfs_ioctl+0x2a25/0x36f0 [btrfs]
        [162513.520643]  ? do_sigaction+0xf3/0x240
        [162513.520645]  ? find_held_lock+0x32/0x90
        [162513.520648]  ? do_sigaction+0xf3/0x240
        [162513.520651]  ? lock_acquired+0x33b/0x470
        [162513.520655]  ? _raw_spin_unlock_irq+0x24/0x50
        [162513.520657]  ? lockdep_hardirqs_on+0x7d/0x100
        [162513.520660]  ? _raw_spin_unlock_irq+0x35/0x50
        [162513.520662]  ? do_sigaction+0xf3/0x240
        [162513.520671]  ? __x64_sys_ioctl+0x83/0xb0
        [162513.520672]  __x64_sys_ioctl+0x83/0xb0
        [162513.520677]  do_syscall_64+0x33/0x80
        [162513.520679]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.520681] RIP: 0033:0x7fc3cd307d87
        [162513.520682] Code: Bad RIP value.
        [162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
        [162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
        [162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
        [162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
        [162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
        [162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
        [162513.520703]
      		  Showing all locks held in the system:
        [162513.520712] 1 lock held by khungtaskd/54:
        [162513.520713]  #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
        [162513.520728] 1 lock held by in:imklog/596:
        [162513.520729]  #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
        [162513.520782] 1 lock held by btrfs-transacti/1356167:
        [162513.520784]  #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
        [162513.520798] 1 lock held by btrfs/1356190:
        [162513.520800]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
        [162513.520805] 1 lock held by fsstress/1356184:
        [162513.520806]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
        [162513.520811] 3 locks held by fsstress/1356185:
        [162513.520812]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
        [162513.520815]  #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
        [162513.520820]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
        [162513.520833] 1 lock held by fsstress/1356196:
        [162513.520834]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
        [162513.520838] 3 locks held by fsstress/1356197:
        [162513.520839]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
        [162513.520843]  #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
        [162513.520846]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
        [162513.520858] 2 locks held by btrfs/1356211:
        [162513.520859]  #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
        [162513.520877]  #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
      
      This was weird because the stack traces show that a transaction commit,
      triggered by a device replace operation, is blocking trying to pause any
      running scrubs but there are no stack traces of blocked tasks doing a
      scrub.
      
      After poking around with drgn, I noticed there was a scrub task that was
      constantly running and blocking for shorts periods of time:
      
        >>> t = find_task(prog, 1356190)
        >>> prog.stack_trace(t)
        #0  __schedule+0x5ce/0xcfc
        #1  schedule+0x46/0xe4
        #2  schedule_timeout+0x1df/0x475
        #3  btrfs_reada_wait+0xda/0x132
        #4  scrub_stripe+0x2a8/0x112f
        #5  scrub_chunk+0xcd/0x134
        #6  scrub_enumerate_chunks+0x29e/0x5ee
        #7  btrfs_scrub_dev+0x2d5/0x91b
        #8  btrfs_ioctl+0x7f5/0x36e7
        #9  __x64_sys_ioctl+0x83/0xb0
        #10 do_syscall_64+0x33/0x77
        #11 entry_SYSCALL_64+0x7c/0x156
      
      Which corresponds to:
      
      int btrfs_reada_wait(void *handle)
      {
          struct reada_control *rc = handle;
          struct btrfs_fs_info *fs_info = rc->fs_info;
      
          while (atomic_read(&rc->elems)) {
              if (!atomic_read(&fs_info->reada_works_cnt))
                  reada_start_machine(fs_info);
              wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
                                (HZ + 9) / 10);
          }
      (...)
      
      So the counter "rc->elems" was set to 1 and never decreased to 0, causing
      the scrub task to loop forever in that function. Then I used the following
      script for drgn to check the readahead requests:
      
        $ cat dump_reada.py
        import sys
        import drgn
        from drgn import NULL, Object, cast, container_of, execscript, \
            reinterpret, sizeof
        from drgn.helpers.linux import *
      
        mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
      
        mnt = None
        for mnt in for_each_mount(prog, dst = mnt_path):
            pass
      
        if mnt is None:
            sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
            sys.exit(1)
      
        fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
      
        def dump_re(re):
            nzones = re.nzones.value_()
            print(f're at {hex(re.value_())}')
            print(f'\t logical {re.logical.value_()}')
            print(f'\t refcnt {re.refcnt.value_()}')
            print(f'\t nzones {nzones}')
            for i in range(nzones):
                dev = re.zones[i].device
                name = dev.name.str.string_()
                print(f'\t\t dev id {dev.devid.value_()} name {name}')
            print()
      
        for _, e in radix_tree_for_each(fs_info.reada_tree):
            re = cast('struct reada_extent *', e)
            dump_re(re)
      
        $ drgn dump_reada.py
        re at 0xffff8f3da9d25ad8
                logical 38928384
                refcnt 1
                nzones 1
                       dev id 0 name b'/dev/sdd'
        $
      
      So there was one readahead extent with a single zone corresponding to the
      source device of that last device replace operation logged in dmesg/syslog.
      Also the ID of that zone's device was 0 which is a special value set in
      the source device of a device replace operation when the operation finishes
      (constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
      confirming again that device /dev/sdd was the source of a device replace
      operation.
      
      Normally there should be as many zones in the readahead extent as there are
      devices, and I wasn't expecting the extent to be in a block group with a
      'single' profile, so I went and confirmed with the following drgn script
      that there weren't any single profile block groups:
      
        $ cat dump_block_groups.py
        import sys
        import drgn
        from drgn import NULL, Object, cast, container_of, execscript, \
            reinterpret, sizeof
        from drgn.helpers.linux import *
      
        mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
      
        mnt = None
        for mnt in for_each_mount(prog, dst = mnt_path):
            pass
      
        if mnt is None:
            sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
            sys.exit(1)
      
        fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
      
        BTRFS_BLOCK_GROUP_DATA = (1 << 0)
        BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
        BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
        BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
        BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
        BTRFS_BLOCK_GROUP_DUP = (1 << 5)
        BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
        BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
        BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
        BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
        BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)
      
        def bg_flags_string(bg):
            flags = bg.flags.value_()
            ret = ''
            if flags & BTRFS_BLOCK_GROUP_DATA:
                ret = 'data'
            if flags & BTRFS_BLOCK_GROUP_METADATA:
                if len(ret) > 0:
                    ret += '|'
                ret += 'meta'
            if flags & BTRFS_BLOCK_GROUP_SYSTEM:
                if len(ret) > 0:
                    ret += '|'
                ret += 'system'
            if flags & BTRFS_BLOCK_GROUP_RAID0:
                ret += ' raid0'
            elif flags & BTRFS_BLOCK_GROUP_RAID1:
                ret += ' raid1'
            elif flags & BTRFS_BLOCK_GROUP_DUP:
                ret += ' dup'
            elif flags & BTRFS_BLOCK_GROUP_RAID10:
                ret += ' raid10'
            elif flags & BTRFS_BLOCK_GROUP_RAID5:
                ret += ' raid5'
            elif flags & BTRFS_BLOCK_GROUP_RAID6:
                ret += ' raid6'
            elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
                ret += ' raid1c3'
            elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
                ret += ' raid1c4'
            else:
                ret += ' single'
      
            return ret
      
        def dump_bg(bg):
            print()
            print(f'block group at {hex(bg.value_())}')
            print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
            print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')
      
        bg_root = fs_info.block_group_cache_tree.address_of_()
        for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
            dump_bg(bg)
      
        $ drgn dump_block_groups.py
      
        block group at 0xffff8f3d673b0400
               start 22020096 length 16777216
               flags 258 - system raid6
      
        block group at 0xffff8f3d53ddb400
               start 38797312 length 536870912
               flags 260 - meta raid6
      
        block group at 0xffff8f3d5f4d9c00
               start 575668224 length 2147483648
               flags 257 - data raid6
      
        block group at 0xffff8f3d08189000
               start 2723151872 length 67108864
               flags 258 - system raid6
      
        block group at 0xffff8f3db70ff000
               start 2790260736 length 1073741824
               flags 260 - meta raid6
      
        block group at 0xffff8f3d5f4dd800
               start 3864002560 length 67108864
               flags 258 - system raid6
      
        block group at 0xffff8f3d67037000
               start 3931111424 length 2147483648
               flags 257 - data raid6
        $
      
      So there were only 2 reasons left for having a readahead extent with a
      single zone: reada_find_zone(), called when creating a readahead extent,
      returned NULL either because we failed to find the corresponding block
      group or because a memory allocation failed. With some additional and
      custom tracing I figured out that on every further ocurrence of the
      problem the block group had just been deleted when we were looping to
      create the zones for the readahead extent (at reada_find_extent()), so we
      ended up with only one zone in the readahead extent, corresponding to a
      device that ends up getting replaced.
      
      So after figuring that out it became obvious why the hang happens:
      
      1) Task A starts a scrub on any device of the filesystem, except for
         device /dev/sdd;
      
      2) Task B starts a device replace with /dev/sdd as the source device;
      
      3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
         starting to scrub a stripe from block group X. This call to
         btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
         calls reada_add_block(), it passes the logical address of the extent
         tree's root node as its 'logical' argument - a value of 38928384;
      
      4) Task A then enters reada_find_extent(), called from reada_add_block().
         It finds there isn't any existing readahead extent for the logical
         address 38928384, so it proceeds to the path of creating a new one.
      
         It calls btrfs_map_block() to find out which stripes exist for the block
         group X. On the first iteration of the for loop that iterates over the
         stripes, it finds the stripe for device /dev/sdd, so it creates one
         zone for that device and adds it to the readahead extent. Before getting
         into the second iteration of the loop, the cleanup kthread deletes block
         group X because it was empty. So in the iterations for the remaining
         stripes it does not add more zones to the readahead extent, because the
         calls to reada_find_zone() returned NULL because they couldn't find
         block group X anymore.
      
         As a result the new readahead extent has a single zone, corresponding to
         the device /dev/sdd;
      
      4) Before task A returns to btrfs_reada_add() and queues the readahead job
         for the readahead work queue, task B finishes the device replace and at
         btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
         device /dev/sdg;
      
      5) Task A returns to reada_add_block(), which increments the counter
         "->elems" of the reada_control structure allocated at btrfs_reada_add().
      
         Then it returns back to btrfs_reada_add() and calls
         reada_start_machine(). This queues a job in the readahead work queue to
         run the function reada_start_machine_worker(), which calls
         __reada_start_machine().
      
         At __reada_start_machine() we take the device list mutex and for each
         device found in the current device list, we call
         reada_start_machine_dev() to start the readahead work. However at this
         point the device /dev/sdd was already freed and is not in the device
         list anymore.
      
         This means the corresponding readahead for the extent at 38928384 is
         never started, and therefore the "->elems" counter of the reada_control
         structure allocated at btrfs_reada_add() never goes down to 0, causing
         the call to btrfs_reada_wait(), done by the scrub task, to wait forever.
      
      Note that the readahead request can be made either after the device replace
      started or before it started, however in pratice it is very unlikely that a
      device replace is able to start after a readahead request is made and is
      able to complete before the readahead request completes - maybe only on a
      very small and nearly empty filesystem.
      
      This hang however is not the only problem we can have with readahead and
      device removals. When the readahead extent has other zones other than the
      one corresponding to the device that is being removed (either by a device
      replace or a device remove operation), we risk having a use-after-free on
      the device when dropping the last reference of the readahead extent.
      
      For example if we create a readahead extent with two zones, one for the
      device /dev/sdd and one for the device /dev/sde:
      
      1) Before the readahead worker starts, the device /dev/sdd is removed,
         and the corresponding btrfs_device structure is freed. However the
         readahead extent still has the zone pointing to the device structure;
      
      2) When the readahead worker starts, it only finds device /dev/sde in the
         current device list of the filesystem;
      
      3) It starts the readahead work, at reada_start_machine_dev(), using the
         device /dev/sde;
      
      4) Then when it finishes reading the extent from device /dev/sde, it calls
         __readahead_hook() which ends up dropping the last reference on the
         readahead extent through the last call to reada_extent_put();
      
      5) At reada_extent_put() it iterates over each zone of the readahead extent
         and attempts to delete an element from the device's 'reada_extents'
         radix tree, resulting in a use-after-free, as the device pointer of the
         zone for /dev/sdd is now stale. We can also access the device after
         dropping the last reference of a zone, through reada_zone_release(),
         also called by reada_extent_put().
      
      And a device remove suffers the same problem, however since it shrinks the
      device size down to zero before removing the device, it is very unlikely to
      still have readahead requests not completed by the time we free the device,
      the only possibility is if the device has a very little space allocated.
      
      While the hang problem is exclusive to scrub, since it is currently the
      only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
      problem affects any path that triggers readhead, which includes
      btree_readahead_hook() and __readahead_hook() (a readahead worker can
      trigger readahed for the children of a node) for example - any path that
      ends up calling reada_add_block() can trigger the use-after-free after a
      device is removed.
      
      So fix this by waiting for any readahead requests for a device to complete
      before removing a device, ensuring that while waiting for existing ones no
      new ones can be made.
      
      This problem has been around for a very long time - the readahead code was
      added in 2011, device remove exists since 2008 and device replace was
      introduced in 2013, hard to pick a specific commit for a git Fixes tag.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      66d204a1
  7. 07 Oct, 2020 24 commits
    • Anand Jain's avatar
      btrfs: skip devices without magic signature when mounting · 96c2e067
      Anand Jain authored
      
      Many things can happen after the device is scanned and before the device
      is mounted.  One such thing is losing the BTRFS_MAGIC on the device.
      If it happens we still won't free that device from the memory and cause
      the userland confusion.
      
      For example: As the BTRFS_IOC_DEV_INFO still carries the device path
      which does not have the BTRFS_MAGIC, 'btrfs fi show' still lists
      device which does not belong to the filesystem anymore:
      
        $ mkfs.btrfs -fq -draid1 -mraid1 /dev/sda /dev/sdb
        $ wipefs -a /dev/sdb
        # /dev/sdb does not contain magic signature
        $ mount -o degraded /dev/sda /btrfs
        $ btrfs fi show -m
        Label: none  uuid: 470ec6fb-646b-4464-b3cb-df1b26c527bd
      	  Total devices 2 FS bytes used 128.00KiB
      	  devid    1 size 3.00GiB used 571.19MiB path /dev/sda
      	  devid    2 size 3.00GiB used 571.19MiB path /dev/sdb
      
      We need to distinguish the missing signature and invalid superblock, so
      add a specific error code ENODATA for that. This also fixes failure of
      fstest btrfs/198.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96c2e067
    • Josef Bacik's avatar
      btrfs: return error if we're unable to read device stats · 92e26df4
      Josef Bacik authored
      
      I noticed when fixing device stats for seed devices that we simply threw
      away the return value from btrfs_search_slot().  This is because we may
      not have stat items, but we could very well get an error, and thus miss
      reporting the error up the chain.
      
      Fix this by returning ret if it's an actual error, and then stop trying
      to init the rest of the devices stats and return the error up the chain.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      92e26df4
    • Josef Bacik's avatar
      btrfs: init device stats for seed devices · 124604eb
      Josef Bacik authored
      
      We recently started recording device stats across the fleet, and noticed
      a large increase in messages such as this
      
        BTRFS warning (device dm-0): get dev_stats failed, not yet valid
      
      on our tiers that use seed devices for their root devices.  This is
      because we do not initialize the device stats for any seed devices if we
      have a sprout device and mount using that sprout device.  The basic
      steps for reproducing are:
      
        $ mkfs seed device
        $ mount seed device
        # fill seed device
        $ umount seed device
        $ btrfstune -S 1 seed device
        $ mount seed device
        $ btrfs device add -f sprout device /mnt/wherever
        $ umount /mnt/wherever
        $ mount sprout device /mnt/wherever
        $ btrfs device stats /mnt/wherever
      
      This will fail with the above message in dmesg.
      
      Fix this by iterating over the fs_devices->seed if they exist in
      btrfs_init_dev_stats.  This fixed the problem and properly reports the
      stats for both devices.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ rename to btrfs_device_init_dev_stats ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      124604eb
    • Anand Jain's avatar
      btrfs: simplify gotos in open_seed_device · c83b60c0
      Anand Jain authored
      
      The function does not have a common exit block and returns immediatelly
      so there's no point having the goto. Remove the two cases.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c83b60c0
    • Anand Jain's avatar
      btrfs: remove unnecessary tmp variable in btrfs_assign_next_active_device() · e493e8f9
      Anand Jain authored
      
      We can check the argument value directly, no need for the temporary
      variable.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e493e8f9
    • Anand Jain's avatar
      btrfs: use sprout device_list_mutex in btrfs_init_devices_late · e17125b5
      Anand Jain authored
      
      On a mounted sprout filesystem, all threads now are using the
      sprout::device_list_mutex, and this is the only code using the
      seed::device_list_mutex. This patch converts to use the sprouts
      fs_info->fs_devices->device_list_mutex.
      
      The same reasoning holds true here, that device delete is holding
      the sprout::device_list_mutex.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e17125b5
    • Anand Jain's avatar
      btrfs: split and refactor btrfs_sysfs_remove_devices_dir · 53f8a74c
      Anand Jain authored
      
      Similar to btrfs_sysfs_add_devices_dir()'s refactoring, split
      btrfs_sysfs_remove_devices_dir() so that we don't have to use the device
      argument to indicate whether to free all devices or just one device.
      
      Export btrfs_sysfs_remove_device() as device operations outside of
      sysfs.c now calls this instead of btrfs_sysfs_remove_devices_dir().
      
      btrfs_sysfs_remove_devices_dir() is renamed to
      btrfs_sysfs_remove_fs_devices() to suite its new role.
      
      Now, no one outside of sysfs.c calls btrfs_sysfs_remove_fs_devices()
      so it is redeclared s static. And the same function had to be moved
      before its first caller.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      53f8a74c
    • Anand Jain's avatar
      btrfs: simplify parameters of btrfs_sysfs_add_devices_dir · cd36da2e
      Anand Jain authored
      
      When we add a device we need to add it to sysfs, so instead of using the
      btrfs_sysfs_add_devices_dir() fs_devices argument to specify whether to
      add a device or all of fs_devices, call the helper function directly
      btrfs_sysfs_add_device() and thus make it non-static.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd36da2e
    • Anand Jain's avatar
      btrfs: improve device scanning messages · 79dae17d
      Anand Jain authored
      Systems booting without the initramfs seems to scan an unusual kind
      of device path (/dev/root). And at a later time, the device is updated
      to the correct path. We generally print the process name and PID of the
      process scanning the device but we don't capture the same information if
      the device path is rescanned with a different pathname.
      
      The current message is too long, so drop the unnecessary UUID and add
      process name and PID.
      
      While at this also update the duplicate device warning to include the
      process name and PID so the messages are consistent
      
      CC: stable@vger.kernel.org # 4.19+
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=89721
      
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79dae17d
    • Goldwyn Rodrigues's avatar
      btrfs: enumerate the type of exclusive operation in progress · c3e1f96c
      Goldwyn Rodrigues authored
      
      Instead of using a flag bit for exclusive operation, use a variable to
      store which exclusive operation is being performed.  Introduce an API
      to start and finish an exclusive operation.
      
      This would enable another way for tools to check which operation is
      running on why starting an exclusive operation failed. The followup
      patch adds a sysfs_notify() to alert userspace when the state changes, so
      userspace can perform select() on it to get notified of the change.
      
      This would enable us to enqueue a command which will wait for current
      exclusive operation to complete before issuing the next exclusive
      operation. This has been done synchronously as opposed to a background
      process, or else error collection (if any) will become difficult.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c3e1f96c
    • Josef Bacik's avatar
      btrfs: sysfs: init devices outside of the chunk_mutex · ca10845a
      Josef Bacik authored
      While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the
      following lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-rc3+ #4 Not tainted
        ------------------------------------------------------
        kswapd0/100 is trying to acquire lock:
        ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
      
        but task is already holding lock:
        ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (fs_reclaim){+.+.}-{0:0}:
      	 fs_reclaim_acquire+0x65/0x80
      	 slab_pre_alloc_hook.constprop.0+0x20/0x200
      	 kmem_cache_alloc+0x37/0x270
      	 alloc_inode+0x82/0xb0
      	 iget_locked+0x10d/0x2c0
      	 kernfs_get_inode+0x1b/0x130
      	 kernfs_get_tree+0x136/0x240
      	 sysfs_get_tree+0x16/0x40
      	 vfs_get_tree+0x28/0xc0
      	 path_mount+0x434/0xc00
      	 __x64_sys_mount+0xe3/0x120
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (kernfs_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 kernfs_add_one+0x23/0x150
      	 kernfs_create_link+0x63/0xa0
      	 sysfs_do_create_link_sd+0x5e/0xd0
      	 btrfs_sysfs_add_devices_dir+0x81/0x130
      	 btrfs_init_new_device+0x67f/0x1250
      	 btrfs_ioctl+0x1ef/0x2e20
      	 __x64_sys_ioctl+0x83/0xb0
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 btrfs_chunk_alloc+0x125/0x3a0
      	 find_free_extent+0xdf6/0x1210
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb0/0x310
      	 alloc_tree_block_no_bg_flush+0x4a/0x60
      	 __btrfs_cow_block+0x11a/0x530
      	 btrfs_cow_block+0x104/0x220
      	 btrfs_search_slot+0x52e/0x9d0
      	 btrfs_insert_empty_items+0x64/0xb0
      	 btrfs_insert_delayed_items+0x90/0x4f0
      	 btrfs_commit_inode_delayed_items+0x93/0x140
      	 btrfs_log_inode+0x5de/0x2020
      	 btrfs_log_inode_parent+0x429/0xc90
      	 btrfs_log_new_name+0x95/0x9b
      	 btrfs_rename2+0xbb9/0x1800
      	 vfs_rename+0x64f/0x9f0
      	 do_renameat2+0x320/0x4e0
      	 __x64_sys_rename+0x1f/0x30
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x119c/0x1fc0
      	 lock_acquire+0xa7/0x3d0
      	 __mutex_lock+0x7e/0x7e0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x330
      	 btrfs_evict_inode+0x24c/0x500
      	 evict+0xcf/0x1f0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x44/0x50
      	 super_cache_scan+0x161/0x1e0
      	 do_shrink_slab+0x178/0x3c0
      	 shrink_slab+0x17c/0x290
      	 shrink_node+0x2b2/0x6d0
      	 balance_pgdat+0x30a/0x670
      	 kswapd+0x213/0x4c0
      	 kthread+0x138/0x160
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> kernfs_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(kernfs_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/100:
         #0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
         #2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
      
        stack backtrace:
        CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb8
         check_noncircular+0x12d/0x150
         __lock_acquire+0x119c/0x1fc0
         lock_acquire+0xa7/0x3d0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         __mutex_lock+0x7e/0x7e0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? lock_acquire+0xa7/0x3d0
         ? find_held_lock+0x2b/0x80
         __btrfs_release_delayed_node.part.0+0x3f/0x330
         btrfs_evict_inode+0x24c/0x500
         evict+0xcf/0x1f0
         dispose_list+0x48/0x70
         prune_icache_sb+0x44/0x50
         super_cache_scan+0x161/0x1e0
         do_shrink_slab+0x178/0x3c0
         shrink_slab+0x17c/0x290
         shrink_node+0x2b2/0x6d0
         balance_pgdat+0x30a/0x670
         kswapd+0x213/0x4c0
         ? _raw_spin_unlock_irqrestore+0x41/0x50
         ? add_wait_queue_exclusive+0x70/0x70
         ? balance_pgdat+0x670/0x670
         kthread+0x138/0x160
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      This happens because we are holding the chunk_mutex at the time of
      adding in a new device.  However we only need to hold the
      device_list_mutex, as we're going to iterate over the fs_devices
      devices.  Move the sysfs init stuff outside of the chunk_mutex to get
      rid of this lockdep splat.
      
      CC: stable@vger.kernel.org # 4.4.x: f3cd2c58
      
      : btrfs: sysfs, rename device_link add/remove functions
      CC: stable@vger.kernel.org # 4.4.x
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca10845a
    • Nikolay Borisov's avatar
      btrfs: don't opencode sync_blockdev in btrfs_init_new_device · b9ba017f
      Nikolay Borisov authored
      
      Instead of opencoding filemap_write_and_wait simply call syncblockdev as
      it makes it abundantly clear what's going on and why this is used. No
      semantics changes.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b9ba017f
    • Nikolay Borisov's avatar
      btrfs: remove redundant code from btrfs_free_stale_devices · 4ae312e9
      Nikolay Borisov authored
      Following the refactor of btrfs_free_stale_devices in
      7bcb8164 ("btrfs: use device_list_mutex when removing stale devices")
      fs_devices are freed after they have been iterated by the inner
      list_for_each so the use-after-free fixed by introducing the break in
      fd649f10
      
       ("btrfs: Fix use-after-free when cleaning up fs_devs with
      a single stale device") is no longer necessary. Just remove it
      altogether. No functional changes.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ae312e9
    • Nikolay Borisov's avatar
      btrfs: refactor locked condition in btrfs_init_new_device · 44cab9ba
      Nikolay Borisov authored
      
      Invert unlocked to locked and exploit the fact it can only ever be
      modified if we are adding a new device to a seed filesystem. This allows
      to simplify the check in error: label. No semantics changes.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44cab9ba
    • Nikolay Borisov's avatar
      btrfs: use RCU for quick device check in btrfs_init_new_device · f4cfa9bd
      Nikolay Borisov authored
      
      When adding a new device there's a mandatory check to see if a device is
      being duplicated to the filesystem it's added to. Since this is a
      read-only operations not necessary to take device_list_mutex and can simply
      make do with an rcu-readlock.
      
      Using just RCU is safe because there won't be another device add delete
      running in parallel as btrfs_init_new_device is called only from
      btrfs_ioctl_add_dev.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f4cfa9bd
    • Josef Bacik's avatar
      btrfs: do not hold device_list_mutex when closing devices · 425c6ed6
      Josef Bacik authored
      
      The following lockdep splat
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.8.0-rc7-00169-g87212851a027-dirty #929 Not tainted
      ------------------------------------------------------
      fsstress/8739 is trying to acquire lock:
      ffff88bfd0eb0c90 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x43/0x70
      
      but task is already holding lock:
      ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #10 (sb_pagefaults){.+.+}-{0:0}:
             __sb_start_write+0x129/0x210
             btrfs_page_mkwrite+0x6a/0x4a0
             do_page_mkwrite+0x4d/0xc0
             handle_mm_fault+0x103c/0x1730
             exc_page_fault+0x340/0x660
             asm_exc_page_fault+0x1e/0x30
      
      -> #9 (&mm->mmap_lock#2){++++}-{3:3}:
             __might_fault+0x68/0x90
             _copy_to_user+0x1e/0x80
             perf_read+0x141/0x2c0
             vfs_read+0xad/0x1b0
             ksys_read+0x5f/0xe0
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #8 (&cpuctx_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             perf_event_init_cpu+0x88/0x150
             perf_event_init+0x1db/0x20b
             start_kernel+0x3ae/0x53c
             secondary_startup_64+0xa4/0xb0
      
      -> #7 (pmus_lock){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             perf_event_init_cpu+0x4f/0x150
             cpuhp_invoke_callback+0xb1/0x900
             _cpu_up.constprop.26+0x9f/0x130
             cpu_up+0x7b/0xc0
             bringup_nonboot_cpus+0x4f/0x60
             smp_init+0x26/0x71
             kernel_init_freeable+0x110/0x258
             kernel_init+0xa/0x103
             ret_from_fork+0x1f/0x30
      
      -> #6 (cpu_hotplug_lock){++++}-{0:0}:
             cpus_read_lock+0x39/0xb0
             kmem_cache_create_usercopy+0x28/0x230
             kmem_cache_create+0x12/0x20
             bioset_init+0x15e/0x2b0
             init_bio+0xa3/0xaa
             do_one_initcall+0x5a/0x2e0
             kernel_init_freeable+0x1f4/0x258
             kernel_init+0xa/0x103
             ret_from_fork+0x1f/0x30
      
      -> #5 (bio_slab_lock){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             bioset_init+0xbc/0x2b0
             __blk_alloc_queue+0x6f/0x2d0
             blk_mq_init_queue_data+0x1b/0x70
             loop_add+0x110/0x290 [loop]
             fq_codel_tcf_block+0x12/0x20 [sch_fq_codel]
             do_one_initcall+0x5a/0x2e0
             do_init_module+0x5a/0x220
             load_module+0x2459/0x26e0
             __do_sys_finit_module+0xba/0xe0
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #4 (loop_ctl_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             lo_open+0x18/0x50 [loop]
             __blkdev_get+0xec/0x570
             blkdev_get+0xe8/0x150
             do_dentry_open+0x167/0x410
             path_openat+0x7c9/0xa80
             do_filp_open+0x93/0x100
             do_sys_openat2+0x22a/0x2e0
             do_sys_open+0x4b/0x80
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             blkdev_put+0x1d/0x120
             close_fs_devices.part.31+0x84/0x130
             btrfs_close_devices+0x44/0xb0
             close_ctree+0x296/0x2b2
             generic_shutdown_super+0x69/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             btrfs_run_dev_stats+0x49/0x480
             commit_cowonly_roots+0xb5/0x2a0
             btrfs_commit_transaction+0x516/0xa60
             sync_filesystem+0x6b/0x90
             generic_shutdown_super+0x22/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             btrfs_commit_transaction+0x4bb/0xa60
             sync_filesystem+0x6b/0x90
             generic_shutdown_super+0x22/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
             __lock_acquire+0x1272/0x2310
             lock_acquire+0x9e/0x360
             __mutex_lock+0x9f/0x930
             btrfs_record_root_in_trans+0x43/0x70
             start_transaction+0xd1/0x5d0
             btrfs_dirty_inode+0x42/0xd0
             file_update_time+0xc8/0x110
             btrfs_page_mkwrite+0x10c/0x4a0
             do_page_mkwrite+0x4d/0xc0
             handle_mm_fault+0x103c/0x1730
             exc_page_fault+0x340/0x660
             asm_exc_page_fault+0x1e/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(sb_pagefaults);
                                     lock(&mm->mmap_lock#2);
                                     lock(sb_pagefaults);
        lock(&fs_info->reloc_mutex);
      
       *** DEADLOCK ***
      
      3 locks held by fsstress/8739:
       #0: ffff88bee66eeb68 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x173/0x660
       #1: ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
       #2: ffff88bfbd16e630 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3da/0x5d0
      
      stack backtrace:
      CPU: 17 PID: 8739 Comm: fsstress Kdump: loaded Not tainted 5.8.0-rc7-00169-g87212851a027-dirty #929
      Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
      Call Trace:
       dump_stack+0x78/0xa0
       check_noncircular+0x165/0x180
       __lock_acquire+0x1272/0x2310
       ? btrfs_get_alloc_profile+0x150/0x210
       lock_acquire+0x9e/0x360
       ? btrfs_record_root_in_trans+0x43/0x70
       __mutex_lock+0x9f/0x930
       ? btrfs_record_root_in_trans+0x43/0x70
       ? lock_acquire+0x9e/0x360
       ? join_transaction+0x5d/0x450
       ? find_held_lock+0x2d/0x90
       ? btrfs_record_root_in_trans+0x43/0x70
       ? join_transaction+0x3d5/0x450
       ? btrfs_record_root_in_trans+0x43/0x70
       btrfs_record_root_in_trans+0x43/0x70
       start_transaction+0xd1/0x5d0
       btrfs_dirty_inode+0x42/0xd0
       file_update_time+0xc8/0x110
       btrfs_page_mkwrite+0x10c/0x4a0
       ? handle_mm_fault+0x5e/0x1730
       do_page_mkwrite+0x4d/0xc0
       ? __do_fault+0x32/0x150
       handle_mm_fault+0x103c/0x1730
       exc_page_fault+0x340/0x660
       ? asm_exc_page_fault+0x8/0x30
       asm_exc_page_fault+0x1e/0x30
      RIP: 0033:0x7faa6c9969c4
      
      Was seen in testing.  The fix is similar to that of
      
        btrfs: open device without device_list_mutex
      
      where we're holding the device_list_mutex and then grab the bd_mutex,
      which pulls in a bunch of dependencies under the bd_mutex.  We only ever
      call btrfs_close_devices() on mount failure or unmount, so we're save to
      not have the device_list_mutex here.  We're already holding the
      uuid_mutex which keeps us safe from any external modification of the
      fs_devices.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      425c6ed6
    • Josef Bacik's avatar
      btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks · 62cf5391
      Josef Bacik authored
      
      When closing and freeing the source device we could end up doing our
      final blkdev_put() on the bdev, which will grab the bd_mutex.  As such
      we want to be holding as few locks as possible, so move this call
      outside of the dev_replace->lock_finishing_cancel_unmount lock.  Since
      we're modifying the fs_devices we need to make sure we're holding the
      uuid_mutex here, so take that as well.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      62cf5391
    • Nikolay Borisov's avatar
      btrfs: remove alloc_list splice in btrfs_prepare_sprout · 68abf360
      Nikolay Borisov authored
      
      btrfs_prepare_sprout is called when the first rw device is added to a
      seed filesystem. This means the filesystem can't have its alloc_list
      be non-empty, since seed filesystems are read only. Simply remove the
      code altogether.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      68abf360
    • Nikolay Borisov's avatar
      btrfs: document some invariants of seed code · 427c8fdd
      Nikolay Borisov authored
      
      Without good understanding of how seed devices works it's hard to grok
      some of what the code in open_seed_devices or btrfs_prepare_sprout does.
      
      Add comments hopefully reducing some of the cognitive load.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      427c8fdd
    • Nikolay Borisov's avatar
      btrfs: switch seed device to list api · 944d3f9f
      Nikolay Borisov authored
      
      While this patch touches a bunch of files the conversion is
      straighforward. Instead of using the implicit linked list anchored at
      btrfs_fs_devices::seed the code is switched to using
      list_for_each_entry.
      
      Previous patches in the series already factored out code that processed
      both main and seed devices so in those cases the factored out functions
      are called on the main fs_devices and then on every seed dev inside
      list_for_each_entry.
      
      Using list api also allows to simplify deletion from the seed dev list
      performed in btrfs_rm_device and btrfs_rm_dev_replace_free_srcdev by
      substituting a while() loop with a simple list_del_init.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      944d3f9f
    • Nikolay Borisov's avatar
      btrfs: simplify setting/clearing fs_info to btrfs_fs_devices · c4989c2f
      Nikolay Borisov authored
      
      It makes no sense to have sysfs-related routines be responsible for
      properly initialising the fs_info pointer of struct btrfs_fs_device.
      Instead this can be streamlined by making it the responsibility of
      btrfs_init_devices_late to initialize it. That function already
      initializes fs_info of every individual device in btrfs_fs_devices.
      
      As far as clearing it is concerned it makes sense to move it to
      close_fs_devices. That function is only called when struct
      btrfs_fs_devices is no longer in use - either for holding seeds or
      main devices for a mounted filesystem.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c4989c2f
    • Nikolay Borisov's avatar
      btrfs: make close_fs_devices return void · 54eed6ae
      Nikolay Borisov authored
      
      The return value of this function conveys absolutely no information.
      All callers already check the state of fs_devices->opened to decide how
      to proceed. So convert the function to returning void. While at it make
      btrfs_close_devices also return void.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54eed6ae
    • Nikolay Borisov's avatar
      btrfs: factor out loop logic from btrfs_free_extra_devids · 3712ccb7
      Nikolay Borisov authored
      
      This prepares the code to switching seeds devices to a proper list.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3712ccb7
    • Qu Wenruo's avatar
      btrfs: add owner and fs_info to alloc_state io_tree · 154f7cb8
      Qu Wenruo authored
      Commit 1c11b63e ("btrfs: replace pending/pinned chunks lists with io
      tree") introduced btrfs_device::alloc_state extent io tree, but it
      doesn't initialize the fs_info and owner member.
      
      This means the following features are not properly supported:
      
      - Fs owner report for insert_state() error
        Without fs_info initialized, although btrfs_err() won't panic, it
        won't output which fs is causing the error.
      
      - Wrong owner for trace events
        alloc_state will get the owner as pinned extents.
      
      Fix this by assiging proper fs_info and owner for
      btrfs_device::alloc_state.
      
      Fixes: 1c11b63e
      
       ("btrfs: replace pending/pinned chunks lists with io tree")
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      154f7cb8