1. 25 Sep, 2019 1 commit
    • Qu Wenruo's avatar
      btrfs: relocation: fix use-after-free on dead relocation roots · 1fac4a54
      Qu Wenruo authored
      [BUG]
      One user reported a reproducible KASAN report about use-after-free:
      
        BTRFS info (device sdi1): balance: start -dvrange=1256811659264..1256811659265
        BTRFS info (device sdi1): relocating block group 1256811659264 flags data|raid0
        ==================================================================
        BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
        Write of size 8 at addr ffff88856f671710 by task kworker/u24:10/261579
      
        CPU: 2 PID: 261579 Comm: kworker/u24:10 Tainted: P           OE     5.2.11-arch1-1-kasan #4
        Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Extreme4, BIOS P3.80 04/06/2018
        Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
        Call Trace:
         dump_stack+0x7b/0xba
         print_address_description+0x6c/0x22e
         ? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
         __kasan_report.cold+0x1b/0x3b
         ? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
         kasan_report+0x12/0x17
         __asan_report_store8_noabort+0x17/0x20
         btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
         record_root_in_trans+0x2a0/0x370 [btrfs]
         btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
         start_transaction+0x1ab/0xe90 [btrfs]
         btrfs_join_transaction+0x1d/0x20 [btrfs]
         btrfs_finish_ordered_io+0x7bf/0x18a0 [btrfs]
         ? lock_repin_lock+0x400/0x400
         ? __kmem_cache_shutdown.cold+0x140/0x1ad
         ? btrfs_unlink_subvol+0x9b0/0x9b0 [btrfs]
         finish_ordered_fn+0x15/0x20 [btrfs]
         normal_work_helper+0x1bd/0xca0 [btrfs]
         ? process_one_work+0x819/0x1720
         ? kasan_check_read+0x11/0x20
         btrfs_endio_write_helper+0x12/0x20 [btrfs]
         process_one_work+0x8c9/0x1720
         ? pwq_dec_nr_in_flight+0x2f0/0x2f0
         ? worker_thread+0x1d9/0x1030
         worker_thread+0x98/0x1030
         kthread+0x2bb/0x3b0
         ? process_one_work+0x1720/0x1720
         ? kthread_park+0x120/0x120
         ret_from_fork+0x35/0x40
      
        Allocated by task 369692:
         __kasan_kmalloc.part.0+0x44/0xc0
         __kasan_kmalloc.constprop.0+0xba/0xc0
         kasan_kmalloc+0x9/0x10
         kmem_cache_alloc_trace+0x138/0x260
         btrfs_read_tree_root+0x92/0x360 [btrfs]
         btrfs_read_fs_root+0x10/0xb0 [btrfs]
         create_reloc_root+0x47d/0xa10 [btrfs]
         btrfs_init_reloc_root+0x1e2/0x340 [btrfs]
         record_root_in_trans+0x2a0/0x370 [btrfs]
         btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
         start_transaction+0x1ab/0xe90 [btrfs]
         btrfs_start_transaction+0x1e/0x20 [btrfs]
         __btrfs_prealloc_file_range+0x1c2/0xa00 [btrfs]
         btrfs_prealloc_file_range+0x13/0x20 [btrfs]
         prealloc_file_extent_cluster+0x29f/0x570 [btrfs]
         relocate_file_extent_cluster+0x193/0xc30 [btrfs]
         relocate_data_extent+0x1f8/0x490 [btrfs]
         relocate_block_group+0x600/0x1060 [btrfs]
         btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
         btrfs_relocate_chunk+0x9e/0x180 [btrfs]
         btrfs_balance+0x14e4/0x2fc0 [btrfs]
         btrfs_ioctl_balance+0x47f/0x640 [btrfs]
         btrfs_ioctl+0x119d/0x8380 [btrfs]
         do_vfs_ioctl+0x9f5/0x1060
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x73/0xb0
         do_syscall_64+0xa5/0x370
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 369692:
         __kasan_slab_free+0x14f/0x210
         kasan_slab_free+0xe/0x10
         kfree+0xd8/0x270
         btrfs_drop_snapshot+0x154c/0x1eb0 [btrfs]
         clean_dirty_subvols+0x227/0x340 [btrfs]
         relocate_block_group+0x972/0x1060 [btrfs]
         btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
         btrfs_relocate_chunk+0x9e/0x180 [btrfs]
         btrfs_balance+0x14e4/0x2fc0 [btrfs]
         btrfs_ioctl_balance+0x47f/0x640 [btrfs]
         btrfs_ioctl+0x119d/0x8380 [btrfs]
         do_vfs_ioctl+0x9f5/0x1060
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x73/0xb0
         do_syscall_64+0xa5/0x370
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        The buggy address belongs to the object at ffff88856f671100
         which belongs to the cache kmalloc-4k of size 4096
        The buggy address is located 1552 bytes inside of
         4096-byte region [ffff88856f671100, ffff88856f672100)
        The buggy address belongs to the page:
        page:ffffea0015bd9c00 refcount:1 mapcount:0 mapping:ffff88864400e600 index:0x0 compound_mapcount: 0
        flags: 0x2ffff0000010200(slab|head)
        raw: 02ffff0000010200 dead000000000100 dead000000000200 ffff88864400e600
        raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
        page dumped because: kasan: bad access detected
      
        Memory state around the buggy address:
         ffff88856f671600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff88856f671680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        >ffff88856f671700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                 ^
         ffff88856f671780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff88856f671800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ==================================================================
        BTRFS info (device sdi1): 1 enospc errors during balance
        BTRFS info (device sdi1): balance: ended with status: -28
      
      [CAUSE]
      The problem happens when finish_ordered_io() get called with balance
      still running, while the reloc root of that subvolume is already dead.
      (Tree is swap already done, but tree not yet deleted for possible qgroup
      usage.)
      
      That means root->reloc_root still exists, but that reloc_root can be
      under btrfs_drop_snapshot(), thus we shouldn't access it.
      
      The following race could cause the use-after-free problem:
      
                      CPU1              |                CPU2
      --------------------------------------------------------------------------
                                        | relocate_block_group()
                                        | |- unset_reloc_control(rc)
                                        | |- btrfs_commit_transaction()
      btrfs_finish_ordered_io()         | |- clean_dirty_subvols()
      |- btrfs_join_transaction()       |    |
         |- record_root_in_trans()      |    |
            |- btrfs_init_reloc_root()  |    |
               |- if (root->reloc_root) |    |
               |                        |    |- root->reloc_root = NULL
               |                        |    |- btrfs_drop_snapshot(reloc_root);
               |- reloc_root->last_trans|
                       = trans->transid |
      	    ^^^^^^^^^^^^^^^^^^^^^^
                  Use after free
      
      [FIX]
      Fix it by the following modifications:
      
      - Test if the root has dead reloc tree before accessing root->reloc_root
        If the root has BTRFS_ROOT_DEAD_RELOC_TREE, then we don't need to
        create or update root->reloc_tree
      
      - Clear the BTRFS_ROOT_DEAD_RELOC_TREE flag until we have fully dropped
        reloc tree
        To co-operate with above modification, so as long as
        BTRFS_ROOT_DEAD_RELOC_TREE is still set, we won't try to re-create
        reloc tree at record_root_in_trans().
      Reported-by: default avatarCebtenzzre <cebtenzzre@gmail.com>
      Fixes: d2311e69 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
      CC: stable@vger.kernel.org # 5.1+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1fac4a54
  2. 24 Sep, 2019 4 commits
    • Filipe Manana's avatar
      Btrfs: fix race setting up and completing qgroup rescan workers · 13fc1d27
      Filipe Manana authored
      There is a race between setting up a qgroup rescan worker and completing
      a qgroup rescan worker that can lead to callers of the qgroup rescan wait
      ioctl to either not wait for the rescan worker to complete or to hang
      forever due to missing wake ups. The following diagram shows a sequence
      of steps that illustrates the race.
      
              CPU 1                                                         CPU 2                                  CPU 3
      
       btrfs_ioctl_quota_rescan()
        btrfs_qgroup_rescan()
         qgroup_rescan_init()
          mutex_lock(&fs_info->qgroup_rescan_lock)
          spin_lock(&fs_info->qgroup_lock)
      
          fs_info->qgroup_flags |=
            BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
          init_completion(
            &fs_info->qgroup_rescan_completion)
      
          fs_info->qgroup_rescan_running = true
      
          mutex_unlock(&fs_info->qgroup_rescan_lock)
          spin_unlock(&fs_info->qgroup_lock)
      
          btrfs_init_work()
           --> starts the worker
      
                                                              btrfs_qgroup_rescan_worker()
                                                               mutex_lock(&fs_info->qgroup_rescan_lock)
      
                                                               fs_info->qgroup_flags &=
                                                                 ~BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
                                                               mutex_unlock(&fs_info->qgroup_rescan_lock)
      
                                                               starts transaction, updates qgroup status
                                                               item, etc
      
                                                                                                                 btrfs_ioctl_quota_rescan()
                                                                                                                  btrfs_qgroup_rescan()
                                                                                                                   qgroup_rescan_init()
                                                                                                                    mutex_lock(&fs_info->qgroup_rescan_lock)
                                                                                                                    spin_lock(&fs_info->qgroup_lock)
      
                                                                                                                    fs_info->qgroup_flags |=
                                                                                                                      BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
                                                                                                                    init_completion(
                                                                                                                      &fs_info->qgroup_rescan_completion)
      
                                                                                                                    fs_info->qgroup_rescan_running = true
      
                                                                                                                    mutex_unlock(&fs_info->qgroup_rescan_lock)
                                                                                                                    spin_unlock(&fs_info->qgroup_lock)
      
                                                                                                                    btrfs_init_work()
                                                                                                                     --> starts another worker
      
                                                               mutex_lock(&fs_info->qgroup_rescan_lock)
      
                                                               fs_info->qgroup_rescan_running = false
      
                                                               mutex_unlock(&fs_info->qgroup_rescan_lock)
      
      							 complete_all(&fs_info->qgroup_rescan_completion)
      
      Before the rescan worker started by the task at CPU 3 completes, if
      another task calls btrfs_ioctl_quota_rescan(), it will get -EINPROGRESS
      because the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN is set at
      fs_info->qgroup_flags, which is expected and correct behaviour.
      
      However if other task calls btrfs_ioctl_quota_rescan_wait() before the
      rescan worker started by the task at CPU 3 completes, it will return
      immediately without waiting for the new rescan worker to complete,
      because fs_info->qgroup_rescan_running is set to false by CPU 2.
      
      This race is making test case btrfs/171 (from fstests) to fail often:
      
        btrfs/171 9s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad)
            --- tests/btrfs/171.out     2018-09-16 21:30:48.505104287 +0100
            +++ /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad      2019-09-19 02:01:36.938486039 +0100
            @@ -1,2 +1,3 @@
             QA output created by 171
            +ERROR: quota rescan failed: Operation now in progress
             Silence is golden
            ...
            (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/171.out /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad'  to see the entire diff)
      
      That is because the test calls the btrfs-progs commands "qgroup quota
      rescan -w", "qgroup assign" and "qgroup remove" in a sequence that makes
      calls to the rescan start ioctl fail with -EINPROGRESS (note the "btrfs"
      commands 'qgroup assign' and 'qgroup remove' often call the rescan start
      ioctl after calling the qgroup assign ioctl,
      btrfs_ioctl_qgroup_assign()), since previous waits didn't actually wait
      for a rescan worker to complete.
      
      Another problem the race can cause is missing wake ups for waiters,
      since the call to complete_all() happens outside a critical section and
      after clearing the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN. In the sequence
      diagram above, if we have a waiter for the first rescan task (executed
      by CPU 2), then fs_info->qgroup_rescan_completion.wait is not empty, and
      if after the rescan worker clears BTRFS_QGROUP_STATUS_FLAG_RESCAN and
      before it calls complete_all() against
      fs_info->qgroup_rescan_completion, the task at CPU 3 calls
      init_completion() against fs_info->qgroup_rescan_completion which
      re-initilizes its wait queue to an empty queue, therefore causing the
      rescan worker at CPU 2 to call complete_all() against an empty queue,
      never waking up the task waiting for that rescan worker.
      
      Fix this by clearing BTRFS_QGROUP_STATUS_FLAG_RESCAN and setting
      fs_info->qgroup_rescan_running to false in the same critical section,
      delimited by the mutex fs_info->qgroup_rescan_lock, as well as doing the
      call to complete_all() in that same critical section. This gives the
      protection needed to avoid rescan wait ioctl callers not waiting for a
      running rescan worker and the lost wake ups problem, since setting that
      rescan flag and boolean as well as initializing the wait queue is done
      already in a critical section delimited by that mutex (at
      qgroup_rescan_init()).
      
      Fixes: 57254b6e ("Btrfs: add ioctl to wait for qgroup rescan completion")
      Fixes: d2c609b8 ("btrfs: properly track when rescan worker is running")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      13fc1d27
    • Filipe Manana's avatar
      Btrfs: fix missing error return if writeback for extent buffer never started · 0607eb1d
      Filipe Manana authored
      If lock_extent_buffer_for_io() fails, it returns a negative value, but its
      caller btree_write_cache_pages() ignores such error. This means that a
      call to flush_write_bio(), from lock_extent_buffer_for_io(), might have
      failed. We should make btree_write_cache_pages() notice such error values
      and stop immediatelly, making sure filemap_fdatawrite_range() returns an
      error to the transaction commit path. A failure from flush_write_bio()
      should also result in the endio callback end_bio_extent_buffer_writepage()
      being invoked, which sets the BTRFS_FS_*_ERR bits appropriately, so that
      there's no risk a transaction or log commit doesn't catch a writeback
      failure.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0607eb1d
    • Dennis Zhou's avatar
      btrfs: adjust dirty_metadata_bytes after writeback failure of extent buffer · eb5b64f1
      Dennis Zhou authored
      Before, if a eb failed to write out, we would end up triggering a
      BUG_ON(). As of f4340622 ("btrfs: extent_io: Move the BUG_ON() in
      flush_write_bio() one level up"), we no longer BUG_ON(), so we should
      make life consistent and add back the unwritten bytes to
      dirty_metadata_bytes.
      
      Fixes: f4340622 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
      CC: stable@vger.kernel.org # 5.2+
      Reviewed-by: default avatarFilipe Manana <fdmanana@kernel.org>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eb5b64f1
    • Filipe Manana's avatar
      Btrfs: fix selftests failure due to uninitialized i_mode in test inodes · 9f7fec0b
      Filipe Manana authored
      Some of the self tests create a test inode, setup some extents and then do
      calls to btrfs_get_extent() to test that the corresponding extent maps
      exist and are correct. However btrfs_get_extent(), since the 5.2 merge
      window, now errors out when it finds a regular or prealloc extent for an
      inode that does not correspond to a regular file (its ->i_mode is not
      S_IFREG). This causes the self tests to fail sometimes, specially when
      KASAN, slub_debug and page poisoning are enabled:
      
        $ modprobe btrfs
        modprobe: ERROR: could not insert 'btrfs': Invalid argument
      
        $ dmesg
        [ 9414.691648] Btrfs loaded, crc32c=crc32c-intel, debug=on, assert=on, integrity-checker=on, ref-verify=on
        [ 9414.692655] BTRFS: selftest: sectorsize: 4096  nodesize: 4096
        [ 9414.692658] BTRFS: selftest: running btrfs free space cache tests
        [ 9414.692918] BTRFS: selftest: running extent only tests
        [ 9414.693061] BTRFS: selftest: running bitmap only tests
        [ 9414.693366] BTRFS: selftest: running bitmap and extent tests
        [ 9414.696455] BTRFS: selftest: running space stealing from bitmap to extent tests
        [ 9414.697131] BTRFS: selftest: running extent buffer operation tests
        [ 9414.697133] BTRFS: selftest: running btrfs_split_item tests
        [ 9414.697564] BTRFS: selftest: running extent I/O tests
        [ 9414.697583] BTRFS: selftest: running find delalloc tests
        [ 9415.081125] BTRFS: selftest: running find_first_clear_extent_bit test
        [ 9415.081278] BTRFS: selftest: running extent buffer bitmap tests
        [ 9415.124192] BTRFS: selftest: running inode tests
        [ 9415.124195] BTRFS: selftest: running btrfs_get_extent tests
        [ 9415.127909] BTRFS: selftest: running hole first btrfs_get_extent test
        [ 9415.128343] BTRFS critical (device (efault)): regular/prealloc extent found for non-regular inode 256
        [ 9415.131428] BTRFS: selftest: fs/btrfs/tests/inode-tests.c:904 expected a real extent, got 0
      
      This happens because the test inodes are created without ever initializing
      the i_mode field of the inode, and neither VFS's new_inode() nor the btrfs
      callback btrfs_alloc_inode() initialize the i_mode. Initialization of the
      i_mode is done through the various callbacks used by the VFS to create
      new inodes (regular files, directories, symlinks, tmpfiles, etc), which
      all call btrfs_new_inode() which in turn calls inode_init_owner(), which
      sets the inode's i_mode. Since the tests only uses new_inode() to create
      the test inodes, the i_mode was never initialized.
      
      This always happens on a VM I used with kasan, slub_debug and many other
      debug facilities enabled. It also happened to someone who reported this
      on bugzilla (on a 5.3-rc).
      
      Fix this by setting i_mode to S_IFREG at btrfs_new_test_inode().
      
      Fixes: 6bf9e4bd ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference")
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204397Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f7fec0b
  3. 09 Sep, 2019 35 commits