1. 17 Oct, 2019 1 commit
  2. 15 Oct, 2019 2 commits
    • Qu Wenruo's avatar
      btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() · 8702ba93
      Qu Wenruo authored
      [Background]
      Btrfs qgroup uses two types of reserved space for METADATA space,
      PERTRANS and PREALLOC.
      
      PERTRANS is metadata space reserved for each transaction started by
      btrfs_start_transaction().
      While PREALLOC is for delalloc, where we reserve space before joining a
      transaction, and finally it will be converted to PERTRANS after the
      writeback is done.
      
      [Inconsistency]
      However there is inconsistency in how we handle PREALLOC metadata space.
      
      The most obvious one is:
      In btrfs_buffered_write():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
      
      We always free qgroup PREALLOC meta space.
      
      While in btrfs_truncate_block():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
      
      We only free qgroup PREALLOC meta space when something went wrong.
      
      [The Correct Behavior]
      The correct behavior should be the one in btrfs_buffered_write(), we
      should always free PREALLOC metadata space.
      
      The reason is, the btrfs_delalloc_* mechanism works by:
      - Reserve metadata first, even it's not necessary
        In btrfs_delalloc_reserve_metadata()
      
      - Free the unused metadata space
        Normally in:
        btrfs_delalloc_release_extents()
        |- btrfs_inode_rsv_release()
           Here we do calculation on whether we should release or not.
      
      E.g. for 64K buffered write, the metadata rsv works like:
      
      /* The first page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=0
      total:		num_bytes=calc_inode_reservations()
      /* The first page caused one outstanding extent, thus needs metadata
         rsv */
      
      /* The 2nd page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed
      /* The 2nd page doesn't cause new outstanding extent, needs no new meta
         rsv, so we free what we have reserved */
      
      /* The 3rd~16th pages */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed (still space for one outstanding extent)
      
      This means, if btrfs_delalloc_release_extents() determines to free some
      space, then those space should be freed NOW.
      So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
      than btrfs_qgroup_convert_reserved_meta().
      
      The good news is:
      - The callers are not that hot
        The hottest caller is in btrfs_buffered_write(), which is already
        fixed by commit 336a8bb8 ("btrfs: Fix wrong
        btrfs_delalloc_release_extents parameter"). Thus it's not that
        easy to cause false EDQUOT.
      
      - The trans commit in advance for qgroup would hide the bug
        Since commit f5fef459 ("btrfs: qgroup: Make qgroup async transaction
        commit more aggressive"), when btrfs qgroup metadata free space is slow,
        it will try to commit transaction and free the wrongly converted
        PERTRANS space, so it's not that easy to hit such bug.
      
      [FIX]
      So to fix the problem, remove the @qgroup_free parameter for
      btrfs_delalloc_release_extents(), and always pass true to
      btrfs_inode_rsv_release().
      Reported-by: default avatarFilipe Manana <fdmanana@suse.com>
      Fixes: 43b18595 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8702ba93
    • David Sterba's avatar
      btrfs: don't needlessly create extent-refs kernel thread · 80ed4548
      David Sterba authored
      The patch 32b593bf ("Btrfs: remove no longer used function to run
      delayed refs asynchronously") removed the async delayed refs but the
      thread has been created, without any use. Remove it to avoid resource
      consumption.
      
      Fixes: 32b593bf ("Btrfs: remove no longer used function to run delayed refs asynchronously")
      CC: stable@vger.kernel.org # 5.2+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      80ed4548
  3. 11 Oct, 2019 2 commits
  4. 08 Oct, 2019 1 commit
    • Austin Kim's avatar
      btrfs: silence maybe-uninitialized warning in clone_range · 431d3988
      Austin Kim authored
      GCC throws warning message as below:
      
      ‘clone_src_i_size’ may be used uninitialized in this function
      [-Wmaybe-uninitialized]
       #define IS_ALIGNED(x, a)  (((x) & ((typeof(x))(a) - 1)) == 0)
                             ^
      fs/btrfs/send.c:5088:6: note: ‘clone_src_i_size’ was declared here
       u64 clone_src_i_size;
         ^
      The clone_src_i_size is only used as call-by-reference
      in a call to get_inode_info().
      
      Silence the warning by initializing clone_src_i_size to 0.
      
      Note that the warning is a false positive and reported by older versions
      of GCC (eg. 7.x) but not eg 9.x. As there have been numerous people, the
      patch is applied. Setting clone_src_i_size to 0 does not otherwise make
      sense and would not do any action in case the code changes in the future.
      Signed-off-by: default avatarAustin Kim <austindh.kim@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      431d3988
  5. 03 Oct, 2019 1 commit
  6. 01 Oct, 2019 4 commits
    • Josef Bacik's avatar
      btrfs: allocate new inode in NOFS context · 11a19a90
      Josef Bacik authored
      A user reported a lockdep splat
      
       ======================================================
       WARNING: possible circular locking dependency detected
       5.2.11-gentoo #2 Not tainted
       ------------------------------------------------------
       kswapd0/711 is trying to acquire lock:
       000000007777a663 (sb_internal){.+.+}, at: start_transaction+0x3a8/0x500
      
      but task is already holding lock:
       000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (fs_reclaim){+.+.}:
       kmem_cache_alloc+0x1f/0x1c0
       btrfs_alloc_inode+0x1f/0x260
       alloc_inode+0x16/0xa0
       new_inode+0xe/0xb0
       btrfs_new_inode+0x70/0x610
       btrfs_symlink+0xd0/0x420
       vfs_symlink+0x9c/0x100
       do_symlinkat+0x66/0xe0
       do_syscall_64+0x55/0x1c0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      -> #0 (sb_internal){.+.+}:
       __sb_start_write+0xf6/0x150
       start_transaction+0x3a8/0x500
       btrfs_commit_inode_delayed_inode+0x59/0x110
       btrfs_evict_inode+0x19e/0x4c0
       evict+0xbc/0x1f0
       inode_lru_isolate+0x113/0x190
       __list_lru_walk_one.isra.4+0x5c/0x100
       list_lru_walk_one+0x32/0x50
       prune_icache_sb+0x36/0x80
       super_cache_scan+0x14a/0x1d0
       do_shrink_slab+0x131/0x320
       shrink_node+0xf7/0x380
       balance_pgdat+0x2d5/0x640
       kswapd+0x2ba/0x5e0
       kthread+0x147/0x160
       ret_from_fork+0x24/0x30
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
       CPU0 CPU1
       ---- ----
       lock(fs_reclaim);
       lock(sb_internal);
       lock(fs_reclaim);
       lock(sb_internal);
      *** DEADLOCK ***
      
       3 locks held by kswapd0/711:
       #0: 000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
       #1: 000000004a5100f8 (shrinker_rwsem){++++}, at: shrink_node+0x9a/0x380
       #2: 00000000f956fa46 (&type->s_umount_key#30){++++}, at: super_cache_scan+0x35/0x1d0
      
      stack backtrace:
       CPU: 7 PID: 711 Comm: kswapd0 Not tainted 5.2.11-gentoo #2
       Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS 2.4.2 09/29/2017
       Call Trace:
       dump_stack+0x85/0xc7
       print_circular_bug.cold.40+0x1d9/0x235
       __lock_acquire+0x18b1/0x1f00
       lock_acquire+0xa6/0x170
       ? start_transaction+0x3a8/0x500
       __sb_start_write+0xf6/0x150
       ? start_transaction+0x3a8/0x500
       start_transaction+0x3a8/0x500
       btrfs_commit_inode_delayed_inode+0x59/0x110
       btrfs_evict_inode+0x19e/0x4c0
       ? var_wake_function+0x20/0x20
       evict+0xbc/0x1f0
       inode_lru_isolate+0x113/0x190
       ? discard_new_inode+0xc0/0xc0
       __list_lru_walk_one.isra.4+0x5c/0x100
       ? discard_new_inode+0xc0/0xc0
       list_lru_walk_one+0x32/0x50
       prune_icache_sb+0x36/0x80
       super_cache_scan+0x14a/0x1d0
       do_shrink_slab+0x131/0x320
       shrink_node+0xf7/0x380
       balance_pgdat+0x2d5/0x640
       kswapd+0x2ba/0x5e0
       ? __wake_up_common_lock+0x90/0x90
       kthread+0x147/0x160
       ? balance_pgdat+0x640/0x640
       ? __kthread_create_on_node+0x160/0x160
       ret_from_fork+0x24/0x30
      
      This is because btrfs_new_inode() calls new_inode() under the
      transaction.  We could probably move the new_inode() outside of this but
      for now just wrap it in memalloc_nofs_save().
      Reported-by: default avatarZdenek Sojka <zsojka@seznam.cz>
      Fixes: 712e36c5 ("btrfs: use GFP_KERNEL in btrfs_alloc_inode")
      CC: stable@vger.kernel.org # 4.16+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      11a19a90
    • Zygo Blaxell's avatar
      btrfs: fix balance convert to single on 32-bit host CPUs · 7a547890
      Zygo Blaxell authored
      Currently, the command:
      
      	btrfs balance start -dconvert=single,soft .
      
      on a Raspberry Pi produces the following kernel message:
      
      	BTRFS error (device mmcblk0p2): balance: invalid convert data profile single
      
      This fails because we use is_power_of_2(unsigned long) to validate
      the new data profile, the constant for 'single' profile uses bit 48,
      and there are only 32 bits in a long on ARM.
      
      Fix by open-coding the check using u64 variables.
      
      Tested by completing the original balance command on several Raspberry
      Pis.
      
      Fixes: 818255fe ("btrfs: use common helper instead of open coding a bit test")
      CC: stable@vger.kernel.org # 4.20+
      Signed-off-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7a547890
    • Josef Bacik's avatar
      btrfs: fix incorrect updating of log root tree · 4203e968
      Josef Bacik authored
      We've historically had reports of being unable to mount file systems
      because the tree log root couldn't be read.  Usually this is the "parent
      transid failure", but could be any of the related errors, including
      "fsid mismatch" or "bad tree block", depending on which block got
      allocated.
      
      The modification of the individual log root items are serialized on the
      per-log root root_mutex.  This means that any modification to the
      per-subvol log root_item is completely protected.
      
      However we update the root item in the log root tree outside of the log
      root tree log_mutex.  We do this in order to allow multiple subvolumes
      to be updated in each log transaction.
      
      This is problematic however because when we are writing the log root
      tree out we update the super block with the _current_ log root node
      information.  Since these two operations happen independently of each
      other, you can end up updating the log root tree in between writing out
      the dirty blocks and setting the super block to point at the current
      root.
      
      This means we'll point at the new root node that hasn't been written
      out, instead of the one we should be pointing at.  Thus whatever garbage
      or old block we end up pointing at complains when we mount the file
      system later and try to replay the log.
      
      Fix this by copying the log's root item into a local root item copy.
      Then once we're safely under the log_root_tree->log_mutex we update the
      root item in the log_root_tree.  This way we do not modify the
      log_root_tree while we're committing it, fixing the problem.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4203e968
    • Filipe Manana's avatar
      Btrfs: fix memory leak due to concurrent append writes with fiemap · c67d970f
      Filipe Manana authored
      When we have a buffered write that starts at an offset greater than or
      equals to the file's size happening concurrently with a full ranged
      fiemap, we can end up leaking an extent state structure.
      
      Suppose we have a file with a size of 1Mb, and before the buffered write
      and fiemap are performed, it has a single extent state in its io tree
      representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set.
      
      The following sequence diagram shows how the memory leak happens if a
      fiemap a buffered write, starting at offset 1Mb and with a length of
      4Kb, are performed concurrently.
      
                CPU 1                                                  CPU 2
      
        extent_fiemap()
          --> it's a full ranged fiemap
              range from 0 to LLONG_MAX - 1
              (9223372036854775807)
      
          --> locks range in the inode's
              io tree
            --> after this we have 2 extent
                states in the io tree:
                --> 1 for range [0, 1Mb[ with
                    the bits EXTENT_LOCKED and
                    EXTENT_DELALLOC_BITS set
                --> 1 for the range
                    [1Mb, LLONG_MAX[ with
                    the EXTENT_LOCKED bit set
      
                                                        --> start buffered write at offset
                                                            1Mb with a length of 4Kb
      
                                                        btrfs_file_write_iter()
      
                                                          btrfs_buffered_write()
                                                            --> cached_state is NULL
      
                                                            lock_and_cleanup_extent_if_need()
                                                              --> returns 0 and does not lock
                                                                  range because it starts
                                                                  at current i_size / eof
      
                                                            --> cached_state remains NULL
      
                                                            btrfs_dirty_pages()
                                                              btrfs_set_extent_delalloc()
                                                                (...)
                                                                __set_extent_bit()
      
                                                                  --> splits extent state for range
                                                                      [1Mb, LLONG_MAX[ and now we
                                                                      have 2 extent states:
      
                                                                      --> one for the range
                                                                          [1Mb, 1Mb + 4Kb[ with
                                                                          EXTENT_LOCKED set
                                                                      --> another one for the range
                                                                          [1Mb + 4Kb, LLONG_MAX[ with
                                                                          EXTENT_LOCKED set as well
      
                                                                  --> sets EXTENT_DELALLOC on the
                                                                      extent state for the range
                                                                      [1Mb, 1Mb + 4Kb[
                                                                  --> caches extent state
                                                                      [1Mb, 1Mb + 4Kb[ into
                                                                      @cached_state because it has
                                                                      the bit EXTENT_LOCKED set
      
                                                          --> btrfs_buffered_write() ends up
                                                              with a non-NULL cached_state and
                                                              never calls anything to release its
                                                              reference on it, resulting in a
                                                              memory leak
      
      Fix this by calling free_extent_state() on cached_state if the range was
      not locked by lock_and_cleanup_extent_if_need().
      
      The same issue can happen if anything else other than fiemap locks a range
      that covers eof and beyond.
      
      This could be triggered, sporadically, by test case generic/561 from the
      fstests suite, which makes duperemove run concurrently with fsstress, and
      duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set
      the leak is reported in dmesg/syslog when removing the btrfs module with
      a message like the following:
      
        [77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1
      
      Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak.
      
      CC: stable@vger.kernel.org # 4.16+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c67d970f
  7. 27 Sep, 2019 2 commits
    • Qu Wenruo's avatar
      btrfs: qgroup: Fix reserved data space leak if we have multiple reserve calls · d4e20494
      Qu Wenruo authored
      [BUG]
      The following script can cause btrfs qgroup data space leak:
      
        mkfs.btrfs -f $dev
        mount $dev -o nospace_cache $mnt
      
        btrfs subv create $mnt/subv
        btrfs quota en $mnt
        btrfs quota rescan -w $mnt
        btrfs qgroup limit 128m $mnt/subv
      
        for (( i = 0; i < 3; i++)); do
                # Create 3 64M holes for latter fallocate to fail
                truncate -s 192m $mnt/subv/file
                xfs_io -c "pwrite 64m 4k" $mnt/subv/file > /dev/null
                xfs_io -c "pwrite 128m 4k" $mnt/subv/file > /dev/null
                sync
      
                # it's supposed to fail, and each failure will leak at least 64M
                # data space
                xfs_io -f -c "falloc 0 192m" $mnt/subv/file &> /dev/null
                rm $mnt/subv/file
                sync
        done
      
        # Shouldn't fail after we removed the file
        xfs_io -f -c "falloc 0 64m" $mnt/subv/file
      
      [CAUSE]
      Btrfs qgroup data reserve code allow multiple reservations to happen on
      a single extent_changeset:
      E.g:
      	btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_1M);
      	btrfs_qgroup_reserve_data(inode, &data_reserved, SZ_1M, SZ_2M);
      	btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_4M);
      
      Btrfs qgroup code has its internal tracking to make sure we don't
      double-reserve in above example.
      
      The only pattern utilizing this feature is in the main while loop of
      btrfs_fallocate() function.
      
      However btrfs_qgroup_reserve_data()'s error handling has a bug in that
      on error it clears all ranges in the io_tree with EXTENT_QGROUP_RESERVED
      flag but doesn't free previously reserved bytes.
      
      This bug has a two fold effect:
      - Clearing EXTENT_QGROUP_RESERVED ranges
        This is the correct behavior, but it prevents
        btrfs_qgroup_check_reserved_leak() to catch the leakage as the
        detector is purely EXTENT_QGROUP_RESERVED flag based.
      
      - Leak the previously reserved data bytes.
      
      The bug manifests when N calls to btrfs_qgroup_reserve_data are made and
      the last one fails, leaking space reserved in the previous ones.
      
      [FIX]
      Also free previously reserved data bytes when btrfs_qgroup_reserve_data
      fails.
      
      Fixes: 52472553 ("btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4e20494
    • Qu Wenruo's avatar
      btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space · bab32fc0
      Qu Wenruo authored
      [BUG]
      Under the following case with qgroup enabled, if some error happened
      after we have reserved delalloc space, then in error handling path, we
      could cause qgroup data space leakage:
      
      From btrfs_truncate_block() in inode.c:
      
      	ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
      					   block_start, blocksize);
      	if (ret)
      		goto out;
      
       again:
      	page = find_or_create_page(mapping, index, mask);
      	if (!page) {
      		btrfs_delalloc_release_space(inode, data_reserved,
      					     block_start, blocksize, true);
      		btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, true);
      		ret = -ENOMEM;
      		goto out;
      	}
      
      [CAUSE]
      In the above case, btrfs_delalloc_reserve_space() will call
      btrfs_qgroup_reserve_data() and mark the io_tree range with
      EXTENT_QGROUP_RESERVED flag.
      
      In the error handling path, we have the following call stack:
      btrfs_delalloc_release_space()
      |- btrfs_free_reserved_data_space()
         |- btrsf_qgroup_free_data()
            |- __btrfs_qgroup_release_data(reserved=@reserved, free=1)
               |- qgroup_free_reserved_data(reserved=@reserved)
                  |- clear_record_extent_bits();
                  |- freed += changeset.bytes_changed;
      
      However due to a completion bug, qgroup_free_reserved_data() will clear
      EXTENT_QGROUP_RESERVED flag in BTRFS_I(inode)->io_failure_tree, other
      than the correct BTRFS_I(inode)->io_tree.
      Since io_failure_tree is never marked with that flag,
      btrfs_qgroup_free_data() will not free any data reserved space at all,
      causing a leakage.
      
      This type of error handling can only be triggered by errors outside of
      qgroup code. So EDQUOT error from qgroup can't trigger it.
      
      [FIX]
      Fix the wrong target io_tree.
      Reported-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Fixes: bc42bda2 ("btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bab32fc0
  8. 25 Sep, 2019 2 commits
    • Qu Wenruo's avatar
      btrfs: Fix a regression which we can't convert to SINGLE profile · fab27359
      Qu Wenruo authored
      [BUG]
      With v5.3 kernel, we can't convert to SINGLE profile:
      
        # btrfs balance start -f -dconvert=single $mnt
        ERROR: error during balancing '/mnt/btrfs': Invalid argument
        # dmesg -t | tail
        validate_convert_profile: data profile=0x1000000000000 allowed=0x20 is_valid=1 final=0x1000000000000 ret=1
        BTRFS error (device dm-3): balance: invalid convert data profile single
      
      [CAUSE]
      With the extra debug output added, it shows that the @allowed bit is
      lacking the special in-memory only SINGLE profile bit.
      
      Thus we fail at that (profile & ~allowed) check.
      
      This regression is caused by commit 081db89b ("btrfs: use raid_attr
      to get allowed profiles for balance conversion") and the fact that we
      don't use any bit to indicate SINGLE profile on-disk, but uses special
      in-memory only bit to help distinguish different profiles.
      
      [FIX]
      Add that BTRFS_AVAIL_ALLOC_BIT_SINGLE to @allowed, so the code should be
      the same as it was and fix the regression.
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Fixes: 081db89b ("btrfs: use raid_attr to get allowed profiles for balance conversion")
      CC: stable@vger.kernel.org # 5.3+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fab27359
    • Qu Wenruo's avatar
      btrfs: relocation: fix use-after-free on dead relocation roots · 1fac4a54
      Qu Wenruo authored
      [BUG]
      One user reported a reproducible KASAN report about use-after-free:
      
        BTRFS info (device sdi1): balance: start -dvrange=1256811659264..1256811659265
        BTRFS info (device sdi1): relocating block group 1256811659264 flags data|raid0
        ==================================================================
        BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
        Write of size 8 at addr ffff88856f671710 by task kworker/u24:10/261579
      
        CPU: 2 PID: 261579 Comm: kworker/u24:10 Tainted: P           OE     5.2.11-arch1-1-kasan #4
        Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Extreme4, BIOS P3.80 04/06/2018
        Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
        Call Trace:
         dump_stack+0x7b/0xba
         print_address_description+0x6c/0x22e
         ? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
         __kasan_report.cold+0x1b/0x3b
         ? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
         kasan_report+0x12/0x17
         __asan_report_store8_noabort+0x17/0x20
         btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
         record_root_in_trans+0x2a0/0x370 [btrfs]
         btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
         start_transaction+0x1ab/0xe90 [btrfs]
         btrfs_join_transaction+0x1d/0x20 [btrfs]
         btrfs_finish_ordered_io+0x7bf/0x18a0 [btrfs]
         ? lock_repin_lock+0x400/0x400
         ? __kmem_cache_shutdown.cold+0x140/0x1ad
         ? btrfs_unlink_subvol+0x9b0/0x9b0 [btrfs]
         finish_ordered_fn+0x15/0x20 [btrfs]
         normal_work_helper+0x1bd/0xca0 [btrfs]
         ? process_one_work+0x819/0x1720
         ? kasan_check_read+0x11/0x20
         btrfs_endio_write_helper+0x12/0x20 [btrfs]
         process_one_work+0x8c9/0x1720
         ? pwq_dec_nr_in_flight+0x2f0/0x2f0
         ? worker_thread+0x1d9/0x1030
         worker_thread+0x98/0x1030
         kthread+0x2bb/0x3b0
         ? process_one_work+0x1720/0x1720
         ? kthread_park+0x120/0x120
         ret_from_fork+0x35/0x40
      
        Allocated by task 369692:
         __kasan_kmalloc.part.0+0x44/0xc0
         __kasan_kmalloc.constprop.0+0xba/0xc0
         kasan_kmalloc+0x9/0x10
         kmem_cache_alloc_trace+0x138/0x260
         btrfs_read_tree_root+0x92/0x360 [btrfs]
         btrfs_read_fs_root+0x10/0xb0 [btrfs]
         create_reloc_root+0x47d/0xa10 [btrfs]
         btrfs_init_reloc_root+0x1e2/0x340 [btrfs]
         record_root_in_trans+0x2a0/0x370 [btrfs]
         btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
         start_transaction+0x1ab/0xe90 [btrfs]
         btrfs_start_transaction+0x1e/0x20 [btrfs]
         __btrfs_prealloc_file_range+0x1c2/0xa00 [btrfs]
         btrfs_prealloc_file_range+0x13/0x20 [btrfs]
         prealloc_file_extent_cluster+0x29f/0x570 [btrfs]
         relocate_file_extent_cluster+0x193/0xc30 [btrfs]
         relocate_data_extent+0x1f8/0x490 [btrfs]
         relocate_block_group+0x600/0x1060 [btrfs]
         btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
         btrfs_relocate_chunk+0x9e/0x180 [btrfs]
         btrfs_balance+0x14e4/0x2fc0 [btrfs]
         btrfs_ioctl_balance+0x47f/0x640 [btrfs]
         btrfs_ioctl+0x119d/0x8380 [btrfs]
         do_vfs_ioctl+0x9f5/0x1060
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x73/0xb0
         do_syscall_64+0xa5/0x370
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 369692:
         __kasan_slab_free+0x14f/0x210
         kasan_slab_free+0xe/0x10
         kfree+0xd8/0x270
         btrfs_drop_snapshot+0x154c/0x1eb0 [btrfs]
         clean_dirty_subvols+0x227/0x340 [btrfs]
         relocate_block_group+0x972/0x1060 [btrfs]
         btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
         btrfs_relocate_chunk+0x9e/0x180 [btrfs]
         btrfs_balance+0x14e4/0x2fc0 [btrfs]
         btrfs_ioctl_balance+0x47f/0x640 [btrfs]
         btrfs_ioctl+0x119d/0x8380 [btrfs]
         do_vfs_ioctl+0x9f5/0x1060
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x73/0xb0
         do_syscall_64+0xa5/0x370
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        The buggy address belongs to the object at ffff88856f671100
         which belongs to the cache kmalloc-4k of size 4096
        The buggy address is located 1552 bytes inside of
         4096-byte region [ffff88856f671100, ffff88856f672100)
        The buggy address belongs to the page:
        page:ffffea0015bd9c00 refcount:1 mapcount:0 mapping:ffff88864400e600 index:0x0 compound_mapcount: 0
        flags: 0x2ffff0000010200(slab|head)
        raw: 02ffff0000010200 dead000000000100 dead000000000200 ffff88864400e600
        raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
        page dumped because: kasan: bad access detected
      
        Memory state around the buggy address:
         ffff88856f671600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff88856f671680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        >ffff88856f671700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                 ^
         ffff88856f671780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff88856f671800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ==================================================================
        BTRFS info (device sdi1): 1 enospc errors during balance
        BTRFS info (device sdi1): balance: ended with status: -28
      
      [CAUSE]
      The problem happens when finish_ordered_io() get called with balance
      still running, while the reloc root of that subvolume is already dead.
      (Tree is swap already done, but tree not yet deleted for possible qgroup
      usage.)
      
      That means root->reloc_root still exists, but that reloc_root can be
      under btrfs_drop_snapshot(), thus we shouldn't access it.
      
      The following race could cause the use-after-free problem:
      
                      CPU1              |                CPU2
      --------------------------------------------------------------------------
                                        | relocate_block_group()
                                        | |- unset_reloc_control(rc)
                                        | |- btrfs_commit_transaction()
      btrfs_finish_ordered_io()         | |- clean_dirty_subvols()
      |- btrfs_join_transaction()       |    |
         |- record_root_in_trans()      |    |
            |- btrfs_init_reloc_root()  |    |
               |- if (root->reloc_root) |    |
               |                        |    |- root->reloc_root = NULL
               |                        |    |- btrfs_drop_snapshot(reloc_root);
               |- reloc_root->last_trans|
                       = trans->transid |
      	    ^^^^^^^^^^^^^^^^^^^^^^
                  Use after free
      
      [FIX]
      Fix it by the following modifications:
      
      - Test if the root has dead reloc tree before accessing root->reloc_root
        If the root has BTRFS_ROOT_DEAD_RELOC_TREE, then we don't need to
        create or update root->reloc_tree
      
      - Clear the BTRFS_ROOT_DEAD_RELOC_TREE flag until we have fully dropped
        reloc tree
        To co-operate with above modification, so as long as
        BTRFS_ROOT_DEAD_RELOC_TREE is still set, we won't try to re-create
        reloc tree at record_root_in_trans().
      Reported-by: default avatarCebtenzzre <cebtenzzre@gmail.com>
      Fixes: d2311e69 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
      CC: stable@vger.kernel.org # 5.1+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1fac4a54
  9. 24 Sep, 2019 4 commits
    • Filipe Manana's avatar
      Btrfs: fix race setting up and completing qgroup rescan workers · 13fc1d27
      Filipe Manana authored
      There is a race between setting up a qgroup rescan worker and completing
      a qgroup rescan worker that can lead to callers of the qgroup rescan wait
      ioctl to either not wait for the rescan worker to complete or to hang
      forever due to missing wake ups. The following diagram shows a sequence
      of steps that illustrates the race.
      
              CPU 1                                                         CPU 2                                  CPU 3
      
       btrfs_ioctl_quota_rescan()
        btrfs_qgroup_rescan()
         qgroup_rescan_init()
          mutex_lock(&fs_info->qgroup_rescan_lock)
          spin_lock(&fs_info->qgroup_lock)
      
          fs_info->qgroup_flags |=
            BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
          init_completion(
            &fs_info->qgroup_rescan_completion)
      
          fs_info->qgroup_rescan_running = true
      
          mutex_unlock(&fs_info->qgroup_rescan_lock)
          spin_unlock(&fs_info->qgroup_lock)
      
          btrfs_init_work()
           --> starts the worker
      
                                                              btrfs_qgroup_rescan_worker()
                                                               mutex_lock(&fs_info->qgroup_rescan_lock)
      
                                                               fs_info->qgroup_flags &=
                                                                 ~BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
                                                               mutex_unlock(&fs_info->qgroup_rescan_lock)
      
                                                               starts transaction, updates qgroup status
                                                               item, etc
      
                                                                                                                 btrfs_ioctl_quota_rescan()
                                                                                                                  btrfs_qgroup_rescan()
                                                                                                                   qgroup_rescan_init()
                                                                                                                    mutex_lock(&fs_info->qgroup_rescan_lock)
                                                                                                                    spin_lock(&fs_info->qgroup_lock)
      
                                                                                                                    fs_info->qgroup_flags |=
                                                                                                                      BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
                                                                                                                    init_completion(
                                                                                                                      &fs_info->qgroup_rescan_completion)
      
                                                                                                                    fs_info->qgroup_rescan_running = true
      
                                                                                                                    mutex_unlock(&fs_info->qgroup_rescan_lock)
                                                                                                                    spin_unlock(&fs_info->qgroup_lock)
      
                                                                                                                    btrfs_init_work()
                                                                                                                     --> starts another worker
      
                                                               mutex_lock(&fs_info->qgroup_rescan_lock)
      
                                                               fs_info->qgroup_rescan_running = false
      
                                                               mutex_unlock(&fs_info->qgroup_rescan_lock)
      
      							 complete_all(&fs_info->qgroup_rescan_completion)
      
      Before the rescan worker started by the task at CPU 3 completes, if
      another task calls btrfs_ioctl_quota_rescan(), it will get -EINPROGRESS
      because the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN is set at
      fs_info->qgroup_flags, which is expected and correct behaviour.
      
      However if other task calls btrfs_ioctl_quota_rescan_wait() before the
      rescan worker started by the task at CPU 3 completes, it will return
      immediately without waiting for the new rescan worker to complete,
      because fs_info->qgroup_rescan_running is set to false by CPU 2.
      
      This race is making test case btrfs/171 (from fstests) to fail often:
      
        btrfs/171 9s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad)
            --- tests/btrfs/171.out     2018-09-16 21:30:48.505104287 +0100
            +++ /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad      2019-09-19 02:01:36.938486039 +0100
            @@ -1,2 +1,3 @@
             QA output created by 171
            +ERROR: quota rescan failed: Operation now in progress
             Silence is golden
            ...
            (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/171.out /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad'  to see the entire diff)
      
      That is because the test calls the btrfs-progs commands "qgroup quota
      rescan -w", "qgroup assign" and "qgroup remove" in a sequence that makes
      calls to the rescan start ioctl fail with -EINPROGRESS (note the "btrfs"
      commands 'qgroup assign' and 'qgroup remove' often call the rescan start
      ioctl after calling the qgroup assign ioctl,
      btrfs_ioctl_qgroup_assign()), since previous waits didn't actually wait
      for a rescan worker to complete.
      
      Another problem the race can cause is missing wake ups for waiters,
      since the call to complete_all() happens outside a critical section and
      after clearing the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN. In the sequence
      diagram above, if we have a waiter for the first rescan task (executed
      by CPU 2), then fs_info->qgroup_rescan_completion.wait is not empty, and
      if after the rescan worker clears BTRFS_QGROUP_STATUS_FLAG_RESCAN and
      before it calls complete_all() against
      fs_info->qgroup_rescan_completion, the task at CPU 3 calls
      init_completion() against fs_info->qgroup_rescan_completion which
      re-initilizes its wait queue to an empty queue, therefore causing the
      rescan worker at CPU 2 to call complete_all() against an empty queue,
      never waking up the task waiting for that rescan worker.
      
      Fix this by clearing BTRFS_QGROUP_STATUS_FLAG_RESCAN and setting
      fs_info->qgroup_rescan_running to false in the same critical section,
      delimited by the mutex fs_info->qgroup_rescan_lock, as well as doing the
      call to complete_all() in that same critical section. This gives the
      protection needed to avoid rescan wait ioctl callers not waiting for a
      running rescan worker and the lost wake ups problem, since setting that
      rescan flag and boolean as well as initializing the wait queue is done
      already in a critical section delimited by that mutex (at
      qgroup_rescan_init()).
      
      Fixes: 57254b6e ("Btrfs: add ioctl to wait for qgroup rescan completion")
      Fixes: d2c609b8 ("btrfs: properly track when rescan worker is running")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      13fc1d27
    • Filipe Manana's avatar
      Btrfs: fix missing error return if writeback for extent buffer never started · 0607eb1d
      Filipe Manana authored
      If lock_extent_buffer_for_io() fails, it returns a negative value, but its
      caller btree_write_cache_pages() ignores such error. This means that a
      call to flush_write_bio(), from lock_extent_buffer_for_io(), might have
      failed. We should make btree_write_cache_pages() notice such error values
      and stop immediatelly, making sure filemap_fdatawrite_range() returns an
      error to the transaction commit path. A failure from flush_write_bio()
      should also result in the endio callback end_bio_extent_buffer_writepage()
      being invoked, which sets the BTRFS_FS_*_ERR bits appropriately, so that
      there's no risk a transaction or log commit doesn't catch a writeback
      failure.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0607eb1d
    • Dennis Zhou's avatar
      btrfs: adjust dirty_metadata_bytes after writeback failure of extent buffer · eb5b64f1
      Dennis Zhou authored
      Before, if a eb failed to write out, we would end up triggering a
      BUG_ON(). As of f4340622 ("btrfs: extent_io: Move the BUG_ON() in
      flush_write_bio() one level up"), we no longer BUG_ON(), so we should
      make life consistent and add back the unwritten bytes to
      dirty_metadata_bytes.
      
      Fixes: f4340622 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
      CC: stable@vger.kernel.org # 5.2+
      Reviewed-by: default avatarFilipe Manana <fdmanana@kernel.org>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eb5b64f1
    • Filipe Manana's avatar
      Btrfs: fix selftests failure due to uninitialized i_mode in test inodes · 9f7fec0b
      Filipe Manana authored
      Some of the self tests create a test inode, setup some extents and then do
      calls to btrfs_get_extent() to test that the corresponding extent maps
      exist and are correct. However btrfs_get_extent(), since the 5.2 merge
      window, now errors out when it finds a regular or prealloc extent for an
      inode that does not correspond to a regular file (its ->i_mode is not
      S_IFREG). This causes the self tests to fail sometimes, specially when
      KASAN, slub_debug and page poisoning are enabled:
      
        $ modprobe btrfs
        modprobe: ERROR: could not insert 'btrfs': Invalid argument
      
        $ dmesg
        [ 9414.691648] Btrfs loaded, crc32c=crc32c-intel, debug=on, assert=on, integrity-checker=on, ref-verify=on
        [ 9414.692655] BTRFS: selftest: sectorsize: 4096  nodesize: 4096
        [ 9414.692658] BTRFS: selftest: running btrfs free space cache tests
        [ 9414.692918] BTRFS: selftest: running extent only tests
        [ 9414.693061] BTRFS: selftest: running bitmap only tests
        [ 9414.693366] BTRFS: selftest: running bitmap and extent tests
        [ 9414.696455] BTRFS: selftest: running space stealing from bitmap to extent tests
        [ 9414.697131] BTRFS: selftest: running extent buffer operation tests
        [ 9414.697133] BTRFS: selftest: running btrfs_split_item tests
        [ 9414.697564] BTRFS: selftest: running extent I/O tests
        [ 9414.697583] BTRFS: selftest: running find delalloc tests
        [ 9415.081125] BTRFS: selftest: running find_first_clear_extent_bit test
        [ 9415.081278] BTRFS: selftest: running extent buffer bitmap tests
        [ 9415.124192] BTRFS: selftest: running inode tests
        [ 9415.124195] BTRFS: selftest: running btrfs_get_extent tests
        [ 9415.127909] BTRFS: selftest: running hole first btrfs_get_extent test
        [ 9415.128343] BTRFS critical (device (efault)): regular/prealloc extent found for non-regular inode 256
        [ 9415.131428] BTRFS: selftest: fs/btrfs/tests/inode-tests.c:904 expected a real extent, got 0
      
      This happens because the test inodes are created without ever initializing
      the i_mode field of the inode, and neither VFS's new_inode() nor the btrfs
      callback btrfs_alloc_inode() initialize the i_mode. Initialization of the
      i_mode is done through the various callbacks used by the VFS to create
      new inodes (regular files, directories, symlinks, tmpfiles, etc), which
      all call btrfs_new_inode() which in turn calls inode_init_owner(), which
      sets the inode's i_mode. Since the tests only uses new_inode() to create
      the test inodes, the i_mode was never initialized.
      
      This always happens on a VM I used with kasan, slub_debug and many other
      debug facilities enabled. It also happened to someone who reported this
      on bugzilla (on a 5.3-rc).
      
      Fix this by setting i_mode to S_IFREG at btrfs_new_test_inode().
      
      Fixes: 6bf9e4bd ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference")
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204397Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f7fec0b
  10. 09 Sep, 2019 21 commits
    • Nikolay Borisov's avatar
      btrfs: Relinquish CPUs in btrfs_compare_trees · 6af112b1
      Nikolay Borisov authored
      When doing any form of incremental send the parent and the child trees
      need to be compared via btrfs_compare_trees. This  can result in long
      loop chains without ever relinquishing the CPU. This causes softlockup
      detector to trigger when comparing trees with a lot of items. Example
      report:
      
      watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [snapperd:16153]
      CPU: 0 PID: 16153 Comm: snapperd Not tainted 5.2.9-1-default #1 openSUSE Tumbleweed (unreleased)
      Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      pstate: 40000005 (nZcv daif -PAN -UAO)
      pc : __ll_sc_arch_atomic_sub_return+0x14/0x20
      lr : btrfs_release_extent_buffer_pages+0xe0/0x1e8 [btrfs]
      sp : ffff00001273b7e0
      Call trace:
       __ll_sc_arch_atomic_sub_return+0x14/0x20
       release_extent_buffer+0xdc/0x120 [btrfs]
       free_extent_buffer.part.0+0xb0/0x118 [btrfs]
       free_extent_buffer+0x24/0x30 [btrfs]
       btrfs_release_path+0x4c/0xa0 [btrfs]
       btrfs_free_path.part.0+0x20/0x40 [btrfs]
       btrfs_free_path+0x24/0x30 [btrfs]
       get_inode_info+0xa8/0xf8 [btrfs]
       finish_inode_if_needed+0xe0/0x6d8 [btrfs]
       changed_cb+0x9c/0x410 [btrfs]
       btrfs_compare_trees+0x284/0x648 [btrfs]
       send_subvol+0x33c/0x520 [btrfs]
       btrfs_ioctl_send+0x8a0/0xaf0 [btrfs]
       btrfs_ioctl+0x199c/0x2288 [btrfs]
       do_vfs_ioctl+0x4b0/0x820
       ksys_ioctl+0x84/0xb8
       __arm64_sys_ioctl+0x28/0x38
       el0_svc_common.constprop.0+0x7c/0x188
       el0_svc_handler+0x34/0x90
       el0_svc+0x8/0xc
      
      Fix this by adding a call to cond_resched at the beginning of the main
      loop in btrfs_compare_trees.
      
      Fixes: 7069830a ("Btrfs: add btrfs_compare_trees function")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6af112b1
    • Nikolay Borisov's avatar
      btrfs: Don't assign retval of btrfs_try_tree_write_lock/btrfs_tree_read_lock_atomic · 65e99c43
      Nikolay Borisov authored
      Those function are simple boolean predicates there is no need to assign
      their return values to interim variables. Use them directly as
      predicates. No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      65e99c43
    • Johannes Thumshirn's avatar
      btrfs: create structure to encode checksum type and length · af024ed2
      Johannes Thumshirn authored
      Create a structure to encode the type and length for the known on-disk
      checksums.  This makes it easier to add new checksums later.
      
      The structure and helpers are moved from ctree.h so they don't occupy
      space in all headers including ctree.h. This save some space in the
      final object.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af024ed2
    • Johannes Thumshirn's avatar
      btrfs: turn checksum type define into an enum · e35b79a1
      Johannes Thumshirn authored
      Turn the checksum type definition into a enum. This eases later addition
      of new checksums.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e35b79a1
    • Josef Bacik's avatar
      btrfs: add enospc debug messages for ticket failure · 84fe47a4
      Josef Bacik authored
      When debugging weird enospc problems it's handy to be able to dump the
      space info when we wake up all tickets, and see what the ticket values
      are.  This helped me figure out cases where we were enospc'ing when we
      shouldn't have been.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      84fe47a4
    • Josef Bacik's avatar
      btrfs: do not account global reserve in can_overcommit · 0096420a
      Josef Bacik authored
      We ran into a problem in production where a box with plenty of space was
      getting wedged doing ENOSPC flushing.  These boxes only had 20% of the
      disk allocated, but their metadata space + global reserve was right at
      the size of their metadata chunk.
      
      In this case can_overcommit should be allowing allocations without
      problem, but there's logic in can_overcommit that doesn't allow us to
      overcommit if there's not enough real space to satisfy the global
      reserve.
      
      This is for historical reasons.  Before there were only certain places
      we could allocate chunks.  We could go to commit the transaction and not
      have enough space for our pending delayed refs and such and be unable to
      allocate a new chunk.  This would result in a abort because of ENOSPC.
      This code was added to solve this problem.
      
      However since then we've gained the ability to always be able to
      allocate a chunk.  So we can easily overcommit in these cases without
      risking a transaction abort because of ENOSPC.
      
      Also prior to now the global reserve really would be used because that's
      the space we relied on for delayed refs.  With delayed refs being
      tracked separately we no longer have to worry about running out of
      delayed refs space while committing.  We are much less likely to
      exhaust our global reserve space during transaction commit.
      
      Fix the can_overcommit code to simply see if our current usage + what we
      want is less than our current free space plus whatever slack space we
      have in the disk is.  This solves the problem we were seeing in
      production and keeps us from flushing as aggressively as we approach our
      actual metadata size usage.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0096420a
    • Josef Bacik's avatar
      btrfs: use btrfs_try_granting_tickets in update_global_rsv · 426551f6
      Josef Bacik authored
      We have some annoying xfstests tests that will create a very small fs,
      fill it up, delete it, and repeat to make sure everything works right.
      This trips btrfs up sometimes because we may commit a transaction to
      free space, but most of the free metadata space was being reserved by
      the global reserve.  So we commit and update the global reserve, but the
      space is simply added to bytes_may_use directly, instead of trying to
      add it to existing tickets.  This results in ENOSPC when we really did
      have space.  Fix this by calling btrfs_try_granting_tickets once we add
      back our excess space to wake any pending tickets.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      426551f6
    • Josef Bacik's avatar
      btrfs: always reserve our entire size for the global reserve · d792b0f1
      Josef Bacik authored
      While messing with the overcommit logic I noticed that sometimes we'd
      ENOSPC out when really we should have run out of space much earlier.  It
      turns out it's because we'll only reserve up to the free amount left in
      the space info for the global reserve, but that doesn't make sense with
      overcommit because we could be well above our actual size.  This results
      in the global reserve not carving out it's entire reservation, and thus
      not putting enough pressure on the rest of the infrastructure to do the
      right thing and ENOSPC out at a convenient time.  Fix this by always
      taking our full reservation amount for the global reserve.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d792b0f1
    • Josef Bacik's avatar
      btrfs: change the minimum global reserve size · 3593ce30
      Josef Bacik authored
      It made sense to have the global reserve set at 16M in the past, but
      since it is used less nowadays set the minimum size to the number of
      items we'll need to update the main trees we update during a transaction
      commit, plus some slop area so we can do unlinks if we need to.
      
      In practice this doesn't affect normal file systems, but for xfstests
      where we do things like fill up a fs and then rm * it can fall over in
      weird ways.  This enables us for more sane behavior at extremely small
      file system sizes.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3593ce30
    • Josef Bacik's avatar
      btrfs: rename btrfs_space_info_add_old_bytes · d05e4649
      Josef Bacik authored
      This name doesn't really fit with how the space reservation stuff works
      now, rename it to btrfs_space_info_free_bytes_may_use so it's clear what
      the function is doing.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d05e4649
    • Josef Bacik's avatar
      btrfs: remove orig_bytes from reserve_ticket · def936e5
      Josef Bacik authored
      Now that we do not do partial filling of tickets simply remove
      orig_bytes, it is no longer needed.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      def936e5
    • Josef Bacik's avatar
      btrfs: fix may_commit_transaction to deal with no partial filling · 00c0135e
      Josef Bacik authored
      Now that we aren't partially filling tickets we may have some slack
      space left in the space_info.  We need to account for this in
      may_commit_transaction, otherwise we may choose to not commit the
      transaction despite it actually having enough space to satisfy our
      ticket.
      
      Calculate the free space we have in the space_info, if any, and subtract
      this from the ticket we have and use that amount to determine if we will
      need to commit to reclaim enough space.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      00c0135e
    • Josef Bacik's avatar
      btrfs: rework wake_all_tickets · 2341ccd1
      Josef Bacik authored
      Now that we no longer partially fill tickets we need to rework
      wake_all_tickets to call btrfs_try_to_wakeup_tickets() in order to see
      if any subsequent tickets are able to be satisfied.  If our tickets_id
      changes we know something happened and we can keep flushing.
      
      Also if we find a ticket that is smaller than the first ticket in our
      queue then we want to retry the flushing loop again in case
      may_commit_transaction() decides we could satisfy the ticket by
      committing the transaction.
      
      Rename this to maybe_fail_all_tickets() while we're at it, to better
      reflect what the function is actually doing.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2341ccd1
    • Josef Bacik's avatar
      btrfs: refactor the ticket wakeup code · 18fa2284
      Josef Bacik authored
      Now that btrfs_space_info_add_old_bytes simply checks if we can make the
      reservation and updates bytes_may_use, there's no reason to have both
      helpers in place.
      
      Factor out the ticket wakeup logic into it's own helper, make
      btrfs_space_info_add_old_bytes() update bytes_may_use and then call the
      wakeup helper, and replace all calls to btrfs_space_info_add_new_bytes()
      with the wakeup helper.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      18fa2284
    • Josef Bacik's avatar
      btrfs: stop partially refilling tickets when releasing space · 91182645
      Josef Bacik authored
      btrfs_space_info_add_old_bytes is used when adding the extra space from
      an existing reservation back into the space_info to be used by any
      waiting tickets.  In order to keep us from overcommitting we check to
      make sure that we can still use this space for our reserve ticket, and
      if we cannot we'll simply subtract it from space_info->bytes_may_use.
      
      However this is problematic, because it assumes that only changes to
      bytes_may_use would affect our ability to make reservations.  Any
      changes to bytes_reserved would be missed.  If we were unable to make a
      reservation prior because of reserved space, but that reserved space was
      free'd due to unlink or truncate and we were allowed to immediately
      reclaim that metadata space we would still ENOSPC.
      
      Consider the example where we create a file with a bunch of extents,
      using up 2MiB of actual space for the new tree blocks.  Then we try to
      make a reservation of 2MiB but we do not have enough space to make this
      reservation.  The iput() occurs in another thread and we remove this
      space, and since we did not write the blocks we simply do
      space_info->bytes_reserved -= 2MiB.  We would never see this because we
      do not check our space info used, we just try to re-use the freed
      reservations.
      
      To fix this problem, and to greatly simplify the wakeup code, do away
      with this partial refilling nonsense.  Use
      btrfs_space_info_add_old_bytes to subtract the reservation from
      space_info->bytes_may_use, and then check the ticket against the total
      used of the space_info the same way we do with the initial reservation
      attempt.
      
      This keeps the reservation logic consistent and solves the problem of
      early ENOSPC in the case that we free up space in places other than
      bytes_may_use and bytes_pinned.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      91182645
    • Josef Bacik's avatar
      btrfs: add space reservation tracepoint for reserved bytes · a43c3835
      Josef Bacik authored
      I noticed when folding the trace_btrfs_space_reservation() tracepoint
      into the btrfs_space_info_update_* helpers that we didn't emit a
      tracepoint when doing btrfs_add_reserved_bytes().  I know this is
      because we were swapping bytes_may_use for bytes_reserved, so in my mind
      there was no reason to have the tracepoint there.  But now there is
      because we always emit the unreserve for the bytes_may_use side, and
      this would have broken if compression was on anyway.  Add a tracepoint
      to cover the bytes_reserved counter so the math still comes out right.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a43c3835
    • Josef Bacik's avatar
      btrfs: roll tracepoint into btrfs_space_info_update helper · f3e75e38
      Josef Bacik authored
      We duplicate this tracepoint everywhere we call these helpers, so update
      the helper to have the tracepoint as well.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3e75e38
    • Josef Bacik's avatar
      btrfs: do not allow reservations if we have pending tickets · ef1317a1
      Josef Bacik authored
      If we already have tickets on the list we don't want to steal their
      reservations.  This is a preparation patch for upcoming changes,
      technically this shouldn't happen today because of the way we add bytes
      to tickets before adding them to the space_info in most cases.
      
      This does not change the FIFO nature of reserve tickets, it simply
      allows us to enforce it in a different way.  Previously it was enforced
      because any new space would be added to the first ticket on the list,
      which would result in new reservations getting a reserve ticket.  This
      replaces that mechanism by simply checking to see if we have outstanding
      reserve tickets and skipping straight to adding a ticket for our
      reservation.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ef1317a1
    • Omar Sandoval's avatar
      btrfs: stop clearing EXTENT_DIRTY in inode I/O tree · e182163d
      Omar Sandoval authored
      Since commit fee187d9 ("Btrfs: do not set EXTENT_DIRTY along with
      EXTENT_DELALLOC"), we never set EXTENT_DIRTY in inode->io_tree, so we
      can simplify and stop trying to clear it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e182163d
    • Omar Sandoval's avatar
      btrfs: treat RWF_{,D}SYNC writes as sync for CRCs · f50cb7af
      Omar Sandoval authored
      The VFS indicates a synchronous write to ->write_iter() via
      iocb->ki_flags. The IOCB_{,D}SYNC flags may be set based on the file
      (see iocb_flags()) or the RWF_* flags passed to a syscall like
      pwritev2() (see kiocb_set_rw_flags()).
      
      However, in btrfs_file_write_iter(), we're checking if a write is
      synchronous based only on the file; we use this to decide when to bump
      the sync_writers counter and thus do CRCs synchronously. Make sure we do
      this for all synchronous writes as determined by the VFS.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add const ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f50cb7af
    • Omar Sandoval's avatar
      btrfs: use correct count in btrfs_file_write_iter() · c09767a8
      Omar Sandoval authored
      generic_write_checks() may modify iov_iter_count(), so we must get the
      count after the call, not before. Using the wrong one has a couple of
      consequences:
      
      1. We check a longer range in check_can_nocow() for nowait than we're
         actually writing.
      2. We create extra hole extent maps in btrfs_cont_expand(). As far as I
         can tell, this is harmless, but I might be missing something.
      
      These issues are pretty minor, but let's fix it before something more
      important trips on it.
      
      Fixes: edf064e7 ("btrfs: nowait aio support")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c09767a8