1. 03 Jan, 2022 27 commits
    • Qu Wenruo's avatar
      btrfs: remove unnecessary @nr_written parameters · 83f1b680
      Qu Wenruo authored
      We use @nr_written to record how many pages have been started by
      btrfs_run_delalloc_range().
      
      Currently there are only two cases that would populate @nr_written:
      
      - Inline extent creation
      - Compressed write
      
      But both cases will also set @page_started to one.
      
      In fact, in writepage_delalloc() we have the following code, showing
      that @nr_written is really only utilized for above two cases:
      
      	/* did the fill delalloc function already unlock and start
      	 * the IO?
      	 */
      	if (page_started) {
      		/*
      		 * we've unlocked the page, so we can't update
      		 * the mapping's writeback index, just update
      		 * nr_to_write.
      		 */
      		wbc->nr_to_write -= nr_written;
      		return 1;
      	}
      
      But for such cases, writepage_delalloc() will return 1, and exit
      __extent_writepage() without going through __extent_writepage_io().
      
      Thus this means, inside __extent_writepage_io(), we always get
      @nr_written as 0.
      
      So this patch is going to remove the unnecessary parameter from the
      following functions:
      
      - writepage_delalloc()
      
        As @nr_written passed in is always the initial value 0.
      
        Although inside that function, we still need a local @nr_written
        to update wbc->nr_to_write.
      
      - __extent_writepage_io()
      
        As explained above, @nr_written passed in can only be 0.
      
        This also means we can remove one update_nr_written() call.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      83f1b680
    • Josef Bacik's avatar
      btrfs: change root to fs_info for btrfs_reserve_metadata_bytes · 9270501c
      Josef Bacik authored
      We used to need the root for btrfs_reserve_metadata_bytes to check the
      orphan cleanup state, but we no longer need that, we simply need the
      fs_info.  Change btrfs_reserve_metadata_bytes() to use the fs_info, and
      change both btrfs_block_rsv_refill() and btrfs_block_rsv_add() to do the
      same as they simply call btrfs_reserve_metadata_bytes() and then
      manipulate the block_rsv that is being used.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9270501c
    • Josef Bacik's avatar
      btrfs: get rid of root->orphan_cleanup_state · 54230013
      Josef Bacik authored
      Now that we don't care about the stage of the orphan_cleanup_state,
      simply replace it with a bit on ->state to make sure we don't call the
      orphan cleanup every time we wander into this root.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54230013
    • Josef Bacik's avatar
      btrfs: remove global rsv stealing logic for orphan cleanup · 6dbdd578
      Josef Bacik authored
      This is very old code before we were stealing from the global reserve
      during evict.  We have proper ways to steal from the global reserve
      while we're evicting, so rip out this code as it's no longer necessary.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6dbdd578
    • Josef Bacik's avatar
      btrfs: make BTRFS_RESERVE_FLUSH_EVICT use the global rsv stealing code · ee6adbfd
      Josef Bacik authored
      I forgot to convert this over when I introduced the global reserve
      stealing code to the space flushing code.  Evict was simply trying to
      make its reservation and then if it failed it would steal from the
      global rsv, which is racey because it's outside of the normal ticketing
      code.
      
      Fix this by setting ticket->steal if we are BTRFS_RESERVE_FLUSH_EVICT,
      and then make the priority flushing path do the steal for us.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ee6adbfd
    • Josef Bacik's avatar
      btrfs: check ticket->steal in steal_from_global_block_rsv · 1b0309ea
      Josef Bacik authored
      We're going to use this helper in the priority flushing loop, move this
      check into the helper to simplify the logic.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1b0309ea
    • Josef Bacik's avatar
      btrfs: check for priority ticket granting before flushing · 9cd8dcdc
      Josef Bacik authored
      Since we're dropping locks before we enter the priority flushing loops
      we could have had our ticket granted before we got the space_info->lock.
      So add this check to avoid doing some extra flushing in the priority
      flushing cases.
      
      The case in priority_reclaim_metadata_space is an optimization.  Think
      we came in to reserve, we didn't have the space, we added our ticket to
      the list.  But at the same time somebody was waiting on the space_info
      lock to add space and do btrfs_try_granting_ticket(), so we drop the
      lock, get satisfied, come in to do our loop, and we have been
      satisfied.
      
      This is the priority reclaim path, so to_reclaim could be !0 still
      because we may have only satisfied the priority tickets and still left
      non priority tickets on the list.  We would then have to_reclaim but
      ->bytes == 0.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ add note about the optimization ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9cd8dcdc
    • Josef Bacik's avatar
      btrfs: handle priority ticket failures in their respective helpers · 9f35f76d
      Josef Bacik authored
      Currently the error case for the priority tickets is handled where we
      deal with all of the tickets, priority and non-priority.  This is OK in
      general, but it makes for some awkward locking.  We take and drop the
      space_info->lock back to back because of these different types of
      tickets.
      
      Rework the code to handle priority ticket failures in their respective
      helpers.  This allows us to be less wonky with our space_info->lock
      usage, and means that the main handler simply has to check
      ticket->error, as the ticket is guaranteed to be off any list and
      completely handled by the time it exits one of the handlers.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f35f76d
    • Naohiro Aota's avatar
      btrfs: zoned: cache reported zone during mount · 16beac87
      Naohiro Aota authored
      When mounting a device, we are reporting the zones twice: once for
      checking the zone attributes in btrfs_get_dev_zone_info and once for
      loading block groups' zone info in
      btrfs_load_block_group_zone_info(). With a lot of block groups, that
      leads to a lot of REPORT ZONE commands and slows down the mount
      process.
      
      This patch introduces a zone info cache in struct
      btrfs_zoned_device_info. The cache is populated while in
      btrfs_get_dev_zone_info() and used for
      btrfs_load_block_group_zone_info() to reduce the number of REPORT ZONE
      commands. The zone cache is then released after loading the block
      groups, as it will not be much effective during the run time.
      
      Benchmark: Mount an HDD with 57,007 block groups
      Before patch: 171.368 seconds
      After patch: 64.064 seconds
      
      While it still takes a minute due to the slowness of loading all the
      block groups, the patch reduces the mount time by 1/3.
      
      Link: https://lore.kernel.org/linux-btrfs/CAHQ7scUiLtcTqZOMMY5kbWUBOhGRwKo6J6wYPT5WY+C=cD49nQ@mail.gmail.com/
      Fixes: 5b316468 ("btrfs: get zone information of zoned block devices")
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      16beac87
    • Su Yue's avatar
      btrfs: remove unused parameter fs_devices from btrfs_init_workqueues · d21deec5
      Su Yue authored
      Since commit ba8a9d07 ("Btrfs: delete the entire async bio submission
      framework") removed submit workqueues, the parameter fs_devices is not used
      anymore.
      
      Remove it, no functional changes.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarSu Yue <l@damenly.su>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d21deec5
    • Filipe Manana's avatar
      btrfs: reduce the scope of the tree log mutex during transaction commit · dfba78dc
      Filipe Manana authored
      In the transaction commit path we are acquiring the tree log mutex too
      early and we have a stale comment because:
      
      1) It mentions a function named btrfs_commit_tree_roots(), which does not
         exists anymore, it was the old name of commit_cowonly_roots(), renamed
         a very long time ago by commit 5d4f98a2 ("Btrfs: Mixed back
         reference  (FORWARD ROLLING FORMAT CHANGE)"));
      
      2) It mentions that we need to acquire the tree log mutex at that point
         to ensure we have no running log writers. That is not correct anymore,
         for many years at least, since we are guaranteed that we do not have
         any log writers at that point simply because we have set the state of
         the transaction to TRANS_STATE_COMMIT_DOING and have waited for all
         writers to complete - meaning no one can log until we change the state
         of the transaction to TRANS_STATE_UNBLOCKED. Any attempts to join the
         transaction or start a new one will block until we do that state
         transition;
      
      3) The comment mentions a "trans mutex" which doesn't exists since 2011,
         commit a4abeea4 ("Btrfs: kill trans_mutex") removed it;
      
      4) The current use of the tree log mutex is to ensure proper serialization
         of super block writes - if someone started a new transaction and uses it
         for logging, it will wait for the previous transaction to write its
         super block before writing the super block when attempting to sync the
         log.
      
      So acquire the tree log mutex only when it's absolutely needed, before
      setting the transaction state to TRANS_STATE_UNBLOCKED, fix and move the
      stale comment, add some assertions and new comments where appropriate.
      
      Also, this has no effect on concurrency or performance, since the new
      start of the critical section is still when the transaction is in the
      state TRANS_STATE_COMMIT_DOING.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dfba78dc
    • Anand Jain's avatar
      btrfs: consolidate device_list_mutex in prepare_sprout to its parent · 849eae5e
      Anand Jain authored
      btrfs_prepare_sprout() splices seed devices into its own struct fs_devices,
      so that its parent function btrfs_init_new_device() can add the new sprout
      device to fs_info->fs_devices.
      
      Both btrfs_prepare_sprout() and btrfs_init_new_device() need
      device_list_mutex. But they are holding it separately, thus create a
      small race window. Close it and hold device_list_mutex across both
      functions btrfs_init_new_device() and btrfs_prepare_sprout().
      
      Split btrfs_prepare_sprout() into btrfs_init_sprout() and
      btrfs_setup_sprout(). This split is essential because device_list_mutex
      must not be held for allocations in btrfs_init_sprout() but must be held
      for btrfs_setup_sprout(). So now a common device_list_mutex can be used
      between btrfs_init_new_device() and btrfs_setup_sprout().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      849eae5e
    • Anand Jain's avatar
      btrfs: switch seeding_dev in init_new_device to bool · fd880809
      Anand Jain authored
      Declare int seeding_dev as a bool. Also, move its declaration a line
      below to adjust packing.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd880809
    • Omar Sandoval's avatar
      btrfs: send: remove unused type parameter to iterate_inode_ref_t · b1dea4e7
      Omar Sandoval authored
      Again, I don't think this was ever used since iterate_dir_item() is only
      used for xattrs. No functional change.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b1dea4e7
    • Omar Sandoval's avatar
      btrfs: send: remove unused found_type parameter to lookup_dir_item_inode() · eab67c06
      Omar Sandoval authored
      As far as I can tell, this was never used. No functional change.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eab67c06
    • Josef Bacik's avatar
      btrfs: rename btrfs_item_end_nr to btrfs_item_data_end · dc2e724e
      Josef Bacik authored
      The name btrfs_item_end_nr() is a bit of a misnomer, as it's actually
      the offset of the end of the data the item points to.  In fact all of
      the helpers that we use btrfs_item_end_nr() use data in their name, like
      BTRFS_LEAF_DATA_SIZE() and leaf_data().  Rename to btrfs_item_data_end()
      to make it clear what this helper is giving us.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc2e724e
    • Josef Bacik's avatar
      btrfs: remove the btrfs_item_end() helper · 5a08663d
      Josef Bacik authored
      We're only using btrfs_item_end() from btrfs_item_end_nr(), so this can
      be collapsed.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a08663d
    • Josef Bacik's avatar
      btrfs: drop the _nr from the item helpers · 3212fa14
      Josef Bacik authored
      Now that all call sites are using the slot number to modify item values,
      rename the SETGET helpers to raw_item_*(), and then rework the _nr()
      helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then
      rename all of the callers to the new helpers.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3212fa14
    • Josef Bacik's avatar
      btrfs: introduce item_nr token variant helpers · 74794207
      Josef Bacik authored
      The last remaining place where we have the pattern of
      
      	item = btrfs_item_nr(slot)
      	<do something with the item>
      
      are the token helpers.  Handle this by introducing token helpers that
      will do the btrfs_item_nr() work inside of the helper itself, and then
      convert all users of the btrfs_item token helpers to the new _nr()
      variants.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      74794207
    • Josef Bacik's avatar
      btrfs: make btrfs_file_extent_inline_item_len take a slot · 437bd07e
      Josef Bacik authored
      Instead of getting the btrfs_item for this, simply pass in the slot of
      the item and then use the btrfs_item_size_nr() helper inside of
      btrfs_file_extent_inline_item_len().
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      437bd07e
    • Josef Bacik's avatar
      btrfs: add btrfs_set_item_*_nr() helpers · c91666b1
      Josef Bacik authored
      We have the pattern of
      
      	item = btrfs_item_nr(slot);
      	btrfs_set_item_*(leaf, item);
      
      in a bunch of places in our code.  Fix this by adding
      btrfs_set_item_*_nr() helpers which will do the appropriate work, and
      replace those calls with
      
      	btrfs_set_item_*_nr(leaf, slot);
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c91666b1
    • Josef Bacik's avatar
      btrfs: use btrfs_item_size_nr/btrfs_item_offset_nr everywhere · 227f3cd0
      Josef Bacik authored
      We have this pattern in a lot of places
      
      	item = btrfs_item_nr(slot);
      	btrfs_item_size(leaf, item);
      
      when we could simply use
      
      	btrfs_item_size(leaf, slot);
      
      Fix all callers of btrfs_item_size() and btrfs_item_offset() to use the
      _nr variation of the helpers.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      227f3cd0
    • Filipe Manana's avatar
      btrfs: remove no longer needed logic for replaying directory deletes · ccae4a19
      Filipe Manana authored
      Now that we log only dir index keys when logging a directory, we no longer
      need to deal with dir item keys in the log replay code for replaying
      directory deletes. This is also true for the case when we replay a log
      tree created by a kernel that still logs dir items.
      
      So remove the remaining code of the replay of directory deletes algorithm
      that deals with dir item keys.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ccae4a19
    • Filipe Manana's avatar
      btrfs: only copy dir index keys when logging a directory · 339d0354
      Filipe Manana authored
      Currently, when logging a directory, we copy both dir items and dir index
      items from the fs/subvolume tree to the log tree. Both items have exactly
      the same data (same struct btrfs_dir_item), the difference lies in the key
      values, where a dir index key contains the index number of a directory
      entry while the dir item key does not, as it's used for doing fast lookups
      of an entry by name, while the former is used for sorting entries when
      listing a directory.
      
      We can exploit that and log only the dir index items, since they contain
      all the information needed to correctly add, replace and delete directory
      entries when replaying a log tree. Logging only the dir index items is
      also backward and forward compatible: an unpatched kernel (without this
      change) can correctly replay a log tree generated by a patched kernel
      (with this patch), and a patched kernel can correctly replay a log tree
      generated by an unpatched kernel.
      
      The backward compatibility is ensured because:
      
      1) For inserting a new dentry: a dentry is only inserted when we find a
         new dir index key - we can only insert if we know the dir index offset,
         which is encoded in the dir index key's offset;
      
      2) For deleting dentries: during log replay, before adding or replacing
         dentries, we first replay dentry deletions. Whenever we find a dir item
         key or a dir index key in the subvolume/fs tree that is not logged in
         a range for which the log tree is authoritative, we do the unlink of
         the dentry, which removes both the existing dir item key and the dir
         index key. Therefore logging just dir index keys is enough to ensure
         dentry deletions are correctly replayed;
      
      3) For dentry replacements: they work when we log only dir index keys
         and this is mostly due to a combination of 1) and 2). If we replace a
         dentry with name "foobar" to point from inode A to inode B, then we
         know the dir index key for the new dentry is different from the old
         one, as it has an index number (key offset) larger than the old one.
         This results in replaying a deletion, through replay_dir_deletes(),
         that causes the old dentry to be removed, both the dir item key and
         the dir index key, as mentioned at 2). Then when processing the new
         dir index key, we add the new dentry, adding both a new dir item key
         and a new index key pointing to inode B, as stated in 1).
      
      The forward compatibility, the ability for a patched kernel to replay a
      log created by an older, unpatched kernel, comes from the changes required
      for making sure we are able to replay a log that only contains dir index
      keys - we simply ignore every dir item key we find.
      
      So modify directory logging to log only dir index items, and modify the
      log replay process to ignore dir item keys, from log trees created by an
      unpatched kernel, and process only with dir index keys. This reduces the
      amount of logged metadata by about half, and therefore the time spent
      logging or fsyncing large directories (less CPU time and less IO).
      
      The following test script was used to measure this change:
      
         #!/bin/bash
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
      
         NUM_NEW_FILES=1000000
         NUM_FILE_DELETES=10000
      
         mkfs.btrfs -f $DEV
         mount -o ssd $DEV $MNT
      
         mkdir $MNT/testdir
      
         for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
                 echo -n > $MNT/testdir/file_$i
         done
      
         start=$(date +%s%N)
         xfs_io -c "fsync" $MNT/testdir
         end=$(date +%s%N)
      
         dur=$(( (end - start) / 1000000 ))
         echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"
      
         # sync to force transaction commit and wipeout the log.
         sync
      
         del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
         for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
                 rm -f $MNT/testdir/file_$i
         done
      
         start=$(date +%s%N)
         xfs_io -c "fsync" $MNT/testdir
         end=$(date +%s%N)
      
         dur=$(( (end - start) / 1000000 ))
         echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
         echo
      
         umount $MNT
      
      The tests were run on a physical machine, with a non-debug kernel (Debian's
      default kernel config), for different values of $NUM_NEW_FILES and
      $NUM_FILE_DELETES, and the results were the following:
      
      ** Before patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
      
      dir fsync took 8412 ms after adding 1000000 files
      dir fsync took 500 ms after deleting 10000 files
      
      ** After patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
      
      dir fsync took 4252 ms after adding 1000000 files   (-49.5%)
      dir fsync took 269 ms after deleting 10000 files    (-46.2%)
      
      ** Before patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 745 ms after adding 100000 files
      dir fsync took 59 ms after deleting 1000 files
      
      ** After patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 404 ms after adding 100000 files   (-45.8%)
      dir fsync took 31 ms after deleting 1000 files    (-47.5%)
      
      ** Before patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 67 ms after adding 10000 files
      dir fsync took 9 ms after deleting 1000 files
      
      ** After patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 36 ms after adding 10000 files   (-46.3%)
      dir fsync took 5 ms after deleting 1000 files   (-44.4%)
      
      ** Before patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
      
      dir fsync took 9 ms after adding 1000 files
      dir fsync took 4 ms after deleting 100 files
      
      ** After patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
      
      dir fsync took 7 ms after adding 1000 files     (-22.2%)
      dir fsync took 3 ms after deleting 100 files    (-25.0%)
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      339d0354
    • Nikolay Borisov's avatar
      btrfs: remove spurious unlock/lock of unused_bgs_lock · 17130a65
      Nikolay Borisov authored
      Since both unused block groups and reclaim bgs lists are protected by
      unused_bgs_lock then free them in the same critical section without
      doing an extra unlock/lock pair.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      17130a65
    • Filipe Manana's avatar
      btrfs: fix deadlock between quota enable and other quota operations · 232796df
      Filipe Manana authored
      When enabling quotas, we attempt to commit a transaction while holding the
      mutex fs_info->qgroup_ioctl_lock. This can result on a deadlock with other
      quota operations such as:
      
      - qgroup creation and deletion, ioctl BTRFS_IOC_QGROUP_CREATE;
      
      - adding and removing qgroup relations, ioctl BTRFS_IOC_QGROUP_ASSIGN.
      
      This is because these operations join a transaction and after that they
      attempt to lock the mutex fs_info->qgroup_ioctl_lock. Acquiring that mutex
      after joining or starting a transaction is a pattern followed everywhere
      in qgroups, so the quota enablement operation is the one at fault here,
      and should not commit a transaction while holding that mutex.
      
      Fix this by making the transaction commit while not holding the mutex.
      We are safe from two concurrent tasks trying to enable quotas because
      we are serialized by the rw semaphore fs_info->subvol_sem at
      btrfs_ioctl_quota_ctl(), which is the only call site for enabling
      quotas.
      
      When this deadlock happens, it produces a trace like the following:
      
        INFO: task syz-executor:25604 blocked for more than 143 seconds.
        Not tainted 5.15.0-rc6 #4
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:syz-executor state:D stack:24800 pid:25604 ppid: 24873 flags:0x00004004
        Call Trace:
        context_switch kernel/sched/core.c:4940 [inline]
        __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
        schedule+0xd3/0x270 kernel/sched/core.c:6366
        btrfs_commit_transaction+0x994/0x2e90 fs/btrfs/transaction.c:2201
        btrfs_quota_enable+0x95c/0x1790 fs/btrfs/qgroup.c:1120
        btrfs_ioctl_quota_ctl fs/btrfs/ioctl.c:4229 [inline]
        btrfs_ioctl+0x637e/0x7b70 fs/btrfs/ioctl.c:5010
        vfs_ioctl fs/ioctl.c:51 [inline]
        __do_sys_ioctl fs/ioctl.c:874 [inline]
        __se_sys_ioctl fs/ioctl.c:860 [inline]
        __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
        do_syscall_x64 arch/x86/entry/common.c:50 [inline]
        do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7f86920b2c4d
        RSP: 002b:00007f868f61ac58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 00007f86921d90a0 RCX: 00007f86920b2c4d
        RDX: 0000000020005e40 RSI: 00000000c0109428 RDI: 0000000000000008
        RBP: 00007f869212bd80 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000246 R12: 00007f86921d90a0
        R13: 00007fff6d233e4f R14: 00007fff6d233ff0 R15: 00007f868f61adc0
        INFO: task syz-executor:25628 blocked for more than 143 seconds.
        Not tainted 5.15.0-rc6 #4
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:syz-executor state:D stack:29080 pid:25628 ppid: 24873 flags:0x00004004
        Call Trace:
        context_switch kernel/sched/core.c:4940 [inline]
        __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
        schedule+0xd3/0x270 kernel/sched/core.c:6366
        schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
        __mutex_lock_common kernel/locking/mutex.c:669 [inline]
        __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
        btrfs_remove_qgroup+0xb7/0x7d0 fs/btrfs/qgroup.c:1548
        btrfs_ioctl_qgroup_create fs/btrfs/ioctl.c:4333 [inline]
        btrfs_ioctl+0x683c/0x7b70 fs/btrfs/ioctl.c:5014
        vfs_ioctl fs/ioctl.c:51 [inline]
        __do_sys_ioctl fs/ioctl.c:874 [inline]
        __se_sys_ioctl fs/ioctl.c:860 [inline]
        __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
        do_syscall_x64 arch/x86/entry/common.c:50 [inline]
        do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      Reported-by: default avatarHao Sun <sunhao.th@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CACkBjsZQF19bQ1C6=yetF3BvL10OSORpFUcWXTP6HErshDB4dQ@mail.gmail.com/
      Fixes: 340f1aa2 ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable")
      CC: stable@vger.kernel.org # 4.19
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      232796df
    • Filipe Manana's avatar
      btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range · f0bfa76a
      Filipe Manana authored
      When doing a direct IO write against a file range that either has
      preallocated extents in that range or has regular extents and the file
      has the NOCOW attribute set, the write fails with -ENOSPC when all of
      the following conditions are met:
      
      1) There are no data blocks groups with enough free space matching
         the size of the write;
      
      2) There's not enough unallocated space for allocating a new data block
         group;
      
      3) The extents in the target file range are not shared, neither through
         snapshots nor through reflinks.
      
      This is wrong because a NOCOW write can be done in such case, and in fact
      it's possible to do it using a buffered IO write, since when failing to
      allocate data space, the buffered IO path checks if a NOCOW write is
      possible.
      
      The failure in direct IO write path comes from the fact that early on,
      at btrfs_dio_iomap_begin(), we try to allocate data space for the write
      and if it that fails we return the error and stop - we never check if we
      can do NOCOW. But later, at btrfs_get_blocks_direct_write(), we check
      if we can do a NOCOW write into the range, or a subset of the range, and
      then release the previously reserved data space.
      
      Fix this by doing the data reservation only if needed, when we must COW,
      at btrfs_get_blocks_direct_write() instead of doing it at
      btrfs_dio_iomap_begin(). This also simplifies a bit the logic and removes
      the inneficiency of doing unnecessary data reservations.
      
      The following example test script reproduces the problem:
      
        $ cat dio-nocow-enospc.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        # Use a small fixed size (1G) filesystem so that it's quick to fill
        # it up.
        # Make sure the mixed block groups feature is not enabled because we
        # later want to not have more space available for allocating data
        # extents but still have enough metadata space free for the file writes.
        mkfs.btrfs -f -b $((1024 * 1024 * 1024)) -O ^mixed-bg $DEV
        mount $DEV $MNT
      
        # Create our test file with the NOCOW attribute set.
        touch $MNT/foobar
        chattr +C $MNT/foobar
      
        # Now fill in all unallocated space with data for our test file.
        # This will allocate a data block group that will be full and leave
        # no (or a very small amount of) unallocated space in the device, so
        # that it will not be possible to allocate a new block group later.
        echo
        echo "Creating test file with initial data..."
        xfs_io -c "pwrite -S 0xab -b 1M 0 900M" $MNT/foobar
      
        # Now try a direct IO write against file range [0, 10M[.
        # This should succeed since this is a NOCOW file and an extent for the
        # range was previously allocated.
        echo
        echo "Trying direct IO write over allocated space..."
        xfs_io -d -c "pwrite -S 0xcd -b 10M 0 10M" $MNT/foobar
      
        umount $MNT
      
      When running the test:
      
        $ ./dio-nocow-enospc.sh
        (...)
      
        Creating test file with initial data...
        wrote 943718400/943718400 bytes at offset 0
        900 MiB, 900 ops; 0:00:01.43 (625.526 MiB/sec and 625.5265 ops/sec)
      
        Trying direct IO write over allocated space...
        pwrite: No space left on device
      
      A test case for fstests will follow, testing both this direct IO write
      scenario as well as the buffered IO write scenario to make it less likely
      to get future regressions on the buffered IO case.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f0bfa76a
  2. 02 Jan, 2022 6 commits
    • Linus Torvalds's avatar
      Linux 5.16-rc8 · c9e6606c
      Linus Torvalds authored
      c9e6606c
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-fixes-for-v5.16-2022-01-02' of... · 24a0b220
      Linus Torvalds authored
      Merge tag 'perf-tools-fixes-for-v5.16-2022-01-02' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
      
      Pull perf tools fixes from Arnaldo Carvalho de Melo:
      
       - Fix TUI exit screen refresh race condition in 'perf top'.
      
       - Fix parsing of Intel PT VM time correlation arguments.
      
       - Honour CPU filtering command line request of a script's switch events
         in 'perf script'.
      
       - Fix printing of switch events in Intel PT python script.
      
       - Fix duplicate alias events list printing in 'perf list', noticed on
         heterogeneous arm64 systems.
      
       - Fix return value of ids__new(), users expect NULL for failure, not
         ERR_PTR(-ENOMEM).
      
      * tag 'perf-tools-fixes-for-v5.16-2022-01-02' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
        perf top: Fix TUI exit screen refresh race condition
        perf pmu: Fix alias events list
        perf scripts python: intel-pt-events.py: Fix printing of switch events
        perf script: Fix CPU filtering of a script's switch events
        perf intel-pt: Fix parsing of VM time correlation arguments
        perf expr: Fix return value of ids__new()
      24a0b220
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 859431ac
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "Better input validation for compat ioctls and a documentation bugfix
        for 5.16"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        Docs: Fixes link to I2C specification
        i2c: validate user data in compat ioctl
      859431ac
    • Linus Torvalds's avatar
      Merge tag 'x86_urgent_for_v5.16_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1286cc48
      Linus Torvalds authored
      Pull x86 fix from Borislav Petkov:
      
       - Use the proper CONFIG symbol in a preprocessor check.
      
      * tag 'x86_urgent_for_v5.16_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/build: Use the proper name CONFIG_FW_LOADER
      1286cc48
    • yaowenbin's avatar
      perf top: Fix TUI exit screen refresh race condition · 64f18d2d
      yaowenbin authored
      When the following command is executed several times, a coredump file is
      generated.
      
      	$ timeout -k 9 5 perf top -e task-clock
      	*******
      	*******
      	*******
      	0.01%  [kernel]                  [k] __do_softirq
      	0.01%  libpthread-2.28.so        [.] __pthread_mutex_lock
      	0.01%  [kernel]                  [k] __ll_sc_atomic64_sub_return
      	double free or corruption (!prev) perf top --sort comm,dso
      	timeout: the monitored command dumped core
      
      When we terminate "perf top" using sending signal method,
      SLsmg_reset_smg() called. SLsmg_reset_smg() resets the SLsmg screen
      management routines by freeing all memory allocated while it was active.
      
      However SLsmg_reinit_smg() maybe be called by another thread.
      
      SLsmg_reinit_smg() will free the same memory accessed by
      SLsmg_reset_smg(), thus it results in a double free.
      
      SLsmg_reinit_smg() is called already protected by ui__lock, so we fix
      the problem by adding pthread_mutex_trylock of ui__lock when calling
      SLsmg_reset_smg().
      Signed-off-by: default avatarWenyu Liu <liuwenyu7@huawei.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: wuxu.wu@huawei.com
      Link: http://lore.kernel.org/lkml/a91e3943-7ddc-f5c0-a7f5-360f073c20e6@huawei.comSigned-off-by: default avatarHewenliang <hewenliang4@huawei.com>
      Signed-off-by: default avataryaowenbin <yaowenbin1@huawei.com>
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      64f18d2d
    • John Garry's avatar
      perf pmu: Fix alias events list · e0257a01
      John Garry authored
      Commit 0e0ae874 ("perf list: Display hybrid PMU events with cpu
      type") changes the event list for uncore PMUs or arm64 heterogeneous CPU
      systems, such that duplicate aliases are incorrectly listed per PMU
      (which they should not be), like:
      
        # perf list
        ...
        unc_cbo_cache_lookup.any_es
        [Unit: uncore_cbox L3 Lookup any request that access cache and found
        line in E or S-state]
        unc_cbo_cache_lookup.any_es
        [Unit: uncore_cbox L3 Lookup any request that access cache and found
        line in E or S-state]
        unc_cbo_cache_lookup.any_i
        [Unit: uncore_cbox L3 Lookup any request that access cache and found
        line in I-state]
        unc_cbo_cache_lookup.any_i
        [Unit: uncore_cbox L3 Lookup any request that access cache and found
        line in I-state]
        ...
      
      Notice how the events are listed twice.
      
      The named commit changed how we remove duplicate events, in that events
      for different PMUs are not treated as duplicates. I suppose this is to
      handle how "Each hybrid pmu event has been assigned with a pmu name".
      
      Fix PMU alias listing by restoring behaviour to remove duplicates for
      non-hybrid PMUs.
      
      Fixes: 0e0ae874 ("perf list: Display hybrid PMU events with cpu type")
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Tested-by: default avatarZhengjun Xing <zhengjun.xing@linux.intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/r/1640103090-140490-1-git-send-email-john.garry@huawei.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      e0257a01
  3. 01 Jan, 2022 1 commit
  4. 31 Dec, 2021 6 commits