1. 09 Dec, 2020 32 commits
    • Boris Burkov's avatar
      btrfs: skip space_cache v1 setup when not using it · af456a2c
      Boris Burkov authored
      If we are not using space cache v1, we should not create the free space
      object or free space inodes. This comes up when we delete the existing
      free space objects/inodes when migrating to v2, only to see them get
      recreated for every dirtied block group.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af456a2c
    • Boris Burkov's avatar
      btrfs: remove free space items when disabling space cache v1 · 36b216c8
      Boris Burkov authored
      When the filesystem transitions from space cache v1 to v2 or to
      nospace_cache, it removes the old cached data, but does not remove
      the FREE_SPACE items nor the free space inodes they point to. This
      doesn't cause any issues besides being a bit inefficient, since these
      items no longer do anything useful.
      
      To fix it, when we are mounting, and plan to disable the space cache,
      destroy each block group's free space item and free space inode.
      The code to remove the items is lifted from the existing use case of
      removing the block group, with a light adaptation to handle whether or
      not we have already looked up the free space inode.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      36b216c8
    • Boris Burkov's avatar
      btrfs: warn when remount will not change the free space tree · 2838d255
      Boris Burkov authored
      If the remount is ro->ro, rw->ro, or rw->rw, we will not create or
      clear the free space tree. This can be surprising, so print a warning
      to dmesg to make the failure more visible. It is also important to
      ensure that the space cache options (SPACE_CACHE, FREE_SPACE_TREE) are
      consistent, so ensure those are set to properly match the current on
      disk state (which won't be changing).
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2838d255
    • Boris Burkov's avatar
      btrfs: use superblock state to print space_cache mount option · 04c41559
      Boris Burkov authored
      To make the contents of /proc/mounts better match the actual state of
      the filesystem, base the display of the space cache mount options off
      the contents of the super block rather than the last mount options
      passed in. Since there are many scenarios where the mount will ignore a
      space cache option, simply showing the passed in option is misleading.
      
      For example, if we mount with -o remount,space_cache=v2 on a read-write
      file system without an existing free space tree, we won't build a free
      space tree, but /proc/mounts will read space_cache=v2 (until we mount
      again and it goes away)
      
      cache_generation is set iff space_cache=v1, FREE_SPACE_TREE is set iff
      space_cache=v2, and if neither is the case, we print nospace_cache.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      04c41559
    • Boris Burkov's avatar
      btrfs: keep sb cache_generation consistent with space_cache · 94846229
      Boris Burkov authored
      When mounting, btrfs uses the cache_generation in the super block to
      determine if space cache v1 is in use. However, by mounting with
      nospace_cache or space_cache=v2, it is possible to disable space cache
      v1, which does not result in un-setting cache_generation back to 0.
      
      In order to base some logic, like mount option printing in /proc/mounts,
      on the current state of the space cache rather than just the values of
      the mount option, keep the value of cache_generation consistent with the
      status of space cache v1.
      
      We ensure that cache_generation > 0 iff the file system is using
      space_cache v1. This requires committing a transaction on any mount
      which changes whether we are using v1. (v1->nospace_cache, v1->v2,
      nospace_cache->v1, v2->v1).
      
      Since the mechanism for writing out the cache generation is transaction
      commit, but we want some finer grained control over when we un-set it,
      we can't just rely on the SPACE_CACHE mount option, and introduce an
      fs_info flag that mount can use when it wants to unset the generation.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      94846229
    • Boris Burkov's avatar
      btrfs: clear free space tree on ro->rw remount · 8b228324
      Boris Burkov authored
      A user might want to revert to v1 or nospace_cache on a root filesystem,
      and much like turning on the free space tree, that can only be done
      remounting from ro->rw. Support clearing the free space tree on such
      mounts by moving it into the shared remount logic.
      
      Since the CLEAR_CACHE option sticks around across remounts, this change
      would result in clearing the tree for ever on every remount, which is
      not desirable. To fix that, add CLEAR_CACHE to the oneshot options we
      clear at mount end, which has the other bonus of not cluttering the
      /proc/mounts output with clear_cache.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8b228324
    • Boris Burkov's avatar
      btrfs: clear oneshot options on mount and remount · 8cd29088
      Boris Burkov authored
      Some options only apply during mount time and are cleared at the end
      of mount. For now, the example is USEBACKUPROOT, but CLEAR_CACHE also
      fits the bill, and this is a preparation patch for also clearing that
      option.
      
      One subtlety is that the current code only resets USEBACKUPROOT on rw
      mounts, but the option is meaningfully "consumed" by a ro mount, so it
      feels appropriate to clear in that case as well. A subsequent read-write
      remount would not go through open_ctree, which is the only place that
      checks the option, so the change should be benign.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8cd29088
    • Boris Burkov's avatar
      btrfs: create free space tree on ro->rw remount · 5011139a
      Boris Burkov authored
      When a user attempts to remount a btrfs filesystem with
      'mount -o remount,space_cache=v2', that operation silently succeeds.
      Unfortunately, this is misleading, because the remount does not create
      the free space tree. /proc/mounts will incorrectly show space_cache=v2,
      but on the next mount, the file system will revert to the old
      space_cache.
      
      For now, we handle only the easier case, where the existing mount is
      read-only and the new mount is read-write. In that case, we can create
      the free space tree without contending with the block groups changing
      as we go.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5011139a
    • Boris Burkov's avatar
      btrfs: only mark bg->needs_free_space if free space tree is on · 997e3e2e
      Boris Burkov authored
      If we attempt to create a free space tree while any block groups have
      needs_free_space set, we will double add the new free space item
      and hit EEXIST. Previously, we only created the free space tree on a new
      mount, so we never hit the case, but if we try to create it on a
      remount, such block groups could exist and trip us up.
      
      We don't do anything with this field unless the free space tree is
      enabled, so there is no harm in not setting it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      997e3e2e
    • Boris Burkov's avatar
      btrfs: start orphan cleanup on ro->rw remount · 8f1c21d7
      Boris Burkov authored
      When we mount a rw filesystem, we start the orphan cleanup process in
      tree root and filesystem tree. However, when we remount a ro file system
      rw, we only clean the former. Move the calls to btrfs_orphan_cleanup()
      on tree_root and fs_root to the shared rw mount routine to effectively
      add them on ro->rw remount.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8f1c21d7
    • Boris Burkov's avatar
      btrfs: lift read-write mount setup from mount and remount · 44c0ca21
      Boris Burkov authored
      Mounting rw and remounting from ro to rw naturally share invariants and
      functionality which result in a correctly setup rw filesystem. Luckily,
      there is even a strong unity in the code which implements them. In
      mount's open_ctree, these operations mostly happen after an early return
      for ro file systems, and in remount, they happen in a section devoted to
      remounting ro->rw, after some remount specific validation passes.
      
      However, there are unfortunately a few differences. There are small
      deviations in the order of some of the operations, remount does not
      start orphan cleanup in root_tree or fs_tree, remount does not create
      the free space tree, and remount does not handle "one-shot" mount
      options like clear_cache and uuid tree rescan.
      
      Since we want to add building the free space tree to remount, and also
      to start the same orphan cleanup process on a filesystem mounted as ro
      then remounted rw, we would benefit from unifying the logic between the
      two code paths.
      
      This patch only lifts the existing common functionality, and leaves a
      natural path for fixing the discrepancies.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44c0ca21
    • Filipe Manana's avatar
      btrfs: do not block inode logging for so long during transaction commit · 47876f7c
      Filipe Manana authored
      Early on during a transaction commit we acquire the tree_log_mutex and
      hold it until after we write the super blocks. But before writing the
      extent buffers dirtied by the transaction and the super blocks we unblock
      the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
      fs_info->running_transaction to NULL.
      
      This means that after that and before writing the super blocks, new
      transactions can start. However if any transaction wants to log an inode,
      it will block waiting for the transaction commit to write its dirty
      extent buffers and the super blocks because the tree_log_mutex is only
      released after those operations are complete, and starting a new log
      transaction blocks on that mutex (at start_log_trans()).
      
      Writing the dirty extent buffers and the super blocks can take a very
      significant amount of time to complete, but we could allow the tasks
      wanting to log an inode to proceed with most of their steps:
      
      1) create the log trees
      2) log metadata in the trees
      3) write their dirty extent buffers
      
      They only need to wait for the previous transaction commit to complete
      (write its super blocks) before they attempt to write their super blocks,
      otherwise we could end up with a corrupt filesystem after a crash.
      
      So change start_log_trans() to use the root tree's log_mutex to serialize
      for the creation of the log root tree instead of using the tree_log_mutex,
      and make btrfs_sync_log() acquire the tree_log_mutex before writing the
      super blocks. This allows for inode logging to wait much less time when
      there is a previous transaction that is still committing, often not having
      to wait at all, as by the time when we try to sync the log the previous
      transaction already wrote its super blocks.
      
      This patch belongs to a patch set that is comprised of the following
      patches:
      
        btrfs: fix race causing unnecessary inode logging during link and rename
        btrfs: fix race that results in logging old extents during a fast fsync
        btrfs: fix race that causes unnecessary logging of ancestor inodes
        btrfs: fix race that makes inode logging fallback to transaction commit
        btrfs: fix race leading to unnecessary transaction commit when logging inode
        btrfs: do not block inode logging for so long during transaction commit
      
      The following script that uses dbench was used to measure the impact of
      the whole patchset:
      
        $ cat test-dbench.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd"
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f -m single -d single $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 300 64
      
        umount $MNT
      
      The test was run on a machine with 12 cores, 64G of ram, using a NVMe
      device and a non-debug kernel configuration (Debian's default).
      
      Before patch set:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    11277211    0.250    85.340
       Close        8283172     0.002     6.479
       Rename        477515     1.935    86.026
       Unlink       2277936     0.770    87.071
       Deltree          256    15.732    81.379
       Mkdir            128     0.003     0.009
       Qpathinfo    10221180    0.056    44.404
       Qfileinfo    1789967     0.002     4.066
       Qfsinfo      1874399     0.003     9.176
       Sfileinfo     918589     0.061    10.247
       Find         3951758     0.341    54.040
       WriteX       5616547     0.047    85.079
       ReadX        17676028    0.005     9.704
       LockX          36704     0.003     1.800
       UnlockX        36704     0.002     0.687
       Flush         790541    14.115   676.236
      
      Throughput 1179.19 MB/sec  64 clients  64 procs  max_latency=676.240 ms
      
      After patch set:
      
      Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    12687926    0.171    86.526
       Close        9320780     0.002     8.063
       Rename        537253     1.444    78.576
       Unlink       2561827     0.559    87.228
       Deltree          374    11.499    73.549
       Mkdir            187     0.003     0.005
       Qpathinfo    11500300    0.061    36.801
       Qfileinfo    2017118     0.002     7.189
       Qfsinfo      2108641     0.003     4.825
       Sfileinfo    1033574     0.008     8.065
       Find         4446553     0.408    47.835
       WriteX       6335667     0.045    84.388
       ReadX        19887312    0.003     9.215
       LockX          41312     0.003     1.394
       UnlockX        41312     0.002     1.425
       Flush         889233    13.014   623.259
      
      Throughput 1339.32 MB/sec  64 clients  64 procs  max_latency=623.265 ms
      
      +12.7% throughput, -8.2% max latency
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47876f7c
    • Filipe Manana's avatar
      btrfs: fix race leading to unnecessary transaction commit when logging inode · 639bd575
      Filipe Manana authored
      When logging an inode we may often have to fallback to a full transaction
      commit, either because a new block group was allocated, there is some case
      we can not deal with without a transaction commit or some error like an
      ENOMEM happened. However after we fallback to a transaction commit, we
      have a time window where we can make the next attempt to log any inode
      commit the next transaction unnecessarily, adding additional overhead and
      increasing latency.
      
      A sequence of steps that leads to this issue is the following:
      
      1) The current open transaction has a generation of 1000;
      
      2) A new block group is allocated, and as a consequence we must make sure
         any attempts to commit a log fallback to a transaction commit, so
         btrfs_set_log_full_commit() is called from btrfs_make_block_group().
         This sets fs_info->last_trans_log_full_commit to 1000;
      
      3) Task A is holding a handle on transaction 1000 and tries to log inode X.
         Once it gets to start_log_trans(), it calls btrfs_need_log_full_commit()
         which returns true, since fs_info->last_trans_log_full_commit has a
         value of 1000. So we end up returning EAGAIN and propagating it up to
         btrfs_sync_file(), where we commit transaction 1000;
      
      4) The transaction commit task (task A) sets the transaction state to
         unblocked (TRANS_STATE_UNBLOCKED);
      
      5) Some other task, task B, starts a new transaction with a generation of
         1001;
      
      6) Some stuff is done with transaction 1001, some btree blocks COWed, etc;
      
      7) Transaction 1000 has not fully committed yet, we are still writing all
         the extent buffers it created;
      
      8) Some new task, task C, starts an fsync of inode Y, gets a handle for
         transaction 1001, and it gets to btrfs_log_inode_parent() which does
         the following check:
      
           if (fs_info->last_trans_log_full_commit > last_committed) {
               ret = 1;
               goto end_no_trans;
           }
      
         At that point last_trans_log_full_commit has a value of 1000 and
         last_committed (value of fs_info->last_trans_committed) has a value of
         999, since transaction 1000 has not yet committed - it is either still
         writing out dirty extent buffers, its super blocks or unpinning
         extents.
      
         As a consequence we return 1, which gets propagated up to
         btrfs_sync_file(), which will then call btrfs_commit_transaction()
         for transaction 1001.
      
         As a consequence we have an unnecessary second transaction commit, we
         previously committed transaction 1000 and now commit transaction 1001
         as well, resulting in more overhead and increased latency.
      
      So fix this double transaction commit issue simply by removing that check,
      because all we need to do is wait for the previous transaction to finish
      its commit, which we already do later when starting the log transaction at
      start_log_trans(), because there we acquire the tree_log_mutex lock, which
      is held by a transaction commit and only released after the transaction
      commits its super blocks.
      
      Another issue that check has is that it reads last_trans_log_full_commit
      without using READ_ONCE(), which is incorrect since that member of
      struct btrfs_fs_info is always updated with WRITE_ONCE() through the
      helper btrfs_set_log_full_commit().
      
      This double transaction commit issue can actually be triggered quite often
      in long runs of dbench, since besides the creation of new block groups
      that force inode logging to fallback to a transaction commit, there are
      cases where dbench asks to fsync a directory which had files in it that
      were previously renamed or subdirectories that were removed, resulting in
      the inode logging to fallback to a full transaction commit.
      
      This patch belongs to a patch set that is comprised of the following
      patches:
      
        btrfs: fix race causing unnecessary inode logging during link and rename
        btrfs: fix race that results in logging old extents during a fast fsync
        btrfs: fix race that causes unnecessary logging of ancestor inodes
        btrfs: fix race that makes inode logging fallback to transaction commit
        btrfs: fix race leading to unnecessary transaction commit when logging inode
        btrfs: do not block inode logging for so long during transaction commit
      
      Performance results are mentioned in the change log of the last patch.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      639bd575
    • Filipe Manana's avatar
      btrfs: fix race that makes inode logging fallback to transaction commit · 47d3db41
      Filipe Manana authored
      When logging an inode and the previous transaction is still committing, we
      have a time window where we can end up incorrectly think an inode has its
      last_unlink_trans field with a value greater than the last transaction
      committed, which results in the logging to fallback to a full transaction
      commit, which is usually much more expensive than doing a log commit.
      
      The race is described by the following steps:
      
      1) We are at transaction 1000;
      
      2) We modify an inode X (a directory) using transaction 1000 and set its
         last_unlink_trans field to 1000, because for example we removed one
         of its subdirectories;
      
      3) We create a new inode Y with a dentry in inode X using transaction 1000,
         so its generation field is set to 1000;
      
      4) The commit for transaction 1000 is started by task A;
      
      5) The task committing transaction 1000 sets the transaction state to
         unblocked, writes the dirty extent buffers and the super blocks, then
         unlocks tree_log_mutex;
      
      6) Some task starts a new transaction with a generation of 1001;
      
      7) We do some modification to inode Y (using transaction 1001);
      
      8) The transaction 1000 commit starts unpinning extents. At this point
         fs_info->last_trans_committed still has a value of 999;
      
      9) Task B starts an fsync on inode Y, and gets a handle for transaction
         1001. When it gets to check_parent_dirs_for_sync() it does the checking
         of the ancestor dentries because the following check does not evaluate
         to true:
      
             if (S_ISREG(inode->vfs_inode.i_mode) &&
                 inode->generation <= last_committed &&
                 inode->last_unlink_trans <= last_committed)
                     goto out;
      
         The generation value for inode Y is 1000 and last_committed, which has
         the value read from fs_info->last_trans_committed, has a value of 999,
         so that check evaluates to false and we proceed to check the ancestor
         inodes.
      
         Once we get to the first ancestor, inode X, we call
         btrfs_must_commit_transaction() on it, which evaluates to true:
      
         static bool btrfs_must_commit_transaction(...)
         {
             struct btrfs_fs_info *fs_info = inode->root->fs_info;
             bool ret = false;
      
             mutex_lock(&inode->log_mutex);
             if (inode->last_unlink_trans > fs_info->last_trans_committed) {
                 /*
                  * Make sure any commits to the log are forced to be full
                  * commits.
                  */
                  btrfs_set_log_full_commit(trans);
                  ret = true;
             }
          (...)
      
          because inode's X last_unlink_trans has a value of 1000 and
          fs_info->last_trans_committed still has a value of 999, it returns
          true to check_parent_dirs_for_sync(), making it return 1 which is
          propagated up to btrfs_sync_file(), causing it to fallback to a full
          transaction commit of transaction 1001.
      
          We should have not fallen back to commit transaction 1001, since inode
          X had last_unlink_trans set to 1000 and the super blocks for
          transaction 1000 were already written. So while not resulting in a
          functional problem, it leads to a lot more work and higher latencies
          for a fsync since committing a transaction is usually more expensive
          than committing a log (if other filesystem changes happened under that
          transaction).
      
      Similar problem happens when logging directories, for the same reason as
      btrfs_must_commit_transaction() returns true on an inode with its
      last_unlink_trans having the generation of the previous transaction and
      that transaction is still committing, unpinning its freed extents.
      
      So fix this by comparing last_unlink_trans with the id of the current
      transaction instead of fs_info->last_trans_committed.
      
      This case is often hit when running dbench for a long enough duration, as
      it does lots of rename and rmdir operations (both update the field
      last_unlink_trans of an inode) and fsyncs of files and directories.
      
      This patch belongs to a patch set that is comprised of the following
      patches:
      
        btrfs: fix race causing unnecessary inode logging during link and rename
        btrfs: fix race that results in logging old extents during a fast fsync
        btrfs: fix race that causes unnecessary logging of ancestor inodes
        btrfs: fix race that makes inode logging fallback to transaction commit
        btrfs: fix race leading to unnecessary transaction commit when logging inode
        btrfs: do not block inode logging for so long during transaction commit
      
      Performance results are mentioned in the change log of the last patch.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47d3db41
    • Filipe Manana's avatar
      btrfs: fix race that causes unnecessary logging of ancestor inodes · 4d6221d7
      Filipe Manana authored
      When logging an inode and we are checking if we need to log ancestors that
      are new, if the previous transaction is still committing we have a time
      window where we can unnecessarily log ancestor inodes that were created in
      the previous transaction.
      
      The race is described by the following steps:
      
      1) We are at transaction 1000;
      
      2) Directory inode X is created, its generation is set to 1000;
      
      3) The commit for transaction 1000 is started by task A;
      
      4) The task committing transaction 1000 sets the transaction state to
         unblocked, writes the dirty extent buffers and the super blocks, then
         unlocks tree_log_mutex;
      
      5) Inode Y, a regular file, is created under directory inode X, this
         results in starting a new transaction with a generation of 1001;
      
      6) The transaction 1000 commit is unpinning extents. At this point
         fs_info->last_trans_committed still has a value of 999;
      
      7) Task B calls fsync on inode Y and gets a handle for transaction 1001;
      
      8) Task B ends up at log_all_new_ancestors() and then because inode Y has
         only one hard link, ends up at log_new_ancestors_fast(). There it reads
         a value of 999 from fs_info->last_trans_committed, and sees that the
         parent inode X has a generation of 1000, so we end up logging inode X:
      
           if (inode->generation > fs_info->last_trans_committed) {
               ret = btrfs_log_inode(trans, root, inode,
                                     LOG_INODE_EXISTS, ctx);
               (...)
      
         which is not necessary since it was created in the past transaction,
         with a generation of 1000, and that transaction has already committed
         its super blocks - it's still unpinning extents so it has not yet
         updated fs_info->last_trans_committed from 999 to 1000.
      
         So this just causes us to spend more time logging and allocating and
         writing more tree blocks for the log tree.
      
      So fix this by comparing an inode's generation with the generation of the
      transaction our transaction handle refers to - if the inode's generation
      matches the generation of the current transaction than we know it is a
      new inode we need to log, otherwise don't log it.
      
      This case is often hit when running dbench for a long enough duration.
      
      This patch belongs to a patch set that is comprised of the following
      patches:
      
        btrfs: fix race causing unnecessary inode logging during link and rename
        btrfs: fix race that results in logging old extents during a fast fsync
        btrfs: fix race that causes unnecessary logging of ancestor inodes
        btrfs: fix race that makes inode logging fallback to transaction commit
        btrfs: fix race leading to unnecessary transaction commit when logging inode
        btrfs: do not block inode logging for so long during transaction commit
      
      Performance results are mentioned in the change log of the last patch.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d6221d7
    • Filipe Manana's avatar
      btrfs: fix race that results in logging old extents during a fast fsync · 5f96bfb7
      Filipe Manana authored
      When logging the extents of an inode during a fast fsync, we have a time
      window where we can log extents that are from the previous transaction and
      already persisted. This only makes us waste time unnecessarily.
      
      The following sequence of steps shows how this can happen:
      
      1) We are at transaction 1000;
      
      2) An ordered extent E from inode I completes, that is it has gone through
         btrfs_finish_ordered_io(), and it set the extent maps' generation to
         1000 when we unpin the extent, which is the generation of the current
         transaction;
      
      3) The commit for transaction 1000 starts by task A;
      
      4) The task committing transaction 1000 sets the transaction state to
         unblocked, writes the dirty extent buffers and the super blocks, then
         unlocks tree_log_mutex;
      
      5) Some change is made to inode I, resulting in creation of a new
         transaction with a generation of 1001;
      
      6) The transaction 1000 commit starts unpinning extents. At this point
         fs_info->last_trans_committed still has a value of 999;
      
      7) Task B starts an fsync on inode I, and when it gets to
         btrfs_log_changed_extents() sees the extent map for extent E in the
         list of modified extents. It sees the extent map has a generation of
         1000 and fs_info->last_trans_committed has a value of 999, so it
         proceeds to logging the respective file extent item and all the
         checksums covering its range.
      
         So we end up wasting time since the extent was already persisted and
         is reachable through the trees pointed to by the super block committed
         by transaction 1000.
      
      So just fix this by comparing the extent maps generation against the
      generation of the transaction handle - if it is smaller then the id in the
      handle, we know the extent was already persisted and we do not need to log
      it.
      
      This patch belongs to a patch set that is comprised of the following
      patches:
      
        btrfs: fix race causing unnecessary inode logging during link and rename
        btrfs: fix race that results in logging old extents during a fast fsync
        btrfs: fix race that causes unnecessary logging of ancestor inodes
        btrfs: fix race that makes inode logging fallback to transaction commit
        btrfs: fix race leading to unnecessary transaction commit when logging inode
        btrfs: do not block inode logging for so long during transaction commit
      
      Performance results are mentioned in the change log of the last patch.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5f96bfb7
    • Filipe Manana's avatar
      btrfs: fix race causing unnecessary inode logging during link and rename · de53d892
      Filipe Manana authored
      When we are doing a rename or a link operation for an inode that was logged
      in the previous transaction and that transaction is still committing, we
      have a time window where we incorrectly consider that the inode was logged
      previously in the current transaction and therefore decide to log it to
      update it in the log. The following steps give an example on how this
      happens during a link operation:
      
      1) Inode X is logged in transaction 1000, so its logged_trans field is set
         to 1000;
      
      2) Task A starts to commit transaction 1000;
      
      3) The state of transaction 1000 is changed to TRANS_STATE_UNBLOCKED;
      
      4) Task B starts a link operation for inode X, and as a consequence it
         starts transaction 1001;
      
      5) Task A is still committing transaction 1000, therefore the value stored
         at fs_info->last_trans_committed is still 999;
      
      6) Task B calls btrfs_log_new_name(), it reads a value of 999 from
         fs_info->last_trans_committed and because the logged_trans field of
         inode X has a value of 1000, the function does not return immediately,
         instead it proceeds to logging the inode, which should not happen
         because the inode was logged in the previous transaction (1000) and
         not in the current one (1001).
      
      This is not a functional problem, just wasted time and space logging an
      inode that does not need to be logged, contributing to higher latency
      for link and rename operations.
      
      So fix this by comparing the inodes' logged_trans field with the
      generation of the current transaction instead of comparing with the value
      stored in fs_info->last_trans_committed.
      
      This case is often hit when running dbench for a long enough duration, as
      it does lots of rename operations.
      
      This patch belongs to a patch set that is comprised of the following
      patches:
      
        btrfs: fix race causing unnecessary inode logging during link and rename
        btrfs: fix race that results in logging old extents during a fast fsync
        btrfs: fix race that causes unnecessary logging of ancestor inodes
        btrfs: fix race that makes inode logging fallback to transaction commit
        btrfs: fix race leading to unnecessary transaction commit when logging inode
        btrfs: do not block inode logging for so long during transaction commit
      
      Performance results are mentioned in the change log of the last patch.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      de53d892
    • David Sterba's avatar
      btrfs: remove recalc_thresholds from free space ops · fa598b06
      David Sterba authored
      After removing the inode number cache that was using the free space
      cache code, we can remove at least the recalc_thresholds callback from
      the ops. Both code and tests use the same callback function. It's moved
      before its first use.
      
      The use_bitmaps callback is still needed by tests to create some
      extents/bitmap setup.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa598b06
    • Nikolay Borisov's avatar
      btrfs: always set NODATASUM/NODATACOW in __create_free_space_inode · f0d1219d
      Nikolay Borisov authored
      Since it's being used solely for the freespace cache unconditionally
      set the flags required for it.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f0d1219d
    • Nikolay Borisov's avatar
      btrfs: remove crc_check logic from free space · 7dbdb443
      Nikolay Borisov authored
      Following removal of the ino cache io_ctl_init will be called only on
      behalf of the freespace inode. In this case we always want to check
      CRCs so conditional code that depended on io_ctl::check_crc can be
      removed.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7dbdb443
    • Nikolay Borisov's avatar
      btrfs: remove inode number cache feature · 5297199a
      Nikolay Borisov authored
      It's been deprecated since commit b547a88e ("btrfs: start
      deprecation of mount option inode_cache") which enumerates the reasons.
      
      A filesystem that uses the feature (mount -o inode_cache) tracks the
      inode numbers in bitmaps, that data stay on the filesystem after this
      patch. The size is roughly 5MiB for 1M inodes [1], which is considered
      small enough to be left there. Removal of the change can be implemented
      in btrfs-progs if needed.
      
      [1] https://lore.kernel.org/linux-btrfs/20201127145836.GZ6430@twin.jikos.cz/Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5297199a
    • Nikolay Borisov's avatar
      btrfs: replace calls to btrfs_find_free_ino with btrfs_find_free_objectid · abadc1fc
      Nikolay Borisov authored
      The former is going away as part of the inode map removal so switch
      callers to btrfs_find_free_objectid. No functional changes since with
      INODE_MAP disabled (default) find_free_objectid was called anyway.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      abadc1fc
    • Nikolay Borisov's avatar
      btrfs: move btrfs_find_highest_objectid/btrfs_find_free_objectid to disk-io.c · ec7d6dfd
      Nikolay Borisov authored
      Those functions are going to be used even after inode cache is removed
      so moved them to a more appropriate place.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ec7d6dfd
    • David Sterba's avatar
      btrfs: drop casts of bio bi_sector · 1201b58b
      David Sterba authored
      Since commit 72deb455 ("block: remove CONFIG_LBDAF") (5.2) the
      sector_t type is u64 on all arches and configs so we don't need to
      typecast it.  It used to be unsigned long and the result of sector size
      shifts were not guaranteed to fit in the type.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1201b58b
    • Naohiro Aota's avatar
      btrfs: implement log-structured superblock for ZONED mode · 12659251
      Naohiro Aota authored
      Superblock (and its copies) is the only data structure in btrfs which
      has a fixed location on a device. Since we cannot overwrite in a
      sequential write required zone, we cannot place superblock in the zone.
      One easy solution is limiting superblock and copies to be placed only in
      conventional zones.  However, this method has two downsides: one is
      reduced number of superblock copies. The location of the second copy of
      superblock is 256GB, which is in a sequential write required zone on
      typical devices in the market today.  So, the number of superblock and
      copies is limited to be two.  Second downside is that we cannot support
      devices which have no conventional zones at all.
      
      To solve these two problems, we employ superblock log writing. It uses
      two adjacent zones as a circular buffer to write updated superblocks.
      Once the first zone is filled up, start writing into the second one.
      Then, when both zones are filled up and before starting to write to the
      first zone again, it reset the first zone.
      
      We can determine the position of the latest superblock by reading write
      pointer information from a device. One corner case is when both zones
      are full. For this situation, we read out the last superblock of each
      zone, and compare them to determine which zone is older.
      
      The following zones are reserved as the circular buffer on ZONED btrfs.
      
      - The primary superblock: zones 0 and 1
      - The first copy: zones 16 and 17
      - The second copy: zones 1024 or zone at 256GB which is minimum, and
        next to it
      
      If these reserved zones are conventional, superblock is written fixed at
      the start of the zone without logging.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      12659251
    • Naohiro Aota's avatar
      btrfs: disallow mixed-bg in ZONED mode · a589dde0
      Naohiro Aota authored
      Placing both data and metadata in a block group is impossible in ZONED
      mode. For data, we can allocate a space for it and write it immediately
      after the allocation. For metadata, however, we cannot do that, because
      the logical addresses are recorded in other metadata buffers to build up
      the trees. As a result, a data buffer can be placed after a metadata
      buffer, which is not written yet. Writing out the data buffer will break
      the sequential write rule.
      
      Check and disallow MIXED_BG with ZONED mode.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a589dde0
    • Naohiro Aota's avatar
      btrfs: disable fallocate in ZONED mode · f1569c4c
      Naohiro Aota authored
      fallocate() is implemented by reserving actual extent instead of
      reservations. This can result in exposing the sequential write
      constraint of host-managed zoned block devices to the application, which
      would break the POSIX semantic for the fallocated file.  To avoid this,
      report fallocate() as not supported when in ZONED mode for now.
      
      In the future, we may be able to implement "in-memory" fallocate() in
      ZONED mode by utilizing space_info->bytes_may_use or similar, so this
      returns EOPNOTSUPP.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f1569c4c
    • Naohiro Aota's avatar
      btrfs: disallow NODATACOW in ZONED mode · d206e9c9
      Naohiro Aota authored
      NODATACOW implies overwriting the file data on a device, which is
      impossible in sequential required zones. Disable NODATACOW globally with
      mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d206e9c9
    • Naohiro Aota's avatar
      btrfs: disallow space_cache in ZONED mode · 5d1ab66c
      Naohiro Aota authored
      As updates to the space cache v1 are in-place, the space cache cannot be
      located over sequential zones and there is no guarantees that the device
      will have enough conventional zones to store this cache. Resolve this
      problem by disabling completely the space cache v1.  This does not
      introduce any problems with sequential block groups: all the free space
      is located after the allocation pointer and no free space before the
      pointer.  There is no need to have such cache.
      
      Note: we can technically use free-space-tree (space cache v2) on ZONED
      mode. But, since ZONED mode now always allocates extents in a block
      group sequentially regardless of underlying device zone type, it's no
      use to enable and maintain the tree.
      
      For the same reason, NODATACOW is also disabled.
      
      In summary, ZONED will disable:
      
      | Disabled features | Reason                                              |
      |-------------------+-----------------------------------------------------|
      | RAID/DUP          | Cannot handle two zone append writes to different   |
      |                   | zones                                               |
      |-------------------+-----------------------------------------------------|
      | space_cache (v1)  | In-place updating                                   |
      | NODATACOW         | In-place updating                                   |
      |-------------------+-----------------------------------------------------|
      | fallocate         | Reserved extent will be a write hole                |
      |-------------------+-----------------------------------------------------|
      | MIXED_BG          | Allocated metadata region will be write holes for   |
      |                   | data writes                                         |
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5d1ab66c
    • Naohiro Aota's avatar
      btrfs: introduce max_zone_append_size · 862931c7
      Naohiro Aota authored
      The zone append write command has a maximum IO size restriction it
      accepts. This is because a zone append write command cannot be split, as
      we ask the device to place the data into a specific target zone and the
      device responds with the actual written location of the data.
      
      Introduce max_zone_append_size to zone_info and fs_info to track the
      value, so we can limit all I/O to a zoned block device that we want to
      write using the zone append command to the device's limits.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      862931c7
    • Naohiro Aota's avatar
      btrfs: check and enable ZONED mode · b70f5097
      Naohiro Aota authored
      Introduce function btrfs_check_zoned_mode() to check if ZONED flag is
      enabled on the file system and if the file system consists of zoned
      devices with equal zone size.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b70f5097
    • Naohiro Aota's avatar
      btrfs: get zone information of zoned block devices · 5b316468
      Naohiro Aota authored
      If a zoned block device is found, get its zone information (number of
      zones and zone size).  To avoid costly run-time zone report
      commands to test the device zones type during block allocation, attach
      the seq_zones bitmap to the device structure to indicate if a zone is
      sequential or accept random writes. Also it attaches the empty_zones
      bitmap to indicate if a zone is empty or not.
      
      This patch also introduces the helper function btrfs_dev_is_sequential()
      to test if the zone storing a block is a sequential write required zone
      and btrfs_dev_is_empty_zone() to test if the zone is a empty zone.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5b316468
  2. 08 Dec, 2020 8 commits
    • Naohiro Aota's avatar
      btrfs: introduce ZONED feature flag · 7b3d5a90
      Naohiro Aota authored
      This patch introduces the ZONED incompat flag. The flag indicates that
      the volume management will satisfy the constraints imposed by
      host-managed zoned block devices (aligned chunk allocation, append-only
      updates, reset zone after filled).
      
      As the zoned support will happen incrementally due to enhancing some
      core infrastructure like super block writes, tree-log, raid support, the
      feature will appear in sysfs only on debug builds. It will be enabled
      once the support is feature complete and applications can reliably check
      whether zoned support is present or not.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7b3d5a90
    • Nikolay Borisov's avatar
      btrfs: return bool from btrfs_should_end_transaction · a2633b6a
      Nikolay Borisov authored
      Results in slightly smaller code.
      
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-11 (-11)
      Function                                     old     new   delta
      btrfs_should_end_transaction                  96      85     -11
      Total: Before=20070, After=20059, chg -0.05%
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a2633b6a
    • Nikolay Borisov's avatar
      8a8f4dea
    • Nikolay Borisov's avatar
      btrfs: remove err variable from do_relocation · 8df01fdd
      Nikolay Borisov authored
      It simply gets assigned to 'ret' in case of errors. The flow of the
      while loop is not changed by this commit since the few call sites
      that 'goto next' will simply break from the loop.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8df01fdd
    • Nikolay Borisov's avatar
      btrfs: eliminate err variable from merge_reloc_root · c6a592f2
      Nikolay Borisov authored
      In most cases when an error is returned from a function 'ret' is simply
      assigned to 'err'. There is only one case where walk_up_reloc_tree can
      return a positive value - in this case the code breaks from the loop and
      ret is going to get its return value from btrfs_cow_block - either 0 or
      negative. This retains the old logic of how 'err' used to be set at
      this call site.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c6a592f2
    • Nikolay Borisov's avatar
      btrfs: remove err variable from btrfs_delete_subvolume · ee0d904f
      Nikolay Borisov authored
      Use only a single 'ret' to control whether we should abort the
      transaction or not. That's fine, because if we abort a transaction then
      btrfs_end_transaction will return the same value as passed to
      btrfs_abort_transaction. No semantic changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ee0d904f
    • Filipe Manana's avatar
      btrfs: unlock path before checking if extent is shared during nocow writeback · c65ca98f
      Filipe Manana authored
      When we are attempting to start writeback for an existing extent in NOCOW
      mode, at run_delalloc_nocow(), we must check if the extent is shared, and
      if it is, fallback to a COW write. However we do such check while still
      holding a read lock on the leaf that contains the file extent item, and
      that check, the call to btrfs_cross_ref_exist(), can take some time
      because:
      
      1) It needs to do a search on the extent tree, which obviously takes some
         time, specially if delayed references are being run at the moment, as
         we can block when trying to lock currently write locked btree nodes;
      
      2) It needs to check the delayed references for any existing reference
         for our data extent, this requires acquiring the delayed references'
         spinlock and maybe block on the mutex of a delayed reference head in the
         case where there is a delayed reference for our data extent, in the
         worst case it makes us release the path on the extent tree and retry
         the whole process again (going back to step 1).
      
      There are other operations we do while holding the leaf locked that can
      take some significant time as well (specially all together):
      
      * btrfs_extent_readonly() - to check if the block group containing the
        extent is currently in RO mode. This requires taking a spinlock and
        searching for the block group in a rbtree that can be big on large
        filesystems;
      
      * csum_exist_in_range() - to search if there are any checksums in the
        csum tree for the extent. Like before, this can take some time if we are
        in a filesystem that has both COW and NOCOW files, in which case the
        csum tree is not empty;
      
      * btrfs_inc_nocow_writers() - increment the number of nocow writers in the
        block group that contains the data extent. Needs to acquire a spinlock
        and search for the block group in a rbtree that can be big on large
        filesystems.
      
      So just unlock the leaf (release the path) before doing all those checks,
      since we do not need it anymore. In case we can not do a NOCOW write for
      the extent, due to any of those checks failing, and the writeback range
      goes beyond that extents' length, we will do another btree search for the
      next file extent item.
      
      The following script that calls dbench was used to measure the impact of
      this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
      directly (no intermediary filesystem on the host) and using a non-debug
      kernel (default configuration on Debian):
      
        $ cat test-dbench.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd -o nodatacow"
        MKFS_OPTIONS="-m single -d single"
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 300 64
      
        umount $MNT
      
      Before this change:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    9326331     0.317   399.957
       Close        6851198     0.002     6.402
       Rename        394894     2.621   402.819
       Unlink       1883131     0.931   398.082
       Deltree          256    19.160   303.580
       Mkdir            128     0.003     0.016
       Qpathinfo    8452314     0.068   116.133
       Qfileinfo    1481921     0.001     5.081
       Qfsinfo      1549963     0.002     4.444
       Sfileinfo     759679     0.084    17.079
       Find         3268168     0.396   118.196
       WriteX       4653310     0.056   110.993
       ReadX        14618818     0.005    23.314
       LockX          30364     0.003     0.497
       UnlockX        30364     0.002     1.720
       Flush         653619    16.954   569.299
      
      Throughput 966.651 MB/sec  64 clients  64 procs  max_latency=569.377 ms
      
      After this change:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    9710433     0.302   232.449
       Close        7132948     0.002    11.496
       Rename        411144     2.452   131.805
       Unlink       1960961     0.893   230.383
       Deltree          256    14.858   198.646
       Mkdir            128     0.002     0.005
       Qpathinfo    8800890     0.066   111.588
       Qfileinfo    1542556     0.001     3.852
       Qfsinfo      1613835     0.002     5.483
       Sfileinfo     790871     0.081    19.492
       Find         3402743     0.386   120.185
       WriteX       4842918     0.054   179.312
       ReadX        15220407     0.005    32.435
       LockX          31612     0.003     1.533
       UnlockX        31612     0.002     1.047
       Flush         680567    16.320   463.323
      
      Throughput 1016.59 MB/sec  64 clients  64 procs  max_latency=463.327 ms
      
      +5.0% throughput, -20.5% max latency
      
      Also, the following test using fio was run:
      
        $ cat test-fio.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd -o nodatacow"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 4 ]; then
            echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE"
            exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
        BLOCK_SIZE=$4
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=randwrite
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=$BLOCK_SIZE
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        EOF
      
        echo
        echo "Using fio config:"
        echo
        cat /tmp/fio-job.ini
        echo
        echo "mount options: $MOUNT_OPTIONS"
        echo
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo "Creating nodatacow files before fio runs..."
        for ((i = 0; i < $NUM_JOBS; i++)); do
            xfs_io -f -c "pwrite -b 128M 0 $FILE_SIZE" "$MNT/writers.$i.0"
        done
        sync
      
        fio /tmp/fio-job.ini
        umount $MNT
      
      Before this change:
      
      $ ./test-fio.sh 16 512M 2 4K
      (...)
      WRITE: bw=28.3MiB/s (29.6MB/s), 28.3MiB/s-28.3MiB/s (29.6MB/s-29.6MB/s), io=8192MiB (8590MB), run=289800-289800msec
      
      After this change:
      
      $ ./test-fio.sh 16 512M 2 4K
      (...)
      WRITE: bw=31.2MiB/s (32.7MB/s), 31.2MiB/s-31.2MiB/s (32.7MB/s-32.7MB/s), io=8192MiB (8590MB), run=262845-262845msec
      
      +9.7% throughput, -9.8% runtime
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c65ca98f
    • David Sterba's avatar
      btrfs: tree-checker: annotate all error branches as unlikely · c7c01a4a
      David Sterba authored
      The tree checker is called many times as it verifies metadata at
      read/write time. The checks follow a simple pattern:
      
        if (error_condition) {
      	  report_error();
      	  return -EUCLEAN;
        }
      
      All the error reporting functions are annotated as __cold that is
      supposed to hint the compiler to move the statement block out of the hot
      path. This does not seem to happen that often.
      
      As the error condition is expected to be false almost always, we can
      annotate it with 'unlikely' as this satisfies one of the few use cases
      for the annotation. The expected outcome is a stronger hint to compiler
      to reorder the checks
      
        test
        jump to exit
        test
        jump to exit
        ...
      
      which can be observed in asm of eg. check_dir_item,
      btrfs_check_chunk_valid, check_root_item or check_leaf.
      
      There's a measurable run time improvement reported by Josef, the testing
      workload went from 655 MiB/s to 677 MiB/s, which is about +3%.
      
      There should be no functional changes but some of the conditions have
      been rewritten to produce more readable result, some lines are longer
      than 80, for the sake of readability.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c7c01a4a