1. 08 Feb, 2021 40 commits
    • Qu Wenruo's avatar
      btrfs: introduce helpers for subpage error status · 03a816b3
      Qu Wenruo authored
      Introduce the following functions to handle subpage error status:
      
      - btrfs_subpage_set_error()
      - btrfs_subpage_clear_error()
      - btrfs_subpage_test_error()
        These helpers can only be called when the page has subpage attached
        and the range is ensured to be inside the page.
      
      - btrfs_page_set_error()
      - btrfs_page_clear_error()
      - btrfs_page_test_error()
        These helpers can handle both regular sector size and subpage without
        problem.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      03a816b3
    • Qu Wenruo's avatar
      btrfs: introduce helpers for subpage uptodate status · a1d767c1
      Qu Wenruo authored
      Introduce the following functions to handle subpage uptodate status:
      
      - btrfs_subpage_set_uptodate()
      - btrfs_subpage_clear_uptodate()
      - btrfs_subpage_test_uptodate()
        These helpers can only be called when the page has subpage attached
        and the range is ensured to be inside the page.
      
      - btrfs_page_set_uptodate()
      - btrfs_page_clear_uptodate()
      - btrfs_page_test_uptodate()
        These helpers can handle both regular sector size and subpage.
        Although caller should still ensure that the range is inside the page.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a1d767c1
    • Qu Wenruo's avatar
      btrfs: attach private to dummy extent buffer pages · 09bc1f0f
      Qu Wenruo authored
      There are locations where we allocate dummy extent buffers for temporary
      usage, like in tree_mod_log_rewind() or get_old_root().
      
      These dummy extent buffers will be handled by the same eb accessors, and
      if they don't have page::private subpage eb accessors could fail.
      
      To address such problems, make __alloc_dummy_extent_buffer() attach
      page private for dummy extent buffers too.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      09bc1f0f
    • Qu Wenruo's avatar
      btrfs: support subpage for extent buffer page release · 8ff8466d
      Qu Wenruo authored
      In btrfs_release_extent_buffer_pages(), we need to add extra handling
      for subpage.
      
      Introduce a helper, detach_extent_buffer_page(), to do different
      handling for regular and subpage cases.
      
      For subpage case, handle detaching page private.
      
      For unmapped (dummy or cloned) ebs, we can detach the page private
      immediately as the page can only be attached to one unmapped eb.
      
      For mapped ebs, we have to ensure there are no eb in the page range
      before we delete it, as page->private is shared between all ebs in the
      same page.
      
      But there is a subpage specific race, where we can race with extent
      buffer allocation, and clear the page private while new eb is still
      being utilized, like this:
      
        Extent buffer A is the new extent buffer which will be allocated,
        while extent buffer B is the last existing extent buffer of the page.
      
        		T1 (eb A) 	 |		T2 (eb B)
        -------------------------------+------------------------------
        alloc_extent_buffer()		 | btrfs_release_extent_buffer_pages()
        |- p = find_or_create_page()   | |
        |- attach_extent_buffer_page() | |
        |				 | |- detach_extent_buffer_page()
        |				 |    |- if (!page_range_has_eb())
        |				 |    |  No new eb in the page range yet
        |				 |    |  As new eb A hasn't yet been
        |				 |    |  inserted into radix tree.
        |				 |    |- btrfs_detach_subpage()
        |				 |       |- detach_page_private();
        |- radix_tree_insert()	 |
      
        Then we have a metadata eb whose page has no private bit.
      
      To avoid such race, we introduce a subpage metadata-specific member,
      btrfs_subpage::eb_refs.
      
      In alloc_extent_buffer() we increase eb_refs in the critical section of
      private_lock.  Then page_range_has_eb() will return true for
      detach_extent_buffer_page(), and will not detach page private.
      
      The section is marked by:
      
      - btrfs_page_inc_eb_refs()
      - btrfs_page_dec_eb_refs()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8ff8466d
    • Qu Wenruo's avatar
      btrfs: make grab_extent_buffer_from_page() handle subpage case · 81982210
      Qu Wenruo authored
      For subpage case, grab_extent_buffer() can't really get an extent buffer
      just from btrfs_subpage.
      
      We have radix tree lock protecting us from inserting the same eb into
      the tree.  Thus we don't really need to do the extra hassle, just let
      alloc_extent_buffer() handle the existing eb in radix tree.
      
      Now if two ebs are being allocated as the same time, one will fail with
      -EEIXST when inserting into the radix tree.
      
      So for grab_extent_buffer(), just always return NULL for subpage case.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      81982210
    • Qu Wenruo's avatar
      btrfs: make attach_extent_buffer_page() handle subpage case · 760f991f
      Qu Wenruo authored
      For subpage case, we need to allocate additional memory for each
      metadata page.
      
      So we need to:
      
      - Allow attach_extent_buffer_page() to return int to indicate allocation
        failure
      
      - Allow manually pre-allocate subpage memory for alloc_extent_buffer()
        As we don't want to use GFP_ATOMIC under spinlock, we introduce
        btrfs_alloc_subpage() and btrfs_free_subpage() functions for this
        purpose.
        (The simple wrap for btrfs_free_subpage() is for later convert to
         kmem_cache. Already internally tested without problem)
      
      - Preallocate btrfs_subpage structure for alloc_extent_buffer()
        We don't want to call memory allocation with spinlock held, so
        do preallocation before we acquire mapping->private_lock.
      
      - Handle subpage and regular case differently in
        attach_extent_buffer_page()
        For regular case, no change, just do the usual thing.
        For subpage case, allocate new memory or use the preallocated memory.
      
      For future subpage metadata, we will make use of radix tree to grab
      extent buffer.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      760f991f
    • Qu Wenruo's avatar
      btrfs: introduce the skeleton of btrfs_subpage structure · cac06d84
      Qu Wenruo authored
      For sectorsize < page size support, we need a structure to record extra
      status info for each sector of a page.
      
      Introduce the skeleton structure, all subpage related code would go to
      subpage.[ch].
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cac06d84
    • Qu Wenruo's avatar
      btrfs: set UNMAPPED bit early in btrfs_clone_extent_buffer() for subpage support · 62c053fb
      Qu Wenruo authored
      For the incoming subpage support, UNMAPPED extent buffer will have
      different behavior in btrfs_release_extent_buffer().
      
      This means we need to set UNMAPPED bit early before calling
      btrfs_release_extent_buffer().
      
      Currently there is only one caller which relies on
      btrfs_release_extent_buffer() in its error path while set UNMAPPED bit
      late:
      - btrfs_clone_extent_buffer()
      
      Make it subpage compatible by setting the UNMAPPED bit early, since
      we're here, also move the UPTODATE bit early.
      
      There is another caller, __alloc_dummy_extent_buffer(), setting
      UNMAPPED bit late, but that function clean up the allocated page
      manually, thus no need for any modification.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      62c053fb
    • Qu Wenruo's avatar
      btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK · 6869b0a8
      Qu Wenruo authored
      PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK are two defines used in
      __process_pages_contig(), to let the function know to clear page dirty
      bit and then set page writeback.
      
      However page writeback and dirty bits are conflicting (at least for
      sector size == PAGE_SIZE case), this means these two have to be always
      updated together.
      
      This means we can merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to
      PAGE_START_WRITEBACK.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6869b0a8
    • Filipe Manana's avatar
      btrfs: make concurrent fsyncs wait less when waiting for a transaction commit · d0c2f4fa
      Filipe Manana authored
      Often an fsync needs to fallback to a transaction commit for several
      reasons (to ensure consistency after a power failure, a new block group
      was allocated or a temporary error such as ENOMEM or ENOSPC happened).
      
      In that case the log is marked as needing a full commit and any concurrent
      tasks attempting to log inodes or commit the log will also fallback to the
      transaction commit. When this happens they all wait for the task that first
      started the transaction commit to finish the transaction commit - however
      they wait until the full transaction commit happens, which is not needed,
      as they only need to wait for the superblocks to be persisted and not for
      unpinning all the extents pinned during the transaction's lifetime, which
      even for short lived transactions can be a few thousand and take some
      significant amount of time to complete - for dbench workloads I have
      observed up to 4~5 milliseconds of time spent unpinning extents in the
      worst cases, and the number of pinned extents was between 2 to 3 thousand.
      
      So allow fsync tasks to skip waiting for the unpinning of extents when
      they call btrfs_commit_transaction() and they were not the task that
      started the transaction commit (that one has to do it, the alternative
      would be to offload the transaction commit to another task so that it
      could avoid waiting for the extent unpinning or offload the extent
      unpinning to another task).
      
      This patch is part of a patchset comprised of the following patches:
      
        btrfs: remove unnecessary directory inode item update when deleting dir entry
        btrfs: stop setting nbytes when filling inode item for logging
        btrfs: avoid logging new ancestor inodes when logging new inode
        btrfs: skip logging directories already logged when logging all parents
        btrfs: skip logging inodes already logged when logging new entries
        btrfs: remove unnecessary check_parent_dirs_for_sync()
        btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
      
      After applying the entire patchset, dbench shows improvements in respect
      to throughput and latency. The script used to measure it is the following:
      
        $ cat dbench-test.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-m single -d single"
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        umount $DEV &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 300 64
      
        umount $MNT
      
      The test was run on a physical machine with 12 cores (Intel corei7), 64G
      of ram, using a NVMe device and a non-debug kernel configuration (Debian's
      default configuration).
      
      Before applying patchset, 32 clients:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    9627107     0.153    61.938
       Close        7072076     0.001     3.175
       Rename        407633     1.222    44.439
       Unlink       1943895     0.658    44.440
       Deltree          256    17.339   110.891
       Mkdir            128     0.003     0.009
       Qpathinfo    8725406     0.064    17.850
       Qfileinfo    1529516     0.001     2.188
       Qfsinfo      1599884     0.002     1.457
       Sfileinfo     784200     0.005     3.562
       Find         3373513     0.411    30.312
       WriteX       4802132     0.053    29.054
       ReadX       15089959     0.002     5.801
       LockX          31344     0.002     0.425
       UnlockX        31344     0.001     0.173
       Flush         674724     5.952   341.830
      
      Throughput 1008.02 MB/sec  32 clients  32 procs  max_latency=341.833 ms
      
      After applying patchset, 32 clients:
      
      After patchset, with 32 clients:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    9931568     0.111    25.597
       Close        7295730     0.001     2.171
       Rename        420549     0.982    49.714
       Unlink       2005366     0.497    39.015
       Deltree          256    11.149    89.242
       Mkdir            128     0.002     0.014
       Qpathinfo    9001863     0.049    20.761
       Qfileinfo    1577730     0.001     2.546
       Qfsinfo      1650508     0.002     3.531
       Sfileinfo     809031     0.005     5.846
       Find         3480259     0.309    23.977
       WriteX       4952505     0.043    41.283
       ReadX       15568127     0.002     5.476
       LockX          32338     0.002     0.978
       UnlockX        32338     0.001     2.032
       Flush         696017     7.485   228.835
      
      Throughput 1049.91 MB/sec  32 clients  32 procs  max_latency=228.847 ms
      
       --> +4.1% throughput, -39.6% max latency
      
      Before applying patchset, 64 clients:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    8956748     0.342   108.312
       Close        6579660     0.001     3.823
       Rename        379209     2.396    81.897
       Unlink       1808625     1.108   131.148
       Deltree          256    25.632   172.176
       Mkdir            128     0.003     0.018
       Qpathinfo    8117615     0.131    55.916
       Qfileinfo    1423495     0.001     2.635
       Qfsinfo      1488496     0.002     5.412
       Sfileinfo     729472     0.007     8.643
       Find         3138598     0.855    78.321
       WriteX       4470783     0.102    79.442
       ReadX       14038139     0.002     7.578
       LockX          29158     0.002     0.844
       UnlockX        29158     0.001     0.567
       Flush         627746    14.168   506.151
      
      Throughput 924.738 MB/sec  64 clients  64 procs  max_latency=506.154 ms
      
      After applying patchset, 64 clients:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    9069003     0.303    43.193
       Close        6662328     0.001     3.888
       Rename        383976     2.194    46.418
       Unlink       1831080     1.022    43.873
       Deltree          256    24.037   155.763
       Mkdir            128     0.002     0.005
       Qpathinfo    8219173     0.137    30.233
       Qfileinfo    1441203     0.001     3.204
       Qfsinfo      1507092     0.002     4.055
       Sfileinfo     738775     0.006     5.431
       Find         3177874     0.936    38.170
       WriteX       4526152     0.084    39.518
       ReadX       14213562     0.002    24.760
       LockX          29522     0.002     1.221
       UnlockX        29522     0.001     0.694
       Flush         635652    14.358   422.039
      
      Throughput 990.13 MB/sec  64 clients  64 procs  max_latency=422.043 ms
      
       --> +6.8% throughput, -18.1% max latency
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d0c2f4fa
    • Filipe Manana's avatar
      btrfs: remove unnecessary check_parent_dirs_for_sync() · 64d6b281
      Filipe Manana authored
      Whenever we fsync an inode, if it is a directory, a regular file that was
      created in the current transaction or has last_unlink_trans set to the
      generation of the current transaction, we check if any of its ancestor
      inodes (and the inode itself if it is a directory) can not be logged and
      need a fallback to a full transaction commit - if so, we return with a
      value of 1 in order to fallback to a transaction commit.
      
      However we often do not need to fallback to a transaction commit because:
      
      1) The ancestor inode is not an immediate parent, and therefore there is
         not an explicit request to log it and it is not needed neither to
         guarantee the consistency of the inode originally asked to be logged
         (fsynced) nor its immediate parent;
      
      2) The ancestor inode was already logged before, in which case any link,
         unlink or rename operation updates the log as needed.
      
      So for these two cases we can avoid an unnecessary transaction commit.
      Therefore remove check_parent_dirs_for_sync() and add a check at the top
      of btrfs_log_inode() to make us fallback immediately to a transaction
      commit when we are logging a directory inode that can not be logged and
      needs a full transaction commit. All we need to protect is the case where
      after renaming a file someone fsyncs only the old directory, which would
      result is losing the renamed file after a log replay.
      
      This patch is part of a patchset comprised of the following patches:
      
        btrfs: remove unnecessary directory inode item update when deleting dir entry
        btrfs: stop setting nbytes when filling inode item for logging
        btrfs: avoid logging new ancestor inodes when logging new inode
        btrfs: skip logging directories already logged when logging all parents
        btrfs: skip logging inodes already logged when logging new entries
        btrfs: remove unnecessary check_parent_dirs_for_sync()
        btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
      
      Performance results, after applying all patches, are mentioned in the
      change log of the last patch.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      64d6b281
    • Filipe Manana's avatar
      btrfs: skip logging inodes already logged when logging new entries · 0e44cb3f
      Filipe Manana authored
      When logging new directory entries of a directory, we log the inodes of
      new dentries and the inodes of dentries pointing to directories that
      may have been created in past transactions. For the case of directories
      we log in full mode, which can be particularly expensive for large
      directories.
      
      We do use btrfs_inode_in_log() to skip already logged inodes, however for
      that helper to return true, it requires that the log transaction used to
      log the inode to be already committed. This means that when we have more
      than one task using the same log transaction we can end up logging an
      inode multiple times, which is a waste of time and not necessary since
      the log will be committed by one of the tasks and the others will wait for
      the log transaction to be committed before returning to user space.
      
      So simply replace the use of btrfs_inode_in_log() with the new helper
      function need_log_inode(), introduced in a previous commit.
      
      This patch is part of a patchset comprised of the following patches:
      
        btrfs: remove unnecessary directory inode item update when deleting dir entry
        btrfs: stop setting nbytes when filling inode item for logging
        btrfs: avoid logging new ancestor inodes when logging new inode
        btrfs: skip logging directories already logged when logging all parents
        btrfs: skip logging inodes already logged when logging new entries
        btrfs: remove unnecessary check_parent_dirs_for_sync()
        btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
      
      Performance results, after applying all patches, are mentioned in the
      change log of the last patch.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0e44cb3f
    • Filipe Manana's avatar
      btrfs: skip logging directories already logged when logging all parents · 3e6a86a1
      Filipe Manana authored
      Some times when we fsync an inode we need to do a full log of all its
      ancestors (due to unlink, link or rename operations), which can be an
      expensive operation, specially if the directories are large.
      
      However if we find an ancestor directory inode that is already logged in
      the current transaction, and has no inserted/updated/deleted xattrs since
      it was last logged, we can skip logging the directory again. We are safe
      to skip that since we know that for logged directories, any link, unlink
      or rename operations that implicate the directory will update the log as
      necessary.
      
      So use the helper need_log_dir(), introduced in a previous commit, to
      detect already logged directories that can be skipped.
      
      This patch is part of a patchset comprised of the following patches:
      
        btrfs: remove unnecessary directory inode item update when deleting dir entry
        btrfs: stop setting nbytes when filling inode item for logging
        btrfs: avoid logging new ancestor inodes when logging new inode
        btrfs: skip logging directories already logged when logging all parents
        btrfs: skip logging inodes already logged when logging new entries
        btrfs: remove unnecessary check_parent_dirs_for_sync()
        btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
      
      Performance results, after applying all patches, are mentioned in the
      change log of the last patch.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3e6a86a1
    • Filipe Manana's avatar
      btrfs: avoid logging new ancestor inodes when logging new inode · ab12313a
      Filipe Manana authored
      When we fsync a new file, created in the current transaction, we check
      all its ancestor inodes and always log them if they were created in the
      current transaction - even if we have already logged them before, which
      is a waste of time.
      
      So avoid logging new ancestor inodes if they were already logged before
      and have no xattrs added/updated/removed since they were last logged.
      
      This patch is part of a patchset comprised of the following patches:
      
        btrfs: remove unnecessary directory inode item update when deleting dir entry
        btrfs: stop setting nbytes when filling inode item for logging
        btrfs: avoid logging new ancestor inodes when logging new inode
        btrfs: skip logging directories already logged when logging all parents
        btrfs: skip logging inodes already logged when logging new entries
        btrfs: remove unnecessary check_parent_dirs_for_sync()
        btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
      
      Performance results, after applying all patches, are mentioned in the
      change log of the last patch.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ab12313a
    • Filipe Manana's avatar
      btrfs: stop setting nbytes when filling inode item for logging · e593e54e
      Filipe Manana authored
      When we fill an inode item for logging we are setting its nbytes field
      with the value returned by inode_get_bytes() (a VFS API), however we do
      not need it because it is not used during log replay. In fact, for fast
      fsyncs, when we call inode_get_bytes() we may even get an outdated value
      for nbytes because the nbytes field of the inode is only updated when
      ordered extents complete, and a fast fsync only waits for writeback to
      complete, it does not wait for ordered extent completion.
      
      So just remove the setup of nbytes and add an explicit comment mentioning
      why we do not set it. This also avoids adding contention on the inode's
      i_lock (VFS) with concurrent stat() calls, since that spinlock is used by
      inode_get_bytes() which is also called by our stat callback
      (btrfs_getattr()).
      
      This patch is part of a patchset comprised of the following patches:
      
        btrfs: remove unnecessary directory inode item update when deleting dir entry
        btrfs: stop setting nbytes when filling inode item for logging
        btrfs: avoid logging new ancestor inodes when logging new inode
        btrfs: skip logging directories already logged when logging all parents
        btrfs: skip logging inodes already logged when logging new entries
        btrfs: remove unnecessary check_parent_dirs_for_sync()
        btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
      
      Performance results, after applying all patches, are mentioned in the
      change log of the last patch.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e593e54e
    • Filipe Manana's avatar
      btrfs: remove unnecessary directory inode item update when deleting dir entry · ddffcf6f
      Filipe Manana authored
      When we remove a directory entry, as part of an unlink operation, if the
      directory was logged before we must remove the directory index items from
      the log. We are also updating the inode item of the directory to update
      its i_size, but that is not necessary because during log replay we do not
      need it and we correctly adjust the i_size in the inode item of the
      subvolume as we process directory index items and replay deletes.
      
      This is not needed since commit d555438b ("Btrfs: drop dir i_size
      when adding new names on replay"), where we explicitly ignore the i_size
      of directory inode items on log replay. Before that we used it but it
      was buggy as mentioned in that commit's change log (i_size got a larger
      value then it should have).
      
      So stop updating the i_size of the directory inode item in the log, as
      that is a waste of time, adds more log contention to the log tree and
      often results in COWing more extent buffers for the log tree.
      
      This code path is triggered often during dbench workloads for example.
      This patch is part of a patchset comprised of the following patches:
      
        btrfs: remove unnecessary directory inode item update when deleting dir entry
        btrfs: stop setting nbytes when filling inode item for logging
        btrfs: avoid logging new ancestor inodes when logging new inode
        btrfs: skip logging directories already logged when logging all parents
        btrfs: skip logging inodes already logged when logging new entries
        btrfs: remove unnecessary check_parent_dirs_for_sync()
        btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
      
      Performance results, after applying all patches, are mentioned in the
      change log of the last patch.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ddffcf6f
    • Michal Rostecki's avatar
      btrfs: let callers of btrfs_get_io_geometry pass the em · 42034313
      Michal Rostecki authored
      Before this change, the btrfs_get_io_geometry() function was calling
      btrfs_get_chunk_map() to get the extent mapping, necessary for
      calculating the I/O geometry. It was using that extent mapping only
      internally and freeing the pointer after its execution.
      
      That resulted in calling btrfs_get_chunk_map() de facto twice by the
      __btrfs_map_block() function. It was calling btrfs_get_io_geometry()
      first and then calling btrfs_get_chunk_map() directly to get the extent
      mapping, used by the rest of the function.
      
      Change that to passing the extent mapping to the btrfs_get_io_geometry()
      function as an argument.
      
      This could improve performance in some cases.  For very large
      filesystems, i.e. several thousands of allocated chunks, not only this
      avoids searching two times the rbtree, saving time, it may also help
      reducing contention on the lock that protects the tree - thinking of
      writeback starting for multiple inodes, other tasks allocating or
      removing chunks, and anything else that requires access to the rbtree.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarMichal Rostecki <mrostecki@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add Filipe's analysis ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      42034313
    • Qu Wenruo's avatar
      btrfs: fix double accounting of ordered extent for subpage case in btrfs_invalidapge · 951c80f8
      Qu Wenruo authored
      Commit dbfdb6d1 ("Btrfs: Search for all ordered extents that could
      span across a page") make btrfs_invalidapage() to search all ordered
      extents.
      
      The offending code looks like this:
      
        again:
      	  start = page_start;
      	  ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 1);
      	  if (ordred) {
      		  end = min(page_end,
      			    ordered->file_offset + ordered->num_bytes - 1);
      
      		  /* Do the cleanup */
      
      		  start = end + 1;
      		  if (start < page_end)
      			  goto again;
      	  }
      
      The behavior is indeed necessary for the incoming subpage support, but
      when it iterates through all the ordered extents, it also resets the
      search range @start.
      
      This means, for the following cases, we can double account the ordered
      extents, causing its bytes_left underflow:
      
      	Page offset
      	0		16K		32K
      	|<--- OE 1  --->|<--- OE 2 ---->|
      
      As the first iteration will find ordered extent (OE) 1, which doesn't
      cover the full page, thus after cleanup code, we need to retry again.
      But again label will reset start to page_start, and we got OE 1 again,
      which causes double accounting on OE 1, and cause OE 1's byte_left to
      underflow.
      
      This problem can only happen for subpage case, as for regular sectorsize
      == PAGE_SIZE case, we will always find a OE ends at or after page end,
      thus no way to trigger the problem.
      
      Move the again label after start = page_start.  There will be more
      comprehensive rework to convert the open coded loop to a proper while
      loop for subpage support.
      
      Fixes: dbfdb6d1 ("Btrfs: Search for all ordered extents that could span across a page")
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      951c80f8
    • Abaci Team's avatar
      btrfs: simplify condition in __btrfs_run_delayed_items · a4559e6f
      Abaci Team authored
      Fix the following coccicheck warnings:
      
      ./fs/btrfs/delayed-inode.c:1157:39-41: WARNING !A || A && B is
      equivalent to !A || B.
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Suggested-by: default avatarJiapeng Zhong <oswb@linux.alibaba.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAbaci Team <abaci-bugfix@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4559e6f
    • Filipe Manana's avatar
      btrfs: remove wrong comment for can_nocow_extent() · 2965194b
      Filipe Manana authored
      The comment for can_nocow_extent() says that the function will flush
      ordered extents, however that never happens and was never true before the
      comment was added in commit e4ecaf90 ("btrfs: add comments for
      btrfs_check_can_nocow() and can_nocow_extent()"). This is true only for
      the function btrfs_check_can_nocow(), which after that commit was renamed
      to check_can_nocow(). So just remove that part of the comment.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2965194b
    • Josef Bacik's avatar
      btrfs: add a trace class for dumping the current ENOSPC state · e5ad49e2
      Josef Bacik authored
      Often when I'm debugging ENOSPC related issues I have to resort to
      printing the entire ENOSPC state with trace_printk() in different spots.
      This gets pretty annoying, so add a trace state that does this for us.
      Then add a trace point at the end of preemptive flushing so you can see
      the state of the space_info when we decide to exit preemptive flushing.
      This helped me figure out we weren't kicking in the preemptive flushing
      soon enough.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e5ad49e2
    • Josef Bacik's avatar
      btrfs: adjust the flush trace point to include the source · 4b02b00f
      Josef Bacik authored
      Since we have normal ticketed flushing and preemptive flushing, adjust
      the tracepoint so that we know the source of the flushing action to make
      it easier to debug problems.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4b02b00f
    • Josef Bacik's avatar
      btrfs: implement space clamping for preemptive flushing · 88a777a6
      Josef Bacik authored
      Starting preemptive flushing at 50% of available free space is a good
      start, but some workloads are particularly abusive and can quickly
      overwhelm the preemptive flushing code and drive us into using tickets.
      
      Handle this by clamping down on our threshold for starting and
      continuing to run preemptive flushing.  This is particularly important
      for our overcommit case, as we can really drive the file system into
      overages and then it's more difficult to pull it back as we start to
      actually fill up the file system.
      
      The clamping is essentially 2^CLAMP, but we start at 1 so whatever we
      calculate for overcommit is the baseline.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88a777a6
    • Josef Bacik's avatar
      btrfs: simplify the logic in need_preemptive_flushing · 2e294c60
      Josef Bacik authored
      A lot of this was added all in one go with no explanation, and is a bit
      unwieldy and confusing.  Simplify the logic to start preemptive flushing
      if we've reserved more than half of our available free space.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2e294c60
    • Josef Bacik's avatar
      btrfs: rework btrfs_calc_reclaim_metadata_size · 9f42d377
      Josef Bacik authored
      Currently btrfs_calc_reclaim_metadata_size does two things, it returns
      the space currently required for flushing by the tickets, and if there
      are no tickets it calculates a value for the preemptive flushing.
      
      However for the normal ticketed flushing we really only care about the
      space required for tickets.  We will accidentally come in and flush one
      time, but as soon as we see there are no tickets we bail out of our
      flushing.
      
      Fix this by making btrfs_calc_reclaim_metadata_size really only tell us
      what is required for flushing if we have people waiting on space.  Then
      move the preemptive flushing logic into need_preemptive_reclaim().  We
      ignore btrfs_calc_reclaim_metadata_size() in need_preemptive_reclaim()
      because if we are in this path then we made our reservation and there
      are not pending tickets currently, so we do not need to check it, simply
      do the fuzzy logic to check if we're getting low on space.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f42d377
    • Josef Bacik's avatar
      btrfs: check reclaim_size in need_preemptive_reclaim · f205edf7
      Josef Bacik authored
      If we're flushing space for tickets then we have
      space_info->reclaim_size set and we do not need to do background
      reclaim.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f205edf7
    • Josef Bacik's avatar
      btrfs: rename need_do_async_reclaim · ae7913ba
      Josef Bacik authored
      All of our normal flushing is asynchronous reclaim, so this helper is
      poorly named.  This is more checking if we need to preemptively flush
      space, so rename it to need_preemptive_reclaim.
      
      Also switch it to bool and make it plain static as followup patches will
      move more code here.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae7913ba
    • Josef Bacik's avatar
      btrfs: improve preemptive background space flushing · 576fa348
      Josef Bacik authored
      Currently if we ever have to flush space because we do not have enough
      we allocate a ticket and attach it to the space_info, and then
      systematically flush things in the filesystem that hold space
      reservations until our space is reclaimed.
      
      However this has a latency cost, we must go to sleep and wait for the
      flushing to make progress before we are woken up and allowed to continue
      doing our work.
      
      In order to address that we used to kick off the async worker to flush
      space preemptively, so that we could be reclaiming space hopefully
      before any tasks needed to stop and wait for space to reclaim.
      
      When I introduced the ticketed ENOSPC stuff this broke slightly in the
      fact that we were using tickets to indicate if we were done flushing.
      No tickets, no more flushing.  However this meant that we essentially
      never preemptively flushed.  This caused a write performance regression
      that Nikolay noticed in an unrelated patch that removed the committing
      of the transaction during btrfs_end_transaction.
      
      The behavior that happened pre that patch was btrfs_end_transaction()
      would see that we were low on space, and it would commit the
      transaction.  This was bad because in this particular case you could end
      up with thousands and thousands of transactions being committed during
      the 5 minute reproducer.  With the patch to remove this behavior we got
      much more sane transaction commits, but we ended up slower because we
      would write for a while, flush, write for a while, flush again.
      
      To address this we need to reinstate a preemptive flushing mechanism.
      However it is distinctly different from our ticketing flushing in that
      it doesn't have tickets to base it's decisions on.  Instead of bolting
      this logic into our existing flushing work, add another worker to handle
      this preemptive flushing.  Here we will attempt to be slightly
      intelligent about the things that we flushing, attempting to balance
      between whichever pool is taking up the most space.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      576fa348
    • Josef Bacik's avatar
      btrfs: introduce a FORCE_COMMIT_TRANS flush operation · f00c42dd
      Josef Bacik authored
      Solely for preemptive flushing, we want to be able to force the
      transaction commit without any of the ambiguity of
      may_commit_transaction().  This is because may_commit_transaction()
      checks tickets and such, and in preemptive flushing we already know
      it'll be helpful, so use this to keep the code nice and clean and
      straightforward.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ add comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f00c42dd
    • Josef Bacik's avatar
      btrfs: track ordered bytes instead of just dio ordered bytes · 5deb17e1
      Josef Bacik authored
      We track dio_bytes because the shrink delalloc code needs to know if we
      have more DIO in flight than we have normal buffered IO.  The reason for
      this is because we can't "flush" DIO, we have to just wait on the
      ordered extents to finish.
      
      However this is true of all ordered extents.  If we have more ordered
      space outstanding than dirty pages we should be waiting on ordered
      extents.  We already are ok on this front technically, because we always
      do a FLUSH_DELALLOC_WAIT loop, but I want to use the ordered counter in
      the preemptive flushing code as well, so change this to count all
      ordered bytes instead of just DIO ordered bytes.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5deb17e1
    • Josef Bacik's avatar
      btrfs: add a trace point for reserve tickets · ac1ea10e
      Josef Bacik authored
      While debugging a ENOSPC related performance problem I needed to see the
      time difference between start and end of a reserve ticket, so add a
      trace point to report when we handle a reserve ticket.
      
      I opted to spit out start_ns itself without calculating the difference
      because there could be a gap between enabling the tracepoint and setting
      start_ns.  Doing it this way allows us to filter on 0 start_ns so we
      don't get bogus entries, and we can easily calculate the time difference
      with bpftrace or something else.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac1ea10e
    • Josef Bacik's avatar
      btrfs: make flush_space take a enum btrfs_flush_state instead of int · 91e79a83
      Josef Bacik authored
      I got a automated message from somebody who runs clang against our
      kernels and it's because I used the wrong enum type for what I passed
      into flush_space, caught by -Wenum-conversion.  Change the argument to
      be explicitly the enum we're expecting to make everything consistent.
      Maybe eventually gcc will catch errors like this.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      91e79a83
    • Roman Anasal's avatar
      btrfs: send: use struct send_ctx *sctx for btrfs_compare_trees and changed_cb · 88980383
      Roman Anasal authored
      btrfs_compare_trees and changed_cb use a void *ctx parameter instead of
      struct send_ctx *sctx but when used in changed_cb it is immediately
      cast to `struct send_ctx *sctx = ctx;`.
      
      changed_cb is only ever called from btrfs_compare_trees and full_send_tree:
      - full_send_tree already passes a struct send_ctx *sctx
      - btrfs_compare_trees is only called by send_subvol with a struct send_ctx *sctx
      - void *ctx in btrfs_compare_trees is only used to be passed to changed_cb
      
      So casting to/from void *ctx seems unnecessary and directly using
      struct send_ctx *sctx instead provides better type-safety.
      
      The original reason for using void *ctx in the first place seems to have
      been dropped with 1b51d6fc ("btrfs: send: remove indirect callback
      parameter for changed_cb").
      Signed-off-by: default avatarRoman Anasal <roman.anasal@bdsu.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88980383
    • Josef Bacik's avatar
      btrfs: run delayed refs less often in commit_cowonly_roots · 488bc2a2
      Josef Bacik authored
      We love running delayed refs in commit_cowonly_roots, but it is a bit
      excessive.  I was seeing cases of running 3 or 4 refs a few times in a
      row during this time.  Instead simply:
      
      - update all of the roots first
      - then run delayed refs
      - then handle the empty block groups case
      - and then if we have any more dirty roots do the whole thing again
      
      This allows us to be much more efficient with our delayed ref running,
      as we can batch a few more operations at once.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      488bc2a2
    • Josef Bacik's avatar
      btrfs: stop running all delayed refs during snapshot · dac348e9
      Josef Bacik authored
      This was added in commit 361048f5 ("Btrfs: fix full backref problem
      when inserting shared block reference") to address a problem where we
      hit the following BUG_ON() in alloc_reserved_tree_block
      
              if (node->type == BTRFS_SHARED_BLOCK_REF_KEY) {
                      BUG_ON(!(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF));
      
      However this BUG_ON() is bogus, and was removed by previous commit:
      
        btrfs: remove bogus BUG_ON in alloc_reserved_tree_block
      
      We no longer need to run delayed refs because of this, and can remove
      this flushing here.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dac348e9
    • Josef Bacik's avatar
      btrfs: remove bogus BUG_ON in alloc_reserved_tree_block · b7774425
      Josef Bacik authored
      The fix 361048f5 ("Btrfs: fix full backref problem when inserting
      shared block reference") added a delayed ref flushing at subvolume
      creation time in order to avoid hitting this particular BUG_ON().
      
      Before this fix, we were tripping the BUG_ON() by
      
      1. Modify snapshot A, which creates blocks with a normal reference for
         snapshot A, as A is the owner of these blocks.  We now have delayed
         refs for these blocks.
      2. Create a snapshot of A named B, which pushes references for the
         children blocks of the root node for the new root B, thus creating
         more delayed refs for newly allocated blocks.
      3. A is modified, and because the metadata blocks can now be shared, it
         must push FULL_BACKREF references to the children of any block that A
         COWs down it's path to its target key.
      4. Delayed refs are run.  Because these are newly allocated blocks, we
         have ->must_insert_reserved reserved set on the delayed ref head, we
         call into alloc_reserved_tree_block() to add the extent item, and
         then add our ref.  At the time of this fix, we were ordering
         FULL_BACKREF delayed ref operations first, so we'd go to add this
         reference and then BUG_ON() because we didn't have the FULL_BACKREF
         flag set.
      
      The patch fixed this problem by making sure we ran the delayed refs
      before we had the chance to modify A.  This meant that any *new* blocks
      would have had their extent items created _before_ we would ever
      actually COW down and generate FULL_BACKREF entries.  Thus the problem
      went away.
      
      However this BUG_ON() is actually completely bogus.  The existence of a
      full backref doesn't necessarily mean that FULL_BACKREF must be set on
      that block, it must only be set on the actual parent itself.  Consider
      the example provided above.  If we COW down one path from A, any nodes
      are going to have a FULL_BACKREF ref pushed down to _all_ of their
      children, but not all of the children are going to have FULL_BACKREF
      set.  It is completely valid to have an extent item with normal and full
      backrefs without FULL_BACKREF actually set on the block itself.
      
      As a final note, I have been testing with the patch (applied after this
      one)
      
        btrfs: stop running all delayed refs during snapshot
      
      which removed this flushing.  My test was a torture test which did a lot
      of operations while snapshotting and deleting snapshots as well as
      relocation, and I never tripped this BUG_ON().  This is actually because
      at the time of 361048f5, we ordered SHARED keys _before_ normal
      references, and thus they would get run first.  However currently they
      are ordered _after_ normal references, so we'd do the initial creation
      without having a shared reference, and thus not hit this BUG_ON(), which
      explains why I didn't start hitting this problem during my testing with
      my other patch applied.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b7774425
    • Josef Bacik's avatar
      btrfs: move delayed ref flushing for qgroup into qgroup helper · 2a4d84c1
      Josef Bacik authored
      The commit d6726335 ("btrfs: qgroup: Make snapshot accounting work
      with new extent-oriented qgroup.") added a flush of the delayed refs
      during snapshot creation in order to get the qgroup accounting properly.
      However this code has changed and been moved to it's own helper that is
      skipped if qgroups are turned off.  Move the flushing to the helper, as
      we do not need it when qgroups are turned off.
      
      Also add a comment explaining why it exists, and why it doesn't actually
      save us.  This will be helpful later when we try to fix qgroup
      accounting properly.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2a4d84c1
    • Josef Bacik's avatar
      btrfs: only run delayed refs once before committing · ad368f33
      Josef Bacik authored
      We try to pre-flush the delayed refs when committing, because we want to
      do as little work as possible in the critical section of the transaction
      commit.
      
      However doing this twice can lead to very long transaction commit delays
      as other threads are allowed to continue to generate more delayed refs,
      which potentially delays the commit by multiple minutes in very extreme
      cases.
      
      So simply stick to one pre-flush, and then continue the rest of the
      transaction commit.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad368f33
    • Josef Bacik's avatar
      btrfs: delayed refs pre-flushing should only run the heads we have · 61a56a99
      Josef Bacik authored
      Previously our delayed ref running used the total number of items as the
      items to run.  However we changed that to number of heads to run with
      the delayed_refs_rsv, as generally we want to run all of the operations
      for one bytenr.
      
      But with btrfs_run_delayed_refs(trans, 0) we set our count to 2x the
      number of items that we have.  This is generally fine, but if we have
      some operation generation loads of delayed refs while we're doing this
      pre-flushing in the transaction commit, we'll just spin forever doing
      delayed refs.
      
      Fix this to simply pick the number of delayed refs we currently have,
      that way we do not end up doing a lot of extra work that's being
      generated in other threads.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      61a56a99
    • Josef Bacik's avatar
      btrfs: only let one thread pre-flush delayed refs in commit · e19eb11f
      Josef Bacik authored
      I've been running a stress test that runs 20 workers in their own
      subvolume, which are running an fsstress instance with 4 threads per
      worker, which is 80 total fsstress threads.  In addition to this I'm
      running balance in the background as well as creating and deleting
      snapshots.  This test takes around 12 hours to run normally, going
      slower and slower as the test goes on.
      
      The reason for this is because fsstress is running fsync sometimes, and
      because we're messing with block groups we often fall through to
      btrfs_commit_transaction, so will often have 20-30 threads all calling
      btrfs_commit_transaction at the same time.
      
      These all get stuck contending on the extent tree while they try to run
      delayed refs during the initial part of the commit.
      
      This is suboptimal, really because the extent tree is a single point of
      failure we only want one thread acting on that tree at once to reduce
      lock contention.
      
      Fix this by making the flushing mechanism a bit operation, to make it
      easy to use test_and_set_bit() in order to make sure only one task does
      this initial flush.
      
      Once we're into the transaction commit we only have one thread doing
      delayed ref running, it's just this initial pre-flush that is
      problematic.  With this patch my stress test takes around 90 minutes to
      run, instead of 12 hours.
      
      The memory barrier is not necessary for the flushing bit as it's
      ordered, unlike plain int. The transaction state accessed in
      btrfs_should_end_transaction could be affected by that too as it's not
      always used under transaction lock. Upon Nikolay's analysis in [1]
      it's not necessary:
      
        In should_end_transaction it's read without holding any locks. (U)
      
        It's modified in btrfs_cleanup_transaction without holding the
        fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set.
      
        set in cleanup_transaction under fs_info->trans_lock (L)
        set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L)
        set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L)
        set in btrfs_commit_trans to COMMIT_UNBLOCK under
        fs_info->trans_lock.(L)
      
        set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this
        point the transaction is finished and fs_info->running_trans is NULL (U
        but irrelevant).
      
        So by the looks of it we can have a concurrent READ race with a WRITE,
        due to reads not taking a lock. In this case what we want to ensure is
        we either see new or old state. I consulted with Will Deacon and he said
        that in such a case we'd want to annotate the accesses to ->state with
        (READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I
        don't think this could happen but I imagine at some point KCSAN would
        flag such an access as racy (which it is).
      
      [1] https://lore.kernel.org/linux-btrfs/e1fd5cc1-0f28-f670-69f4-e9958b4964e6@suse.comReviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ add comments regarding memory barrier ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e19eb11f