1. 21 Jun, 2021 40 commits
    • Qu Wenruo's avatar
      btrfs: make free space cache size consistent across different PAGE_SIZE · 0044ae11
      Qu Wenruo authored
      Currently free space cache inode size is determined by two factors:
      
      - block group size
      - PAGE_SIZE
      
      This means, for the same sized block groups, with different PAGE_SIZE,
      it will result in different inode sizes.
      
      This will not be a good thing for subpage support, so change the
      requirement for PAGE_SIZE to sectorsize.
      
      Now for the same 4K sectorsize btrfs, it should result the same inode
      size no matter what the PAGE_SIZE is.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0044ae11
    • Qu Wenruo's avatar
      btrfs: scrub: fix subpage repair error caused by hard coded PAGE_SIZE · 8df507cb
      Qu Wenruo authored
      [BUG]
      For the following file layout, scrub will not be able to repair all
      these two repairable error, but in fact make one corruption even
      unrepairable:
      
      	  inode offset 0      4k     8K
      Mirror 1               |XXXXXX|      |
      Mirror 2               |      |XXXXXX|
      
      [CAUSE]
      The root cause is the hard coded PAGE_SIZE, which makes scrub repair to
      go crazy for subpage.
      
      For above case, when reading the first sector, we use PAGE_SIZE other
      than sectorsize to read, which makes us to read the full range [0, 64K).
      In fact, after 8K there may be no data at all, we can just get some
      garbage.
      
      Then when doing the repair, we also writeback a full page from mirror 2,
      this means, we will also writeback the corrupted data in mirror 2 back
      to mirror 1, leaving the range [4K, 8K) unrepairable.
      
      [FIX]
      This patch will modify the following PAGE_SIZE use with sectorsize:
      
      - scrub_print_warning_inode()
        Remove the min() and replace PAGE_SIZE with sectorsize.
        The min() makes no sense, as csum is done for the full sector with
        padding.
      
        This fixes a bug that subpage report extra length like:
         checksum error at logical 298844160 on dev /dev/mapper/arm_nvme-test,
         physical 575668224, root 5, inode 257, offset 0, length 12288, links 1 (path: file)
      
        Where the error is only 1 sector.
      
      - scrub_handle_errored_block()
        Comments with PAGE|page involved, all changed to sector.
      
      - scrub_setup_recheck_block()
      - scrub_repair_page_from_good_copy()
      - scrub_add_page_to_wr_bio()
      - scrub_wr_submit()
      - scrub_add_page_to_rd_bio()
      - scrub_block_complete()
        Replace PAGE_SIZE with sectorsize.
        This solves several problems where we read/write extra range for
        subpage case.
      
      RAID56 code is excluded intentionally, as RAID56 has extra PAGE_SIZE
      usage, and is not really safe enough.
      Thus we will reject RAID56 for subpage in later commit.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8df507cb
    • Nikolay Borisov's avatar
      btrfs: use list_last_entry in add_falloc_range · ec87b42f
      Nikolay Borisov authored
      Instead of calling list_entry with head->prev simply call
      list_last_entry which makes it obvious which member of the list is
      being referred. This allows to remove the extra 'prev' pointer.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ec87b42f
    • Anand Jain's avatar
      btrfs: fix comment about max_out in btrfs_compress_pages · 4183abf6
      Anand Jain authored
      Commit e5d74902 ("btrfs: derive maximum output size in the
      compression implementation") removed @max_out argument in
      btrfs_compress_pages() but its comment remained, remove it.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4183abf6
    • Anand Jain's avatar
      btrfs: optimize variables size in btrfs_submit_compressed_write · 65b5355f
      Anand Jain authored
      Patch "btrfs: reduce compressed_bio member's types" reduced some
      member's size. Function arguments @len, @compressed_len and @nr_pages
      can be declared as unsigned int.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      65b5355f
    • Anand Jain's avatar
      btrfs: optimize variables size in btrfs_submit_compressed_read · 356b4a2d
      Anand Jain authored
      Patch "btrfs: reduce compressed_bio member's types" reduced some
      member's size. Declare the variables @compressed_len, @nr_pages and
      @pg_index size as an unsigned int in the function
      btrfs_submit_compressed_read.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      356b4a2d
    • Anand Jain's avatar
      btrfs: reduce the variable size to fit nr_pages · 1d08ce58
      Anand Jain authored
      Patch "btrfs: reduce compressed_bio member's types" reduced the
      @nr_pages size to unsigned int, its cascading effects are updated here.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1d08ce58
    • Filipe Manana's avatar
      btrfs: avoid unnecessary logging of xattrs during fast fsyncs · b590b839
      Filipe Manana authored
      When logging an inode we always log all its xattrs, so that we are able
      to figure out which ones should be deleted during log replay. However this
      is unnecessary when we are doing a fast fsync and no xattrs were added,
      changed or deleted since the last time we logged the inode in the current
      transaction.
      
      So skip the logging of xattrs when the inode was previously logged in the
      current transaction and no xattrs were added, changed or deleted. If any
      changes to xattrs happened, than the inode has BTRFS_INODE_COPY_EVERYTHING
      set in its runtime flags and the xattrs get logged. This saves time on
      scanning for xattrs, allocating memory, COWing log tree extent buffers and
      adding more lock contention on the extent buffers when there are multiple
      tasks logging in parallel.
      
      The use of xattrs is common when using ACLs, some applications, or when
      using security modules like SELinux where every inode gets a security
      xattr added to it.
      
      The following test script, using fio, was used on a box with 12 cores, 64G
      of RAM, a NVMe device and the default non-debug kernel config from Debian.
      It uses 8 concurrent jobs each writing in blocks of 64K to its own 4G file,
      each file with a single xattr of 50 bytes (about the same size for an ACL
      or SELinux xattr), doing random buffered writes with an fsync after each
      write.
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/nvme0n1
         MNT=/mnt/test
         MOUNT_OPTIONS="-o ssd"
         MKFS_OPTIONS="-d single -m single"
      
         NUM_JOBS=8
         FILE_SIZE=4G
      
         cat <<EOF > /tmp/fio-job.ini
         [writers]
         rw=randwrite
         fsync=1
         fallocate=none
         group_reporting=1
         direct=0
         bs=64K
         ioengine=sync
         size=$FILE_SIZE
         directory=$MNT
         numjobs=$NUM_JOBS
         EOF
      
         echo "performance" | \
             tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
         mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
         mount $MOUNT_OPTIONS $DEV $MNT
      
         echo "Creating files before fio runs, each with 1 xattr of 50 bytes"
         for ((i = 0; i < $NUM_JOBS; i++)); do
             path="$MNT/writers.$i.0"
             truncate -s $FILE_SIZE $path
             setfattr -n user.xa1 -v $(printf '%0.sX' $(seq 50)) $path
         done
      
         fio /tmp/fio-job.ini
         umount $MNT
      
      fio output before this change:
      
      WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=272145-272145msec
      
      fio output after this change:
      
      WRITE: bw=142MiB/s (149MB/s), 142MiB/s-142MiB/s (149MB/s-149MB/s), io=32.0GiB (34.4GB), run=230408-230408msec
      
      +16.8% throughput, -16.6% runtime
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b590b839
    • David Sterba's avatar
      btrfs: add device delete cancel · 67ae34b6
      David Sterba authored
      Accept device name "cancel" as a request to cancel running device
      deletion operation. The string is literal, in case there's a real device
      named "cancel", pass it as full absolute path or as "./cancel"
      
      This works for v1 and v2 ioctls when the device is specified by name.
      Moving chunks from the device uses relocation, use the conditional
      exclusive operation start and cancellation helpers
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      67ae34b6
    • David Sterba's avatar
      btrfs: add cancellation to resize · bb059a37
      David Sterba authored
      Accept literal string "cancel" as resize operation and interpret that
      as a request to cancel the running operation. If it's running, wait
      until it finishes current work and return ECANCELED.
      
      Shrinking resize uses relocation to move the chunks away, use the
      conditional exclusive operation start and cancellation helpers.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb059a37
    • David Sterba's avatar
      btrfs: add wrapper for conditional start of exclusive operation · 17aaa434
      David Sterba authored
      To support optional cancellation of some operations, add helper that will
      wrap all the combinations. In normal mode it's same as
      btrfs_exclop_start, in cancellation mode it checks if it's already
      running and request cancellation and waits until completion.
      
      The error codes can be returned to to user space and semantics is not
      changed, adding ECANCELED. This should be evaluated as an error and that
      the operation has not completed and the operation should be restarted
      or the filesystem status reviewed.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      17aaa434
    • David Sterba's avatar
      btrfs: introduce try-lock semantics for exclusive op start · 578bda9e
      David Sterba authored
      Add try-lock for exclusive operation start to allow callers to do more
      checks. The same operation must already be running. The try-lock and
      unlock must pair and are a substitute for btrfs_exclop_start, thus it
      must also pair with btrfs_exclop_finish to release the exclop context.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      578bda9e
    • David Sterba's avatar
      btrfs: add cancellable chunk relocation support · 907d2710
      David Sterba authored
      Add support code that will allow canceling relocation on the chunk
      granularity. This is different and independent of balance, that also
      uses relocation but is a higher level operation and manages it's own
      state and pause/cancellation requests.
      
      Relocation is used for resize (shrink) and device deletion so this will
      be a common point to implement cancellation for both. The context is
      entirely in btrfs_relocate_block_group and btrfs_recover_relocation,
      enclosing one chunk relocation. The status bit is set and unset between
      the chunks. As relocation can take long, the effects may not be
      immediate and the request and actual action can slightly race.
      
      The fs_info::reloc_cancel_req is only supposed to be increased and does
      not pair with decrement like fs_info::balance_cancel_req.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      907d2710
    • David Sterba's avatar
      btrfs: protect exclusive_operation by super_lock · 0d7ed32c
      David Sterba authored
      The exclusive operation is now atomically checked and set using bit
      operations. Switch it to protection by spinlock. The super block lock is
      not frequently used and adding a new lock seems like an overkill so it
      should be safe to reuse it.
      
      The reason to use spinlock is to enhance the locking context so more
      checks can be done, eg. allowing the same exclusive operation enter
      the exclop section and cancel the running one. This will be used for
      resize and device delete.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d7ed32c
    • David Sterba's avatar
      btrfs: clean up header members offsets in write helpers · 24880be5
      David Sterba authored
      Move header offsetof() to the expression that calculates the address so
      it's part of get_eb_offset_in_page where the 2nd parameter is the member
      offset.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      24880be5
    • David Sterba's avatar
      btrfs: simplify eb checksum verification in btrfs_validate_metadata_buffer · dfd29eed
      David Sterba authored
      The verification copies the calculated checksum bytes to a temporary
      buffer but this is not necessary. We can map the eb header on the first
      page and use the checksum bytes directly.
      
      This saves at least one function call and boundary checks so it could
      lead to a minor performance improvement.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dfd29eed
    • David Sterba's avatar
      btrfs: remove extra sb::s_id from message in btrfs_validate_metadata_buffer · ff14aa79
      David Sterba authored
      The s_id is already printed by message helpers.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ff14aa79
    • David Sterba's avatar
      btrfs: reduce compressed_bio members' types · 282ab3ff
      David Sterba authored
      Several members of compressed_bio are of type that's unnecessarily big
      for the values that they'd hold:
      
      - the size of the uncompressed and compressed data is 128K now, we can
        keep is as int
      - same for number of pages
      - the compress type fits to a byte
      - the errors is 0/1
      
      The size of the unpatched structure is 80 bytes with several holes.
      Reordering nr_pages next to the pages the hole after pending_bios is
      filled and the resulting size is 56 bytes. This keeps the csums array
      aligned to 8 bytes, which is nice. Further size optimizations may be
      possible but right now it looks good to me:
      
      struct compressed_bio {
              refcount_t                 pending_bios;         /*     0     4 */
              unsigned int               nr_pages;             /*     4     4 */
              struct page * *            compressed_pages;     /*     8     8 */
              struct inode *             inode;                /*    16     8 */
              u64                        start;                /*    24     8 */
              unsigned int               len;                  /*    32     4 */
              unsigned int               compressed_len;       /*    36     4 */
              u8                         compress_type;        /*    40     1 */
              u8                         errors;               /*    41     1 */
      
              /* XXX 2 bytes hole, try to pack */
      
              int                        mirror_num;           /*    44     4 */
              struct bio *               orig_bio;             /*    48     8 */
              u8                         sums[];               /*    56     0 */
      
              /* size: 56, cachelines: 1, members: 12 */
              /* sum members: 54, holes: 1, sum holes: 2 */
              /* last cacheline: 56 bytes */
      };
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      282ab3ff
    • David Sterba's avatar
    • David Sterba's avatar
      btrfs: scrub: factor out common scrub_stripe constraints · 7735cd75
      David Sterba authored
      There are common values set for the stripe constraints, some of them
      are already factored out. Do that for increment and mirror_num as well.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7735cd75
    • David Sterba's avatar
      btrfs: clear log tree recovering status if starting transaction fails · 1aeb6b56
      David Sterba authored
      When a log recovery is in progress, lots of operations have to take that
      into account, so we keep this status per tree during the operation. Long
      time ago error handling revamp patch 79787eaa ("btrfs: replace many
      BUG_ONs with proper error handling") removed clearing of the status in
      an error branch. Add it back as was intended in e02119d5 ("Btrfs:
      Add a write ahead tree log to optimize synchronous operations").
      
      There are probably no visible effects, log replay is done only during
      mount and if it fails all structures are cleared so the stale status
      won't be kept.
      
      Fixes: 79787eaa ("btrfs: replace many BUG_ONs with proper error handling")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1aeb6b56
    • David Sterba's avatar
      btrfs: clear defrag status of a root if starting transaction fails · 6819703f
      David Sterba authored
      The defrag loop processes leaves in batches and starting transaction for
      each. The whole defragmentation on a given root is protected by a bit
      but in case the transaction fails, the bit is not cleared
      
      In case the transaction fails the bit would prevent starting
      defragmentation again, so make sure it's cleared.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6819703f
    • David Sterba's avatar
      btrfs: sysfs: fix format string for some discard stats · 8c5ec995
      David Sterba authored
      The type of discard_bitmap_bytes and discard_extent_bytes is u64 so the
      format should be %llu, though the actual values would hardly ever
      overflow to negative values.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8c5ec995
    • Josef Bacik's avatar
      btrfs: always abort the transaction if we abort a trans handle · 5963ffca
      Josef Bacik authored
      While stress testing our error handling I noticed that sometimes we
      would still commit the transaction even though we had aborted the
      transaction.
      
      Currently we track if a trans handle has dirtied any metadata, and if it
      hasn't we mark the filesystem as having an error (so no new transactions
      can be started), but we will allow the current transaction to complete
      as we do not mark the transaction itself as having been aborted.
      
      This sounds good in theory, but we were not properly tracking IO errors
      in btrfs_finish_ordered_io, and thus committing the transaction with
      bogus free space data.  This isn't necessarily a problem per-se with the
      free space cache, as the other guards in place would have kept us from
      accepting the free space cache as valid, but highlights a real world
      case where we had a bug and could have corrupted the filesystem because
      of it.
      
      This "skip abort on empty trans handle" is nice in theory, but assumes
      we have perfect error handling everywhere, which we clearly do not.
      Also we do not allow further transactions to be started, so all this
      does is save the last transaction that was happening, which doesn't
      necessarily gain us anything other than the potential for real
      corruption.
      
      Remove this particular bit of code, if we decide we need to abort the
      transaction then abort the current one and keep us from doing real harm
      to the file system, regardless of whether this specific trans handle
      dirtied anything or not.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5963ffca
    • Filipe Manana's avatar
      btrfs: don't set the full sync flag when truncation does not touch extents · 0d7d3165
      Filipe Manana authored
      At btrfs_truncate() where we truncate the inode either to the same size
      or to a smaller size, we always set the full sync flag on the inode.
      
      This is needed in case the truncation drops or trims any file extent items
      that start beyond or cross the new inode size, so that the next fsync
      drops all inode items from the log and scans again the fs/subvolume tree
      to find all items that must be logged.
      
      However if the truncation does not drop or trims any file extent items, we
      do not need to set the full sync flag and force the next fsync to use the
      slow code path. So do not set the full sync flag in such cases.
      
      One use case where it is frequent to do truncations that do not change
      the inode size and do not drop any extents (no prealloc extents beyond
      i_size) is when running Microsoft's SQL Server inside a Docker container.
      One example workload is the one Philipp Fent reported recently, in the
      thread with a link below. In this workload a large number of fsyncs are
      preceded by such truncate operations.
      
      After this change I constantly get the runtime for that workload from
      Philipp to be reduced by about -12%, for example from 184 seconds down
      to 162 seconds.
      
      Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/Tested-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d7d3165
    • Filipe Manana's avatar
      btrfs: fix misleading and incomplete comment of btrfs_truncate() · 4f7e6737
      Filipe Manana authored
      The comment at the top of btrfs_truncate() mentions that csum items are
      dropped or truncated to the new i_size, but this is wrong and non sense,
      as they are unrelated to the i_size and are located in the csums tree and
      not on a tree with inode items (fs/subvolume tree or a log tree). Instead
      that claim applies to file extent items, so fix the comment to refer to
      them instead.
      
      While at it make the whole comment for the function more descriptive and
      follow the kernel doc style.
      Tested-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4f7e6737
    • Josef Bacik's avatar
      btrfs: abort transaction if we fail to update the delayed inode · 04587ad9
      Josef Bacik authored
      If we fail to update the delayed inode we need to abort the transaction,
      because we could leave an inode with the improper counts or some other
      such corruption behind.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      04587ad9
    • Josef Bacik's avatar
      btrfs: fix error handling in __btrfs_update_delayed_inode · bb385bed
      Josef Bacik authored
      If we get an error while looking up the inode item we'll simply bail
      without cleaning up the delayed node.  This results in this style of
      warning happening on commit:
      
        WARNING: CPU: 0 PID: 76403 at fs/btrfs/delayed-inode.c:1365 btrfs_assert_delayed_root_empty+0x5b/0x90
        CPU: 0 PID: 76403 Comm: fsstress Tainted: G        W         5.13.0-rc1+ #373
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        RIP: 0010:btrfs_assert_delayed_root_empty+0x5b/0x90
        RSP: 0018:ffffb8bb815a7e50 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: ffff95d6d07e1888 RCX: ffff95d6c0fa3000
        RDX: 0000000000000002 RSI: 000000000029e91c RDI: ffff95d6c0fc8060
        RBP: ffff95d6c0fc8060 R08: 00008d6d701a2c1d R09: 0000000000000000
        R10: ffff95d6d1760ea0 R11: 0000000000000001 R12: ffff95d6c15a4d00
        R13: ffff95d6c0fa3000 R14: 0000000000000000 R15: ffffb8bb815a7e90
        FS:  00007f490e8dbb80(0000) GS:ffff95d73bc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f6e75555cb0 CR3: 00000001101ce001 CR4: 0000000000370ef0
        Call Trace:
         btrfs_commit_transaction+0x43c/0xb00
         ? finish_wait+0x80/0x80
         ? vfs_fsync_range+0x90/0x90
         iterate_supers+0x8c/0x100
         ksys_sync+0x50/0x90
         __do_sys_sync+0xa/0x10
         do_syscall_64+0x3d/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Because the iref isn't dropped and this leaves an elevated node->count,
      so any release just re-queues it onto the delayed inodes list.  Fix this
      by going to the out label to handle the proper cleanup of the delayed
      node.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb385bed
    • Josef Bacik's avatar
      btrfs: make btrfs_release_delayed_iref handle the !iref case · a4cb90dc
      Josef Bacik authored
      Right now we only cleanup the delayed iref if we have
      BTRFS_DELAYED_NODE_DEL_IREF set on the node.  However we have some error
      conditions that need to cleanup the iref if it still exists, so to make
      this code cleaner move the test_bit into btrfs_release_delayed_iref
      itself and unconditionally call it in each of the cases instead.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4cb90dc
    • David Sterba's avatar
      btrfs: scrub: per-device bandwidth control · eb3b5053
      David Sterba authored
      Add sysfs interface to limit io during scrub. We relied on the ionice
      interface to do that, eg. the idle class let the system usable while
      scrub was running. This has changed when mq-deadline got widespread and
      did not implement the scheduling classes. That was a CFQ thing that got
      deleted. We've got numerous complaints from users about degraded
      performance.
      
      Currently only BFQ supports that but it's not a common scheduler and we
      can't ask everybody to switch to it.
      
      Alternatively the cgroup io limiting can be used but that also a
      non-trivial setup (v2 required, the controller must be enabled on the
      system). This can still be used if desired.
      
      Other ideas that have been explored: piggy-back on ionice (that is set
      per-process and is accessible) and interpret the class and classdata as
      bandwidth limits, but this does not have enough flexibility as there are
      only 8 allowed and we'd have to map fixed limits to each value. Also
      adjusting the value would need to lookup the process that currently runs
      scrub on the given device, and the value is not sticky so would have to
      be adjusted each time scrub runs.
      
      Running out of options, sysfs does not look that bad:
      
      - it's accessible from scripts, or udev rules
      - the name is similar to what MD-RAID has
        (/proc/sys/dev/raid/speed_limit_max or /sys/block/mdX/md/sync_speed_max)
      - the value is sticky at least for filesystem mount time
      - adjusting the value has immediate effect
      - sysfs is available in constrained environments (eg. system rescue)
      - the limit also applies to device replace
      
      Sysfs:
      
      - raw value is in bytes
      - values written to the file accept suffixes like K, M
      - file is in the per-device directory /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max
      - 0 means use default priority of IO
      
      The scheduler is a simple deadline one and the accuracy is up to nearest
      128K.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eb3b5053
    • Johannes Thumshirn's avatar
      btrfs: zoned: factor out zoned device lookup · e7ff9e6b
      Johannes Thumshirn authored
      To be able to construct a zone append bio we need to look up the
      btrfs_device. The code doing the chunk map lookup to get the device is
      present in btrfs_submit_compressed_write and submit_extent_page.
      
      Factor out the lookup calls into a helper and use it in the submission
      paths.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e7ff9e6b
    • Tian Tao's avatar
      btrfs: return EAGAIN if defrag is canceled · 50535db8
      Tian Tao authored
      When inode defrag is canceled, the error is set to EAGAIN but then
      overwritten by number of defragmented bytes. As this would hide the
      error, rather return EAGAIN. This does not harm 'btrfs fi defrag', it
      will print the error and continue to next file (as it does in for any
      other error).
      Signed-off-by: default avatarTian Tao <tiantao6@hisilicon.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      50535db8
    • Qu Wenruo's avatar
      btrfs: remove io_failure_record::in_validation · 1245835d
      Qu Wenruo authored
      The io_failure_record::in_validation was introduced to handle failed bio
      which cross several sectors.  In such case, we still need to verify
      which sectors are corrupted.
      
      But since we've changed the way how we handle corrupted sectors, by only
      submitting repair for each corrupted sector, there is no need for extra
      validation any more.
      
      This patch will cleanup all io_failure_record::in_validation related
      code.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1245835d
    • Qu Wenruo's avatar
      btrfs: submit read time repair only for each corrupted sector · 150e4b05
      Qu Wenruo authored
      Currently btrfs_submit_read_repair() has some extra check on whether the
      failed bio needs extra validation for repair.  But we can avoid all
      these extra mechanisms if we submit the repair for each sector.
      
      By this, each read repair can be easily handled without the need to
      verify which sector is corrupted.
      
      This will also benefit subpage, as one subpage bvec can contain several
      sectors, making the extra verification more complex.
      
      So this patch will:
      
      - Introduce repair_one_sector()
        The main code submitting repair, which is more or less the same as old
        btrfs_submit_read_repair().
        But this time, it only repairs one sector.
      
      - Make btrfs_submit_read_repair() to handle sectors differently
        There are 3 different cases:
      
        * Good sector
          We need to release the page and extent, set the range uptodate.
      
        * Bad sector and failed to submit repair bio
          We need to release the page and extent, but not set the range
          uptodate.
      
        * Bad sector but repair bio submitted
          The page and extent release will be handled by the submitted repair
          bio. Nothing needs to be done.
      
        Since btrfs_submit_read_repair() will handle the page and extent
        release now, we need to skip to next bvec even we hit some error.
      
      - Change the lifespan of @uptodate in end_bio_extent_readpage()
        Since now btrfs_submit_read_repair() will handle the full bvec
        which contains any corruption, we don't need to bother updating
        @uptodate bit anymore.
        Just let @uptodate to be local variable inside the main loop,
        so that any error from one bvec won't affect later bvec.
      
      - Only export btrfs_repair_one_sector(), unexport
        btrfs_submit_read_repair()
        The only outside caller for read repair is DIO, which already submits
        its repair for just one sector.
        Only export btrfs_repair_one_sector() for DIO.
      
      This patch will focus on the change on the repair path, the extra
      validation code is still kept as is, and will be cleaned up later.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      150e4b05
    • Qu Wenruo's avatar
      btrfs: make btrfs_verify_data_csum() to return a bitmap · 08508fea
      Qu Wenruo authored
      This will provide the basis for later per-sector repair for subpage,
      while still keeping the existing code happy.
      
      As if all csums match, the return value will be 0, same as now.
      Only when csum mismatches, the return value is different.
      
      The new return value will be a bitmap, for 4K sectorsize and 4K page
      size, it will be either 1, instead of the -EIO (which is not used
      directly by the callers, no effective change).
      
      But for 4K sectorsize and 64K page size, aka subpage case, since the
      bvec can contain multiple sectors, knowing which sectors are corrupted
      will allow us to submit repair only for corrupted sectors.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      08508fea
    • Johannes Thumshirn's avatar
      btrfs: rename check_async_write and let it return bool · f4dcfb30
      Johannes Thumshirn authored
      The 'check_async_write' function is a helper used in
      'btrfs_submit_metadata_bio' and it checks if asynchronous writing can be
      used for metadata.
      
      Make the function return bool and get rid of the local variable async in
      btrfs_submit_metadata_bio storing the result of check_async_write's
      tests.
      
      As this is touching all function call sites, also rename it to
      should_async_write as this is more in line with the naming we use.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f4dcfb30
    • Johannes Thumshirn's avatar
      btrfs: zoned: bail out if we can't read a reliable write pointer · 06e1e7f4
      Johannes Thumshirn authored
      If we can't read a reliable write pointer from a sequential zone fail
      creating the block group with an I/O error.
      
      Also if the read write pointer is beyond the end of the respective zone,
      fail the creation of the block group on this zone with an I/O error.
      
      While this could also happen in real world scenarios with misbehaving
      drives, this issue addresses a problem uncovered by fstests' test case
      generic/475.
      
      CC: stable@vger.kernel.org # 5.12+
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06e1e7f4
    • Naohiro Aota's avatar
      btrfs: zoned: print message when zone sanity check type fails · 47cdfb5e
      Naohiro Aota authored
      This extends patch 784daf2b ("btrfs: zoned: sanity check zone
      type"), the message was supposed to be there but was lost during merge.
      We want to make the error noticeable so add it.
      
      Fixes: 784daf2b ("btrfs: zoned: sanity check zone type")
      CC: stable@vger.kernel.org # 5.12+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47cdfb5e
    • Josef Bacik's avatar
      btrfs: handle preemptive delalloc flushing slightly differently · 385f421f
      Josef Bacik authored
      If we decide to flush delalloc from the preemptive flusher, we really do
      not want to wait on ordered extents, as it gains us nothing.  However
      there was logic to go ahead and wait on ordered extents if there was
      more ordered bytes than delalloc bytes.  We do not want this behavior,
      so pass through whether this flushing is for preemption, and do not wait
      for ordered extents if that's the case.  Also break out of the shrink
      loop after the first flushing, as we just want to one shot shrink
      delalloc.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      385f421f
    • Josef Bacik's avatar
      btrfs: only ignore delalloc if delalloc is much smaller than ordered · 3e101569
      Josef Bacik authored
      While testing heavy delalloc workloads I noticed that sometimes we'd
      just stop preemptively flushing when we had loads of delalloc available
      to flush.  This is because we skip preemptive flushing if delalloc <=
      ordered.  However if we start with say 4gib of delalloc, and we flush
      2gib of that, we'll stop flushing there, when we still have 2gib of
      delalloc to flush.
      
      Instead adjust the ordered bytes down by half, this way if 2/3 of our
      outstanding delalloc reservations are tied up by ordered extents we
      don't bother preemptive flushing, as we're getting close to the state
      where we need to wait on ordered extents.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3e101569