1. 12 Oct, 2023 40 commits
    • Johannes Thumshirn's avatar
      btrfs: tree-checker: add support for raid stripe tree · e0b4077f
      Johannes Thumshirn authored
      Add a tree checker support for RAID stripe tree items, verify:
      
      - alignment
      - presence of the incompat bit
      - supported encoding
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e0b4077f
    • Johannes Thumshirn's avatar
      btrfs: tracepoints: add events for raid stripe tree · b5e2c2ff
      Johannes Thumshirn authored
      Add trace events for raid-stripe-tree operations.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b5e2c2ff
    • Johannes Thumshirn's avatar
      btrfs: sysfs: announce presence of raid-stripe-tree · 9f9918a8
      Johannes Thumshirn authored
      If a filesystem with a raid-stripe-tree is mounted, show the RST feature
      in sysfs, currently still under the CONFIG_BTRFS_DEBUG option.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f9918a8
    • Johannes Thumshirn's avatar
      btrfs: add raid stripe tree pretty printer · edde81f1
      Johannes Thumshirn authored
      Decode raid-stripe-tree entries on btrfs_print_tree().
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      edde81f1
    • Johannes Thumshirn's avatar
      btrfs: zoned: support RAID0/1/10 on top of raid stripe tree · 568220fa
      Johannes Thumshirn authored
      When we have a raid-stripe-tree, we can do RAID0/1/10 on zoned devices
      for data block groups. For metadata block groups, we don't actually
      need anything special, as all metadata I/O is protected by the
      btrfs_zoned_meta_io_lock() already.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      568220fa
    • Johannes Thumshirn's avatar
      btrfs: scrub: implement raid stripe tree support · 9acaa641
      Johannes Thumshirn authored
      A filesystem that uses the raid stripe tree for logical to physical
      address translation can't use the regular scrub path, that reads all
      stripes and then checks if a sector is unused afterwards.
      
      When using the raid stripe tree, this will result in lookup errors, as
      the stripe tree doesn't know the requested logical addresses.
      
      In case we're scrubbing a filesystem which uses the RAID stripe tree for
      multi-device logical to physical address translation, perform an extra
      block mapping step to get the real on-disk stripe length from the stripe
      tree when scrubbing the sectors.
      
      This prevents a double completion of the btrfs_bio caused by splitting the
      underlying bio and ultimately a use-after-free.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9acaa641
    • Johannes Thumshirn's avatar
      btrfs: lookup physical address from stripe extent · 10e27980
      Johannes Thumshirn authored
      Lookup the physical address from the raid stripe tree when a read on an
      RAID volume formatted with the raid stripe tree was attempted.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      10e27980
    • Johannes Thumshirn's avatar
      btrfs: delete stripe extent on extent deletion · ca41504e
      Johannes Thumshirn authored
      As each stripe extent is tied to an extent item, delete the stripe extent
      once the corresponding extent item is deleted.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca41504e
    • Johannes Thumshirn's avatar
      btrfs: add support for inserting raid stripe extents · 02c372e1
      Johannes Thumshirn authored
      Add support for inserting stripe extents into the raid stripe tree on
      completion of every write that needs an extra logical-to-physical
      translation when using RAID.
      
      Inserting the stripe extents happens after the data I/O has completed,
      this is done to
      
        a) support zone-append and
        b) rule out the possibility of a RAID-write-hole.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      02c372e1
    • Johannes Thumshirn's avatar
      btrfs: read raid stripe tree from disk · 51502090
      Johannes Thumshirn authored
      If we find the raid-stripe-tree on mount, read it from disk. This is
      a backward incompatible feature. The rescue=ignorebadroots mount option
      will skip this tree.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      51502090
    • Johannes Thumshirn's avatar
      btrfs: add raid stripe tree definitions · ee129330
      Johannes Thumshirn authored
      Add definitions for the raid stripe tree. This tree will hold information
      about the on-disk layout of the stripes in a RAID set.
      
      Each stripe extent has a 1:1 relationship with an on-disk extent item and
      is doing the logical to per-drive physical address translation for the
      extent item in question.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ee129330
    • Qu Wenruo's avatar
      btrfs: warn on tree blocks which are not nodesize aligned · 6d3a6194
      Qu Wenruo authored
      A long time ago, we had some metadata chunks which started at sector
      boundary but not aligned to nodesize boundary.
      
      This led to some older filesystems which can have tree blocks only
      aligned to sectorsize, but not nodesize.
      
      Later 'btrfs check' gained the ability to detect and warn about such tree
      blocks, and kernel fixed the chunk allocation behavior, nowadays those
      tree blocks should be pretty rare.
      
      But in the future, if we want to migrate metadata to folio, we cannot
      have such tree blocks, as filemap_add_folio() requires the page index to
      be aligned with the folio number of pages.  Such unaligned tree blocks
      can lead to VM_BUG_ON().
      
      So this patch adds extra warning for those unaligned tree blocks, as a
      preparation for the future folio migration.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d3a6194
    • Josef Bacik's avatar
      btrfs: don't arbitrarily slow down delalloc if we're committing · 11aeb97b
      Josef Bacik authored
      We have a random schedule_timeout() if the current transaction is
      committing, which seems to be a holdover from the original delalloc
      reservation code.
      
      Remove this, we have the proper flushing stuff, we shouldn't be hoping
      for random timing things to make everything work.  This just induces
      latency for no reason.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      11aeb97b
    • Filipe Manana's avatar
      btrfs: remove useless comment from btrfs_pin_extent_for_log_replay() · c967c19e
      Filipe Manana authored
      The comment on top of btrfs_pin_extent_for_log_replay() mentioning that
      the function must be called within a transaction is pointless as of
      commit 9fce5704 ("btrfs: Make btrfs_pin_extent_for_log_replay take
      transaction handle"), since the function now takes a transaction handle
      as its first argument. So remove the comment because it's completely
      useless now.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c967c19e
    • Filipe Manana's avatar
      btrfs: remove stale comment from btrfs_free_extent() · df423ee2
      Filipe Manana authored
      A comment at btrfs_free_extent() mentions the call to btrfs_pin_extent()
      unlocks the pinned mutex, however that mutex is long gone, it was removed
      in 2009 by commit 04018de5 ("Btrfs: kill the pinned_mutex"). So just
      delete the comment.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      df423ee2
    • Christoph Hellwig's avatar
      btrfs: zoned: factor out DUP bg handling from btrfs_load_block_group_zone_info · 87463f7e
      Christoph Hellwig authored
      Split the code handling a type DUP block group from
      btrfs_load_block_group_zone_info to make the code more readable.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      87463f7e
    • Christoph Hellwig's avatar
      btrfs: zoned: factor out single bg handling from btrfs_load_block_group_zone_info · 9e0e3e74
      Christoph Hellwig authored
      Split the code handling a type single block group from
      btrfs_load_block_group_zone_info to make the code more readable.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e0e3e74
    • Christoph Hellwig's avatar
      btrfs: zoned: factor out per-zone logic from btrfs_load_block_group_zone_info · 09a46725
      Christoph Hellwig authored
      Split out a helper for the body of the per-zone loop in
      btrfs_load_block_group_zone_info to make the function easier to read and
      modify.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      09a46725
    • Christoph Hellwig's avatar
      btrfs: zoned: introduce a zone_info struct in btrfs_load_block_group_zone_info · 15c12fcc
      Christoph Hellwig authored
      Add a new zone_info structure to hold per-zone information in
      btrfs_load_block_group_zone_info and prepare for breaking out helpers
      from it.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      15c12fcc
    • Filipe Manana's avatar
      btrfs: remove pointless loop from btrfs_update_block_group() · 4d20c1de
      Filipe Manana authored
      When an extent is allocated or freed, we call btrfs_update_block_group()
      to update its block group and space info. An extent always belongs to a
      single block group, it can never span multiple block groups, so the loop
      we have at btrfs_update_block_group() is pointless, as it always has a
      single iteration. The loop was added in the very early days, 2007, when
      the block group code was added in commit 9078a3e1 ("Btrfs: start of
      block group code"), but even back then it seemed pointless.
      
      So remove the loop and assert the block group containing the start offset
      of the extent also contains the whole extent.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d20c1de
    • Filipe Manana's avatar
      btrfs: mark transaction id check as unlikely at btrfs_mark_buffer_dirty() · 4ebe8d47
      Filipe Manana authored
      At btrfs_mark_buffer_dirty(), having a transaction id mismatch is never
      expected to happen and it usually means there's a bug or some memory
      corruption due to a bitflip for example. So mark the condition as unlikely
      to optimize code generation as well as to make it obvious for human
      readers that it is a very unexpected condition.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ebe8d47
    • Filipe Manana's avatar
      btrfs: use btrfs_crit at btrfs_mark_buffer_dirty() · 20cbe460
      Filipe Manana authored
      There's no need to use WARN() at btrfs_mark_buffer_dirty() to print an
      error message, as we have the fs_info pointer we can use btrfs_crit()
      which prints device information and makes the message have a more uniform
      format. As we are already aborting the transaction we already have a stack
      trace printed as well. So replace the use of WARN() with btrfs_crit().
      
      Also slightly reword the message to use 'logical' instead of 'block' as
      it's what is used in other error/warning messages.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      20cbe460
    • Filipe Manana's avatar
      btrfs: abort transaction on generation mismatch when marking eb as dirty · 50564b65
      Filipe Manana authored
      When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(),
      we check if its generation matches the running transaction and if not we
      just print a warning. Such mismatch is an indicator that something really
      went wrong and only printing a warning message (and stack trace) is not
      enough to prevent a corruption. Allowing a transaction to commit with such
      an extent buffer will trigger an error if we ever try to read it from disk
      due to a generation mismatch with its parent generation.
      
      So abort the current transaction with -EUCLEAN if we notice a generation
      mismatch. For this we need to pass a transaction handle to
      btrfs_mark_buffer_dirty() which is always available except in test code,
      in which case we can pass NULL since it operates on dummy extent buffers
      and all test roots have a single node/leaf (root node at level 0).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      50564b65
    • Anand Jain's avatar
      btrfs: scan but don't register device on single device filesystem · bc27d6f0
      Anand Jain authored
      After the commit 5f58d783 ("btrfs: free device in btrfs_close_devices
      for a single device filesystem") we unregister the device from the kernel
      memory upon unmounting for a single device.
      
      So, device registration that was performed before mounting if any is no
      longer in the kernel memory.
      
      However, in fact, note that device registration is unnecessary for a
      single-device btrfs filesystem unless it's a seed device.
      
      So for commands like 'btrfs device scan' or 'btrfs device ready' with a
      non-seed single-device btrfs filesystem, they can return success just
      after superblock verification and without the actual device scan.  When
      'device scan --forget' is called on such device no error is returned.
      
      The seed device must remain in the kernel memory to allow the sprout
      device to mount without the need to specify the seed device explicitly.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bc27d6f0
    • David Sterba's avatar
      btrfs: rename errno identifiers to error · ed164802
      David Sterba authored
      We sync the kernel files to userspace and the 'errno' symbol is defined
      by standard library, which does not matter in kernel but the parameters
      or local variables could clash. Rename them all.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ed164802
    • Filipe Manana's avatar
      btrfs: always reserve space for delayed refs when starting transaction · 28270e25
      Filipe Manana authored
      When starting a transaction (or joining an existing one with
      btrfs_start_transaction()), we reserve space for the number of items we
      want to insert in a btree, but we don't do it for the delayed refs we
      will generate while using the transaction to modify (COW) extent buffers
      in a btree or allocate new extent buffers. Basically how it works:
      
      1) When we start a transaction we reserve space for the number of items
         the caller wants to be inserted/modified/deleted in a btree. This space
         goes to the transaction block reserve;
      
      2) If the delayed refs block reserve is not full, its size is greater
         than the amount of its reserved space, and the flush method is
         BTRFS_RESERVE_FLUSH_ALL, then we attempt to reserve more space for
         it corresponding to the number of items the caller wants to
         insert/modify/delete in a btree;
      
      3) The size of the delayed refs block reserve is increased when a task
         creates delayed refs after COWing an extent buffer, allocating a new
         one or deleting (freeing) an extent buffer. This happens after the
         the task started or joined a transaction, whenever it calls
         btrfs_update_delayed_refs_rsv();
      
      4) The delayed refs block reserve is then refilled by anyone calling
         btrfs_delayed_refs_rsv_refill(), either during unlink/truncate
         operations or when someone else calls btrfs_start_transaction() with
         a 0 number of items and flush method BTRFS_RESERVE_FLUSH_ALL;
      
      5) As a task COWs or allocates extent buffers, it consumes space from the
         transaction block reserve. When the task releases its transaction
         handle (btrfs_end_transaction()) or it attempts to commit the
         transaction, it releases any remaining space in the transaction block
         reserve that it did not use, as not all space may have been used (due
         to pessimistic space calculation) by calling btrfs_block_rsv_release()
         which will try to add that unused space to the delayed refs block
         reserve (if its current size is greater than its reserved space).
         That transferred space may not be enough to completely fulfill the
         delayed refs block reserve.
      
         Plus we have some tasks that will attempt do modify as many leaves
         as they can before getting -ENOSPC (and then reserving more space and
         retrying), such as hole punching and extent cloning which call
         btrfs_replace_file_extents(). Such tasks can generate therefore a
         high number of delayed refs, for both metadata and data (we can't
         know in advance how many file extent items we will find in a range
         and therefore how many delayed refs for dropping references on data
         extents we will generate);
      
      6) If a transaction starts its commit before the delayed refs block
         reserve is refilled, for example by the transaction kthread or by
         someone who called btrfs_join_transaction() before starting the
         commit, then when running delayed references if we don't have enough
         reserved space in the delayed refs block reserve, we will consume
         space from the global block reserve.
      
      Now this doesn't make a lot of sense because:
      
      1) We should reserve space for delayed references when starting the
         transaction, since we have no guarantees the delayed refs block
         reserve will be refilled;
      
      2) If no refill happens then we will consume from the global block reserve
         when running delayed refs during the transaction commit;
      
      3) If we have a bunch of tasks calling btrfs_start_transaction() with a
         number of items greater than zero and at the time the delayed refs
         reserve is full, then we don't reserve any space at
         btrfs_start_transaction() for the delayed refs that will be generated
         by a task, and we can therefore end up using a lot of space from the
         global reserve when running the delayed refs during a transaction
         commit;
      
      4) There are also other operations that result in bumping the size of the
         delayed refs reserve, such as creating and deleting block groups, as
         well as the need to update a block group item because we allocated or
         freed an extent from the respective block group;
      
      5) If we have a significant gap between the delayed refs reserve's size
         and its reserved space, two very bad things may happen:
      
         1) The reserved space of the global reserve may not be enough and we
            fail the transaction commit with -ENOSPC when running delayed refs;
      
         2) If the available space in the global reserve is enough it may result
            in nearly exhausting it. If the fs has no more unallocated device
            space for allocating a new block group and all the available space
            in existing metadata block groups is not far from the global
            reserve's size before we started the transaction commit, we may end
            up in a situation where after the transaction commit we have too
            little available metadata space, and any future transaction commit
            will fail with -ENOSPC, because although we were able to reserve
            space to start the transaction, we were not able to commit it, as
            running delayed refs generates some more delayed refs (to update the
            extent tree for example) - this includes not even being able to
            commit a transaction that was started with the goal of unlinking a
            file, removing an empty data block group or doing reclaim/balance,
            so there's no way to release metadata space.
      
            In the worst case the next time we mount the filesystem we may
            also fail with -ENOSPC due to failure to commit a transaction to
            cleanup orphan inodes. This later case was reported and hit by
            someone running a SLE (SUSE Linux Enterprise) distribution for
            example - where the fs had no more unallocated space that could be
            used to allocate a new metadata block group, and the available
            metadata space was about 1.5M, not enough to commit a transaction
            to cleanup an orphan inode (or do relocation of data block groups
            that were far from being full).
      
      So improve on this situation by always reserving space for delayed refs
      when calling start_transaction(), and if the flush method is
      BTRFS_RESERVE_FLUSH_ALL, also try to refill the delayed refs block
      reserve if it's not full. The space reserved for the delayed refs is added
      to a local block reserve that is part of the transaction handle, and when
      a task updates the delayed refs block reserve size, after creating a
      delayed ref, the space is transferred from that local reserve to the
      global delayed refs reserve (fs_info->delayed_refs_rsv). In case the
      local reserve does not have enough space, which may happen for tasks
      that generate a variable and potentially large number of delayed refs
      (such as the hole punching and extent cloning cases mentioned before),
      we transfer any available space and then rely on the current behaviour
      of hoping some other task refills the delayed refs reserve or fallback
      to the global block reserve.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28270e25
    • Filipe Manana's avatar
      btrfs: stop doing excessive space reservation for csum deletion · adb86dbe
      Filipe Manana authored
      Currently when reserving space for deleting the csum items for a data
      extent, when adding or updating a delayed ref head, we determine how
      many leaves of csum items we can have and then pass that number to the
      helper btrfs_calc_delayed_ref_bytes(). This helper is used for calculating
      space for all tree modifications we need when running delayed references,
      however the amount of space it computes is excessive for deleting csum
      items because:
      
      1) It uses btrfs_calc_insert_metadata_size() which is excessive because
         we only need to delete csum items from the csum tree, we don't need
         to insert any items, so btrfs_calc_metadata_size() is all we need (as
         it computes space needed to delete an item);
      
      2) If the free space tree is enabled, it doubles the amount of space,
         which is pointless for csum deletion since we don't need to touch the
         free space tree or any other tree other than the csum tree.
      
      So improve on this by tracking how many csum deletions we have and using
      a new helper to calculate space for csum deletions (just a wrapper around
      btrfs_calc_metadata_size() with a comment). This reduces the amount of
      space we need to reserve for csum deletions by a factor of 4, and it helps
      reduce the number of times we have to block space reservations and have
      the reclaim task enter the space flushing algorithm (flush delayed items,
      flush delayed refs, etc) in order to satisfy tickets.
      
      For example this results in a total time decrease when unlinking (or
      truncating) files with many extents, as we end up having to block on space
      metadata reservations less often. Example test:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nullb0
        MNT=/mnt/test
      
        umount $DEV &> /dev/null
        mkfs.btrfs -f $DEV
        # Use compression to quickly create files with a lot of extents
        # (each with a size of 128K).
        mount -o compress=lzo $DEV $MNT
      
        # 100G gives at least 983040 extents with a size of 128K.
        xfs_io -f -c "pwrite -S 0xab -b 1M 0 120G" $MNT/foobar
      
        # Flush all delalloc and clear all metadata from memory.
        umount $MNT
        mount -o compress=lzo $DEV $MNT
      
        start=$(date +%s%N)
        rm -f $MNT/foobar
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
        echo "rm took $dur milliseconds"
      
        umount $MNT
      
      Before this change rm took: 7504 milliseconds
      After this change rm took:  6574 milliseconds  (-12.4%)
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      adb86dbe
    • Filipe Manana's avatar
      btrfs: remove pointless initialization at btrfs_delayed_refs_rsv_release() · b6ea3e6a
      Filipe Manana authored
      There's no point in initializing to 0 the local variable 'released' as
      we don't use it before the next assignment to it. So remove the
      initialization. This may help avoid some warnings with clang tools such
      as the one reported/fixed by commit 966de47f ("btrfs: remove redundant
      initialization of variables in log_new_ancestors").
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b6ea3e6a
    • Filipe Manana's avatar
      btrfs: reserve space for delayed refs on a per ref basis · 3ee56a58
      Filipe Manana authored
      Currently when reserving space for delayed refs we do it on a per ref head
      basis. This is generally enough because most back refs for an extent end
      up being inlined in the extent item - with the default leaf size of 16K we
      can have at most 33 inline back refs (this is calculated by the macro
      BTRFS_MAX_EXTENT_ITEM_SIZE()). The amount of bytes reserved for each ref
      head is given by btrfs_calc_delayed_ref_bytes(), which basically
      corresponds to a single path for insertion into the extent tree plus
      another path for insertion into the free space tree if it's enabled.
      
      However if we have reached the limit of inline refs or we have a mix of
      inline and non-inline refs, then we will need to insert a non-inline ref
      and update the existing extent item to update the total number of
      references for the extent. This implies we need reserved space for two
      insertion paths in the extent tree, but we only reserved for one path.
      The extent item and the non-inline ref item may be located in different
      leaves, or even if they are located in the same leaf, after updating the
      extent item and before inserting the non-inline ref item, the extent
      buffers in the btree path may have been written (due to memory pressure
      for e.g.), in which case we need to COW the entire path again. In this
      case since we have not reserved enough space for the delayed refs block
      reserve, we will use the global block reserve.
      
      If we are in a situation where the fs has no more unallocated space enough
      to allocate a new metadata block group and available space in the existing
      metadata block groups is close to the maximum size of the global block
      reserve (512M), we may end up consuming too much of the free metadata
      space to the point where we can't commit any future transaction because it
      will fail, with -ENOSPC, during its commit when trying to allocate an
      extent for some COW operation (running delayed refs generated by running
      delayed refs or COWing the root tree's root node at commit_cowonly_roots()
      for example). Such dramatic scenario can happen if we have many delayed
      refs that require the insertion of non-inline ref items, due to too many
      reflinks or snapshots. We also have situations where we use the global
      block reserve because we could not in advance know that we will need
      space to update some trees (block group creation for example), so this
      all adds up to increase the chances of exhausting the global block reserve
      and making any future transaction commit to fail with -ENOSPC and turn
      the fs into RO mode, or fail the mount operation in case the mount needs
      to start and commit a transaction, such as when we have orphans to cleanup
      for example - such case was reported and hit by someone running a SLE
      (SUSE Linux Enterprise) distribution for example - where the fs had no
      more unallocated space that could be used to allocate a new metadata block
      group, and the available metadata space was about 1.5M, not enough to
      commit a transaction to cleanup an orphan inode (or do relocation of data
      block groups that were far from being full).
      
      So reserve space for delayed refs by individual refs and not by ref heads,
      as we may need to COW multiple extent tree paths due to non-inline ref
      items.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ee56a58
    • Filipe Manana's avatar
      btrfs: allow to run delayed refs by bytes to be released instead of count · 8a526c44
      Filipe Manana authored
      When running delayed references, through btrfs_run_delayed_refs(), we can
      specify how many to run, run all existing delayed references and keep
      running delayed references while we can find any. This is controlled with
      the value of the 'count' argument, where a value of 0 means to run all
      delayed references that exist by the time btrfs_run_delayed_refs() is
      called, (unsigned long)-1 means to keep running delayed references while
      we are able find any, and any other value to run that exact number of
      delayed references.
      
      Typically a specific value other than 0 or -1 is used when flushing space
      to try to release a certain amount of bytes for a ticket. In this case
      we just simply calculate how many delayed reference heads correspond to a
      specific amount of bytes, with calc_delayed_refs_nr(). However that only
      takes into account the space reserved for the reference heads themselves,
      and does not account for the space reserved for deleting checksums from
      the csum tree (see add_delayed_ref_head() and update_existing_head_ref())
      in case we are going to delete a data extent. This means we may end up
      running more delayed references than necessary in case we process delayed
      references for deleting a data extent.
      
      So change the logic of btrfs_run_delayed_refs() to take a bytes argument
      to specify how many bytes of delayed references to run/release, using the
      special values of 0 to mean all existing delayed references and U64_MAX
      (or (u64)-1) to keep running delayed references while we can find any.
      
      This prevents running more delayed references than necessary, when we have
      delayed references for deleting data extents, but also makes the upcoming
      changes/patches simpler and it's preparatory work for them.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8a526c44
    • Filipe Manana's avatar
      btrfs: simplify check for extent item overrun at lookup_inline_extent_backref() · da8848ac
      Filipe Manana authored
      At lookup_inline_extent_backref() we can simplify the check for an overrun
      of the extent item by making the while loop's condition to be "ptr < end"
      and then check after the loop if an overrun happened ("ptr > end"). This
      reduces indentation and makes the loop condition more clear. So move the
      check out of the loop and change the loop condition accordingly, while
      also adding the 'unlikely' tag to the check since it's not supposed to be
      triggered.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da8848ac
    • Filipe Manana's avatar
      btrfs: return -EUCLEAN if extent item is missing when searching inline backref · eba444f1
      Filipe Manana authored
      At lookup_inline_extent_backref() when trying to insert an inline backref,
      if we don't find the extent item we log an error and then return -EIO.
      This error code is confusing because there was actually no IO error, and
      this means we have some corruption, either caused by a bug or something
      like a memory bitflip for example. So change the error code from -EIO to
      -EUCLEAN.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eba444f1
    • Filipe Manana's avatar
      btrfs: use a single variable for return value at lookup_inline_extent_backref() · cc925b96
      Filipe Manana authored
      At lookup_inline_extent_backref(), instead of using a 'ret' and an 'err'
      variable for tracking the return value, use a single one ('ret'). This
      simplifies the code, makes it comply with most of the existing code and
      it's less prone for logic errors as time has proven over and over in the
      btrfs code.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cc925b96
    • Filipe Manana's avatar
      btrfs: use a single variable for return value at run_delayed_extent_op() · 20fb05a6
      Filipe Manana authored
      Instead of using a 'ret' and an 'err' variable at run_delayed_extent_op()
      for tracking the return value, use a single one ('ret'). This simplifies
      the code, makes it comply with most of the existing code and it's less
      prone for logic errors as time has proven over and over in the btrfs code.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      20fb05a6
    • Filipe Manana's avatar
      btrfs: remove pointless 'ref_root' variable from run_delayed_data_ref() · e721043a
      Filipe Manana authored
      The 'ref_root' variable, at run_delayed_data_ref(), is not really needed
      as we can always use ref->root directly, plus its initialization to 0 is
      completely pointless as we assign it ref->root before its first use.
      So just drop that variable and use ref->root directly.
      
      This may help avoid some warnings with clang tools such as the one
      reported/fixed by commit 966de47f ("btrfs: remove redundant
      initialization of variables in log_new_ancestors").
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e721043a
    • Filipe Manana's avatar
      btrfs: initialize key where it's used when running delayed data ref · 7cce0d69
      Filipe Manana authored
      At run_delayed_data_ref() we are always initializing a key but the key
      is only needed and used if we are inserting a new extent. So move the
      declaration and initialization of the key to 'if' branch where it's used.
      Also rename the key from 'ins' to 'key', as it's a more clear name.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7cce0d69
    • Filipe Manana's avatar
      btrfs: remove refs_to_drop argument from __btrfs_free_extent() · 1df6b3c0
      Filipe Manana authored
      Currently the 'refs_to_drop' argument of __btrfs_free_extent() always
      matches the value of node->ref_mod, so remove the argument and use
      node->ref_mod at __btrfs_free_extent().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1df6b3c0
    • Filipe Manana's avatar
      btrfs: remove refs_to_add argument from __btrfs_inc_extent_ref() · 88b2d088
      Filipe Manana authored
      Currently the 'refs_to_add' argument of __btrfs_inc_extent_ref() always
      matches the value of node->ref_mod, so remove the argument and use
      node->ref_mod at __btrfs_inc_extent_ref().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88b2d088
    • Filipe Manana's avatar
      btrfs: remove the refcount warning/check at btrfs_put_delayed_ref() · abff279e
      Filipe Manana authored
      At btrfs_put_delayed_ref(), it's pointless to have a WARN_ON() to check if
      the refcount of the delayed ref is zero. Such check is already done by the
      refcount_t module and refcount_dec_and_test(), which loudly complains if
      we try to decrement a reference count that is currently 0.
      
      The WARN_ON() dates back to the time when used a regular atomic_t type
      for the reference counter, before we switched to the refcount_t type.
      The main goal of the refcount_t type/module is precisely to catch such
      types of bugs and loudly complain if they happen.
      
      This also reduces a bit the module's text size.
      Before this change:
      
         $ size fs/btrfs/btrfs.ko
            text	   data	    bss	    dec	    hex	filename
         1612483	 167145	  16864	1796492	 1b698c	fs/btrfs/btrfs.ko
      
      After this change:
      
         $ size fs/btrfs/btrfs.ko
            text	   data	    bss	    dec	    hex	filename
         1612371	 167073	  16864	1796308	 1b68d4	fs/btrfs/btrfs.ko
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      abff279e
    • Filipe Manana's avatar
      btrfs: remove unnecessary logic when running new delayed references · 3cbb9f51
      Filipe Manana authored
      When running delayed references, at btrfs_run_delayed_refs(), we have this
      logic to run any new delayed references that might have been added just
      after we ran all delayed references. This logic grabs the first delayed
      reference, then locks it to wait for any contention on it before running
      all new delayed references. This however is pointless and not necessary
      because at __btrfs_run_delayed_refs() when we start running delayed
      references, we pick the first reference with btrfs_obtain_ref_head() and
      then we will lock it (with btrfs_delayed_ref_lock()).
      
      So remove the duplicate and unnecessary logic at btrfs_run_delayed_refs().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3cbb9f51