1. 19 Jun, 2023 40 commits
    • Christoph Hellwig's avatar
      btrfs: open code need_full_stripe conditions · 8680e587
      Christoph Hellwig authored
      need_full_stripe is just a somewhat complicated way to say
      "op != BTRFS_MAP_READ".  Just spell that explicit check out, which makes
      a lot of the code currently using the helper easier to understand.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8680e587
    • Christoph Hellwig's avatar
      btrfs: open code btrfs_map_sblock · 723b8bb1
      Christoph Hellwig authored
      btrfs_map_sblock just hard codes three arguments and calls
      btrfs_map_sblock.  Remove it as it doesn't provide any real value, but
      makes following the btrfs_map_block call chains harder.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      723b8bb1
    • Christoph Hellwig's avatar
      btrfs: rename __btrfs_map_block to btrfs_map_block · cd4efd21
      Christoph Hellwig authored
      Now that the old btrfs_map_block is gone, drop the leading underscores
      from __btrfs_map_block.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd4efd21
    • Christoph Hellwig's avatar
      btrfs: remove unused btrfs_map_block · d69d7ffc
      Christoph Hellwig authored
      There are no users of btrfs_map_block left, so remove it.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d69d7ffc
    • Christoph Hellwig's avatar
      btrfs: optimize simple reads in btrfsic_map_block · 78a213a0
      Christoph Hellwig authored
      Pass a smap into __btrfs_map_block so that the usual case of a read that
      doesn't require parity raid recovery doesn't need an extra memory
      allocation for the btrfs_io_context.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      78a213a0
    • Christoph Hellwig's avatar
      btrfs: remove unused BTRFS_MAP_DISCARD · 3965a4c7
      Christoph Hellwig authored
      BTRFS_MAP_DISCARD is never set, as REQ_OP_DISCARD is never passed to
      btrfs_op() only only checked in two ASSERTS.
      
      Remove it and let the catchall WARN_ON in btrfs_op() deal with accidental
      REQ_OP_DISCARDs leaked into btrfs_op(). Last use was in a4012f06
      ("btrfs: split discard handling out of btrfs_map_block").
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3965a4c7
    • David Sterba's avatar
      btrfs: add xxhash to fast checksum implementations · efcfcbc6
      David Sterba authored
      The implementation of XXHASH is now CPU only but still fast enough to be
      considered for the synchronous checksumming, like non-generic crc32c.
      
      A userspace benchmark comparing it to various implementations (patched
      hash-speedtest from btrfs-progs):
      
        Block size:     4096
        Iterations:     1000000
        Implementation: builtin
        Units:          CPU cycles
      
      	NULL-NOP: cycles:     73384294, cycles/i       73
           NULL-MEMCPY: cycles:    228033868, cycles/i      228,    61664.320 MiB/s
            CRC32C-ref: cycles:  24758559416, cycles/i    24758,      567.950 MiB/s
             CRC32C-NI: cycles:   1194350470, cycles/i     1194,    11773.433 MiB/s
        CRC32C-ADLERSW: cycles:   6150186216, cycles/i     6150,     2286.372 MiB/s
        CRC32C-ADLERHW: cycles:    626979180, cycles/i      626,    22427.453 MiB/s
            CRC32C-PCL: cycles:    466746732, cycles/i      466,    30126.699 MiB/s
      	  XXHASH: cycles:    860656400, cycles/i      860,    16338.188 MiB/s
      
      Comparing purely software implementation (ref), current outdated
      accelerated using crc32q instruction (NI), optimized implementations by
      M. Adler (https://stackoverflow.com/questions/17645167/implementing-sse-4-2s-crc32c-in-software/17646775#17646775)
      and the best one that was taken from kernel using the PCLMULQDQ
      instruction (PCL).
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      efcfcbc6
    • Christoph Hellwig's avatar
      btrfs: pass the new logical address to split_extent_map · f000bc6f
      Christoph Hellwig authored
      split_extent_map splits off the first chunk of an extent map into a new
      one.  One of the two users is the zoned I/O completion code that wants to
      rewrite the logical block start address right after this split.  Pass in
      the logical address to be set in the split off first extent_map as an
      argument to avoid an extra extent tree lookup for this case.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f000bc6f
    • Christoph Hellwig's avatar
      btrfs: defer splitting of ordered extents until I/O completion · 71df088c
      Christoph Hellwig authored
      The btrfs zoned completion code currently needs an ordered_extent and
      extent_map per bio so that it can account for the non-predictable
      write location from Zone Append.  To archive that it currently splits
      the ordered_extent and extent_map at I/O submission time, and then
      records the actual physical address in the ->physical field of the
      ordered_extent.
      
      This patch instead switches to record the "original" physical address
      that the btrfs allocator assigned in spare space in the btrfs_bio,
      and then rewrites the logical address in the btrfs_ordered_sum
      structure at I/O completion time.  This allows the ordered extent
      completion handler to simply walk the list of ordered csums and
      split the ordered extent as needed.  This removes an extra ordered
      extent and extent_map lookup and manipulation during the I/O
      submission path, and instead batches it in the I/O completion path
      where we need to touch these anyway.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      71df088c
    • Christoph Hellwig's avatar
      btrfs: handle completed ordered extents in btrfs_split_ordered_extent · 52b1fdca
      Christoph Hellwig authored
      To delay splitting ordered_extents to I/O completion time we need to be
      able to handle fully completed ordered extents in
      btrfs_split_ordered_extent.  Besides a bit of accounting this primarily
      involved moving over the csums to the split bio for the range that it
      covers, which is simple enough because we always have one
      btrfs_ordered_sum per bio.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      52b1fdca
    • Christoph Hellwig's avatar
      btrfs: atomically insert the new extent in btrfs_split_ordered_extent · 816f589b
      Christoph Hellwig authored
      Currently there is a small race window in btrfs_split_ordered_extent,
      where the reduced old extent can be looked up on the per-inode rbtree
      or the per-root list while the newly split out one isn't visible yet.
      
      Fix this by open coding btrfs_alloc_ordered_extent in
      btrfs_split_ordered_extent, and holding the tree lock and
      root->ordered_extent_lock over the entire tree and extent manipulation.
      
      Note that this introduces new lock ordering because previously
      ordered_extent_lock was never held over the tree lock.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      816f589b
    • Christoph Hellwig's avatar
      btrfs: split btrfs_alloc_ordered_extent to allocation and insertion helpers · 53d9981c
      Christoph Hellwig authored
      Split two low-level helpers out of btrfs_alloc_ordered_extent to allocate
      and insert the logic extent.  The pure alloc helper will be used to
      improve btrfs_split_ordered_extent.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      53d9981c
    • Christoph Hellwig's avatar
      btrfs: return the new ordered_extent from btrfs_split_ordered_extent · b0307e28
      Christoph Hellwig authored
      Return the ordered_extent split from the passed in one.  This will be
      needed to be able to store an ordered_extent in the btrfs_bio.
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0307e28
    • Christoph Hellwig's avatar
      btrfs: reorder conditions in btrfs_extract_ordered_extent · ebdb44a0
      Christoph Hellwig authored
      There is no good reason for doing one before the other in terms of
      failure implications, but doing the extent_map split first will
      simplify some upcoming refactoring.
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ebdb44a0
    • Christoph Hellwig's avatar
      btrfs: move split_extent_map to extent_map.c · a6f3e205
      Christoph Hellwig authored
      split_extent_map doesn't have anything to do with the other code in
      inode.c, so move it to extent_map.c.
      
      This also allows marking replace_extent_mapping static.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a6f3e205
    • Christoph Hellwig's avatar
      btrfs: record orig_physical only for the original bio · 3887653c
      Christoph Hellwig authored
      btrfs_submit_dev_bio is also called for clone bios that aren't embedded
      into a btrfs_bio structure, but previous commit "btrfs: optimize the
      logical to physical mapping for zoned writes" added code to assign
      btrfs_bio.orig_physical in it.
      
      This is harmless right now as only the single data profile can be used
      on zoned devices, but will blow up when the RAID stripe tree is added.
      Move it out into the single I/O specific branch in the caller.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3887653c
    • Christoph Hellwig's avatar
      btrfs: optimize the logical to physical mapping for zoned writes · cbfce4c7
      Christoph Hellwig authored
      The current code to store the final logical to physical mapping for a
      zone append write in the extent tree is rather inefficient.  It first has
      to split the ordered extent so that there is one ordered extent per bio,
      so that it can look up the ordered extent on I/O completion in
      btrfs_record_physical_zoned and store the physical LBA returned by the
      block driver in the ordered extent.
      
      btrfs_rewrite_logical_zoned then has to do a lookup in the chunk tree to
      see what physical address the logical address for this bio / ordered
      extent is mapped to, and then rewrite it in the extent tree.
      
      To optimize this process, we can store the physical address assigned in
      the chunk tree to the original logical address and a pointer to
      btrfs_ordered_sum structure the in the btrfs_bio structure, and then use
      this information to rewrite the logical address in the btrfs_ordered_sum
      structure directly at I/O completion time in btrfs_record_physical_zoned.
      btrfs_rewrite_logical_zoned then simply updates the logical address in
      the extent tree and the ordered_extent itself.
      
      The code in btrfs_rewrite_logical_zoned now runs for all data I/O
      completions in zoned file systems, which is fine as there is no remapping
      to do for non-append writes to conventional zones or for relocation, and
      the overhead for quickly breaking out of the loop is very low.
      
      Because zoned file systems now need the ordered_sums structure to
      record the actual write location returned by zone append, allocate dummy
      structures without the csum array for them when the I/O doesn't use
      checksums, and free them when completing the ordered_extent.
      
      Note that the btrfs_bio doesn't grow as the new field are places into
      a union that is so far not used for data writes and has plenty of space
      left in it.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cbfce4c7
    • Christoph Hellwig's avatar
      btrfs: rename the bytenr field in struct btrfs_ordered_sum to logical · 5cfe76f8
      Christoph Hellwig authored
      btrfs_ordered_sum::bytendr stores a logical address.  Make that clear by
      renaming it to ->logical.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5cfe76f8
    • Christoph Hellwig's avatar
      btrfs: mark the len field in struct btrfs_ordered_sum as unsigned · 6e4b2479
      Christoph Hellwig authored
      len can't ever be negative, so mark it as an u32 instead of int.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6e4b2479
    • Christoph Hellwig's avatar
      btrfs: don't call btrfs_record_physical_zoned for failed append · e9cb93b9
      Christoph Hellwig authored
      When a zoned append command fails there is no written address reported,
      so don't try to record it.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e9cb93b9
    • Christoph Hellwig's avatar
      btrfs: optimize out btrfs_is_zoned for !CONFIG_BLK_DEV_ZONED · dd8b7b04
      Christoph Hellwig authored
      Add an IS_ENABLED check for CONFIG_BLK_DEV_ZONED in addition to the
      run-time check for the zone size.  This will allow to make use of
      compiler dead code elimination for code guarded by btrfs_is_zoned, and
      for example provide just a dangling prototype for a function instead of
      adding a stub.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dd8b7b04
    • Filipe Manana's avatar
      btrfs: make btrfs_destroy_delayed_refs() return void · 99f09ce3
      Filipe Manana authored
      btrfs_destroy_delayed_refs() always returns 0 and its single caller does
      not check its return value, as it also returns void, and so does the
      callers' caller and so on. This is because we are in the transaction abort
      path, where we have no way to deal with errors (we are in a critical
      situation) and all cleanup of resources works in a best effort fashion.
      So make btrfs_destroy_delayed_refs() return void.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      99f09ce3
    • Filipe Manana's avatar
      btrfs: remove unnecessary prototype declarations at disk-io.c · 184533e3
      Filipe Manana authored
      We have a few static functions at disk-io.c for which we have a forward
      declaration of their prototype, but it's not needed because all those
      functions are defined before they are called, so remove them.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      184533e3
    • Filipe Manana's avatar
      btrfs: use a single switch statement when initializing delayed ref head · f1ed785a
      Filipe Manana authored
      At init_delayed_ref_head(), we are using two separate if statements to
      check the delayed ref head action, and initializing 'must_insert_reserved'
      to false twice, once when the variable is declared and once again in an
      else branch.
      
      Make this simpler and more straightforward by having a single switch
      statement, also moving the comment about a drop action to the
      corresponding switch case to make it more clear and eliminating the
      duplicated initialization of 'must_insert_reserved' to false.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f1ed785a
    • Filipe Manana's avatar
      btrfs: use bool type for delayed ref head fields that are used as booleans · 61c681fe
      Filipe Manana authored
      There's no point in have several fields defined as 1 bit unsigned int in
      struct btrfs_delayed_ref_head, we can instead use a bool type, it makes
      the code a bit more readable and it doesn't change the structure size.
      So switch them to proper booleans.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      61c681fe
    • Filipe Manana's avatar
      btrfs: assert correct lock is held at btrfs_select_ref_head() · 1e6b71c3
      Filipe Manana authored
      The function btrfs_select_ref_head() iterates over the red black tree of
      delayed reference heads, which is protected by the spinlock in the delayed
      refs root. The function doesn't take the lock, it's taken by its single
      caller, btrfs_obtain_ref_head(), because it needs to call that function
      and btrfs_delayed_ref_lock() in the same critical section (delimited by
      that spinlock). So assert at btrfs_select_ref_head() that we are holding
      the expected lock.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e6b71c3
    • Filipe Manana's avatar
      btrfs: get rid of label and goto at insert_delayed_ref() · 798f4d95
      Filipe Manana authored
      At insert_delayed_ref() there's no point of having a label and goto in the
      case we were able to insert the delayed ref head. We can just add the code
      under label to the if statement's body and return immediately, and also
      there is no need to track the return value in a variable, we can just
      return a literal true or false value directly. So do those changes.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      798f4d95
    • Filipe Manana's avatar
      btrfs: make insert_delayed_ref() return a bool instead of an int · f38462c4
      Filipe Manana authored
      insert_delayed_ref() can only return 0 or 1, to indicate if the given
      delayed reference was added to the head reference or if it was merged
      into an existing delayed ref, respectively. So just make it return a
      boolean instead.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f38462c4
    • Filipe Manana's avatar
      btrfs: use a bool to track qgroup record insertion when adding ref head · 293f8197
      Filipe Manana authored
      We are using an integer as a boolean to track the qgroup record insertion
      status when adding a delayed reference head. Since all we need is a
      boolean, switch the type from int to bool to make it more obvious.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      293f8197
    • Filipe Manana's avatar
      btrfs: remove pointless in_tree field from struct btrfs_delayed_ref_node · 4d34ad34
      Filipe Manana authored
      The 'in_tree' field is really not needed in struct btrfs_delayed_ref_node,
      as we can check whether a reference is in the tree or not simply by
      checking its red black tree node member with RB_EMPTY_NODE(), as when we
      remove it from the tree we always call RB_CLEAR_NODE(). So remove that
      field and use RB_EMPTY_NODE().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d34ad34
    • Filipe Manana's avatar
      btrfs: remove unused is_head field from struct btrfs_delayed_ref_node · 53499d5f
      Filipe Manana authored
      The 'is_head' field of struct btrfs_delayed_ref_node is no longer after
      commit d278850e ("btrfs: remove delayed_ref_node from ref_head"),
      so remove it.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      53499d5f
    • Filipe Manana's avatar
      btrfs: reorder some members of struct btrfs_delayed_ref_head · 315dd5cc
      Filipe Manana authored
      Currently struct delayed_ref_head has its 'bytenr' and 'href_node' members
      in different cache lines (even on a release, non-debug, kernel). This is
      not optimal because when iterating the red black tree of delayed ref heads
      for inserting a new delayed ref head (htree_insert()) we have to pull in 2
      cache lines of delayed ref heads we find in a patch, one for the tree node
      (struct rb_node) and another one for the 'bytenr' field. The same applies
      when searching for an existing delayed ref head (find_ref_head()).
      On a release (non-debug) kernel, the structure also has two 4 bytes holes,
      which makes it 8 bytes longer than necessary. Its current layout is the
      following:
      
        struct btrfs_delayed_ref_head {
                u64                        bytenr;               /*     0     8 */
                u64                        num_bytes;            /*     8     8 */
                refcount_t                 refs;                 /*    16     4 */
      
                /* XXX 4 bytes hole, try to pack */
      
                struct mutex               mutex;                /*    24    32 */
                spinlock_t                 lock;                 /*    56     4 */
      
                /* XXX 4 bytes hole, try to pack */
      
                /* --- cacheline 1 boundary (64 bytes) --- */
                struct rb_root_cached      ref_tree;             /*    64    16 */
                struct list_head           ref_add_list;         /*    80    16 */
                struct rb_node             href_node __attribute__((__aligned__(8))); /*    96    24 */
                struct btrfs_delayed_extent_op * extent_op;      /*   120     8 */
                /* --- cacheline 2 boundary (128 bytes) --- */
                int                        total_ref_mod;        /*   128     4 */
                int                        ref_mod;              /*   132     4 */
                unsigned int               must_insert_reserved:1; /*   136: 0  4 */
                unsigned int               is_data:1;            /*   136: 1  4 */
                unsigned int               is_system:1;          /*   136: 2  4 */
                unsigned int               processing:1;         /*   136: 3  4 */
      
                /* size: 144, cachelines: 3, members: 15 */
                /* sum members: 128, holes: 2, sum holes: 8 */
                /* sum bitfield members: 4 bits (0 bytes) */
                /* padding: 4 */
                /* bit_padding: 28 bits */
                /* forced alignments: 1 */
                /* last cacheline: 16 bytes */
        } __attribute__((__aligned__(8)));
      
      This change reorders the 'href_node' and 'refs' members so that we have
      the 'href_node' in the same cache line as the 'bytenr' field, while also
      eliminating the two holes and reducing the structure size from 144 bytes
      down to 136 bytes, so we can now have 30 ref heads per 4K page (on x86_64)
      instead of 28. The new structure layout after this change is now:
      
        struct btrfs_delayed_ref_head {
                u64                        bytenr;               /*     0     8 */
                u64                        num_bytes;            /*     8     8 */
                struct rb_node             href_node __attribute__((__aligned__(8))); /*    16    24 */
                struct mutex               mutex;                /*    40    32 */
                /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
                refcount_t                 refs;                 /*    72     4 */
                spinlock_t                 lock;                 /*    76     4 */
                struct rb_root_cached      ref_tree;             /*    80    16 */
                struct list_head           ref_add_list;         /*    96    16 */
                struct btrfs_delayed_extent_op * extent_op;      /*   112     8 */
                int                        total_ref_mod;        /*   120     4 */
                int                        ref_mod;              /*   124     4 */
                /* --- cacheline 2 boundary (128 bytes) --- */
                unsigned int               must_insert_reserved:1; /*   128: 0  4 */
                unsigned int               is_data:1;            /*   128: 1  4 */
                unsigned int               is_system:1;          /*   128: 2  4 */
                unsigned int               processing:1;         /*   128: 3  4 */
      
                /* size: 136, cachelines: 3, members: 15 */
                /* padding: 4 */
                /* bit_padding: 28 bits */
                /* forced alignments: 1 */
                /* last cacheline: 8 bytes */
        } __attribute__((__aligned__(8)));
      
      Running the following fs_mark test shows some significant improvement.
      
        $ cat test.sh
        #!/bin/bash
      
        # 15G null block device
        DEV=/dev/nullb0
        MNT=/mnt/nullb0
        FILES=100000
        THREADS=$(nproc --all)
        FILE_SIZE=0
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f $DEV
        mount -o ssd $DEV $MNT
      
        OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t $THREADS -k"
        for ((i = 1; i <= $THREADS; i++)); do
            OPTS="$OPTS -d $MNT/d$i"
        done
      
        fs_mark $OPTS
      
        umount $MNT
      
      Before this change:
      
      FSUse%        Count         Size    Files/sec     App Overhead
          10      1200000            0     112631.3         11928055
          16      2400000            0     189943.8         12140777
          23      3600000            0     150719.2         13178480
          50      4800000            0      99137.3         12504293
          53      6000000            0     111733.9         12670836
      
                          Total files/sec: 664165.5
      
      After this change:
      
      FSUse%        Count         Size    Files/sec     App Overhead
          10      1200000            0     148589.5         11565889
          16      2400000            0     227743.8         11561596
          23      3600000            0     191590.5         12550755
          30      4800000            0     179812.3         12629610
          53      6000000            0      92471.4         12352383
      
                          Total files/sec: 840207.5
      
      Measuring the execution times of htree_insert(), in nanoseconds, during
      those fs_mark runs:
      
      Before this change:
      
        Range:  0.000 - 940647.000; Mean: 619.733; Median: 548.000; Stddev: 1834.231
        Percentiles:  90th: 980.000; 95th: 1208.000; 99th: 2090.000
           0.000 -    6.384:       257 |
           6.384 -   26.259:       977 |
          26.259 -   99.635:      4963 |
          99.635 -  370.526:    136800 #############
         370.526 - 1370.603:    566110 #####################################################
        1370.603 - 5062.704:     24945 ##
        5062.704 - 18693.248:      944 |
        18693.248 - 69014.670:     211 |
        69014.670 - 254791.959:     30 |
        254791.959 - 940647.000:     4 |
      
      After this change:
      
        Range:  0.000 - 299200.000; Mean: 587.754; Median: 542.000; Stddev: 1030.422
        Percentiles:  90th: 918.000; 95th: 1113.000; 99th: 1987.000
           0.000 -    5.585:      163 |
           5.585 -   20.678:      452 |
          20.678 -   70.369:     1806 |
          70.369 -  233.965:    26268 ####
         233.965 -  772.564:   333519 #####################################################
         772.564 - 2545.771:    91820 ###############
        2545.771 - 8383.615:     2238 |
        8383.615 - 27603.280:     170 |
        27603.280 - 90879.297:     68 |
        90879.297 - 299200.000:    12 |
      
      Mean, percentiles, maximum times are all better, as well as a lower
      standard deviation.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      315dd5cc
    • Qu Wenruo's avatar
      btrfs: use the same uptodate variable for end_bio_extent_readpage() · 31dd8c81
      Qu Wenruo authored
      In function end_bio_extent_readpage() we call
      endio_readpage_release_extent() to unlock the extent io tree.
      
      However we pass PageUptodate(page) as @uptodate parameter for it, while
      for previous end_page_read() call, we use a dedicated @uptodate local
      variable.
      
      This is not a big deal, as even for subpage cases, either the bio only
      covers part of the page, then the @uptodate is always false, and the
      subpage ranges can still be merged.
      
      But for the sake of consistency, always use @uptodate variable when
      possible.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      31dd8c81
    • Qu Wenruo's avatar
      btrfs: subpage: make alloc_extent_buffer() handle previously uptodate range efficiently · 5a963419
      Qu Wenruo authored
      Currently alloc_extent_buffer() would make the extent buffer uptodate if
      the corresponding pages are also uptodate.
      
      But this check is only checking PageUptodate, which is fine for regular
      cases, but not for subpage cases, as we can have multiple extent buffers
      in the same page.
      
      So here we go btrfs_page_test_uptodate() instead.
      
      The old code doesn't cause any problem, but is not efficient, as it
      would cause extra metadata read even if the range is already uptodate.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a963419
    • David Sterba's avatar
      btrfs: print assertion failure report and stack trace from the same line · b831306b
      David Sterba authored
      Assertions reports are split into two parts, the exact file and location
      of the condition and then the stack trace printed from
      btrfs_assertfail(). This means all the stack traces report the same line
      and this is what's typically reported by various tools, making it harder
      to distinguish the reports.
      
        [403.2467] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4259
        [403.2479] ------------[ cut here ]------------
        [403.2484] kernel BUG at fs/btrfs/messages.c:259!
        [403.2488] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
        [403.2493] CPU: 2 PID: 23202 Comm: umount Not tainted 6.2.0-rc4-default+ #67
        [403.2499] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
        [403.2509] RIP: 0010:btrfs_assertfail+0x19/0x1b [btrfs]
        ...
        [403.2595] Call Trace:
        [403.2598]  <TASK>
        [403.2601]  btrfs_free_block_groups.cold+0x52/0xae [btrfs]
        [403.2608]  close_ctree+0x6c2/0x761 [btrfs]
        [403.2613]  ? __wait_for_common+0x2b8/0x360
        [403.2618]  ? btrfs_cleanup_one_transaction.cold+0x7a/0x7a [btrfs]
        [403.2626]  ? mark_held_locks+0x6b/0x90
        [403.2630]  ? lockdep_hardirqs_on_prepare+0x13d/0x200
        [403.2636]  ? __call_rcu_common.constprop.0+0x1ea/0x3d0
        [403.2642]  ? trace_hardirqs_on+0x2d/0x110
        [403.2646]  ? __call_rcu_common.constprop.0+0x1ea/0x3d0
        [403.2652]  generic_shutdown_super+0xb0/0x1c0
        [403.2657]  kill_anon_super+0x1e/0x40
        [403.2662]  btrfs_kill_super+0x25/0x30 [btrfs]
        [403.2668]  deactivate_locked_super+0x4c/0xc0
      
      By making btrfs_assertfail a macro we'll get the same line number for
      the BUG output:
      
        [63.5736] assertion failed: 0, in fs/btrfs/super.c:1572
        [63.5758] ------------[ cut here ]------------
        [63.5782] kernel BUG at fs/btrfs/super.c:1572!
        [63.5807] invalid opcode: 0000 [#2] PREEMPT SMP KASAN
        [63.5831] CPU: 0 PID: 859 Comm: mount Tainted: G      D            6.3.0-rc7-default+ #2062
        [63.5868] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
        [63.5905] RIP: 0010:btrfs_mount+0x24/0x30 [btrfs]
        [63.5964] RSP: 0018:ffff88800e69fcd8 EFLAGS: 00010246
        [63.5982] RAX: 000000000000002d RBX: ffff888008fc1400 RCX: 0000000000000000
        [63.6004] RDX: 0000000000000000 RSI: ffffffffb90fd868 RDI: ffffffffbcc3ff20
        [63.6026] RBP: ffffffffc081b200 R08: 0000000000000001 R09: ffff88800e69fa27
        [63.6046] R10: ffffed1001cd3f44 R11: 0000000000000001 R12: ffff888005a3c370
        [63.6062] R13: ffffffffc058e830 R14: 0000000000000000 R15: 00000000ffffffff
        [63.6081] FS:  00007f7b3561f800(0000) GS:ffff88806c600000(0000) knlGS:0000000000000000
        [63.6105] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [63.6120] CR2: 00007fff83726e10 CR3: 0000000002a9e000 CR4: 00000000000006b0
        [63.6137] Call Trace:
        [63.6143]  <TASK>
        [63.6148]  legacy_get_tree+0x80/0xd0
        [63.6158]  vfs_get_tree+0x43/0x120
        [63.6166]  do_new_mount+0x1f3/0x3d0
        [63.6176]  ? do_add_mount+0x140/0x140
        [63.6187]  ? cap_capable+0xa4/0xe0
        [63.6197]  path_mount+0x223/0xc10
      
      This comes at a cost of bloating the final btrfs.ko module due all the
      inlining, as long as assertions are compiled in. This is a must for
      debugging builds but this is often enabled on release builds too.
      
      Release build:
      
         text    data     bss     dec     hex filename
      1251676   20317   16088 1288081  13a791 pre/btrfs.ko
      1260612   29473   16088 1306173  13ee3d post/btrfs.ko
      
      DELTA: +8936
      
      CC: Josh Poimboeuf <jpoimboe@kernel.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b831306b
    • Qu Wenruo's avatar
      btrfs: subpage: dump extra subpage bitmaps for debug · 75258f20
      Qu Wenruo authored
      There is a bug report that assert_eb_page_uptodate() gets triggered for
      free space tree metadata.
      
      Without proper dump for the subpage bitmaps it's much harder to debug.
      
      Thus this patch would dump all the subpage bitmaps (split them into
      their own bitmaps) for a easier debugging.
      
      The output would look like this:
      (Dumped after a tree block got read from disk)
      
        page:000000006e34bf49 refcount:4 mapcount:0 mapping:0000000067661ac4 index:0x1d1 pfn:0x110e9
        memcg:ffff0000d7d62000
        aops:btree_aops [btrfs] ino:1
        flags: 0x8000000000002002(referenced|private|zone=2)
        page_type: 0xffffffff()
        raw: 8000000000002002 0000000000000000 dead000000000122 ffff00000188bed0
        raw: 00000000000001d1 ffff0000c7992700 00000004ffffffff ffff0000d7d62000
        page dumped because: btrfs subpage dump
        BTRFS warning (device dm-1): start=30490624 len=16384 page=30474240 bitmaps: uptodate=4-7 error= dirty= writeback= ordered= checked=
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      75258f20
    • Tejun Heo's avatar
      btrfs: use alloc_ordered_workqueue() to create ordered workqueues · 58e814fc
      Tejun Heo authored
      BACKGROUND
      ==========
      
      When multiple work items are queued to a workqueue, their execution order
      doesn't match the queueing order. They may get executed in any order and
      simultaneously. When fully serialized execution - one by one in the queueing
      order - is needed, an ordered workqueue should be used which can be created
      with alloc_ordered_workqueue().
      
      However, alloc_ordered_workqueue() was a later addition. Before it, an
      ordered workqueue could be obtained by creating an UNBOUND workqueue with
      @max_active==1. This originally was an implementation side-effect which was
      broken by 4c16bd32 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
      ordered"). Because there were users that depended on the ordered execution,
      5c0338c6 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
      made workqueue allocation path to implicitly promote UNBOUND workqueues w/
      @max_active==1 to ordered workqueues.
      
      While this has worked okay, overloading the UNBOUND allocation interface
      this way creates other issues. It's difficult to tell whether a given
      workqueue actually needs to be ordered and users that legitimately want a
      min concurrency level wq unexpectedly gets an ordered one instead. With
      planned UNBOUND workqueue updates to improve execution locality and more
      prevalence of chiplet designs which can benefit from such improvements, this
      isn't a state we wanna be in forever.
      
      This patch series audits all call sites that create an UNBOUND workqueue w/
      @max_active==1 and converts them to alloc_ordered_workqueue() as necessary.
      
      BTRFS
      =====
      
      * fs_info->scrub_workers initialized in scrub_workers_get() was setting
        @max_active to 1 when @is_dev_replace is set and it seems that the
        workqueue actually needs to be ordered if @is_dev_replace. Update the code
        so that alloc_ordered_workqueue() is used if @is_dev_replace.
      
      * fs_info->discard_ctl.discard_workers initialized in
        btrfs_init_workqueues() was directly using alloc_workqueue() w/
        @max_active==1. Converted to alloc_ordered_workqueue().
      
      * fs_info->fixup_workers and fs_info->qgroup_rescan_workers initialized in
        btrfs_queue_work() use the btrfs's workqueue wrapper, btrfs_workqueue,
        which are allocated with btrfs_alloc_workqueue().
      
        btrfs_workqueue implements automatic @max_active adjustment which is
        disabled when the specified max limit is below a certain threshold, so
        calling btrfs_alloc_workqueue() with @limit_active==1 yields an ordered
        workqueue whose @max_active won't be changed as the auto-tuning is
        disabled.
      
        This is rather brittle in that nothing clearly indicates that the two
        workqueues should be ordered or btrfs_alloc_workqueue() must disable
        auto-tuning when @limit_active==1.
      
        This patch factors out the common btrfs_workqueue init code into
        btrfs_init_workqueue() and add explicit btrfs_alloc_ordered_workqueue().
        The two workqueues are converted to use the new ordered allocation
        interface.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      58e814fc
    • David Sterba's avatar
      btrfs: drop gfp from parameter extent state helpers · 1d126800
      David Sterba authored
      Now that all extent state bit helpers effectively take the GFP_NOFS mask
      (and GFP_NOWAIT is encoded in the bits) we can remove the parameter.
      This reduces stack consumption in many functions and simplifies a lot of
      code.
      
      Net effect on module on a release build:
      
         text    data     bss     dec     hex filename
      1250432   20985   16088 1287505  13a551 pre/btrfs.ko
      1247074   20985   16088 1284147  139833 post/btrfs.ko
      
      DELTA: -3358
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1d126800
    • David Sterba's avatar
      btrfs: pass NOWAIT for set/clear extent bits as another bit · 62bc6047
      David Sterba authored
      The only flags we now pass to set_extent_bit/__clear_extent_bit are
      GFP_NOFS and GFP_NOWAIT (a few functions handling mappings). This
      requires an extra parameter to be passed everywhere but is almost always
      the same.
      
      Encode the GFP_NOWAIT as an artificial extent bit and extract the
      real bits and gfp mask in the lowest level helpers. Now the passed
      gfp mask is not actually used and can be removed.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      62bc6047
    • David Sterba's avatar
      btrfs: drop NOFAIL from set_extent_bit allocation masks · 7dde7a8a
      David Sterba authored
      The __GFP_NOFAIL passed to set_extent_bit first appeared in 2010
      (commit f0486c68 ("Btrfs: Introduce contexts for metadata
      reservation")), without any explanation why it would be needed.
      
      Meanwhile we've updated the semantics of set_extent_bit to handle failed
      allocations and do unlock, sleep and retry if needed.  The use of the
      NOFAIL flag is also an outlier, we never want any of the set/clear
      extent bit helpers to fail, they're used for many critical changes like
      extent locking, besides the extent state bit changes.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7dde7a8a