1. 12 Oct, 2023 40 commits
    • Filipe Manana's avatar
      btrfs: remove redundant root argument from btrfs_update_inode_fallback() · 0a5d0dc5
      Filipe Manana authored
      The root argument for btrfs_update_inode_fallback() always matches the
      root of the given inode, so remove the root argument and get it from the
      inode argument.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0a5d0dc5
    • Filipe Manana's avatar
      btrfs: remove noinline from btrfs_update_inode() · cddaaacc
      Filipe Manana authored
      The noinline attribute of btrfs_update_inode() is pointless as the
      function is exported and widely used, so remove it.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cddaaacc
    • Filipe Manana's avatar
      btrfs: simplify error check condition at btrfs_dirty_inode() · 2199cb0f
      Filipe Manana authored
      The following condition at btrfs_dirty_inode() is redundant:
      
        if (ret && (ret == -ENOSPC || ret == -EDQUOT))
      
      The first check for a non-zero 'ret' value is pointless, we can simplify
      this to simply:
      
        if (ret == -ENOSPC || ret == -EDQUOT)
      
      Not only this makes it easier to read, it also slightly reduces the text
      size of the btrfs kernel module:
      
        $ size fs/btrfs/btrfs.ko.before
           text	   data	    bss	    dec	    hex	filename
        1641400	 168265	  16864	1826529	 1bdee1	fs/btrfs/btrfs.ko.before
      
        $ size fs/btrfs/btrfs.ko.after
           text	   data	    bss	    dec	    hex	filename
        1641224	 168181	  16864	1826269	 1bdddd	fs/btrfs/btrfs.ko.after
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2199cb0f
    • Boris Burkov's avatar
      btrfs: qgroup: only set QUOTA_ENABLED when done reading qgroups · e0761451
      Boris Burkov authored
      In open_ctree, we set BTRFS_FS_QUOTA_ENABLED as soon as we see a
      quota_root, as opposed to after we are done setting up the qgroup
      structures. In the quota_enable path, we wait until after the structures
      are set up. Likewise, in disable, we clear the bit before tearing down
      the structures. I feel that this organization is less surprising for the
      open_ctree path.
      
      I don't believe this fixes any actual bug, but avoids potential
      confusion when using btrfs_qgroup_mode in an intermediate state where we
      are enabled but haven't yet setup the qgroup status flags. It also
      avoids any risk of calling a qgroup function and attempting to use the
      qgroup rbtrees before they exist/are setup.
      
      This all occurs before we do rw setup, so I believe it should be mostly
      a no-op.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e0761451
    • Boris Burkov's avatar
      btrfs: track data relocation with simple quota · 2672a051
      Boris Burkov authored
      Relocation data allocations are quite tricky for simple quotas. The
      basic data relocation sequence is (ignoring details that aren't relevant
      to this fix):
      
      - create a fake relocation data fs root
      - create a fake relocation inode in that root
      - for each data extent:
        - preallocate a data extent on behalf of the fake inode
        - copy over the data
      - for each extent
        - swap the refs so that the original file extent now refers to the new
          extent item
      - drop the fake root, dropping its refs on the old extents, which lets
        us delete them.
      
      Done naively, this results in storing an extent item in the extent tree
      whose owner_ref points at the relocation data root and a no-op squota
      recording, since the reloc root is not a legit fstree. So far, that's
      OK. The problem comes when you do the swap, and leave an extent item
      owned by this bogus root as the real permanent extents of the file. If
      the file then drops that ref, we free it and no-op account that against
      the fake relocation root. Essentially, this means that relocation is
      simple quota "extent laundering", since we re-own the extents into a
      fake root.
      
      Simple quotas very intentionally doesn't have a mechanism for
      transferring ownership of extents, as that is exactly the complicated
      thing we are trying to avoid with the new design. Further, it cannot be
      correctly done in this case, since at the time you create the new
      "real" refs, there is no way to know which was the original owner before
      relocation unless we track it.
      
      Therefore, it makes more sense to trick the preallocation to handle
      relocation as a special case and note the proper owner ref from the
      beginning. That way, we never write out an extent item without the
      correct owner ref that it will eventually have.
      
      This could be done by wiring a special root parameter all the way
      through the allocation code path, but to avoid that special case
      touching all the code, take advantage of the serial nature of relocation
      to store the src root on the relocation root object. Then when we finish
      the prealloc, if it happens to be this case, prepare the delayed ref
      appropriately.
      
      We must also add logic to handle relocating adjacent extents with
      different owning roots. Those cannot be preallocated together in a
      cluster as it would lose the separate ownership information.
      
      This is obviously a smelly bit of code, but I think it is the best
      solution to the problem, given the relocation implementation.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2672a051
    • Boris Burkov's avatar
      btrfs: qgroup: track metadata relocation COW with simple quota · 60ea105a
      Boris Burkov authored
      Relocation COWs metadata blocks in two cases for the reloc root:
      
      - copying the subvolume root item when creating the reloc root
      - copying a btree node when there is a COW during relocation
      
      In both cases, the resulting btree node hits an abnormal code path with
      respect to the owner field in its btrfs_header. It first creates the
      root item for the new objectid, which populates the reloc root id, and
      it at this point that delayed refs are created.
      
      Later, it fully copies the old node into the new node (including the
      original owner field) which overwrites it. This results in a simple
      quotas mismatch where we run the delayed ref for the reloc root which
      has no simple quota effect (reloc root is not an fstree) but when we
      ultimately delete the node, the owner is the real original fstree and we
      do free the space.
      
      To work around this without tampering with the behavior of relocation,
      add a parameter to btrfs_add_tree_block that lets the relocation code
      path specify a different owning root than the "operating" root (in this
      case, owning root is the real root and the operating root is the reloc
      root). These can naturally be plumbed into delayed refs that have the
      same concept.
      
      Note that this is a double count in some sense, but a relatively natural
      one, as there are really two extents, and the old one will be deleted
      soon. This is consistent with how data relocation extents are accounted
      by simple quotas.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      60ea105a
    • Boris Burkov's avatar
      btrfs: qgroup: check generation when recording simple quota delta · bd7c1ea3
      Boris Burkov authored
      Simple quotas count extents only from the moment the feature is enabled.
      Therefore, if we do something like:
      
      1. create subvol S
      2. write F in S
      3. enable quotas
      4. remove F
      5. write G in S
      
      then after 3. and 4. we would expect the simple quota usage of S to be 0
      (putting aside some metadata extents that might be written) and after
      5., it should be the size of G plus metadata. Therefore, we need to be
      able to determine whether a particular quota delta we are processing
      predates simple quota enablement.
      
      To do this, store the transaction id when quotas were enabled. In
      fs_info for immediate use and in the quota status item to make it
      recoverable on mount. When we see a delta, check if the generation of
      the extent item is less than that of quota enablement. If so, we should
      ignore the delta from this extent.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bd7c1ea3
    • Boris Burkov's avatar
      btrfs: qgroup: simple quota auto hierarchy for nested subvolumes · 5343cd93
      Boris Burkov authored
      Consider the following sequence:
      
      - enable quotas
      - create subvol S id 256 at dir outer/
      - create a qgroup 1/100
      - add 0/256 (S's auto qgroup) to 1/100
      - create subvol T id 257 at dir outer/inner/
      
      With full qgroups, there is no relationship between 0/257 and either of
      0/256 or 1/100. There is an inherit feature that the creator of inner/
      can use to specify it ought to be in 1/100.
      
      Simple quotas are targeted at container isolation, where such automatic
      inheritance for not necessarily trusted/controlled nested subvol
      creation would be quite helpful. Therefore, add a new default behavior
      for simple quotas: when you create a nested subvol, automatically
      inherit as parents any parents of the qgroup of the subvol the new inode
      is going in.
      
      In our example, 257/0 would also be under 1/100, allowing easy control
      of a total quota over an arbitrary hierarchy of subvolumes.
      
      I think this _might_ be a generally useful behavior, so it could be
      interesting to put it behind a new inheritance flag that simple quotas
      always use while traditional quotas let the user specify, but this is a
      minimally intrusive change to start.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5343cd93
    • Boris Burkov's avatar
      btrfs: record simple quota deltas in delayed refs · cecbb533
      Boris Burkov authored
      At the moment that we run delayed refs, we make the final ref-count
      based decision on creating/removing extent (and metadata) items.
      Therefore, it is exactly the spot to hook up simple quotas.
      
      There are a few important subtleties to the fields we must collect to
      accurately track simple quotas, particularly when removing an extent.
      When removing a data extent, the ref could be in any tree (due to
      reflink, for example) and so we need to recover the owning root id from
      the owner ref item. When removing a metadata extent, we know the owning
      root from the owner field in the header when we create the delayed ref,
      so we can recover it from there.
      
      We must also be careful to handle reservations properly to not leaked
      reserved space. The happy path is freeing the reservation when the
      simple quota delta runs on a data extent. If that doesn't happen, due to
      refs canceling out or some error, the ref head already has the
      must_insert_reserved machinery to handle this, so we piggy back on that
      and use it to clean up the reserved data.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cecbb533
    • Boris Burkov's avatar
      btrfs: add helper for inline owner ref lookup · 8d299091
      Boris Burkov authored
      Inline ref parsing is a bit tricky and relies on a decent amount of
      implicit information, so I think it is beneficial to have a helper
      function for reading the owner ref, if only to "document" the format,
      along with the write path.
      
      The main subtlety of note which I was missing by open-coding this was
      that it is important to check whether or not inline refs are present
      *at all*. i.e., if we are writing out a new extent under squotas, we
      will always use a big enough item for the inline ref and have it.
      However, it is possible that some random item predating squotas will not
      have any inline refs. In that case, trying to read the "type" field of
      the first inline ref will just be reading garbage in the form of
      whatever is in the next item.
      
      This will be used by the extent free-ing path, which looks up data
      extent owners as well as a relocation path which needs to grab the owner
      before relocating an extent.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8d299091
    • Boris Burkov's avatar
      btrfs: new inline ref storing owning subvol of data extents · d9a620f7
      Boris Burkov authored
      In order to implement simple quota groups, we need to be able to
      associate a data extent with the subvolume that created it. Once you
      account for reflink, this information cannot be recovered without
      explicitly storing it. Options for storing it are:
      
      - a new key/item
      - a new extent inline ref item
      
      The former is backwards compatible, but wastes space, the latter is
      incompat, but is efficient in space and reuses the existing inline ref
      machinery, while only abusing it a tiny amount -- specifically, the new
      item is not a ref, per-se.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d9a620f7
    • Boris Burkov's avatar
      btrfs: track original extent owner in head_ref · cf79ac47
      Boris Burkov authored
      Simple quotas requires tracking the original creating root of any given
      extent. This gets complicated when multiple subvolumes create
      overlapping/contradictory refs in the same transaction. For example,
      due to modifying or deleting an extent while also snapshotting it.
      
      To resolve this in a general way, take advantage of the fact that we are
      essentially already tracking this for handling releasing reservations.
      The head ref coalesces the various refs and uses must_insert_reserved to
      check if it needs to create an extent/free reservation. Store the ref
      that set must_insert_reserved as the owning ref on the head ref.
      
      Note that this can result in writing an extent for the very first time
      with an owner different from its only ref, but it will look the same as
      if you first created it with the original owning ref, then added the
      other ref, then removed the owning ref.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cf79ac47
    • Boris Burkov's avatar
      btrfs: track owning root in btrfs_ref · 457cb1dd
      Boris Burkov authored
      While data extents require us to store additional inline refs to track
      the original owner on free, this information is available implicitly for
      metadata. It is found in the owner field of the header of the tree
      block. Even if other trees refer to this block and the original ref goes
      away, we will not rewrite that header field, so it will reliably give the
      original owner.
      
      In addition, there is a relocation case where a new data extent needs to
      have an owning root separate from the referring root wired through
      delayed refs.
      
      To use it for recording simple quota deltas, we need to wire this root
      id through from when we create the delayed ref until we fully process
      it. Store it in the generic btrfs_ref struct of the delayed ref.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      457cb1dd
    • Boris Burkov's avatar
      btrfs: rename tree_ref and data_ref owning_root · 610647d7
      Boris Burkov authored
      commit 113479d5 ("btrfs: rename root fields in delayed refs structs")
      changed these from ref_root to owning_root. However, there are many
      circumstances where that name is not really accurate and the root on the
      ref struct _is_ the referring root. In general, these are not the owning
      root, though it does happen in some ref merging cases involving
      overwrites during snapshots and similar.
      
      Simple quotas cares quite a bit about tracking the original owner of an
      extent through delayed refs, so rename these back to free up the name
      for the real owning root (which will live on the generic btrfs_ref and
      the head ref)
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      610647d7
    • Boris Burkov's avatar
      btrfs: add helper for recording simple quota deltas · 1e0e9d57
      Boris Burkov authored
      Rather than re-computing shared/exclusive ownership based on backrefs
      and walking roots for implicit backrefs, simple quotas does an increment
      when creating an extent and a decrement when deleting it. Add the API
      for the extent item code to use to track those events.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e0e9d57
    • Boris Burkov's avatar
      btrfs: create qgroup earlier in snapshot creation · 6ed05643
      Boris Burkov authored
      Pull creating the qgroup earlier in the snapshot. This allows simple
      quotas qgroups to see all the metadata writes related to the snapshot
      being created and to be born with the root node accounted.
      
      Note this has an impact on transaction commit where the qgroup creation
      can do a lot of work, allocate memory and take locks. The change is done
      for correctness, potential performance issues will be fixed in the
      future.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      [ add note ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ed05643
    • Boris Burkov's avatar
      btrfs: qgroup: flush reservations during quota disable · af0e2aab
      Boris Burkov authored
      The following sequence:
      
        enable simple quotas
        do some writes
            reserve space
            create ordered_extent
      	  release rsv (store rsv_bytes in OE, mark QGROUP_RESERVED bits)
        disable quotas
        enable simple quotas
            set qgroup rsv to 0 on all subvolumes
        ordered_extent finishes
            create delayed ref with rsv_bytes from before
        run delayed ref
            record_simple_quota_delta
      	  free rsv_bytes (0 -> -rsv_delta)
      
      results in us reliably underflowing the subvolume's qgroup rsv counter,
      because disabling/re-enabling quotas toggles reservation counters down
      to 0, but does not remove other file system state which represents
      successful acquisition of qgroup rsv space. Specifically metadata rsv
      counters on the root object and rsv_bytes on ordered_extent objects that
      have released their reservation as well as the corresponding
      QGROUP_RESERVED extent bits.
      
      Normal qgroups gets away with this, I believe because it forces more
      work to happen on transaction commit, but I am not certain it is totally
      safe from the ordered_extent/leaked extent bit variant. Simple quotas
      hits this reliably.
      
      The intent of the fix is to make disable take the time to clear that
      external to qgroups state as well: after flipping off the quota bit on
      fs_info, flush delalloc and ordered extents, clearing the extent bits
      along the way. This makes it so there are no ordered extents or meta
      prealloc hanging around from the first enablement period during the second.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af0e2aab
    • Boris Burkov's avatar
      btrfs: sysfs: add simple_quota incompat feature entry · a744986a
      Boris Burkov authored
      Add an entry in the features directory for the new incompat flag
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a744986a
    • Boris Burkov's avatar
      btrfs: sysfs: expose quota mode via sysfs · 0182764a
      Boris Burkov authored
      Add a new sysfs file /sys/fs/btrfs/<uuid>/qgroups/mode
      which prints out the mode qgroups is running in. The possible modes are
      qgroup, and squota.
      
      If quotas are not enabled, then the qgroups directory will not exist,
      so don't handle that mode.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0182764a
    • Boris Burkov's avatar
      btrfs: qgroup: add new quota mode for simple quotas · 182940f4
      Boris Burkov authored
      Add a new quota mode called "simple quotas". It can be enabled by the
      existing quota enable ioctl via a new command, and sets an incompat
      bit, as the implementation of simple quotas will make backwards
      incompatible changes to the disk format of the extent tree.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      182940f4
    • Boris Burkov's avatar
      btrfs: qgroup: introduce quota mode · 6b0cd63b
      Boris Burkov authored
      In preparation for introducing simple quotas, change from a binary
      setting for quotas to an enum based mode. Initially, the possible modes
      are disabled/full. Full quotas is normal btrfs qgroups.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b0cd63b
    • David Sterba's avatar
      btrfs: merge ordered work callbacks in btrfs_work into one · 078b8b90
      David Sterba authored
      There are two callbacks defined in btrfs_work but only two actually make
      use of them, otherwise there are NULLs. We can get rid of the freeing
      callback making it a special case of the normal work. This reduces the
      size of btrfs_work by 8 bytes, final layout:
      
      struct btrfs_work {
              btrfs_func_t               func;                 /*     0     8 */
              btrfs_ordered_func_t       ordered_func;         /*     8     8 */
              struct work_struct         normal_work;          /*    16    32 */
              struct list_head           ordered_list;         /*    48    16 */
              /* --- cacheline 1 boundary (64 bytes) --- */
              struct btrfs_workqueue *   wq;                   /*    64     8 */
              long unsigned int          flags;                /*    72     8 */
      
              /* size: 80, cachelines: 2, members: 6 */
              /* last cacheline: 16 bytes */
      };
      
      This in turn reduces size of other structures (on a release config):
      
      - async_chunk			 160 ->  152
      - async_submit_bio		 152 ->  144
      - btrfs_async_delayed_work	 104 ->   96
      - btrfs_caching_control		 176 ->  168
      - btrfs_delalloc_work		 144 ->  136
      - btrfs_fs_info			3608 -> 3600
      - btrfs_ordered_extent		 440 ->  424
      - btrfs_writepage_fixup		 104 ->   96
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      078b8b90
    • Johannes Thumshirn's avatar
      btrfs: add raid stripe tree to features enabled with debug config · e9b9b911
      Johannes Thumshirn authored
      Until the raid stripe tree code is well enough tested and feature
      complete, "hide" it behind CONFIG_BTRFS_DEBUG so only people who
      want to use it are actually using it.
      
      The scrub support may still fail some tests (btrfs/060 and up) and will
      be fixed, RAID5/6 is not supported.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e9b9b911
    • Johannes Thumshirn's avatar
      btrfs: tree-checker: add support for raid stripe tree · e0b4077f
      Johannes Thumshirn authored
      Add a tree checker support for RAID stripe tree items, verify:
      
      - alignment
      - presence of the incompat bit
      - supported encoding
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e0b4077f
    • Johannes Thumshirn's avatar
      btrfs: tracepoints: add events for raid stripe tree · b5e2c2ff
      Johannes Thumshirn authored
      Add trace events for raid-stripe-tree operations.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b5e2c2ff
    • Johannes Thumshirn's avatar
      btrfs: sysfs: announce presence of raid-stripe-tree · 9f9918a8
      Johannes Thumshirn authored
      If a filesystem with a raid-stripe-tree is mounted, show the RST feature
      in sysfs, currently still under the CONFIG_BTRFS_DEBUG option.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f9918a8
    • Johannes Thumshirn's avatar
      btrfs: add raid stripe tree pretty printer · edde81f1
      Johannes Thumshirn authored
      Decode raid-stripe-tree entries on btrfs_print_tree().
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      edde81f1
    • Johannes Thumshirn's avatar
      btrfs: zoned: support RAID0/1/10 on top of raid stripe tree · 568220fa
      Johannes Thumshirn authored
      When we have a raid-stripe-tree, we can do RAID0/1/10 on zoned devices
      for data block groups. For metadata block groups, we don't actually
      need anything special, as all metadata I/O is protected by the
      btrfs_zoned_meta_io_lock() already.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      568220fa
    • Johannes Thumshirn's avatar
      btrfs: scrub: implement raid stripe tree support · 9acaa641
      Johannes Thumshirn authored
      A filesystem that uses the raid stripe tree for logical to physical
      address translation can't use the regular scrub path, that reads all
      stripes and then checks if a sector is unused afterwards.
      
      When using the raid stripe tree, this will result in lookup errors, as
      the stripe tree doesn't know the requested logical addresses.
      
      In case we're scrubbing a filesystem which uses the RAID stripe tree for
      multi-device logical to physical address translation, perform an extra
      block mapping step to get the real on-disk stripe length from the stripe
      tree when scrubbing the sectors.
      
      This prevents a double completion of the btrfs_bio caused by splitting the
      underlying bio and ultimately a use-after-free.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9acaa641
    • Johannes Thumshirn's avatar
      btrfs: lookup physical address from stripe extent · 10e27980
      Johannes Thumshirn authored
      Lookup the physical address from the raid stripe tree when a read on an
      RAID volume formatted with the raid stripe tree was attempted.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      10e27980
    • Johannes Thumshirn's avatar
      btrfs: delete stripe extent on extent deletion · ca41504e
      Johannes Thumshirn authored
      As each stripe extent is tied to an extent item, delete the stripe extent
      once the corresponding extent item is deleted.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca41504e
    • Johannes Thumshirn's avatar
      btrfs: add support for inserting raid stripe extents · 02c372e1
      Johannes Thumshirn authored
      Add support for inserting stripe extents into the raid stripe tree on
      completion of every write that needs an extra logical-to-physical
      translation when using RAID.
      
      Inserting the stripe extents happens after the data I/O has completed,
      this is done to
      
        a) support zone-append and
        b) rule out the possibility of a RAID-write-hole.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      02c372e1
    • Johannes Thumshirn's avatar
      btrfs: read raid stripe tree from disk · 51502090
      Johannes Thumshirn authored
      If we find the raid-stripe-tree on mount, read it from disk. This is
      a backward incompatible feature. The rescue=ignorebadroots mount option
      will skip this tree.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      51502090
    • Johannes Thumshirn's avatar
      btrfs: add raid stripe tree definitions · ee129330
      Johannes Thumshirn authored
      Add definitions for the raid stripe tree. This tree will hold information
      about the on-disk layout of the stripes in a RAID set.
      
      Each stripe extent has a 1:1 relationship with an on-disk extent item and
      is doing the logical to per-drive physical address translation for the
      extent item in question.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ee129330
    • Qu Wenruo's avatar
      btrfs: warn on tree blocks which are not nodesize aligned · 6d3a6194
      Qu Wenruo authored
      A long time ago, we had some metadata chunks which started at sector
      boundary but not aligned to nodesize boundary.
      
      This led to some older filesystems which can have tree blocks only
      aligned to sectorsize, but not nodesize.
      
      Later 'btrfs check' gained the ability to detect and warn about such tree
      blocks, and kernel fixed the chunk allocation behavior, nowadays those
      tree blocks should be pretty rare.
      
      But in the future, if we want to migrate metadata to folio, we cannot
      have such tree blocks, as filemap_add_folio() requires the page index to
      be aligned with the folio number of pages.  Such unaligned tree blocks
      can lead to VM_BUG_ON().
      
      So this patch adds extra warning for those unaligned tree blocks, as a
      preparation for the future folio migration.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d3a6194
    • Josef Bacik's avatar
      btrfs: don't arbitrarily slow down delalloc if we're committing · 11aeb97b
      Josef Bacik authored
      We have a random schedule_timeout() if the current transaction is
      committing, which seems to be a holdover from the original delalloc
      reservation code.
      
      Remove this, we have the proper flushing stuff, we shouldn't be hoping
      for random timing things to make everything work.  This just induces
      latency for no reason.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      11aeb97b
    • Filipe Manana's avatar
      btrfs: remove useless comment from btrfs_pin_extent_for_log_replay() · c967c19e
      Filipe Manana authored
      The comment on top of btrfs_pin_extent_for_log_replay() mentioning that
      the function must be called within a transaction is pointless as of
      commit 9fce5704 ("btrfs: Make btrfs_pin_extent_for_log_replay take
      transaction handle"), since the function now takes a transaction handle
      as its first argument. So remove the comment because it's completely
      useless now.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c967c19e
    • Filipe Manana's avatar
      btrfs: remove stale comment from btrfs_free_extent() · df423ee2
      Filipe Manana authored
      A comment at btrfs_free_extent() mentions the call to btrfs_pin_extent()
      unlocks the pinned mutex, however that mutex is long gone, it was removed
      in 2009 by commit 04018de5 ("Btrfs: kill the pinned_mutex"). So just
      delete the comment.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      df423ee2
    • Christoph Hellwig's avatar
      btrfs: zoned: factor out DUP bg handling from btrfs_load_block_group_zone_info · 87463f7e
      Christoph Hellwig authored
      Split the code handling a type DUP block group from
      btrfs_load_block_group_zone_info to make the code more readable.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      87463f7e
    • Christoph Hellwig's avatar
      btrfs: zoned: factor out single bg handling from btrfs_load_block_group_zone_info · 9e0e3e74
      Christoph Hellwig authored
      Split the code handling a type single block group from
      btrfs_load_block_group_zone_info to make the code more readable.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e0e3e74