1. 12 Oct, 2023 40 commits
    • David Sterba's avatar
      btrfs: add specific helper for range bit test exists · 99be1a66
      David Sterba authored
      The existing helper test_range_bit works in two ways, checks if the whole
      range contains all the bits, or stop on the first occurrence.  By adding
      a specific helper for the latter case, the inner loop can be simplified
      and contains fewer conditionals, making it a bit faster.
      
      There's no caller that uses the cached state pointer so this reduces the
      argument count further.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      99be1a66
    • Filipe Manana's avatar
      btrfs: move btrfs_realloc_node() from ctree.c into defrag.c · 6422b4cd
      Filipe Manana authored
      btrfs_realloc_node() is only used by the defrag code. Nowadays we have a
      defrag.c file, so move it, and its helper close_blocks(), into defrag.c.
      
      During the move also do a few minor cosmetic changes:
      
      1) Change the return value of close_blocks() from int to bool;
      
      2) Use SZ_32K instead of 32768 at close_blocks();
      
      3) Make some variables const in btrfs_realloc_node(), 'blocksize' and
         'end_slot';
      
      4) Get rid of 'parent_nritems' variable, in both places where it was
         used it could be replaced by calling btrfs_header_nritems(parent);
      
      5) Change the type of a couple variables from int to bool;
      
      6) Rename variable 'err' to 'ret', as that's the most common name we
         use to track the return value of a function;
      
      7) Move some variables from the top scope to the scope of the for loop
         where they are used.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6422b4cd
    • Filipe Manana's avatar
      btrfs: export comp_keys() from ctree.c as btrfs_comp_keys() · 79d25df0
      Filipe Manana authored
      Export comp_keys() out of ctree.c, as btrfs_comp_keys(), so that in a
      later patch we can move out defrag specific code from ctree.c into
      defrag.c.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79d25df0
    • Filipe Manana's avatar
      btrfs: rename and export __btrfs_cow_block() · 95f93bc4
      Filipe Manana authored
      Rename and export __btrfs_cow_block() as btrfs_force_cow_block(). This is
      to allow to move defrag specific code out of ctree.c and into defrag.c in
      one of the next patches.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      95f93bc4
    • Filipe Manana's avatar
      btrfs: use round_down() to align block offset at btrfs_cow_block() · b8bf4e4d
      Filipe Manana authored
      At btrfs_cow_block() we can use round_down() to align the extent buffer's
      logical offset to the start offset of a metadata block group, instead of
      the less easy to read set of bitwise operations (two plus one subtraction).
      So replace the bitwise operations with a round_down() call.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b8bf4e4d
    • Filipe Manana's avatar
      btrfs: remove noinline attribute from btrfs_cow_block() · 7bff16e3
      Filipe Manana authored
      It's pointless to have the noiline attribute for btrfs_cow_block(), as the
      function is exported and widely used. So remove it.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7bff16e3
    • Anand Jain's avatar
      btrfs: remove incomplete metadata_uuid conversion fixup logic · 5966930d
      Anand Jain authored
      Previous commit ("btrfs: reject devices with CHANGING_FSID_V2") has
      stopped the assembly of devices with the CHANGING_FSID_V2 flag in the
      kernel. Such devices can be scanned but will not be registered and can't
      be mounted without a manual fix by btrfstune.  Remove the related logic
      and now unused code.
      
      The original motivation was to allow an interrupted partial conversion
      fix itself on next mount, in case the system has to be rebooted. This is
      a convenience but brings a lot of complexity the device scanning and
      handling the partial states.  It's hard to estimate if this was ever
      needed in practice, expecting the typical use case like a manual
      conversion of an unmounted filesystem where the user can verify the
      success and rerun it eventually.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add historical context ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5966930d
    • Anand Jain's avatar
      btrfs: reject devices with CHANGING_FSID_V2 · 197a9ece
      Anand Jain authored
      The BTRFS_SUPER_FLAG_CHANGING_FSID_V2 flag indicates a transient state
      where the device in the userspace btrfstune -m|-M operation failed to
      complete changing the fsid.
      
      This flag makes the kernel to automatically determine the other
      partner devices to which a given device can be associated, based on the
      fsid, metadata_uuid and generation values.
      
      btrfstune -m|M feature is especially useful in virtual cloud setups, where
      compute instances (disk images) are quickly copied, fsid changed, and
      launched. Given numerous disk images with the same metadata_uuid but
      different fsid, there's no clear way a device can be correctly assembled
      with the proper partners when the CHANGING_FSID_V2 flag is set. So, the
      disk could be assembled incorrectly, as in the example below:
      
      Before this patch:
      
      Consider the following two filesystems:
         /dev/loop[2-3] are raw copies of /dev/loop[0-1] and the btrsftune -m
      operation fails.
      
      In this scenario, as the /dev/loop0's fsid change is interrupted, and the
      CHANGING_FSID_V2 flag is set as shown below.
      
        $ p="device|devid|^metadata_uuid|^fsid|^incom|^generation|^flags"
      
        $ btrfs inspect dump-super /dev/loop0 | egrep '$p'
        superblock: bytenr=65536, device=/dev/loop0
        flags			0x1000000001
        fsid			7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
        metadata_uuid		bb040a9f-233a-4de2-ad84-49aa5a28059b
        generation		9
        num_devices		2
        incompat_flags	0x741
        dev_item.devid	1
      
        $ btrfs inspect dump-super /dev/loop1 | egrep '$p'
        superblock: bytenr=65536, device=/dev/loop1
        flags			0x1
        fsid			11d2af4d-1b71-45a9-83f6-f2100766939d
        metadata_uuid		bb040a9f-233a-4de2-ad84-49aa5a28059b
        generation		10
        num_devices		2
        incompat_flags	0x741
        dev_item.devid	2
      
        $ btrfs inspect dump-super /dev/loop2 | egrep '$p'
        superblock: bytenr=65536, device=/dev/loop2
        flags			0x1
        fsid			7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
        metadata_uuid		bb040a9f-233a-4de2-ad84-49aa5a28059b
        generation		8
        num_devices		2
        incompat_flags	0x741
        dev_item.devid	1
      
        $ btrfs inspect dump-super /dev/loop3 | egrep '$p'
        superblock: bytenr=65536, device=/dev/loop3
        flags			0x1
        fsid			7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
        metadata_uuid		bb040a9f-233a-4de2-ad84-49aa5a28059b
        generation		8
        num_devices		2
        incompat_flags	0x741
        dev_item.devid	2
      
      It is normal that some devices aren't instantly discovered during
      system boot or iSCSI discovery. The controlled scan below demonstrates
      this.
      
        $ btrfs device scan --forget
        $ btrfs device scan /dev/loop0
        Scanning for btrfs filesystems on '/dev/loop0'
        $ mount /dev/loop3 /btrfs
        $ btrfs filesystem show -m
        Label: none  uuid: 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
      	Total devices 2 FS bytes used 144.00KiB
      	devid    1 size 300.00MiB used 48.00MiB path /dev/loop0
      	devid    2 size 300.00MiB used 40.00MiB path /dev/loop3
      
      /dev/loop0 and /dev/loop3 are incorrectly partnered.
      
      This kernel patch removes functions and code connected to the
      CHANGING_FSID_V2 flag.
      
      With this patch, now devices with the CHANGING_FSID_V2 flag are rejected.
      And its partner will fail to mount with the extra -o degraded option.
      The check is removed from open_ctree(), devices are rejected during
      scanning which in turn fails the mount.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      197a9ece
    • David Sterba's avatar
      btrfs: relocation: constify parameters where possible · ab7c8bbf
      David Sterba authored
      Lots of the functions in relocation.c don't change pointer parameters
      but lack the annotations. Add them and reformat according to current
      coding style if needed.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ab7c8bbf
    • David Sterba's avatar
      btrfs: relocation: return bool from btrfs_should_ignore_reloc_root · 32f2abca
      David Sterba authored
      btrfs_should_ignore_reloc_root() is a predicate so it should return
      bool.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      32f2abca
    • David Sterba's avatar
      btrfs: switch btrfs_backref_cache::is_reloc to bool · c71d3c69
      David Sterba authored
      The btrfs_backref_cache::is_reloc is an indicator variable and should
      use a bool type.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c71d3c69
    • David Sterba's avatar
      btrfs: relocation: open code mapping_tree_init · 733fa44d
      David Sterba authored
      There's only one user of mapping_tree_init, we don't need a helper for
      the simple initialization.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      733fa44d
    • David Sterba's avatar
      btrfs: relocation: switch bitfields to bool in reloc_control · d23d42e3
      David Sterba authored
      Use bool types for the indicators instead of bitfields. The structure
      size slightly grows but the new types are placed within the padding.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d23d42e3
    • David Sterba's avatar
      btrfs: relocation: use enum for stages · 8daf07cf
      David Sterba authored
      Add an enum type for data relocation stages.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8daf07cf
    • David Sterba's avatar
      btrfs: relocation: use more natural types for tree_block bitfields · a3bb700f
      David Sterba authored
      We don't need to use bitfields for tree_block::level and
      tree_block::key_ready, there's enough padding in the structure for
      proper types.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a3bb700f
    • Filipe Manana's avatar
      btrfs: move btrfs_defrag_root() to defrag.{c,h} · 1723270f
      Filipe Manana authored
      The btrfs_defrag_root() function does not really belong in the
      transaction.{c,h} module and as we have a defrag.{c,h} nowadays,
      move it to there instead. This also allows to stop exporting
      btrfs_defrag_leaves(), so we can make it static.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ rename info to fs_info for consistency ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1723270f
    • Filipe Manana's avatar
      btrfs: remove redundant root argument from fixup_inode_link_count() · 8befc61c
      Filipe Manana authored
      The root argument for fixup_inode_link_count() always matches the root of
      the given inode, so remove the root argument and get it from the inode
      argument. This also applies to the helpers count_inode_extrefs() and
      count_inode_refs() used by fixup_inode_link_count() - they don't need the
      root argument, as it always matches the root of the inode passed to them.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8befc61c
    • Filipe Manana's avatar
      btrfs: remove redundant root argument from maybe_insert_hole() · 0a325e62
      Filipe Manana authored
      The root argument for maybe_insert_hole() always matches the root of the
      given inode, so remove the root argument and get it from the inode
      argument.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0a325e62
    • Filipe Manana's avatar
      btrfs: remove redundant root argument from btrfs_delayed_update_inode() · 04bd8e94
      Filipe Manana authored
      The root argument for btrfs_delayed_update_inode() always matches the root
      of the given inode, so remove the root argument and get it from the inode
      argument.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      04bd8e94
    • Filipe Manana's avatar
      btrfs: remove redundant root argument from btrfs_update_inode_item() · 07a274a8
      Filipe Manana authored
      The root argument for btrfs_update_inode_item() always matches the root of
      the given inode, so remove the root argument and get it from the inode
      argument.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07a274a8
    • Filipe Manana's avatar
      btrfs: remove redundant root argument from btrfs_update_inode() · 8b9d0322
      Filipe Manana authored
      The root argument for btrfs_update_inode() always matches the root of the
      given inode, so remove the root argument and get it from the inode
      argument.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8b9d0322
    • Filipe Manana's avatar
      btrfs: remove redundant root argument from btrfs_update_inode_fallback() · 0a5d0dc5
      Filipe Manana authored
      The root argument for btrfs_update_inode_fallback() always matches the
      root of the given inode, so remove the root argument and get it from the
      inode argument.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0a5d0dc5
    • Filipe Manana's avatar
      btrfs: remove noinline from btrfs_update_inode() · cddaaacc
      Filipe Manana authored
      The noinline attribute of btrfs_update_inode() is pointless as the
      function is exported and widely used, so remove it.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cddaaacc
    • Filipe Manana's avatar
      btrfs: simplify error check condition at btrfs_dirty_inode() · 2199cb0f
      Filipe Manana authored
      The following condition at btrfs_dirty_inode() is redundant:
      
        if (ret && (ret == -ENOSPC || ret == -EDQUOT))
      
      The first check for a non-zero 'ret' value is pointless, we can simplify
      this to simply:
      
        if (ret == -ENOSPC || ret == -EDQUOT)
      
      Not only this makes it easier to read, it also slightly reduces the text
      size of the btrfs kernel module:
      
        $ size fs/btrfs/btrfs.ko.before
           text	   data	    bss	    dec	    hex	filename
        1641400	 168265	  16864	1826529	 1bdee1	fs/btrfs/btrfs.ko.before
      
        $ size fs/btrfs/btrfs.ko.after
           text	   data	    bss	    dec	    hex	filename
        1641224	 168181	  16864	1826269	 1bdddd	fs/btrfs/btrfs.ko.after
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2199cb0f
    • Boris Burkov's avatar
      btrfs: qgroup: only set QUOTA_ENABLED when done reading qgroups · e0761451
      Boris Burkov authored
      In open_ctree, we set BTRFS_FS_QUOTA_ENABLED as soon as we see a
      quota_root, as opposed to after we are done setting up the qgroup
      structures. In the quota_enable path, we wait until after the structures
      are set up. Likewise, in disable, we clear the bit before tearing down
      the structures. I feel that this organization is less surprising for the
      open_ctree path.
      
      I don't believe this fixes any actual bug, but avoids potential
      confusion when using btrfs_qgroup_mode in an intermediate state where we
      are enabled but haven't yet setup the qgroup status flags. It also
      avoids any risk of calling a qgroup function and attempting to use the
      qgroup rbtrees before they exist/are setup.
      
      This all occurs before we do rw setup, so I believe it should be mostly
      a no-op.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e0761451
    • Boris Burkov's avatar
      btrfs: track data relocation with simple quota · 2672a051
      Boris Burkov authored
      Relocation data allocations are quite tricky for simple quotas. The
      basic data relocation sequence is (ignoring details that aren't relevant
      to this fix):
      
      - create a fake relocation data fs root
      - create a fake relocation inode in that root
      - for each data extent:
        - preallocate a data extent on behalf of the fake inode
        - copy over the data
      - for each extent
        - swap the refs so that the original file extent now refers to the new
          extent item
      - drop the fake root, dropping its refs on the old extents, which lets
        us delete them.
      
      Done naively, this results in storing an extent item in the extent tree
      whose owner_ref points at the relocation data root and a no-op squota
      recording, since the reloc root is not a legit fstree. So far, that's
      OK. The problem comes when you do the swap, and leave an extent item
      owned by this bogus root as the real permanent extents of the file. If
      the file then drops that ref, we free it and no-op account that against
      the fake relocation root. Essentially, this means that relocation is
      simple quota "extent laundering", since we re-own the extents into a
      fake root.
      
      Simple quotas very intentionally doesn't have a mechanism for
      transferring ownership of extents, as that is exactly the complicated
      thing we are trying to avoid with the new design. Further, it cannot be
      correctly done in this case, since at the time you create the new
      "real" refs, there is no way to know which was the original owner before
      relocation unless we track it.
      
      Therefore, it makes more sense to trick the preallocation to handle
      relocation as a special case and note the proper owner ref from the
      beginning. That way, we never write out an extent item without the
      correct owner ref that it will eventually have.
      
      This could be done by wiring a special root parameter all the way
      through the allocation code path, but to avoid that special case
      touching all the code, take advantage of the serial nature of relocation
      to store the src root on the relocation root object. Then when we finish
      the prealloc, if it happens to be this case, prepare the delayed ref
      appropriately.
      
      We must also add logic to handle relocating adjacent extents with
      different owning roots. Those cannot be preallocated together in a
      cluster as it would lose the separate ownership information.
      
      This is obviously a smelly bit of code, but I think it is the best
      solution to the problem, given the relocation implementation.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2672a051
    • Boris Burkov's avatar
      btrfs: qgroup: track metadata relocation COW with simple quota · 60ea105a
      Boris Burkov authored
      Relocation COWs metadata blocks in two cases for the reloc root:
      
      - copying the subvolume root item when creating the reloc root
      - copying a btree node when there is a COW during relocation
      
      In both cases, the resulting btree node hits an abnormal code path with
      respect to the owner field in its btrfs_header. It first creates the
      root item for the new objectid, which populates the reloc root id, and
      it at this point that delayed refs are created.
      
      Later, it fully copies the old node into the new node (including the
      original owner field) which overwrites it. This results in a simple
      quotas mismatch where we run the delayed ref for the reloc root which
      has no simple quota effect (reloc root is not an fstree) but when we
      ultimately delete the node, the owner is the real original fstree and we
      do free the space.
      
      To work around this without tampering with the behavior of relocation,
      add a parameter to btrfs_add_tree_block that lets the relocation code
      path specify a different owning root than the "operating" root (in this
      case, owning root is the real root and the operating root is the reloc
      root). These can naturally be plumbed into delayed refs that have the
      same concept.
      
      Note that this is a double count in some sense, but a relatively natural
      one, as there are really two extents, and the old one will be deleted
      soon. This is consistent with how data relocation extents are accounted
      by simple quotas.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      60ea105a
    • Boris Burkov's avatar
      btrfs: qgroup: check generation when recording simple quota delta · bd7c1ea3
      Boris Burkov authored
      Simple quotas count extents only from the moment the feature is enabled.
      Therefore, if we do something like:
      
      1. create subvol S
      2. write F in S
      3. enable quotas
      4. remove F
      5. write G in S
      
      then after 3. and 4. we would expect the simple quota usage of S to be 0
      (putting aside some metadata extents that might be written) and after
      5., it should be the size of G plus metadata. Therefore, we need to be
      able to determine whether a particular quota delta we are processing
      predates simple quota enablement.
      
      To do this, store the transaction id when quotas were enabled. In
      fs_info for immediate use and in the quota status item to make it
      recoverable on mount. When we see a delta, check if the generation of
      the extent item is less than that of quota enablement. If so, we should
      ignore the delta from this extent.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bd7c1ea3
    • Boris Burkov's avatar
      btrfs: qgroup: simple quota auto hierarchy for nested subvolumes · 5343cd93
      Boris Burkov authored
      Consider the following sequence:
      
      - enable quotas
      - create subvol S id 256 at dir outer/
      - create a qgroup 1/100
      - add 0/256 (S's auto qgroup) to 1/100
      - create subvol T id 257 at dir outer/inner/
      
      With full qgroups, there is no relationship between 0/257 and either of
      0/256 or 1/100. There is an inherit feature that the creator of inner/
      can use to specify it ought to be in 1/100.
      
      Simple quotas are targeted at container isolation, where such automatic
      inheritance for not necessarily trusted/controlled nested subvol
      creation would be quite helpful. Therefore, add a new default behavior
      for simple quotas: when you create a nested subvol, automatically
      inherit as parents any parents of the qgroup of the subvol the new inode
      is going in.
      
      In our example, 257/0 would also be under 1/100, allowing easy control
      of a total quota over an arbitrary hierarchy of subvolumes.
      
      I think this _might_ be a generally useful behavior, so it could be
      interesting to put it behind a new inheritance flag that simple quotas
      always use while traditional quotas let the user specify, but this is a
      minimally intrusive change to start.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5343cd93
    • Boris Burkov's avatar
      btrfs: record simple quota deltas in delayed refs · cecbb533
      Boris Burkov authored
      At the moment that we run delayed refs, we make the final ref-count
      based decision on creating/removing extent (and metadata) items.
      Therefore, it is exactly the spot to hook up simple quotas.
      
      There are a few important subtleties to the fields we must collect to
      accurately track simple quotas, particularly when removing an extent.
      When removing a data extent, the ref could be in any tree (due to
      reflink, for example) and so we need to recover the owning root id from
      the owner ref item. When removing a metadata extent, we know the owning
      root from the owner field in the header when we create the delayed ref,
      so we can recover it from there.
      
      We must also be careful to handle reservations properly to not leaked
      reserved space. The happy path is freeing the reservation when the
      simple quota delta runs on a data extent. If that doesn't happen, due to
      refs canceling out or some error, the ref head already has the
      must_insert_reserved machinery to handle this, so we piggy back on that
      and use it to clean up the reserved data.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cecbb533
    • Boris Burkov's avatar
      btrfs: add helper for inline owner ref lookup · 8d299091
      Boris Burkov authored
      Inline ref parsing is a bit tricky and relies on a decent amount of
      implicit information, so I think it is beneficial to have a helper
      function for reading the owner ref, if only to "document" the format,
      along with the write path.
      
      The main subtlety of note which I was missing by open-coding this was
      that it is important to check whether or not inline refs are present
      *at all*. i.e., if we are writing out a new extent under squotas, we
      will always use a big enough item for the inline ref and have it.
      However, it is possible that some random item predating squotas will not
      have any inline refs. In that case, trying to read the "type" field of
      the first inline ref will just be reading garbage in the form of
      whatever is in the next item.
      
      This will be used by the extent free-ing path, which looks up data
      extent owners as well as a relocation path which needs to grab the owner
      before relocating an extent.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8d299091
    • Boris Burkov's avatar
      btrfs: new inline ref storing owning subvol of data extents · d9a620f7
      Boris Burkov authored
      In order to implement simple quota groups, we need to be able to
      associate a data extent with the subvolume that created it. Once you
      account for reflink, this information cannot be recovered without
      explicitly storing it. Options for storing it are:
      
      - a new key/item
      - a new extent inline ref item
      
      The former is backwards compatible, but wastes space, the latter is
      incompat, but is efficient in space and reuses the existing inline ref
      machinery, while only abusing it a tiny amount -- specifically, the new
      item is not a ref, per-se.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d9a620f7
    • Boris Burkov's avatar
      btrfs: track original extent owner in head_ref · cf79ac47
      Boris Burkov authored
      Simple quotas requires tracking the original creating root of any given
      extent. This gets complicated when multiple subvolumes create
      overlapping/contradictory refs in the same transaction. For example,
      due to modifying or deleting an extent while also snapshotting it.
      
      To resolve this in a general way, take advantage of the fact that we are
      essentially already tracking this for handling releasing reservations.
      The head ref coalesces the various refs and uses must_insert_reserved to
      check if it needs to create an extent/free reservation. Store the ref
      that set must_insert_reserved as the owning ref on the head ref.
      
      Note that this can result in writing an extent for the very first time
      with an owner different from its only ref, but it will look the same as
      if you first created it with the original owning ref, then added the
      other ref, then removed the owning ref.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cf79ac47
    • Boris Burkov's avatar
      btrfs: track owning root in btrfs_ref · 457cb1dd
      Boris Burkov authored
      While data extents require us to store additional inline refs to track
      the original owner on free, this information is available implicitly for
      metadata. It is found in the owner field of the header of the tree
      block. Even if other trees refer to this block and the original ref goes
      away, we will not rewrite that header field, so it will reliably give the
      original owner.
      
      In addition, there is a relocation case where a new data extent needs to
      have an owning root separate from the referring root wired through
      delayed refs.
      
      To use it for recording simple quota deltas, we need to wire this root
      id through from when we create the delayed ref until we fully process
      it. Store it in the generic btrfs_ref struct of the delayed ref.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      457cb1dd
    • Boris Burkov's avatar
      btrfs: rename tree_ref and data_ref owning_root · 610647d7
      Boris Burkov authored
      commit 113479d5 ("btrfs: rename root fields in delayed refs structs")
      changed these from ref_root to owning_root. However, there are many
      circumstances where that name is not really accurate and the root on the
      ref struct _is_ the referring root. In general, these are not the owning
      root, though it does happen in some ref merging cases involving
      overwrites during snapshots and similar.
      
      Simple quotas cares quite a bit about tracking the original owner of an
      extent through delayed refs, so rename these back to free up the name
      for the real owning root (which will live on the generic btrfs_ref and
      the head ref)
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      610647d7
    • Boris Burkov's avatar
      btrfs: add helper for recording simple quota deltas · 1e0e9d57
      Boris Burkov authored
      Rather than re-computing shared/exclusive ownership based on backrefs
      and walking roots for implicit backrefs, simple quotas does an increment
      when creating an extent and a decrement when deleting it. Add the API
      for the extent item code to use to track those events.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e0e9d57
    • Boris Burkov's avatar
      btrfs: create qgroup earlier in snapshot creation · 6ed05643
      Boris Burkov authored
      Pull creating the qgroup earlier in the snapshot. This allows simple
      quotas qgroups to see all the metadata writes related to the snapshot
      being created and to be born with the root node accounted.
      
      Note this has an impact on transaction commit where the qgroup creation
      can do a lot of work, allocate memory and take locks. The change is done
      for correctness, potential performance issues will be fixed in the
      future.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      [ add note ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ed05643
    • Boris Burkov's avatar
      btrfs: qgroup: flush reservations during quota disable · af0e2aab
      Boris Burkov authored
      The following sequence:
      
        enable simple quotas
        do some writes
            reserve space
            create ordered_extent
      	  release rsv (store rsv_bytes in OE, mark QGROUP_RESERVED bits)
        disable quotas
        enable simple quotas
            set qgroup rsv to 0 on all subvolumes
        ordered_extent finishes
            create delayed ref with rsv_bytes from before
        run delayed ref
            record_simple_quota_delta
      	  free rsv_bytes (0 -> -rsv_delta)
      
      results in us reliably underflowing the subvolume's qgroup rsv counter,
      because disabling/re-enabling quotas toggles reservation counters down
      to 0, but does not remove other file system state which represents
      successful acquisition of qgroup rsv space. Specifically metadata rsv
      counters on the root object and rsv_bytes on ordered_extent objects that
      have released their reservation as well as the corresponding
      QGROUP_RESERVED extent bits.
      
      Normal qgroups gets away with this, I believe because it forces more
      work to happen on transaction commit, but I am not certain it is totally
      safe from the ordered_extent/leaked extent bit variant. Simple quotas
      hits this reliably.
      
      The intent of the fix is to make disable take the time to clear that
      external to qgroups state as well: after flipping off the quota bit on
      fs_info, flush delalloc and ordered extents, clearing the extent bits
      along the way. This makes it so there are no ordered extents or meta
      prealloc hanging around from the first enablement period during the second.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af0e2aab
    • Boris Burkov's avatar
      btrfs: sysfs: add simple_quota incompat feature entry · a744986a
      Boris Burkov authored
      Add an entry in the features directory for the new incompat flag
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a744986a
    • Boris Burkov's avatar
      btrfs: sysfs: expose quota mode via sysfs · 0182764a
      Boris Burkov authored
      Add a new sysfs file /sys/fs/btrfs/<uuid>/qgroups/mode
      which prints out the mode qgroups is running in. The possible modes are
      qgroup, and squota.
      
      If quotas are not enabled, then the qgroups directory will not exist,
      so don't handle that mode.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0182764a