1. 12 Oct, 2023 40 commits
    • Filipe Manana's avatar
      btrfs: use extent_io_tree_release() to empty dirty log pages · 0f8ac74d
      Filipe Manana authored
      When freeing a log tree, during a transaction commit, we clear its dirty
      log pages io tree by calling clear_extent_bits() using a range from 0 to
      (u64)-1. This will iterate the io tree's rbtree and call rb_erase() on
      each node before freeing it, which will often trigger rebalance operations
      on the rbtree. A better alternative it to use extent_io_tree_release(),
      which will not do deletions and trigger rebalances.
      
      So use extent_io_tree_release() instead of clear_extent_bits().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0f8ac74d
    • Filipe Manana's avatar
      btrfs: make tree iteration in extent_io_tree_release() more efficient · 63ffc1f7
      Filipe Manana authored
      Currently extent_io_tree_release() is a loop that keeps getting the first
      node in the io tree, using rb_first() which is a loop that gets to the
      leftmost node of the rbtree, and then for each node it calls rb_erase(),
      which often requires rebalancing the rbtree.
      
      We can make this more efficient by using
      rbtree_postorder_for_each_entry_safe() to free each node without having
      to delete it from the rbtree and without looping to get the first node.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63ffc1f7
    • Filipe Manana's avatar
      btrfs: collapse wait_on_state() to its caller wait_extent_bit() · df2a8e70
      Filipe Manana authored
      The wait_on_state() function is very short and has a single caller, which
      is wait_extent_bit(), so remove the function and put its code into the
      caller.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      df2a8e70
    • Filipe Manana's avatar
      btrfs: remove redundant memory barrier from extent_io_tree_release() · 28967c76
      Filipe Manana authored
      The memory barrier at extent_io_tree_release() is redundant. Holding
      spin_lock here is not enough to drop the barrier completely.  We only
      change the waitqueue of an extent state record while holding the tree
      lock - see wait_on_state().
      
      The update to waitqueue state will not become stale because there will
      be an spin_unlock/spin_lock sequence between the change and waiting,
      this implies a full memory barrier.
      
      So remove the explicit smp_mb() barrier.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ reword reasoning ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28967c76
    • Filipe Manana's avatar
      btrfs: make wait_extent_bit() static · a1c20d15
      Filipe Manana authored
      The function wait_extent_bit() is not used outside extent-io-tree.c so
      make it static. Furthermore the function doesn't have the 'btrfs_' prefix.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a1c20d15
    • Filipe Manana's avatar
      btrfs: update stale comment at extent_io_tree_release() · bea22a58
      Filipe Manana authored
      There's this comment at extent_io_tree_release() that mentions io btrees,
      but this function is no longer used only for io btrees. Originally it was
      added as a static function named clear_btree_io_tree() at transaction.c,
      in commit 663dfbb0 ("Btrfs: deal with convert_extent_bit errors to
      avoid fs corruption"), as it was used only for cleaning one of the io
      trees that track dirty extent buffers, the dirty_log_pages io tree of a
      a root and the dirty_pages io tree of a transaction. Later it was renamed
      and exported and now it's used to cleanup other io trees such as the
      allocation state io tree of a device or the csums range io tree of a log
      root.
      
      So remove that comment and replace it with one at the top of the function
      that is more complete, mentioning what the function does and that it's
      expected to be called only when a task is sure no one else will need to
      use the tree anymore, as well as there should be no locked ranges in the
      tree and therefore no waiters on its extent state records. Also add an
      assertion to check that there are no locked extent state records in the
      tree.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bea22a58
    • Filipe Manana's avatar
      btrfs: make extent state merges more efficient during insertions · c91ea4bf
      Filipe Manana authored
      When inserting a new extent state record into an io tree that happens to
      be mergeable, we currently do the following:
      
      1) Insert the extent state record in the io tree's rbtree. This requires
         going down the tree to find where to insert it, and during the
         insertion we often need to balance the rbtree;
      
      2) We then check if the previous node is mergeable, so we call rb_prev()
         to find it, which requires some looping to find the previous node;
      
      3) If the previous node is mergeable, we adjust our node to include the
         range of the previous node and then delete the previous node from the
         rbtree, which again may need to balance the rbtree;
      
      4) Then we check if the next node is mergeable with the node we inserted,
         so we call rb_next(), which requires some looping too. If the next node
         is indeed mergeable, we expand the range of our node to include the
         next node's range and then delete the next node from the rbtree, which
         again may need to balance the tree.
      
      So these are quite of lot of iterations and looping over the rbtree, and
      some of the operations may need to rebalance the rb tree. This can be made
      a bit more efficient by:
      
      1) When iterating the rbtree, once we find a node that is mergeable with
         the node we want to insert, we can just adjust that node's range with
         the range of the node to insert - this avoids continuing iterating
         over the tree and deleting a node from the rbtree;
      
      2) If we expand the range of a mergeable node, then we find the next or
         the previous node, depending on other we merged a range to the right or
         to the left of the node we are currently at during the iteration. This
         merging is as before, we find the next or previous node with rb_next()
         or rb_prev() and if that other node is mergeable with the current one,
         we adjust the range of the current node and remove the other node from
         the rbtree;
      
      3) Whenever we need to insert the new extent state record it's because
         we don't have any extent state record in the rbtree which can be
         merged, so we can remove the call to merge_state() after the insertion,
         saving rb_next() and rb_prev() calls, which require some looping.
      
      So update the insertion function insert_state() to have this behaviour.
      
      Running dbench for 120 seconds and capturing the execution times of
      set_extent_bit() at pin_down_extent(), resulted in the following data
      (time values are in nanoseconds):
      
      Before this change:
      
        Count: 2278299
        Range:  0.000 - 4003728.000; Mean: 713.436; Median: 612.000; Stddev: 3606.952
        Percentiles:  90th: 1187.000; 95th: 1350.000; 99th: 1724.000
             0.000 -       7.534:       5 |
             7.534 -      35.418:      36 |
            35.418 -     154.403:     273 |
           154.403 -     662.138: 1244016 #####################################################
           662.138 -    2828.745: 1031335 ############################################
          2828.745 -   12074.102:    1395 |
         12074.102 -   51525.930:     806 |
         51525.930 -  219874.955:     162 |
        219874.955 -  938254.688:      22 |
        938254.688 - 4003728.000:       3 |
      
      After this change:
      
        Count: 2275862
        Range:  0.000 - 1605175.000; Mean: 678.903; Median: 590.000; Stddev: 2149.785
        Percentiles:  90th: 1105.000; 95th: 1245.000; 99th: 1590.000
             0.000 -      10.219:      10 |
            10.219 -      40.957:      36 |
            40.957 -     155.907:     262 |
           155.907 -     585.789: 1127214 ####################################################
           585.789 -    2193.431: 1145134 #####################################################
          2193.431 -    8205.578:    1648 |
          8205.578 -   30689.378:    1039 |
         30689.378 -  114772.699:     362 |
        114772.699 -  429221.537:      52 |
        429221.537 - 1605175.000:      10 |
      
      Maximum duration (range), average duration, percentiles and standard
      deviation are all better.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c91ea4bf
    • David Sterba's avatar
      btrfs: change test_range_bit to scan the whole range · 893fe243
      David Sterba authored
      The semantics of test_range_bit() with filled == 0 is now in it's own
      helper so test_range_bit will check the whole range unconditionally.
      The detection logic is flipped and assumes success by default and
      catches exceptions.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      893fe243
    • David Sterba's avatar
      btrfs: add specific helper for range bit test exists · 99be1a66
      David Sterba authored
      The existing helper test_range_bit works in two ways, checks if the whole
      range contains all the bits, or stop on the first occurrence.  By adding
      a specific helper for the latter case, the inner loop can be simplified
      and contains fewer conditionals, making it a bit faster.
      
      There's no caller that uses the cached state pointer so this reduces the
      argument count further.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      99be1a66
    • Filipe Manana's avatar
      btrfs: move btrfs_realloc_node() from ctree.c into defrag.c · 6422b4cd
      Filipe Manana authored
      btrfs_realloc_node() is only used by the defrag code. Nowadays we have a
      defrag.c file, so move it, and its helper close_blocks(), into defrag.c.
      
      During the move also do a few minor cosmetic changes:
      
      1) Change the return value of close_blocks() from int to bool;
      
      2) Use SZ_32K instead of 32768 at close_blocks();
      
      3) Make some variables const in btrfs_realloc_node(), 'blocksize' and
         'end_slot';
      
      4) Get rid of 'parent_nritems' variable, in both places where it was
         used it could be replaced by calling btrfs_header_nritems(parent);
      
      5) Change the type of a couple variables from int to bool;
      
      6) Rename variable 'err' to 'ret', as that's the most common name we
         use to track the return value of a function;
      
      7) Move some variables from the top scope to the scope of the for loop
         where they are used.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6422b4cd
    • Filipe Manana's avatar
      btrfs: export comp_keys() from ctree.c as btrfs_comp_keys() · 79d25df0
      Filipe Manana authored
      Export comp_keys() out of ctree.c, as btrfs_comp_keys(), so that in a
      later patch we can move out defrag specific code from ctree.c into
      defrag.c.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79d25df0
    • Filipe Manana's avatar
      btrfs: rename and export __btrfs_cow_block() · 95f93bc4
      Filipe Manana authored
      Rename and export __btrfs_cow_block() as btrfs_force_cow_block(). This is
      to allow to move defrag specific code out of ctree.c and into defrag.c in
      one of the next patches.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      95f93bc4
    • Filipe Manana's avatar
      btrfs: use round_down() to align block offset at btrfs_cow_block() · b8bf4e4d
      Filipe Manana authored
      At btrfs_cow_block() we can use round_down() to align the extent buffer's
      logical offset to the start offset of a metadata block group, instead of
      the less easy to read set of bitwise operations (two plus one subtraction).
      So replace the bitwise operations with a round_down() call.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b8bf4e4d
    • Filipe Manana's avatar
      btrfs: remove noinline attribute from btrfs_cow_block() · 7bff16e3
      Filipe Manana authored
      It's pointless to have the noiline attribute for btrfs_cow_block(), as the
      function is exported and widely used. So remove it.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7bff16e3
    • Anand Jain's avatar
      btrfs: remove incomplete metadata_uuid conversion fixup logic · 5966930d
      Anand Jain authored
      Previous commit ("btrfs: reject devices with CHANGING_FSID_V2") has
      stopped the assembly of devices with the CHANGING_FSID_V2 flag in the
      kernel. Such devices can be scanned but will not be registered and can't
      be mounted without a manual fix by btrfstune.  Remove the related logic
      and now unused code.
      
      The original motivation was to allow an interrupted partial conversion
      fix itself on next mount, in case the system has to be rebooted. This is
      a convenience but brings a lot of complexity the device scanning and
      handling the partial states.  It's hard to estimate if this was ever
      needed in practice, expecting the typical use case like a manual
      conversion of an unmounted filesystem where the user can verify the
      success and rerun it eventually.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add historical context ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5966930d
    • Anand Jain's avatar
      btrfs: reject devices with CHANGING_FSID_V2 · 197a9ece
      Anand Jain authored
      The BTRFS_SUPER_FLAG_CHANGING_FSID_V2 flag indicates a transient state
      where the device in the userspace btrfstune -m|-M operation failed to
      complete changing the fsid.
      
      This flag makes the kernel to automatically determine the other
      partner devices to which a given device can be associated, based on the
      fsid, metadata_uuid and generation values.
      
      btrfstune -m|M feature is especially useful in virtual cloud setups, where
      compute instances (disk images) are quickly copied, fsid changed, and
      launched. Given numerous disk images with the same metadata_uuid but
      different fsid, there's no clear way a device can be correctly assembled
      with the proper partners when the CHANGING_FSID_V2 flag is set. So, the
      disk could be assembled incorrectly, as in the example below:
      
      Before this patch:
      
      Consider the following two filesystems:
         /dev/loop[2-3] are raw copies of /dev/loop[0-1] and the btrsftune -m
      operation fails.
      
      In this scenario, as the /dev/loop0's fsid change is interrupted, and the
      CHANGING_FSID_V2 flag is set as shown below.
      
        $ p="device|devid|^metadata_uuid|^fsid|^incom|^generation|^flags"
      
        $ btrfs inspect dump-super /dev/loop0 | egrep '$p'
        superblock: bytenr=65536, device=/dev/loop0
        flags			0x1000000001
        fsid			7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
        metadata_uuid		bb040a9f-233a-4de2-ad84-49aa5a28059b
        generation		9
        num_devices		2
        incompat_flags	0x741
        dev_item.devid	1
      
        $ btrfs inspect dump-super /dev/loop1 | egrep '$p'
        superblock: bytenr=65536, device=/dev/loop1
        flags			0x1
        fsid			11d2af4d-1b71-45a9-83f6-f2100766939d
        metadata_uuid		bb040a9f-233a-4de2-ad84-49aa5a28059b
        generation		10
        num_devices		2
        incompat_flags	0x741
        dev_item.devid	2
      
        $ btrfs inspect dump-super /dev/loop2 | egrep '$p'
        superblock: bytenr=65536, device=/dev/loop2
        flags			0x1
        fsid			7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
        metadata_uuid		bb040a9f-233a-4de2-ad84-49aa5a28059b
        generation		8
        num_devices		2
        incompat_flags	0x741
        dev_item.devid	1
      
        $ btrfs inspect dump-super /dev/loop3 | egrep '$p'
        superblock: bytenr=65536, device=/dev/loop3
        flags			0x1
        fsid			7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
        metadata_uuid		bb040a9f-233a-4de2-ad84-49aa5a28059b
        generation		8
        num_devices		2
        incompat_flags	0x741
        dev_item.devid	2
      
      It is normal that some devices aren't instantly discovered during
      system boot or iSCSI discovery. The controlled scan below demonstrates
      this.
      
        $ btrfs device scan --forget
        $ btrfs device scan /dev/loop0
        Scanning for btrfs filesystems on '/dev/loop0'
        $ mount /dev/loop3 /btrfs
        $ btrfs filesystem show -m
        Label: none  uuid: 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
      	Total devices 2 FS bytes used 144.00KiB
      	devid    1 size 300.00MiB used 48.00MiB path /dev/loop0
      	devid    2 size 300.00MiB used 40.00MiB path /dev/loop3
      
      /dev/loop0 and /dev/loop3 are incorrectly partnered.
      
      This kernel patch removes functions and code connected to the
      CHANGING_FSID_V2 flag.
      
      With this patch, now devices with the CHANGING_FSID_V2 flag are rejected.
      And its partner will fail to mount with the extra -o degraded option.
      The check is removed from open_ctree(), devices are rejected during
      scanning which in turn fails the mount.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      197a9ece
    • David Sterba's avatar
      btrfs: relocation: constify parameters where possible · ab7c8bbf
      David Sterba authored
      Lots of the functions in relocation.c don't change pointer parameters
      but lack the annotations. Add them and reformat according to current
      coding style if needed.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ab7c8bbf
    • David Sterba's avatar
      btrfs: relocation: return bool from btrfs_should_ignore_reloc_root · 32f2abca
      David Sterba authored
      btrfs_should_ignore_reloc_root() is a predicate so it should return
      bool.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      32f2abca
    • David Sterba's avatar
      btrfs: switch btrfs_backref_cache::is_reloc to bool · c71d3c69
      David Sterba authored
      The btrfs_backref_cache::is_reloc is an indicator variable and should
      use a bool type.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c71d3c69
    • David Sterba's avatar
      btrfs: relocation: open code mapping_tree_init · 733fa44d
      David Sterba authored
      There's only one user of mapping_tree_init, we don't need a helper for
      the simple initialization.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      733fa44d
    • David Sterba's avatar
      btrfs: relocation: switch bitfields to bool in reloc_control · d23d42e3
      David Sterba authored
      Use bool types for the indicators instead of bitfields. The structure
      size slightly grows but the new types are placed within the padding.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d23d42e3
    • David Sterba's avatar
      btrfs: relocation: use enum for stages · 8daf07cf
      David Sterba authored
      Add an enum type for data relocation stages.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8daf07cf
    • David Sterba's avatar
      btrfs: relocation: use more natural types for tree_block bitfields · a3bb700f
      David Sterba authored
      We don't need to use bitfields for tree_block::level and
      tree_block::key_ready, there's enough padding in the structure for
      proper types.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a3bb700f
    • Filipe Manana's avatar
      btrfs: move btrfs_defrag_root() to defrag.{c,h} · 1723270f
      Filipe Manana authored
      The btrfs_defrag_root() function does not really belong in the
      transaction.{c,h} module and as we have a defrag.{c,h} nowadays,
      move it to there instead. This also allows to stop exporting
      btrfs_defrag_leaves(), so we can make it static.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ rename info to fs_info for consistency ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1723270f
    • Filipe Manana's avatar
      btrfs: remove redundant root argument from fixup_inode_link_count() · 8befc61c
      Filipe Manana authored
      The root argument for fixup_inode_link_count() always matches the root of
      the given inode, so remove the root argument and get it from the inode
      argument. This also applies to the helpers count_inode_extrefs() and
      count_inode_refs() used by fixup_inode_link_count() - they don't need the
      root argument, as it always matches the root of the inode passed to them.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8befc61c
    • Filipe Manana's avatar
      btrfs: remove redundant root argument from maybe_insert_hole() · 0a325e62
      Filipe Manana authored
      The root argument for maybe_insert_hole() always matches the root of the
      given inode, so remove the root argument and get it from the inode
      argument.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0a325e62
    • Filipe Manana's avatar
      btrfs: remove redundant root argument from btrfs_delayed_update_inode() · 04bd8e94
      Filipe Manana authored
      The root argument for btrfs_delayed_update_inode() always matches the root
      of the given inode, so remove the root argument and get it from the inode
      argument.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      04bd8e94
    • Filipe Manana's avatar
      btrfs: remove redundant root argument from btrfs_update_inode_item() · 07a274a8
      Filipe Manana authored
      The root argument for btrfs_update_inode_item() always matches the root of
      the given inode, so remove the root argument and get it from the inode
      argument.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07a274a8
    • Filipe Manana's avatar
      btrfs: remove redundant root argument from btrfs_update_inode() · 8b9d0322
      Filipe Manana authored
      The root argument for btrfs_update_inode() always matches the root of the
      given inode, so remove the root argument and get it from the inode
      argument.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8b9d0322
    • Filipe Manana's avatar
      btrfs: remove redundant root argument from btrfs_update_inode_fallback() · 0a5d0dc5
      Filipe Manana authored
      The root argument for btrfs_update_inode_fallback() always matches the
      root of the given inode, so remove the root argument and get it from the
      inode argument.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0a5d0dc5
    • Filipe Manana's avatar
      btrfs: remove noinline from btrfs_update_inode() · cddaaacc
      Filipe Manana authored
      The noinline attribute of btrfs_update_inode() is pointless as the
      function is exported and widely used, so remove it.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cddaaacc
    • Filipe Manana's avatar
      btrfs: simplify error check condition at btrfs_dirty_inode() · 2199cb0f
      Filipe Manana authored
      The following condition at btrfs_dirty_inode() is redundant:
      
        if (ret && (ret == -ENOSPC || ret == -EDQUOT))
      
      The first check for a non-zero 'ret' value is pointless, we can simplify
      this to simply:
      
        if (ret == -ENOSPC || ret == -EDQUOT)
      
      Not only this makes it easier to read, it also slightly reduces the text
      size of the btrfs kernel module:
      
        $ size fs/btrfs/btrfs.ko.before
           text	   data	    bss	    dec	    hex	filename
        1641400	 168265	  16864	1826529	 1bdee1	fs/btrfs/btrfs.ko.before
      
        $ size fs/btrfs/btrfs.ko.after
           text	   data	    bss	    dec	    hex	filename
        1641224	 168181	  16864	1826269	 1bdddd	fs/btrfs/btrfs.ko.after
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2199cb0f
    • Boris Burkov's avatar
      btrfs: qgroup: only set QUOTA_ENABLED when done reading qgroups · e0761451
      Boris Burkov authored
      In open_ctree, we set BTRFS_FS_QUOTA_ENABLED as soon as we see a
      quota_root, as opposed to after we are done setting up the qgroup
      structures. In the quota_enable path, we wait until after the structures
      are set up. Likewise, in disable, we clear the bit before tearing down
      the structures. I feel that this organization is less surprising for the
      open_ctree path.
      
      I don't believe this fixes any actual bug, but avoids potential
      confusion when using btrfs_qgroup_mode in an intermediate state where we
      are enabled but haven't yet setup the qgroup status flags. It also
      avoids any risk of calling a qgroup function and attempting to use the
      qgroup rbtrees before they exist/are setup.
      
      This all occurs before we do rw setup, so I believe it should be mostly
      a no-op.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e0761451
    • Boris Burkov's avatar
      btrfs: track data relocation with simple quota · 2672a051
      Boris Burkov authored
      Relocation data allocations are quite tricky for simple quotas. The
      basic data relocation sequence is (ignoring details that aren't relevant
      to this fix):
      
      - create a fake relocation data fs root
      - create a fake relocation inode in that root
      - for each data extent:
        - preallocate a data extent on behalf of the fake inode
        - copy over the data
      - for each extent
        - swap the refs so that the original file extent now refers to the new
          extent item
      - drop the fake root, dropping its refs on the old extents, which lets
        us delete them.
      
      Done naively, this results in storing an extent item in the extent tree
      whose owner_ref points at the relocation data root and a no-op squota
      recording, since the reloc root is not a legit fstree. So far, that's
      OK. The problem comes when you do the swap, and leave an extent item
      owned by this bogus root as the real permanent extents of the file. If
      the file then drops that ref, we free it and no-op account that against
      the fake relocation root. Essentially, this means that relocation is
      simple quota "extent laundering", since we re-own the extents into a
      fake root.
      
      Simple quotas very intentionally doesn't have a mechanism for
      transferring ownership of extents, as that is exactly the complicated
      thing we are trying to avoid with the new design. Further, it cannot be
      correctly done in this case, since at the time you create the new
      "real" refs, there is no way to know which was the original owner before
      relocation unless we track it.
      
      Therefore, it makes more sense to trick the preallocation to handle
      relocation as a special case and note the proper owner ref from the
      beginning. That way, we never write out an extent item without the
      correct owner ref that it will eventually have.
      
      This could be done by wiring a special root parameter all the way
      through the allocation code path, but to avoid that special case
      touching all the code, take advantage of the serial nature of relocation
      to store the src root on the relocation root object. Then when we finish
      the prealloc, if it happens to be this case, prepare the delayed ref
      appropriately.
      
      We must also add logic to handle relocating adjacent extents with
      different owning roots. Those cannot be preallocated together in a
      cluster as it would lose the separate ownership information.
      
      This is obviously a smelly bit of code, but I think it is the best
      solution to the problem, given the relocation implementation.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2672a051
    • Boris Burkov's avatar
      btrfs: qgroup: track metadata relocation COW with simple quota · 60ea105a
      Boris Burkov authored
      Relocation COWs metadata blocks in two cases for the reloc root:
      
      - copying the subvolume root item when creating the reloc root
      - copying a btree node when there is a COW during relocation
      
      In both cases, the resulting btree node hits an abnormal code path with
      respect to the owner field in its btrfs_header. It first creates the
      root item for the new objectid, which populates the reloc root id, and
      it at this point that delayed refs are created.
      
      Later, it fully copies the old node into the new node (including the
      original owner field) which overwrites it. This results in a simple
      quotas mismatch where we run the delayed ref for the reloc root which
      has no simple quota effect (reloc root is not an fstree) but when we
      ultimately delete the node, the owner is the real original fstree and we
      do free the space.
      
      To work around this without tampering with the behavior of relocation,
      add a parameter to btrfs_add_tree_block that lets the relocation code
      path specify a different owning root than the "operating" root (in this
      case, owning root is the real root and the operating root is the reloc
      root). These can naturally be plumbed into delayed refs that have the
      same concept.
      
      Note that this is a double count in some sense, but a relatively natural
      one, as there are really two extents, and the old one will be deleted
      soon. This is consistent with how data relocation extents are accounted
      by simple quotas.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      60ea105a
    • Boris Burkov's avatar
      btrfs: qgroup: check generation when recording simple quota delta · bd7c1ea3
      Boris Burkov authored
      Simple quotas count extents only from the moment the feature is enabled.
      Therefore, if we do something like:
      
      1. create subvol S
      2. write F in S
      3. enable quotas
      4. remove F
      5. write G in S
      
      then after 3. and 4. we would expect the simple quota usage of S to be 0
      (putting aside some metadata extents that might be written) and after
      5., it should be the size of G plus metadata. Therefore, we need to be
      able to determine whether a particular quota delta we are processing
      predates simple quota enablement.
      
      To do this, store the transaction id when quotas were enabled. In
      fs_info for immediate use and in the quota status item to make it
      recoverable on mount. When we see a delta, check if the generation of
      the extent item is less than that of quota enablement. If so, we should
      ignore the delta from this extent.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bd7c1ea3
    • Boris Burkov's avatar
      btrfs: qgroup: simple quota auto hierarchy for nested subvolumes · 5343cd93
      Boris Burkov authored
      Consider the following sequence:
      
      - enable quotas
      - create subvol S id 256 at dir outer/
      - create a qgroup 1/100
      - add 0/256 (S's auto qgroup) to 1/100
      - create subvol T id 257 at dir outer/inner/
      
      With full qgroups, there is no relationship between 0/257 and either of
      0/256 or 1/100. There is an inherit feature that the creator of inner/
      can use to specify it ought to be in 1/100.
      
      Simple quotas are targeted at container isolation, where such automatic
      inheritance for not necessarily trusted/controlled nested subvol
      creation would be quite helpful. Therefore, add a new default behavior
      for simple quotas: when you create a nested subvol, automatically
      inherit as parents any parents of the qgroup of the subvol the new inode
      is going in.
      
      In our example, 257/0 would also be under 1/100, allowing easy control
      of a total quota over an arbitrary hierarchy of subvolumes.
      
      I think this _might_ be a generally useful behavior, so it could be
      interesting to put it behind a new inheritance flag that simple quotas
      always use while traditional quotas let the user specify, but this is a
      minimally intrusive change to start.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5343cd93
    • Boris Burkov's avatar
      btrfs: record simple quota deltas in delayed refs · cecbb533
      Boris Burkov authored
      At the moment that we run delayed refs, we make the final ref-count
      based decision on creating/removing extent (and metadata) items.
      Therefore, it is exactly the spot to hook up simple quotas.
      
      There are a few important subtleties to the fields we must collect to
      accurately track simple quotas, particularly when removing an extent.
      When removing a data extent, the ref could be in any tree (due to
      reflink, for example) and so we need to recover the owning root id from
      the owner ref item. When removing a metadata extent, we know the owning
      root from the owner field in the header when we create the delayed ref,
      so we can recover it from there.
      
      We must also be careful to handle reservations properly to not leaked
      reserved space. The happy path is freeing the reservation when the
      simple quota delta runs on a data extent. If that doesn't happen, due to
      refs canceling out or some error, the ref head already has the
      must_insert_reserved machinery to handle this, so we piggy back on that
      and use it to clean up the reserved data.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cecbb533
    • Boris Burkov's avatar
      btrfs: add helper for inline owner ref lookup · 8d299091
      Boris Burkov authored
      Inline ref parsing is a bit tricky and relies on a decent amount of
      implicit information, so I think it is beneficial to have a helper
      function for reading the owner ref, if only to "document" the format,
      along with the write path.
      
      The main subtlety of note which I was missing by open-coding this was
      that it is important to check whether or not inline refs are present
      *at all*. i.e., if we are writing out a new extent under squotas, we
      will always use a big enough item for the inline ref and have it.
      However, it is possible that some random item predating squotas will not
      have any inline refs. In that case, trying to read the "type" field of
      the first inline ref will just be reading garbage in the form of
      whatever is in the next item.
      
      This will be used by the extent free-ing path, which looks up data
      extent owners as well as a relocation path which needs to grab the owner
      before relocating an extent.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8d299091
    • Boris Burkov's avatar
      btrfs: new inline ref storing owning subvol of data extents · d9a620f7
      Boris Burkov authored
      In order to implement simple quota groups, we need to be able to
      associate a data extent with the subvolume that created it. Once you
      account for reflink, this information cannot be recovered without
      explicitly storing it. Options for storing it are:
      
      - a new key/item
      - a new extent inline ref item
      
      The former is backwards compatible, but wastes space, the latter is
      incompat, but is efficient in space and reuses the existing inline ref
      machinery, while only abusing it a tiny amount -- specifically, the new
      item is not a ref, per-se.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d9a620f7