1. 12 Oct, 2023 40 commits
    • Anand Jain's avatar
      btrfs: sysfs: show temp_fsid feature · f3623740
      Anand Jain authored
      This adds sysfs objects to indicate temp_fsid feature support and
      its status.
      
        /sys/fs/btrfs/features/temp_fsid
        /sys/fs/btrfs/<UUID>/temp_fsid
      
      For example:
      
         Consider two cloned and mounted devices.
      
            $ blkid /dev/sdc[1-2]
            /dev/sdc1: UUID="509ad44b-ad2a-4a8a-bc8d-fe69db7220d5" ..
            /dev/sdc2: UUID="509ad44b-ad2a-4a8a-bc8d-fe69db7220d5" ..
      
         One gets actual fsid, and the other gets the temp_fsid when
         mounted.
      
            $ btrfs filesystem show -m
            Label: none  uuid: 509ad44b-ad2a-4a8a-bc8d-fe69db7220d5
      	      Total devices 1 FS bytes used 54.14MiB
      	      devid    1 size 300.00MiB used 144.00MiB path /dev/sdc1
      
            Label: none  uuid: 33bad74e-c91b-43a5-aef8-b3cab97ae63a
      	      Total devices 1 FS bytes used 54.14MiB
      	      devid    1 size 300.00MiB used 144.00MiB path /dev/sdc2
      
         Their sysfs as below.
      
            $ cat /sys/fs/btrfs/features/temp_fsid
            0
      
            $ cat /sys/fs/btrfs/509ad44b-ad2a-4a8a-bc8d-fe69db7220d5/temp_fsid
            0
      
            $ cat /sys/fs/btrfs/33bad74e-c91b-43a5-aef8-b3cab97ae63a/temp_fsid
            1
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3623740
    • Anand Jain's avatar
      btrfs: disable the device add feature for temp-fsid · ac6ea6a9
      Anand Jain authored
      The device addition operation will transform the cloned temp-fsid mounted
      device into a multi-device filesystem. Therefore, it is marked as
      unsupported.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac6ea6a9
    • Anand Jain's avatar
      btrfs: disable the seed feature for temp-fsid · c47b02c1
      Anand Jain authored
      A seed device is an integral component of the sprout device, which
      functions as a multi-device filesystem. Therefore, temp-fsid feature
      is not supported.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c47b02c1
    • Anand Jain's avatar
      btrfs: update comment for temp-fsid, fsid, and metadata_uuid · 000331bb
      Anand Jain authored
      Update the comment to explain the relationship between temp_fsid, fsid,
      and metadata_uuid.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      000331bb
    • Filipe Manana's avatar
      btrfs: remove pointless empty log context list check when syncing log · 3cf63ddf
      Filipe Manana authored
      When syncing the log, if we get an error when updating the log root, we
      check first if the log root tree context is in a log context list, and if
      so it deletes from the log root tree context from the list. This check
      however is pointless because at this moment the context is always in a
      list, he have just added it to a context list. The check became pointless
      after commit a93e0168 ("btrfs: remove no longer needed use of
      log_writers for the log root tree"). So remove this now pointless empty
      list check.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3cf63ddf
    • Filipe Manana's avatar
      btrfs: update comment for struct btrfs_inode::lock · 68539bd0
      Filipe Manana authored
      Update the comment for the lock named "lock" in struct btrfs_inode because
      it does not mention that the fields "delalloc_bytes", "defrag_bytes",
      "csum_bytes", "outstanding_extents" and "disk_i_size" are also protected
      by that lock.
      
      Also add a comment on top of each field protected by this lock to mention
      that the lock protects them.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      68539bd0
    • Filipe Manana's avatar
      btrfs: remove pointless barrier from btrfs_sync_file() · 5ca1949b
      Filipe Manana authored
      The memory barrier (smp_mb()) at btrfs_sync_file() is completely redundant
      now that fs_info->last_trans_committed is read using READ_ONCE(), with the
      helper btrfs_get_last_trans_committed(), and written using WRITE_ONCE()
      with the helper btrfs_set_last_trans_committed().
      
      This barrier was introduced in 2011, by commit a4abeea4 ("Btrfs: kill
      trans_mutex"), but even back then it was not correct since the writer side
      (in btrfs_commit_transaction()), did not issue a pairing memory barrier
      after it updated fs_info->last_trans_committed.
      
      So remove this barrier.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5ca1949b
    • Filipe Manana's avatar
      btrfs: add and use helpers for reading and writing last_trans_committed · 0124855f
      Filipe Manana authored
      Currently the last_trans_committed field of struct btrfs_fs_info is
      modified and read without any locking or other protection. For example
      early in the fsync path, skip_inode_logging() is called which reads
      fs_info->last_trans_committed, but at the same time we can have a
      transaction commit completing and updating that field.
      
      In the case of an fsync this is harmless and any data race should be
      rare and at most cause an unnecessary logging of an inode.
      
      To avoid data race warnings from tools like KCSAN and other issues such
      as load and store tearing (amongst others, see [1]), create helpers to
      access the last_trans_committed field of struct btrfs_fs_info using
      READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere.
      
      [1] https://lwn.net/Articles/793253/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0124855f
    • Filipe Manana's avatar
      btrfs: add and use helpers for reading and writing fs_info->generation · 4a4f8fe2
      Filipe Manana authored
      Currently the generation field of struct btrfs_fs_info is always modified
      while holding fs_info->trans_lock locked. Most readers will access this
      field without taking that lock but while holding a transaction handle,
      which is safe to do due to the transaction life cycle.
      
      However there are other readers that are neither holding the lock nor
      holding a transaction handle open:
      
      1) When reading an inode from disk, at btrfs_read_locked_inode();
      
      2) When reading the generation to expose it to sysfs, at
         btrfs_generation_show();
      
      3) Early in the fsync path, at skip_inode_logging();
      
      4) When creating a hole at btrfs_cont_expand(), during write paths,
         truncate and reflinking;
      
      5) In the fs_info ioctl (btrfs_ioctl_fs_info());
      
      6) While mounting the filesystem, in the open_ctree() path. In these
         cases it's safe to directly read fs_info->generation as no one
         can concurrently start a transaction and update fs_info->generation.
      
      In case of the fsync path, races here should be harmless, and in the worst
      case they may cause a fsync to log an inode when it's not really needed,
      so nothing bad from a functional perspective. In the other cases it's not
      so clear if functional problems may arise, though in case 1 rare things
      like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC
      flag not being set on an inode and therefore result in incorrect logging
      later on in case a fsync call is made.
      
      To avoid data race warnings from tools like KCSAN and other issues such
      as load and store tearing (amongst others, see [1]), create helpers to
      access the generation field of struct btrfs_fs_info using READ_ONCE() and
      WRITE_ONCE(), and use these helpers where needed.
      
      [1] https://lwn.net/Articles/793253/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4a4f8fe2
    • Filipe Manana's avatar
      btrfs: add and use helpers for reading and writing log_transid · 6008859b
      Filipe Manana authored
      Currently the log_transid field of a root is always modified while holding
      the root's log_mutex locked. Most readers of a root's log_transid are also
      holding the root's log_mutex locked, however there is one exception which
      is btrfs_set_inode_last_trans() where we don't take the lock to avoid
      blocking several operations if log syncing is happening in parallel.
      
      Any races here should be harmless, and in the worst case they may cause a
      fsync to log an inode when it's not really needed, so nothing bad from a
      functional perspective.
      
      To avoid data race warnings from tools like KCSAN and other issues such
      as load and store tearing (amongst others, see [1]), create helpers to
      access the log_transid field of a root using READ_ONCE() and WRITE_ONCE(),
      and use these helpers where needed.
      
      [1] https://lwn.net/Articles/793253/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6008859b
    • Filipe Manana's avatar
      btrfs: add and use helpers for reading and writing last_log_commit · f9850787
      Filipe Manana authored
      Currently, the last_log_commit of a root can be accessed concurrently
      without any lock protection. Readers can be calling btrfs_inode_in_log()
      early in a fsync call, which reads a root's last_log_commit, while a
      writer can change the last_log_commit while a log tree if being synced,
      at btrfs_sync_log(). Any races here should be harmless, and in the worst
      case they may cause a fsync to log an inode when it's not really needed,
      so nothing bad from a functional perspective.
      
      To avoid data race warnings from tools like KCSAN and other issues such
      as load and store tearing (amongst others, see [1]), create helpers to
      access the last_log_commit field of a root using READ_ONCE() and
      WRITE_ONCE(), and use these helpers everywhere.
      
      [1] https://lwn.net/Articles/793253/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f9850787
    • Anand Jain's avatar
      btrfs: support cloned-device mount capability · a5b8a5f9
      Anand Jain authored
      Guilherme's previous work [1] aimed at the mounting of cloned devices
      using a superblock flag SINGLE_DEV during mkfs.
       [1] https://lore.kernel.org/linux-btrfs/20230831001544.3379273-1-gpiccoli@igalia.com/
      
      Building upon this work, here is in memory only approach. As it mounts
      we determine if the same fsid is already mounted if then we generate a
      random temp fsid which shall be used the mount, in memory only not
      written to the disk. We distinguish devices by devt.
      
      Example:
        $ fallocate -l 300m ./disk1.img
        $ mkfs.btrfs -f ./disk1.img
        $ cp ./disk1.img ./disk2.img
        $ cp ./disk1.img ./disk3.img
        $ mount -o loop ./disk1.img /btrfs
        $ mount -o ./disk2.img /btrfs1
        $ mount -o ./disk3.img /btrfs2
      
        $ btrfs fi show -m
        Label: none  uuid: 4a212b48-1bec-46a5-938a-783c8c1f0b02
      	Total devices 1 FS bytes used 144.00KiB
      	devid    1 size 300.00MiB used 88.00MiB path /dev/loop0
      
        Label: none  uuid: adabf2fe-5515-4ad0-95b4-7b1609218c16
      	Total devices 1 FS bytes used 144.00KiB
      	devid    1 size 300.00MiB used 88.00MiB path /dev/loop1
      
        Label: none  uuid: 1d77d0df-7d92-439e-adbd-20b9b86fdedb
      	Total devices 1 FS bytes used 144.00KiB
      	devid    1 size 300.00MiB used 88.00MiB path /dev/loop2
      Co-developed-by: default avatarGuilherme G. Piccoli <gpiccoli@igalia.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a5b8a5f9
    • Anand Jain's avatar
      btrfs: add helper function find_fsid_by_disk · 69d427f3
      Anand Jain authored
      In preparation for adding support to mount multiple single-disk
      btrfs filesystems with the same FSID, wrap find_fsid() into
      find_fsid_by_disk().
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      69d427f3
    • Filipe Manana's avatar
      btrfs: stop reserving excessive space for block group item insertions · 9ef17228
      Filipe Manana authored
      Space for block group item insertions, necessary after allocating a new
      block group, is reserved in the delayed refs block reserve. Currently we
      do this by incrementing the transaction handle's delayed_ref_updates
      counter and then calling btrfs_update_delayed_refs_rsv(), which will
      increase the size of the delayed refs block reserve by an amount that
      corresponds to the same amount we use for delayed refs, given by
      btrfs_calc_delayed_ref_bytes().
      
      That is an excessive amount because it corresponds to the amount of space
      needed to insert one item in a btree (btrfs_calc_insert_metadata_size())
      times 2 when the free space tree feature is enabled. All we need is an
      amount as given by btrfs_calc_insert_metadata_size(), since we only need to
      insert a block group item in the extent tree (or block group tree if this
      feature is enabled). By using btrfs_calc_insert_metadata_size() we will
      need to reserve 2 times less space when using the free space tree, putting
      less pressure on space reservation.
      
      So use helpers to reserve and release space for block group item
      insertions that use btrfs_calc_insert_metadata_size() for calculation of
      the space.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9ef17228
    • Filipe Manana's avatar
      btrfs: stop reserving excessive space for block group item updates · f66e0209
      Filipe Manana authored
      Space for block group item updates, necessary after allocating or
      deallocating an extent from a block group, is reserved in the delayed
      refs block reserve. Currently we do this by incrementing the transaction
      handle's delayed_ref_updates counter and then calling
      btrfs_update_delayed_refs_rsv(), which will increase the size of the
      delayed refs block reserve by an amount that corresponds to the same
      amount we use for delayed refs, given by btrfs_calc_delayed_ref_bytes().
      
      That is an excessive amount because it corresponds to the amount of space
      needed to insert one item in a btree (btrfs_calc_insert_metadata_size())
      times 2 when the free space tree feature is enabled. All we need is an
      amount as given by btrfs_calc_metadata_size(), since we only need to
      update an existing block group item in the extent tree (or block group
      tree if this feature is enabled). By using btrfs_calc_metadata_size() we
      will need to reserve 4 times less space when using the free space tree
      and 2 times less space when not using it, putting less pressure on space
      reservation.
      
      So use helpers to reserve and release space for block group item updates
      that use btrfs_calc_metadata_size() for calculation of the space.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f66e0209
    • David Sterba's avatar
      btrfs: reorder btrfs_inode to fill gaps · 398fb913
      David Sterba authored
      Previous commit created a hole in struct btrfs_inode, we can move
      outstanding_extents there. This reduces size by 8 bytes from 1120 to
      1112 on a release config.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      398fb913
    • David Sterba's avatar
      btrfs: open code btrfs_ordered_inode_tree in btrfs_inode · 54c65371
      David Sterba authored
      The structure btrfs_ordered_inode_tree is used only in one place, in
      btrfs_inode. The structure itself has a 4 byte hole which is wasted
      space.
      
      Move the btrfs_ordered_inode_tree members to btrfs_inode with a common
      prefix 'ordered_tree_' where the hole can be utilized and shrink inode
      size.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54c65371
    • Josef Bacik's avatar
      btrfs: adjust overcommit logic when very close to full · cb6cbab7
      Josef Bacik authored
      A user reported some unpleasant behavior with very small file systems.
      The reproducer is this
      
        $ mkfs.btrfs -f -m single -b 8g /dev/vdb
        $ mount /dev/vdb /mnt/test
        $ dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20
      
      This will result in usage that looks like this
      
        Overall:
            Device size:                   8.00GiB
            Device allocated:              8.00GiB
            Device unallocated:            1.00MiB
            Device missing:                  0.00B
            Device slack:                  2.00GiB
            Used:                          5.47GiB
            Free (estimated):              2.52GiB      (min: 2.52GiB)
            Free (statfs, df):               0.00B
            Data ratio:                       1.00
            Metadata ratio:                   1.00
            Global reserve:                5.50MiB      (used: 0.00B)
            Multiple profiles:                  no
      
        Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
           /dev/vdb        7.99GiB
      
        Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
           /dev/vdb        8.00MiB
      
        System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
           /dev/vdb        4.00MiB
      
        Unallocated:
           /dev/vdb        1.00MiB
      
      As you can see we've gotten ourselves quite full with metadata, with all
      of the disk being allocated for data.
      
      On smaller file systems there's not a lot of time before we get full, so
      our overcommit behavior bites us here.  Generally speaking data
      reservations result in chunk allocations as we assume reservation ==
      actual use for data.  This means at any point we could end up with a
      chunk allocation for data, and if we're very close to full we could do
      this before we have a chance to figure out that we need another metadata
      chunk.
      
      Address this by adjusting the overcommit logic.  Simply put we need to
      take away 1 chunk from the available chunk space in case of a data
      reservation.  This will allow us to stop overcommitting before we
      potentially lose this space to a data allocation.  With this fix in
      place we properly allocate a metadata chunk before we're completely
      full, allowing for enough slack space in metadata.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb6cbab7
    • Josef Bacik's avatar
      btrfs: increase ->free_chunk_space in btrfs_grow_device · 6f2d3c01
      Josef Bacik authored
      My overcommit patch exposed a bug with btrfs/177 [1].  The problem here is
      that when we grow the device we're not adding to ->free_chunk_space, so
      subsequent allocations can cause ->free_chunk_space to wrap, which
      causes problems in can_overcommit because we add this to ->total_bytes,
      which causes the counter to wrap and gives us an unexpected ENOSPC.
      
      Fix this by properly updating ->free_chunk_space with the new available
      space in btrfs_grow_device.
      
      [1] First version of the fix:
          https://lore.kernel.org/linux-btrfs/b97e47ce0ce1d41d221878de7d6090b90aa7a597.1695065233.git.josef@toxicpanda.com/Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6f2d3c01
    • Josef Bacik's avatar
      btrfs: fix ->free_chunk_space math in btrfs_shrink_device · e9fd2c05
      Josef Bacik authored
      There are two bugs in how we adjust ->free_chunk_space in
      btrfs_shrink_device.  First we're removing the entire diff between
      new_size and old_size from ->free_chunk_space.  This only works if we're
      reducing the free area, which we could potentially not be.  So adjust
      the math to only subtract the diff in the free space from
      ->free_chunk_space.
      
      Additionally in the error case we're unconditionally adding the diff
      back into ->free_chunk_space, which we need to only do if this device is
      writeable.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e9fd2c05
    • Filipe Manana's avatar
      btrfs: make sure we cache next state in find_first_extent_bit() · efba1454
      Filipe Manana authored
      Currently, at find_first_extent_bit(), when we are given a cached extent
      state that happens to have its end offset match the desired range start,
      we find the next extent state using that cached state, with next_state()
      calls, and then return it.
      
      We then try to cache that next state by calling cache_state_if_flags(),
      but that will not cache the state because we haven't reset *cached_state
      to NULL, so we end up with the cached_state unchanged, and if the caller
      is iterating over extent states in the io tree, its next call to
      find_first_extent_bit() will not use the current cached state as its end
      offset does not match the minimum start range offset, therefore the cached
      state is reset and we have to search the rbtree to find the next suitable
      extent state record.
      
      So fix this by resetting the cached state to NULL (and dropping our ref
      on it) when we have a suitable cached state and we found a next state by
      using next_state() starting from the cached state. This makes use cases
      of calling find_first_extent_bit() to go over all ranges in the io tree
      to do a single rbtree full search, only on the first call, and the next
      calls will just do next_state() (rb_next() wrapper) calls, which is more
      efficient.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      efba1454
    • Filipe Manana's avatar
      btrfs: use extent_io_tree_release() to empty dirty log pages · 0f8ac74d
      Filipe Manana authored
      When freeing a log tree, during a transaction commit, we clear its dirty
      log pages io tree by calling clear_extent_bits() using a range from 0 to
      (u64)-1. This will iterate the io tree's rbtree and call rb_erase() on
      each node before freeing it, which will often trigger rebalance operations
      on the rbtree. A better alternative it to use extent_io_tree_release(),
      which will not do deletions and trigger rebalances.
      
      So use extent_io_tree_release() instead of clear_extent_bits().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0f8ac74d
    • Filipe Manana's avatar
      btrfs: make tree iteration in extent_io_tree_release() more efficient · 63ffc1f7
      Filipe Manana authored
      Currently extent_io_tree_release() is a loop that keeps getting the first
      node in the io tree, using rb_first() which is a loop that gets to the
      leftmost node of the rbtree, and then for each node it calls rb_erase(),
      which often requires rebalancing the rbtree.
      
      We can make this more efficient by using
      rbtree_postorder_for_each_entry_safe() to free each node without having
      to delete it from the rbtree and without looping to get the first node.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63ffc1f7
    • Filipe Manana's avatar
      btrfs: collapse wait_on_state() to its caller wait_extent_bit() · df2a8e70
      Filipe Manana authored
      The wait_on_state() function is very short and has a single caller, which
      is wait_extent_bit(), so remove the function and put its code into the
      caller.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      df2a8e70
    • Filipe Manana's avatar
      btrfs: remove redundant memory barrier from extent_io_tree_release() · 28967c76
      Filipe Manana authored
      The memory barrier at extent_io_tree_release() is redundant. Holding
      spin_lock here is not enough to drop the barrier completely.  We only
      change the waitqueue of an extent state record while holding the tree
      lock - see wait_on_state().
      
      The update to waitqueue state will not become stale because there will
      be an spin_unlock/spin_lock sequence between the change and waiting,
      this implies a full memory barrier.
      
      So remove the explicit smp_mb() barrier.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ reword reasoning ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28967c76
    • Filipe Manana's avatar
      btrfs: make wait_extent_bit() static · a1c20d15
      Filipe Manana authored
      The function wait_extent_bit() is not used outside extent-io-tree.c so
      make it static. Furthermore the function doesn't have the 'btrfs_' prefix.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a1c20d15
    • Filipe Manana's avatar
      btrfs: update stale comment at extent_io_tree_release() · bea22a58
      Filipe Manana authored
      There's this comment at extent_io_tree_release() that mentions io btrees,
      but this function is no longer used only for io btrees. Originally it was
      added as a static function named clear_btree_io_tree() at transaction.c,
      in commit 663dfbb0 ("Btrfs: deal with convert_extent_bit errors to
      avoid fs corruption"), as it was used only for cleaning one of the io
      trees that track dirty extent buffers, the dirty_log_pages io tree of a
      a root and the dirty_pages io tree of a transaction. Later it was renamed
      and exported and now it's used to cleanup other io trees such as the
      allocation state io tree of a device or the csums range io tree of a log
      root.
      
      So remove that comment and replace it with one at the top of the function
      that is more complete, mentioning what the function does and that it's
      expected to be called only when a task is sure no one else will need to
      use the tree anymore, as well as there should be no locked ranges in the
      tree and therefore no waiters on its extent state records. Also add an
      assertion to check that there are no locked extent state records in the
      tree.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bea22a58
    • Filipe Manana's avatar
      btrfs: make extent state merges more efficient during insertions · c91ea4bf
      Filipe Manana authored
      When inserting a new extent state record into an io tree that happens to
      be mergeable, we currently do the following:
      
      1) Insert the extent state record in the io tree's rbtree. This requires
         going down the tree to find where to insert it, and during the
         insertion we often need to balance the rbtree;
      
      2) We then check if the previous node is mergeable, so we call rb_prev()
         to find it, which requires some looping to find the previous node;
      
      3) If the previous node is mergeable, we adjust our node to include the
         range of the previous node and then delete the previous node from the
         rbtree, which again may need to balance the rbtree;
      
      4) Then we check if the next node is mergeable with the node we inserted,
         so we call rb_next(), which requires some looping too. If the next node
         is indeed mergeable, we expand the range of our node to include the
         next node's range and then delete the next node from the rbtree, which
         again may need to balance the tree.
      
      So these are quite of lot of iterations and looping over the rbtree, and
      some of the operations may need to rebalance the rb tree. This can be made
      a bit more efficient by:
      
      1) When iterating the rbtree, once we find a node that is mergeable with
         the node we want to insert, we can just adjust that node's range with
         the range of the node to insert - this avoids continuing iterating
         over the tree and deleting a node from the rbtree;
      
      2) If we expand the range of a mergeable node, then we find the next or
         the previous node, depending on other we merged a range to the right or
         to the left of the node we are currently at during the iteration. This
         merging is as before, we find the next or previous node with rb_next()
         or rb_prev() and if that other node is mergeable with the current one,
         we adjust the range of the current node and remove the other node from
         the rbtree;
      
      3) Whenever we need to insert the new extent state record it's because
         we don't have any extent state record in the rbtree which can be
         merged, so we can remove the call to merge_state() after the insertion,
         saving rb_next() and rb_prev() calls, which require some looping.
      
      So update the insertion function insert_state() to have this behaviour.
      
      Running dbench for 120 seconds and capturing the execution times of
      set_extent_bit() at pin_down_extent(), resulted in the following data
      (time values are in nanoseconds):
      
      Before this change:
      
        Count: 2278299
        Range:  0.000 - 4003728.000; Mean: 713.436; Median: 612.000; Stddev: 3606.952
        Percentiles:  90th: 1187.000; 95th: 1350.000; 99th: 1724.000
             0.000 -       7.534:       5 |
             7.534 -      35.418:      36 |
            35.418 -     154.403:     273 |
           154.403 -     662.138: 1244016 #####################################################
           662.138 -    2828.745: 1031335 ############################################
          2828.745 -   12074.102:    1395 |
         12074.102 -   51525.930:     806 |
         51525.930 -  219874.955:     162 |
        219874.955 -  938254.688:      22 |
        938254.688 - 4003728.000:       3 |
      
      After this change:
      
        Count: 2275862
        Range:  0.000 - 1605175.000; Mean: 678.903; Median: 590.000; Stddev: 2149.785
        Percentiles:  90th: 1105.000; 95th: 1245.000; 99th: 1590.000
             0.000 -      10.219:      10 |
            10.219 -      40.957:      36 |
            40.957 -     155.907:     262 |
           155.907 -     585.789: 1127214 ####################################################
           585.789 -    2193.431: 1145134 #####################################################
          2193.431 -    8205.578:    1648 |
          8205.578 -   30689.378:    1039 |
         30689.378 -  114772.699:     362 |
        114772.699 -  429221.537:      52 |
        429221.537 - 1605175.000:      10 |
      
      Maximum duration (range), average duration, percentiles and standard
      deviation are all better.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c91ea4bf
    • David Sterba's avatar
      btrfs: change test_range_bit to scan the whole range · 893fe243
      David Sterba authored
      The semantics of test_range_bit() with filled == 0 is now in it's own
      helper so test_range_bit will check the whole range unconditionally.
      The detection logic is flipped and assumes success by default and
      catches exceptions.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      893fe243
    • David Sterba's avatar
      btrfs: add specific helper for range bit test exists · 99be1a66
      David Sterba authored
      The existing helper test_range_bit works in two ways, checks if the whole
      range contains all the bits, or stop on the first occurrence.  By adding
      a specific helper for the latter case, the inner loop can be simplified
      and contains fewer conditionals, making it a bit faster.
      
      There's no caller that uses the cached state pointer so this reduces the
      argument count further.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      99be1a66
    • Filipe Manana's avatar
      btrfs: move btrfs_realloc_node() from ctree.c into defrag.c · 6422b4cd
      Filipe Manana authored
      btrfs_realloc_node() is only used by the defrag code. Nowadays we have a
      defrag.c file, so move it, and its helper close_blocks(), into defrag.c.
      
      During the move also do a few minor cosmetic changes:
      
      1) Change the return value of close_blocks() from int to bool;
      
      2) Use SZ_32K instead of 32768 at close_blocks();
      
      3) Make some variables const in btrfs_realloc_node(), 'blocksize' and
         'end_slot';
      
      4) Get rid of 'parent_nritems' variable, in both places where it was
         used it could be replaced by calling btrfs_header_nritems(parent);
      
      5) Change the type of a couple variables from int to bool;
      
      6) Rename variable 'err' to 'ret', as that's the most common name we
         use to track the return value of a function;
      
      7) Move some variables from the top scope to the scope of the for loop
         where they are used.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6422b4cd
    • Filipe Manana's avatar
      btrfs: export comp_keys() from ctree.c as btrfs_comp_keys() · 79d25df0
      Filipe Manana authored
      Export comp_keys() out of ctree.c, as btrfs_comp_keys(), so that in a
      later patch we can move out defrag specific code from ctree.c into
      defrag.c.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79d25df0
    • Filipe Manana's avatar
      btrfs: rename and export __btrfs_cow_block() · 95f93bc4
      Filipe Manana authored
      Rename and export __btrfs_cow_block() as btrfs_force_cow_block(). This is
      to allow to move defrag specific code out of ctree.c and into defrag.c in
      one of the next patches.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      95f93bc4
    • Filipe Manana's avatar
      btrfs: use round_down() to align block offset at btrfs_cow_block() · b8bf4e4d
      Filipe Manana authored
      At btrfs_cow_block() we can use round_down() to align the extent buffer's
      logical offset to the start offset of a metadata block group, instead of
      the less easy to read set of bitwise operations (two plus one subtraction).
      So replace the bitwise operations with a round_down() call.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b8bf4e4d
    • Filipe Manana's avatar
      btrfs: remove noinline attribute from btrfs_cow_block() · 7bff16e3
      Filipe Manana authored
      It's pointless to have the noiline attribute for btrfs_cow_block(), as the
      function is exported and widely used. So remove it.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7bff16e3
    • Anand Jain's avatar
      btrfs: remove incomplete metadata_uuid conversion fixup logic · 5966930d
      Anand Jain authored
      Previous commit ("btrfs: reject devices with CHANGING_FSID_V2") has
      stopped the assembly of devices with the CHANGING_FSID_V2 flag in the
      kernel. Such devices can be scanned but will not be registered and can't
      be mounted without a manual fix by btrfstune.  Remove the related logic
      and now unused code.
      
      The original motivation was to allow an interrupted partial conversion
      fix itself on next mount, in case the system has to be rebooted. This is
      a convenience but brings a lot of complexity the device scanning and
      handling the partial states.  It's hard to estimate if this was ever
      needed in practice, expecting the typical use case like a manual
      conversion of an unmounted filesystem where the user can verify the
      success and rerun it eventually.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add historical context ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5966930d
    • Anand Jain's avatar
      btrfs: reject devices with CHANGING_FSID_V2 · 197a9ece
      Anand Jain authored
      The BTRFS_SUPER_FLAG_CHANGING_FSID_V2 flag indicates a transient state
      where the device in the userspace btrfstune -m|-M operation failed to
      complete changing the fsid.
      
      This flag makes the kernel to automatically determine the other
      partner devices to which a given device can be associated, based on the
      fsid, metadata_uuid and generation values.
      
      btrfstune -m|M feature is especially useful in virtual cloud setups, where
      compute instances (disk images) are quickly copied, fsid changed, and
      launched. Given numerous disk images with the same metadata_uuid but
      different fsid, there's no clear way a device can be correctly assembled
      with the proper partners when the CHANGING_FSID_V2 flag is set. So, the
      disk could be assembled incorrectly, as in the example below:
      
      Before this patch:
      
      Consider the following two filesystems:
         /dev/loop[2-3] are raw copies of /dev/loop[0-1] and the btrsftune -m
      operation fails.
      
      In this scenario, as the /dev/loop0's fsid change is interrupted, and the
      CHANGING_FSID_V2 flag is set as shown below.
      
        $ p="device|devid|^metadata_uuid|^fsid|^incom|^generation|^flags"
      
        $ btrfs inspect dump-super /dev/loop0 | egrep '$p'
        superblock: bytenr=65536, device=/dev/loop0
        flags			0x1000000001
        fsid			7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
        metadata_uuid		bb040a9f-233a-4de2-ad84-49aa5a28059b
        generation		9
        num_devices		2
        incompat_flags	0x741
        dev_item.devid	1
      
        $ btrfs inspect dump-super /dev/loop1 | egrep '$p'
        superblock: bytenr=65536, device=/dev/loop1
        flags			0x1
        fsid			11d2af4d-1b71-45a9-83f6-f2100766939d
        metadata_uuid		bb040a9f-233a-4de2-ad84-49aa5a28059b
        generation		10
        num_devices		2
        incompat_flags	0x741
        dev_item.devid	2
      
        $ btrfs inspect dump-super /dev/loop2 | egrep '$p'
        superblock: bytenr=65536, device=/dev/loop2
        flags			0x1
        fsid			7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
        metadata_uuid		bb040a9f-233a-4de2-ad84-49aa5a28059b
        generation		8
        num_devices		2
        incompat_flags	0x741
        dev_item.devid	1
      
        $ btrfs inspect dump-super /dev/loop3 | egrep '$p'
        superblock: bytenr=65536, device=/dev/loop3
        flags			0x1
        fsid			7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
        metadata_uuid		bb040a9f-233a-4de2-ad84-49aa5a28059b
        generation		8
        num_devices		2
        incompat_flags	0x741
        dev_item.devid	2
      
      It is normal that some devices aren't instantly discovered during
      system boot or iSCSI discovery. The controlled scan below demonstrates
      this.
      
        $ btrfs device scan --forget
        $ btrfs device scan /dev/loop0
        Scanning for btrfs filesystems on '/dev/loop0'
        $ mount /dev/loop3 /btrfs
        $ btrfs filesystem show -m
        Label: none  uuid: 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
      	Total devices 2 FS bytes used 144.00KiB
      	devid    1 size 300.00MiB used 48.00MiB path /dev/loop0
      	devid    2 size 300.00MiB used 40.00MiB path /dev/loop3
      
      /dev/loop0 and /dev/loop3 are incorrectly partnered.
      
      This kernel patch removes functions and code connected to the
      CHANGING_FSID_V2 flag.
      
      With this patch, now devices with the CHANGING_FSID_V2 flag are rejected.
      And its partner will fail to mount with the extra -o degraded option.
      The check is removed from open_ctree(), devices are rejected during
      scanning which in turn fails the mount.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      197a9ece
    • David Sterba's avatar
      btrfs: relocation: constify parameters where possible · ab7c8bbf
      David Sterba authored
      Lots of the functions in relocation.c don't change pointer parameters
      but lack the annotations. Add them and reformat according to current
      coding style if needed.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ab7c8bbf
    • David Sterba's avatar
      btrfs: relocation: return bool from btrfs_should_ignore_reloc_root · 32f2abca
      David Sterba authored
      btrfs_should_ignore_reloc_root() is a predicate so it should return
      bool.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      32f2abca
    • David Sterba's avatar
      btrfs: switch btrfs_backref_cache::is_reloc to bool · c71d3c69
      David Sterba authored
      The btrfs_backref_cache::is_reloc is an indicator variable and should
      use a bool type.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c71d3c69