1. 10 Sep, 2024 1 commit
  2. 11 Jul, 2024 14 commits
    • David Sterba's avatar
      btrfs: pass a btrfs_inode to btrfs_set_prop() · 0d9b7e16
      David Sterba authored
      
      Pass a struct btrfs_inode to btrfs_set_prop() as it's an
      internal interface, allowing to remove some use of BTRFS_I.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d9b7e16
    • David Sterba's avatar
      btrfs: switch btrfs_pending_snapshot::dir to btrfs_inode · c154a844
      David Sterba authored
      
      The structure is internal so we should use struct btrfs_inode for that.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c154a844
    • David Sterba's avatar
      btrfs: pass a btrfs_inode to btrfs_ioctl_send() · 24e74598
      David Sterba authored
      
      Pass a struct btrfs_inode to btrfs_ioctl_send() and _btrfs_ioctl_send()
      as it's an internal interface, allowing to remove some use of BTRFS_I.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      24e74598
    • David Sterba's avatar
      btrfs: qgroup: warn about inconsistent qgroups when relation update fails · a5b3abb1
      David Sterba authored
      
      Calling btrfs_handle_fs_error() after btrfs_run_qgroups() fails to
      update the qgroup status is probably not necessary, this would turn the
      filesystem to read-only. For the same reason aborting the transaction is
      also not a good option.
      
      The state is left inconsistent and can be fixed by rescan, printing a
      warning should be sufficient. Return code reflects the status of
      adding/deleting the relation and if the transaction was ended properly.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a5b3abb1
    • David Sterba's avatar
      btrfs: qgroup: preallocate memory before adding a relation · 4addc1ff
      David Sterba authored
      
      There's a transaction joined in the qgroup relation add/remove ioctl and
      any error will lead to abort/error. We could lift the allocation from
      btrfs_add_qgroup_relation() and move it outside of the transaction
      context. The relation deletion does not need that.
      
      The ownership of the structure is moved to the add relation handler.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4addc1ff
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() when freeing tree block after error · bb386803
      Filipe Manana authored
      When freeing a tree block, at btrfs_free_tree_block(), if we fail to
      create a delayed reference we don't deal with the error and just do a
      BUG_ON(). The error most likely to happen is -ENOMEM, and we have a
      comment mentioning that only -ENOMEM can happen, but that is not true,
      because in case qgroups are enabled any error returned from
      btrfs_qgroup_trace_extent_post() (can be -EUCLEAN or anything returned
      from btrfs_search_slot() for example) can be propagated back to
      btrfs_free_tree_block().
      
      So stop doing a BUG_ON() and return the error to the callers and make
      them abort the transaction to prevent leaking space. Syzbot was
      triggering this, likely due to memory allocation failure injection.
      
      Reported-by: syzbot+a306f914b4d01b3958fe@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/000000000000fcba1e05e998263c@google.com/
      
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb386803
    • Filipe Manana's avatar
      btrfs: remove super block argument from btrfs_iget() · d13240dd
      Filipe Manana authored
      
      It's pointless to pass a super block argument to btrfs_iget() because we
      always pass a root and from it we can get the super block through:
      
         root->fs_info->sb
      
      So remove the super block argument.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d13240dd
    • Filipe Manana's avatar
      btrfs: avoid transaction commit on any fsync after subvolume creation · 45c4102f
      Filipe Manana authored
      As of commit 1b53e51a
      
       ("btrfs: don't commit transaction for every
      subvol create") we started to make any fsync after creating a subvolume
      to fallback to a transaction commit if the fsync is performed in the
      same transaction that was used to create the subvolume. This happens
      with the following at ioctl.c:create_subvol():
      
        $ cat fs/btrfs/ioctl.c
        (...)
            /* Tree log can't currently deal with an inode which is a new root. */
            btrfs_set_log_full_commit(trans);
        (...)
      
      Note that the comment is misleading as the problem is not that fsync can
      not deal with the root inode of a new root, but that we can not log any
      inode that belongs to a root that was not yet persisted because that would
      make log replay fail since the root doesn't exist at log replay time.
      
      The above simply makes any fsync fallback to a full transaction commit if
      it happens in the same transaction used to create the subvolume - even if
      it's an inode that belongs to any other subvolume. This is a brute force
      solution and it doesn't necessarily improve performance for every workload
      out there - it just moves a full transaction commit from one place, the
      subvolume creation, to another - an fsync for any inode.
      
      Just improve on this by making the fallback to a transaction commit only
      for an fsync against an inode of the new subvolume, or for the directory
      that contains the dentry that points to the new subvolume (in case anyone
      attempts to fsync the directory in the same transaction).
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      45c4102f
    • Filipe Manana's avatar
      btrfs: remove pointless code when creating and deleting a subvolume · ebc7c767
      Filipe Manana authored
      When creating and deleting a subvolume, after starting a transaction we
      are explicitly calling btrfs_record_root_in_trans() for the root which we
      passed to btrfs_start_transaction(). This is pointless because at
      transaction.c:start_transaction() we end up doing that call, regardless
      of whether we actually start a new transaction or join an existing one,
      and if we were not it would mean the root item of that root would not
      be updated in the root tree when committing the transaction, leading to
      problems easy to spot with fstests for example.
      
      Remove these redundant calls. They were introduced with commit
      74e97958
      
       ("btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume
      operations").
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ebc7c767
    • David Sterba's avatar
      btrfs: constify pointer parameters where applicable · 2917f741
      David Sterba authored
      
      We can add const to many parameters, this is for clarity and minor
      addition to safety. There are some minor effects, in the assembly
      code and .ko measured on release config. This patch does not cover all
      possible conversions.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2917f741
    • David Sterba's avatar
      btrfs: simplify range parameters of btrfs_wait_ordered_roots() · 42317ab4
      David Sterba authored
      
      The range is specified only in two ways, we can simplify the case for
      the whole filesystem range as a NULL block group parameter.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      42317ab4
    • Filipe Manana's avatar
      btrfs: remove objectid from struct btrfs_inode on 64 bits platforms · 7a7bc214
      Filipe Manana authored
      On 64 bits platforms we don't really need to have a dedicated member (the
      objectid field) for the inode's number since we store in the VFS inode's
      i_ino member, which is an unsigned long and this type is 64 bits wide on
      64 bits platforms. We only need that field in case we are on a 32 bits
      platform because the unsigned long type is 32 bits wide on such platforms
      See commit 33345d01
      
       ("Btrfs: Always use 64bit inode number") regarding
      this 64/32 bits detail.
      
      The objectid field of struct btrfs_inode is also used to store the ID of
      a root for directories that are stubs for unreferenced roots. In such
      cases the inode is a directory and has the BTRFS_INODE_ROOT_STUB runtime
      flag set.
      
      So in order to reduce the size of btrfs_inode structure on 64 bits
      platforms we can remove the objectid member and use the VFS inode's i_ino
      member instead whenever we need to get the inode number. In case the inode
      is a root stub (BTRFS_INODE_ROOT_STUB set) we can use the member
      last_reflink_trans to store the ID of the unreferenced root, since such
      inode is a directory and reflinks can't be done against directories.
      
      So remove the objectid fields for 64 bits platforms and alias the
      last_reflink_trans field with a name of ref_root_id in a union.
      On a release kernel config, this reduces the size of struct btrfs_inode
      from 1040 bytes down to 1032 bytes.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7a7bc214
    • Filipe Manana's avatar
      btrfs: remove location key from struct btrfs_inode · 068fc8f9
      Filipe Manana authored
      
      Currently struct btrfs_inode has a key member, named "location", that is
      either:
      
      1) The key of the inode's item. In this case the objectid is the number
         of the inode;
      
      2) A key stored in a dir entry with a type of BTRFS_ROOT_ITEM_KEY, for
         the case where we have a root that is a snapshot of a subvolume that
         points to other subvolumes. In this case the objectid is the ID of
         a subvolume inside the snapshotted parent subvolume.
      
      The key is only used to lookup the inode item for the first case, while
      for the second it's never used since it corresponds to directory stubs
      created with new_simple_dir() and which are marked as dummy, so there's
      no actual inode item to ever update. In the second case we only check
      the key type at btrfs_ino() for 32 bits platforms and its objectid is
      only needed for unlink.
      
      Instead of using a key we can do fine with just the objectid, since we
      can generate the key whenever we need it having only the objectid, as
      in all use cases the type is always BTRFS_INODE_ITEM_KEY and the offset
      is always 0.
      
      So use only an objectid instead of a full key. This reduces the size of
      struct btrfs_inode from 1048 bytes down to 1040 bytes on a release kernel.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      068fc8f9
    • David Sterba's avatar
      btrfs: qgroup: do quick checks if quotas are enabled before starting ioctls · 34064fc3
      David Sterba authored
      
      The ioctls that add relations, create qgroups or set limits start/join
      transaction. When quotas are not enabled this is not necessary, there
      will be errors reported back anyway but this could be also misleading
      and we should really report that quotas are not enabled. For that use
      -ENOTCONN.
      
      The helper is meant to do a quick check before any other standard ioctl
      checks are done. If quota is disabled meanwhile we still rely on proper
      locking inside any active operation changing the qgroup structures.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      34064fc3
  3. 20 Jun, 2024 1 commit
    • Prasad Singamsetty's avatar
      fs: Initial atomic write support · c34fc6f2
      Prasad Singamsetty authored
      
      An atomic write is a write issued with torn-write protection, meaning
      that for a power failure or any other hardware failure, all or none of the
      data from the write will be stored, but never a mix of old and new data.
      
      Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
      write is to be issued with torn-write prevention, according to special
      alignment and length rules.
      
      For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
      iocb->ki_flags field to indicate the same.
      
      A call to statx will give the relevant atomic write info for a file:
      - atomic_write_unit_min
      - atomic_write_unit_max
      - atomic_write_segments_max
      
      Both min and max values must be a power-of-2.
      
      Applications can avail of atomic write feature by ensuring that the total
      length of a write is a power-of-2 in size and also sized between
      atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
      must ensure that the write is at a naturally-aligned offset in the file
      wrt the total write length. The value in atomic_write_segments_max
      indicates the upper limit for IOV_ITER iovcnt.
      
      Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
      flag set will have RWF_ATOMIC rejected and not just ignored.
      
      Add a type argument to kiocb_set_rw_flags() to allows reads which have
      RWF_ATOMIC set to be rejected.
      
      Helper function generic_atomic_write_valid() can be used by FSes to verify
      compliant writes. There we check for iov_iter type is for ubuf, which
      implies iovcnt==1 for pwritev2(), which is an initial restriction for
      atomic_write_segments_max. Initially the only user will be bdev file
      operations write handler. We will rely on the block BIO submission path to
      ensure write sizes are compliant for the bdev, so we don't need to check
      atomic writes sizes yet.
      Signed-off-by: default avatarPrasad Singamsetty <prasad.singamsetty@oracle.com>
      jpg: merge into single patch and much rewrite
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c34fc6f2
  4. 07 May, 2024 2 commits
  5. 25 Apr, 2024 1 commit
    • Josef Bacik's avatar
      btrfs: take the cleaner_mutex earlier in qgroup disable · 0f2b8098
      Josef Bacik authored
      
      One of my CI runs popped the following lockdep splat
      
      ======================================================
      WARNING: possible circular locking dependency detected
      6.9.0-rc4+ #1 Not tainted
      ------------------------------------------------------
      btrfs/471533 is trying to acquire lock:
      ffff92ba46980850 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_quota_disable+0x54/0x4c0
      
      but task is already holding lock:
      ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #2 (&fs_info->subvol_sem){++++}-{3:3}:
             down_read+0x42/0x170
             btrfs_rename+0x607/0xb00
             btrfs_rename2+0x2e/0x70
             vfs_rename+0xaf8/0xfc0
             do_renameat2+0x586/0x600
             __x64_sys_rename+0x43/0x50
             do_syscall_64+0x95/0x180
             entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      -> #1 (&sb->s_type->i_mutex_key#16){++++}-{3:3}:
             down_write+0x3f/0xc0
             btrfs_inode_lock+0x40/0x70
             prealloc_file_extent_cluster+0x1b0/0x370
             relocate_file_extent_cluster+0xb2/0x720
             relocate_data_extent+0x107/0x160
             relocate_block_group+0x442/0x550
             btrfs_relocate_block_group+0x2cb/0x4b0
             btrfs_relocate_chunk+0x50/0x1b0
             btrfs_balance+0x92f/0x13d0
             btrfs_ioctl+0x1abf/0x2600
             __x64_sys_ioctl+0x97/0xd0
             do_syscall_64+0x95/0x180
             entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      -> #0 (&fs_info->cleaner_mutex){+.+.}-{3:3}:
             __lock_acquire+0x13e7/0x2180
             lock_acquire+0xcb/0x2e0
             __mutex_lock+0xbe/0xc00
             btrfs_quota_disable+0x54/0x4c0
             btrfs_ioctl+0x206b/0x2600
             __x64_sys_ioctl+0x97/0xd0
             do_syscall_64+0x95/0x180
             entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      other info that might help us debug this:
      
      Chain exists of:
        &fs_info->cleaner_mutex --> &sb->s_type->i_mutex_key#16 --> &fs_info->subvol_sem
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&fs_info->subvol_sem);
                                     lock(&sb->s_type->i_mutex_key#16);
                                     lock(&fs_info->subvol_sem);
        lock(&fs_info->cleaner_mutex);
      
       *** DEADLOCK ***
      
      2 locks held by btrfs/471533:
       #0: ffff92ba4319e420 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0x3b5/0x2600
       #1: ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600
      
      stack backtrace:
      CPU: 1 PID: 471533 Comm: btrfs Kdump: loaded Not tainted 6.9.0-rc4+ #1
      Call Trace:
       <TASK>
       dump_stack_lvl+0x77/0xb0
       check_noncircular+0x148/0x160
       ? lock_acquire+0xcb/0x2e0
       __lock_acquire+0x13e7/0x2180
       lock_acquire+0xcb/0x2e0
       ? btrfs_quota_disable+0x54/0x4c0
       ? lock_is_held_type+0x9a/0x110
       __mutex_lock+0xbe/0xc00
       ? btrfs_quota_disable+0x54/0x4c0
       ? srso_return_thunk+0x5/0x5f
       ? lock_acquire+0xcb/0x2e0
       ? btrfs_quota_disable+0x54/0x4c0
       ? btrfs_quota_disable+0x54/0x4c0
       btrfs_quota_disable+0x54/0x4c0
       btrfs_ioctl+0x206b/0x2600
       ? srso_return_thunk+0x5/0x5f
       ? __do_sys_statfs+0x61/0x70
       __x64_sys_ioctl+0x97/0xd0
       do_syscall_64+0x95/0x180
       ? srso_return_thunk+0x5/0x5f
       ? reacquire_held_locks+0xd1/0x1f0
       ? do_user_addr_fault+0x307/0x8a0
       ? srso_return_thunk+0x5/0x5f
       ? lock_acquire+0xcb/0x2e0
       ? srso_return_thunk+0x5/0x5f
       ? srso_return_thunk+0x5/0x5f
       ? find_held_lock+0x2b/0x80
       ? srso_return_thunk+0x5/0x5f
       ? lock_release+0xca/0x2a0
       ? srso_return_thunk+0x5/0x5f
       ? do_user_addr_fault+0x35c/0x8a0
       ? srso_return_thunk+0x5/0x5f
       ? trace_hardirqs_off+0x4b/0xc0
       ? srso_return_thunk+0x5/0x5f
       ? lockdep_hardirqs_on_prepare+0xde/0x190
       ? srso_return_thunk+0x5/0x5f
      
      This happens because when we call rename we already have the inode mutex
      held, and then we acquire the subvol_sem if we are a subvolume.  This
      makes the dependency
      
      inode lock -> subvol sem
      
      When we're running data relocation we will preallocate space for the
      data relocation inode, and we always run the relocation under the
      ->cleaner_mutex.  This now creates the dependency of
      
      cleaner_mutex -> inode lock (from the prealloc) -> subvol_sem
      
      Qgroup delete is doing this in the opposite order, it is acquiring the
      subvol_sem and then it is acquiring the cleaner_mutex, which results in
      this lockdep splat.  This deadlock can't happen in reality, because we
      won't ever rename the data reloc inode, nor is the data reloc inode a
      subvolume.
      
      However this is fairly easy to fix, simply take the cleaner mutex in the
      case where we are disabling qgroups before we take the subvol_sem.  This
      resolves the lockdep splat.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0f2b8098
  6. 02 Apr, 2024 1 commit
    • Boris Burkov's avatar
      btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations · 74e97958
      Boris Burkov authored
      Create subvolume, create snapshot and delete subvolume all use
      btrfs_subvolume_reserve_metadata() to reserve metadata for the changes
      done to the parent subvolume's fs tree, which cannot be mediated in the
      normal way via start_transaction. When quota groups (squota or qgroups)
      are enabled, this reserves qgroup metadata of type PREALLOC. Once the
      operation is associated to a transaction, we convert PREALLOC to
      PERTRANS, which gets cleared in bulk at the end of the transaction.
      
      However, the error paths of these three operations were not implementing
      this lifecycle correctly. They unconditionally converted the PREALLOC to
      PERTRANS in a generic cleanup step regardless of errors or whether the
      operation was fully associated to a transaction or not. This resulted in
      error paths occasionally converting this rsv to PERTRANS without calling
      record_root_in_trans successfully, which meant that unless that root got
      recorded in the transaction by some other thread, the end of the
      transaction would not free that root's PERTRANS, leaking it. Ultimately,
      this resulted in hitting a WARN in CONFIG_BTRFS_DEBUG builds at unmount
      for the leaked reservation.
      
      The fix is to ensure that every qgroup PREALLOC reservation observes the
      following properties:
      
      1. any failure before record_root_in_trans is called successfully
         results in freeing the PREALLOC reservation.
      2. after record_root_in_trans, we convert to PERTRANS, and now the
         transaction owns freeing the reservation.
      
      This patch enforces those properties on the three operations. Without
      it, generic/269 with squotas enabled at mkfs time would fail in ~5-10
      runs on my system. With this patch, it ran successfully 1000 times in a
      row.
      
      Fixes: e85fde51
      
       ("btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      74e97958
  7. 05 Mar, 2024 1 commit
    • Qu Wenruo's avatar
      btrfs: qgroup: validate btrfs_qgroup_inherit parameter · 86211eea
      Qu Wenruo authored
      
      [BUG]
      Currently btrfs can create subvolume with an invalid qgroup inherit
      without triggering any error:
      
        # mkfs.btrfs -O quota -f $dev
        # mount $dev $mnt
        # btrfs subvolume create -i 2/0 $mnt/subv1
        # btrfs qgroup show -prce --sync $mnt
        Qgroupid    Referenced    Exclusive   Path
        --------    ----------    ---------   ----
        0/5           16.00KiB     16.00KiB   <toplevel>
        0/256         16.00KiB     16.00KiB   subv1
      
      [CAUSE]
      We only do a very basic size check for btrfs_qgroup_inherit structure,
      but never really verify if the values are correct.
      
      Thus in btrfs_qgroup_inherit() function, we have to skip non-existing
      qgroups, and never return any error.
      
      [FIX]
      Fix the behavior and introduce extra checks:
      
      - Introduce early check for btrfs_qgroup_inherit structure
        Not only the size, but also all the qgroup ids would be verified.
      
        And the timing is very early, so we can return error early.
        This early check is very important for snapshot creation, as snapshot
        is delayed to transaction commit.
      
      - Drop support for btrfs_qgroup_inherit::num_ref_copies and
        num_excl_copies
        Those two members are used to specify to copy refr/excl numbers from
        other qgroups.
        This would definitely mark qgroup inconsistent, and btrfs-progs has
        dropped the support for them for a long time.
        It's time to drop the support for kernel.
      
      - Verify the supported btrfs_qgroup_inherit::flags
        Just in case we want to add extra flags for btrfs_qgroup_inherit.
      
      Now above subvolume creation would fail with -ENOENT other than silently
      ignore the non-existing qgroup.
      
      CC: stable@vger.kernel.org # 6.7+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      86211eea
  8. 04 Mar, 2024 6 commits
  9. 29 Feb, 2024 1 commit
    • Filipe Manana's avatar
      btrfs: fix double free of anonymous device after snapshot creation failure · e2b54eaf
      Filipe Manana authored
      When creating a snapshot we may do a double free of an anonymous device
      in case there's an error committing the transaction. The second free may
      result in freeing an anonymous device number that was allocated by some
      other subsystem in the kernel or another btrfs filesystem.
      
      The steps that lead to this:
      
      1) At ioctl.c:create_snapshot() we allocate an anonymous device number
         and assign it to pending_snapshot->anon_dev;
      
      2) Then we call btrfs_commit_transaction() and end up at
         transaction.c:create_pending_snapshot();
      
      3) There we call btrfs_get_new_fs_root() and pass it the anonymous device
         number stored in pending_snapshot->anon_dev;
      
      4) btrfs_get_new_fs_root() frees that anonymous device number because
         btrfs_lookup_fs_root() returned a root - someone else did a lookup
         of the new root already, which could some task doing backref walking;
      
      5) After that some error happens in the transaction commit path, and at
         ioctl.c:create_snapshot() we jump to the 'fail' label, and after
         that we free again the same anonymous device number, which in the
         meanwhile may have been reallocated somewhere else, because
         pending_snapshot->anon_dev still has the same value as in step 1.
      
      Recently syzbot ran into this and reported the following trace:
      
        ------------[ cut here ]------------
        ida_free called for id=51 which is not allocated.
        WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525
        Modules linked in:
        CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
        RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525
        Code: 10 42 80 3c 28 (...)
        RSP: 0018:ffffc90015a67300 EFLAGS: 00010246
        RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000
        RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000
        RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4
        R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246
        R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246
        FS:  00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0
        Call Trace:
         <TASK>
         btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346
         create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837
         create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931
         btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404
         create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848
         btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998
         btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044
         __btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306
         btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393
         btrfs_ioctl+0xa74/0xd40
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:871 [inline]
         __se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857
         do_syscall_64+0xfb/0x240
         entry_SYSCALL_64_after_hwframe+0x6f/0x77
        RIP: 0033:0x7fca3e67dda9
        Code: 28 00 00 00 (...)
        RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9
        RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003
        RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
        R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658
         </TASK>
      
      Where we get an explicit message where we attempt to free an anonymous
      device number that is not currently allocated. It happens in a different
      code path from the example below, at btrfs_get_root_ref(), so this change
      may not fix the case triggered by syzbot.
      
      To fix at least the code path from the example above, change
      btrfs_get_root_ref() and its callers to receive a dev_t pointer argument
      for the anonymous device number, so that in case it frees the number, it
      also resets it to 0, so that up in the call chain we don't attempt to do
      the double free.
      
      CC: stable@vger.kernel.org # 5.10+
      Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/
      Fixes: e03ee2fe
      
       ("btrfs: do not ASSERT() if the newly created subvolume already got read")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e2b54eaf
  10. 25 Feb, 2024 1 commit
  11. 31 Jan, 2024 1 commit
  12. 12 Jan, 2024 2 commits
    • Qu Wenruo's avatar
      btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args · 173431b2
      Qu Wenruo authored
      
      Add extra sanity check for btrfs_ioctl_defrag_range_args::flags.
      
      This is not really to enhance fuzzing tests, but as a preparation for
      future expansion on btrfs_ioctl_defrag_range_args.
      
      In the future we're going to add new members, allowing more fine tuning
      for btrfs defrag.  Without the -ENONOTSUPP error, there would be no way
      to detect if the kernel supports those new defrag features.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      173431b2
    • Omar Sandoval's avatar
      btrfs: don't abort filesystem when attempting to snapshot deleted subvolume · 7081929a
      Omar Sandoval authored
      
      If the source file descriptor to the snapshot ioctl refers to a deleted
      subvolume, we get the following abort:
      
        BTRFS: Transaction aborted (error -2)
        WARNING: CPU: 0 PID: 833 at fs/btrfs/transaction.c:1875 create_pending_snapshot+0x1040/0x1190 [btrfs]
        Modules linked in: pata_acpi btrfs ata_piix libata scsi_mod virtio_net blake2b_generic xor net_failover virtio_rng failover scsi_common rng_core raid6_pq libcrc32c
        CPU: 0 PID: 833 Comm: t_snapshot_dele Not tainted 6.7.0-rc6 #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
        RIP: 0010:create_pending_snapshot+0x1040/0x1190 [btrfs]
        RSP: 0018:ffffa09c01337af8 EFLAGS: 00010282
        RAX: 0000000000000000 RBX: ffff9982053e7c78 RCX: 0000000000000027
        RDX: ffff99827dc20848 RSI: 0000000000000001 RDI: ffff99827dc20840
        RBP: ffffa09c01337c00 R08: 0000000000000000 R09: ffffa09c01337998
        R10: 0000000000000003 R11: ffffffffb96da248 R12: fffffffffffffffe
        R13: ffff99820535bb28 R14: ffff99820b7bd000 R15: ffff99820381ea80
        FS:  00007fe20aadabc0(0000) GS:ffff99827dc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000559a120b502f CR3: 00000000055b6000 CR4: 00000000000006f0
        Call Trace:
         <TASK>
         ? create_pending_snapshot+0x1040/0x1190 [btrfs]
         ? __warn+0x81/0x130
         ? create_pending_snapshot+0x1040/0x1190 [btrfs]
         ? report_bug+0x171/0x1a0
         ? handle_bug+0x3a/0x70
         ? exc_invalid_op+0x17/0x70
         ? asm_exc_invalid_op+0x1a/0x20
         ? create_pending_snapshot+0x1040/0x1190 [btrfs]
         ? create_pending_snapshot+0x1040/0x1190 [btrfs]
         create_pending_snapshots+0x92/0xc0 [btrfs]
         btrfs_commit_transaction+0x66b/0xf40 [btrfs]
         btrfs_mksubvol+0x301/0x4d0 [btrfs]
         btrfs_mksnapshot+0x80/0xb0 [btrfs]
         __btrfs_ioctl_snap_create+0x1c2/0x1d0 [btrfs]
         btrfs_ioctl_snap_create_v2+0xc4/0x150 [btrfs]
         btrfs_ioctl+0x8a6/0x2650 [btrfs]
         ? kmem_cache_free+0x22/0x340
         ? do_sys_openat2+0x97/0xe0
         __x64_sys_ioctl+0x97/0xd0
         do_syscall_64+0x46/0xf0
         entry_SYSCALL_64_after_hwframe+0x6e/0x76
        RIP: 0033:0x7fe20abe83af
        RSP: 002b:00007ffe6eff1360 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fe20abe83af
        RDX: 00007ffe6eff23c0 RSI: 0000000050009417 RDI: 0000000000000003
        RBP: 0000000000000003 R08: 0000000000000000 R09: 00007fe20ad16cd0
        R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
        R13: 00007ffe6eff13c0 R14: 00007fe20ad45000 R15: 0000559a120b6d58
         </TASK>
        ---[ end trace 0000000000000000 ]---
        BTRFS: error (device vdc: state A) in create_pending_snapshot:1875: errno=-2 No such entry
        BTRFS info (device vdc: state EA): forced readonly
        BTRFS warning (device vdc: state EA): Skipping commit of aborted transaction.
        BTRFS: error (device vdc: state EA) in cleanup_transaction:2055: errno=-2 No such entry
      
      This happens because create_pending_snapshot() initializes the new root
      item as a copy of the source root item. This includes the refs field,
      which is 0 for a deleted subvolume. The call to btrfs_insert_root()
      therefore inserts a root with refs == 0. btrfs_get_new_fs_root() then
      finds the root and returns -ENOENT if refs == 0, which causes
      create_pending_snapshot() to abort.
      
      Fix it by checking the source root's refs before attempting the
      snapshot, but after locking subvol_sem to avoid racing with deletion.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7081929a
  13. 15 Dec, 2023 1 commit
  14. 24 Nov, 2023 2 commits
  15. 03 Nov, 2023 1 commit
  16. 28 Oct, 2023 1 commit
  17. 12 Oct, 2023 3 commits
    • Anand Jain's avatar
      btrfs: disable the device add feature for temp-fsid · ac6ea6a9
      Anand Jain authored
      
      The device addition operation will transform the cloned temp-fsid mounted
      device into a multi-device filesystem. Therefore, it is marked as
      unsupported.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac6ea6a9
    • Filipe Manana's avatar
      btrfs: add and use helpers for reading and writing last_trans_committed · 0124855f
      Filipe Manana authored
      Currently the last_trans_committed field of struct btrfs_fs_info is
      modified and read without any locking or other protection. For example
      early in the fsync path, skip_inode_logging() is called which reads
      fs_info->last_trans_committed, but at the same time we can have a
      transaction commit completing and updating that field.
      
      In the case of an fsync this is harmless and any data race should be
      rare and at most cause an unnecessary logging of an inode.
      
      To avoid data race warnings from tools like KCSAN and other issues such
      as load and store tearing (amongst others, see [1]), create helpers to
      access the last_trans_committed field of struct btrfs_fs_info using
      READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere.
      
      [1] https://lwn.net/Articles/793253/
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0124855f
    • Filipe Manana's avatar
      btrfs: add and use helpers for reading and writing fs_info->generation · 4a4f8fe2
      Filipe Manana authored
      Currently the generation field of struct btrfs_fs_info is always modified
      while holding fs_info->trans_lock locked. Most readers will access this
      field without taking that lock but while holding a transaction handle,
      which is safe to do due to the transaction life cycle.
      
      However there are other readers that are neither holding the lock nor
      holding a transaction handle open:
      
      1) When reading an inode from disk, at btrfs_read_locked_inode();
      
      2) When reading the generation to expose it to sysfs, at
         btrfs_generation_show();
      
      3) Early in the fsync path, at skip_inode_logging();
      
      4) When creating a hole at btrfs_cont_expand(), during write paths,
         truncate and reflinking;
      
      5) In the fs_info ioctl (btrfs_ioctl_fs_info());
      
      6) While mounting the filesystem, in the open_ctree() path. In these
         cases it's safe to directly read fs_info->generation as no one
         can concurrently start a transaction and update fs_info->generation.
      
      In case of the fsync path, races here should be harmless, and in the worst
      case they may cause a fsync to log an inode when it's not really needed,
      so nothing bad from a functional perspective. In the other cases it's not
      so clear if functional problems may arise, though in case 1 rare things
      like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC
      flag not being set on an inode and therefore result in incorrect logging
      later on in case a fsync call is made.
      
      To avoid data race warnings from tools like KCSAN and other issues such
      as load and store tearing (amongst others, see [1]), create helpers to
      access the generation field of struct btrfs_fs_info using READ_ONCE() and
      WRITE_ONCE(), and use these helpers where needed.
      
      [1] https://lwn.net/Articles/793253/
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4a4f8fe2