- 10 Sep, 2024 1 commit
-
-
Filipe Manana authored
The BTRFS_IOC_SYNC ioctl wants to wake up the cleaner kthread so that it does any pending work (subvolume deletion, delayed iputs, etc), however it is waking up the transaction kthread, which in turn wakes up the cleaner. Since we don't have any transaction to commit, as any ongoing transaction was already committed when it called btrfs_sync_fs() and the goal is just to wake up the cleaner thread, directly wake up the cleaner instead of the transaction kthread. Reviewed-by:
Boris Burkov <boris@bur.io> Reviewed-by:
Qu Wenruo <wqu@suse.com> Signed-off-by:
Filipe Manana <fdmanana@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 11 Jul, 2024 14 commits
-
-
David Sterba authored
Pass a struct btrfs_inode to btrfs_set_prop() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by:
Boris Burkov <boris@bur.io> Signed-off-by:
David Sterba <dsterba@suse.com>
-
David Sterba authored
The structure is internal so we should use struct btrfs_inode for that. Reviewed-by:
Boris Burkov <boris@bur.io> Signed-off-by:
David Sterba <dsterba@suse.com>
-
David Sterba authored
Pass a struct btrfs_inode to btrfs_ioctl_send() and _btrfs_ioctl_send() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by:
Boris Burkov <boris@bur.io> Signed-off-by:
David Sterba <dsterba@suse.com>
-
David Sterba authored
Calling btrfs_handle_fs_error() after btrfs_run_qgroups() fails to update the qgroup status is probably not necessary, this would turn the filesystem to read-only. For the same reason aborting the transaction is also not a good option. The state is left inconsistent and can be fixed by rescan, printing a warning should be sufficient. Return code reflects the status of adding/deleting the relation and if the transaction was ended properly. Reviewed-by:
Qu Wenruo <wqu@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
David Sterba authored
There's a transaction joined in the qgroup relation add/remove ioctl and any error will lead to abort/error. We could lift the allocation from btrfs_add_qgroup_relation() and move it outside of the transaction context. The relation deletion does not need that. The ownership of the structure is moved to the add relation handler. Reviewed-by:
Qu Wenruo <wqu@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When freeing a tree block, at btrfs_free_tree_block(), if we fail to create a delayed reference we don't deal with the error and just do a BUG_ON(). The error most likely to happen is -ENOMEM, and we have a comment mentioning that only -ENOMEM can happen, but that is not true, because in case qgroups are enabled any error returned from btrfs_qgroup_trace_extent_post() (can be -EUCLEAN or anything returned from btrfs_search_slot() for example) can be propagated back to btrfs_free_tree_block(). So stop doing a BUG_ON() and return the error to the callers and make them abort the transaction to prevent leaking space. Syzbot was triggering this, likely due to memory allocation failure injection. Reported-by: syzbot+a306f914b4d01b3958fe@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000fcba1e05e998263c@google.com/ Reviewed-by:
Qu Wenruo <wqu@suse.com> Signed-off-by:
Filipe Manana <fdmanana@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Filipe Manana authored
It's pointless to pass a super block argument to btrfs_iget() because we always pass a root and from it we can get the super block through: root->fs_info->sb So remove the super block argument. Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Josef Bacik <josef@toxicpanda.com> Reviewed-by:
Qu Wenruo <wqu@suse.com> Signed-off-by:
Filipe Manana <fdmanana@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Filipe Manana authored
As of commit 1b53e51a ("btrfs: don't commit transaction for every subvol create") we started to make any fsync after creating a subvolume to fallback to a transaction commit if the fsync is performed in the same transaction that was used to create the subvolume. This happens with the following at ioctl.c:create_subvol(): $ cat fs/btrfs/ioctl.c (...) /* Tree log can't currently deal with an inode which is a new root. */ btrfs_set_log_full_commit(trans); (...) Note that the comment is misleading as the problem is not that fsync can not deal with the root inode of a new root, but that we can not log any inode that belongs to a root that was not yet persisted because that would make log replay fail since the root doesn't exist at log replay time. The above simply makes any fsync fallback to a full transaction commit if it happens in the same transaction used to create the subvolume - even if it's an inode that belongs to any other subvolume. This is a brute force solution and it doesn't necessarily improve performance for every workload out there - it just moves a full transaction commit from one place, the subvolume creation, to another - an fsync for any inode. Just improve on this by making the fallback to a transaction commit only for an fsync against an inode of the new subvolume, or for the directory that contains the dentry that points to the new subvolume (in case anyone attempts to fsync the directory in the same transaction). Reviewed-by:
Josef Bacik <josef@toxicpanda.com> Signed-off-by:
Filipe Manana <fdmanana@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When creating and deleting a subvolume, after starting a transaction we are explicitly calling btrfs_record_root_in_trans() for the root which we passed to btrfs_start_transaction(). This is pointless because at transaction.c:start_transaction() we end up doing that call, regardless of whether we actually start a new transaction or join an existing one, and if we were not it would mean the root item of that root would not be updated in the root tree when committing the transaction, leading to problems easy to spot with fstests for example. Remove these redundant calls. They were introduced with commit 74e97958 ("btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations"). Reviewed-by:
Boris Burkov <boris@bur.io> Signed-off-by:
Filipe Manana <fdmanana@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
David Sterba authored
We can add const to many parameters, this is for clarity and minor addition to safety. There are some minor effects, in the assembly code and .ko measured on release config. This patch does not cover all possible conversions. Signed-off-by:
David Sterba <dsterba@suse.com>
-
David Sterba authored
The range is specified only in two ways, we can simplify the case for the whole filesystem range as a NULL block group parameter. Signed-off-by:
David Sterba <dsterba@suse.com>
-
Filipe Manana authored
On 64 bits platforms we don't really need to have a dedicated member (the objectid field) for the inode's number since we store in the VFS inode's i_ino member, which is an unsigned long and this type is 64 bits wide on 64 bits platforms. We only need that field in case we are on a 32 bits platform because the unsigned long type is 32 bits wide on such platforms See commit 33345d01 ("Btrfs: Always use 64bit inode number") regarding this 64/32 bits detail. The objectid field of struct btrfs_inode is also used to store the ID of a root for directories that are stubs for unreferenced roots. In such cases the inode is a directory and has the BTRFS_INODE_ROOT_STUB runtime flag set. So in order to reduce the size of btrfs_inode structure on 64 bits platforms we can remove the objectid member and use the VFS inode's i_ino member instead whenever we need to get the inode number. In case the inode is a root stub (BTRFS_INODE_ROOT_STUB set) we can use the member last_reflink_trans to store the ID of the unreferenced root, since such inode is a directory and reflinks can't be done against directories. So remove the objectid fields for 64 bits platforms and alias the last_reflink_trans field with a name of ref_root_id in a union. On a release kernel config, this reduces the size of struct btrfs_inode from 1040 bytes down to 1032 bytes. Signed-off-by:
Filipe Manana <fdmanana@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Filipe Manana authored
Currently struct btrfs_inode has a key member, named "location", that is either: 1) The key of the inode's item. In this case the objectid is the number of the inode; 2) A key stored in a dir entry with a type of BTRFS_ROOT_ITEM_KEY, for the case where we have a root that is a snapshot of a subvolume that points to other subvolumes. In this case the objectid is the ID of a subvolume inside the snapshotted parent subvolume. The key is only used to lookup the inode item for the first case, while for the second it's never used since it corresponds to directory stubs created with new_simple_dir() and which are marked as dummy, so there's no actual inode item to ever update. In the second case we only check the key type at btrfs_ino() for 32 bits platforms and its objectid is only needed for unlink. Instead of using a key we can do fine with just the objectid, since we can generate the key whenever we need it having only the objectid, as in all use cases the type is always BTRFS_INODE_ITEM_KEY and the offset is always 0. So use only an objectid instead of a full key. This reduces the size of struct btrfs_inode from 1048 bytes down to 1040 bytes on a release kernel. Signed-off-by:
Filipe Manana <fdmanana@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
David Sterba authored
The ioctls that add relations, create qgroups or set limits start/join transaction. When quotas are not enabled this is not necessary, there will be errors reported back anyway but this could be also misleading and we should really report that quotas are not enabled. For that use -ENOTCONN. The helper is meant to do a quick check before any other standard ioctl checks are done. If quota is disabled meanwhile we still rely on proper locking inside any active operation changing the qgroup structures. Reviewed-by:
Anand Jain <anand.jain@oracle.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 20 Jun, 2024 1 commit
-
-
Prasad Singamsetty authored
An atomic write is a write issued with torn-write protection, meaning that for a power failure or any other hardware failure, all or none of the data from the write will be stored, but never a mix of old and new data. Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the write is to be issued with torn-write prevention, according to special alignment and length rules. For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for iocb->ki_flags field to indicate the same. A call to statx will give the relevant atomic write info for a file: - atomic_write_unit_min - atomic_write_unit_max - atomic_write_segments_max Both min and max values must be a power-of-2. Applications can avail of atomic write feature by ensuring that the total length of a write is a power-of-2 in size and also sized between atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications must ensure that the write is at a naturally-aligned offset in the file wrt the total write length. The value in atomic_write_segments_max indicates the upper limit for IOV_ITER iovcnt. Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the flag set will have RWF_ATOMIC rejected and not just ignored. Add a type argument to kiocb_set_rw_flags() to allows reads which have RWF_ATOMIC set to be rejected. Helper function generic_atomic_write_valid() can be used by FSes to verify compliant writes. There we check for iov_iter type is for ubuf, which implies iovcnt==1 for pwritev2(), which is an initial restriction for atomic_write_segments_max. Initially the only user will be bdev file operations write handler. We will rely on the block BIO submission path to ensure write sizes are compliant for the bdev, so we don't need to check atomic writes sizes yet. Signed-off-by:
Prasad Singamsetty <prasad.singamsetty@oracle.com> jpg: merge into single patch and much rewrite Acked-by:
Darrick J. Wong <djwong@kernel.org> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
John Garry <john.g.garry@oracle.com> Reviewed-by:
Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 07 May, 2024 2 commits
-
-
Josef Bacik authored
A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code. The changes where made with the following Coccinelle semantic patch: // <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 | - E->root_key.objectid + btrfs_root_id(E) ) // </smpl> Reviewed-by:
Filipe Manana <fdmanana@suse.com> Signed-off-by:
Josef Bacik <josef@toxicpanda.com> Reviewed-by:
David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by:
David Sterba <dsterba@suse.com>
-
Anand Jain authored
Unify naming of return value to the preferred way. Signed-off-by:
Anand Jain <anand.jain@oracle.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 25 Apr, 2024 1 commit
-
-
Josef Bacik authored
One of my CI runs popped the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 6.9.0-rc4+ #1 Not tainted ------------------------------------------------------ btrfs/471533 is trying to acquire lock: ffff92ba46980850 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_quota_disable+0x54/0x4c0 but task is already holding lock: ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&fs_info->subvol_sem){++++}-{3:3}: down_read+0x42/0x170 btrfs_rename+0x607/0xb00 btrfs_rename2+0x2e/0x70 vfs_rename+0xaf8/0xfc0 do_renameat2+0x586/0x600 __x64_sys_rename+0x43/0x50 do_syscall_64+0x95/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&sb->s_type->i_mutex_key#16){++++}-{3:3}: down_write+0x3f/0xc0 btrfs_inode_lock+0x40/0x70 prealloc_file_extent_cluster+0x1b0/0x370 relocate_file_extent_cluster+0xb2/0x720 relocate_data_extent+0x107/0x160 relocate_block_group+0x442/0x550 btrfs_relocate_block_group+0x2cb/0x4b0 btrfs_relocate_chunk+0x50/0x1b0 btrfs_balance+0x92f/0x13d0 btrfs_ioctl+0x1abf/0x2600 __x64_sys_ioctl+0x97/0xd0 do_syscall_64+0x95/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&fs_info->cleaner_mutex){+.+.}-{3:3}: __lock_acquire+0x13e7/0x2180 lock_acquire+0xcb/0x2e0 __mutex_lock+0xbe/0xc00 btrfs_quota_disable+0x54/0x4c0 btrfs_ioctl+0x206b/0x2600 __x64_sys_ioctl+0x97/0xd0 do_syscall_64+0x95/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: &fs_info->cleaner_mutex --> &sb->s_type->i_mutex_key#16 --> &fs_info->subvol_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->subvol_sem); lock(&sb->s_type->i_mutex_key#16); lock(&fs_info->subvol_sem); lock(&fs_info->cleaner_mutex); *** DEADLOCK *** 2 locks held by btrfs/471533: #0: ffff92ba4319e420 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0x3b5/0x2600 #1: ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600 stack backtrace: CPU: 1 PID: 471533 Comm: btrfs Kdump: loaded Not tainted 6.9.0-rc4+ #1 Call Trace: <TASK> dump_stack_lvl+0x77/0xb0 check_noncircular+0x148/0x160 ? lock_acquire+0xcb/0x2e0 __lock_acquire+0x13e7/0x2180 lock_acquire+0xcb/0x2e0 ? btrfs_quota_disable+0x54/0x4c0 ? lock_is_held_type+0x9a/0x110 __mutex_lock+0xbe/0xc00 ? btrfs_quota_disable+0x54/0x4c0 ? srso_return_thunk+0x5/0x5f ? lock_acquire+0xcb/0x2e0 ? btrfs_quota_disable+0x54/0x4c0 ? btrfs_quota_disable+0x54/0x4c0 btrfs_quota_disable+0x54/0x4c0 btrfs_ioctl+0x206b/0x2600 ? srso_return_thunk+0x5/0x5f ? __do_sys_statfs+0x61/0x70 __x64_sys_ioctl+0x97/0xd0 do_syscall_64+0x95/0x180 ? srso_return_thunk+0x5/0x5f ? reacquire_held_locks+0xd1/0x1f0 ? do_user_addr_fault+0x307/0x8a0 ? srso_return_thunk+0x5/0x5f ? lock_acquire+0xcb/0x2e0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? find_held_lock+0x2b/0x80 ? srso_return_thunk+0x5/0x5f ? lock_release+0xca/0x2a0 ? srso_return_thunk+0x5/0x5f ? do_user_addr_fault+0x35c/0x8a0 ? srso_return_thunk+0x5/0x5f ? trace_hardirqs_off+0x4b/0xc0 ? srso_return_thunk+0x5/0x5f ? lockdep_hardirqs_on_prepare+0xde/0x190 ? srso_return_thunk+0x5/0x5f This happens because when we call rename we already have the inode mutex held, and then we acquire the subvol_sem if we are a subvolume. This makes the dependency inode lock -> subvol sem When we're running data relocation we will preallocate space for the data relocation inode, and we always run the relocation under the ->cleaner_mutex. This now creates the dependency of cleaner_mutex -> inode lock (from the prealloc) -> subvol_sem Qgroup delete is doing this in the opposite order, it is acquiring the subvol_sem and then it is acquiring the cleaner_mutex, which results in this lockdep splat. This deadlock can't happen in reality, because we won't ever rename the data reloc inode, nor is the data reloc inode a subvolume. However this is fairly easy to fix, simply take the cleaner mutex in the case where we are disabling qgroups before we take the subvol_sem. This resolves the lockdep splat. Reviewed-by:
Filipe Manana <fdmanana@suse.com> Signed-off-by:
Josef Bacik <josef@toxicpanda.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 02 Apr, 2024 1 commit
-
-
Boris Burkov authored
Create subvolume, create snapshot and delete subvolume all use btrfs_subvolume_reserve_metadata() to reserve metadata for the changes done to the parent subvolume's fs tree, which cannot be mediated in the normal way via start_transaction. When quota groups (squota or qgroups) are enabled, this reserves qgroup metadata of type PREALLOC. Once the operation is associated to a transaction, we convert PREALLOC to PERTRANS, which gets cleared in bulk at the end of the transaction. However, the error paths of these three operations were not implementing this lifecycle correctly. They unconditionally converted the PREALLOC to PERTRANS in a generic cleanup step regardless of errors or whether the operation was fully associated to a transaction or not. This resulted in error paths occasionally converting this rsv to PERTRANS without calling record_root_in_trans successfully, which meant that unless that root got recorded in the transaction by some other thread, the end of the transaction would not free that root's PERTRANS, leaking it. Ultimately, this resulted in hitting a WARN in CONFIG_BTRFS_DEBUG builds at unmount for the leaked reservation. The fix is to ensure that every qgroup PREALLOC reservation observes the following properties: 1. any failure before record_root_in_trans is called successfully results in freeing the PREALLOC reservation. 2. after record_root_in_trans, we convert to PERTRANS, and now the transaction owns freeing the reservation. This patch enforces those properties on the three operations. Without it, generic/269 with squotas enabled at mkfs time would fail in ~5-10 runs on my system. With this patch, it ran successfully 1000 times in a row. Fixes: e85fde51 ("btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations") CC: stable@vger.kernel.org # 6.1+ Reviewed-by:
Qu Wenruo <wqu@suse.com> Signed-off-by:
Boris Burkov <boris@bur.io> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 05 Mar, 2024 1 commit
-
-
Qu Wenruo authored
[BUG] Currently btrfs can create subvolume with an invalid qgroup inherit without triggering any error: # mkfs.btrfs -O quota -f $dev # mount $dev $mnt # btrfs subvolume create -i 2/0 $mnt/subv1 # btrfs qgroup show -prce --sync $mnt Qgroupid Referenced Exclusive Path -------- ---------- --------- ---- 0/5 16.00KiB 16.00KiB <toplevel> 0/256 16.00KiB 16.00KiB subv1 [CAUSE] We only do a very basic size check for btrfs_qgroup_inherit structure, but never really verify if the values are correct. Thus in btrfs_qgroup_inherit() function, we have to skip non-existing qgroups, and never return any error. [FIX] Fix the behavior and introduce extra checks: - Introduce early check for btrfs_qgroup_inherit structure Not only the size, but also all the qgroup ids would be verified. And the timing is very early, so we can return error early. This early check is very important for snapshot creation, as snapshot is delayed to transaction commit. - Drop support for btrfs_qgroup_inherit::num_ref_copies and num_excl_copies Those two members are used to specify to copy refr/excl numbers from other qgroups. This would definitely mark qgroup inconsistent, and btrfs-progs has dropped the support for them for a long time. It's time to drop the support for kernel. - Verify the supported btrfs_qgroup_inherit::flags Just in case we want to add extra flags for btrfs_qgroup_inherit. Now above subvolume creation would fail with -ENOENT other than silently ignore the non-existing qgroup. CC: stable@vger.kernel.org # 6.7+ Signed-off-by:
Qu Wenruo <wqu@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 04 Mar, 2024 6 commits
-
-
David Sterba authored
The validation of vol args v2 name in snapshot and device remove ioctls is not done properly. A terminating NUL is written to the end of the buffer unconditionally, assuming that this would be the last place in case the buffer is used completely. This does not communicate back the actual error (either an invalid or too long path). Factor out all such cases and use a helper to do the verification, simply look for NUL in the buffer. There's no expected practical change, the size of buffer is 4088, this is enough for most paths or names. Reviewed-by:
Boris Burkov <boris@bur.io> Signed-off-by:
David Sterba <dsterba@suse.com>
-
David Sterba authored
The validation of vol args name in several ioctls is not done properly. a terminating NUL is written to the end of the buffer unconditionally, assuming that this would be the last place in case the buffer is used completely. This does not communicate back the actual error (either an invalid or too long path). Factor out all such cases and use a helper to do the verification, simply look for NUL in the buffer. There's no expected practical change, the size of buffer is 4088, this is enough for most paths or names. Reviewed-by:
Boris Burkov <boris@bur.io> Signed-off-by:
David Sterba <dsterba@suse.com>
-
David Sterba authored
Add a convenience helper to get a fs_info from a VFS inode pointer instead of open coding the chain or using btrfs_sb() that in some cases does one more pointer hop. This is implemented as a macro (still with type checking) so we don't need full definitions of struct btrfs_inode, btrfs_root or btrfs_fs_info. Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Anand Jain <anand.jain@oracle.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
David Sterba authored
The helper btrfs_may_delete() is a copy of generic fs/namei.c:may_delete() to verify various conditions before deletion. There's a BUG_ON added before linux.git started, we can turn it to a proper error handling at least in our local helper. A mistmatch between directory and the deleted dentry is clearly invalid. This won't be probably ever hit due to the way how the parameters are set from the caller btrfs_ioctl_snap_destroy(), using a VFS helper lookup_one(). Reviewed-by:
Josef Bacik <josef@toxicpanda.com> Reviewed-by:
Anand Jain <anand.jain@oracle.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
David Sterba authored
With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by:
Josef Bacik <josef@toxicpanda.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
David Sterba authored
The block size stored in the super block is used by subsystems outside of btrfs and it's a copy of fs_info::sectorsize. Unify that to always use our sectorsize, with the exception of mount where we first need to use fixed values (4K) until we read the super block and can set the sectorsize. Replace all uses, in most cases it's fewer pointer indirections. Reviewed-by:
Josef Bacik <josef@toxicpanda.com> Reviewed-by:
Anand Jain <anand.jain@oracle.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 29 Feb, 2024 1 commit
-
-
Filipe Manana authored
When creating a snapshot we may do a double free of an anonymous device in case there's an error committing the transaction. The second free may result in freeing an anonymous device number that was allocated by some other subsystem in the kernel or another btrfs filesystem. The steps that lead to this: 1) At ioctl.c:create_snapshot() we allocate an anonymous device number and assign it to pending_snapshot->anon_dev; 2) Then we call btrfs_commit_transaction() and end up at transaction.c:create_pending_snapshot(); 3) There we call btrfs_get_new_fs_root() and pass it the anonymous device number stored in pending_snapshot->anon_dev; 4) btrfs_get_new_fs_root() frees that anonymous device number because btrfs_lookup_fs_root() returned a root - someone else did a lookup of the new root already, which could some task doing backref walking; 5) After that some error happens in the transaction commit path, and at ioctl.c:create_snapshot() we jump to the 'fail' label, and after that we free again the same anonymous device number, which in the meanwhile may have been reallocated somewhere else, because pending_snapshot->anon_dev still has the same value as in step 1. Recently syzbot ran into this and reported the following trace: ------------[ cut here ]------------ ida_free called for id=51 which is not allocated. WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525 Modules linked in: CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525 Code: 10 42 80 3c 28 (...) RSP: 0018:ffffc90015a67300 EFLAGS: 00010246 RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000 RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000 RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4 R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246 R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246 FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0 Call Trace: <TASK> btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346 create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837 create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931 btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404 create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848 btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998 btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044 __btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306 btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393 btrfs_ioctl+0xa74/0xd40 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7fca3e67dda9 Code: 28 00 00 00 (...) RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9 RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003 RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658 </TASK> Where we get an explicit message where we attempt to free an anonymous device number that is not currently allocated. It happens in a different code path from the example below, at btrfs_get_root_ref(), so this change may not fix the case triggered by syzbot. To fix at least the code path from the example above, change btrfs_get_root_ref() and its callers to receive a dev_t pointer argument for the anonymous device number, so that in case it frees the number, it also resets it to 0, so that up in the call chain we don't attempt to do the double free. CC: stable@vger.kernel.org # 5.10+ Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/ Fixes: e03ee2fe ("btrfs: do not ASSERT() if the newly created subvolume already got read") Signed-off-by:
Filipe Manana <fdmanana@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 25 Feb, 2024 1 commit
-
-
Christian Brauner authored
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-19-adbd023e19cc@kernel.org Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
- 31 Jan, 2024 1 commit
-
-
Boris Burkov authored
Creating a qgroup 0/subvolid leads to various races and it isn't helpful, because you can't specify a subvol id when creating a subvol, so you can't be sure it will be the right one. Any requirements on the automatic subvol can be gratified by using a higher level qgroup and the inheritance parameters of subvol creation. Fixes: cecbb533 ("btrfs: record simple quota deltas in delayed refs") CC: stable@vger.kernel.org # 4.14+ Reviewed-by:
Qu Wenruo <wqu@suse.com> Signed-off-by:
Boris Burkov <boris@bur.io> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 12 Jan, 2024 2 commits
-
-
Qu Wenruo authored
Add extra sanity check for btrfs_ioctl_defrag_range_args::flags. This is not really to enhance fuzzing tests, but as a preparation for future expansion on btrfs_ioctl_defrag_range_args. In the future we're going to add new members, allowing more fine tuning for btrfs defrag. Without the -ENONOTSUPP error, there would be no way to detect if the kernel supports those new defrag features. CC: stable@vger.kernel.org # 4.14+ Reviewed-by:
Filipe Manana <fdmanana@suse.com> Signed-off-by:
Qu Wenruo <wqu@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
If the source file descriptor to the snapshot ioctl refers to a deleted subvolume, we get the following abort: BTRFS: Transaction aborted (error -2) WARNING: CPU: 0 PID: 833 at fs/btrfs/transaction.c:1875 create_pending_snapshot+0x1040/0x1190 [btrfs] Modules linked in: pata_acpi btrfs ata_piix libata scsi_mod virtio_net blake2b_generic xor net_failover virtio_rng failover scsi_common rng_core raid6_pq libcrc32c CPU: 0 PID: 833 Comm: t_snapshot_dele Not tainted 6.7.0-rc6 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014 RIP: 0010:create_pending_snapshot+0x1040/0x1190 [btrfs] RSP: 0018:ffffa09c01337af8 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff9982053e7c78 RCX: 0000000000000027 RDX: ffff99827dc20848 RSI: 0000000000000001 RDI: ffff99827dc20840 RBP: ffffa09c01337c00 R08: 0000000000000000 R09: ffffa09c01337998 R10: 0000000000000003 R11: ffffffffb96da248 R12: fffffffffffffffe R13: ffff99820535bb28 R14: ffff99820b7bd000 R15: ffff99820381ea80 FS: 00007fe20aadabc0(0000) GS:ffff99827dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000559a120b502f CR3: 00000000055b6000 CR4: 00000000000006f0 Call Trace: <TASK> ? create_pending_snapshot+0x1040/0x1190 [btrfs] ? __warn+0x81/0x130 ? create_pending_snapshot+0x1040/0x1190 [btrfs] ? report_bug+0x171/0x1a0 ? handle_bug+0x3a/0x70 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? create_pending_snapshot+0x1040/0x1190 [btrfs] ? create_pending_snapshot+0x1040/0x1190 [btrfs] create_pending_snapshots+0x92/0xc0 [btrfs] btrfs_commit_transaction+0x66b/0xf40 [btrfs] btrfs_mksubvol+0x301/0x4d0 [btrfs] btrfs_mksnapshot+0x80/0xb0 [btrfs] __btrfs_ioctl_snap_create+0x1c2/0x1d0 [btrfs] btrfs_ioctl_snap_create_v2+0xc4/0x150 [btrfs] btrfs_ioctl+0x8a6/0x2650 [btrfs] ? kmem_cache_free+0x22/0x340 ? do_sys_openat2+0x97/0xe0 __x64_sys_ioctl+0x97/0xd0 do_syscall_64+0x46/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7fe20abe83af RSP: 002b:00007ffe6eff1360 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fe20abe83af RDX: 00007ffe6eff23c0 RSI: 0000000050009417 RDI: 0000000000000003 RBP: 0000000000000003 R08: 0000000000000000 R09: 00007fe20ad16cd0 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffe6eff13c0 R14: 00007fe20ad45000 R15: 0000559a120b6d58 </TASK> ---[ end trace 0000000000000000 ]--- BTRFS: error (device vdc: state A) in create_pending_snapshot:1875: errno=-2 No such entry BTRFS info (device vdc: state EA): forced readonly BTRFS warning (device vdc: state EA): Skipping commit of aborted transaction. BTRFS: error (device vdc: state EA) in cleanup_transaction:2055: errno=-2 No such entry This happens because create_pending_snapshot() initializes the new root item as a copy of the source root item. This includes the refs field, which is 0 for a deleted subvolume. The call to btrfs_insert_root() therefore inserts a root with refs == 0. btrfs_get_new_fs_root() then finds the root and returns -ENOENT if refs == 0, which causes create_pending_snapshot() to abort. Fix it by checking the source root's refs before attempting the snapshot, but after locking subvol_sem to avoid racing with deletion. CC: stable@vger.kernel.org # 4.14+ Reviewed-by:
Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by:
Anand Jain <anand.jain@oracle.com> Signed-off-by:
Omar Sandoval <osandov@fb.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 15 Dec, 2023 1 commit
-
-
Josef Bacik authored
Our btrfs subvolume snapshot <source> <destination> utility enforces that <source> is the root of the subvolume, however this isn't enforced in the kernel. Update the kernel to also enforce this limitation to avoid problems with other users of this ioctl that don't have the appropriate checks in place. Reported-by:
Martin Michaelis <code@mgjm.de> CC: stable@vger.kernel.org # 4.14+ Reviewed-by:
Neal Gompa <neal@gompa.dev> Signed-off-by:
Josef Bacik <josef@toxicpanda.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 24 Nov, 2023 2 commits
-
-
David Sterba authored
When the send protocol versioning was added in 5.16 e77fbf99 ("btrfs: send: prepare for v2 protocol"), the 32/64bit compat code was not updated (added by 2351f431 ("btrfs: fix send ioctl on 32bit with 64bit kernel")), missing the version struct member. The compat code is probably rarely used, nobody reported any bugs. Found by tool https://github.com/jirislaby/clang-struct . Fixes: e77fbf99 ("btrfs: send: prepare for v2 protocol") CC: stable@vger.kernel.org # 6.1+ Reviewed-by:
Filipe Manana <fdmanana@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Amir Goldstein authored
In vfs code, file_start_write() is usually called after the permission hook in rw_verify_area(). btrfs_ioctl_encoded_write() in an exception to this rule. Move file_start_write() to after the rw_verify_area() check in encoded write to make the permission hook "start-write-safe". This is needed for fanotify "pre content" events. Reviewed-by:
Josef Bacik <josef@toxicpanda.com> Signed-off-by:
Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231122122715.2561213-9-amir73il@gmail.com Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
- 03 Nov, 2023 1 commit
-
-
Filipe Manana authored
In the tree search v2 ioctl we use the type size_t, which is an unsigned long, to track the buffer size in the local variable 'buf_size'. An unsigned long is 32 bits wide on a 32 bits architecture. The buffer size defined in struct btrfs_ioctl_search_args_v2 is a u64, so when we later try to copy the local variable 'buf_size' to the argument struct, when the search returns -EOVERFLOW, we copy only 32 bits which will be a problem on big endian systems. Fix this by using a u64 type for the buffer sizes, not only at btrfs_ioctl_tree_search_v2(), but also everywhere down the call chain so that we can use the u64 at btrfs_ioctl_tree_search_v2(). Fixes: cc68a8a5 ("btrfs: new ioctl TREE_SEARCH_V2") Reported-by:
Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/linux-btrfs/ce6f4bd6-9453-4ffe-ba00-cee35495e10f@moroto.mountain/ Signed-off-by:
Filipe Manana <fdmanana@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- 28 Oct, 2023 1 commit
-
-
Jan Kara authored
Convert btrfs to use bdev_open_by_path() and pass the handle around. We also drop the holder from struct btrfs_device as it is now not needed anymore. CC: David Sterba <dsterba@suse.com> CC: linux-btrfs@vger.kernel.org Acked-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Christian Brauner <brauner@kernel.org> Signed-off-by:
Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-20-jack@suse.cz Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
- 12 Oct, 2023 3 commits
-
-
Anand Jain authored
The device addition operation will transform the cloned temp-fsid mounted device into a multi-device filesystem. Therefore, it is marked as unsupported. Signed-off-by:
Anand Jain <anand.jain@oracle.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Filipe Manana authored
Currently the last_trans_committed field of struct btrfs_fs_info is modified and read without any locking or other protection. For example early in the fsync path, skip_inode_logging() is called which reads fs_info->last_trans_committed, but at the same time we can have a transaction commit completing and updating that field. In the case of an fsync this is harmless and any data race should be rare and at most cause an unnecessary logging of an inode. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the last_trans_committed field of struct btrfs_fs_info using READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere. [1] https://lwn.net/Articles/793253/ Signed-off-by:
Filipe Manana <fdmanana@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Filipe Manana authored
Currently the generation field of struct btrfs_fs_info is always modified while holding fs_info->trans_lock locked. Most readers will access this field without taking that lock but while holding a transaction handle, which is safe to do due to the transaction life cycle. However there are other readers that are neither holding the lock nor holding a transaction handle open: 1) When reading an inode from disk, at btrfs_read_locked_inode(); 2) When reading the generation to expose it to sysfs, at btrfs_generation_show(); 3) Early in the fsync path, at skip_inode_logging(); 4) When creating a hole at btrfs_cont_expand(), during write paths, truncate and reflinking; 5) In the fs_info ioctl (btrfs_ioctl_fs_info()); 6) While mounting the filesystem, in the open_ctree() path. In these cases it's safe to directly read fs_info->generation as no one can concurrently start a transaction and update fs_info->generation. In case of the fsync path, races here should be harmless, and in the worst case they may cause a fsync to log an inode when it's not really needed, so nothing bad from a functional perspective. In the other cases it's not so clear if functional problems may arise, though in case 1 rare things like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC flag not being set on an inode and therefore result in incorrect logging later on in case a fsync call is made. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the generation field of struct btrfs_fs_info using READ_ONCE() and WRITE_ONCE(), and use these helpers where needed. [1] https://lwn.net/Articles/793253/ Signed-off-by:
Filipe Manana <fdmanana@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-