- 25 Feb, 2019 40 commits
-
-
Qu Wenruo authored
Before this patch, qgroup code traces the whole subtree of subvolume and reloc trees unconditionally. This makes qgroup numbers consistent, but it could cause tons of unnecessary extent tracing, which causes a lot of overhead. However for subtree swap of balance, just swap both subtrees because they contain the same contents and tree structure, so qgroup numbers won't change. It's the race window between subtree swap and transaction commit could cause qgroup number change. This patch will delay the qgroup subtree scan until COW happens for the subtree root. So if there is no other operations for the fs, balance won't cause extra qgroup overhead. (best case scenario) Depending on the workload, most of the subtree scan can still be avoided. Only for worst case scenario, it will fall back to old subtree swap overhead. (scan all swapped subtrees) [[Benchmark]] Hardware: VM 4G vRAM, 8 vCPUs, disk is using 'unsafe' cache mode, backing device is SAMSUNG 850 evo SSD. Host has 16G ram. Mkfs parameter: --nodesize 4K (To bump up tree size) Initial subvolume contents: 4G data copied from /usr and /lib. (With enough regular small files) Snapshots: 16 snapshots of the original subvolume. each snapshot has 3 random files modified. balance parameter: -m So the content should be pretty similar to a real world root fs layout. And after file system population, there is no other activity, so it should be the best case scenario. | v4.20-rc1 | w/ patchset | diff ----------------------------------------------------------------------- relocated extents | 22615 | 22457 | -0.1% qgroup dirty extents | 163457 | 121606 | -25.6% time (sys) | 22.884s | 18.842s | -17.6% time (real) | 27.724s | 22.884s | -17.5% Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
To allow delayed subtree swap rescan, btrfs needs to record per-root information about which tree blocks get swapped. This patch introduces the required infrastructure. The designed workflow will be: 1) Record the subtree root block that gets swapped. During subtree swap: O = Old tree blocks N = New tree blocks reloc tree subvolume tree X Root Root / \ / \ NA OB OA OB / | | \ / | | \ NC ND OE OF OC OD OE OF In this case, NA and OA are going to be swapped, record (NA, OA) into subvolume tree X. 2) After subtree swap. reloc tree subvolume tree X Root Root / \ / \ OA OB NA OB / | | \ / | | \ OC OD OE OF NC ND OE OF 3a) COW happens for OB If we are going to COW tree block OB, we check OB's bytenr against tree X's swapped_blocks structure. If it doesn't fit any, nothing will happen. 3b) COW happens for NA Check NA's bytenr against tree X's swapped_blocks, and get a hit. Then we do subtree scan on both subtrees OA and NA. Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND). Then no matter what we do to subvolume tree X, qgroup numbers will still be correct. Then NA's record gets removed from X's swapped_blocks. 4) Transaction commit Any record in X's swapped_blocks gets removed, since there is no modification to swapped subtrees, no need to trigger heavy qgroup subtree rescan for them. This will introduce 128 bytes overhead for each btrfs_root even qgroup is not enabled. This is to reduce memory allocations and potential failures. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Refactor btrfs_qgroup_trace_subtree_swap() into qgroup_trace_subtree_swap(), which only needs two extent buffer and some other bool to control the behavior. This provides the basis for later delayed subtree scan work. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Relocation code will drop btrfs_root::reloc_root as soon as merge_reloc_root() finishes. However later qgroup code will need to access btrfs_root::reloc_root after merge_reloc_root() for delayed subtree rescan. So alter the timming of resetting btrfs_root:::reloc_root, make it happens after transaction commit. With this patch, we will introduce a new btrfs_root::state, BTRFS_ROOT_DEAD_RELOC_TREE, to info part of btrfs_root::reloc_tree user that although btrfs_root::reloc_tree is still non-NULL, but still it's not used any more. The lifespan of btrfs_root::reloc tree will become: Old behavior | New ------------------------------------------------------------------------ btrfs_init_reloc_root() --- | btrfs_init_reloc_root() --- set reloc_root | | set reloc_root | | | | | | | merge_reloc_root() | | merge_reloc_root() | |- btrfs_update_reloc_root() --- | |- btrfs_update_reloc_root() -+- clear btrfs_root::reloc_root | set ROOT_DEAD_RELOC_TREE | | record root into dirty | | roots rbtree | | | | reloc_block_group() Or | | btrfs_recover_relocation() | | | After transaction commit | | |- clean_dirty_subvols() --- | clear btrfs_root::reloc_root During ROOT_DEAD_RELOC_TREE set lifespan, the only user of btrfs_root::reloc_tree should be qgroup. Since reloc root needs a longer life-span, this patch will also delay btrfs_drop_snapshot() call. Now btrfs_drop_snapshot() is called in clean_dirty_subvols(). This patch will increase the size of btrfs_root by 16 bytes. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
The first thing we do is loop through the list, this if (!list_empty()) btrfs_create_pending_block_groups(); thing is just wasted space. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
Instead of open coding this stuff use the helper instead. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
We have this open coded in btrfs_destroy_delayed_refs, use the helper instead. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
The kernel log messages help debugging and audit, add them for scrub Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The workqueue name is constructed from a format string but the prefix does not need to be set by %s. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
Both btrfs_find_device() and find_device() does the same thing except that the latter does not take the seed device onto account in the device scanning context. We can merge them. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
Preparatory patch to add ioctl that allows to forget a device (ie. reverse of scan). Refactors btrfs_free_stale_devices() to obtain return status. As this function can fail if it can't find the given path (returns -ENOENT) or trying to delete a mounted device (returns -EBUSY). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
btrfs_find_device() accepts fs_info as an argument and retrieves fs_devices from fs_info. Instead use fs_devices, so that this function can be used in non-mount (during device scanning) context as well. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
btrfs_find_device_by_devspec() finds the device by @devid or by @device_path. This patch makes code flow easy to read by open coding the else part and renames devpath to device_path. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
btrfs_find_device_missing_or_by_path() is relatively small function, and its only parent btrfs_find_device_by_devspec() is small as well. Besides there are a number of find_device functions. Merge btrfs_find_device_missing_or_by_path() into its parent. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
In order to avoid duplicating init code for em there is an additional label, not_found_em, which is used to only set ->block_start. The only case when it will be used is if the extent we are adding overlaps with an existing extent. Make that case more obvious by: 1. Adding a comment hinting at what's going on 2. Assigning EXTENT_MAP_HOLE and directly going to insert. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Core btree functions in btrfs generally return 0 when an item is found, 1 in case the sought item cannot be found and <0 when an error happens. Consolidate the checks for those conditions in one 'if () {} else if () {}' construct rather than 2 separate 'if () {}' statements. This emphasizes that the handling code pertains to a single function. No functional changes. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
found_type really holds the type of extent and is guaranteed to to have a value between [0, 2]. The only time it can contain anything different is if btrfs_lookup_file_extent returned a positive value and the previous item is different than an extent. Avoid this situation by simply checking found_key.type rather than assigning the item type to found_type intermittently. Also make the variable an u8 to reduce stack usage. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
Move the check that verifies if both inodes have checksums disabled or both have them enabled, from the clone and deduplication functions into the new common helper btrfs_remap_file_range_prep(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
We can never have extents marked as EXTENT_MAP_DELALLOC since this value is only ever used by btrfs_get_extent_fiemap. In this case the extent map is created by btrfs_get_extent_fiemap and is never really published, this flag is used to return the corresponding userspace one. Considering this, it's pointless having a check for EXTENT_MAP_DELALLOC in mergable_maps. Just remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
If the call to btrfs_balance() failed we would overwrite the error returned to user space with -EFAULT if the call to copy_to_user() failed as well. Fix that by calling copy_to_user() only if btrfs_balance() returned success or was canceled. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
If the call to btrfs_dev_replace_by_ioctl() failed we would overwrite the error returned to user space with -EFAULT if the call to copy_to_user() failed as well. Fix that by calling copy_to_user() only if no error happened before or a device replace operation was canceled. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
Checking if either of the inodes corresponds to a swapfile is already performed by generic_remap_file_range_prep(), so we do not need to do it in the btrfs clone and deduplication functions. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Add a couple of comments regarding the logic flow in shrink_delalloc. Then, cease using max_reclaim as a temporary variable when calculating nr_pages. Finally give max_reclaim a more becoming name, which uneqivocally shows at what this variable really holds. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Add a comment explaining when ->inode could be NULL and why we always perform the ->async_delalloc_pages modification. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
It can never trigger since before calling alloc_delalloc_work we have called igrab in start_delalloc_inodes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
ihold is supposed to be used when the caller already has a reference to the inode. In the case of cow_file_range_async this invariants holds, since the 3 call chains leading to this function all take a reference: btrfs_writepage <--- does igrab extent_write_full_page __extent_writepage writepage_delalloc btrfs_run_delalloc_range cow_file_range_async extent_write_cache_pages <--- does igrab __extent_writepage (same callchain as above) and submit_compressed_extents <-- already called from async CoW submit path, which would have done ihold. extent_write_locked_range __extent_writepage Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
It's used only once so just inline the call to i_size_read. The semantics regarding the inode size are not changed, the pages in the range are locked and i_size cannot change between the time it was set and used. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
We already pass the async_cow struct that holds a reference to the inode. Exploit this fact and remove the extra inode argument. No functional changes. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
YueHaibing authored
Fixes gcc '-Wunused-but-set-variable' warning: fs/btrfs/ioctl.c: In function 'btrfs_extent_same': fs/btrfs/ioctl.c:3260:6: warning: variable 'num_pages' set but not used [-Wunused-but-set-variable] It not used any more since commit 9ee8234e6220 ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication") Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
hole_len is only used if the hole falls within the requested range. Make that explicitly clear by only assigning in the corresponding branch. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Make btrfs_get_extent_fiemap a bit more friendly. First step is to rename the closely related, yet arbitrary named range_start/found_end/found variables. They define the delalloc range that is found in case a real extent wasn't found. Subsequently remove an unnecessary check for hole_em since it's guaranteed to be set i.e the check is always true. Top it off by giving all comments a refresh. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformatted a few more comments ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
This function is a simple wrapper over btrfs_get_extent that returns either: a) A real extent in the passed range or b) Adjusted extent based on whether delalloc bytes are found backing up a hole. To support these semantics it doesn't need the page/pg_offset/create arguments which are passed to btrfs_get_extent in case an extent is to be created. So simplify the function by removing the unused arguments. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
We are holding a transaction handle when setting an acl, therefore we can not allocate the xattr value buffer using GFP_KERNEL, as we could deadlock if reclaim is triggered by the allocation, therefore setup a nofs context. Fixes: 39a27ec1 ("btrfs: use GFP_KERNEL for xattr and acl allocations") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
We are holding a transaction handle when creating a tree, therefore we can not allocate the root using GFP_KERNEL, as we could deadlock if reclaim is triggered by the allocation, therefore setup a nofs context. Fixes: 74e4d827 ("btrfs: let callers of btrfs_alloc_root pass gfp flags") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
If the call to btrfs_get_dev_stats() failed we would overwrite the error returned to user space with -EFAULT if the call to copy_to_user() failed as well. Fix that by calling copy_to_user() only if btrfs_get_dev_stats() returned success. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
If the call to btrfs_scrub_progress() failed we would overwrite the error returned to user space with -EFAULT if the call to copy_to_user() failed as well. Fix that by calling copy_to_user() only if btrfs_scrub_progress() returned success. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
If scrub returned an error and then the copy_to_user() call did not succeed, we would overwrite the error returned by scrub with -EFAULT. Fix that by calling copy_to_user() only if btrfs_scrub_dev() returned success. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Since this function is no longer a callback there is no need to have its first argument obfuscated with a void *. Change it directly to a pointer to an inode. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Julia Lawall authored
Drop LIST_HEAD where the variable it declares is never used. The uses were removed in 3fd0a558 ("Btrfs: Metadata ENOSPC handling for balance"), but not the declaration. The semantic patch that fixes this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ identifier x; @@ - LIST_HEAD(x); ... when != x // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Linus Torvalds authored
-