- 14 Mar, 2022 40 commits
-
-
Qu Wenruo authored
When btrfs_get_extent() tries to get some file extent from disk, it never populates extent_map::generation, leaving the value to be 0. On the other hand, for extent map generated by IO, it will get its generation properly set at finish_ordered_io() finish_ordered_io() |- unpin_extent_cache(gen = trans->transid) |- em->generation = gen; [CAUSE] Since extent_map::generation is mostly used by fsync code, and for fsync they only care about modified extents, which all have their em::generation > 0. Thus it's fine to not populate em read from disk for fsync. [CORNER CASE] However autodefrag also relies on em::generation to determine if one extent needs to be defragged. This unpopulated extent_map::generation can prevent the following autodefrag case from working: mkfs.btrfs -f $dev mount $dev $mnt -o autodefrag # initial write to queue the inode for autodefrag xfs_io -f -c "pwrite 0 4k" $mnt/file sync # Real fragmented write xfs_io -f -s -c "pwrite -b 4096 0 32k" $mnt/file sync echo "=== before autodefrag ===" xfs_io -c "fiemap -v" $mnt/file # Drop cache to force em to be read from disk echo 3 > /proc/sys/vm/drop_caches mount -o remount,commit=1 $mnt sleep 3 sync echo "=== After autodefrag ===" xfs_io -c "fiemap -v" $mnt/file umount $mnt The result looks like this: === before autodefrag === /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..15]: 26672..26687 16 0x0 1: [16..31]: 26656..26671 16 0x0 2: [32..47]: 26640..26655 16 0x0 3: [48..63]: 26624..26639 16 0x1 === After autodefrag === /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..15]: 26672..26687 16 0x0 1: [16..31]: 26656..26671 16 0x0 2: [32..47]: 26640..26655 16 0x0 3: [48..63]: 26624..26639 16 0x1 This fragmented 32K will not be defragged by autodefrag. [FIX] To make things less weird, just populate extent_map::generation when reading file extents from disk. This would make above fragmented extents to be properly defragged: == before autodefrag === /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..15]: 26672..26687 16 0x0 1: [16..31]: 26656..26671 16 0x0 2: [32..47]: 26640..26655 16 0x0 3: [48..63]: 26624..26639 16 0x1 === After autodefrag === /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..63]: 26688..26751 64 0x1 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
Removing or replacing an extent map requires holding a write lock on the extent map's tree. We currently do that everywhere, except in one of the self tests, where it's harmless since there's no concurrency. In order to catch possible races in the future, assert that we are holding a write lock on the extent map tree before removing or replacing an extent map in the tree, and update the self test to obtain a write lock before removing extent maps. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
After commit 92082d40 ("btrfs: integrate page status update for data read path into begin/end_page_read"), the 'nr' counter at btrfs_do_readpage() is no longer used, we increment it but we never read from it. So just remove it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At btrfs_do_readpage(), if we get an error when trying to lookup for an extent map, we end up marking the page with the error bit, clearing the uptodate bit on it, and doing everything else that should be done. However we return success (0) to the caller, when we should return the error encoded in the extent map pointer. So fix that by returning the error encoded in the pointer. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At extent_io.c, in the read page and write page code paths, we are testing if the return value from btrfs_get_extent() can be NULL. However that is not possible, as btrfs_get_extent() always returns either an error pointer or a (non-NULL) pointer to an extent map structure. Everywhere else outside extent_io.c we never check for NULL, we always treat any returned value as a non-NULL pointer if it does not encode an error. So check only for the IS_ERR() case at extent_io.c. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When we want to log an extent, in the fast fsync path, we obtain a path to the leaf that will hold the file extent item either through a deletion search, via btrfs_drop_extents(), or through an insertion search using btrfs_insert_empty_item(). After that we fill the file extent item's fields one by one directly on the leaf. Instead of doing that, we could prepare the file extent item before obtaining a btree path, and then copy the prepared extent item with a single operation once we get the path. This helps avoid some contention on the log tree, since we are holding write locks for longer than necessary, especially in the case where the path is obtained via btrfs_drop_extents() through a deletion search, which always keeps a write lock on the nodes at levels 1 and 2 (besides the leaf). This change does that, we prepare the file extent item that is going to be inserted before acquiring a path, and then copy it into a leaf using a single copy operation once we get a path. This change if part of a patchset that is comprised of the following patches: 1/6 btrfs: remove unnecessary leaf free space checks when pushing items 2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf 3/6 btrfs: avoid unnecessary computation when deleting items from a leaf 4/6 btrfs: remove constraint on number of visited leaves when replacing extents 5/6 btrfs: remove useless path release in the fast fsync path 6/6 btrfs: prepare extents to be logged before locking a log tree path The following test was run to measure the impact of the whole patchset: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" NUM_JOBS=8 FILE_SIZE=128M RUN_TIME=200 cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=1 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=sync filesize=$FILE_SIZE runtime=$RUN_TIME time_based directory=$MNT numjobs=$NUM_JOBS thread EOF echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The test ran inside a VM (8 cores, 32G of RAM) with the target disk mapping to a raw NVMe device, and using a non-debug kernel config (Debian's default config). Before the patchset: WRITE: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=22.7GiB (24.4GB), run=200013-200013msec After the patchset: WRITE: bw=125MiB/s (131MB/s), 125MiB/s-125MiB/s (131MB/s-131MB/s), io=24.3GiB (26.1GB), run=200007-200007msec A 7.8% gain on throughput and +7.0% more IO done in the same period of time (200 seconds). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
There's no point in calling btrfs_release_path() after finishing the loop that logs the modified extents, since log_one_extent() returns with the path released. In case the list of extents is empty, the path is already released, so there's no need for that case as well. So just remove that unnecessary btrfs_release_path() call. This change if part of a patchset that is comprised of the following patches: 1/6 btrfs: remove unnecessary leaf free space checks when pushing items 2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf 3/6 btrfs: avoid unnecessary computation when deleting items from a leaf 4/6 btrfs: remove constraint on number of visited leaves when replacing extents 5/6 btrfs: remove useless path release in the fast fsync path 6/6 btrfs: prepare extents to be logged before locking a log tree path The last patch in the series has some performance test result in its changelog. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At btrfs_drop_extents(), we try to replace a range of file extent items with a new file extent in a single btree search, to avoid the need to do a search for deletion, followed by a path release and followed by yet another search for insertion. When I originally added that optimization, in commit 1acae57b ("Btrfs: faster file extent item replace operations"), I left a constraint to do the fast replace only if we visited a single leaf. That was because in the most common case we find all file extent items that need to be deleted (or trimmed) in a single leaf, however it can work for other common cases like when we need to delete a few file extent items located at the end of a leaf and a few more located at the beginning of the next leaf. The key for the new file extent item is greater than the key of any deleted or trimmed file extent item from previous leaves, so we are fine to use the last leaf that we found as long as we are holding a write lock on it - even if the new key ends up at slot 0, as if that's the case, the btree search has obtained a write lock on any upper nodes that need to have a key pointer updated. So removed the constraint that limits the optimization to the case where we visited only a single leaf. This change if part of a patchset that is comprised of the following patches: 1/6 btrfs: remove unnecessary leaf free space checks when pushing items 2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf 3/6 btrfs: avoid unnecessary computation when deleting items from a leaf 4/6 btrfs: remove constraint on number of visited leaves when replacing extents 5/6 btrfs: remove useless path release in the fast fsync path 6/6 btrfs: prepare extents to be logged before locking a log tree path The last patch in the series has some performance test result in its changelog. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When deleting items from a leaf, we always compute the sum of the data sizes of the items that are going to be deleted. However we only use that sum when the last item to delete is behind the last item in the leaf. This unnecessarily wastes CPU time when we are deleting either the whole leaf or from some slot > 0 up to the last item in the leaf, and both of these cases are common (e.g. truncation operation, either as a result of truncate(2) or when logging inodes, deleting checksums after removing a large enough extent, etc). So compute only the sum of the data sizes if the last item to be deleted does not match the last item in the leaf. This change if part of a patchset that is comprised of the following patches: 1/6 btrfs: remove unnecessary leaf free space checks when pushing items 2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf 3/6 btrfs: avoid unnecessary computation when deleting items from a leaf 4/6 btrfs: remove constraint on number of visited leaves when replacing extents 5/6 btrfs: remove useless path release in the fast fsync path 6/6 btrfs: prepare extents to be logged before locking a log tree path The last patch in the series has some performance test result in its changelog. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When we delete items from a leaf, if we end up with more than two thirds of unused leaf space, we try to delete the leaf by moving all its items into its left and right neighbour leaves. Sometimes that is not possible because there is not enough free space in the left and right leaves, and in that case we end up not deleting our leaf. The way we are doing this is not ideal and can be improved in the following ways: 1) When we call push_leaf_left(), we pass a value of 1 byte to the data size parameter of push_leaf_left(). This is not realistic value because no item can have a size less than 25 bytes, which is the size of struct btrfs_item. This means that means that if the left leaf has not enough free space to push any item, we end up COWing it even if we end up not changing its content at all. COWing that leaf means allocating a new metadata extent, marking it dirty and doing more IO when committing a transaction or when syncing a log tree. For a log tree case, it's particularly more important to avoid the useless COW operation, as more IO can imply a higher latency for an fsync operation. So instead of passing 1 as the minimum data size for push_leaf_left(), pass the size of the first item in our leaf, as we don't want to COW the left leaf if we can't at least push the first item of our leaf; 2) When we call push_leaf_right(), we also pass a value of 1 byte as the data size parameter of push_leaf_right(). Like the previous case, it will also result in COWing the right leaf even if we are not able to move any items into it, since there can't be any item with a size smaller than 25 bytes (the size of struct btrfs_item). So instead of passing 1 as the minimum data size to push_leaf_right(), pass a size that corresponds to the sum of the size of all the remaining items in our leaf. We are not interested in moving less than that, because if we do, we are not able to delete our leaf and we have COWed the right leaf for nothing. Plus, moving only some of the items of our leaf, it means an even less balanced tree. Just like the previous case, we want to avoid the useless COW of the right leaf, this way we don't have to spend time allocating one new metadata extent, and doing more IO when committing a transaction or syncing a log tree. For the log tree case it's specially more important because more IO can result in a higher latency for a fsync operation. So adjust the minimum data size passed to push_leaf_left() and push_leaf_right() as mentioned above. This change if part of a patchset that is comprised of the following patches: 1/6 btrfs: remove unnecessary leaf free space checks when pushing items 2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf 3/6 btrfs: avoid unnecessary computation when deleting items from a leaf 4/6 btrfs: remove constraint on number of visited leaves when replacing extents 5/6 btrfs: remove useless path release in the fast fsync path 6/6 btrfs: prepare extents to be logged before locking a log tree path Not being able to delete a leaf that became less than 1/3 full after deleting items from it is actually common. For example, for the fio test mentioned in the changelog of patch 6/6, we are only able to delete a leaf at btrfs_del_items() about 5.3% of the time, due to its left and right neighbour leaves not having enough free space to push all the remaining items into them. The last patch in the series has some performance test result in its changelog. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When trying to push items from a leaf into its left and right neighbours, we lock the left or right leaf, check if it has the required minimum free space, COW the leaf and then check again if it has the minimum required free space. This second check is pointless: 1) Most and foremost because it's not needed. We have a write lock on the leaf and on its parent node, so no one can come in and change either the pre-COW or post-COW version of the leaf for the whole duration of the push_leaf_left() and push_leaf_right() calls; 2) The call to btrfs_leaf_free_space() is not trivial, it has a fair amount of arithmetic operations and access to fields in the leaf's header and items, so it's not very cheap. So remove the duplicated free space checks. This change if part of a patchset that is comprised of the following patches: 1/6 btrfs: remove unnecessary leaf free space checks when pushing items 2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf 3/6 btrfs: avoid unnecessary computation when deleting items from a leaf 4/6 btrfs: remove constraint on number of visited leaves when replacing extents 5/6 btrfs: remove useless path release in the fast fsync path 6/6 btrfs: prepare extents to be logged before locking a log tree path The last patch in the series has some performance test result in its changelog. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Johannes Thumshirn authored
In get_extent_skip_holes() we're checking the return of btrfs_get_extent_fiemap() for an error pointer or NULL, but btrfs_get_extent_fiemap() will never return NULL, only error pointers or a valid extent_map. The other caller of btrfs_get_extent_fiemap(), find_desired_extent(), correctly only checks for error-pointers. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Pankaj Raghav authored
Remove the redundant assignment to zone_info variable in btrfs_check_zoned_mode function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The static_assert introduced in 6bab69c6 ("build_bug.h: add wrapper for _Static_assert") has been supported by compilers for a long time (gcc 4.6, clang 3.0) and can be used in header files. We don't need to put BUILD_BUG_ON to random functions but rather keep it next to the definition. The exception here is the UAPI header btrfs_tree.h that could be potentially included by userspace code and the static assert is not defined (nor used in any other header). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Johannes Thumshirn authored
Allow creating or reading block-groups on a zoned device with DUP as a meta-data profile. This works because we're using the zoned_meta_io_lock and REQ_OP_WRITE operations for meta-data on zoned btrfs, so all writes to meta-data zones are aligned to the zone's write-pointer. Upon loading of the block-group, it is ensured both zones do have the same zone capacity and write-pointer offsets, so no extra machinery is needed to keep the write-pointers in sync for the meta-data zones. If this prerequisite is not met, loading of the block-group is refused. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Johannes Thumshirn authored
Allow for a block-group to be placed on more than one physical zone. This is a preparation for allowing DUP profiles for meta-data on a zoned file-system. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Johannes Thumshirn authored
Currently finishing of a zone only works if the block group isn't spanning more than one zone. This limitation is purely artificial and can be easily expanded to block groups being places across multiple zones. This is a preparation for allowing DUP and later more complex block-group profiles on zoned btrfs. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Johannes Thumshirn authored
Currently activation of a zone only works if the block group isn't spanning more than one zone. This limitation is purely artificial and can be easily expanded to block groups being places across multiple zones. This is a preparation for allowing DUP and later more complex block-group profiles on zoned btrfs. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
With extent tree v2 you will be able to create multiple csum, extent, and free space trees. They will be used based on the block group, which will now use the block_group_item->chunk_objectid to point to the set of global roots that it will use. When allocating new block groups we'll simply mod the gigabyte offset of the block group against the number of global roots we have and that will be the block groups global id. >From there we can take the bytenr that we're modifying in the respective tree, look up the block group and get that block groups corresponding global root id. From there we can get to the appropriate global root for that bytenr. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
This code adds the on disk structures for the block group root, which will hold the block group items for extent tree v2. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
We're going to be adding more roots that need to be loaded from the super block, so abstract out the code to read the tree_root from the super block, and use this helper for the chunk root as well. This will make it simpler to load the new trees in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
For extent tree v2 we can definitely have empty extent roots, so skip this particular check if we have that set. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
We cannot fall back on the slow caching for extent tree v2, which means we can't just arbitrarily clear the free space trees at mount time. Furthermore we can't do v1 space cache with extent tree v2. Simply ignore these mount options for extent tree v2 as they aren't relevant. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
When we stop tracking metadata blocks all of snapshotting will break, so disable it until I add the snapshot root and drop tree support. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
Scrub depends on extent references for every block, and with extent tree v2 we won't have that, so disable scrub until we can add back the proper code to handle extent-tree-v2. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
Backref lookups are going to be drastically different with extent tree v2, disable qgroups until we do the work to add this support for extent tree v2. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
Device add, remove, and replace all require balance, which doesn't work right now on extent tree v2, so disable these for now. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
With global root id's it makes it problematic to do backref lookups for balance. This isn't hard to deal with, but future changes are going to make it impossible to lookup backrefs on any COWonly roots, so go ahead and disable balance for now on extent tree v2 until we can add balance support back in future patches. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
This adds the initial definition of the EXTENT_TREE_V2 incompat feature flag. This also hides the support behind CONFIG_BTRFS_DEBUG. THIS IS A IN DEVELOPMENT FORMAT CHANGE, DO NOT USE UNLESS YOU ARE A DEVELOPER OR A TESTER. The format is in flux and will be added in stages, any fs will need to be re-made between updates to the format. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At btrfs_log_inode(), we have two variables to track errors and the return value of the function, named 'ret' and 'err'. In some places we use 'ret' and if gets a non-zero value we assign its value to 'err' and then jump to the 'out' label, while in other places we use 'err' directly without 'ret' as an intermediary. This is inconsistent, error prone and not necessary. So change that to use only the 'ret' variable, making this consistent with most functions in btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
During a rename or link operation, we need to determine if an inode was previously logged or not, and if it was, do some update to the logged inode. We used to rely exclusively on the logged_trans field of struct btrfs_inode to determine that, but that was not reliable because the value of that field is not persisted in the inode item, so it's lost when an inode is evicted and loaded back again. That led to several issues in the past, such as not persisting deletions (such as the case fixed by commit 803f0f64 ("Btrfs: fix fsync not persisting dentry deletions due to inode evictions")), or resulting in losing a file after an inode eviction followed by a rename (commit ecc64fab ("btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction")), besides other issues. So the inode_logged() helper was introduced and used to determine if an inode was possibly logged before in the current transaction, with the caveat that it could return false positives, in the sense that even if an inode was not logged before in the current transaction, it could still return true, but never to return false in case the inode was logged. >From a functional point of view that is fine, but from a performance perspective it can introduce significant latencies to rename and link operations, as they will end up doing inode logging even when it is not necessary. Recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package installations and upgrades, with the zypper tool, were often taking a long time to complete. With strace it could be observed that zypper was spending about 99% of its time on rename operations, and then with further analysis we checked that directory logging was happening too frequently. Taking into account that installation/upgrade of some of the packages needed a few thousand file renames, the slowdown was very noticeable for the user. The issue was caused indirectly due to an excessive number of inode evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or a 5.16-rc8 kernel. While triggering the inode evictions if something outside btrfs' control, btrfs could still behave better by eliminating the false positives from the inode_logged() helper. So change inode_logged() to actually eliminate such false positives caused by inode eviction and when an inode was never logged since the filesystem was mounted, as both cases relate to when the logged_trans field of struct btrfs_inode has a value of zero. When it can not determine if the inode was logged based only on the logged_trans value, lookup for the existence of the inode item in the log tree - if it's there then we known the inode was logged, if it's not there then it can not have been logged in the current transaction. Once we determine if the inode was logged, update the logged_trans value to avoid future calls to have to search in the log tree again. Alternatively, we could start storing logged_trans in the on disk inode item structure (struct btrfs_inode_item) in the unused space it still has, but that would be a bit odd because: 1) We only care about logged_trans since the filesystem was mounted, we don't care about its value from a previous mount. Having it persisted in the inode item structure would not make the best use of the precious unused space; 2) In order to get logged_trans persisted before inode eviction, we would have to update the delayed inode when we finish logging the inode and update its logged_trans in struct btrfs_inode, which makes it a bit cumbersome since we need to check if the delayed inode exists, if not create it and populate it and deal with any errors (-ENOMEM mostly). This change is part of a patchset comprised of the following patches: 1/5 btrfs: add helper to delete a dir entry from a log tree 2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode 3/5 btrfs: avoid logging all directory changes during renames 4/5 btrfs: stop doing unnecessary log updates during a rename 5/5 btrfs: avoid inode logging during rename and link when possible The following test script mimics part of what the zypper tool does during package installations/upgrades. It does not triggers inode evictions, but it's similar because it triggers false positives from the inode_logged() helper, because the inodes have a logged_trans of 0, there's a log tree due to a fsync of an unrelated file and the directory inode has its last_trans field set to the current transaction: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_FILES=10000 mkfs.btrfs -f $DEV mount $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done sync # Now do some change to an unrelated file and fsync it. # This is just to create a log tree to make sure that inode_logged() # does not return false when called against "testdir". xfs_io -f -c "pwrite 0 4K" -c "fsync" $MNT/foo # Do some change to testdir. This is to make sure inode_logged() # will return true when called against "testdir", because its # logged_trans is 0, it was changed in the current transaction # and there's a log tree. echo -n > $MNT/testdir/file_$((NUM_FILES + 1)) echo "Renaming $NUM_FILES files..." start=$(date +%s%N) for ((i = 1; i <= $NUM_FILES; i++)); do mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE done end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Renames took $dur milliseconds" umount $MNT Testing this change on a box using a non-debug kernel (Debian's default kernel config) gave the following results: NUM_FILES=10000, before patchset: 27837 ms NUM_FILES=10000, after patches 1/5 to 4/5 applied: 9236 ms (-66.8%) NUM_FILES=10000, after whole patchset applied: 8902 ms (-68.0%) NUM_FILES=5000, before patchset: 9127 ms NUM_FILES=5000, after patches 1/5 to 4/5 applied: 4640 ms (-49.2%) NUM_FILES=5000, after whole patchset applied: 4441 ms (-51.3%) NUM_FILES=2000, before patchset: 2528 ms NUM_FILES=2000, after patches 1/5 to 4/5 applied: 1983 ms (-21.6%) NUM_FILES=2000, after whole patchset applied: 1747 ms (-30.9%) NUM_FILES=1000, before patchset: 1085 ms NUM_FILES=1000, after patches 1/5 to 4/5 applied: 893 ms (-17.7%) NUM_FILES=1000, after whole patchset applied: 867 ms (-20.1%) Running dbench on the same physical machine with the following script: $ cat run-dbench.sh #!/bin/bash NUM_JOBS=$(nproc --all) DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-O no-holes -R free-space-tree" echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT -t 120 $NUM_JOBS umount $MNT Before patchset: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 3761352 0.032 143.843 Close 2762770 0.002 2.273 Rename 159304 0.291 67.037 Unlink 759784 0.207 143.998 Deltree 72 4.028 15.977 Mkdir 36 0.003 0.006 Qpathinfo 3409780 0.013 9.678 Qfileinfo 596772 0.001 0.878 Qfsinfo 625189 0.003 1.245 Sfileinfo 306443 0.006 1.840 Find 13181061 0.063 19.798 WriteX 1871137 0.021 8.532 ReadX 5897325 0.003 3.567 LockX 12252 0.003 0.258 UnlockX 12252 0.002 0.100 Flush 263666 3.327 155.632 Throughput 980.047 MB/sec 12 clients 12 procs max_latency=155.636 ms After whole patchset applied: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 4195584 0.033 107.742 Close 3081932 0.002 1.935 Rename 177641 0.218 14.905 Unlink 847333 0.166 107.822 Deltree 118 5.315 15.247 Mkdir 59 0.004 0.048 Qpathinfo 3802612 0.014 10.302 Qfileinfo 666748 0.001 1.034 Qfsinfo 697329 0.003 0.944 Sfileinfo 341712 0.006 2.099 Find 1470365 0.065 9.359 WriteX 2093921 0.021 8.087 ReadX 6576234 0.003 3.407 LockX 13660 0.003 0.308 UnlockX 13660 0.002 0.114 Flush 294090 2.906 115.539 Throughput 1093.11 MB/sec 12 clients 12 procs max_latency=115.544 ms +11.5% throughput -25.8% max latency rename max latency -77.8% Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
During a rename, we call __btrfs_unlink_inode(), which will call btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(), in order to remove an inode reference and a directory entry from the log. These are necessary when __btrfs_unlink_inode() is called from the unlink path, but not necessary when it's called from a rename context, because: 1) For the btrfs_del_inode_ref_in_log() call, it's pointless to delete the inode reference related to the old name, because later in the rename path we call btrfs_log_new_name(), which will drop all inode references from the log and copy all inode references from the subvolume tree to the log tree. So we are doing one unnecessary btree operation which adds additional latency and lock contention in case there are other tasks accessing the log tree; 2) For the btrfs_del_dir_entries_in_log() call, we are now doing the equivalent at btrfs_log_new_name() since the previous patch in the series, that has the subject "btrfs: avoid logging all directory changes during renames". In fact, having __btrfs_unlink_inode() call this function not only adds additional latency and lock contention due to the extra btree operation, but also can make btrfs_log_new_name() unnecessarily log a range item to track the deletion of the old name, since it has no way to known that the directory entry related to the old name was previously logged and already deleted by __btrfs_unlink_inode() through its call to btrfs_del_dir_entries_in_log(). So skip those calls at __btrfs_unlink_inode() when we are doing a rename. Skipping them also allows us now to reduce the duration of time we are pinning a log transaction during renames, which is always beneficial as it's not delaying so much other tasks trying to sync the log tree, in particular we end up not holding the log transaction pinned while adding the new name (adding inode ref, directory entry, etc). This change is part of a patchset comprised of the following patches: 1/5 btrfs: add helper to delete a dir entry from a log tree 2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode 3/5 btrfs: avoid logging all directory changes during renames 4/5 btrfs: stop doing unnecessary log updates during a rename 5/5 btrfs: avoid inode logging during rename and link when possible Just like the previous patch in the series, "btrfs: avoid logging all directory changes during renames", the following script mimics part of what a package installation/upgrade with zypper does, which is basically renaming a lot of files, in some directory under /usr, to a name with a suffix of "-RPMDELETE": $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_FILES=10000 mkfs.btrfs -f $DEV mount $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done sync # Do some change to testdir and fsync it. echo -n > $MNT/testdir/file_$((NUM_FILES + 1)) xfs_io -c "fsync" $MNT/testdir echo "Renaming $NUM_FILES files..." start=$(date +%s%N) for ((i = 1; i <= $NUM_FILES; i++)); do mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE done end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Renames took $dur milliseconds" umount $MNT Testing this change on box a using a non-debug kernel (Debian's default kernel config) gave the following results: NUM_FILES=10000, before patchset: 27399 ms NUM_FILES=10000, after patches 1/5 to 3/5 applied: 9093 ms (-66.8%) NUM_FILES=10000, after patches 1/5 to 4/5 applied: 9016 ms (-67.1%) NUM_FILES=5000, before patchset: 9241 ms NUM_FILES=5000, after patches 1/5 to 3/5 applied: 4642 ms (-49.8%) NUM_FILES=5000, after patches 1/5 to 4/5 applied: 4553 ms (-50.7%) NUM_FILES=2000, before patchset: 2550 ms NUM_FILES=2000, after patches 1/5 to 3/5 applied: 1788 ms (-29.9%) NUM_FILES=2000, after patches 1/5 to 4/5 applied: 1767 ms (-30.7%) NUM_FILES=1000, before patchset: 1088 ms NUM_FILES=1000, after patches 1/5 to 3/5 applied: 905 ms (-16.9%) NUM_FILES=1000, after patches 1/5 to 4/5 applied: 883 ms (-18.8%) The next patch in the series (5/5), also contains dbench results after applying to whole patchset. Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When doing a rename of a file, if the file or its old parent directory were logged before, we log the new name of the file and then make sure we log the old parent directory, to ensure that after a log replay the old name of the file is deleted and the new name added. The logging of the old parent directory can take some time, because it will scan all leaves modified in the current transaction, check which directory entries were already logged, copy the ones that were not logged before, etc. In this rename context all we need to do is make sure that the old name of the file is deleted on log replay, so instead of triggering a directory log operation, we can just delete the old directory entry from the log if it's there, or in case it isn't there, just log a range item to signal log replay that the old name must be deleted. So change btrfs_log_new_name() to do that. This scenario is actually not uncommon to trigger, and recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package installations and upgrades, with the zypper tool, were often taking a long time to complete, much more than usual. With strace it could be observed that zypper was spending over 99% of its time on rename operations, and then with further analysis we checked that directory logging was happening too frequently and causing high latencies for the rename operations. Taking into account that installation/upgrade of some of these packages needed about a few thousand file renames, the slowdown was very noticeable for the user. The issue was caused indirectly due to an excessive number of inode evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or a 5.16-rc8 kernel. After an inode eviction we can't tell for sure, in an efficient way, if an inode was previously logged in the current transaction, so we are pessimistic and assume it was, because in case it was we need to update the logged inode. More details on that in one of the patches in the same series (subject "btrfs: avoid inode logging during rename and link when possible"). Either way, in case the parent directory was logged before, we currently do more work then necessary during a rename, and this change minimizes that amount of work. The following script mimics part of what a package installation/upgrade with zypper does, which is basically renaming a lot of files, in some directory under /usr, to a name with a suffix of "-RPMDELETE": $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_FILES=10000 mkfs.btrfs -f $DEV mount $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done sync # Do some change to testdir and fsync it. echo -n > $MNT/testdir/file_$((NUM_FILES + 1)) xfs_io -c "fsync" $MNT/testdir echo "Renaming $NUM_FILES files..." start=$(date +%s%N) for ((i = 1; i <= $NUM_FILES; i++)); do mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE done end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Renames took $dur milliseconds" umount $MNT Testing this change on box using a non-debug kernel (Debian's default kernel config) gave the following results: NUM_FILES=10000, before this patch: 27399 ms NUM_FILES=10000, after this patch: 9093 ms (-66.8%) NUM_FILES=5000, before this patch: 9241 ms NUM_FILES=5000, after this patch: 4642 ms (-49.8%) NUM_FILES=2000, before this patch: 2550 ms NUM_FILES=2000, after this patch: 1788 ms (-29.9%) NUM_FILES=1000, before this patch: 1088 ms NUM_FILES=1000, after this patch: 905 ms (-16.9%) Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
In the next patch in the series, there will be the need to access the old name, and its length, of an inode when logging the inode during a rename. So instead of passing the inode to btrfs_log_new_name() pass the dentry, because from the dentry we can get the inode, the name and its length. This will avoid passing 3 new parameters to btrfs_log_new_name() in the next patch - the name, its length and an index number. This way we end up passing only 1 new parameter, the index number. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
Move the code that finds and deletes a logged dir entry out of btrfs_del_dir_entries_in_log() into a helper function. This new helper function will be used by another patch in the same series, and serves to avoid having duplicated logic. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Minghao Chi authored
Return value from fs_path_add_path() directly instead of taking this in another redundant variable. Reported-by: Zeal Robot <zealci@zte.com.cn> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn> Signed-off-by: CGEL ZTE <cgel.zte@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Instead of having 2 places that short circuit the qgroup leaf scan have everything in the qgroup_rescan_leaf function. In addition to that, also ensure that the inconsistent qgroup flag is set when rescan_should_stop returns true. This both retains the old behavior when -EINTR was set in the body of the loop and at the same time also extends this behavior when scanning is interrupted due to remount or unmount operations. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Jiapeng Chong authored
increment is being initialized to map->stripe_len but this is never read as increment is overwritten later on. Remove the redundant initialization. Cleans up the following clang-analyzer warning: fs/btrfs/scrub.c:3193:6: warning: Value stored to 'increment' during its initialization is never read [clang-analyzer-deadcode.DeadStores]. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Jiapeng Chong authored
to_add is being initialized to len but this is never read as to_add is overwritten later on. Remove the redundant initialization. Cleans up the following clang-analyzer warning: fs/btrfs/extent-tree.c:2769:8: warning: Value stored to 'to_add' during its initialization is never read [clang-analyzer-deadcode.DeadStores]. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
The pointer to struct request_queue is used only to get device type rotating or the non-rotating. So use it directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-