btrfs: try to unlock parent nodes earlier when inserting a key

When inserting a new key, we release the write lock on the leaf's parent only after doing the binary search on the leaf. This is because if the key ends up at slot 0, we will have to update the key at slot 0 of the parent node. The same reasoning applies to any other upper level nodes when their slot is 0. We also need to keep the parent locked in case the leaf does not have enough free space to insert the new key/item, because in that case we will split the leaf and we will need to add a new key to the parent due to a new leaf resulting from the split operation. However if the leaf has enough space for the new key and the key does not end up at slot 0 of the leaf we could release our write lock on the parent before doing the binary search on the leaf to figure out the destination slot. That leads to reducing the amount of time other tasks are blocked waiting to lock the parent, therefore increasing parallelism when there are other tasks that are trying to access other leaves accessible through the same parent. This also applies to other upper nodes besides the immediate parent, when their slot is 0, since we keep locks on them until we figure out if the leaf slot is slot 0 or not. In fact, having the key ending at up slot 0 when is rare. Typically it only happens when the key is less than or equals to the smallest, the "left most", key of the entire btree, during a split attempt when we try to push to the right sibling leaf or when the caller just wants to update the item of an existing key. It's also very common that a leaf has enough space to insert a new key, since after a split we move about half of the keys from one into the new leaf. So unlock the parent, and any other upper level nodes, when during a key insertion we notice the key is greater then the first key in the leaf and the leaf has enough free space. After unlocking the upper level nodes, do the binary search using a low boundary of slot 1 and not slot 0, to figure out the slot where the key will be inserted (or where the key already is in case it exists and the caller wants to modify its item data). This extra comparison, with the first key, is cheap and the key is very likely already in a cache line because it immediately follows the header of the extent buffer and we have recently read the level field of the header (which in fact is the last field of the header). The following fs_mark test was run on a non-debug kernel (debian's default kernel config), with a 12 cores intel CPU, and using a NVMe device: $ cat run-fsmark.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-O no-holes -R free-space-tree" FILES=100000 THREADS=$(nproc --all) FILE_SIZE=0 echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT OPTS="-S 0 -L 10 -n $FILES -s $FILE_SIZE -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Before this change: FSUse% Count Size Files/sec App Overhead 0 1200000 0 165273.6 5958381 0 2400000 0 190938.3 6284477 0 3600000 0 181429.1 6044059 0 4800000 0 173979.2 6223418 0 6000000 0 139288.0 6384560 0 7200000 0 163000.4 6520083 1 8400000 0 57799.2 5388544 1 9600000 0 66461.6 5552969 2 10800000 0 49593.5 5163675 2 12000000 0 57672.1 4889398 After this change: FSUse% Count Size Files/sec App Overhead 0 1200000 0 167987.3 (+1.6%) 6272730 0 2400000 0 198563.9 (+4.0%) 6048847 0 3600000 0 197436.6 (+8.8%) 6163637 0 4800000 0 202880.7 (+16.6%) 6371771 1 6000000 0 167275.9 (+20.1%) 6556733 1 7200000 0 204051.2 (+25.2%) 6817091 1 8400000 0 69622.8 (+20.5%) 5525675 1 9600000 0 69384.5 (+4.4%) 5700723 1 10800000 0 61454.1 (+23.9%) 5363754 3 12000000 0 61908.7 (+7.3%) 5370196 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: try to unlock parent nodes earlier when inserting a key
When inserting a new key, we release the write lock on the leaf's parent only after doing the binary search on the leaf. This is because if the key ends up at slot 0, we will have to update the key at slot 0 of the parent node. The same reasoning applies to any other upper level nodes when their slot is 0. We also need to keep the parent locked in case the leaf does not have enough free space to insert the new key/item, because in that case we will split the leaf and we will need to add a new key to the parent due to a new leaf resulting from the split operation. However if the leaf has enough space for the new key and the key does not end up at slot 0 of the leaf we could release our write lock on the parent before doing the binary search on the leaf to figure out the destination slot. That leads to reducing the amount of time other tasks are blocked waiting to lock the parent, therefore increasing parallelism when there are other tasks that are trying to access other leaves accessible through the same parent. This also applies to other upper nodes besides the immediate parent, when their slot is 0, since we keep locks on them until we figure out if the leaf slot is slot 0 or not. In fact, having the key ending at up slot 0 when is rare. Typically it only happens when the key is less than or equals to the smallest, the "left most", key of the entire btree, during a split attempt when we try to push to the right sibling leaf or when the caller just wants to update the item of an existing key. It's also very common that a leaf has enough space to insert a new key, since after a split we move about half of the keys from one into the new leaf. So unlock the parent, and any other upper level nodes, when during a key insertion we notice the key is greater then the first key in the leaf and the leaf has enough free space. After unlocking the upper level nodes, do the binary search using a low boundary of slot 1 and not slot 0, to figure out the slot where the key will be inserted (or where the key already is in case it exists and the caller wants to modify its item data). This extra comparison, with the first key, is cheap and the key is very likely already in a cache line because it immediately follows the header of the extent buffer and we have recently read the level field of the header (which in fact is the last field of the header). The following fs_mark test was run on a non-debug kernel (debian's default kernel config), with a 12 cores intel CPU, and using a NVMe device: $ cat run-fsmark.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-O no-holes -R free-space-tree" FILES=100000 THREADS=$(nproc --all) FILE_SIZE=0 echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT OPTS="-S 0 -L 10 -n $FILES -s $FILE_SIZE -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Before this change: FSUse% Count Size Files/sec App Overhead 0 1200000 0 165273.6 5958381 0 2400000 0 190938.3 6284477 0 3600000 0 181429.1 6044059 0 4800000 0 173979.2 6223418 0 6000000 0 139288.0 6384560 0 7200000 0 163000.4 6520083 1 8400000 0 57799.2 5388544 1 9600000 0 66461.6 5552969 2 10800000 0 49593.5 5163675 2 12000000 0 57672.1 4889398 After this change: FSUse% Count Size Files/sec App Overhead 0 1200000 0 167987.3 (+1.6%) 6272730 0 2400000 0 198563.9 (+4.0%) 6048847 0 3600000 0 197436.6 (+8.8%) 6163637 0 4800000 0 202880.7 (+16.6%) 6371771 1 6000000 0 167275.9 (+20.1%) 6556733 1 7200000 0 204051.2 (+25.2%) 6817091 1 8400000 0 69622.8 (+20.5%) 5525675 1 9600000 0 69384.5 (+4.4%) 5700723 1 10800000 0 61454.1 (+23.9%) 5363754 3 12000000 0 61908.7 (+7.3%) 5370196 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
e2e58d0f · Filipe Manana · David Sterba · fb81212c · e2e58d0f
Commit e2e58d0f authored Dec 02, 2021 by Filipe Manana Committed by David Sterba Jan 07, 2022
Show whitespace changes
Inline Side-by-side

Showing with 118 additions and 19 deletions

fs/btrfs/ctree.c fs/btrfs/ctree.c +118 -19

No files found.
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1680,6 +1680,27 @@ static int finish_need_commit_sem_search(struct btrfs_path *path)
 	return 0;
 }

+static inline int search_for_key_slot(struct extent_buffer *eb,
+				      int search_low_slot,
+				      const struct btrfs_key *key,
+				      int prev_cmp,
+				      int *slot)
+{
+	/*
+	 * If a previous call to btrfs_bin_search() on a parent node returned an
+	 * exact match (prev_cmp == 0), we can safely assume the target key will
+	 * always be at slot 0 on lower levels, since each key pointer
+	 * (struct btrfs_key_ptr) refers to the lowest key accessible from the
+	 * subtree it points to. Thus we can skip searching lower levels.
+	 */
+	if (prev_cmp == 0) {
+		*slot = 0;
+		return 0;
+	}
+
+	return generic_bin_search(eb, search_low_slot, key, slot);
+}
+
 /*
 * btrfs_search_slot - look for a key in a tree and perform necessary
 * modifications to preserve tree invariants.
@@ -1840,25 +1861,98 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 			}
 		}

+		if (level == 0) {
+			int leaf_free_space = 0;
+			int search_low_slot = 0;
+
 			/*
-		 * If btrfs_bin_search returns an exact match (prev_cmp == 0)
-		 * we can safely assume the target key will always be in slot 0
-		 * on lower levels due to the invariants BTRFS' btree provides,
-		 * namely that a btrfs_key_ptr entry always points to the
-		 * lowest key in the child node, thus we can skip searching
-		 * lower levels
+			 * If we are doing an insertion, the leaf has enough free
+			 * space and the destination slot for the key is not slot
+			 * 0, then we can unlock our write lock on the parent, and
+			 * any other upper nodes, before doing the binary search
+			 * on the leaf (with search_for_key_slot()), allowing other
+			 * tasks to lock the parent and any other upper nodes.
 			 */
-		if (prev_cmp == 0) {
-			slot = 0;
-			ret = 0;
+			if (ins_len > 0) {
+				struct btrfs_disk_key first_key;
+
+				/*
+				 * Cache the leaf free space, since we will need it
+				 * later and it will not change until then.
+				 */
+				leaf_free_space = btrfs_leaf_free_space(b);
+
+				/*
+				 * !p->locks[1] means we have a single node tree,
+				 * the leaf is the root of the tree.
+				 */
+				if (!p->locks[1] || leaf_free_space < ins_len)
+					goto leaf_search;
+
+				ASSERT(btrfs_header_nritems(b) > 0);
+				btrfs_item_key(b, &first_key, 0);
+
+				/*
+				 * Doing the extra comparison with the first key
+				 * is cheap, taking into account that the first
+				 * key is very likely already in a cache line
+				 * because it immediately follows the extent
+				 * buffer's header and we have recently accessed
+				 * the header's level field.
+				 */
+				ret = comp_keys(&first_key, key);
+				if (ret < 0) {
+					/*
+					 * The first key is smaller than the key
+					 * we want to insert, so we are safe to
+					 * unlock all upper nodes and we have to
+					 * do the binary search.
+					 *
+					 * We do use btrfs_unlock_up_safe() and
+					 * not unlock_up() because the later does
+					 * not unlock nodes with a slot of 0.
+					 * We can safely unlock any node even if
+					 * its slot is 0 since in this case the
+					 * key does not end up at slot 0 of the
+					 * leaf and there's also no need to split
+					 * the leaf.
+					 */
+					btrfs_unlock_up_safe(p, 1);
+					search_low_slot = 1;
 				} else {
-			ret = btrfs_bin_search(b, key, &slot);
-			prev_cmp = ret;
+					/*
+					 * The first key is >= then the key we
+					 * want to insert, so we can skip the
+					 * binary search as the target key will
+					 * be at slot 0.
+					 *
+					 * We can not unlock upper nodes when
+					 * the key is less than the first key,
+					 * because we will need to update the key
+					 * at slot 0 of the parent node and
+					 * possibly of other upper nodes too.
+					 * If the key matches the first key, then
+					 * we can unlock all the upper nodes,
+					 * using btrfs_unlock_up_safe() instead
+					 * of unlock_up() as stated above.
+					 */
+					if (ret == 0)
+						btrfs_unlock_up_safe(p, 1);
+					slot = 0;
+					/*
+					 * ret is already 0 or 1, matching the
+					 * result of a btrfs_bin_search() call,
+					 * so there is no need to adjust it.
+					 */
+					goto skip_leaf_search;
+				}
+			}
+leaf_search:
+			ret = search_for_key_slot(b, search_low_slot, key,
+						  prev_cmp, &slot);
 			if (ret < 0)
 				goto done;
-		}
-
-		if (level == 0) {
+skip_leaf_search:
 			p->slots[level] = slot;
 			/*
 			 * Item key already exists. In this case, if we are
@@ -1874,8 +1968,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 				ASSERT(ins_len >= sizeof(struct btrfs_item));
 				ins_len -= sizeof(struct btrfs_item);
 			}
-			if (ins_len > 0 &&
-			    btrfs_leaf_free_space(b) < ins_len) {
+			if (ins_len > 0 && leaf_free_space < ins_len) {
 				if (write_lock_level < 1) {
 					write_lock_level = 1;
 					btrfs_release_path(p);
@@ -1896,6 +1989,12 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 					  min_write_lock_level, NULL);
 			goto done;
 		}
+
+		ret = search_for_key_slot(b, 0, key, prev_cmp, &slot);
+		if (ret < 0)
+			goto done;
+		prev_cmp = ret;
+
 		if (ret && slot > 0) {
 			dec = 1;
 			slot--;