Commit d96b3424 authored by Filipe Manana's avatar Filipe Manana Committed by David Sterba

btrfs: make send work with concurrent block group relocation

We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.

The restriction between balance and send was added in commit 9e967495
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.

Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.

For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.

This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:

1) For all tree searches, send acquires a read lock on the commit root
   semaphore;

2) After each tree search, and before releasing the commit root semaphore,
   the leaf is cloned and placed in the search path (struct btrfs_path);

3) After releasing the commit root semaphore, the changed_cb() callback
   is invoked, which operates on the leaf and writes commands to the pipe
   (or file in case send/receive is not used with a pipe). It's important
   here to not hold a lock on the commit root semaphore, because if we did
   we could deadlock when sending and receiving to the same filesystem
   using a pipe - the send task blocks on the pipe because it's full, the
   receive task, which is the only consumer of the pipe, triggers a
   transaction commit when attempting to create a subvolume or reserve
   space for a write operation for example, but the transaction commit
   blocks trying to write lock the commit root semaphore, resulting in a
   deadlock;

4) Before moving to the next key, or advancing to the next change in case
   of an incremental send, check if a transaction used for relocation was
   committed (or is about to finish its commit). If so, release the search
   path(s) and restart the search, to where we were before, so that we
   don't operate on stale extent buffers. The search restarts are always
   possible because both the send and parent roots are RO, and no one can
   add, remove of update keys (change their offset) in RO trees - the
   only exception is deduplication, but that is still not allowed to run
   in parallel with send;

5) Periodically check if there is contention on the commit root semaphore,
   which means there is a transaction commit trying to write lock it, and
   release the semaphore and reschedule if there is contention, so as to
   avoid causing any significant delays to transaction commits.

This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).

Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.

A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
parent 364be842
......@@ -1509,7 +1509,6 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
container_of(work, struct btrfs_fs_info, reclaim_bgs_work);
struct btrfs_block_group *bg;
struct btrfs_space_info *space_info;
LIST_HEAD(again_list);
if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
return;
......@@ -1586,18 +1585,14 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
div64_u64(zone_unusable * 100, bg->length));
trace_btrfs_reclaim_block_group(bg);
ret = btrfs_relocate_chunk(fs_info, bg->start);
if (ret && ret != -EAGAIN)
if (ret)
btrfs_err(fs_info, "error relocating chunk %llu",
bg->start);
next:
spin_lock(&fs_info->unused_bgs_lock);
if (ret == -EAGAIN && list_empty(&bg->bg_list))
list_add_tail(&bg->bg_list, &again_list);
else
btrfs_put_block_group(bg);
spin_lock(&fs_info->unused_bgs_lock);
}
list_splice_tail(&again_list, &fs_info->reclaim_bgs);
spin_unlock(&fs_info->unused_bgs_lock);
mutex_unlock(&fs_info->reclaim_bgs_lock);
btrfs_exclop_finish(fs_info);
......
......@@ -1568,7 +1568,6 @@ static struct extent_buffer *btrfs_search_slot_get_root(struct btrfs_root *root,
struct btrfs_path *p,
int write_lock_level)
{
struct btrfs_fs_info *fs_info = root->fs_info;
struct extent_buffer *b;
int root_lock;
int level = 0;
......@@ -1577,26 +1576,8 @@ static struct extent_buffer *btrfs_search_slot_get_root(struct btrfs_root *root,
root_lock = BTRFS_READ_LOCK;
if (p->search_commit_root) {
/*
* The commit roots are read only so we always do read locks,
* and we always must hold the commit_root_sem when doing
* searches on them, the only exception is send where we don't
* want to block transaction commits for a long time, so
* we need to clone the commit root in order to avoid races
* with transaction commits that create a snapshot of one of
* the roots used by a send operation.
*/
if (p->need_commit_sem) {
down_read(&fs_info->commit_root_sem);
b = btrfs_clone_extent_buffer(root->commit_root);
up_read(&fs_info->commit_root_sem);
if (!b)
return ERR_PTR(-ENOMEM);
} else {
b = root->commit_root;
atomic_inc(&b->refs);
}
level = btrfs_header_level(b);
/*
* Ensure that all callers have set skip_locking when
......@@ -1648,6 +1629,42 @@ static struct extent_buffer *btrfs_search_slot_get_root(struct btrfs_root *root,
return b;
}
/*
* Replace the extent buffer at the lowest level of the path with a cloned
* version. The purpose is to be able to use it safely, after releasing the
* commit root semaphore, even if relocation is happening in parallel, the
* transaction used for relocation is committed and the extent buffer is
* reallocated in the next transaction.
*
* This is used in a context where the caller does not prevent transaction
* commits from happening, either by holding a transaction handle or holding
* some lock, while it's doing searches through a commit root.
* At the moment it's only used for send operations.
*/
static int finish_need_commit_sem_search(struct btrfs_path *path)
{
const int i = path->lowest_level;
const int slot = path->slots[i];
struct extent_buffer *lowest = path->nodes[i];
struct extent_buffer *clone;
ASSERT(path->need_commit_sem);
if (!lowest)
return 0;
lockdep_assert_held_read(&lowest->fs_info->commit_root_sem);
clone = btrfs_clone_extent_buffer(lowest);
if (!clone)
return -ENOMEM;
btrfs_release_path(path);
path->nodes[i] = clone;
path->slots[i] = slot;
return 0;
}
/*
* btrfs_search_slot - look for a key in a tree and perform necessary
......@@ -1684,6 +1701,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
const struct btrfs_key *key, struct btrfs_path *p,
int ins_len, int cow)
{
struct btrfs_fs_info *fs_info = root->fs_info;
struct extent_buffer *b;
int slot;
int ret;
......@@ -1725,6 +1743,11 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
min_write_lock_level = write_lock_level;
if (p->need_commit_sem) {
ASSERT(p->search_commit_root);
down_read(&fs_info->commit_root_sem);
}
again:
prev_cmp = -1;
b = btrfs_search_slot_get_root(root, p, write_lock_level);
......@@ -1919,6 +1942,16 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
done:
if (ret < 0 && !p->skip_release_on_error)
btrfs_release_path(p);
if (p->need_commit_sem) {
int ret2;
ret2 = finish_need_commit_sem_search(p);
up_read(&fs_info->commit_root_sem);
if (ret2)
ret = ret2;
}
return ret;
}
ALLOW_ERROR_INJECTION(btrfs_search_slot, ERRNO);
......@@ -4373,7 +4406,9 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
int level;
struct extent_buffer *c;
struct extent_buffer *next;
struct btrfs_fs_info *fs_info = root->fs_info;
struct btrfs_key key;
bool need_commit_sem = false;
u32 nritems;
int ret;
int i;
......@@ -4390,14 +4425,20 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
path->keep_locks = 1;
if (time_seq)
if (time_seq) {
ret = btrfs_search_old_slot(root, &key, path, time_seq);
else
} else {
if (path->need_commit_sem) {
path->need_commit_sem = 0;
need_commit_sem = true;
down_read(&fs_info->commit_root_sem);
}
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
}
path->keep_locks = 0;
if (ret < 0)
return ret;
goto done;
nritems = btrfs_header_nritems(path->nodes[0]);
/*
......@@ -4520,6 +4561,15 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
ret = 0;
done:
unlock_up(path, 0, 1, 0, NULL);
if (need_commit_sem) {
int ret2;
path->need_commit_sem = 1;
ret2 = finish_need_commit_sem_search(path);
up_read(&fs_info->commit_root_sem);
if (ret2)
ret = ret2;
}
return ret;
}
......
......@@ -572,7 +572,6 @@ enum {
/*
* Indicate that relocation of a chunk has started, it's set per chunk
* and is toggled between chunks.
* Set, tested and cleared while holding fs_info::send_reloc_lock.
*/
BTRFS_FS_RELOC_RUNNING,
......@@ -673,6 +672,12 @@ struct btrfs_fs_info {
u64 generation;
u64 last_trans_committed;
/*
* Generation of the last transaction used for block group relocation
* since the filesystem was last mounted (or 0 if none happened yet).
* Must be written and read while holding btrfs_fs_info::commit_root_sem.
*/
u64 last_reloc_trans;
u64 avg_delayed_ref_runtime;
/*
......@@ -1003,13 +1008,6 @@ struct btrfs_fs_info {
struct crypto_shash *csum_shash;
spinlock_t send_reloc_lock;
/*
* Number of send operations in progress.
* Updated while holding fs_info::send_reloc_lock.
*/
int send_in_progress;
/* Type of exclusive operation running, protected by super_lock */
enum btrfs_exclusive_operation exclusive_operation;
......
......@@ -3023,6 +3023,7 @@ static int __cold init_tree_roots(struct btrfs_fs_info *fs_info)
/* All successful */
fs_info->generation = generation;
fs_info->last_trans_committed = generation;
fs_info->last_reloc_trans = 0;
/* Always begin writing backup roots after the one being used */
if (backup_index < 0) {
......@@ -3159,9 +3160,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
spin_lock_init(&fs_info->swapfile_pins_lock);
fs_info->swapfile_pins = RB_ROOT;
spin_lock_init(&fs_info->send_reloc_lock);
fs_info->send_in_progress = 0;
fs_info->bg_reclaim_threshold = BTRFS_DEFAULT_RECLAIM_THRESH;
INIT_WORK(&fs_info->reclaim_bgs_work, btrfs_reclaim_bgs_work);
}
......
......@@ -3859,25 +3859,14 @@ struct inode *create_reloc_inode(struct btrfs_fs_info *fs_info,
* 0 success
* -EINPROGRESS operation is already in progress, that's probably a bug
* -ECANCELED cancellation request was set before the operation started
* -EAGAIN can not start because there are ongoing send operations
*/
static int reloc_chunk_start(struct btrfs_fs_info *fs_info)
{
spin_lock(&fs_info->send_reloc_lock);
if (fs_info->send_in_progress) {
btrfs_warn_rl(fs_info,
"cannot run relocation while send operations are in progress (%d in progress)",
fs_info->send_in_progress);
spin_unlock(&fs_info->send_reloc_lock);
return -EAGAIN;
}
if (test_and_set_bit(BTRFS_FS_RELOC_RUNNING, &fs_info->flags)) {
/* This should not happen */
spin_unlock(&fs_info->send_reloc_lock);
btrfs_err(fs_info, "reloc already running, cannot start");
return -EINPROGRESS;
}
spin_unlock(&fs_info->send_reloc_lock);
if (atomic_read(&fs_info->reloc_cancel_req) > 0) {
btrfs_info(fs_info, "chunk relocation canceled on start");
......@@ -3899,9 +3888,7 @@ static void reloc_chunk_end(struct btrfs_fs_info *fs_info)
/* Requested after start, clear bit first so any waiters can continue */
if (atomic_read(&fs_info->reloc_cancel_req) > 0)
btrfs_info(fs_info, "chunk relocation canceled during operation");
spin_lock(&fs_info->send_reloc_lock);
clear_and_wake_up_bit(BTRFS_FS_RELOC_RUNNING, &fs_info->flags);
spin_unlock(&fs_info->send_reloc_lock);
atomic_set(&fs_info->reloc_cancel_req, 0);
}
......
This diff is collapsed.
......@@ -169,6 +169,10 @@ static noinline void switch_commit_roots(struct btrfs_trans_handle *trans)
ASSERT(cur_trans->state == TRANS_STATE_COMMIT_DOING);
down_write(&fs_info->commit_root_sem);
if (test_bit(BTRFS_FS_RELOC_RUNNING, &fs_info->flags))
fs_info->last_reloc_trans = trans->transid;
list_for_each_entry_safe(root, tmp, &cur_trans->switch_commits,
dirty_list) {
list_del_init(&root->dirty_list);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment