Commit e2da30c5 authored by Liu Bo's avatar Liu Bo Committed by Greg Kroah-Hartman

btrfs: fix reading stale metadata blocks after degraded raid1 mounts

commit 02a3307a upstream.

If a btree block, aka. extent buffer, is not available in the extent
buffer cache, it'll be read out from the disk instead, i.e.

btrfs_search_slot()
  read_block_for_search()  # hold parent and its lock, go to read child
    btrfs_release_path()
    read_tree_block()  # read child

Unfortunately, the parent lock got released before reading child, so
commit 5bdd3536 ("Btrfs: Fix block generation verification race") had
used 0 as parent transid to read the child block.  It forces
read_tree_block() not to check if parent transid is different with the
generation id of the child that it reads out from disk.

A simple PoC is included in btrfs/124,

0. A two-disk raid1 btrfs,

1. Right after mkfs.btrfs, block A is allocated to be device tree's root.

2. Mount this filesystem and put it in use, after a while, device tree's
   root got COW but block A hasn't been allocated/overwritten yet.

3. Umount it and reload the btrfs module to remove both disks from the
   global @fs_devices list.

4. mount -odegraded dev1 and write some data, so now block A is allocated
   to be a leaf in checksum tree.  Note that only dev1 has the latest
   metadata of this filesystem.

5. Umount it and mount it again normally (with both disks), since raid1
   can pick up one disk by the writer task's pid, if btrfs_search_slot()
   needs to read block A, dev2 which does NOT have the latest metadata
   might be read for block A, then we got a stale block A.

6. As parent transid is not checked, block A is marked as uptodate and
   put into the extent buffer cache, so the future search won't bother
   to read disk again, which means it'll make changes on this stale
   one and make it dirty and flush it onto disk.

To avoid the problem, parent transid needs to be passed to
read_tree_block().

In order to get a valid parent transid, we need to hold the parent's
lock until finishing reading child.

This patch needs to be slightly adapted for stable kernels, the
&first_key parameter added to read_tree_block() is from 4.16+
(581c1760). The fix is to replace 0 by 'gen'.

Fixes: 5bdd3536 ("Btrfs: Fix block generation verification race")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
[ update changelog ]
Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
parent 68dea4bd
......@@ -2497,10 +2497,8 @@ read_block_for_search(struct btrfs_trans_handle *trans,
if (p->reada)
reada_for_search(root, p, level, slot, key->objectid);
btrfs_release_path(p);
ret = -EAGAIN;
tmp = read_tree_block(root, blocknr, 0);
tmp = read_tree_block(root, blocknr, gen);
if (!IS_ERR(tmp)) {
/*
* If the read above didn't mark this buffer up to date,
......@@ -2512,6 +2510,8 @@ read_block_for_search(struct btrfs_trans_handle *trans,
ret = -EIO;
free_extent_buffer(tmp);
}
btrfs_release_path(p);
return ret;
}
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment