Commits · 0b5485391deff35c6633d7cbff5cf3a8229fbe9e · Kirill Smelkov / linux

17 Apr, 2023 40 commits

btrfs: locking: use atomic for DREW lock writers · 0b548539

David Sterba authored Mar 01, 2023

The DREW lock uses percpu variable to track lock counters and for that
it needs to allocate the structure. In btrfs_read_tree_root() or
btrfs_init_fs_root() this may add another error case or requires the
NOFS scope protection.

One way is to preallocate the structure as was suggested in
https://lore.kernel.org/linux-btrfs/20221214021125.28289-1-robbieko@synology.com/

We may avoid the allocation altogether if we don't use the percpu
variables but an atomic for the writer counter. This should not make any
difference, the DREW lock is used for truncate and NOCOW writes along
with other IO operations.

The percpu counter for writers has been there since the original commit
8257b2dc "Btrfs: introduce btrfs_{start, end}_nocow_write() for
each subvolume". The reason could be to avoid hammering the same
cacheline from all the readers but then the writers do that anyway.
Signed-off-by: David Sterba <dsterba@suse.com>

0b548539

btrfs: remove redundant clearing of NODISCARD · ce4cf379

Anand Jain authored Mar 02, 2023

If no discard mount option is specified, including the NODISCARD option,
we make the async discard the default option then we don't have to call
the clear_opt again to clear the NODISCARD flag. Though this makes no
difference, that the call is redundant has been pointed out several
times so we better remove it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ce4cf379

btrfs: avoid repetitive define BTRFS_FEATURE_INCOMPAT_SUPP · 0f202b25

Anand Jain authored Mar 06, 2023

BTRFS_FEATURE_INCOMPAT_SUPP is defined twice, once under
CONFIG_BTRFS_DEBUG and once without it, resulting in repetitive code. The
reason for this is to add experimental features under CONFIG_BTRFS_DEBUG.

To avoid repetitive code, add a common list BTRFS_FEATURE_INCOMPAT_SUPP_STABLE,
and append experimental features only under CONFIG_BTRFS_DEBUG.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

0f202b25

btrfs: scrub: remove root and csum_root arguments from scrub_simple_mirror() · 6b4d375a

Qu Wenruo authored Jan 16, 2023

We don't need to pass the roots as arguments, reading them from the
rb-tree is cheap.  Thus there is really not much need to pre-fetch it
and pass it all the way from scrub_stripe().

And we already have more than enough arguments in scrub_simple_mirror()
and scrub_simple_stripe(), it's better to remove them and only grab
those roots in scrub_simple_mirror().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

6b4d375a

btrfs: scrub: remove unused path inside scrub_stripe() · 1d403297

Qu Wenruo authored Jan 16, 2023

The variable @path is no longer passed into any call sites after commit
18d30ab9 ("btrfs: scrub: use scrub_simple_mirror() to handle RAID56
data stripe scrub"), thus we can remove the variable completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

1d403297

btrfs: do not use replace target device as an extra mirror · 5f50fa91

Qu Wenruo authored Feb 22, 2023

[BUG]
Currently btrfs can use dev-replace device as an extra mirror for
read-repair.  But it can lead to NODATASUM corruption in the following
case:

 There is a RAID1 data chunk, and dev-replace is running from
 dev2 to dev0.

 |//| = Replaced data
          X       X+1MB     X+2MB
  Dev 2:  |       |         |           <- Source dev
  Dev 0:  |///////|         |           <- Target dev

Then a read on dev 2 X+2MB happens.
And something wrong happened inside devid 2, causing an -EIO.

In that case, read-repair would try the next mirror, and since we can
use target device as an extra mirror, we will use that mirror instead.

But unfortunately since the read is beyond the current replace cursor,
we should not trust it at all, what we get would be just uninitialized
garbage.

But if this read is for NODATASUM range, then we just trust them and
cause data corruption.

[CAUSE]
We used to have some checks to make sure we only return such extra
mirror when the range is before our left cursor.

The first commit introducing this behavior is ad6d620e ("Btrfs:
allow repair code to include target disk when searching mirrors").

But later a fix, 22ab04e8 ("Btrfs: fix race between device replace
and chunk allocation") changed the behavior, to always let
btrfs_map_block() include the extra mirror to address a race in
dev-replace which can cause missing writes to target device.

This means, we lose the tracking of cursor for the extra mirror, thus
can lead to above corruption.

[FIX]
The extra mirror is never a reliable one, at the beginning of
dev-replace, the reliability is zero, while only at the end of the
replace it's a fully reliable mirror.

We either do the complex tracking, or never trust it.

IMHO it's much easier to maintain if we don't trust it at all, and the
extra mirror can only benefit for a limited period of time (during
replace).

Thus this patch would completely remove the ability to use target device
as an extra mirror.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

5f50fa91

btrfs: open_ctree() error handling cleanup · 4871c33b

Qu Wenruo authored Feb 28, 2023

Currently open_ctree() still uses two variables for error handling, err
and ret. This can be confusing and missing some errors and does not
conform to current coding style.

This patch will fix the problems by:

- Use only ret for error handling

- Add proper ret assignment
  Originally we rely on the default value (-EINVAL) of err to handle
  errors, but that doesn't really reflects the error.
  This will change it use the correct error number for the following
  call sites:

  * subpage_info allocation
  * btrfs_free_extra_devids()
  * btrfs_check_rw_degradable()
  * cleaner_kthread allocation
  * transaction_kthread allocation

- Add an extra ASSERT()
  To make sure we error out instead of returning 0.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

4871c33b

btrfs: cleanup the main loop in btrfs_lookup_bio_sums · e2eb0248

Christoph Hellwig authored Feb 22, 2023

Introduce a bio_offset variable for the current offset into the bio
instead of recalculating it over and over.   Remove the now only used
once search_len and sector_offset variables, and reduce the scope for
count and cur_disk_bytenr.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>

e2eb0248

btrfs: remove search_file_offset_in_bio · 65886d2b

Christoph Hellwig authored Feb 22, 2023

There is no need to search for a file offset in a bio, it is now always
provided in bbio->file_offset (set at bio allocation time since
0d495430 ("btrfs: set bbio->file_offset in alloc_new_bio")).  Just
use that with the offset into the bio.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

65886d2b

btrfs: sink calc_bio_boundaries into its only caller · 198bd49e

Johannes Thumshirn authored Feb 21, 2023

Nowadays calc_bio_boundaries() is a relatively simple function that only
guarantees the one bio equals to one ordered extent rule for uncompressed
Zone Append bios.

Sink it into it's only caller alloc_new_bio().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

198bd49e

btrfs: simplify main loop in submit_extent_page · 24e6c808

Christoph Hellwig authored Feb 27, 2023

bio_add_page always adds either the entire range passed to it or nothing.
Based on that btrfs_bio_add_page can only return a length smaller than
the passed in one when hitting the ordered extent limit, which can only
happen for writes.  Given that compressed writes never even use this code
path, this means that all the special cases for compressed extent offset
handling are dead code.

Reflow submit_extent_page to take advantage of this by inlining
btrfs_bio_add_page and handling the ordered extent limit by decrementing
it for each added range and thus significantly simplifying the loop.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

24e6c808

btrfs: check for contiguity in submit_extent_page · 78a2ef1b

Christoph Hellwig authored Feb 27, 2023

Different loop iterations in btrfs_bio_add_page not only have the same
contiguity parameters, but also any non-initial operation operates on a
fresh bio anyway.

Factor out the contiguity check into a new btrfs_bio_is_contig and only
call it once in submit_extent_page before descending into the
bio_add_page loop.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

78a2ef1b

btrfs: simplify the error handling in __extent_writepage_io · 5380311f

Christoph Hellwig authored Feb 27, 2023

Remove the has_error and saved_ret variables, and just jump to a goto
label for error handling from the only place returning an error from the
main loop.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

5380311f

btrfs: remove the submit_extent_page return value · 55173337

Christoph Hellwig authored Feb 27, 2023

submit_extent_page always returns 0 since commit d5e4377d ("btrfs:
split zone append bios in btrfs_submit_bio").  Change it to a void return
type and remove all the unreachable error handling code in the callers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

55173337

btrfs: remove the compress_type argument to submit_extent_page · f8ed4852

Christoph Hellwig authored Feb 27, 2023

Update the compress_type in the btrfs_bio_ctrl after forcing out the
previous bio in btrfs_do_readpage, so that alloc_new_bio can just use
the compress_type member in struct btrfs_bio_ctrl instead of passing the
same information redundantly as a function argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

f8ed4852

btrfs: rename the this_bio_flag variable in btrfs_do_readpage · a140453b

Christoph Hellwig authored Feb 27, 2023

Rename this_bio_flag to compress_type to match the surrounding code
and better document the intent.  Also use the proper enum type instead
of unsigned long.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

a140453b

btrfs: move the compress_type check out of btrfs_bio_add_page · c9bc621f

Christoph Hellwig authored Feb 27, 2023

The compress_type can only change on a per-extent basis.  So instead of
checking it for every page in btrfs_bio_add_page, do the check once in
btrfs_do_readpage, which is the only caller of btrfs_bio_add_page and
submit_extent_page that deals with compressed extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c9bc621f

btrfs: add a wbc pointer to struct btrfs_bio_ctrl · 72b505dc

Christoph Hellwig authored Feb 27, 2023

Instead of passing down the wbc pointer the deep call chain, just
add it to the btrfs_bio_ctrl structure.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

72b505dc

btrfs: remove the sync_io flag in struct btrfs_bio_ctrl · 794c26e2

Christoph Hellwig authored Feb 27, 2023

The sync_io flag is equivalent to wbc->sync_mode == WB_SYNC_ALL, so
just check for that and remove the separate flag.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

794c26e2

btrfs: store the bio opf in struct btrfs_bio_ctrl · c000bc04

Christoph Hellwig authored Feb 27, 2023

The bio op and flags never change over the life time of a bio_ctrl,
so move it in there instead of passing it down the deep call chain
all the way down to alloc_new_bio.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c000bc04

btrfs: remove the force_bio_submit to submit_extent_page · eb8d0c6d

Christoph Hellwig authored Feb 27, 2023

If force_bio_submit, submit_extent_page simply calls submit_one_bio as
the first thing.  This can just be moved to the only caller that sets
force_bio_submit to true.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

eb8d0c6d

btrfs: don't set force_bio_submit in read_extent_buffer_subpage · 67998cf4

Christoph Hellwig authored Feb 27, 2023

When read_extent_buffer_subpage calls submit_extent_page, it does
so on a freshly initialized btrfs_bio_ctrl structure that can't have
a valid bio to submit.  Clear the force_bio_submit parameter to false
as there is nothing to submit.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

67998cf4

btrfs: open code btrfs_bin_search() · fdf8d595

Anand Jain authored Feb 24, 2023

btrfs_bin_search() is a simple wrapper that searches for the whole slots
by calling btrfs_generic_bin_search() with the starting slot/first_slot
preset to 0.

This simple wrapper can be open coded as btrfs_bin_search().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

fdf8d595

btrfs: dev-replace: properly follow its read mode · 7b31e045

Qu Wenruo authored Feb 22, 2023

[BUG]
Although dev replace ioctl has a way to specify the mode on whether we
should read from the source device, it's not properly followed.

 # mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2
 # mount $dev1 $mnt
 # xfs_io -f -c "pwrite 0 32M" $mnt/file
 # sync
 # btrfs replace start -r -f 1 $dev3 $mnt

And one extra trace is added to scrub_submit(), showing the detail about
the bio:

  btrfs-11569 [005] ...  37.0270: scrub_submit.part.0: devid=1 logical=22036480 phy=22036480 len=16384
  btrfs-11569 [005] ...  37.0273: scrub_submit.part.0: devid=1 logical=30457856 phy=30457856 len=32768
  btrfs-11569 [005] ...  37.0274: scrub_submit.part.0: devid=1 logical=30507008 phy=30507008 len=49152
  btrfs-11569 [005] ...  37.0274: scrub_submit.part.0: devid=1 logical=30605312 phy=30605312 len=32768
  btrfs-11569 [005] ...  37.0275: scrub_submit.part.0: devid=1 logical=30703616 phy=30703616 len=65536
  btrfs-11569 [005] ...  37.0281: scrub_submit.part.0: devid=1 logical=298844160 phy=298844160 len=131072
  ...
  btrfs-11569 [005] ...  37.0762: scrub_submit.part.0: devid=1 logical=322961408 phy=322961408 len=131072
  btrfs-11569 [005] ...  37.0762: scrub_submit.part.0: devid=1 logical=323092480 phy=323092480 len=131072

One can see that all the reads are submitted to devid 1, even if we have
specified "-r" option to avoid reading from the source device.

[CAUSE]
The dev-replace read mode is only set but not followed by scrub code at
all.  In fact, only common read path is properly following the read
mode, but scrub itself has its own read path, thus not following the
mode.

[FIX]
Here we enhance scrub_find_good_copy() to also follow the read mode.

The idea is pretty simple, in the first loop, we avoid the following
devices:

- Missing devices
  This is the existing condition

- The source device if the replace wants to avoid it.

And if above loop found no candidate (e.g. replace a single device),
then we discard the 2nd condition, and try again.

Since we're here, also enhance the function scrub_find_good_copy() by:

- Remove the forward declaration

- Makes it return int
  To indicates errors, e.g. no good mirror found.

- Add extra error messages

Now with the same trace, "btrfs replace start -r" works as expected:

  btrfs-1213 [000] ...  991.9059: scrub_submit.part.0: devid=2 logical=22036480 phy=1064960 len=16384
  btrfs-1213 [000] ...  991.9062: scrub_submit.part.0: devid=2 logical=30457856 phy=9486336 len=32768
  btrfs-1213 [000] ...  991.9063: scrub_submit.part.0: devid=2 logical=30507008 phy=9535488 len=49152
  btrfs-1213 [000] ...  991.9064: scrub_submit.part.0: devid=2 logical=30605312 phy=9633792 len=32768
  btrfs-1213 [000] ...  991.9065: scrub_submit.part.0: devid=2 logical=30703616 phy=9732096 len=65536
  btrfs-1213 [000] ...  991.9073: scrub_submit.part.0: devid=2 logical=298844160 phy=277872640 len=131072
  btrfs-1213 [000] ...  991.9075: scrub_submit.part.0: devid=2 logical=298975232 phy=278003712 len=131072
  btrfs-1213 [000] ...  991.9078: scrub_submit.part.0: devid=2 logical=299106304 phy=278134784 len=131072
  ...
  btrfs-1213 [000] ...  991.9474: scrub_submit.part.0: devid=2 logical=318504960 phy=297533440 len=131072
  btrfs-1213 [000] ...  991.9476: scrub_submit.part.0: devid=2 logical=318636032 phy=297664512 len=131072
  btrfs-1213 [000] ...  991.9479: scrub_submit.part.0: devid=2 logical=318767104 phy=297795584 len=131072
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

7b31e045

btrfs: fold finish_compressed_bio_write into btrfs_finish_compressed_write_work · f9327a70

Christoph Hellwig authored Feb 10, 2023

Fold finish_compressed_bio_write into its only caller as there is no
reason to keep them separate.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

f9327a70

btrfs: don't clear page->mapping in btrfs_free_compressed_pages · a959a174

Christoph Hellwig authored Feb 10, 2023

No one ever set ->mapping on these pages, so don't bother clearing it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

a959a174

btrfs: factor out a btrfs_free_compressed_pages helper · 32586c5b

Christoph Hellwig authored Feb 10, 2023

Share the code to free the compressed pages and the array to hold them
into a common helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

32586c5b

btrfs: factor out a btrfs_add_compressed_bio_pages helper · 10e924bc

Christoph Hellwig authored Feb 10, 2023

Factor out a common helper to add the compressed_bio pages to the
bio that is shared by the compressed read and write path.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

10e924bc

btrfs: use the bbio file offset in add_ra_bio_pages · d7294e4d

Christoph Hellwig authored Feb 10, 2023

struct btrfs_bio now has a file_offset field set up by all submitters.
Use that value combined with the bio size in add_ra_bio_pages to
calculate the last offset in the bio.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

d7294e4d

btrfs: use the bbio file offset in btrfs_submit_compressed_read · e7aff33e

Christoph Hellwig authored Feb 10, 2023

struct btrfs_bio now has a file_offset field set up by all submitters.
Use that in btrfs_submit_compressed_read instead of recalculating the
value.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e7aff33e

btrfs: remove redundant free_extent_map in btrfs_submit_compressed_read · 798c9fc7

Christoph Hellwig authored Feb 10, 2023

em can't be non-NULL after the free_extent_map label.  Also remove
the now pointless clearing of em to NULL after freeing it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

798c9fc7

btrfs: embed a btrfs_bio into struct compressed_bio · 544fe4a9

Christoph Hellwig authored Feb 10, 2023

Embed a btrfs_bio into struct compressed_bio.  This avoids potential
(so far theoretical) deadlocks due to nesting of btrfs_bioset allocations
for the original read bio and the compressed bio, and avoids an extra
memory allocation in the I/O path.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

544fe4a9

btrfs: replace btrfs_io_context::raid_map with a fixed u64 value · 18d758a2

Qu Wenruo authored Feb 17, 2023

In btrfs_io_context structure, we have a pointer raid_map, which
indicates the logical bytenr for each stripe.

But considering we always call sort_parity_stripes(), the result
raid_map[] is always sorted, thus raid_map[0] is always the logical
bytenr of the full stripe.

So why we waste the space and time (for sorting) for raid_map?

This patch will replace btrfs_io_context::raid_map with a single u64
number, full_stripe_start, by:

- Replace btrfs_io_context::raid_map with full_stripe_start

- Replace call sites using raid_map[0] to use full_stripe_start

- Replace call sites using raid_map[i] to compare with nr_data_stripes.

The benefits are:

- Less memory wasted on raid_map
  It's sizeof(u64) * num_stripes vs sizeof(u64).
  It'll always save at least one u64, and the benefit grows larger with
  num_stripes.

- No more weird alloc_btrfs_io_context() behavior
  As there is only one fixed size + one variable length array.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

18d758a2

btrfs: use an efficient way to represent source of duplicated stripes · 1faf3885

Qu Wenruo authored Feb 07, 2023

For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.

For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).

But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.

Currently we use a complex way using tgtdev_map[] array, e.g:

 num_tgtdevs = 1
 tgtdev_map[0] = 0    <- Means stripes[0] is not involved in replace.
 tgtdev_map[1] = 3    <- Means stripes[1] is involved in replace,
			 and it's duplicated to stripes[3].
 tgtdev_map[2] = 0    <- Means stripes[2] is not involved in replace.

But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.

Thus we can change it to a fixed array to represent the mapping:

 replace_nr_stripes = 1
 replace_stripe_src = 1    <- Means stripes[1] is involved in replace.
			      thus the extra stripe is a copy of
			      stripes[1]

By this we can save some space for bioc on RAID56 chunks with many
devices.  And we get rid of one variable sized array from bioc.

Thus the patch involves the following changes:

- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
  and @replace_stripe_src.

  @num_tgtdevs is just renamed to @replace_nr_stripes.
  While the mapping is completely changed.

- Add extra ASSERT()s for RAID56 code

- Only add two more extra stripes for dev-replace cases.
  As we have an upper limit on how many dev-replace stripes we can have.

- Unify the behavior of handle_ops_on_dev_replace()
  Previously handle_ops_on_dev_replace() go two different paths for
  WRITE and GET_READ_MIRRORS.
  Now unify them by always going the WRITE path first (with at most 2
  replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
  extra stripes, just drop one stripe.

- Remove the @real_stripes argument from alloc_btrfs_io_context()
  As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

1faf3885

btrfs: reduce type width of btrfs_io_contexts · 4ced85f8

Qu Wenruo authored Feb 07, 2023

That structure is our ultimate object for all __btrfs_map_block()
related functions.  We have some hard to understand members, like
tgtdev_map, but without any comments.

This patch will improve the situation:

- Add extra comments for num_stripes, mirror_num, num_tgtdevs and
  tgtdev_map[]
  Especially for the last two members, add a dedicated (thus very long)
  comments for them, with example to explain it.

- Shrink those int members to u16.
  In fact our on-disk format is only using u16 for num_stripes, thus
  no need to use int at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

4ced85f8

btrfs: simplify the bioc argument for handle_ops_on_dev_replace() · be5c7edb

Qu Wenruo authored Feb 07, 2023

There is no memory re-allocation for handle_ops_on_dev_replace(), thus
we don't need to pass a btrfs_io_context pointer.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

be5c7edb

btrfs: reduce div64 calls by limiting the number of stripes of a chunk to u32 · 6ded22c1

Qu Wenruo authored Feb 17, 2023

There are quite some div64 calls inside btrfs_map_block() and its
variants.

Such calls are for @stripe_nr, where @stripe_nr is the number of
stripes before our logical bytenr inside a chunk.

However we can eliminate such div64 calls by just reducing the width of
@stripe_nr from 64 to 32.

This can be done because our chunk size limit is already 10G, with fixed
stripe length 64K.
Thus a U32 is definitely enough to contain the number of stripes.

With such width reduction, we can get rid of slower div64, and extra
warning for certain 32bit arch.

This patch would do:

- Add a new tree-checker chunk validation on chunk length
  Make sure no chunk can reach 256G, which can also act as a bitflip
  checker.

- Reduce the width from u64 to u32 for @stripe_nr variables

- Replace unnecessary div64 calls with regular modulo and division
  32bit division and modulo are much faster than 64bit operations, and
  we are finally free of the div64 fear at least in those involved
  functions.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

6ded22c1

btrfs: replace map_lookup->stripe_len by BTRFS_STRIPE_LEN · a97699d1

Qu Wenruo authored Feb 17, 2023

Currently btrfs doesn't support stripe lengths other than 64KiB.
This is already set in the tree-checker.

There is really no meaning to record that fixed value in map_lookup for
now, and can all be replaced with BTRFS_STRIPE_LEN.

Furthermore we can use the fix stripe length to do the following
optimization:

- Use BTRFS_STRIPE_LEN_SHIFT to replace some 64bit division
  Now we only need to do a right shift.

  And the value of BTRFS_STRIPE_LEN itself is already too large for bit
  shift, thus if we accidentally use BTRFS_STRIPE_LEN to do bit shift,
  a compiler warning would be triggered.

  Thus this bit shift optimization would be safe.

- Use BTRFS_STRIPE_LEN_MASK to calculate the offset inside a stripe
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

a97699d1

btrfs: move all btree inode initialization into btrfs_init_btree_inode · dcb2137c

Christoph Hellwig authored Feb 19, 2023

Move the remaining code that deals with initializing the btree
inode into btrfs_init_btree_inode instead of splitting it between
that helpers and its only caller.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

dcb2137c

btrfs: switch search_file_offset_in_bio to return bool · 19337f8e

Anand Jain authored Feb 11, 2023

Function search_file_offset_in_bio() finds the file offset in the
file_offset_ret, and we use the return value to indicate if it is
successful, so use bool.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

19337f8e