Commits · e53d03fe39f1458065ddb5f7309ade066ba6fb95 · Kirill Smelkov / linux

22 Oct, 2023 40 commits

bcachefs: don't bump key cache journal seq on nojournal commits · e53d03fe

Brian Foster authored Mar 02, 2023

fstest generic/388 occasionally reproduces corruptions where an
inode has extents beyond i_size. This is a deliberate crash and
recovery test, and the post crash+recovery characteristics are
usually the same: the inode exists on disk in an early (i.e. just
allocated) state based on the journal sequence number associated
with the inode. Subsequent inode updates exist in the journal at
higher sequence numbers, but the inode hadn't been written back
before the associated crash and the post-crash recovery processes a
set of journal sequence numbers that doesn't include updates to the
inode. In fact, the sequence with the most recent inode key update
always happens to be the sequence just before the front of the
journal processed by recovery.

This last bit is a significant hint that the problem relates to an
on-disk journal update of the front of the journal. The root cause
of this problem is basically that the inode is updated (multiple
times) in-core and in the key cache, each time bumping the key cache
sequence number used to control the cache flush. The cache flush
skips one or more times, bumping the associated key cache journal
pin to the key cache seq value. This has a side effect of holding
the inode in memory a bit longer than normal, which helps exacerbate
this problem, but is also unsafe in certain cases where the key
cache seq may have been updated by a transaction commit that didn't
journal the associated key.

For example, consider an inode that has been allocated, updated
several times in the key cache, journaled, but not yet written back.
At this stage, everything should be consistent if the fs happens to
crash because the latest update has been journal. Now consider a key
update via bch2_extent_update_i_size_sectors() that uses the
BTREE_UPDATE_NOJOURNAL flag. While this update may not change inode
state, it can have the side effect of bumping ck->seq in
bch2_btree_insert_key_cached(). In turn, if a subsequent key cache
flush skips due to seq not matching the former, the ck->journal pin
is updated to ck->seq even though the most recent key update was not
journaled. If this pin happens to reside at the front (tail) of the
journal, this means a subsequent journal write can update last_seq
to a value beyond that which includes the most recent update to the
inode. If this occurs and the fs happens to crash before the inode
happens to flush, recovery will see the latest last_seq, fail to
recover the inode and leave the inode in the inconsistent state
described above.

To avoid this problem, skip the key cache seq update on NOJOURNAL
commits, except on initial pin add. Pass the insert entry directly
to bch2_btree_insert_key_cached() to make the associated flag
available and be consistent with btree_insert_key_leaf().
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

e53d03fe

bcachefs: When shutting down, flush btree node writes last · 83ec519a
Kent Overstreet authored Mar 07, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
83ec519a
bcachefs: Verbose on by default when CONFIG_BCACHEFS_DEBUG=y · adac06fa
Kent Overstreet authored Mar 07, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
adac06fa
fixup bcachefs: Use for_each_btree_key_upto() more consistently · db64a8e8
Kent Overstreet authored Mar 06, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
db64a8e8

six locks: be more careful about lost wakeups · 4b5b13da

Kent Overstreet authored Mar 06, 2023

This is a workaround for a lost wakeup bug we've been seeing - we still
need to discover the actual bug.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

4b5b13da

bcachefs: Journal resize fixes · 2640faeb

Kent Overstreet authored Mar 06, 2023

 - Fix a sleeping-in-atomic bug due to calling
   bch2_journal_buckets_to_sb() under the journal lock.
 - Additionally, now we mark buckets as journal buckets before adding
   them to the journal in memory and the superblock. This ensures that
   if we crash part way through we'll never be writing to journal
   buckets that aren't marked correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

2640faeb

bcachefs: bch2_btree_iter_peek_node_and_restart() · 511b629a
Kent Overstreet authored Mar 06, 2023
```
Minor refactoring for the Rust interface.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
511b629a

bcachefs: bch2_btree_node_ondisk_to_text() · b65499b7

Kent Overstreet authored Mar 06, 2023

Pulling out a helper from cmd_list.c, as the rest is being rewritten in
Rust but we're not ready to rewrite lower-level btree code in Rust.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

b65499b7

bcachefs: bch2_btree_node_to_text() const correctness · a345b0f3

Kent Overstreet authored Mar 06, 2023

This is for the Rust interface - Rust cares more about const than C
does.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a345b0f3

bcachefs: Fix "btree node in stripe" error · 26bab33b
Kent Overstreet authored Mar 06, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
26bab33b
bcachefs: Kill bch2_ec_bucket_written() · 2a912a9a
Kent Overstreet authored Mar 05, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
2a912a9a
bcachefs: Improve bch2_new_stripes_to_text() · 81c771b2
Kent Overstreet authored Mar 08, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
81c771b2

bcachefs: Improved copygc pipelining · 8fcdf814

Kent Overstreet authored Feb 27, 2023

This improves copygc pipelining across multiple buckets: we now track
each in flight bucket we're evacuating, with separate moving_contexts.

This means that whereas previously we had to wait for outstanding moves
to complete to ensure we didn't try to evacuate the same bucket twice,
we can now just check buckets we want to evacuate against the pending
list.

This also mean we can run the verify_bucket_evacuated() check without
killing pipelining - meaning it can now always be enabled, not just on
debug builds.

This is going to be important for the upcoming erasure coding work,
where moving IOs that are being erasure coded will now skip the initial
replication step; instead the IOs will wait on the stripe to complete.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8fcdf814

bcachefs: Free move buffers as early as possible · 0b943b97
Kent Overstreet authored Mar 05, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
0b943b97

bcachefs: Fix stripe reuse path · 5be6a274

Kent Overstreet authored Mar 05, 2023

It's possible that we reuse a stripe that doesn't have quite the same
configuration as the stripe_head we're allocating from. In that case, we
have to make sure that the new stripe uses the settings from the stripe
we resue, not the stripe head, and make sure the buffer is allocated
correctly.

This fixes the ec_mixed_tiers test.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

5be6a274

bcachefs: Drop some anonymous structs, unions · ac2ccddc

Kent Overstreet authored Mar 04, 2023

Rust bindgen doesn't cope well with anonymous structs and unions. This
patch drops the fancy anonymous structs & unions in bkey_i that let us
use the same helpers for bkey_i and bkey_packed; since bkey_packed is an
internal type that's never exposed to outside code, it's only a minor
inconvenienc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ac2ccddc

bcachefs: BKEY_PADDED_ONSTACK() · 45dd05b3

Kent Overstreet authored Mar 04, 2023

Rust bindgen doesn't do anonymous structs very nicely: BKEY_PADDED()
only needs the anonymous struct when it's used on the stack, to
guarantee layout, not when it's embedded in another struct.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

45dd05b3

bcachefs: moving_context->stats is allowed to be NULL · 2f528663
Kent Overstreet authored Mar 04, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
2f528663

bcachefs: RESERVE_stripe · e84face6

Kent Overstreet authored Mar 02, 2023

Rework stripe creation path - new algorithm for deciding when to create
new stripes or reuse existing stripes.

We add a new allocation watermark, RESERVE_stripe, above RESERVE_none.
Then we always try to create a new stripe by doing RESERVE_stripe
allocations; if this fails, we reuse an existing stripe and allocate
buckets for it with the reserve watermark for the given write
(RESERVE_none or RESERVE_movinggc).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

e84face6

bcachefs: Improve error message for stripe block sector counts wrong · d57c9add
Kent Overstreet authored Mar 03, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
d57c9add
bcachefs: More stripe create cleanup/fixes · 9d32097f
Kent Overstreet authored Mar 03, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
9d32097f
bcachefs: Plumb alloc_reserve through stripe create path · a1fb08f5
Kent Overstreet authored Mar 03, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
a1fb08f5

bcachefs: Mark stripe buckets with correct data type · 91065976

Kent Overstreet authored Mar 01, 2023

Currently, we don't use bucket data type for tracking whether buckets
are part of a stripe; parity buckets are BCH_DATA_parity, but data
buckets in a stripe are BCH_DATA_user. There's a separate counter,
buckets_ec, outside the BCH_DATA_TYPES system for tracking number of
buckets on a device that are part of a stripe.

The trouble with this approach is that it's too coarse grained, and we
need better information on fragmentation for debugging copygc.

With this patch, data buckets in a stripe are now tracked as
BCH_DATA_stripe buckets.

This doesn't yet differentiate between erasure coded and non-erasure
coded data in a stripe bucket, nor do we yet track empty data buckets in
stripes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

91065976

bcachefs: Centralize btree node lock initialization · 3329cf1b

Kent Overstreet authored Mar 03, 2023

This fixes some confusion in the lockdep code due to initializing btree
node/key cache locks with the same lockdep key, but different names.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

3329cf1b

bcachefs: Plumb btree_trans through btree cache code · 1306f87d

Kent Overstreet authored Mar 02, 2023

Soon, __bch2_btree_node_write() is going to require a btree_trans: zoned
device support is going to require a new allocation for every btree node
write. This is a bit of prep work.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

1306f87d

bcachefs: Improve dev_alloc_debug_to_text() · b1cfe5ed

Kent Overstreet authored Mar 02, 2023

Now we also print the number of buckets reserved for each watermark.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

b1cfe5ed

bcachefs: bch2_copygc_wait_to_text() · c85d7796
Kent Overstreet authored Mar 01, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
c85d7796

bcachefs: bch2_mark_key() now takes btree_id & level · 2611a041

Kent Overstreet authored Mar 01, 2023

btree & level are passed to trans_mark - for backpointers -
bch2_mark_key() should take them as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

2611a041

bcachefs: bch2_write_queue() · e9020958
Kent Overstreet authored Feb 28, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
e9020958
bcachefs: ec: Improve error message for btree node in stripe · 8f2bbcdd
Kent Overstreet authored Feb 28, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
8f2bbcdd
bcachefs: bch2_open_bucket_to_text() · 2f4e9472
Kent Overstreet authored Feb 28, 2023
```
Factor out a common helper
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
2f4e9472
bcachefs: bch2_data_update_init() considers ptr durability · 11bb67a4
Kent Overstreet authored Feb 27, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
11bb67a4
bcachefs: ec: Ensure new stripe is closed in error path · a64adedb
Kent Overstreet authored Feb 27, 2023
```
This fixes a use-after-free bug.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
a64adedb

bcachefs: Convert constants to consts · f3a65bb9

Kent Overstreet authored Feb 27, 2023

Rust bindgen doesn't handle macros, but it does handle integer
constants: this conversion aids in implementing safe Rust wrapper
interfaces.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

f3a65bb9

bcachefs: bch2_btree_iter_peek_and_restart_outlined() · 0f2ea655

Kent Overstreet authored Feb 27, 2023

Needed for interfacing with Rust - bindgen can't handle inline
functions, alas.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

0f2ea655

bcachefs: ec: zero_out_rest_of_ec_bucket() · 94bc95c4

Kent Overstreet authored Feb 26, 2023

Occasionally, we won't write to an entire bucket. This fixes the EC code
to handle this case, zeroing out the rest of the bucket as needed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

94bc95c4

bcachefs: bch2_data_update_index_update() -> bch2_trans_run() · 039c45fe
Kent Overstreet authored Feb 26, 2023
```
Convert to use the standard helper
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
039c45fe
bcachefs: Flush write buffer as needed in backpointers repair · e07cb974
Kent Overstreet authored Feb 25, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
e07cb974

bcachefs: Fix for shared paths in write buffer flush · 747ded6d

Kent Overstreet authored Feb 26, 2023

It's possible for bch2_write_buffer_flush_one() to end up with a shared
path, if called from a context that already has a btree iterator
pointing to a key being flushed. We have to be careful when that
happens, since we can't clone a path that holds write locks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

747ded6d

bcachefs: Single open_bucket_partial list · 39a1ea12
Kent Overstreet authored Feb 25, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
39a1ea12