Commits · 2a89a3e9682b127c1978ac31eb38ef73a39a416e · Kirill Smelkov / linux

22 Oct, 2023 40 commits

bcachefs: Fix a null ptr deref in check_xattr() · 2a89a3e9

Kent Overstreet authored Jul 20, 2023

We were attempting to initialize inode hash info when no inodes were
found.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

2a89a3e9

bcachefs: bch2_btree_bit_mod() · 8e992c6c

Kent Overstreet authored Jul 17, 2023

New helper for bitset btrees.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8e992c6c

bcachefs: move inode triggers to inode.c · 4dc5bb9a
Kent Overstreet authored Jul 17, 2023
```
bit of reorg
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
4dc5bb9a

bcachefs: fsck: delete dead code · 9d8a3c95

Kent Overstreet authored Jul 17, 2023

Delete the old, now reimplemented overlapping extent check/repair.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

9d8a3c95

bcachefs: Make topology repair a normal recovery pass · 922bc5a0

Kent Overstreet authored Jul 16, 2023

This adds bch2_run_explicit_recovery_pass(), for rewinding recovery and
explicitly running a specific recovery pass - this is a more general
replacement for how we were running topology repair before.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

922bc5a0

bcachefs: bch2_run_explicit_recovery_pass() · ae2e13d7

Kent Overstreet authored Jul 16, 2023

This introduces bch2_run_explicit_recovery_pass() and uses it for when
fsck detects that we need to re-run dead snaphots cleanup, and makes
dead snapshot cleanup more like a normal recovery pass.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ae2e13d7

bcachefs: Print version, options earlier in startup path · ef1634f0
Kent Overstreet authored Jul 20, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
ef1634f0

bcachefs: use prejournaled key updates for write buffer flushes · 60a5b898

Brian Foster authored Jul 19, 2023

The write buffer mechanism journals keys twice in certain
situations. A key is always journaled on write buffer insertion, and
is potentially journaled again if a write buffer flush falls into
either of the slow btree insert paths. This has shown to cause
journal recovery ordering problems in the event of an untimely
crash.

For example, consider if a key is inserted into index 0 of a write
buffer, the active write buffer switches to index 1, the key is
deleted in index 1, and then index 0 is flushed. If the original key
is rejournaled in the btree update from the index 0 flush, the (now
deleted) key is journaled in a seq buffer ahead of the latest
version of key (which was journaled when the key was deleted in
index 1). If the fs crashes while this is still observable in the
log, recovery sees the key from the btree update after the delete
key from the write buffer insert, which is the incorrect order. This
problem is occasionally reproduced by generic/388 and generally
manifests as one or more backpointer entry inconsistencies.

To avoid this problem, never rejournal write buffered key updates to
the associated btree. Instead, use prejournaled key updates to pass
the journal seq of the write buffer insert down to the btree insert,
which updates the btree leaf pin to reflect the seq of the key.

Note that tracking the seq is required instead of just using
NOJOURNAL here because otherwise we lose protection of the write
buffer pin when the buffer is flushed, which means the key can fall
off the tail of the on-disk journal before the btree leaf is flushed
and lead to similar recovery inconsistencies.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

60a5b898

bcachefs: support btree updates of prejournaled keys · eabb10dc

Brian Foster authored Jul 19, 2023

Introduce support for prejournaled key updates. This allows a
transaction to commit an update for a key that already exists (and
is pinned) in the journal. This is required for btree write buffer
updates as the current scheme of journaling both on write buffer
insertion and write buffer (slow path) flush is unsafe in certain
crash recovery scenarios.

Create a small trans update wrapper to pass along the seq where the
key resides into the btree_insert_entry. From there, trans commit
passes the seq into the btree insert path where it is used to manage
the journal pin for the associated btree leaf.

Note that this patch only introduces the underlying mechanism and
otherwise includes no functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

eabb10dc

bcachefs: fold bch2_trans_update_by_path_trace() into callers · 78623ee0

Brian Foster authored Jul 19, 2023

There is only one other caller so eliminate some boilerplate.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

78623ee0

bcachefs: remove unnecessary btree_insert_key_leaf() wrapper · a2437bba

Brian Foster authored Jul 19, 2023

This is in preparation to support prejournaled keys. We want the
ability to optionally pass a seq stored in the btree update rather
than the seq of the committing transaction.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a2437bba

bcachefs: remove duplicate code between backpointer update paths · 2110f21e
Brian Foster authored Jul 19, 2023
```
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
2110f21e

MAINTAINERS: add Brian Foster as a reviewer for bcachefs · f7b3e651

Brian Foster authored Jul 20, 2023

Brian has been playing with bcachefs for several months now and has
offerred to commit time to patch review.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

f7b3e651

bcachefs: Suppresss various error messages in no_data_io mode · 970a5096

Kent Overstreet authored Jul 16, 2023

We commonly use no_data_io mode when debugging filesystem metadata
dumps, where data checksum/compression errors are expected and
unimportant - this patch suppresses these.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

970a5096

bcachefs: Fix lookup_inode_for_snapshot() · 20e6d9a8
Kent Overstreet authored Jul 16, 2023
```
This fixes a use-after-free.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
20e6d9a8

bcachefs: need_snapshot_cleanup shouldn't be a fsck error · 6b20d746

Kent Overstreet authored Jul 16, 2023

We currently don't track whether snapshot cleanup still needs to finish
(aside from running a full fsck), so it shouldn't be a fsck error yet -
fsck -n after fsck has succesfully completed shouldn't error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

6b20d746

bcachefs: Improve key_visible_in_snapshot() · 464ee192

Kent Overstreet authored Jul 16, 2023

Delete a redundant bch2_snapshot_is_ancestor() check, and convert some
assertions to debug assertions.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

464ee192

bcachefs: Refactor overlapping extent checks · a397b8df

Kent Overstreet authored Jul 16, 2023

Make the overlapping extent check/repair code more self contained.

This is prep work for hopefully reducing key_visible_in_snapshot() usage
here as well, and also includes a nice performance optimization to not
check ref_visible2() unless the extents potentially overlap.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a397b8df

bcachefs: check_extent(): don't use key_visible_in_snapshot() · a0076086

Kent Overstreet authored Jul 16, 2023

This changes the main part of check_extents(), that checks the extent
against the corresponding inode, to not use key_visible_in_snapshot().

key_visible_in_snapshot() has to iterate over the list of ancestor
overwrites repeatedly calling bch2_snapshot_is_ancestor(), so this is a
significant performance improvement.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a0076086

bcachefs: check_extent() refactoring · 650eb16b

Kent Overstreet authored Jul 16, 2023

More prep work for reducing key_visible_in_snapshot() usage - this
rearranges how KEY_TYPE_whitout keys are handled, so that they can be
marked off in inode_warker->inode->seen_this_pos.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

650eb16b

bcachefs: fsck: walk_inode() now takes is_whiteout · a57f4d61

Kent Overstreet authored Jul 16, 2023

We only want to synthesize an inode for the current snapshot ID for non
whiteouts - this refactoring lets us call walk_inode() earlier and clean
up some control flow.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a57f4d61

bcachefs: Simplify check_extent() · 0d8f320d

Kent Overstreet authored Jul 13, 2023

Minor refactoring/dead code deletion, prep work for reworking
check_extent() to avoid key_visible_in_snapshot().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

0d8f320d

bcachefs: overlapping_extents_found() · 43b81a4e

Kent Overstreet authored Jul 13, 2023

This improves the repair path for overlapping extents - we now verify
that we find in the btree the overlapping extents that the algorithm
detected, and fail the fsck run with a more useful error if it doesn't
match.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

43b81a4e

bcachefs: fsck: inode_walker: last_pos, seen_this_pos · f9f52bc4

Kent Overstreet authored Jul 16, 2023

Prep work for changing check_extent() to avoid
key_visible_in_snapshot() - this adds the state to track whether an
inode has seen an extent at this pos.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

f9f52bc4

bcachefs: check_extents(): make sure to check i_sectors for last inode · 5897505e
Kent Overstreet authored Jul 16, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
5897505e
bcachefs: Inline bch2_snapshot_is_ancestor() fast path · 93de9e92
Kent Overstreet authored Jul 16, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
93de9e92

bcachefs: Upgrade path fixes · 813e0cec

Kent Overstreet authored Jul 15, 2023

Some minor fixes to not print errors that are actually due to a verson
upgrade.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

813e0cec

bcachefs: is_ancestor bitmap · 6132c84c

Kent Overstreet authored Jul 13, 2023

Further optimization for bch2_snapshot_is_ancestor(). We add a small
inline bitmap to snapshot_t, which indicates which of the next 128
snapshot IDs are ancestors of the current id - eliminating the last few
iterations of the loop in bch2_snapshot_is_ancestor().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

6132c84c

bcachefs: mark bch_inode_info and bkey_cached as reclaimable · 5eaa76d8

Mikulas Patocka authored Jul 13, 2023

Mark these caches as reclaimable, so that available memory is correctly
reported when there is a lot of cached inodes.

Note that more work is needed - you should add __GFP_RECLAIMABLE to some
of the kmalloc calls, so that they are allocated from the "kmalloc-rcl-*"
caches.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

5eaa76d8

bcachefs: Compression levels · 986e9842

Kent Overstreet authored Jul 12, 2023

This allows including a compression level when specifying a compression
type, e.g.
  compression=zstd:15

Values from 1 through 15 indicate compression levels, 0 or unspecified
indicates the default.

For LZ4, values 3-15 specify that the HC algorithm should be used.

Note that for compatibility, extents themselves only include the
compression type, not the compression level. This means that specifying
the same compression algorithm but different compression levels for the
compression and background_compression options will have no effect.

XXX: perhaps we could add a warning for this
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

986e9842

bcachefs: Extent sb compression type fields to 8 bits · e86e9124
Kent Overstreet authored Jul 12, 2023
```
The upper 4 bits are for compression level.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
e86e9124
bcachefs: bcachefs_format.h should be using __u64 · a5cf5a4b
Kent Overstreet authored Jul 12, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
a5cf5a4b

bcachefs: fix_errors option is now a proper enum · a0f8faea

Kent Overstreet authored Jul 11, 2023

Before, it was parsed as a bool but internally it was really an enum:
this lets us pass in all the possible values.

But we special case the option parsing: no supplied value is parsed as
FSCK_FIX_yes, to match the previous behaviour.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a0f8faea

bcachefs: bch_opt_fn · 9f343e24

Kent Overstreet authored Jul 12, 2023

Minor refactoring to get rid of some unneeded token pasting.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

9f343e24

bcachefs: Convert snapshot table to RCU array · 8479938d

Kent Overstreet authored Jul 12, 2023

This switches the generic radix tree for the in-memory table of snapshot
nodes to a simple rcu array. This means we have to add new locking to
deal with reallocations, but is faster than traversing the radix tree.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8479938d

bcachefs: Add a race_fault() for write buffer slowpath · d82978ca

Kent Overstreet authored Jul 12, 2023

We haven't hooked up dynamic fault injection quite yet, but we will soon
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

d82978ca

bcachefs: Add buffered IO fallback for userspace · f39d1aca

Kent Overstreet authored Jul 10, 2023

In userspace, we want to be able to switch to buffered IO when we're
dealing with an image on a filesystem/device that doesn't support the
blocksize the filesystem was formatted with.

This plumbs through !opts.direct_io -> FMODE_BUFFERED, which will be
supported by the shim version of blkdev_get_by_path() in -tools, and it
adds a fallback to disable direct IO and retry for userspace.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

f39d1aca

bcachefs: Fallocate now checks page cache · a09818c7

Kent Overstreet authored Jul 09, 2023

Previously, fallocate would only check the state of the extents btree
when determining if we need to create a reservation.

But the page cache might already have dirty data or a disk reservation.
This changes __bchfs_fallocate() to call bch2_seek_pagecache_hole() to
check for this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a09818c7

bcachefs: Don't start copygc until recovery is finished · ea28c867

Kent Overstreet authored Jul 10, 2023

With "bcachefs: Snapshot depth, skiplist fields", we now can't run data
move operations until after bch2_check_snapshots() is complete.

Ideally we'd have the copygc (and rebalance) threads wait until
c->curr_recovery_pass has advanced, but the waitlist handling is tricky
- so for now, move starting copygc back to read_write_late().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ea28c867

bcachefs: Fix build error on weird gcc · b9129136

Kent Overstreet authored Jul 10, 2023

fixes
./include/linux/stddef.h:8:14: error: positional initialization of field in ‘struct’ declared with ‘designated_init’ attribute [-Werror=designated-init]
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

b9129136