Commits · f2bfe7e83765f3bd84382cc75d8ac3ca619de39a · Kirill Smelkov / linux

09 Sep, 2024 17 commits

bcachefs: Rip out freelists from btree key cache · f2bfe7e8
Kent Overstreet authored Jun 08, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
f2bfe7e8
bcachefs: rcu_pending now works in userspace · d2ed0f20
Kent Overstreet authored Aug 23, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
d2ed0f20

Kent Overstreet authored Jun 10, 2024

Generic data structure for explicitly tracking pending RCU items,
allowing items to be dequeued (i.e. allocate from items pending
freeing). Works with conventional RCU and SRCU, and possibly other RCU
flavors in the future, meaning this can serve as a more generic
replacement for SLAB_TYPESAFE_BY_RCU.

Pending items are tracked in radix trees; if memory allocation fails, we
fall back to linked lists.

A rcu_pending is initialized with a callback, which is invoked when
pending items's grace periods have expired. Two types of callback
processing are handled specially:

- RCU_PENDING_KVFREE_FN

  New backend for kvfree_rcu(). Slightly faster, and eliminates the
  synchronize_rcu() slowpath in kvfree_rcu_mightsleep() - instead, an
  rcu_head is allocated if we don't have one and can't use the radix
  tree

  TODO:
  - add a shrinker (as in the existing kvfree_rcu implementation) so that
    memory reclaim can free expired objects if callback processing isn't
    keeping up, and to expedite a grace period if we're under memory
    pressure and too much memory is stranded by RCU

  - add a counter for amount of memory pending

- RCU_PENDING_CALL_RCU_FN

  Accelerated backend for call_rcu() - pending callbacks are tracked in
  a radix tree to eliminate linked list overhead.

to serve as replacement backends for kvfree_rcu() and call_rcu(); these
may be of interest to other uses (e.g. SLAB_TYPESAFE_BY_RCU users).

Note:

Internally, we're using a single rearming call_rcu() callback for
notifications from the core RCU subsystem for notifications when objects
are ready to be processed.

Ideally we would be getting a callback every time a grace period
completes for which we have objects, but that would require multiple
rcu_heads in flight, and since the number of gp sequence numbers with
uncompleted callbacks is not bounded, we can't do that yet.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8e973a4f

lib/generic-radix-tree.c: add preallocation · b3f9da79
Kent Overstreet authored Aug 10, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
b3f9da79
lib/generic-radix-tree.c: genradix_ptr_inlined() · f6594633
Kent Overstreet authored Jun 17, 2024
```
Provide an inlined fast path
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
f6594633

bcachefs: Fix deadlock in __wait_on_freeing_inode() · 54f77024

Kent Overstreet authored Aug 16, 2024

We can't call __wait_on_freeing_inode() with btree locks held; we're
waiting on another thread that's in evict(), and before it clears that
bit it needs to write that inode to flush timestamps - deadlock.

Fixing this involves a fair amount of re-jiggering to plumb a new
transaction restart.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

54f77024

bcachefs: switch to rhashtable for vfs inodes hash · 112d21fd

Kent Overstreet authored Jun 08, 2024

the standard vfs inode hash table suffers from painful lock contention -
this is long overdue
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

112d21fd

inode: make __iget() a static inline · 88d2ae0e

Kent Overstreet authored Aug 08, 2024

bcachefs is switching to an rhashtable for vfs inodes instead of the
standard inode.c hashtable, so we need this exported, or - a static
inline makes more sense for a single atomic_inc().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

88d2ae0e

bcachefs: Replace div_u64 with div64_u64 where second param is u64 · 27663d77

Reed Riley authored Sep 05, 2024

Bcachefs often uses this function to divide by nanosecond times - which
can easily cause problems when cast to u32.  For example, `cat
/sys/fs/bcachefs/*/internal/rebalance_status` would return invalid data
in the `duration waited` field because dividing by the number of
nanoseconds in a minute requires the divisor parameter to be u64.
Signed-off-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

27663d77

bcachefs: Fix sysfs rebalance duration waited formatting · 36f0af4f

Feiko Nanninga authored Sep 01, 2024

cat /sys/fs/bcachefs/*/internal/rebalance_status
waiting
  io wait duration:  13.5 GiB
  io wait remaining: 627 MiB
  duration waited:   1392 m

duration waited was increasing at a rate of about 14 times the expected
rate.

div_u64 takes a u32 divisor, but u->nsecs (from time_units[]) can be
bigger than u32.
Signed-off-by: Feiko Nanninga <feiko.nanninga@fnanninga.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

36f0af4f

bcachefs: Fix negative timespecs · a3ed1cc4

Alyssa Ross authored Sep 07, 2024

This fixes two problems in the handling of negative times:

 • rem is signed, but the rem * c->sb.nsec_per_time_unit operation
   produced a bogus unsigned result, because s32 * u32 = u32.

 • The timespec was not normalized (it could contain more than a
   billion nanoseconds).

For example, { .tv_sec = -14245441, .tv_nsec = 750000000 }, after
being round tripped through timespec_to_bch2_time and then
bch2_time_to_timespec would come back as
{ .tv_sec = -14245440, .tv_nsec = 4044967296 } (more than 4 billion
nanoseconds).

Cc: stable@vger.kernel.org
Fixes: 595c1e9b ("bcachefs: Fix time handling")
Closes: https://github.com/koverstreet/bcachefs/issues/743Co-developed-by: Erin Shepherd <erin.shepherd@e43.eu>
Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
Co-developed-by: Ryan Lahfa <ryan@lahfa.xyz>
Signed-off-by: Ryan Lahfa <ryan@lahfa.xyz>
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a3ed1cc4

bcachefs: Don't delete open files in online fsck · 16005147

Kent Overstreet authored Sep 08, 2024

If a file is unlinked but still open, we don't want online fsck to
delete it - or fun inconsistencies will happen.

https://github.com/koverstreet/bcachefs/issues/727Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

16005147

bcachefs: fix btree_key_cache sysfs knob · 2c377d8a
Kent Overstreet authored Sep 05, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
2c377d8a
bcachefs: More BCH_SB_MEMBER_INVALID support · 52df04f0
Kent Overstreet authored Sep 04, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
52df04f0

bcachefs: Simplify bch2_bkey_drop_ptrs() · df88febc

Kent Overstreet authored Sep 04, 2024

bch2_bkey_drop_ptrs() had a some complicated machinery for avoiding
O(n^2) when dropping multiple pointers - but when n is only going to be
~4, it's not worth it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

df88febc

bcachefs: Add a cond_resched() to __journal_keys_sort() · ec36573d

Kent Overstreet authored Sep 05, 2024

Without this, we'd potentially sort multiple times without a
cond_resched(), leading to hung task warnings on larger systems.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ec36573d

bcachefs: Fix ca->io_ref usage · 5a6e43af

Kent Overstreet authored Sep 04, 2024

ca->io_ref does not protect against the filesystem going way,
c->write_ref does. Much like

0b50b731 bcachefs: Fix refcounting in discard path

the other async paths need fixing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

5a6e43af

04 Sep, 2024 1 commit

bcachefs: BCH_SB_MEMBER_INVALID · 53f66195

Kent Overstreet authored Sep 01, 2024

Create a sentinal value for "invalid device".

This is needed for removing devices that have stripes on them (force
removing, without evacuating); we need a sentinal value for the stripe
pointers to the device being removed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

53f66195

01 Sep, 2024 1 commit

bcachefs: fix rebalance accounting · 7f12a963

Kent Overstreet authored Sep 01, 2024

Fixes: 49aa7830 ("bcachefs: Fix rebalance_work accounting")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

7f12a963

31 Aug, 2024 2 commits

bcachefs: Mark more errors as autofix · 3d3020c4

Kent Overstreet authored Aug 22, 2024

errors that are known to always be safe to fix should be autofix: this
should be most errors even at this point, but that will need some
thorough review.

note that errors are still logged in the superblock, so we'll still know
that they happened.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

3d3020c4

bcachefs: Revert lockless buffered IO path · e3e69409

Kent Overstreet authored Aug 31, 2024

We had a report of data corruption on nixos when building installer
images.

https://github.com/NixOS/nixpkgs/pull/321055#issuecomment-2184131334

It seems that writes are being dropped, but only when issued by QEMU,
and possibly only in snapshot mode. It's undetermined if it's write
calls are being dropped or dirty folios.

Further testing, via minimizing the original patch to just the change
that skips the inode lock on non appends/truncates, reveals that it
really is just not taking the inode lock that causes the corruption: it
has nothing to do with the other logic changes for preserving write
atomicity in corner cases.

It's also kernel config dependent: it doesn't reproduce with the minimal
kernel config that ktest uses, but it does reproduce with nixos's distro
config. Bisection the kernel config initially pointer the finger at page
migration or compaction, but it appears that was erroneous; we haven't
yet determined what kernel config option actually triggers it.

Sadly it appears this will have to be reverted since we're getting too
close to release and my plate is full, but we'd _really_ like to fully
debug it.

My suspicion is that this patch is exposing a preexisting bug - the
inode lock actually covers very little in IO paths, and we have a
different lock (the pagecache add lock) that guards against races with
truncate here.

Fixes: 7e64c86c ("bcachefs: Buffered write path now can avoid the inode lock")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

e3e69409

27 Aug, 2024 2 commits

bcachefs: Fix bch2_extents_match() false positive · d2693569

Kent Overstreet authored Aug 26, 2024

This was caught as a very rare nonce inconsistency, on systems with
encryption and replication (and tiering, or some form of rebalance
operation running):

[Wed Jul 17 13:30:03 2024] about to insert invalid key in data update path
[Wed Jul 17 13:30:03 2024] old: u64s 10 type extent 671283510:6392:U32_MAX len 16 ver 106595503: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:104 gen 7 ptr: 4:513244:48 gen 6 rebalance: target hdd compression zstd
[Wed Jul 17 13:30:03 2024] k:   u64s 10 type extent 671283510:6400:U32_MAX len 16 ver 106595508: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:112 gen 7 ptr: 4:513244:56 gen 6 rebalance: target hdd compression zstd
[Wed Jul 17 13:30:03 2024] new: u64s 14 type extent 671283510:6392:U32_MAX len 8 ver 106595508: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:112 gen 7 cached ptr: 4:513244:56 gen 6 cached rebalance: target hdd compression zstd crc: c_size 8 size 16 offset 8 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 1:10860085:32 gen 0 ptr: 0:17285918:408 gen 0
[Wed Jul 17 13:30:03 2024] bcachefs (cca5bc65-fe77-409d-a9fa-465a6e7f4eae): fatal error - emergency read only

bch2_extents_match() was reporting true for extents that did not
actually point to the same data.

bch2_extent_match() iterates over pairs of pointers, looking for
pointers that point to the same location on disk (with matching
generation numbers). However one or both extents may have been trimmed
(or merged) and they might not have the same disk offset: it corrects
for this by subtracting the key offset and the checksum entry offset.

However, this failed when an extent was immediately partially
overwritten, and the new overwrite was allocated the next adjacent disk
space.

Normally, with compression off, this would never cause a bug, since the
new extent would have to be immediately after the old extent for the
pointer offsets to match, and the rebalance index update path is not
looking for an extent outside the range of the extent it moved.

However with compression enabled, extents take up less space on disk
than they do in the btree index space - and spuriously matching after
partial overwrite is possible.

To fix this, add a secondary check, that strictly checks that the
regions pointed to on disk overlap.

https://github.com/koverstreet/bcachefs/issues/717Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

d2693569

bcachefs: Fix failure to return error in data_update_index_update() · 66927b89

Kent Overstreet authored Aug 26, 2024

This fixes an assertion pop in io_write.c - if we don't return an error
we're supposed to have completed all the btree updates.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

66927b89

24 Aug, 2024 2 commits

bcachefs: Fix rebalance_work accounting · 49aa7830

Kent Overstreet authored Aug 23, 2024

rebalance_work was keying off of the presence of rebelance_opts in the
extent - but that was incorrect, we keep those around after rebalance
for indirect extents since the inode's options are not directly
available

Fixes: 20ac515a ("bcachefs: bch_acct_rebalance_work")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

49aa7830

bcachefs: Fix failure to flush moves before sleeping in copygc · d3204616

Kent Overstreet authored Aug 23, 2024

This fixes an apparent deadlock - rebalance would get stuck trying to
take nocow locks because they weren't being released by copygc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

d3204616

22 Aug, 2024 15 commits

bcachefs: don't use rht_bucket() in btree_key_cache_scan() · a592cdf5

Kent Overstreet authored Aug 19, 2024

rht_bucket() does strange complicated things when a rehash is in
progress.

Instead, just skip scanning when a rehash is in progress: scanning is
going to be more expensive (many more empty slots to cover), and some
sort of infinite loop is being observed
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a592cdf5

bcachefs: add missing inode_walker_exit() · 3e878fe5
Kent Overstreet authored Aug 22, 2024
```
fix a small leak
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
3e878fe5

bcachefs: clear path->should_be_locked in bch2_btree_key_cache_drop() · 87313ac1

Kent Overstreet authored Aug 22, 2024

bch2_btree_key_cache_drop() evicts the key cache entry - it's used when
we're doing an update that bypasses the key cache, because for cache
coherency reasons a key can't be in the key cache unless it also exists
in the btree - i.e. creates have to bypass the cache.

After evicting, the path no longer points to a key cache key, and
relock() will always fail if should_be_locked is true.

Prep for improving path->should_be_locked assertions
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

87313ac1

bcachefs: Fix double assignment in check_dirent_to_subvol() · dedb2fe3

Yuesong Li authored Aug 22, 2024

ret was assigned twice in check_dirent_to_subvol(). Reported by cocci.
Signed-off-by: Yuesong Li <liyuesong@vivo.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

dedb2fe3

bcachefs: Fix refcounting in discard path · 0b50b731

Kent Overstreet authored Aug 21, 2024

bch_dev->io_ref does not protect against the filesystem going away;
bch_fs->writes does.

Thus the filesystem write ref needs to be the last ref we release.

Reported-by: syzbot+9e0404b505e604f67e41@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

0b50b731

bcachefs: Fix compat issue with old alloc_v4 keys · 8ed823b1

Kent Overstreet authored Aug 21, 2024

we allow new fields to be added to existing key types, and new versions
should treat them as being zeroed; this was not handled in
alloc_v4_validate.

Reported-by: syzbot+3b2968fa4953885dd66a@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8ed823b1

bcachefs: Fix warning in bch2_fs_journal_stop() · 7f2de694

Kent Overstreet authored Aug 21, 2024

j->last_empty_seq needs to match j->seq when the journal is empty

Reported-by: syzbot+4093905737cf289b6b38@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

7f2de694

fs/super.c: improve get_tree() error message · 06f67437

Kent Overstreet authored Aug 21, 2024

seeing an odd bug where we fail to correctly return an error from
.get_tree():

https://syzkaller.appspot.com/bug?extid=c0360e8367d6d8d04a66

we need to be able to distinguish between accidently returning a
positive error (as implied by the log) and no error.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

06f67437

bcachefs: Fix missing validation in bch2_sb_journal_v2_validate() · bdbdd475

Kent Overstreet authored Aug 21, 2024

Reported-by: syzbot+47ecc948aadfb2ab3efc@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bdbdd475

bcachefs: Fix replay_now_at() assert · cab18be6

Kent Overstreet authored Aug 21, 2024

Journal replay, in the slowpath where we insert keys in journal order,
was inserting keys in the wrong order; keys from early repair come last.

Reported-by: syzbot+2c4fcb257ce2b6a29d0e@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

cab18be6

bcachefs: Fix locking in bch2_ioc_setlabel() · 6575b8c9

Kent Overstreet authored Aug 20, 2024

Fixes: 7a254053 ("bcachefs: support FS_IOC_SETFSLABEL")
Reported-by: syzbot+7e9efdfec27fbde0141d@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

6575b8c9

bcachefs: fix failure to relock in btree_node_fill() · 5dbfc4ef
Kent Overstreet authored Aug 20, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
5dbfc4ef

bcachefs: fix failure to relock in bch2_btree_node_mem_alloc() · 3c5d0b72

Kent Overstreet authored Aug 19, 2024

We weren't always so strict about trans->locked state - but now we are,
and new assertions are shaking some bugs out.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

3c5d0b72

bcachefs: unlock_long() before resort in journal replay · 1dceae4c

Kent Overstreet authored Aug 20, 2024

Fix another SRCU splat - this one pretty harmless.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

1dceae4c

bcachefs: fix missing bch2_err_str() · cecc3282
Kent Overstreet authored Aug 20, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
cecc3282