Commits · 27663d7784b5dfd354a968e06b26452dc93f2a16 · Kirill Smelkov / linux

09 Sep, 2024 9 commits

bcachefs: Replace div_u64 with div64_u64 where second param is u64 · 27663d77

Reed Riley authored Sep 05, 2024

Bcachefs often uses this function to divide by nanosecond times - which
can easily cause problems when cast to u32.  For example, `cat
/sys/fs/bcachefs/*/internal/rebalance_status` would return invalid data
in the `duration waited` field because dividing by the number of
nanoseconds in a minute requires the divisor parameter to be u64.
Signed-off-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

27663d77

bcachefs: Fix sysfs rebalance duration waited formatting · 36f0af4f

Feiko Nanninga authored Sep 01, 2024

cat /sys/fs/bcachefs/*/internal/rebalance_status
waiting
  io wait duration:  13.5 GiB
  io wait remaining: 627 MiB
  duration waited:   1392 m

duration waited was increasing at a rate of about 14 times the expected
rate.

div_u64 takes a u32 divisor, but u->nsecs (from time_units[]) can be
bigger than u32.
Signed-off-by: Feiko Nanninga <feiko.nanninga@fnanninga.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

36f0af4f

bcachefs: Fix negative timespecs · a3ed1cc4

Alyssa Ross authored Sep 07, 2024

This fixes two problems in the handling of negative times:

 • rem is signed, but the rem * c->sb.nsec_per_time_unit operation
   produced a bogus unsigned result, because s32 * u32 = u32.

 • The timespec was not normalized (it could contain more than a
   billion nanoseconds).

For example, { .tv_sec = -14245441, .tv_nsec = 750000000 }, after
being round tripped through timespec_to_bch2_time and then
bch2_time_to_timespec would come back as
{ .tv_sec = -14245440, .tv_nsec = 4044967296 } (more than 4 billion
nanoseconds).

Cc: stable@vger.kernel.org
Fixes: 595c1e9b ("bcachefs: Fix time handling")
Closes: https://github.com/koverstreet/bcachefs/issues/743Co-developed-by: Erin Shepherd <erin.shepherd@e43.eu>
Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
Co-developed-by: Ryan Lahfa <ryan@lahfa.xyz>
Signed-off-by: Ryan Lahfa <ryan@lahfa.xyz>
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a3ed1cc4

bcachefs: Don't delete open files in online fsck · 16005147

Kent Overstreet authored Sep 08, 2024

If a file is unlinked but still open, we don't want online fsck to
delete it - or fun inconsistencies will happen.

https://github.com/koverstreet/bcachefs/issues/727Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

16005147

bcachefs: fix btree_key_cache sysfs knob · 2c377d8a
Kent Overstreet authored Sep 05, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
2c377d8a
bcachefs: More BCH_SB_MEMBER_INVALID support · 52df04f0
Kent Overstreet authored Sep 04, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
52df04f0

bcachefs: Simplify bch2_bkey_drop_ptrs() · df88febc

Kent Overstreet authored Sep 04, 2024

bch2_bkey_drop_ptrs() had a some complicated machinery for avoiding
O(n^2) when dropping multiple pointers - but when n is only going to be
~4, it's not worth it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

df88febc

bcachefs: Add a cond_resched() to __journal_keys_sort() · ec36573d

Kent Overstreet authored Sep 05, 2024

Without this, we'd potentially sort multiple times without a
cond_resched(), leading to hung task warnings on larger systems.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ec36573d

bcachefs: Fix ca->io_ref usage · 5a6e43af

Kent Overstreet authored Sep 04, 2024

ca->io_ref does not protect against the filesystem going way,
c->write_ref does. Much like

0b50b731 bcachefs: Fix refcounting in discard path

the other async paths need fixing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

5a6e43af

04 Sep, 2024 1 commit

bcachefs: BCH_SB_MEMBER_INVALID · 53f66195

Kent Overstreet authored Sep 01, 2024

Create a sentinal value for "invalid device".

This is needed for removing devices that have stripes on them (force
removing, without evacuating); we need a sentinal value for the stripe
pointers to the device being removed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

53f66195

01 Sep, 2024 1 commit

bcachefs: fix rebalance accounting · 7f12a963

Kent Overstreet authored Sep 01, 2024

Fixes: 49aa7830 ("bcachefs: Fix rebalance_work accounting")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

7f12a963

31 Aug, 2024 2 commits

bcachefs: Mark more errors as autofix · 3d3020c4

Kent Overstreet authored Aug 22, 2024

errors that are known to always be safe to fix should be autofix: this
should be most errors even at this point, but that will need some
thorough review.

note that errors are still logged in the superblock, so we'll still know
that they happened.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

3d3020c4

bcachefs: Revert lockless buffered IO path · e3e69409

Kent Overstreet authored Aug 31, 2024

We had a report of data corruption on nixos when building installer
images.

https://github.com/NixOS/nixpkgs/pull/321055#issuecomment-2184131334

It seems that writes are being dropped, but only when issued by QEMU,
and possibly only in snapshot mode. It's undetermined if it's write
calls are being dropped or dirty folios.

Further testing, via minimizing the original patch to just the change
that skips the inode lock on non appends/truncates, reveals that it
really is just not taking the inode lock that causes the corruption: it
has nothing to do with the other logic changes for preserving write
atomicity in corner cases.

It's also kernel config dependent: it doesn't reproduce with the minimal
kernel config that ktest uses, but it does reproduce with nixos's distro
config. Bisection the kernel config initially pointer the finger at page
migration or compaction, but it appears that was erroneous; we haven't
yet determined what kernel config option actually triggers it.

Sadly it appears this will have to be reverted since we're getting too
close to release and my plate is full, but we'd _really_ like to fully
debug it.

My suspicion is that this patch is exposing a preexisting bug - the
inode lock actually covers very little in IO paths, and we have a
different lock (the pagecache add lock) that guards against races with
truncate here.

Fixes: 7e64c86c ("bcachefs: Buffered write path now can avoid the inode lock")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

e3e69409

27 Aug, 2024 2 commits

bcachefs: Fix bch2_extents_match() false positive · d2693569

Kent Overstreet authored Aug 26, 2024

This was caught as a very rare nonce inconsistency, on systems with
encryption and replication (and tiering, or some form of rebalance
operation running):

[Wed Jul 17 13:30:03 2024] about to insert invalid key in data update path
[Wed Jul 17 13:30:03 2024] old: u64s 10 type extent 671283510:6392:U32_MAX len 16 ver 106595503: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:104 gen 7 ptr: 4:513244:48 gen 6 rebalance: target hdd compression zstd
[Wed Jul 17 13:30:03 2024] k:   u64s 10 type extent 671283510:6400:U32_MAX len 16 ver 106595508: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:112 gen 7 ptr: 4:513244:56 gen 6 rebalance: target hdd compression zstd
[Wed Jul 17 13:30:03 2024] new: u64s 14 type extent 671283510:6392:U32_MAX len 8 ver 106595508: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:112 gen 7 cached ptr: 4:513244:56 gen 6 cached rebalance: target hdd compression zstd crc: c_size 8 size 16 offset 8 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 1:10860085:32 gen 0 ptr: 0:17285918:408 gen 0
[Wed Jul 17 13:30:03 2024] bcachefs (cca5bc65-fe77-409d-a9fa-465a6e7f4eae): fatal error - emergency read only

bch2_extents_match() was reporting true for extents that did not
actually point to the same data.

bch2_extent_match() iterates over pairs of pointers, looking for
pointers that point to the same location on disk (with matching
generation numbers). However one or both extents may have been trimmed
(or merged) and they might not have the same disk offset: it corrects
for this by subtracting the key offset and the checksum entry offset.

However, this failed when an extent was immediately partially
overwritten, and the new overwrite was allocated the next adjacent disk
space.

Normally, with compression off, this would never cause a bug, since the
new extent would have to be immediately after the old extent for the
pointer offsets to match, and the rebalance index update path is not
looking for an extent outside the range of the extent it moved.

However with compression enabled, extents take up less space on disk
than they do in the btree index space - and spuriously matching after
partial overwrite is possible.

To fix this, add a secondary check, that strictly checks that the
regions pointed to on disk overlap.

https://github.com/koverstreet/bcachefs/issues/717Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

d2693569

bcachefs: Fix failure to return error in data_update_index_update() · 66927b89

Kent Overstreet authored Aug 26, 2024

This fixes an assertion pop in io_write.c - if we don't return an error
we're supposed to have completed all the btree updates.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

66927b89

24 Aug, 2024 2 commits

bcachefs: Fix rebalance_work accounting · 49aa7830

Kent Overstreet authored Aug 23, 2024

rebalance_work was keying off of the presence of rebelance_opts in the
extent - but that was incorrect, we keep those around after rebalance
for indirect extents since the inode's options are not directly
available

Fixes: 20ac515a ("bcachefs: bch_acct_rebalance_work")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

49aa7830

bcachefs: Fix failure to flush moves before sleeping in copygc · d3204616

Kent Overstreet authored Aug 23, 2024

This fixes an apparent deadlock - rebalance would get stuck trying to
take nocow locks because they weren't being released by copygc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

d3204616

22 Aug, 2024 22 commits

bcachefs: don't use rht_bucket() in btree_key_cache_scan() · a592cdf5

Kent Overstreet authored Aug 19, 2024

rht_bucket() does strange complicated things when a rehash is in
progress.

Instead, just skip scanning when a rehash is in progress: scanning is
going to be more expensive (many more empty slots to cover), and some
sort of infinite loop is being observed
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a592cdf5

bcachefs: add missing inode_walker_exit() · 3e878fe5
Kent Overstreet authored Aug 22, 2024
```
fix a small leak
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
3e878fe5

bcachefs: clear path->should_be_locked in bch2_btree_key_cache_drop() · 87313ac1

Kent Overstreet authored Aug 22, 2024

bch2_btree_key_cache_drop() evicts the key cache entry - it's used when
we're doing an update that bypasses the key cache, because for cache
coherency reasons a key can't be in the key cache unless it also exists
in the btree - i.e. creates have to bypass the cache.

After evicting, the path no longer points to a key cache key, and
relock() will always fail if should_be_locked is true.

Prep for improving path->should_be_locked assertions
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

87313ac1

bcachefs: Fix double assignment in check_dirent_to_subvol() · dedb2fe3

Yuesong Li authored Aug 22, 2024

ret was assigned twice in check_dirent_to_subvol(). Reported by cocci.
Signed-off-by: Yuesong Li <liyuesong@vivo.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

dedb2fe3

bcachefs: Fix refcounting in discard path · 0b50b731

Kent Overstreet authored Aug 21, 2024

bch_dev->io_ref does not protect against the filesystem going away;
bch_fs->writes does.

Thus the filesystem write ref needs to be the last ref we release.

Reported-by: syzbot+9e0404b505e604f67e41@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

0b50b731

bcachefs: Fix compat issue with old alloc_v4 keys · 8ed823b1

Kent Overstreet authored Aug 21, 2024

we allow new fields to be added to existing key types, and new versions
should treat them as being zeroed; this was not handled in
alloc_v4_validate.

Reported-by: syzbot+3b2968fa4953885dd66a@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8ed823b1

bcachefs: Fix warning in bch2_fs_journal_stop() · 7f2de694

Kent Overstreet authored Aug 21, 2024

j->last_empty_seq needs to match j->seq when the journal is empty

Reported-by: syzbot+4093905737cf289b6b38@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

7f2de694

fs/super.c: improve get_tree() error message · 06f67437

Kent Overstreet authored Aug 21, 2024

seeing an odd bug where we fail to correctly return an error from
.get_tree():

https://syzkaller.appspot.com/bug?extid=c0360e8367d6d8d04a66

we need to be able to distinguish between accidently returning a
positive error (as implied by the log) and no error.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

06f67437

bcachefs: Fix missing validation in bch2_sb_journal_v2_validate() · bdbdd475

Kent Overstreet authored Aug 21, 2024

Reported-by: syzbot+47ecc948aadfb2ab3efc@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bdbdd475

bcachefs: Fix replay_now_at() assert · cab18be6

Kent Overstreet authored Aug 21, 2024

Journal replay, in the slowpath where we insert keys in journal order,
was inserting keys in the wrong order; keys from early repair come last.

Reported-by: syzbot+2c4fcb257ce2b6a29d0e@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

cab18be6

bcachefs: Fix locking in bch2_ioc_setlabel() · 6575b8c9

Kent Overstreet authored Aug 20, 2024

Fixes: 7a254053 ("bcachefs: support FS_IOC_SETFSLABEL")
Reported-by: syzbot+7e9efdfec27fbde0141d@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

6575b8c9

bcachefs: fix failure to relock in btree_node_fill() · 5dbfc4ef
Kent Overstreet authored Aug 20, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
5dbfc4ef

bcachefs: fix failure to relock in bch2_btree_node_mem_alloc() · 3c5d0b72

Kent Overstreet authored Aug 19, 2024

We weren't always so strict about trans->locked state - but now we are,
and new assertions are shaking some bugs out.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

3c5d0b72

bcachefs: unlock_long() before resort in journal replay · 1dceae4c

Kent Overstreet authored Aug 20, 2024

Fix another SRCU splat - this one pretty harmless.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

1dceae4c

bcachefs: fix missing bch2_err_str() · cecc3282
Kent Overstreet authored Aug 20, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
cecc3282

bcachefs: fix time_stats_to_text() · b8db1bd8

Kent Overstreet authored Aug 19, 2024

Fixes: 7423330e ("bcachefs: prt_printf() now respects \r\n\t")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

b8db1bd8

bcachefs: Fix bch2_bucket_gens_init() · c2a503f3

Kent Overstreet authored Aug 18, 2024

Comparing the wrong bpos - this was missed because normally
bucket_gens_init() runs on brand new filesystems, but this bug caused it
to overwrite bucket_gens keys with 0s when upgrading ancient
filesystems.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

c2a503f3

bcachefs: Fix bch2_trigger_alloc assert · e150a7e8

Kent Overstreet authored Aug 18, 2024

On testing on an old mangled filesystem, we missed a case.

Fixes: bd864bc2 ("bcachefs: Fix bch2_trigger_alloc when upgrading from old versions")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

e150a7e8

bcachefs: Fix failure to relock in btree_node_get() · 49203a6b
Kent Overstreet authored Aug 18, 2024
```
discovered by new trans->locked asserts
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
49203a6b

bcachefs: setting bcachefs_effective.* xattrs is a noop · 548e7f51

Kent Overstreet authored Aug 18, 2024

bcachefs_effective.* xattrs show the options inherited from parent
directories (as well as explicitly set); this namespace is not for
setting bcachefs options.

Change the .set() handler to a noop so that if e.g. rsync is copying
xattrs it'll do the right thing, and only copy xattrs in the bcachefs.*
namespace. We don't want to return an error, because that will cause
rsync to bail out or get spammy.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

548e7f51

bcachefs: Fix "trying to move an extent, but nr_replicas=0" · 8cc0e506

Kent Overstreet authored Aug 18, 2024

data_update_init() does a bunch of complicated stuff to decide how many
replicas to add, since we only want to increase an extent's durability
on an explicit rereplicate, but extent pointers may be on devices with
different durability settings.

There was a corner case when evacuating a device that had been set to
durability=0 after data had been written to it, and extents on that
device had already been rereplicated - then evacuate only needs to drop
pointers on that device, not move them.

So the assert for !m->op.nr_replicas was spurious; this was a perfectly
legitimate case that needed to be handled.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8cc0e506

bcachefs: bch2_data_update_init() cleanup · 3f53d050

Kent Overstreet authored Aug 18, 2024

Factor out some helpers - this function has gotten much too big.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

3f53d050

20 Aug, 2024 1 commit

bcachefs: Extra debug for data move path · 2102bdac

Kent Overstreet authored Aug 17, 2024

We don't have sufficient information to debug:

https://github.com/koverstreet/bcachefs/issues/726

- print out durability of extent ptrs, when non default
- print the number of replicas we need in data_update_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

2102bdac