1. 22 Oct, 2023 40 commits
    • Kent Overstreet's avatar
      bcachefs: Clear btree_node_just_written() when node reused or evicted · 0b438c5b
      Kent Overstreet authored
      This fixes the following bug:
      
      Journal reclaim attempts to flush a node, but races with the node being
      evicted from the btree node cache; when we lock the node, the data
      buffers have already been freed.
      
      We don't evict a node that's dirty, so calling btree_node_write() is
      fine - it's a noop - except that the btree_node_just_written bit causes
      bch2_btree_post_write_cleanup() to run (resorting the node), which then
      causes a null ptr deref.
      
      00078 Unable to handle kernel NULL pointer dereference at virtual address 000000000000009e
      00078 Mem abort info:
      00078   ESR = 0x0000000096000005
      00078   EC = 0x25: DABT (current EL), IL = 32 bits
      00078   SET = 0, FnV = 0
      00078   EA = 0, S1PTW = 0
      00078   FSC = 0x05: level 1 translation fault
      00078 Data abort info:
      00078   ISV = 0, ISS = 0x00000005
      00078   CM = 0, WnR = 0
      00078 user pgtable: 4k pages, 39-bit VAs, pgdp=000000007ed64000
      00078 [000000000000009e] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
      00078 Internal error: Oops: 0000000096000005 [#1] SMP
      00078 Modules linked in:
      00078 CPU: 75 PID: 1170 Comm: stress-ng-utime Not tainted 6.3.0-ktest-g5ef5b466e77e #2078
      00078 Hardware name: linux,dummy-virt (DT)
      00078 pstate: 80001005 (Nzcv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
      00078 pc : btree_node_sort+0xc4/0x568
      00078 lr : bch2_btree_post_write_cleanup+0x6c/0x1c0
      00078 sp : ffffff803e30b350
      00078 x29: ffffff803e30b350 x28: 0000000000000001 x27: ffffff80076e52a8
      00078 x26: 0000000000000002 x25: 0000000000000000 x24: ffffffc00912e000
      00078 x23: ffffff80076e52a8 x22: 0000000000000000 x21: ffffff80076e52bc
      00078 x20: ffffff80076e5200 x19: 0000000000000000 x18: 0000000000000000
      00078 x17: fffffffff8000000 x16: 0000000008000000 x15: 0000000008000000
      00078 x14: 0000000000000002 x13: 0000000000000000 x12: 00000000000000a0
      00078 x11: ffffff803e30b400 x10: ffffff803e30b408 x9 : 0000000000000001
      00078 x8 : 0000000000000000 x7 : ffffff803e480000 x6 : 00000000000000a0
      00078 x5 : 0000000000000088 x4 : 0000000000000000 x3 : 0000000000000010
      00078 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff80076e52a8
      00078 Call trace:
      00078  btree_node_sort+0xc4/0x568
      00078  bch2_btree_post_write_cleanup+0x6c/0x1c0
      00078  bch2_btree_node_write+0x108/0x148
      00078  __btree_node_flush+0x104/0x160
      00078  bch2_btree_node_flush0+0x1c/0x30
      00078  journal_flush_pins.constprop.0+0x184/0x2d0
      00078  __bch2_journal_reclaim+0x4d4/0x508
      00078  bch2_journal_reclaim+0x1c/0x30
      00078  __bch2_journal_preres_get+0x244/0x268
      00078  bch2_trans_journal_preres_get_cold+0xa4/0x180
      00078  __bch2_trans_commit+0x61c/0x1bb0
      00078  bch2_setattr_nonsize+0x254/0x318
      00078  bch2_setattr+0x5c/0x78
      00078  notify_change+0x2bc/0x408
      00078  vfs_utimes+0x11c/0x218
      00078  do_utimes+0x84/0x140
      00078  __arm64_sys_utimensat+0x68/0xa8
      00078  invoke_syscall.constprop.0+0x54/0xf0
      00078  do_el0_svc+0x48/0xd8
      00078  el0_svc+0x14/0x48
      00078  el0t_64_sync_handler+0xb0/0xb8
      00078  el0t_64_sync+0x14c/0x150
      00078 Code: 8b050265 910020c6 8b060266 910060ac (79402cad)
      00078 ---[ end trace 0000000000000000 ]---
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      0b438c5b
    • Kent Overstreet's avatar
      bcachefs: alloc_v4_u64s() fix · faa62a20
      Kent Overstreet authored
      With the recent bkey_ops.min_val_size addition, bkey values are
      automatically extended to the size of the current version.
      
      The check in bch2_alloc_v4_invalid() needs to be updated to take this
      into account.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      faa62a20
    • Kent Overstreet's avatar
      bcachefs: Delete an incorrect bch2_trans_unlock() · a49bd8c0
      Kent Overstreet authored
      These deletes a bch2_trans_unlock() call from __bch2_move_data(). It was
      redundant; bch2_move_extent() has the correct unlock call, and it was
      buggy because when move_extent calls bch2_extent_drop_ptrs() we don't
      want the transaction to be unlocked yet - this fixes a btree_iter.c
      assertion.
      
      Fixes https://github.com/koverstreet/bcachefs/issues/511.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      a49bd8c0
    • Kent Overstreet's avatar
      bcachefs: Use memcpy_u64s_small() for copying keys · d598a9b7
      Kent Overstreet authored
      Small performance optimization; an open coded loop is better than rep ;
      movsq for small copies.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      d598a9b7
    • Kent Overstreet's avatar
      bcachefs: Fix check_overlapping_extents() · 73da30e8
      Kent Overstreet authored
      A error check had a flipped conditional - whoops.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      73da30e8
    • Kent Overstreet's avatar
      bcachefs: Replace a BUG_ON() with fatal error · 4a2e5d7b
      Kent Overstreet authored
      A user hit this BUG_ON() - it's unclear how it happened, so replace it
      with a fatal error that will cause us to go read only, and print out
      more information.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      4a2e5d7b
    • Kent Overstreet's avatar
      bcachefs: Delete some dead code in bch2_replicas_gc_end() · 92e637ce
      Kent Overstreet authored
      bch2_replicas_gc_(start|end) is now only used for journal replicas
      entries, which don't have bucket sector counts - so this code is
      entirely dead and can be deleted.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      92e637ce
    • Brian Foster's avatar
      bcachefs: mark journal replicas before journal write submission · a7b29b8d
      Brian Foster authored
      The journal write submission path marks the associated replica
      entries for journal data in journal_write_done(), which is just
      after journal write bio submission. This creates a small window
      where journal entries might have been written out, but the
      associated replica is not marked such that recovery does not know
      that the associated device contains journal data.
      
      Move the replica marking a bit earlier in the write path such that
      recovery is guaranteed to recognize that the device contains journal
      data in the event of a crash.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      a7b29b8d
    • Kent Overstreet's avatar
    • Kent Overstreet's avatar
      bcachefs: Fix quotas + snapshots · cb1b479d
      Kent Overstreet authored
      Now that we can reliably designate and find the master subvolume out of
      a tree of snapshots, we can finally make quotas work with snapshots:
      
      That is - quotas will now _ignore_ snapshot subvolumes, and only be in
      effect for the master (non snapshot) subvolume.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      cb1b479d
    • Kent Overstreet's avatar
      bcachefs: Add otime, parent to bch_subvolume · 653693be
      Kent Overstreet authored
      Add two new fields to bch_subvolume:
       - otime: creation time
       - parent: For snapshots, this is the id of the subvolume the snapshot
         was created from
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      653693be
    • Kent Overstreet's avatar
      bcachefs: BTREE_ID_snapshot_tree · 1c59b483
      Kent Overstreet authored
      This adds a new btree which gets us a persistent per-snapshot-tree
      identifier.
      
       - BTREE_ID_snapshot_trees
       - KEY_TYPE_snapshot_tree
       - bch_snapshot now has a field that points to a snapshot_tree
      
      This is going to be used to designate one snapshot ID/subvolume out of a
      given tree of snapshots as the "main" subvolume, so that we can do quota
      accounting in that subvolume and not the rest.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      1c59b483
    • Kent Overstreet's avatar
      bcachefs: bch2_bkey_get_empty_slot() · 51e84d3b
      Kent Overstreet authored
      Add a new helper for allocating a new slot in a btree.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      51e84d3b
    • Kent Overstreet's avatar
      bcachefs: bch2_bkey_make_mut() now calls bch2_trans_update() · dbda63bb
      Kent Overstreet authored
      It's safe to call bch2_trans_update with a k/v pair where the value
      hasn't been filled out, as long as the key part has been and the value
      is filled out by transaction commit time.
      
      This patch folds the bch2_trans_update() call into bch2_bkey_make_mut(),
      eliminating a bit of boilerplate.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      dbda63bb
    • Kent Overstreet's avatar
      bcachefs: bch2_bkey_get_mut() now calls bch2_trans_update() · f12a798a
      Kent Overstreet authored
      It's safe to call bch2_trans_update with a k/v pair where the value
      hasn't been filled out, as long as the key part has been and the value
      is filled out by transaction commit time.
      
      This patch folds the bch2_trans_update() call into bch2_bkey_get_mut(),
      eliminating a bit of boilerplate.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      f12a798a
    • Kent Overstreet's avatar
      bcachefs: bch2_bkey_alloc() now calls bch2_trans_update() · f8cb35fd
      Kent Overstreet authored
      It's safe to call bch2_trans_update with a k/v pair where the value
      hasn't been filled out, as long as the key part has been and the value
      is filled out by transaction commit time.
      
      This patch folds the bch2_trans_update() call into bch2_bkey_alloc(),
      eliminating a bit of boilerplate.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      f8cb35fd
    • Kent Overstreet's avatar
      bcachefs: bch2_bkey_get_mut() improvements · 34dfa5db
      Kent Overstreet authored
       - bch2_bkey_get_mut() now handles types increasing in size, allocating
         a buffer for the type's current size when necessary
       - bch2_bkey_make_mut_typed()
       - bch2_bkey_get_mut() now initializes the iterator, like
         bch2_bkey_get_iter()
      
      Also, refactor so that most of the code is in functions - now macros are
      only used for wrappers.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      34dfa5db
    • Kent Overstreet's avatar
      bcachefs: Move bch2_bkey_make_mut() to btree_update.h · d67a16df
      Kent Overstreet authored
      It's for doing updates - this is where it belongs, and next pathes will
      be changing these helpers to use items from btree_update.h.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      d67a16df
    • Kent Overstreet's avatar
      bcachefs: bch2_bkey_get_iter() helpers · bcb79a51
      Kent Overstreet authored
      Introduce new helpers for a common pattern:
      
        bch2_trans_iter_init();
        bch2_btree_iter_peek_slot();
      
       - bch2_bkey_get_iter_type() returns -ENOENT if it doesn't find a key of
         the correct type
       - bch2_bkey_get_val_typed() copies the val out of the btree to a
         (typically stack allocated) variable; it handles the case where the
         value in the btree is smaller than the current version of the type,
         zeroing out the remainder.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      bcb79a51
    • Kent Overstreet's avatar
      bcachefs: bkey_ops.min_val_size · 174f930b
      Kent Overstreet authored
      This adds a new field to bkey_ops for the minimum size of the value,
      which standardizes that check and also enforces the new rule (previously
      done somewhat ad-hoc) that we can extend value types by adding new
      fields on to the end.
      
      To make that work we do _not_ initialize min_val_size with sizeof,
      instead we initialize it to the size of the first version of those
      values.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      174f930b
    • Kent Overstreet's avatar
    • Kent Overstreet's avatar
      bcachefs: Btree iterator, update flags no longer conflict · 95b595a5
      Kent Overstreet authored
      Change btree_update_flags to start after the last btree iterator flag,
      so that we can pass both in the same flags argument.
      
      This is needed for the upcoming bch2_bkey_get_mut() helper.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      95b595a5
    • Brian Foster's avatar
    • Brian Foster's avatar
      bcachefs: fix accounting corruption race between reclaim and dev add · 3c434cdf
      Brian Foster authored
      When a device is removed from a bcachefs volume, the associated
      content is removed from the various btrees. The alloc tree uses the
      key cache, so when keys are removed the deletes exist in cache for a
      period of time until reclaim comes along and flushes outstanding
      updates.
      
      When a device is re-added to the bcachefs volume, the add process
      re-adds some of these previously deleted keys. When marking device
      superblock locations on device add, the keys will likely refer to
      some of the same alloc keys that were just removed. The memory
      triggers for these key updates are responsible for further updates,
      such as bch2_mark_alloc() calling into bch2_dev_usage_update() to
      update per-device usage accounting.
      
      When a new key is added to key cache, the trans update path also
      flushes the key to the backing btree for coherency reasons for tree
      walks.
      
      With all of this context, if a device is removed and re-added
      quickly enough such that some key deletes from the remove are still
      pending a key cache flush, the trans update path can view this as
      addition of a new key because the old key in the insert entry refers
      to a deleted key. However the deleted cached key has not been filled
      by absence of a btree key, but rather refers to an explicit deletion
      of an existing key that occurred during device removal.
      
      The trans update path adds a new update to flush the key and tags
      the original (cached) update to skip running the memory triggers.
      This results in running triggers on the non-cached update instead,
      which in turn will perform accounting updates based on incoherent
      values. For example, bch2_dev_usage_update() subtracts the the old
      alloc key dirty sector count in the non-cached btree key from the
      newly initialized (i.e. zeroed) per device counters, leading to
      underflow and accounting corruption.
      
      There are at least a few ways to avoid this problem, the simplest of
      which may be to run triggers against the cached update rather than
      the non-cached update. If the key only needs to be flushed when the
      key is not present in the tree, however, then this still performs an
      unnecessary update. We could potentially use the cached key dirty
      state to determine whether the delete is a dirty, cached update vs.
      a clean cache fill, but this may require transmitting key cache
      dirty state across layers, which adds complexity and seems to be of
      limited value. Instead, update flush_new_cached_update() to handle
      this by simply checking for the key in the btree and only perform
      the flush when a backing key is not present.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      3c434cdf
    • Kent Overstreet's avatar
      bcachefs: Mark bch2_copygc() noinline · 958c347b
      Kent Overstreet authored
      This works around a "stack from too large" error.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      958c347b
    • Kent Overstreet's avatar
      bcachefs: Delete obsolete btree ptr check · 3140a3d0
      Kent Overstreet authored
      This patch deletes a .key_invalid check for btree pointers that only
      applies to _very_ old on disk format versions, and potentially
      complicates the upgrade process.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      3140a3d0
    • Kent Overstreet's avatar
      bcachefs: Always run topology error when CONFIG_BCACHEFS_DEBUG=y · 6b52bcde
      Kent Overstreet authored
      Improved test coverage.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      6b52bcde
    • Kent Overstreet's avatar
      a0668d77
    • Kent Overstreet's avatar
      bcachefs: Make sure hash info gets initialized in fsck · c8d5b714
      Kent Overstreet authored
      We had some bugs with setting/using first_this_inode in the inode
      walker in the dirents/xattr code.
      
      This patch changes to not clear first_this_inode until after
      initializing the new hash info.
      
      Also, we fix an error message to not print on transaction restart, and
      add a comment to related fsck error code.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      c8d5b714
    • Kent Overstreet's avatar
      bcachefs: Kill bch2_verify_bucket_evacuated() · 1af5227c
      Kent Overstreet authored
      With backpointers, it's now impossible for bch2_evacuate_bucket() to be
      completely reliable: it can race with an extent being partially
      overwritten or split, which needs a new write buffer flush for the
      backpointer to be seen.
      
      This shouldn't be a real issue in practice; the previous patch added a
      new tracepoint so we'll be able to see more easily if it is.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      1af5227c
    • Kent Overstreet's avatar
      bcachefs: Improve move path tracepoints · 5a21764d
      Kent Overstreet authored
      Move path tracepoints now include the key being moved. Also, add new
      tracepoints for the start of move_extent, and evacuate_bucket.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      5a21764d
    • Kent Overstreet's avatar
      bcachefs: Drop a redundant error message · 09ebfa61
      Kent Overstreet authored
      When we're already read-only, we don't need to print out errors from
      writing btree nodes.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      09ebfa61
    • Brian Foster's avatar
      bcachefs: remove bucket_gens btree keys on device removal · 02d51bb9
      Brian Foster authored
      If a device has keys in the bucket_gens btree associated with its
      buckets and is removed from a bcachefs volume, fsck will complain
      about the presence of keys associated with an invalid device index.
      A repair removes the associated keys and restores correctness.
      Update bch2_dev_remove_alloc() to remove device related keys at
      device removal time to avoid the problem.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      02d51bb9
    • Brian Foster's avatar
      bcachefs: fix NULL bch_dev deref when checking bucket_gens keys · 251babb5
      Brian Foster authored
      fsck removes bucket_gens keys for devices that do not exist in the
      volume (i.e., if the device was removed). In 'fsck -n' mode, the
      associated fsck_err_on() wrapper returns false to skip the key
      removal. This proceeds on to the rest of the function, which
      eventually segfaults on a NULL bch_dev because the device does not
      exist.
      
      Update bch2_check_bucket_gens_key() to skip out of the rest of the
      function when the associated device does not exist, regardless of
      running fsck in check or repair mode.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      251babb5
    • Brian Foster's avatar
      bcachefs: folio pos to bch_folio_sector index helper · bf98ee10
      Brian Foster authored
      Create a small helper to translate from file offset to the
      associated bch_folio_sector index in the underlying bch_folio. The
      helper assumes the file offset is covered by the passed folio.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      bf98ee10
    • Kent Overstreet's avatar
      bcachefs: Fix a null ptr deref in fsck check_extents() · e3dc75eb
      Kent Overstreet authored
      It turns out, in rare situations we need to be passing in a disk
      reservation, which will be used internally by the transaction commit
      path when needed. Pass one in...
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      e3dc75eb
    • Kent Overstreet's avatar
      bcachefs: Fix a slab-out-of-bounds · 615fccad
      Kent Overstreet authored
      In __bch2_alloc_to_v4_mut(), we overrun the buffer we allocate if the
      alloc key had backpointers stored in it (which we no longer support).
      
      Fix this with a max() call.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      615fccad
    • Kent Overstreet's avatar
      bcachefs: Allow answering y or n to all fsck errors of given type · 853b7393
      Kent Overstreet authored
      This changes the ask_yn() function used by fsck to accept Y or N,
      meaning yes or no for all errors of a given type.
      
      With this, the user can be prompted only for distinct error types -
      useful when a filesystem has lots of errors.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      853b7393
    • Brian Foster's avatar
      bcachefs: use u64 for folio end pos to avoid overflows · 6b9857b2
      Brian Foster authored
      Some of the folio_end_*() helpers are prone to overflow of signed
      64-bit types because the mapping is only limited by the max value of
      loff_t and the associated helpers return the start offset of the
      next folio. Therefore, a folio_end_pos() of the max allowable folio in a
      mapping returns a value that overflows loff_t.
      
      This makes it hard to rely on such values when doing folio
      processing across a range of a file, as bcachefs attempts to do with
      the recent folio changes. For example, generic/564 causes problems
      in the buffered write path when testing writes at max boundary
      conditions.
      
      The current understanding is that the pagecache historically limited
      the mapping to one less page to avoid this problem and this was
      dropped with some of the folio conversions, but may be reinstated to
      properly address the problem. In the meantime, update the internal
      folio_end_*() helpers in bcachefs to return a u64, and all of the
      associated code to use or cast to u64 to avoid overflow problems.
      This allows generic/564 to pass and can be reverted back to using
      loff_t if at any point the pagecache subsystem can guarantee these
      boundary conditions will not overflow.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      6b9857b2
    • Brian Foster's avatar
      bcachefs: clean up post-eof folios on -ENOSPC · 335f7d4f
      Brian Foster authored
      The buffered write path batches folio creations in the file mapping
      based on the requested size of the write. Under low free space
      conditions, it is possible to add a bunch of folios to the mapping
      and then return a short write or -ENOSPC due to lack of space. If
      this occurs on an extending write, the file size is updated based on
      the amount of data successfully written to the file. If folios were
      added beyond the final i_size, they may hang around until reclaimed,
      truncated or encountered unexpectedly by another operation.
      
      For example, generic/083 reproduces a sequence of events where a
      short write leaves around one or more post-EOF folios on an inode, a
      subsequent zero range request extends beyond i_size and overlaps
      with an aforementioned folio, and __bch2_truncate_folio() happens
      across it and complains.
      
      Update __bch2_buffered_write() to keep track of the start offset of
      the last folio added to the mapping for a prospective write. After
      i_size is updated, check whether this offset starts beyond EOF. If
      so, truncate pagecache beyond the latest EOF to clean up any folios
      that don't reside at least partially within EOF upon completion of
      the write.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      335f7d4f