An error occurred fetching the project authors.
  1. 09 May, 2024 1 commit
    • Masahiro Yamada's avatar
      kbuild: use $(src) instead of $(srctree)/$(src) for source directory · b1992c37
      Masahiro Yamada authored
      Kbuild conventionally uses $(obj)/ for generated files, and $(src)/ for
      checked-in source files. It is merely a convention without any functional
      difference. In fact, $(obj) and $(src) are exactly the same, as defined
      in scripts/Makefile.build:
      
          src := $(obj)
      
      When the kernel is built in a separate output directory, $(src) does
      not accurately reflect the source directory location. While Kbuild
      resolves this discrepancy by specifying VPATH=$(srctree) to search for
      source files, it does not cover all cases. For example, when adding a
      header search path for local headers, -I$(srctree)/$(src) is typically
      passed to the compiler.
      
      This introduces inconsistency between upstream and downstream Makefiles
      because $(src) is used instead of $(srctree)/$(src) for the latter.
      
      To address this inconsistency, this commit changes the semantics of
      $(src) so that it always points to the directory in the source tree.
      
      Going forward, the variables used in Makefiles will have the following
      meanings:
      
        $(obj)     - directory in the object tree
        $(src)     - directory in the source tree  (changed by this commit)
        $(objtree) - the top of the kernel object tree
        $(srctree) - the top of the kernel source tree
      
      Consequently, $(srctree)/$(src) in upstream Makefiles need to be replaced
      with $(src).
      Signed-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      Reviewed-by: default avatarNicolas Schier <nicolas@fjasle.eu>
      b1992c37
  2. 22 Feb, 2024 12 commits
    • Darrick J. Wong's avatar
      xfs: create refcount bag structure for btree repairs · 7a2192ac
      Darrick J. Wong authored
      Create a bag structure for refcount information that uses the refcount
      bag btree defined in the previous patch.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      7a2192ac
    • Darrick J. Wong's avatar
      xfs: define an in-memory btree for storing refcount bag info during repairs · 18a1e644
      Darrick J. Wong authored
      Create a new in-memory btree type so that we can store refcount bag info
      in a much more memory-efficient and performant format.  Recall that the
      refcount recordset regenerator computes the new recordset from browsing
      the rmap records.  Let's say that the rmap records are:
      
      {agbno: 10, length: 40, ...}
      {agbno: 11, length: 3, ...}
      {agbno: 12, length: 20, ...}
      {agbno: 15, length: 1, ...}
      
      It is convenient to have a data structure that could quickly tell us the
      refcount for an arbitrary agbno without wasting memory.  An array or a
      list could do that pretty easily.  List suck because of the pointer
      overhead.  xfarrays are a lot more compact, but we want to minimize
      sparse holes in the xfarray to constrain memory usage.  Maintaining any
      kind of record order isn't needed for correctness, so I created the
      "rcbag", which is shorthand for an unordered list of (excerpted) reverse
      mappings.
      
      So we add the first rmap to the rcbag, and it looks like:
      
      0: {agbno: 10, length: 40}
      
      The refcount for agbno 10 is 1.  Then we move on to block 11, so we add
      the second rmap:
      
      0: {agbno: 10, length: 40}
      1: {agbno: 11, length: 3}
      
      The refcount for agbno 11 is 2.  We move on to block 12, so we add the
      third:
      
      0: {agbno: 10, length: 40}
      1: {agbno: 11, length: 3}
      2: {agbno: 12, length: 20}
      
      The refcount for agbno 12 and 13 is 3.  We move on to block 14, and
      remove the second rmap:
      
      0: {agbno: 10, length: 40}
      1: NULL
      2: {agbno: 12, length: 20}
      
      The refcount for agbno 14 is 2.  We move on to block 15, and add the
      last rmap.  But we don't care where it is and we don't want to expand
      the array so we put it in slot 1:
      
      0: {agbno: 10, length: 40}
      1: {agbno: 15, length: 1}
      2: {agbno: 12, length: 20}
      
      The refcount for block 15 is 3.  Notice how order doesn't matter in this
      list?  That's why repair uses an unordered list, or "bag".  The data
      structure is not a set because it does not guarantee uniqueness.
      
      That said, adding and removing specific items is now an O(n) operation
      because we have no idea where that item might be in the list.  Overall,
      the runtime is O(n^2) which is bad.
      
      I realized that I could easily refactor the btree code and reimplement
      the refcount bag with an xfbtree.  Adding and removing is now O(log2 n),
      so the runtime is at least O(n log2 n), which is much faster.  In the
      end, the rcbag becomes a sorted list, but that's merely a detail of the
      implementation.  The repair code doesn't care.
      
      (Note: That horrible xfs_db bmap_inflate command can be used to exercise
      this sort of rcbag insanity by cranking up refcounts quickly.)
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      18a1e644
    • Darrick J. Wong's avatar
      xfs: repair the rmapbt · 32080a9b
      Darrick J. Wong authored
      Rebuild the reverse mapping btree from all primary metadata.  This first
      patch establishes the bare mechanics of finding records and putting
      together a new ondisk tree; more complex pieces are needed to make it
      work properly.
      
      Link: Documentation/filesystems/xfs-online-fsck-design.rst
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      32080a9b
    • Darrick J. Wong's avatar
      xfs: support in-memory btrees · a095686a
      Darrick J. Wong authored
      Adapt the generic btree cursor code to be able to create a btree whose
      buffers come from a (presumably in-memory) buftarg with a header block
      that's specific to in-memory btrees.  We'll connect this to other parts
      of online scrub in the next patches.
      
      Note that in-memory btrees always have a block size matching the system
      memory page size for efficiency reasons.  There are also a few things we
      need to do to finalize a btree update; that's covered in the next patch.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      a095686a
    • Darrick J. Wong's avatar
      xfs: support in-memory buffer cache targets · 5076a604
      Darrick J. Wong authored
      Allow the buffer cache to target in-memory files by making it possible
      to have a buftarg that maps pages from private shmem files.  As the
      prevous patch alludes, the in-memory buftarg contains its own cache,
      points to a shmem file, and does not point to a block_device.
      
      The next few patches will make it possible to construct an xfs_btree in
      pageable memory by using this buftarg.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      5076a604
    • Darrick J. Wong's avatar
      xfs: repair summary counters · 4ed080cd
      Darrick J. Wong authored
      Use the same summary counter calculation infrastructure to generate new
      values for the in-core summary counters.   The difference between the
      scrubber and the repairer is that the repairer will freeze the fs during
      setup, which means that the values should match exactly.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      4ed080cd
    • Darrick J. Wong's avatar
      xfs: teach repair to fix file nlinks · 6b631c60
      Darrick J. Wong authored
      Fix the file link counts since we just computed the correct ones.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      6b631c60
    • Darrick J. Wong's avatar
      xfs: teach scrub to check file nlinks · f1184081
      Darrick J. Wong authored
      Create the necessary scrub code to walk the filesystem's directory tree
      so that we can compute file link counts.  Similar to quotacheck, we
      create an incore shadow array of link count information and then we walk
      the filesystem a second time to compare the link counts.  We need live
      updates to keep the information up to date during the lengthy scan, so
      this scrubber remains disabled until the next patch.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      f1184081
    • Darrick J. Wong's avatar
      xfs: repair dquots based on live quotacheck results · 96ed2ae4
      Darrick J. Wong authored
      Use the shadow quota counters that live quotacheck creates to reset the
      incore dquot counters.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      96ed2ae4
    • Darrick J. Wong's avatar
      xfs: implement live quotacheck inode scan · 48dd9117
      Darrick J. Wong authored
      Create a new trio of scrub functions to check quota counters.  While the
      dquots themselves are filesystem metadata and should be checked early,
      the dquot counter values are computed from other metadata and are
      therefore summary counters.  We don't plug these into the scrub dispatch
      just yet, because we still need to be able to watch quota updates while
      doing our scan.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      48dd9117
    • Darrick J. Wong's avatar
      xfs: allow scrub to hook metadata updates in other writers · 4e98cc90
      Darrick J. Wong authored
      Certain types of filesystem metadata can only be checked by scanning
      every file in the entire filesystem.  Specific examples of this include
      quota counts, file link counts, and reverse mappings of file extents.
      Directory and parent pointer reconstruction may also fall into this
      category.  File scanning is much trickier than scanning AG metadata
      because we have to take inode locks in the same order as the rest of
      [VX]FS, we can't be holding buffer locks when we do that, and scanning
      the whole filesystem takes time.
      
      Earlier versions of the online repair patchset relied heavily on
      fsfreeze as a means to quiesce the filesystem so that we could take
      locks in the proper order without worrying about concurrent updates from
      other writers.  Reviewers of those patches opined that freezing the
      entire fs to check and repair something was not sufficiently better than
      unmounting to run fsck offline.  I don't agree with that 100%, but the
      message was clear: find a way to repair things that minimizes the
      quiet period where nobody can write to the filesystem.
      
      Generally, building btree indexes online can be split into two phases: a
      collection phase where we compute the records that will be put into the
      new btree; and a construction phase, where we construct the physical
      btree blocks and persist them.  While it's simple to hold resource locks
      for the entirety of the two phases to ensure that the new index is
      consistent with the rest of the system, we don't need to hold resource
      locks during the collection phase if we have a means to receive live
      updates of other work going on elsewhere in the system.
      
      The goal of this patch, then, is to enable online fsck to learn about
      metadata updates going on in other threads while it constructs a shadow
      copy of the metadata records to verify or correct the real metadata.  To
      minimize the overhead when online fsck isn't running, we use srcu
      notifiers because they prioritize fast access to the notifier call chain
      (particularly when the chain is empty) at a cost to configuring
      notifiers.  Online fsck should be relatively infrequent, so this is
      acceptable.
      
      The intended usage model is fairly simple.  Code that modifies a
      metadata structure of interest should declare a xfs_hook_chain structure
      in some well defined place, and call xfs_hook_call whenever an update
      happens.  Online fsck code should define a struct notifier_block and use
      xfs_hook_add to attach the block to the chain, along with a function to
      be called.  This function should synchronize with the fsck scanner to
      update whatever in-memory data the scanner is collecting.  When
      finished, xfs_hook_del removes the notifier from the list and waits for
      them all to complete.
      
      Originally, I selected srcu notifiers over blocking notifiers to
      implement live hooks because they seemed to have fewer impacts to
      scalability.  The per-call cost of srcu_notifier_call_chain is higher
      (19ns) than blocking_notifier_ (4ns) in the single threaded case, but
      blocking notifiers use an rwsem to stabilize the list.  Cacheline
      bouncing for that rwsem is costly to runtime code when there are a lot
      of CPUs running regular filesystem operations.  If there are no hooks
      installed, this is a total waste of CPU time.
      
      Therefore, I stuck with srcu notifiers, despite trading off single
      threaded performance for multithreaded performance.  I also wasn't
      thrilled with the very high teardown time for srcu notifiers, since the
      caller has to wait for the next rcu grace period.  This can take a long
      time if there are a lot of CPUs.
      
      Then I discovered the jump label implementation of static keys.
      
      Jump labels use kernel code patching to replace a branch with a nop sled
      when the key is disabled.  IOWs, they can eliminate the overhead of
      _call_chain when there are no hooks enabled.  This makes blocking
      notifiers competitive again -- scrub runs faster because teardown of the
      chain is a lot cheaper, and runtime code only pays the rwsem locking
      overhead when scrub is actually running.
      
      With jump labels enabled, calls to empty notifier chains are elided from
      the call sites when there are no hooks registered, which means that the
      overhead is 0.36ns when fsck is not running.  This is perfect for most
      of the architectures that XFS is expected to run on (e.g. x86, powerpc,
      arm64, s390x, riscv).
      
      For architectures that don't support jump labels (e.g. m68k) the runtime
      overhead of checking the static key is an atomic counter read.  This
      isn't great, but it's still cheaper than taking a shared rwsem.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      4e98cc90
    • Darrick J. Wong's avatar
      xfs: implement live inode scan for scrub · 8660c7b7
      Darrick J. Wong authored
      This patch implements a live file scanner for online fsck functions that
      require the ability to walk a filesystem to gather metadata records and
      stay informed about metadata changes to files that have already been
      visited.
      
      The iscan structure consists of two inode number cursors: one to track
      which inode we want to visit next, and a second one to track which
      inodes have already been visited.  This second cursor is key to
      capturing live updates to files previously scanned while the main thread
      continues scanning -- any inode greater than this value hasn't been
      scanned and can go on its way; any other update must be incorporated
      into the collected data.  It is critical for the scanning thraad to hold
      exclusive access on the inode until after marking the inode visited.
      
      This new code is a separate patch from the patchsets adding callers for
      the sake of enabling the author to move patches around his tree with
      ease.  The intended usage model for this code is roughly:
      
      	xchk_iscan_start(iscan, 0, 0);
      	while ((error = xchk_iscan_iter(sc, iscan, &ip)) == 1) {
      		xfs_ilock(ip, ...);
      		/* capture inode metadata */
      		xchk_iscan_mark_visited(iscan, ip);
      		xfs_iunlock(ip, ...);
      
      		xfs_irele(ip);
      	}
      	xchk_iscan_stop(iscan);
      	if (error)
      		return error;
      
      Hook functions for live updates can then do:
      
      	if (xchk_iscan_want_live_update(...))
      		/* update the captured inode metadata */
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      8660c7b7
  3. 13 Feb, 2024 1 commit
    • Dave Chinner's avatar
      xfs: convert kmem_alloc() to kmalloc() · f078d4ea
      Dave Chinner authored
      kmem_alloc() is just a thin wrapper around kmalloc() these days.
      Convert everything to use kmalloc() so we can get rid of the
      wrapper.
      
      Note: the transaction region allocation in xlog_add_to_transaction()
      can be a high order allocation. Converting it to use
      kmalloc(__GFP_NOFAIL) results in warnings in the page allocation
      code being triggered because the mm subsystem does not want us to
      use __GFP_NOFAIL with high order allocations like we've been doing
      with the kmem_alloc() wrapper for a couple of decades. Hence this
      specific case gets converted to xlog_kvmalloc() rather than
      kmalloc() to avoid this issue.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      f078d4ea
  4. 15 Dec, 2023 10 commits
  5. 07 Dec, 2023 1 commit
    • Darrick J. Wong's avatar
      xfs: implement block reservation accounting for btrees we're staging · be408417
      Darrick J. Wong authored
      Create a new xrep_newbt structure to encapsulate a fake root for
      creating a staged btree cursor as well as to track all the blocks that
      we need to reserve in order to build that btree.
      
      As for the particular choice of lowspace thresholds and btree block
      slack factors -- at this point one could say that the thresholds in
      online repair come from bulkload_estimate_ag_slack in xfs_repair[1].
      But that's not the entire story, since the offline btree rebuilding
      code in xfs_repair was merged as a retroport of the online btree code
      in this patchset!
      
      Before xfs_btree_staging.[ch] came along, xfs_repair determined the
      slack factor (aka the number of slots to leave unfilled in each new
      btree block) via open-coded logic in repair/phase5.c[2].  At that point
      the slack factors were arbitrary quantities per btree.  The rmapbt
      automatically left 10 slots free; everything else left zero.
      
      That had a noticeable effect on performance straight after mounting
      because adding records to /any/ btree would result in splits.  A few
      years ago when this patch was first written, Dave and I decided that
      repair should generate btree blocks that were 75% full unless space was
      tight, in which case it should try to fill the blocks to nearly full.
      We defined tight as ~10% free to avoid repair failures but settled on
      3/32 (~9%) to avoid div64.
      
      IOWs, we mostly pulled the thresholds out of thin air.  We've been
      QAing with those geometry numbers ever since. ;)
      
      Link: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/repair/bulkload.c?h=v6.5.0#n114
      Link: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/repair/phase5.c?h=v4.19.0#n1349Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      be408417
  6. 10 Aug, 2023 4 commits
    • Darrick J. Wong's avatar
      xfs: move the realtime summary file scrubber to a separate source file · b7d47a77
      Darrick J. Wong authored
      Move the realtime summary file checking code to a separate file in
      preparation to actually implement it.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      b7d47a77
    • Darrick J. Wong's avatar
      xfs: track usage statistics of online fsck · d7a74cad
      Darrick J. Wong authored
      Track the usage, outcomes, and run times of the online fsck code, and
      report these values via debugfs.  The columns in the file are:
      
       * scrubber name
      
       * number of scrub invocations
       * clean objects found
       * corruptions found
       * optimizations found
       * cross referencing failures
       * inconsistencies found during cross referencing
       * incomplete scrubs
       * warnings
       * number of time scrub had to retry
       * cumulative amount of time spent scrubbing (microseconds)
      
       * number of repair inovcations
       * successfully repaired objects
       * cumuluative amount of time spent repairing (microseconds)
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      d7a74cad
    • Darrick J. Wong's avatar
      xfs: create a big array data structure · 3934e8eb
      Darrick J. Wong authored
      Create a simple 'big array' data structure for storage of fixed-size
      metadata records that will be used to reconstruct a btree index.  For
      repair operations, the most important operations are append, iterate,
      and sort.
      
      Earlier implementations of the big array used linked lists and suffered
      from severe problems -- pinning all records in kernel memory was not a
      good idea and frequently lead to OOM situations; random access was very
      inefficient; and record overhead for the lists was unacceptably high at
      40-60%.
      
      Therefore, the big memory array relies on the 'xfile' abstraction, which
      creates a memfd file and stores the records in page cache pages.  Since
      the memfd is created in tmpfs, the memory pages can be pushed out to
      disk if necessary and we have a built-in usage limit of 50% of physical
      memory.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      3934e8eb
    • Darrick J. Wong's avatar
      xfs: move the post-repair block reaping code to a separate file · e06ef14b
      Darrick J. Wong authored
      Reaping blocks after a repair is a complicated affair involving a lot of
      rmap btree lookups and figuring out if we're going to unmap or free old
      metadata blocks that might be crosslinked.  Eventually, we will need to
      be able to reap per-AG metadata blocks, bmbt blocks from inode forks,
      garbage CoW staging extents, and (even later) blocks from btrees rooted
      in inodes.  This results in a lot of reaping code, so we might as well
      split that off while it's easy.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      e06ef14b
  7. 12 Apr, 2023 3 commits
    • Darrick J. Wong's avatar
      xfs: cross-reference rmap records with ag btrees · fed050f3
      Darrick J. Wong authored
      Strengthen the rmap btree record checker a little more by comparing
      OWN_FS and OWN_LOG reverse mappings against the AG headers and internal
      logs, respectively.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      fed050f3
    • Darrick J. Wong's avatar
      xfs: streamline the directory iteration code for scrub · 4c233b5c
      Darrick J. Wong authored
      Currently, online scrub reuses the xfs_readdir code to walk every entry
      in a directory.  This isn't awesome for performance, since we end up
      cycling the directory ILOCK needlessly and coding around the particular
      quirks of the VFS dir_context interface.
      
      Create a streamlined version of readdir that keeps the ILOCK (since the
      walk function isn't going to copy stuff to userspace), skips a whole lot
      of directory walk cursor checks (since we start at 0 and walk to the
      end) and has a sane way to return error codes.
      
      Note: Porting the dotdot checking code is left for a subsequent patch.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      4c233b5c
    • Darrick J. Wong's avatar
      xfs: allow queued AG intents to drain before scrubbing · d5c88131
      Darrick J. Wong authored
      When a writer thread executes a chain of log intent items, the AG header
      buffer locks will cycle during a transaction roll to get from one intent
      item to the next in a chain.  Although scrub takes all AG header buffer
      locks, this isn't sufficient to guard against scrub checking an AG while
      that writer thread is in the middle of finishing a chain because there's
      no higher level locking primitive guarding allocation groups.
      
      When there's a collision, cross-referencing between data structures
      (e.g. rmapbt and refcountbt) yields false corruption events; if repair
      is running, this results in incorrect repairs, which is catastrophic.
      
      Fix this by adding to the perag structure the count of active intents
      and make scrub wait until it has both AG header buffer locks and the
      intent counter reaches zero.
      
      One quirk of the drain code is that deferred bmap updates also bump and
      drop the intent counter.  A fundamental decision made during the design
      phase of the reverse mapping feature is that updates to the rmapbt
      records are always made by the same code that updates the primary
      metadata.  In other words, callers of bmapi functions expect that the
      bmapi functions will queue deferred rmap updates.
      
      Some parts of the reflink code queue deferred refcount (CUI) and bmap
      (BUI) updates in the same head transaction, but the deferred work
      manager completely finishes the CUI before the BUI work is started.  As
      a result, the CUI drops the intent count long before the deferred rmap
      (RUI) update even has a chance to bump the intent count.  The only way
      to keep the intent count elevated between the CUI and RUI is for the BUI
      to bump the counter until the RUI has been created.
      
      A second quirk of the intent drain code is that deferred work items must
      increment the intent counter as soon as the work item is added to the
      transaction.  When a BUI completes and queues an RUI, the RUI must
      increment the counter before the BUI decrements it.  The only way to
      accomplish this is to require that the counter be bumped as soon as the
      deferred work item is created in memory.
      
      In the next patches we'll improve on this facility, but this patch
      provides the basic functionality.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      d5c88131
  8. 19 Mar, 2023 1 commit
    • Darrick J. Wong's avatar
      xfs: test dir/attr hash when loading module · 3cfb9290
      Darrick J. Wong authored
      Back in the 6.2-rc1 days, Eric Whitney reported a fstests regression in
      ext4 against generic/454.  The cause of this test failure was the
      unfortunate combination of setting an xattr name containing UTF8 encoded
      emoji, an xattr hash function that accepted a char pointer with no
      explicit signedness, signed type extension of those chars to an int, and
      the 6.2 build tools maintainers deciding to mandate -funsigned-char
      across the board.  As a result, the ondisk extended attribute structure
      written out by 6.1 and 6.2 were not the same.
      
      This discrepancy, in fact, had been noticeable if a filesystem with such
      an xattr were moved between any two architectures that don't employ the
      same signedness of a raw "char" declaration.  The only reason anyone
      noticed is that x86 gcc defaults to signed, and no such -funsigned-char
      update was made to e2fsprogs, so e2fsck immediately started reporting
      data corruption.
      
      After a day and a half of discussing how to handle this use case (xattrs
      with bit 7 set anywhere in the name) without breaking existing users,
      Linus merged his own patch and didn't tell the maintainer.  None of the
      ext4 developers realized this until AUTOSEL announced that the commit
      had been backported to stable.
      
      In the end, this problem could have been detected much earlier if there
      had been any useful tests of hash function(s) in use inside ext4 to make
      sure that they always produce the same outputs given the same inputs.
      
      The XFS dirent/xattr name hash takes a uint8_t*, so I don't think it's
      vulnerable to this problem.  However, let's avoid all this drama by
      adding our own self test to check that the da hash produces the same
      outputs for a static pile of inputs on various platforms.  This enables
      us to fix any breakage that may result in a controlled fashion.  The
      buffer and test data are identical to the patches submitted to xfsprogs.
      
      Link: https://lore.kernel.org/linux-ext4/Y8bpkm3jA3bDm3eL@debian-BULLSEYE-live-builder-AMD64/
      Link: https://lore.kernel.org/linux-xfs/ZBUKCRR7xvIqPrpX@destitution/T/#md38272cc684e2c0d61494435ccbb91f022e8dee4Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      3cfb9290
  9. 18 Jul, 2022 1 commit
    • Shiyang Ruan's avatar
      xfs: implement ->notify_failure() for XFS · 6f643c57
      Shiyang Ruan authored
      Introduce xfs_notify_failure.c to handle failure related works, such as
      implement ->notify_failure(), register/unregister dax holder in xfs, and
      so on.
      
      If the rmap feature of XFS enabled, we can query it to find files and
      metadata which are associated with the corrupt data.  For now all we do is
      kill processes with that file mapped into their address spaces, but future
      patches could actually do something about corrupt metadata.
      
      After that, the memory failure needs to notify the processes who are using
      those files.
      
      Link: https://lkml.kernel.org/r/20220603053738.1218681-7-ruansy.fnst@fujitsu.comSigned-off-by: default avatarShiyang Ruan <ruansy.fnst@fujitsu.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.wiliams@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Ritesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6f643c57
  10. 14 Jul, 2022 1 commit
    • Dave Chinner's avatar
      xfs: add in-memory iunlink log item · 784eb7d8
      Dave Chinner authored
      Now that we have a clean operation to update the di_next_unlinked
      field of inode cluster buffers, we can easily defer this operation
      to transaction commit time so we can order the inode cluster buffer
      locking consistently.
      
      To do this, we introduce a new in-memory log item to track the
      unlinked list item modification that we are going to make. This
      follows the same observations as the in-memory double linked list
      used to track unlinked inodes in that the inodes on the list are
      pinned in memory and cannot go away, and hence we can simply
      reference them for the duration of the transaction without needing
      to take active references or pin them or look them up.
      
      This allows us to pass the xfs_inode to the transaction commit code
      along with the modification to be made, and then order the logged
      modifications via the ->iop_sort and ->iop_precommit operations
      for the new log item type. As this is an in-memory log item, it
      doesn't have formatting, CIL or AIL operational hooks - it exists
      purely to run the inode unlink modifications and is then removed
      from the transaction item list and freed once the precommit
      operation has run.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      784eb7d8
  11. 04 May, 2022 1 commit
    • Allison Henderson's avatar
      xfs: Set up infrastructure for log attribute replay · fd920008
      Allison Henderson authored
      Currently attributes are modified directly across one or more
      transactions. But they are not logged or replayed in the event of an
      error. The goal of log attr replay is to enable logging and replaying
      of attribute operations using the existing delayed operations
      infrastructure.  This will later enable the attributes to become part of
      larger multi part operations that also must first be recorded to the
      log.  This is mostly of interest in the scheme of parent pointers which
      would need to maintain an attribute containing parent inode information
      any time an inode is moved, created, or removed.  Parent pointers would
      then be of interest to any feature that would need to quickly derive an
      inode path from the mount point. Online scrub, nfs lookups and fs grow
      or shrink operations are all features that could take advantage of this.
      
      This patch adds two new log item types for setting or removing
      attributes as deferred operations.  The xfs_attri_log_item will log an
      intent to set or remove an attribute.  The corresponding
      xfs_attrd_log_item holds a reference to the xfs_attri_log_item and is
      freed once the transaction is done.  Both log items use a generic
      xfs_attr_log_format structure that contains the attribute name, value,
      flags, inode, and an op_flag that indicates if the operations is a set
      or remove.
      
      [dchinner: added extra little bits needed for intent whiteouts]
      Signed-off-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      fd920008
  12. 08 May, 2020 1 commit
  13. 04 May, 2020 1 commit
  14. 18 Mar, 2020 1 commit
  15. 11 Nov, 2019 1 commit