1. 10 Aug, 2023 32 commits
    • Darrick J. Wong's avatar
      xfs: hide xfs_inode_is_allocated in scrub common code · 0d296634
      Darrick J. Wong authored
      This function is only used by online fsck, so let's move it there.
      In the next patch, we'll fix it to work properly and to require that the
      caller hold the AGI buffer locked.  No major changes aside from
      adjusting the signature a bit.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      0d296634
    • Darrick J. Wong's avatar
      xfs: fix agf_fllast when repairing an empty AGFL · a634c0a6
      Darrick J. Wong authored
      xfs/139 with parent pointers enabled occasionally pops up a corruption
      message when online fsck force-rebuild repairs an AGFL:
      
       XFS (sde): Metadata corruption detected at xfs_agf_verify+0x11e/0x220 [xfs], xfs_agf block 0x9e0001
       XFS (sde): Unmount and run xfs_repair
       XFS (sde): First 128 bytes of corrupted metadata buffer:
       00000000: 58 41 47 46 00 00 00 01 00 00 00 4f 00 00 40 00  XAGF.......O..@.
       00000010: 00 00 00 01 00 00 00 02 00 00 00 05 00 00 00 01  ................
       00000020: 00 00 00 01 00 00 00 01 00 00 00 00 ff ff ff ff  ................
       00000030: 00 00 00 00 00 00 00 05 00 00 00 05 00 00 00 00  ................
       00000040: 91 2e 6f b1 ed 61 4b 4d 8c 9b 6e 87 08 bb f6 36  ..o..aKM..n....6
       00000050: 00 00 00 01 00 00 00 01 00 00 00 06 00 00 00 01  ................
       00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
       00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      
      The root cause of this failure is that prior to the repair, there were
      zero blocks in the AGFL.  This scenario is set up by the test case, since
      it formats with 64MB AGs and tries to ENOSPC the whole filesystem.  In
      this case of flcount==0, we reset fllast to -1U, which then trips the
      write verifier's check that fllast is less than xfs_agfl_size().
      
      Correct this code to set fllast to the last possible slot in the AGFL
      when flcount is zero, which mirrors the behavior of xfs_repair phase5
      when it has to create a totally empty AGFL.
      
      Fixes: 0e93d3f4 ("xfs: repair the AGFL")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      a634c0a6
    • Darrick J. Wong's avatar
      xfs: clear pagf_agflreset when repairing the AGFL · 9ce7f9b2
      Darrick J. Wong authored
      Clear the pagf_agflreset flag when we're repairing the AGFL because we
      fix all the same padding problems that xfs_agfl_reset does.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      9ce7f9b2
    • Darrick J. Wong's avatar
      xfs: allow userspace to rebuild metadata structures · 5c83df2e
      Darrick J. Wong authored
      Add a new (superuser-only) flag to the online metadata repair ioctl to
      force it to rebuild structures, even if they're not broken.  We will use
      this to move metadata structures out of the way during a free space
      defragmentation operation.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      5c83df2e
    • Darrick J. Wong's avatar
      xfs: don't complain about unfixed metadata when repairs were injected · 8336a64e
      Darrick J. Wong authored
      While debugging other parts of online repair, I noticed that if someone
      injects FORCE_SCRUB_REPAIR, starts an IFLAG_REPAIR scrub on a piece of
      metadata, and the metadata repair fails, we'll log a message about
      uncorrected errors in the filesystem.
      
      This isn't strictly true if the scrub function didn't set OFLAG_CORRUPT
      and we're only doing the repair because the error injection knob is set.
      Repair functions are allowed to abort the entire operation at any point
      before committing new metadata, in which case the piece of metadata is
      in the same state as it was before.  Therefore, the log message should
      be gated on the results of the scrub.  Refactor the predicate and
      rearrange the code flow to make this happen.
      
      Note: If the repair function errors out after it commits the new
      metadata, the transaction cancellation will shut down the filesystem,
      which is an obvious sign of corrupt metadata.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      8336a64e
    • Darrick J. Wong's avatar
      xfs: allow the user to cancel repairs before we start writing · d728f4e3
      Darrick J. Wong authored
      All online repair functions have the same structure: walk filesystem
      metadata structures gathering enough data to rebuild the structure,
      stage a new copy, and then commit the new copy.
      
      The gathering steps do not write anything to disk, so they are peppered
      with xchk_should_terminate calls to avoid softlockup warnings and to
      provide an opportunity to abort the repair (by killing xfs_scrub).
      However, it's not clear in the code base when is the last chance to
      abort cleanly without having to undo a bunch of structure.
      
      Therefore, add one more call to xchk_should_terminate (along with a
      comment) providing the sysadmin with the ability to abort before it's
      too late and to make it clear in the source code when it's no longer
      convenient or safe to abort a repair.   As there are only four repair
      functions right now, this patch exists more to establish a precedent for
      subsequent additions than to deliver practical functionality.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      d728f4e3
    • Darrick J. Wong's avatar
      xfs: always rescan allegedly healthy per-ag metadata after repair · d65eb8a6
      Darrick J. Wong authored
      After an online repair function runs for a per-AG metadata structure,
      sc->sick_mask is supposed to reflect the per-AG metadata that the repair
      function fixed.  Our next move is to re-check the metadata to assess
      the completeness of our repair, so we don't want the rebuilt structure
      to be excluded from the rescan just because the health system previously
      logged a problem with the data structure.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      d65eb8a6
    • Darrick J. Wong's avatar
      xfs: implement online scrubbing of rtsummary info · 526aab5f
      Darrick J. Wong authored
      Finish the realtime summary scrubber by adding the functions we need to
      compute a fresh copy of the rtsummary info and comparing it to the copy
      on disk.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      526aab5f
    • Darrick J. Wong's avatar
      xfs: move the realtime summary file scrubber to a separate source file · b7d47a77
      Darrick J. Wong authored
      Move the realtime summary file checking code to a separate file in
      preparation to actually implement it.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      b7d47a77
    • Darrick J. Wong's avatar
      xfs: wrap ilock/iunlock operations on sc->ip · 294012fb
      Darrick J. Wong authored
      Scrub tracks the resources that it's holding onto in the xfs_scrub
      structure.  This includes the inode being checked (if applicable) and
      the inode lock state of that inode.  Replace the open-coded structure
      manipulation with a trivial helper to eliminate sources of error.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      294012fb
    • Darrick J. Wong's avatar
      xfs: get our own reference to inodes that we want to scrub · 17308539
      Darrick J. Wong authored
      When we want to scrub a file, get our own reference to the inode
      unconditionally.  This will make disposal rules simpler in the long run.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      17308539
    • Darrick J. Wong's avatar
      xfs: track usage statistics of online fsck · d7a74cad
      Darrick J. Wong authored
      Track the usage, outcomes, and run times of the online fsck code, and
      report these values via debugfs.  The columns in the file are:
      
       * scrubber name
      
       * number of scrub invocations
       * clean objects found
       * corruptions found
       * optimizations found
       * cross referencing failures
       * inconsistencies found during cross referencing
       * incomplete scrubs
       * warnings
       * number of time scrub had to retry
       * cumulative amount of time spent scrubbing (microseconds)
      
       * number of repair inovcations
       * successfully repaired objects
       * cumuluative amount of time spent repairing (microseconds)
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      d7a74cad
    • Darrick J. Wong's avatar
      xfs: create scaffolding for creating debugfs entries · a76dba3b
      Darrick J. Wong authored
      Set up debugfs directories for xfs as a whole, and a subdirectory for
      each mounted filesystem.  This will enable the creation of debugfs files
      in the next patch.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      a76dba3b
    • Darrick J. Wong's avatar
      xfs: improve xfarray quicksort pivot · 764018ca
      Darrick J. Wong authored
      Now that we have the means to do insertion sorts of small in-memory
      subsets of an xfarray, use it to improve the quicksort pivot algorithm
      by reading 7 records into memory and finding the median of that.  This
      should prevent bad partitioning when a[lo] and a[hi] end up next to each
      other in the final sort, which can happen when sorting for cntbt repair
      when the free space is extremely fragmented (e.g. generic/176).
      
      This doesn't speed up the average quicksort run by much, but it will
      (hopefully) avoid the quadratic time collapse for which quicksort is
      famous.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      764018ca
    • Darrick J. Wong's avatar
      xfs: cache pages used for xfarray quicksort convergence · cf36f4f6
      Darrick J. Wong authored
      After quicksort picks a pivot item for a particular subsort, it walks
      the records in that subset from the outside in, rearranging them so that
      every record less than the pivot comes before it, and every record
      greater than the pivot comes after it.  This scan has a lot of locality,
      so we can speed it up quite a bit by grabbing the xfile backing page and
      holding onto it as long as we possibly can.  Doing so reduces the
      runtime by another 5% on the author's computer.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      cf36f4f6
    • Darrick J. Wong's avatar
      xfs: speed up xfarray sort by sorting xfile page contents directly · e5b46c75
      Darrick J. Wong authored
      If all the records in an xfarray subset live within the same memory
      page, we can short-circuit even more quicksort recursion by mapping that
      page into the local CPU and using the kernel's heapsort function to sort
      the subset.  On the author's computer, this reduces the runtime by
      another 15% on a 500,000 element array.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      e5b46c75
    • Darrick J. Wong's avatar
      xfs: teach xfile to pass back direct-map pages to caller · 137db333
      Darrick J. Wong authored
      Certain xfile array operations (such as sorting) can be sped up quite a
      bit by allowing xfile users to grab a page to bulk-read the records
      contained within it.  Create helper methods to facilitate this.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      137db333
    • Darrick J. Wong's avatar
      xfs: convert xfarray insertion sort to heapsort using scratchpad memory · c390c645
      Darrick J. Wong authored
      In the previous patch, we created a very basic quicksort implementation
      for xfile arrays.  While the use of an alternate sorting algorithm to
      avoid quicksort recursion on very small subsets reduces the runtime
      modestly, we could do better than a load and store-heavy insertion sort,
      particularly since each load and store requires a page mapping lookup in
      the xfile.
      
      For a small increase in kernel memory requirements, we could instead
      bulk load the xfarray records into memory, use the kernel's existing
      heapsort implementation to sort the records, and bulk store the memory
      buffer back into the xfile.  On the author's computer, this reduces the
      runtime by about 5% on a 500,000 element array.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      c390c645
    • Darrick J. Wong's avatar
      xfs: enable sorting of xfile-backed arrays · 232ea052
      Darrick J. Wong authored
      The btree bulk loading code requires that records be provided in the
      correct record sort order for the given btree type.  In general, repair
      code cannot be required to collect records in order, and it is not
      feasible to insert new records in the middle of an array to maintain
      sort order.
      
      Implement a sorting algorithm so that we can sort the records just prior
      to bulk loading.  In principle, an xfarray could consume many gigabytes
      of memory and its backing pages can be sent out to disk at any time.
      This means that we cannot map the entire array into memory at once, so
      we must find a way to divide the work into smaller portions (e.g. a
      page) that /can/ be mapped into memory.
      
      Quicksort seems like a reasonable fit for this purpose, since it uses a
      divide and conquer strategy to keep its average runtime logarithmic.
      The solution presented here is a port of the glibc implementation, which
      itself is derived from the median-of-three and tail call recursion
      strategies outlined by Sedgwick.
      
      Subsequent patches will optimize the implementation further by utilizing
      the kernel's heapsort on directly-mapped memory whenever possible, and
      improving the quicksort pivot selection algorithm to try to avoid O(n^2)
      collapses.
      
      Note: The sorting functionality gets its own patch because the basic big
      array mechanisms were plenty for a single code patch.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      232ea052
    • Darrick J. Wong's avatar
      xfs: create a big array data structure · 3934e8eb
      Darrick J. Wong authored
      Create a simple 'big array' data structure for storage of fixed-size
      metadata records that will be used to reconstruct a btree index.  For
      repair operations, the most important operations are append, iterate,
      and sort.
      
      Earlier implementations of the big array used linked lists and suffered
      from severe problems -- pinning all records in kernel memory was not a
      good idea and frequently lead to OOM situations; random access was very
      inefficient; and record overhead for the lists was unacceptably high at
      40-60%.
      
      Therefore, the big memory array relies on the 'xfile' abstraction, which
      creates a memfd file and stores the records in page cache pages.  Since
      the memfd is created in tmpfs, the memory pages can be pushed out to
      disk if necessary and we have a built-in usage limit of 50% of physical
      memory.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      3934e8eb
    • Darrick J. Wong's avatar
      xfs: use per-AG bitmaps to reap unused AG metadata blocks during repair · 014ad537
      Darrick J. Wong authored
      The AGFL repair code uses a series of bitmaps to figure out where there
      are OWN_AG blocks that are not claimed by the free space and rmap
      btrees.  These blocks become the new AGFL, and any overflow is reaped.
      The bitmaps current track xfs_fsblock_t even though we already know the
      AG number.
      
      In the last patch, we introduced a new bitmap "type" for tracking
      xfs_agblock_t extents.  Port the reaping code and the AGFL repair to use
      this new type, which makes it very obvious what we're tracking.  This
      also eliminates a bunch of unnecessary agblock <-> fsblock conversions.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      014ad537
    • Darrick J. Wong's avatar
      xfs: reap large AG metadata extents when possible · 1c7ce115
      Darrick J. Wong authored
      When we're freeing extents that have been set in a bitmap, break the
      bitmap extent into multiple sub-extents organized by fate, and reap the
      extents.  This enables us to dispose of old resources more efficiently
      than doing them block by block.
      
      While we're at it, rename the reaping functions to make it clear that
      they're reaping per-AG extents.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      1c7ce115
    • Darrick J. Wong's avatar
      xfs: allow scanning ranges of the buffer cache for live buffers · 9ed851f6
      Darrick J. Wong authored
      After an online repair, we need to invalidate buffers representing the
      blocks from the old metadata that we're replacing.  It's possible that
      parts of a tree that were previously cached in memory are no longer
      accessible due to media failure or other corruption on interior nodes,
      so repair figures out the old blocks from the reverse mapping data and
      scans the buffer cache directly.
      
      In other words, online fsck needs to find all the live (i.e. non-stale)
      buffers for a range of fsblocks so that it can invalidate them.
      
      Unfortunately, the current buffer cache code triggers asserts if the
      rhashtable lookup finds a non-stale buffer of a different length than
      the key we searched for.  For regular operation this is desirable, but
      for this repair procedure, we don't care since we're going to forcibly
      stale the buffer anyway.  Add an internal lookup flag to avoid the
      assert.  Skip buffers that are already XBF_STALE.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      9ed851f6
    • Darrick J. Wong's avatar
      xfs: rearrange xrep_reap_block to make future code flow easier · 77a1396f
      Darrick J. Wong authored
      Rearrange the logic inside xrep_reap_block to make it more obvious that
      crosslinked metadata blocks are handled differently.  Add a couple of
      tracepoints so that we can tell what's going on at the end of a btree
      rebuild operation.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      77a1396f
    • Darrick J. Wong's avatar
      xfs: use deferred frees to reap old btree blocks · 5fee784e
      Darrick J. Wong authored
      Use deferred frees (EFIs) to reap the blocks of a btree that we just
      replaced.  This helps us to shrink the window in which those old blocks
      could be lost due to a system crash, though we try to flush the EFIs
      every few hundred blocks so that we don't also overflow the transaction
      reservations during and after we commit the new btree.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      5fee784e
    • Darrick J. Wong's avatar
      xfs: only allow reaping of per-AG blocks in xrep_reap_extents · a55e0730
      Darrick J. Wong authored
      Now that we've refactored btree cursors to require the caller to pass in
      a perag structure, there are numerous problems in xrep_reap_extents if
      it's being called to reap extents for an inode metadata repair.  We
      don't have any repair functions that can do that, so drop the support
      for now.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      a55e0730
    • Darrick J. Wong's avatar
      xfs: only invalidate blocks if we're going to free them · 8e54e06b
      Darrick J. Wong authored
      When we're discarding old btree blocks after a repair, only invalidate
      the buffers for the ones that we're freeing -- if the metadata was
      crosslinked with another data structure, we don't want to touch it.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      8e54e06b
    • Darrick J. Wong's avatar
      xfs: move the post-repair block reaping code to a separate file · e06ef14b
      Darrick J. Wong authored
      Reaping blocks after a repair is a complicated affair involving a lot of
      rmap btree lookups and figuring out if we're going to unmap or free old
      metadata blocks that might be crosslinked.  Eventually, we will need to
      be able to reap per-AG metadata blocks, bmbt blocks from inode forks,
      garbage CoW staging extents, and (even later) blocks from btrees rooted
      in inodes.  This results in a lot of reaping code, so we might as well
      split that off while it's easy.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      e06ef14b
    • Darrick J. Wong's avatar
      xfs: cull repair code that will never get used · 86a46417
      Darrick J. Wong authored
      These two functions date from the era when I thought that we could
      rebuild btrees by creating an alternate root and adding records one by
      one.  In other words, they predate the btree bulk loader.  They're not
      necessary now, so remove them.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      86a46417
    • Darrick J. Wong's avatar
      MAINTAINERS: add Chandan Babu as XFS release manager · d6532904
      Darrick J. Wong authored
      I nominate Chandan Babu to take over release management for the upstream
      kernel's XFS code.  He has had sufficient experience merging backports
      to the 5.4 LTS tree, testing them, and sending them on to the LTS leads.
      
      NOTE: I am /not/ nominating Chandan to take on any of the other roles I
      have just dropped.  Bug triager, testing lead, and community manager are
      open positions that need to be filled.  There's also maintainer for
      supported LTS releases (4.14, 4.19, 5.10...).
      
      Cc: Chandan Babu R <chandan.babu@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Acked-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: default avatarCarlos Maiolino <cem@kernel.org>
      d6532904
    • Darrick J. Wong's avatar
      MAINTAINERS: drop me as XFS maintainer · d554046e
      Darrick J. Wong authored
      I burned out years ago trying to juggle the roles senior developer,
      reviewer, tester, triager (crappily), release manager, and (at times)
      manager liaison.  There's enough work here in this one subsystem for a
      team of 20 FT, but instead we're squeezed to half that.  I thought if I
      could hold on just a bit longer I could help to maintain the focus on
      long term development to improve the experience for users.  I was wrong.
      
      Nowadays, people working on XFS seem to spend most of their time on
      distro kernel backports and dealing with AI-generated corner case bug
      reports that aren't user reports.  Reviewing has become a nightmare of
      sifting through under-documented kernel code trying to decide if this
      new feature won't break all the other features.  Getting reviews is an
      unpleasant process of negotiating with demands for further cleanups,
      trying to figure out if a review comment is based in experience or
      unfamiliarity, and wondering if the silence means anything.
      
      For now, I will continue to review patches and will try to get online
      fsck, parent pointers, and realtime volume modernisation merged.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      d554046e
    • Darrick J. Wong's avatar
      docs: add maintainer entry profile for XFS · 19e13b0a
      Darrick J. Wong authored
      Create a new document to list what I think are (within the scope of XFS)
      our shared goals and community roles.  Since I will be stepping down
      shortly, I feel it's important to write down somewhere all the hats that
      I have been wearing for the past six years.
      
      Also, document important extra details about how to contribute to XFS.
      
      Cc: corbet@lwn.net
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      19e13b0a
  2. 06 Aug, 2023 8 commits
    • Linus Torvalds's avatar
      Linux 6.5-rc5 · 52a93d39
      Linus Torvalds authored
      52a93d39
    • Linus Torvalds's avatar
      Merge tag 'v6.5-rc5.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 0108963f
      Linus Torvalds authored
      Pull vfs fixes from Christian Brauner:
      
       - Fix a wrong check for O_TMPFILE during RESOLVE_CACHED lookup
      
       - Clean up directory iterators and clarify file_needs_f_pos_lock()
      
      * tag 'v6.5-rc5.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
        fs: rely on ->iterate_shared to determine f_pos locking
        vfs: get rid of old '->iterate' directory operation
        proc: fix missing conversion to 'iterate_shared'
        open: make RESOLVE_CACHED correctly test for O_TMPFILE
      0108963f
    • Christian Brauner's avatar
      fs: rely on ->iterate_shared to determine f_pos locking · 7d84d1b9
      Christian Brauner authored
      Now that we removed ->iterate we don't need to check for either
      ->iterate or ->iterate_shared in file_needs_f_pos_lock(). Simply check
      for ->iterate_shared instead. This will tell us whether we need to
      unconditionally take the lock. Not just does it allow us to avoid
      checking f_inode's mode it also actually clearly shows that we're
      locking because of readdir.
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      7d84d1b9
    • Linus Torvalds's avatar
      vfs: get rid of old '->iterate' directory operation · 3e327154
      Linus Torvalds authored
      All users now just use '->iterate_shared()', which only takes the
      directory inode lock for reading.
      
      Filesystems that never got convered to shared mode now instead use a
      wrapper that drops the lock, re-takes it in write mode, calls the old
      function, and then downgrades the lock back to read mode.
      
      This way the VFS layer and other callers no longer need to care about
      filesystems that never got converted to the modern era.
      
      The filesystems that use the new wrapper are ceph, coda, exfat, jfs,
      ntfs, ocfs2, overlayfs, and vboxsf.
      
      Honestly, several of them look like they really could just iterate their
      directories in shared mode and skip the wrapper entirely, but the point
      of this change is to not change semantics or fix filesystems that
      haven't been fixed in the last 7+ years, but to finally get rid of the
      dual iterators.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      3e327154
    • Linus Torvalds's avatar
      proc: fix missing conversion to 'iterate_shared' · 0a2c2baa
      Linus Torvalds authored
      I'm looking at the directory handling due to the discussion about f_pos
      locking (see commit 79796425: "file: reinstate f_pos locking
      optimization for regular files"), and wanting to clean that up.
      
      And one source of ugliness is how we were supposed to move filesystems
      over to the '->iterate_shared()' function that only takes the inode lock
      for reading many many years ago, but several filesystems still use the
      bad old '->iterate()' that takes the inode lock for exclusive access.
      
      See commit 61922694 ("introduce a parallel variant of ->iterate()")
      that also added some documentation stating
      
            Old method is only used if the new one is absent; eventually it will
            be removed.  Switch while you still can; the old one won't stay.
      
      and that was back in April 2016.  Here we are, many years later, and the
      old version is still clearly sadly alive and well.
      
      Now, some of those old style iterators are probably just because the
      filesystem may end up having per-inode mutable data that it uses for
      iterating a directory, but at least one case is just a mistake.
      
      Al switched over most filesystems to use '->iterate_shared()' back when
      it was introduced.  In particular, the /proc filesystem was converted as
      one of the first ones in commit f50752ea ("switch all procfs
      directories ->iterate_shared()").
      
      But then later one new user of '->iterate()' was then re-introduced by
      commit 6d9c939d ("procfs: add smack subdir to attrs").
      
      And that's clearly not what we wanted, since that new case just uses the
      same 'proc_pident_readdir()' and 'proc_pident_lookup()' helper functions
      that other /proc pident directories use, and they are most definitely
      safe to use with the inode lock held shared.
      
      So just fix it.
      
      This still leaves a fair number of oddball filesystems using the
      old-style directory iterator (ceph, coda, exfat, jfs, ntfs, ocfs2,
      overlayfs, and vboxsf), but at least we don't have any remaining in the
      core filesystems.
      
      I'm going to add a wrapper function that just drops the read-lock and
      takes it as a write lock, so that we can clean up the core vfs layer and
      make all the ugly 'this filesystem needs exclusive inode locking' be
      just filesystem-internal warts.
      
      I just didn't want to make that conversion when we still had a core user
      left.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      0a2c2baa
    • Aleksa Sarai's avatar
      open: make RESOLVE_CACHED correctly test for O_TMPFILE · a0fc452a
      Aleksa Sarai authored
      O_TMPFILE is actually __O_TMPFILE|O_DIRECTORY. This means that the old
      fast-path check for RESOLVE_CACHED would reject all users passing
      O_DIRECTORY with -EAGAIN, when in fact the intended test was to check
      for __O_TMPFILE.
      
      Cc: stable@vger.kernel.org # v5.12+
      Fixes: 99668f61 ("fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED")
      Signed-off-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Message-Id: <20230806-resolve_cached-o_tmpfile-v1-1-7ba16308465e@cyphar.com>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      a0fc452a
    • Linus Torvalds's avatar
      Merge tag 'rust-fixes-6.5-rc5' of https://github.com/Rust-for-Linux/linux · f0ab9f34
      Linus Torvalds authored
      Pull rust fixes from Miguel Ojeda:
      
       - Allocator: prevent mis-aligned allocation
      
       - Types: delete 'ForeignOwnable::borrow_mut'. A sound replacement is
         planned for the merge window
      
       - Build: fix bindgen error with UBSAN_BOUNDS_STRICT
      
      * tag 'rust-fixes-6.5-rc5' of https://github.com/Rust-for-Linux/linux:
        rust: fix bindgen build error with UBSAN_BOUNDS_STRICT
        rust: delete `ForeignOwnable::borrow_mut`
        rust: allocator: Prevent mis-aligned allocation
      f0ab9f34
    • Linus Torvalds's avatar
      Merge tag 'ata-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata · fb0d9199
      Linus Torvalds authored
      Pull ata fix from Damien Le Moal:
      
       - Prevent the scsi disk driver from issuing a START STOP UNIT command
         for ATA devices during system resume as this causes various issues
         reported by multiple users.
      
      * tag 'ata-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
        ata,scsi: do not issue START STOP UNIT on resume
      fb0d9199