An error occurred fetching the project authors.
  1. 11 Aug, 2023 1 commit
  2. 10 Aug, 2023 6 commits
  3. 13 Jun, 2023 1 commit
  4. 12 Jun, 2023 2 commits
  5. 05 Jun, 2023 4 commits
  6. 01 May, 2023 1 commit
  7. 12 Apr, 2023 1 commit
    • Darrick J. Wong's avatar
      xfs: deprecate the ascii-ci feature · 7ba83850
      Darrick J. Wong authored
      This feature is a mess -- the hash function has been broken for the
      entire 15 years of its existence if you create names with extended ascii
      bytes; metadump name obfuscation has silently failed for just as long;
      and the feature clashes horribly with the UTF8 encodings that most
      systems use today.  There is exactly one fstest for this feature.
      
      In other words, this feature is crap.  Let's deprecate it now so we can
      remove it from the codebase in 2030.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      7ba83850
  8. 19 Mar, 2023 1 commit
    • Darrick J. Wong's avatar
      xfs: test dir/attr hash when loading module · 3cfb9290
      Darrick J. Wong authored
      Back in the 6.2-rc1 days, Eric Whitney reported a fstests regression in
      ext4 against generic/454.  The cause of this test failure was the
      unfortunate combination of setting an xattr name containing UTF8 encoded
      emoji, an xattr hash function that accepted a char pointer with no
      explicit signedness, signed type extension of those chars to an int, and
      the 6.2 build tools maintainers deciding to mandate -funsigned-char
      across the board.  As a result, the ondisk extended attribute structure
      written out by 6.1 and 6.2 were not the same.
      
      This discrepancy, in fact, had been noticeable if a filesystem with such
      an xattr were moved between any two architectures that don't employ the
      same signedness of a raw "char" declaration.  The only reason anyone
      noticed is that x86 gcc defaults to signed, and no such -funsigned-char
      update was made to e2fsprogs, so e2fsck immediately started reporting
      data corruption.
      
      After a day and a half of discussing how to handle this use case (xattrs
      with bit 7 set anywhere in the name) without breaking existing users,
      Linus merged his own patch and didn't tell the maintainer.  None of the
      ext4 developers realized this until AUTOSEL announced that the commit
      had been backported to stable.
      
      In the end, this problem could have been detected much earlier if there
      had been any useful tests of hash function(s) in use inside ext4 to make
      sure that they always produce the same outputs given the same inputs.
      
      The XFS dirent/xattr name hash takes a uint8_t*, so I don't think it's
      vulnerable to this problem.  However, let's avoid all this drama by
      adding our own self test to check that the da hash produces the same
      outputs for a static pile of inputs on various platforms.  This enables
      us to fix any breakage that may result in a controlled fashion.  The
      buffer and test data are identical to the patches submitted to xfsprogs.
      
      Link: https://lore.kernel.org/linux-ext4/Y8bpkm3jA3bDm3eL@debian-BULLSEYE-live-builder-AMD64/
      Link: https://lore.kernel.org/linux-xfs/ZBUKCRR7xvIqPrpX@destitution/T/#md38272cc684e2c0d61494435ccbb91f022e8dee4Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      3cfb9290
  9. 12 Feb, 2023 2 commits
  10. 17 Nov, 2022 1 commit
  11. 31 Oct, 2022 2 commits
  12. 30 Sep, 2022 1 commit
    • Lukas Czerner's avatar
      fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE · cbfecb92
      Lukas Czerner authored
      Currently the I_DIRTY_TIME will never get set if the inode already has
      I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME.  That's
      true, however ext4 will only update the on-disk inode in
      ->dirty_inode(), not on actual writeback. As a result if the inode
      already has I_DIRTY_INODE state by the time we get to
      __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
      into on-disk inode and will not get updated until the next I_DIRTY_INODE
      update, which might never come if we crash or get a power failure.
      
      The problem can be reproduced on ext4 by running xfstest generic/622
      with -o iversion mount option.
      
      Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
      I_DIRTY_INODE. Also make sure that the case is properly handled in
      writeback_single_inode() as well. Additionally changes in
      xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag.
      
      Thanks Jan Kara for suggestions on how to make this work properly.
      
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20220825100657.44217-1-lczerner@redhat.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      cbfecb92
  13. 30 Jul, 2022 1 commit
  14. 20 Jul, 2022 1 commit
    • Dave Chinner's avatar
      xfs: xfs_buf cache destroy isn't RCU safe · 231f91ab
      Dave Chinner authored
      Darrick and Sachin Sant reported that xfs/435 and xfs/436 would
      report an non-empty xfs_buf slab on module remove. This isn't easily
      to reproduce, but is clearly a side effect of converting the buffer
      caceh to RUC freeing and lockless lookups. Sachin bisected and
      Darrick hit it when testing the patchset directly.
      
      Turns out that the xfs_buf slab is not destroyed when all the other
      XFS slab caches are destroyed. Instead, it's got it's own little
      wrapper function that gets called separately, and so it doesn't have
      an rcu_barrier() call in it that is needed to drain all the rcu
      callbacks before the slab is destroyed.
      
      Fix it by removing the xfs_buf_init/terminate wrappers that just
      allocate and destroy the xfs_buf slab, and move them to the same
      place that all the other slab caches are set up and destroyed.
      Reported-and-tested-by: default avatarSachin Sant <sachinp@linux.ibm.com>
      Fixes: 298f3422 ("xfs: lockless buffer lookup")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      231f91ab
  15. 14 Jul, 2022 1 commit
    • Dave Chinner's avatar
      xfs: add in-memory iunlink log item · 784eb7d8
      Dave Chinner authored
      Now that we have a clean operation to update the di_next_unlinked
      field of inode cluster buffers, we can easily defer this operation
      to transaction commit time so we can order the inode cluster buffer
      locking consistently.
      
      To do this, we introduce a new in-memory log item to track the
      unlinked list item modification that we are going to make. This
      follows the same observations as the in-memory double linked list
      used to track unlinked inodes in that the inodes on the list are
      pinned in memory and cannot go away, and hence we can simply
      reference them for the duration of the transaction without needing
      to take active references or pin them or look them up.
      
      This allows us to pass the xfs_inode to the transaction commit code
      along with the modification to be made, and then order the logged
      modifications via the ->iop_sort and ->iop_precommit operations
      for the new log item type. As this is an in-memory log item, it
      doesn't have formatting, CIL or AIL operational hooks - it exists
      purely to run the inode unlink modifications and is then removed
      from the transaction item list and freed once the precommit
      operation has run.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      784eb7d8
  16. 01 Jul, 2022 1 commit
    • Dave Chinner's avatar
      xfs: introduce per-cpu CIL tracking structure · af1c2146
      Dave Chinner authored
      The CIL push lock is highly contended on larger machines, becoming a
      hard bottleneck that about 700,000 transaction commits/s on >16p
      machines. To address this, start moving the CIL tracking
      infrastructure to utilise per-CPU structures.
      
      We need to track the space used, the amount of log reservation space
      reserved to write the CIL, the log items in the CIL and the busy
      extents that need to be completed by the CIL commit.  This requires
      a couple of per-cpu counters, an unordered per-cpu list and a
      globally ordered per-cpu list.
      
      Create a per-cpu structure to hold these and all the management
      interfaces needed, as well as the hooks to handle hotplug CPUs.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      
      af1c2146
  17. 23 Jun, 2022 2 commits
    • Dave Chinner's avatar
      xfs: introduce xfs_inodegc_push() · 5e672cd6
      Dave Chinner authored
      The current blocking mechanism for pushing the inodegc queue out to
      disk can result in systems becoming unusable when there is a long
      running inodegc operation. This is because the statfs()
      implementation currently issues a blocking flush of the inodegc
      queue and a significant number of common system utilities will call
      statfs() to discover something about the underlying filesystem.
      
      This can result in userspace operations getting stuck on inodegc
      progress, and when trying to remove a heavily reflinked file on slow
      storage with a full journal, this can result in delays measuring in
      hours.
      
      Avoid this problem by adding "push" function that expedites the
      flushing of the inodegc queue, but doesn't wait for it to complete.
      
      Convert xfs_fs_statfs() and xfs_qm_scall_getquota() to use this
      mechanism so they don't block but still ensure that queued
      operations are expedited.
      
      Fixes: ab23a776 ("xfs: per-cpu deferred inode inactivation queues")
      Reported-by: default avatarChris Dunlop <chris@onthe.net.au>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      [djwong: fix _getquota_next to use _inodegc_push too]
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      5e672cd6
    • Dave Chinner's avatar
      xfs: bound maximum wait time for inodegc work · 7cf2b0f9
      Dave Chinner authored
      Currently inodegc work can sit queued on the per-cpu queue until
      the workqueue is either flushed of the queue reaches a depth that
      triggers work queuing (and later throttling). This means that we
      could queue work that waits for a long time for some other event to
      trigger flushing.
      
      Hence instead of just queueing work at a specific depth, use a
      delayed work that queues the work at a bound time. We can still
      schedule the work immediately at a given depth, but we no long need
      to worry about leaving a number of items on the list that won't get
      processed until external events prevail.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      7cf2b0f9
  18. 27 May, 2022 1 commit
  19. 22 May, 2022 1 commit
  20. 18 Apr, 2022 1 commit
  21. 13 Apr, 2022 1 commit
  22. 11 Apr, 2022 1 commit
    • Darrick J. Wong's avatar
      xfs: use a separate frextents counter for rt extent reservations · 2229276c
      Darrick J. Wong authored
      As mentioned in the previous commit, the kernel misuses sb_frextents in
      the incore mount to reflect both incore reservations made by running
      transactions as well as the actual count of free rt extents on disk.
      This results in the superblock being written to the log with an
      underestimate of the number of rt extents that are marked free in the
      rtbitmap.
      
      Teaching XFS to recompute frextents after log recovery avoids
      operational problems in the current mount, but it doesn't solve the
      problem of us writing undercounted frextents which are then recovered by
      an older kernel that doesn't have that fix.
      
      Create an incore percpu counter to mirror the ondisk frextents.  This
      new counter will track transaction reservations and the only time we
      will touch the incore super counter (i.e the one that gets logged) is
      when those transactions commit updates to the rt bitmap.  This is in
      contrast to the lazysbcount counters (e.g. fdblocks), where we know that
      log recovery will always fix any incorrect counter that we log.
      As a bonus, we only take m_sb_lock at transaction commit time.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      2229276c
  23. 28 Mar, 2022 1 commit
    • Darrick J. Wong's avatar
      xfs: don't report reserved bnobt space as available · 85bcfa26
      Darrick J. Wong authored
      On a modern filesystem, we don't allow userspace to allocate blocks for
      data storage from the per-AG space reservations, the user-controlled
      reservation pool that prevents ENOSPC in the middle of internal
      operations, or the internal per-AG set-aside that prevents unwanted
      filesystem shutdowns due to ENOSPC during a bmap btree split.
      
      Since we now consider freespace btree blocks as unavailable for
      allocation for data storage, we shouldn't report those blocks via statfs
      either.  This makes the numbers that we return via the statfs f_bavail
      and f_bfree fields a more conservative estimate of actual free space.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      85bcfa26
  24. 10 Feb, 2022 1 commit
  25. 30 Jan, 2022 1 commit
  26. 21 Dec, 2021 1 commit
    • Darrick J. Wong's avatar
      xfs: only run COW extent recovery when there are no live extents · 7993f1a4
      Darrick J. Wong authored
      As part of multiple customer escalations due to file data corruption
      after copy on write operations, I wrote some fstests that use fsstress
      to hammer on COW to shake things loose.  Regrettably, I caught some
      filesystem shutdowns due to incorrect rmap operations with the following
      loop:
      
      mount <filesystem>				# (0)
      fsstress <run only readonly ops> &		# (1)
      while true; do
      	fsstress <run all ops>
      	mount -o remount,ro			# (2)
      	fsstress <run only readonly ops>
      	mount -o remount,rw			# (3)
      done
      
      When (2) happens, notice that (1) is still running.  xfs_remount_ro will
      call xfs_blockgc_stop to walk the inode cache to free all the COW
      extents, but the blockgc mechanism races with (1)'s reader threads to
      take IOLOCKs and loses, which means that it doesn't clean them all out.
      Call such a file (A).
      
      When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
      walks the ondisk refcount btree and frees any COW extent that it finds.
      This function does not check the inode cache, which means that incore
      COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
      one of those former COW extents are allocated and mapped into another
      file (B) and someone triggers a COW to the stale reservation in (A), A's
      dirty data will be written into (B) and once that's done, those blocks
      will be transferred to (A)'s data fork without bumping the refcount.
      
      The results are catastrophic -- file (B) and the refcount btree are now
      corrupt.  In the first patch, we fixed the race condition in (2) so that
      (A) will always flush the COW fork.  In this second patch, we move the
      _recover_cow call to the initial mount call in (0) for safety.
      
      As mentioned previously, xfs_reflink_recover_cow walks the refcount
      btree looking for COW staging extents, and frees them.  This was
      intended to be run at mount time (when we know there are no live inodes)
      to clean up any leftover staging events that may have been left behind
      during an unclean shutdown.  As a time "optimization" for readonly
      mounts, we deferred this to the ro->rw transition, not realizing that
      any failure to clean all COW forks during a rw->ro transition would
      result in catastrophic corruption.
      
      Therefore, remove this optimization and only run the recovery routine
      when we're guaranteed not to have any COW staging extents anywhere,
      which means we always run this at mount time.  While we're at it, move
      the callsite to xfs_log_mount_finish because any refcount btree
      expansion (however unlikely given that we're removing records from the
      right side of the index) must be fed by a per-AG reservation, which
      doesn't exist in its current location.
      
      Fixes: 174edb0e ("xfs: store in-progress CoW allocations in the refcount btree")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      7993f1a4
  27. 07 Dec, 2021 1 commit
    • Darrick J. Wong's avatar
      xfs: remove all COW fork extents when remounting readonly · 089558bc
      Darrick J. Wong authored
      As part of multiple customer escalations due to file data corruption
      after copy on write operations, I wrote some fstests that use fsstress
      to hammer on COW to shake things loose.  Regrettably, I caught some
      filesystem shutdowns due to incorrect rmap operations with the following
      loop:
      
      mount <filesystem>				# (0)
      fsstress <run only readonly ops> &		# (1)
      while true; do
      	fsstress <run all ops>
      	mount -o remount,ro			# (2)
      	fsstress <run only readonly ops>
      	mount -o remount,rw			# (3)
      done
      
      When (2) happens, notice that (1) is still running.  xfs_remount_ro will
      call xfs_blockgc_stop to walk the inode cache to free all the COW
      extents, but the blockgc mechanism races with (1)'s reader threads to
      take IOLOCKs and loses, which means that it doesn't clean them all out.
      Call such a file (A).
      
      When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
      walks the ondisk refcount btree and frees any COW extent that it finds.
      This function does not check the inode cache, which means that incore
      COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
      one of those former COW extents are allocated and mapped into another
      file (B) and someone triggers a COW to the stale reservation in (A), A's
      dirty data will be written into (B) and once that's done, those blocks
      will be transferred to (A)'s data fork without bumping the refcount.
      
      The results are catastrophic -- file (B) and the refcount btree are now
      corrupt.  Solve this race by forcing the xfs_blockgc_free_space to run
      synchronously, which causes xfs_icwalk to return to inodes that were
      skipped because the blockgc code couldn't take the IOLOCK.  This is safe
      to do here because the VFS has already prohibited new writer threads.
      
      Fixes: 10ddf64e ("xfs: remove leftover CoW reservations when remounting ro")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      089558bc
  28. 04 Dec, 2021 1 commit