1. 21 Dec, 2021 8 commits
    • Dan Carpenter's avatar
      xfs: prevent a WARN_ONCE() in xfs_ioc_attr_list() · 6ed6356b
      Dan Carpenter authored
      The "bufsize" comes from the root user.  If "bufsize" is negative then,
      because of type promotion, neither of the validation checks at the start
      of the function are able to catch it:
      
      	if (bufsize < sizeof(struct xfs_attrlist) ||
      	    bufsize > XFS_XATTR_LIST_MAX)
      		return -EINVAL;
      
      This means "bufsize" will trigger (WARN_ON_ONCE(size > INT_MAX)) in
      kvmalloc_node().  Fix this by changing the type from int to size_t.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      6ed6356b
    • Yang Xu's avatar
      xfs: Fix comments mentioning xfs_ialloc · 132c460e
      Yang Xu authored
      Since kernel commit 1abcf261 ("xfs: move on-disk inode allocation out of xfs_ialloc()"),
      xfs_ialloc has been renamed to xfs_init_new_inode. So update this in comments.
      Signed-off-by: default avatarYang Xu <xuyang2018.jy@fujitsu.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      132c460e
    • Dave Chinner's avatar
      xfs: check sb_meta_uuid for dabuf buffer recovery · 09654ed8
      Dave Chinner authored
      Got a report that a repeated crash test of a container host would
      eventually fail with a log recovery error preventing the system from
      mounting the root filesystem. It manifested as a directory leaf node
      corruption on writeback like so:
      
       XFS (loop0): Mounting V5 Filesystem
       XFS (loop0): Starting recovery (logdev: internal)
       XFS (loop0): Metadata corruption detected at xfs_dir3_leaf_check_int+0x99/0xf0, xfs_dir3_leaf1 block 0x12faa158
       XFS (loop0): Unmount and run xfs_repair
       XFS (loop0): First 128 bytes of corrupted metadata buffer:
       00000000: 00 00 00 00 00 00 00 00 3d f1 00 00 e1 9e d5 8b  ........=.......
       00000010: 00 00 00 00 12 fa a1 58 00 00 00 29 00 00 1b cc  .......X...)....
       00000020: 91 06 78 ff f7 7e 4a 7d 8d 53 86 f2 ac 47 a8 23  ..x..~J}.S...G.#
       00000030: 00 00 00 00 17 e0 00 80 00 43 00 00 00 00 00 00  .........C......
       00000040: 00 00 00 2e 00 00 00 08 00 00 17 2e 00 00 00 0a  ................
       00000050: 02 35 79 83 00 00 00 30 04 d3 b4 80 00 00 01 50  .5y....0.......P
       00000060: 08 40 95 7f 00 00 02 98 08 41 fe b7 00 00 02 d4  .@.......A......
       00000070: 0d 62 ef a7 00 00 01 f2 14 50 21 41 00 00 00 0c  .b.......P!A....
       XFS (loop0): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_buf.c:1514).  Shutting down.
       XFS (loop0): Please unmount the filesystem and rectify the problem(s)
       XFS (loop0): log mount/recovery failed: error -117
       XFS (loop0): log mount failed
      
      Tracing indicated that we were recovering changes from a transaction
      at LSN 0x29/0x1c16 into a buffer that had an LSN of 0x29/0x1d57.
      That is, log recovery was overwriting a buffer with newer changes on
      disk than was in the transaction. Tracing indicated that we were
      hitting the "recovery immediately" case in
      xfs_buf_log_recovery_lsn(), and hence it was ignoring the LSN in the
      buffer.
      
      The code was extracting the LSN correctly, then ignoring it because
      the UUID in the buffer did not match the superblock UUID. The
      problem arises because the UUID check uses the wrong UUID - it
      should be checking the sb_meta_uuid, not sb_uuid. This filesystem
      has sb_uuid != sb_meta_uuid (which is fine), and the buffer has the
      correct matching sb_meta_uuid in it, it's just the code checked it
      against the wrong superblock uuid.
      
      The is no corruption in the filesystem, and failing to recover the
      buffer due to a write verifier failure means the recovery bug did
      not propagate the corruption to disk. Hence there is no corruption
      before or after this bug has manifested, the impact is limited
      simply to an unmountable filesystem....
      
      This was missed back in 2015 during an audit of incorrect sb_uuid
      usage that resulted in commit fcfbe2c4 ("xfs: log recovery needs
      to validate against sb_meta_uuid") that fixed the magic32 buffers to
      validate against sb_meta_uuid instead of sb_uuid. It missed the
      magicda buffers....
      
      Fixes: ce748eaa ("xfs: create new metadata UUID field and incompat flag")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      09654ed8
    • Darrick J. Wong's avatar
      xfs: fix a bug in the online fsck directory leaf1 bestcount check · e5d1802c
      Darrick J. Wong authored
      When xfs_scrub encounters a directory with a leaf1 block, it tries to
      validate that the leaf1 block's bestcount (aka the best free count of
      each directory data block) is the correct size.  Previously, this author
      believed that comparing bestcount to the directory isize (since
      directory data blocks are under isize, and leaf/bestfree blocks are
      above it) was sufficient.
      
      Unfortunately during testing of online repair, it was discovered that it
      is possible to create a directory with a hole between the last directory
      block and isize.  The directory code seems to handle this situation just
      fine and xfs_repair doesn't complain, which effectively makes this quirk
      part of the disk format.
      
      Fix the check to work properly.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      e5d1802c
    • Darrick J. Wong's avatar
      xfs: only run COW extent recovery when there are no live extents · 7993f1a4
      Darrick J. Wong authored
      As part of multiple customer escalations due to file data corruption
      after copy on write operations, I wrote some fstests that use fsstress
      to hammer on COW to shake things loose.  Regrettably, I caught some
      filesystem shutdowns due to incorrect rmap operations with the following
      loop:
      
      mount <filesystem>				# (0)
      fsstress <run only readonly ops> &		# (1)
      while true; do
      	fsstress <run all ops>
      	mount -o remount,ro			# (2)
      	fsstress <run only readonly ops>
      	mount -o remount,rw			# (3)
      done
      
      When (2) happens, notice that (1) is still running.  xfs_remount_ro will
      call xfs_blockgc_stop to walk the inode cache to free all the COW
      extents, but the blockgc mechanism races with (1)'s reader threads to
      take IOLOCKs and loses, which means that it doesn't clean them all out.
      Call such a file (A).
      
      When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
      walks the ondisk refcount btree and frees any COW extent that it finds.
      This function does not check the inode cache, which means that incore
      COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
      one of those former COW extents are allocated and mapped into another
      file (B) and someone triggers a COW to the stale reservation in (A), A's
      dirty data will be written into (B) and once that's done, those blocks
      will be transferred to (A)'s data fork without bumping the refcount.
      
      The results are catastrophic -- file (B) and the refcount btree are now
      corrupt.  In the first patch, we fixed the race condition in (2) so that
      (A) will always flush the COW fork.  In this second patch, we move the
      _recover_cow call to the initial mount call in (0) for safety.
      
      As mentioned previously, xfs_reflink_recover_cow walks the refcount
      btree looking for COW staging extents, and frees them.  This was
      intended to be run at mount time (when we know there are no live inodes)
      to clean up any leftover staging events that may have been left behind
      during an unclean shutdown.  As a time "optimization" for readonly
      mounts, we deferred this to the ro->rw transition, not realizing that
      any failure to clean all COW forks during a rw->ro transition would
      result in catastrophic corruption.
      
      Therefore, remove this optimization and only run the recovery routine
      when we're guaranteed not to have any COW staging extents anywhere,
      which means we always run this at mount time.  While we're at it, move
      the callsite to xfs_log_mount_finish because any refcount btree
      expansion (however unlikely given that we're removing records from the
      right side of the index) must be fed by a per-AG reservation, which
      doesn't exist in its current location.
      
      Fixes: 174edb0e ("xfs: store in-progress CoW allocations in the refcount btree")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      7993f1a4
    • Darrick J. Wong's avatar
      xfs: don't expose internal symlink metadata buffers to the vfs · 7b7820b8
      Darrick J. Wong authored
      Ian Kent reported that for inline symlinks, it's possible for
      vfs_readlink to hang on to the target buffer returned by
      _vn_get_link_inline long after it's been freed by xfs inode reclaim.
      This is a layering violation -- we should never expose XFS internals to
      the VFS.
      
      When the symlink has a remote target, we allocate a separate buffer,
      copy the internal information, and let the VFS manage the new buffer's
      lifetime.  Let's adapt the inline code paths to do this too.  It's
      less efficient, but fixes the layering violation and avoids the need to
      adapt the if_data lifetime to rcu rules.  Clearly I don't care about
      readlink benchmarks.
      
      As a side note, this fixes the minor locking violation where we can
      access the inode data fork without taking any locks; proper locking (and
      eliminating the possibility of having to switch inode_operations on a
      live inode) is essential to online repair coordinating repairs
      correctly.
      Reported-by: default avatarIan Kent <raven@themaw.net>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      7b7820b8
    • Darrick J. Wong's avatar
      xfs: fix quotaoff mutex usage now that we don't support disabling it · 59d7fab2
      Darrick J. Wong authored
      Prior to commit 40b52225 ("xfs: remove support for disabling quota
      accounting on a mounted file system"), we used the quotaoff mutex to
      protect dquot operations against quotaoff trying to pull down dquots as
      part of disabling quota.
      
      Now that we only support turning off quota enforcement, the quotaoff
      mutex only protects changes in m_qflags/sb_qflags.  We don't need it to
      protect dquots, which means we can remove it from setqlimits and the
      dquot scrub code.  While we're at it, fix the function that forces
      quotacheck, since it should have been taking the quotaoff mutex.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      59d7fab2
    • Darrick J. Wong's avatar
      xfs: shut down filesystem if we xfs_trans_cancel with deferred work items · 47a6df7c
      Darrick J. Wong authored
      While debugging some very strange rmap corruption reports in connection
      with the online directory repair code.  I root-caused the error to the
      following incorrect sequence:
      
      <start repair transaction>
      <expand directory, causing a deferred rmap to be queued>
      <roll transaction>
      <cancel transaction>
      
      Obviously, we should have committed the transaction instead of
      cancelling it.  Thinking more broadly, however, xfs_trans_cancel should
      have warned us that we were throwing away work item that we already
      committed to performing.  This is not correct, and we need to shut down
      the filesystem.
      
      Change xfs_trans_cancel to complain in the loudest manner if we're
      cancelling any transaction with deferred work items attached.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      47a6df7c
  2. 12 Dec, 2021 14 commits
  3. 11 Dec, 2021 18 commits