1. 01 May, 2023 9 commits
    • Darrick J. Wong's avatar
      xfs: fix xfs_inodegc_stop racing with mod_delayed_work · 2254a739
      Darrick J. Wong authored
      syzbot reported this warning from the faux inodegc shrinker that tries
      to kick off inodegc work:
      
      ------------[ cut here ]------------
      WARNING: CPU: 1 PID: 102 at kernel/workqueue.c:1445 __queue_work+0xd44/0x1120 kernel/workqueue.c:1444
      RIP: 0010:__queue_work+0xd44/0x1120 kernel/workqueue.c:1444
      Call Trace:
       __queue_delayed_work+0x1c8/0x270 kernel/workqueue.c:1672
       mod_delayed_work_on+0xe1/0x220 kernel/workqueue.c:1746
       xfs_inodegc_shrinker_scan fs/xfs/xfs_icache.c:2212 [inline]
       xfs_inodegc_shrinker_scan+0x250/0x4f0 fs/xfs/xfs_icache.c:2191
       do_shrink_slab+0x428/0xaa0 mm/vmscan.c:853
       shrink_slab+0x175/0x660 mm/vmscan.c:1013
       shrink_one+0x502/0x810 mm/vmscan.c:5343
       shrink_many mm/vmscan.c:5394 [inline]
       lru_gen_shrink_node mm/vmscan.c:5511 [inline]
       shrink_node+0x2064/0x35f0 mm/vmscan.c:6459
       kswapd_shrink_node mm/vmscan.c:7262 [inline]
       balance_pgdat+0xa02/0x1ac0 mm/vmscan.c:7452
       kswapd+0x677/0xd60 mm/vmscan.c:7712
       kthread+0x2e8/0x3a0 kernel/kthread.c:376
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
      
      This warning corresponds to this code in __queue_work:
      
      	/*
      	 * For a draining wq, only works from the same workqueue are
      	 * allowed. The __WQ_DESTROYING helps to spot the issue that
      	 * queues a new work item to a wq after destroy_workqueue(wq).
      	 */
      	if (unlikely(wq->flags & (__WQ_DESTROYING | __WQ_DRAINING) &&
      		     WARN_ON_ONCE(!is_chained_work(wq))))
      		return;
      
      For this to trip, we must have a thread draining the inodedgc workqueue
      and a second thread trying to queue inodegc work to that workqueue.
      This can happen if freezing or a ro remount race with reclaim poking our
      faux inodegc shrinker and another thread dropping an unlinked O_RDONLY
      file:
      
      Thread 0	Thread 1	Thread 2
      
      xfs_inodegc_stop
      
      				xfs_inodegc_shrinker_scan
      				xfs_is_inodegc_enabled
      				<yes, will continue>
      
      xfs_clear_inodegc_enabled
      xfs_inodegc_queue_all
      <list empty, do not queue inodegc worker>
      
      		xfs_inodegc_queue
      		<add to list>
      		xfs_is_inodegc_enabled
      		<no, returns>
      
      drain_workqueue
      <set WQ_DRAINING>
      
      				llist_empty
      				<no, will queue list>
      				mod_delayed_work_on(..., 0)
      				__queue_work
      				<sees WQ_DRAINING, kaboom>
      
      In other words, everything between the access to inodegc_enabled state
      and the decision to poke the inodegc workqueue requires some kind of
      coordination to avoid the WQ_DRAINING state.  We could perhaps introduce
      a lock here, but we could also try to eliminate WQ_DRAINING from the
      picture.
      
      We could replace the drain_workqueue call with a loop that flushes the
      workqueue and queues workers as long as there is at least one inode
      present in the per-cpu inodegc llists.  We've disabled inodegc at this
      point, so we know that the number of queued inodes will eventually hit
      zero as long as xfs_inodegc_start cannot reactivate the workers.
      
      There are four callers of xfs_inodegc_start.  Three of them come from the
      VFS with s_umount held: filesystem thawing, failed filesystem freezing,
      and the rw remount transition.  The fourth caller is mounting rw (no
      remount or freezing possible).
      
      There are three callers ofs xfs_inodegc_stop.  One is unmounting (no
      remount or thaw possible).  Two of them come from the VFS with s_umount
      held: fs freezing and ro remount transition.
      
      Hence, it is correct to replace the drain_workqueue call with a loop
      that drains the inodegc llists.
      
      Fixes: 6191cf3a ("xfs: flush inodegc workqueue tasks before cancel")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      2254a739
    • Darrick J. Wong's avatar
      xfs: disable reaping in fscounters scrub · 2d5f38a3
      Darrick J. Wong authored
      The fscounters scrub code doesn't work properly because it cannot
      quiesce updates to the percpu counters in the filesystem, hence it
      returns false corruption reports.  This has been fixed properly in
      one of the online repair patchsets that are under review by replacing
      the xchk_disable_reaping calls with an exclusive filesystem freeze.
      Disabling background gc isn't sufficient to fix the problem.
      
      In other words, scrub doesn't need to call xfs_inodegc_stop, which is
      just as well since it wasn't correct to allow scrub to call
      xfs_inodegc_start when something else could be calling xfs_inodegc_stop
      (e.g. trying to freeze the filesystem).
      
      Neuter the scrubber for now, and remove the xchk_*_reaping functions.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      2d5f38a3
    • Darrick J. Wong's avatar
      xfs: check that per-cpu inodegc workers actually run on that cpu · b37c4c83
      Darrick J. Wong authored
      Now that we've allegedly worked out the problem of the per-cpu inodegc
      workers being scheduled on the wrong cpu, let's put in a debugging knob
      to let us know if a worker ever gets mis-scheduled again.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      b37c4c83
    • Darrick J. Wong's avatar
      xfs: explicitly specify cpu when forcing inodegc delayed work to run immediately · 03e0add8
      Darrick J. Wong authored
      I've been noticing odd racing behavior in the inodegc code that could
      only be explained by one cpu adding an inode to its inactivation llist
      at the same time that another cpu is processing that cpu's llist.
      Preemption is disabled between get/put_cpu_ptr, so the only explanation
      is scheduler mayhem.  I inserted the following debug code into
      xfs_inodegc_worker (see the next patch):
      
      	ASSERT(gc->cpu == smp_processor_id());
      
      This assertion tripped during overnight tests on the arm64 machines, but
      curiously not on x86_64.  I think we haven't observed any resource leaks
      here because the lockfree list code can handle simultaneous llist_add
      and llist_del_all functions operating on the same list.  However, the
      whole point of having percpu inodegc lists is to take advantage of warm
      memory caches by inactivating inodes on the last processor to touch the
      inode.
      
      The incorrect scheduling seems to occur after an inodegc worker is
      subjected to mod_delayed_work().  This wraps mod_delayed_work_on with
      WORK_CPU_UNBOUND specified as the cpu number.  Unbound allows for
      scheduling on any cpu, not necessarily the same one that scheduled the
      work.
      
      Because preemption is disabled for as long as we have the gc pointer, I
      think it's safe to use current_cpu() (aka smp_processor_id) to queue the
      delayed work item on the correct cpu.
      
      Fixes: 7cf2b0f9 ("xfs: bound maximum wait time for inodegc work")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      03e0add8
    • Darrick J. Wong's avatar
      xfs: fix negative array access in xfs_getbmap · 1bba82fe
      Darrick J. Wong authored
      In commit 8ee81ed5, Ye Bin complained about an ASSERT in the bmapx
      code that trips if we encounter a delalloc extent after flushing the
      pagecache to disk.  The ioctl code does not hold MMAPLOCK so it's
      entirely possible that a racing write page fault can create a delalloc
      extent after the file has been flushed.  The proposed solution was to
      replace the assertion with an early return that avoids filling out the
      bmap recordset with a delalloc entry if the caller didn't ask for it.
      
      At the time, I recall thinking that the forward logic sounded ok, but
      felt hesitant because I suspected that changing this code would cause
      something /else/ to burst loose due to some other subtlety.
      
      syzbot of course found that subtlety.  If all the extent mappings found
      after the flush are delalloc mappings, we'll reach the end of the data
      fork without ever incrementing bmv->bmv_entries.  This is new, since
      before we'd have emitted the delalloc mappings even though the caller
      didn't ask for them.  Once we reach the end, we'll try to set
      BMV_OF_LAST on the -1st entry (because bmv_entries is zero) and go
      corrupt something else in memory.  Yay.
      
      I really dislike all these stupid patches that fiddle around with debug
      code and break things that otherwise worked well enough.  Nobody was
      complaining that calling XFS_IOC_BMAPX without BMV_IF_DELALLOC would
      return BMV_OF_DELALLOC records, and now we've gone from "weird behavior
      that nobody cared about" to "bad behavior that must be addressed
      immediately".
      
      Maybe I'll just ignore anything from Huawei from now on for my own sake.
      
      Reported-by: syzbot+c103d3808a0de5faaf80@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-xfs/20230412024907.GP360889@frogsfrogsfrogs/
      Fixes: 8ee81ed5 ("xfs: fix BUG_ON in xfs_getbmap()")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      1bba82fe
    • Darrick J. Wong's avatar
      xfs: don't allocate into the data fork for an unshare request · 1f1397b7
      Darrick J. Wong authored
      For an unshare request, we only have to take action if the data fork has
      a shared mapping.  We don't care if someone else set up a cow operation.
      If we find nothing in the data fork, return a hole to avoid allocating
      space.
      
      Note that fallocate will replace the delalloc reservation with an
      unwritten extent anyway, so this has no user-visible effects outside of
      avoiding unnecessary updates.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      1f1397b7
    • Darrick J. Wong's avatar
      xfs: flush dirty data and drain directios before scrubbing cow fork · 397b2d7e
      Darrick J. Wong authored
      When we're scrubbing the COW fork, we need to take MMAPLOCK_EXCL to
      prevent page_mkwrite from modifying any inode state.  The ILOCK should
      suffice to avoid confusing online fsck, but let's take the same locks
      that we do everywhere else.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      397b2d7e
    • Darrick J. Wong's avatar
      xfs: set bnobt/cntbt numrecs correctly when formatting new AGs · 8e698ee7
      Darrick J. Wong authored
      Through generic/300, I discovered that mkfs.xfs creates corrupt
      filesystems when given these parameters:
      
      # mkfs.xfs -d size=512M /dev/sda -f -d su=128k,sw=4 --unsupported
      Filesystems formatted with --unsupported are not supported!!
      meta-data=/dev/sda               isize=512    agcount=8, agsize=16352 blks
               =                       sectsz=512   attr=2, projid32bit=1
               =                       crc=1        finobt=1, sparse=1, rmapbt=1
               =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
      data     =                       bsize=4096   blocks=130816, imaxpct=25
               =                       sunit=32     swidth=128 blks
      naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
      log      =internal log           bsize=4096   blocks=8192, version=2
               =                       sectsz=512   sunit=32 blks, lazy-count=1
      realtime =none                   extsz=4096   blocks=0, rtextents=0
               =                       rgcount=0    rgsize=0 blks
      Discarding blocks...Done.
      # xfs_repair -n /dev/sda
      Phase 1 - find and verify superblock...
              - reporting progress in intervals of 15 minutes
      Phase 2 - using internal log
              - zero log...
              - 16:30:50: zeroing log - 16320 of 16320 blocks done
              - scan filesystem freespace and inode maps...
      agf_freeblks 25, counted 0 in ag 4
      sb_fdblocks 8823, counted 8798
      
      The root cause of this problem is the numrecs handling in
      xfs_freesp_init_recs, which is used to initialize a new AG.  Prior to
      calling the function, we set up the new bnobt block with numrecs == 1
      and rely on _freesp_init_recs to format that new record.  If the last
      record created has a blockcount of zero, then it sets numrecs = 0.
      
      That last bit isn't correct if the AG contains the log, the start of the
      log is not immediately after the initial blocks due to stripe alignment,
      and the end of the log is perfectly aligned with the end of the AG.  For
      this case, we actually formatted a single bnobt record to handle the
      free space before the start of the (stripe aligned) log, and incremented
      arec to try to format a second record.  That second record turned out to
      be unnecessary, so what we really want is to leave numrecs at 1.
      
      The numrecs handling itself is overly complicated because a different
      function sets numrecs == 1.  Change the bnobt creation code to start
      with numrecs set to zero and only increment it after successfully
      formatting a free space extent into the btree block.
      
      Fixes: f327a007 ("xfs: account for log space when formatting new AGs")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      8e698ee7
    • Darrick J. Wong's avatar
      xfs: don't unconditionally null args->pag in xfs_bmap_btalloc_at_eof · b82a5c42
      Darrick J. Wong authored
      xfs/170 on a filesystem with su=128k,sw=4 produces this splat:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000010
      #PF: supervisor write access in kernel mode
      #PF: error_code(0x0002) - not-present page
      PGD 0 P4D 0
      Oops: 0002 [#1] PREEMPT SMP
      CPU: 1 PID: 4022907 Comm: dd Tainted: G        W          6.3.0-xfsx #2 6ebeeffbe9577d32
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-bu
      RIP: 0010:xfs_perag_rele+0x10/0x70 [xfs]
      RSP: 0018:ffffc90001e43858 EFLAGS: 00010217
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000100
      RDX: ffffffffa054e717 RSI: 0000000000000005 RDI: 0000000000000000
      RBP: ffff888194eea000 R08: 0000000000000000 R09: 0000000000000037
      R10: ffff888100ac1cb0 R11: 0000000000000018 R12: 0000000000000000
      R13: ffffc90001e43a38 R14: ffff888194eea000 R15: ffff888194eea000
      FS:  00007f93d1a0e740(0000) GS:ffff88843fc80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000010 CR3: 000000018a34f000 CR4: 00000000003506e0
      Call Trace:
       <TASK>
       xfs_bmap_btalloc+0x1a7/0x5d0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
       xfs_bmapi_allocate+0xee/0x470 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
       xfs_bmapi_write+0x539/0x9e0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
       xfs_iomap_write_direct+0x1bb/0x2b0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
       xfs_direct_write_iomap_begin+0x51c/0x710 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
       iomap_iter+0x132/0x2f0
       __iomap_dio_rw+0x2f8/0x840
       iomap_dio_rw+0xe/0x30
       xfs_file_dio_write_aligned+0xad/0x180 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
       xfs_file_write_iter+0xfb/0x190 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
       vfs_write+0x2eb/0x410
       ksys_write+0x65/0xe0
       do_syscall_64+0x2b/0x80
      
      This crash occurs under the "out_low_space" label.  We grabbed a perag
      reference, passed it via args->pag into xfs_bmap_btalloc_at_eof, and
      afterwards args->pag is NULL.  Fix the second function not to clobber
      args->pag if the caller had passed one in.
      
      Fixes: 85843327 ("xfs: factor xfs_bmap_btalloc()")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      b82a5c42
  2. 26 Apr, 2023 1 commit
    • Dave Chinner's avatar
      xfs: fix livelock in delayed allocation at ENOSPC · 9419092f
      Dave Chinner authored
      On a filesystem with a non-zero stripe unit and a large sequential
      write, delayed allocation will set a minimum allocation length of
      the stripe unit. If allocation fails because there are no extents
      long enough for an aligned minlen allocation, it is supposed to
      fall back to unaligned allocation which allows single block extents
      to be allocated.
      
      When the allocator code was rewritting in the 6.3 cycle, this
      fallback was broken - the old code used args->fsbno as the both the
      allocation target and the allocation result, the new code passes the
      target as a separate parameter. The conversion didn't handle the
      aligned->unaligned fallback path correctly - it reset args->fsbno to
      the target fsbno on failure which broke allocation failure detection
      in the high level code and so it never fell back to unaligned
      allocations.
      
      This resulted in a loop in writeback trying to allocate an aligned
      block, getting a false positive success, trying to insert the result
      in the BMBT. This did nothing because the extent already was in the
      BMBT (merge results in an unchanged extent) and so it returned the
      prior extent to the conversion code as the current iomap.
      
      Because the iomap returned didn't cover the offset we tried to map,
      xfs_convert_blocks() then retries the allocation, which fails in the
      same way and now we have a livelock.
      Reported-and-tested-by: default avatarBrian Foster <bfoster@redhat.com>
      Fixes: 85843327 ("xfs: factor xfs_bmap_btalloc()")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      9419092f
  3. 19 Apr, 2023 2 commits
  4. 13 Apr, 2023 21 commits
    • Dave Chinner's avatar
      Merge tag 'fix-asciici-bugs-6.4_2023-04-11' of... · 798352cb
      Dave Chinner authored
      Merge tag 'fix-asciici-bugs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: fix ascii-ci problems, then kill it [v2]
      
      Last week, I was fiddling around with the metadump name obfuscation code
      while writing a debugger command to generate directories full of names
      that all have the same hash name.  I had a few questions about how well
      all that worked with ascii-ci mode, and discovered a nasty discrepancy
      between the kernel and glibc's implementations of the tolower()
      function.
      
      I discovered that I could create a directory that is large enough to
      require separate leaf index blocks.  The hashes stored in the dabtree
      use the ascii-ci specific hash function, which uses a library function
      to convert the name to lowercase before hashing.  If the kernel and C
      library's versions of tolower do not behave exactly identically,
      xfs_ascii_ci_hashname will not produce the same results for the same
      inputs.  xfs_repair will deem the leaf information corrupt and rebuild
      the directory.  After that, lookups in the kernel will fail because the
      hash index doesn't work.
      
      The kernel's tolower function will convert extended ascii uppercase
      letters (e.g. A-with-umlaut) to extended ascii lowercase letters (e.g.
      a-with-umlaut), whereas glibc's will only do that if you force LANG to
      ascii.  Tiny embedded libc implementations just plain won't do it at
      all, and the result is a mess.  Stabilize the behavior of the hash
      function by encoding the name transformation function in libxfs, add it
      to the selftest, and fix all the userspace tools, none of which handle
      this transformation correctly.
      
      The v1 series generated a /lot/ of discussion, in which several things
      became very clear: (1) Linus is not enamored of case folding of any
      kind; (2) Dave and Christoph don't seem to agree on whether the feature
      is supposed to work for 7-bit ascii or latin1; (3) it trashes UTF8
      encoded names if those happen to show up; and (4) I don't want to
      maintain this mess any longer than I have to.  Kill it in 2030.
      
      v2: rename the functions to make it clear we're moving away from the
      letters t, o, l, o, w, e, and r; and deprecate the whole feature once
      we've fixed the bugs and added tests.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      798352cb
    • Dave Chinner's avatar
      Merge tag 'scrub-strengthen-rmap-checking-6.4_2023-04-11' of... · b89116c2
      Dave Chinner authored
      Merge tag 'scrub-strengthen-rmap-checking-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: strengthen rmapbt scrubbing [v24.5]
      
      This series strengthens space allocation record cross referencing by
      using AG block bitmaps to compute the difference between space used
      according to the rmap records and the primary metadata, and reports
      cross-referencing errors for any discrepancies.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      b89116c2
    • Dave Chinner's avatar
      Merge tag 'repair-bitmap-rework-6.4_2023-04-11' of... · 43223ef7
      Dave Chinner authored
      Merge tag 'repair-bitmap-rework-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: rework online fsck incore bitmap [v24.5]
      
      In this series, we make some changes to the incore bitmap code: First,
      we shorten the prefix to 'xbitmap'.  Then, we rework some utility
      functions for later use by online repair and clarify how the walk
      functions are supposed to be used.
      
      Finally, we use all these new pieces to convert the incore bitmap to use
      an interval tree instead of linked lists.  This lifts the limitation
      that callers had to be careful not to set a range that was already set;
      and gets us ready for the btree rebuilder functions needing to be able
      to set bits in a bitmap and generate maximal contiguous extents for the
      set ranges.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      43223ef7
    • Dave Chinner's avatar
      Merge tag 'scrub-fix-xattr-memory-mgmt-6.4_2023-04-11' of... · bb09d765
      Dave Chinner authored
      Merge tag 'scrub-fix-xattr-memory-mgmt-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: clean up memory management in xattr scrub [v24.5]
      
      Currently, the extended attribute scrubber uses a single VLA to store
      all the context information needed in various parts of the scrubber
      code.  This includes xattr leaf block space usage bitmaps, and the value
      buffer used to check the correctness of remote xattr value block
      headers.  We try to minimize the insanity through the use of helper
      functions, but this is a memory management nightmare.  Clean this up by
      making the bitmap and value pointers explicit members of struct
      xchk_xattr_buf.
      
      Second, strengthen the xattr checking by teaching it to look for overlapping
      data structures in the shortform attr data.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      bb09d765
    • Dave Chinner's avatar
      Merge tag 'scrub-detect-mergeable-records-6.4_2023-04-11' of... · b9fcf89f
      Dave Chinner authored
      Merge tag 'scrub-detect-mergeable-records-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: detect mergeable and overlapping btree records [v24.5]
      
      While I was doing differential fuzz analysis between xfs_scrub and
      xfs_repair, I noticed that xfs_repair was only partially effective at
      detecting btree records that can be merged, and xfs_scrub totally didn't
      notice at all.
      
      For every interval btree type except for the bmbt, there should never
      exist two adjacent records with adjacent keyspaces because the
      blockcount field is always large enough to span the entire keyspace of
      the domain.  This is because the free space, rmap, and refcount btrees
      have a blockcount field large enough to store the maximum AG length, and
      there can never be an allocation larger than an AG.
      
      The bmbt is a different story due to its ondisk encoding where the
      blockcount is only 21 bits wide.  Because AGs can span up to 2^31 blocks
      and the RT volume can span up to 2^52 blocks, a preallocation of 2^22
      blocks will be expressed as two records of 2^21 length.  We don't
      opportunistically combine records when doing bmbt operations, which is
      why the fsck tools have never complained about this scenario.
      
      Offline repair is partially effective at detecting mergeable records
      because I taught it to do that for the rmap and refcount btrees.  This
      series enhances the free space, rmap, and refcount scrubbers to detect
      mergeable records.  For the bmbt, it will flag the file as being
      eligible for an optimization to shrink the size of the data structure.
      
      The last patch in this set also enhances the rmap scrubber to detect
      records that overlap incorrectly.  This check is done automatically for
      non-overlapping btree types, but we have to do it separately for the
      rmapbt because there are constraints on which allocation types are
      allowed to overlap.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      b9fcf89f
    • Dave Chinner's avatar
      Merge tag 'scrub-merge-bmap-records-6.4_2023-04-12' of... · d808a8e6
      Dave Chinner authored
      Merge tag 'scrub-merge-bmap-records-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: merge bmap records for faster scrubs [v24.5]
      
      I started looking into performance problems with the data fork scrubber
      in generic/333, and noticed a few things that needed improving.  First,
      due to design reasons, it's possible for file forks btrees to contain
      multiple contiguous mappings to the same physical space.  Instead of
      checking each ondisk mapping individually, it's much faster to combine
      them when possible and check the combined mapping because that's fewer
      trips through the rmap btree, and we can drop this check-around
      behavior that it does when an rmapbt lookup produces a record that
      starts before or ends after a particular bmbt mapping.
      
      Second, I noticed that the bmbt scrubber decides to walk every reverse
      mapping in the filesystem if the file fork is in btree format.  This is
      very costly, and only necessary if the inode repair code had to zap a
      fork to convince iget to work.  Constraining the full-rmap scan to this
      one case means we can skip it for normal files, which drives the runtime
      of this test from 8 hours down to 45 minutes (observed with realtime
      reflink and rebuild-all mode.)
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      d808a8e6
    • Dave Chinner's avatar
      Merge tag 'scrub-iget-fixes-6.4_2023-04-12' of... · 1e791234
      Dave Chinner authored
      Merge tag 'scrub-iget-fixes-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: fix iget/irele usage in online fsck [v24.5]
      
      This patchset fixes a handful of problems relating to how we get and
      release incore inodes in the online scrub code.  The first patch fixes
      how we handle DONTCACHE -- our reasons for setting (or clearing it)
      depend entirely on the runtime environment at irele time.  Hence we can
      refactor iget and irele to use our own wrappers that set that context
      appropriately.
      
      The second patch fixes a race between the iget call in the inode core
      scrubber and other writer threads that are allocating or freeing inodes
      in the same AG by changing the behavior of xchk_iget (and the inode core
      scrub setup function) to return either an incore inode or the AGI buffer
      so that we can be sure that the inode cannot disappear on us.
      
      The final patch elides MMAPLOCK from scrub paths when possible.  It did
      not fit anywhere else.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      1e791234
    • Dave Chinner's avatar
      Merge tag 'scrub-parent-fixes-6.4_2023-04-12' of... · a4466722
      Dave Chinner authored
      Merge tag 'scrub-parent-fixes-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: fix bugs in parent pointer checking [v24.5]
      
      Jan Kara pointed out that the VFS doesn't take i_rwsem of a child
      subdirectory that is being moved from one parent to another.  Upon
      deeper analysis, I realized that this was the source of a very hard to
      trigger false corruption report in the parent pointer checking code.
      
      Now that we've refactored how directory walks work in scrub, we can also
      get rid of all the unnecessary and broken locking to make parent pointer
      scrubbing work properly.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      a4466722
    • Dave Chinner's avatar
      Merge tag 'scrub-dir-iget-fixes-6.4_2023-04-12' of... · f697c2cc
      Dave Chinner authored
      Merge tag 'scrub-dir-iget-fixes-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: fix iget usage in directory scrub [v24.5]
      
      In this series, we fix some problems with how the directory scrubber
      grabs child inodes.  First, we want to reduce EDEADLOCK returns by
      replacing fixed-iteration loops with interruptible trylock loops.
      Second, we add UNTRUSTED to the child iget call so that we can detect a
      dirent that points to an unallocated inode.  Third, we fix a bug where
      we weren't checking the inode pointed to by dotdot entries at all.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      f697c2cc
    • Dave Chinner's avatar
      Merge tag 'scrub-detect-rmapbt-gaps-6.4_2023-04-11' of... · b1bdab25
      Dave Chinner authored
      Merge tag 'scrub-detect-rmapbt-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: detect incorrect gaps in rmap btree [v24.5]
      
      Following in the theme of the last two patchsets, this one strengthens
      the rmap btree record checking so that scrub can count the number of
      space records that map to a given owner and that do not map to a given
      owner.  This enables us to determine exclusive ownership of space that
      can't be shared.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      b1bdab25
    • Dave Chinner's avatar
      Merge tag 'scrub-detect-inobt-gaps-6.4_2023-04-11' of... · f1121b99
      Dave Chinner authored
      Merge tag 'scrub-detect-inobt-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: detect incorrect gaps in inode btree [v24.5]
      
      This series continues the corrections for a couple of problems I found
      in the inode btree scrubber.  The first problem is that we don't
      directly check the inobt records have a direct correspondence with the
      finobt records, and vice versa.  The second problem occurs on
      filesystems with sparse inode chunks -- the cross-referencing we do
      detects sparseness, but it doesn't actually check the consistency
      between the inobt hole records and the rmap data.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      f1121b99
    • Dave Chinner's avatar
      Merge tag 'scrub-detect-refcount-gaps-6.4_2023-04-11' of... · e7cef2fe
      Dave Chinner authored
      Merge tag 'scrub-detect-refcount-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: detect incorrect gaps in refcount btree [v24.5]
      
      The next few patchsets address a deficiency in scrub that I found while
      QAing the refcount btree scrubber.  If there's a gap between refcount
      records, we need to cross-reference that gap with the reverse mappings
      to ensure that there are no overlapping records in the rmap btree.  If
      we find any, then the refcount btree is not consistent.  This is not a
      property that is specific to the refcount btree; they all need to have
      this sort of keyspace scanning logic to detect inconsistencies.
      
      To do this accurately, we need to be able to scan the keyspace of a
      btree (which we already do) to be able to tell the caller if the
      keyspace is empty, sparse, or fully covered by records.  The first few
      patches add the keyspace scanner to the generic btree code, along with
      the ability to mask off parts of btree keys because when we scan the
      rmapbt, we only care about space usage, not the owners.
      
      The final patch closes the scanning gap in the refcountbt scanner.
      
      v23.1: create helpers for the key extraction and comparison functions,
             improve documentation, and eliminate the ->mask_key indirect
             calls
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      e7cef2fe
    • Dave Chinner's avatar
      Merge tag 'scrub-btree-key-enhancements-6.4_2023-04-11' of... · 6858c887
      Dave Chinner authored
      Merge tag 'scrub-btree-key-enhancements-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: enhance btree key scrubbing [v24.5]
      
      This series fixes the scrub btree block checker to ensure that the keys
      in the parent block accurately represent the block, and check the
      ordering of all interior key records.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      6858c887
    • Dave Chinner's avatar
      Merge tag 'rmap-btree-fix-key-handling-6.4_2023-04-11' of... · 1ee75505
      Dave Chinner authored
      Merge tag 'rmap-btree-fix-key-handling-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: fix rmap btree key flag handling [v24.5]
      
      This series fixes numerous flag handling bugs in the rmapbt key code.
      The most serious transgression is that key comparisons completely strip
      out all flag bits from rm_offset, including the ones that participate in
      record lookups.  The second problem is that for years we've been letting
      the unwritten flag (which is an attribute of a specific record and not
      part of the record key) escape from leaf records into key records.
      
      The solution to the second problem is to filter attribute flags when
      creating keys from records, and the solution to the first problem is to
      preserve *only* the flags used for key lookups.  The ATTR and BMBT flags
      are a part of the lookup key, and the UNWRITTEN flag is a record
      attribute.
      
      This has worked for years without generating user complaints because
      ATTR and BMBT extents cannot be shared, so key comparisons succeed
      solely on rm_startblock.  Only file data fork extents can be shared, and
      those records never set any of the three flag bits, so comparisons that
      dig into rm_owner and rm_offset work just fine.
      
      A filesystem written with an unpatched kernel and mounted on a patched
      kernel will work correctly because the ATTR/BMBT flags have been
      conveyed into keys correctly all along, and we still ignore the
      UNWRITTEN flag in any key record.  This was what doomed my previous
      attempt to correct this problem in 2019.
      
      A filesystem written with a patched kernel and mounted on an unpatched
      kernel will also work correctly because unpatched kernels ignore all
      flags.
      
      With this patchset applied, the scrub code gains the ability to detect
      rmap btrees with incorrectly set attr and bmbt flags in the key records.
      After three years of testing, I haven't encountered any problems.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      1ee75505
    • Dave Chinner's avatar
      Merge tag 'btree-hoist-scrub-checks-6.4_2023-04-11' of... · b764ea20
      Dave Chinner authored
      Merge tag 'btree-hoist-scrub-checks-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: hoist scrub record checks into libxfs [v24.5]
      
      There are a few things about btree records that scrub checked but the
      libxfs _get_rec functions didn't.  Move these bits into libxfs so that
      everyone can benefit.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      b764ea20
    • Dave Chinner's avatar
      Merge tag 'btree-complain-bad-records-6.4_2023-04-11' of... · 01822a74
      Dave Chinner authored
      Merge tag 'btree-complain-bad-records-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: standardize btree record checking code [v24.5]
      
      While I was cleaning things up for 6.1, I noticed that the btree
      _query_range and _query_all functions don't perform the same checking
      that the _get_rec functions perform.  In fact, they don't perform /any/
      sanity checking, which means that callers aren't warned about impossible
      records.
      
      Therefore, hoist the record validation and complaint logging code into
      separate functions, and call them from any place where we convert an
      ondisk record into an incore record.  For online scrub, we can replace
      checking code with a call to the record checking functions in libxfs,
      thereby reducing the size of the codebase.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      01822a74
    • Dave Chinner's avatar
      Merge tag 'scrub-drain-intents-6.4_2023-04-11' of... · b634abac
      Dave Chinner authored
      Merge tag 'scrub-drain-intents-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: drain deferred work items when scrubbing [v24.5]
      
      The design doc for XFS online fsck contains a long discussion of the
      eventual consistency models in use for XFS metadata.  In that chapter,
      we note that it is possible for scrub to collide with a chain of
      deferred space metadata updates, and proposes a lightweight solution:
      The use of a pending-intents counter so that scrub can wait for the
      system to drain all chains.
      
      This patchset implements that scrub drain.  The first patch implements
      the basic mechanism, and the subsequent patches reduce the runtime
      overhead by converting the implementation to use sloppy counters and
      introducing jump labels to avoid walking into scrub hooks when it isn't
      running.  This last paradigm repeats elsewhere in this megaseries.
      
      v23.1: make intent items take an active ref to the perag structure and
             document why we bump and drop the intent counts when we do
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      b634abac
    • Dave Chinner's avatar
      Merge tag 'scrub-fix-legalese-6.4_2023-04-11' of... · 793f5c2c
      Dave Chinner authored
      Merge tag 'scrub-fix-legalese-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs_scrub: fix licensing and copyright notices [v24.5]
      
      Fix various attribution problems in the xfs_scrub source code, such as
      the author's contact information, out of date SPDX tags, and a rough
      estimate of when the feature was under heavy development.  The most
      egregious parts are the files that are missing license information
      completely.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      793f5c2c
    • Dave Chinner's avatar
      Merge tag 'pass-perag-refs-6.4_2023-04-11' of... · 1e5ffdc5
      Dave Chinner authored
      Merge tag 'pass-perag-refs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: pass perag references around when possible [v24.5]
      
      Avoid the cost of perag radix tree lookups by passing around active perag
      references when possible.
      
      v24.2: rework some of the naming and whatnot so there's less opencoding
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      1e5ffdc5
    • Dave Chinner's avatar
      Merge tag 'intents-perag-refs-6.4_2023-04-11' of... · 826053db
      Dave Chinner authored
      Merge tag 'intents-perag-refs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: make intent items take a perag reference [v24.5]
      
      Now that we've cleaned up some code warts in the deferred work item
      processing code, let's make intent items take an active perag reference
      from their creation until they are finally freed by the defer ops
      machinery.  This change facilitates the scrub drain in the next patchset
      and will make it easier for the future AG removal code to detect a busy
      AG in need of quiescing.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      826053db
    • Dave Chinner's avatar
      Merge tag 'online-fsck-design-6.4_2023-04-11' of... · bed25d80
      Dave Chinner authored
      Merge tag 'online-fsck-design-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: design documentation for online fsck [v24.5]
      
      After six years of development and a nearly two year hiatus from
      patchbombing, I think it is time to resume the process of merging the
      online fsck feature into XFS.  The full patchset comprises 105 separate
      patchsets that capture 470 patches across the kernel, xfsprogs, and
      fstests projects.
      
      I would like to merge this feature into upstream in time for the 2023
      LTS kernel.  As of 5.15 (aka last year's LTS), we have merged all
      generally useful infrastructure improvements into the regular
      filesystem.  The only changes to the core filesystem that remain are the
      ones that are only useful to online fsck itself.  In other words, the
      vast majority of the new code in the patchsets comprising the online
      fsck feature are is mostly self contained and can be turned off via
      Kconfig.
      
      Many of you readers might be wondering -- why have I chosen to make one
      large submission with 100+ patchsets comprising ~500 patches?  Why
      didn't I merge small pieces of functionality bit by bit and revise
      common code as necessary?  Well, the simple answer is that in the past
      six years, the fundamental algorithms have been revised repeatedly as
      I've built out the functionality.  In other words, the codebase as it is
      now has the benefit that I now know every piece that's necessary to get
      the job done in a reasonable manner and within the constraints laid out
      by community reviews.  I believe this has reduced code churn in mainline
      and freed up my time so that I can iterate faster.
      
      As a concession to the mail servers, I'm breaking up the submission into
      smaller pieces; I'm only pushing the design document and the revisions
      to the existing scrub code, which is the first 20%% of the patches.
      Also, I'm arbitrarily restarting the version numbering by reversioning
      all patchsets from version 22 to epoch 23, version 1.
      
      The big question to everyone reading this is: How might I convince you
      that there is more merit in merging the whole feature and dealing with
      the consequences than continuing to maintain it out of tree?
      
      ---------
      
      To prepare the XFS community and potential patch reviewers for the
      upstream submission of the online fsck feature, I decided to write a
      document capturing the broader picture behind the online repair
      development effort.  The document begins by defining the problems that
      online fsck aims to solve and outlining specific use cases for the
      functionality.
      
      Using that as a base, the rest of the design document presents the high
      level algorithms that fulfill the goals set out at the start and the
      interactions between the large pieces of the system.  Case studies round
      out the design documentation by adding the details of exactly how
      specific parts of the online fsck code integrate the algorithms with the
      filesystem.
      
      The goal of this effort is to help the XFS community understand how the
      gigantic online repair patchset works.  The questions I submit to the
      community reviewers are:
      
      1. As you read the design doc (and later the code), do you feel that you
         understand what's going on well enough to try to fix a bug if you
         found one?
      
      2. What sorts of interactions between systems (or between scrub and the
         rest of the kernel) am I missing?
      
      3. Do you feel confident enough in the implementation as it is now that
         the benefits of merging the feature (as EXPERIMENTAL) outweigh any
         potential disruptions to XFS at large?
      
      4. Are there problematic interactions between subsystems that ought to
         be cleared up before merging?
      
      5. Can I just merge all of this?
      
      I intend to commit this document to the kernel's documentation directory
      when we start merging the patchset, albeit without the links to
      git.kernel.org.  A much more readable version of this is posted at:
      https://djwong.org/docs/xfs-online-fsck-design/
      
      v2: add missing sections about: all the in-kernel data structures and
          new apis that the scrub and repair functions use; how xattrs and
          directories are checked; how space btree records are checked; and
          add more details to the parts where all these bits tie together.
          Proofread for verb tense inconsistencies and eliminate vague 'we'
          usage.  Move all the discussion of what we can do with pageable
          kernel memory into a single source file and section.  Document where
          log incompat feature locks fit into the locking model.
      
      v3: resync with 6.0, fix a few typos, begin discussion of the merging
          plan for this megapatchset.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      bed25d80
  5. 12 Apr, 2023 7 commits
    • Ye Bin's avatar
      xfs: fix BUG_ON in xfs_getbmap() · 8ee81ed5
      Ye Bin authored
      There's issue as follows:
      XFS: Assertion failed: (bmv->bmv_iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c, line: 329
      ------------[ cut here ]------------
      kernel BUG at fs/xfs/xfs_message.c:102!
      invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      CPU: 1 PID: 14612 Comm: xfs_io Not tainted 6.3.0-rc2-next-20230315-00006-g2729d23ddb3b-dirty #422
      RIP: 0010:assfail+0x96/0xa0
      RSP: 0018:ffffc9000fa178c0 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff888179a18000
      RDX: 0000000000000000 RSI: ffff888179a18000 RDI: 0000000000000002
      RBP: 0000000000000000 R08: ffffffff8321aab6 R09: 0000000000000000
      R10: 0000000000000001 R11: ffffed1105f85139 R12: ffffffff8aacc4c0
      R13: 0000000000000149 R14: ffff888269f58000 R15: 000000000000000c
      FS:  00007f42f27a4740(0000) GS:ffff88882fc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000b92388 CR3: 000000024f006000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       xfs_getbmap+0x1a5b/0x1e40
       xfs_ioc_getbmap+0x1fd/0x5b0
       xfs_file_ioctl+0x2cb/0x1d50
       __x64_sys_ioctl+0x197/0x210
       do_syscall_64+0x39/0xb0
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Above issue may happen as follows:
               ThreadA                       ThreadB
      do_shared_fault
       __do_fault
        xfs_filemap_fault
         __xfs_filemap_fault
          filemap_fault
                                   xfs_ioc_getbmap -> Without BMV_IF_DELALLOC flag
      			      xfs_getbmap
      			       xfs_ilock(ip, XFS_IOLOCK_SHARED);
      			       filemap_write_and_wait
       do_page_mkwrite
        xfs_filemap_page_mkwrite
         __xfs_filemap_fault
          xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
          iomap_page_mkwrite
           ...
           xfs_buffered_write_iomap_begin
            xfs_bmapi_reserve_delalloc -> Allocate delay extent
                                    xfs_ilock_data_map_shared(ip)
      	                      xfs_getbmap_report_one
      			       ASSERT((bmv->bmv_iflags & BMV_IF_DELALLOC) != 0)
      	                        -> trigger BUG_ON
      
      As xfs_filemap_page_mkwrite() only hold XFS_MMAPLOCK_SHARED lock, there's
      small window mkwrite can produce delay extent after file write in xfs_getbmap().
      To solve above issue, just skip delalloc extents.
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      8ee81ed5
    • Darrick J. Wong's avatar
      xfs: verify buffer contents when we skip log replay · 22ed903e
      Darrick J. Wong authored
      syzbot detected a crash during log recovery:
      
      XFS (loop0): Mounting V5 Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791
      XFS (loop0): Torn write (CRC failure) detected at log block 0x180. Truncating head block from 0x200.
      XFS (loop0): Starting recovery (logdev: internal)
      ==================================================================
      BUG: KASAN: slab-out-of-bounds in xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
      Read of size 8 at addr ffff88807e89f258 by task syz-executor132/5074
      
      CPU: 0 PID: 5074 Comm: syz-executor132 Not tainted 6.2.0-rc1-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106
       print_address_description+0x74/0x340 mm/kasan/report.c:306
       print_report+0x107/0x1f0 mm/kasan/report.c:417
       kasan_report+0xcd/0x100 mm/kasan/report.c:517
       xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
       xfs_btree_lookup+0x346/0x12c0 fs/xfs/libxfs/xfs_btree.c:1913
       xfs_btree_simple_query_range+0xde/0x6a0 fs/xfs/libxfs/xfs_btree.c:4713
       xfs_btree_query_range+0x2db/0x380 fs/xfs/libxfs/xfs_btree.c:4953
       xfs_refcount_recover_cow_leftovers+0x2d1/0xa60 fs/xfs/libxfs/xfs_refcount.c:1946
       xfs_reflink_recover_cow+0xab/0x1b0 fs/xfs/xfs_reflink.c:930
       xlog_recover_finish+0x824/0x920 fs/xfs/xfs_log_recover.c:3493
       xfs_log_mount_finish+0x1ec/0x3d0 fs/xfs/xfs_log.c:829
       xfs_mountfs+0x146a/0x1ef0 fs/xfs/xfs_mount.c:933
       xfs_fs_fill_super+0xf95/0x11f0 fs/xfs/xfs_super.c:1666
       get_tree_bdev+0x400/0x620 fs/super.c:1282
       vfs_get_tree+0x88/0x270 fs/super.c:1489
       do_new_mount+0x289/0xad0 fs/namespace.c:3145
       do_mount fs/namespace.c:3488 [inline]
       __do_sys_mount fs/namespace.c:3697 [inline]
       __se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f89fa3f4aca
      Code: 83 c4 08 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007fffd5fb5ef8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
      RAX: ffffffffffffffda RBX: 00646975756f6e2c RCX: 00007f89fa3f4aca
      RDX: 0000000020000100 RSI: 0000000020009640 RDI: 00007fffd5fb5f10
      RBP: 00007fffd5fb5f10 R08: 00007fffd5fb5f50 R09: 000000000000970d
      R10: 0000000000200800 R11: 0000000000000206 R12: 0000000000000004
      R13: 0000555556c6b2c0 R14: 0000000000200800 R15: 00007fffd5fb5f50
       </TASK>
      
      The fuzzed image contains an AGF with an obviously garbage
      agf_refcount_level value of 32, and a dirty log with a buffer log item
      for that AGF.  The ondisk AGF has a higher LSN than the recovered log
      item.  xlog_recover_buf_commit_pass2 reads the buffer, compares the
      LSNs, and decides to skip replay because the ondisk buffer appears to be
      newer.
      
      Unfortunately, the ondisk buffer is corrupt, but recovery just read the
      buffer with no buffer ops specified:
      
      	error = xfs_buf_read(mp->m_ddev_targp, buf_f->blf_blkno,
      			buf_f->blf_len, buf_flags, &bp, NULL);
      
      Skipping the buffer leaves its contents in memory unverified.  This sets
      us up for a kernel crash because xfs_refcount_recover_cow_leftovers
      reads the buffer (which is still around in XBF_DONE state, so no read
      verification) and creates a refcountbt cursor of height 32.  This is
      impossible so we run off the end of the cursor object and crash.
      
      Fix this by invoking the verifier on all skipped buffers and aborting
      log recovery if the ondisk buffer is corrupt.  It might be smarter to
      force replay the log item atop the buffer and then see if it'll pass the
      write verifier (like ext4 does) but for now let's go with the
      conservative option where we stop immediately.
      
      Link: https://syzkaller.appspot.com/bug?extid=7e9494b8b399902e994eSigned-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      22ed903e
    • Darrick J. Wong's avatar
      xfs: _{attr,data}_map_shared should take ILOCK_EXCL until iread_extents is completely done · c95356ca
      Darrick J. Wong authored
      While fuzzing the data fork extent count on a btree-format directory
      with xfs/375, I observed the following (excerpted) splat:
      
      XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 1208
      ------------[ cut here ]------------
      WARNING: CPU: 0 PID: 43192 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
      Call Trace:
       <TASK>
       xfs_iread_extents+0x1af/0x210 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       xchk_dir_walk+0xb8/0x190 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       xchk_parent_count_parent_dentries+0x41/0x80 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       xchk_parent_validate+0x199/0x2e0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       xchk_parent+0xdf/0x130 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       xfs_scrub_metadata+0x2b8/0x730 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       xfs_scrubv_metadata+0x38b/0x4d0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       xfs_ioc_scrubv_metadata+0x111/0x160 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       xfs_file_ioctl+0x367/0xf50 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       __x64_sys_ioctl+0x82/0xa0
       do_syscall_64+0x2b/0x80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      The cause of this is a race condition in xfs_ilock_data_map_shared,
      which performs an unlocked access to the data fork to guess which lock
      mode it needs:
      
      Thread 0                          Thread 1
      
      xfs_need_iread_extents
      <observe no iext tree>
      xfs_ilock(..., ILOCK_EXCL)
      xfs_iread_extents
      <observe no iext tree>
      <check ILOCK_EXCL>
      <load bmbt extents into iext>
      <notice iext size doesn't
       match nextents>
                                        xfs_need_iread_extents
                                        <observe iext tree>
                                        xfs_ilock(..., ILOCK_SHARED)
      <tear down iext tree>
      xfs_iunlock(..., ILOCK_EXCL)
                                        xfs_iread_extents
                                        <observe no iext tree>
                                        <check ILOCK_EXCL>
                                        *BOOM*
      
      Fix this race by adding a flag to the xfs_ifork structure to indicate
      that we have not yet read in the extent records and changing the
      predicate to look at the flag state, not if_height.  The memory barrier
      ensures that the flag will not be set until the very end of the
      function.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      c95356ca
    • Dave Chinner's avatar
      xfs: remove WARN when dquot cache insertion fails · 4b827b3f
      Dave Chinner authored
      It just creates unnecessary bot noise these days.
      
      Reported-by: syzbot+6ae213503fb12e87934f@syzkaller.appspotmail.com
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      4b827b3f
    • Dave Chinner's avatar
      xfs: don't consider future format versions valid · aa880198
      Dave Chinner authored
      In commit fe08cc50 we reworked the valid superblock version
      checks. If it is a V5 filesystem, it is always valid, then we
      checked if the version was less than V4 (reject) and then checked
      feature fields in the V4 flags to determine if it was valid.
      
      What we missed was that if the version is not V4 at this point,
      we shoudl reject the fs. i.e. the check current treats V6+
      filesystems as if it was a v4 filesystem. Fix this.
      
      cc: stable@vger.kernel.org
      Fixes: fe08cc50 ("xfs: open code sb verifier feature checks")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      aa880198
    • Darrick J. Wong's avatar
      xfs: deprecate the ascii-ci feature · 7ba83850
      Darrick J. Wong authored
      This feature is a mess -- the hash function has been broken for the
      entire 15 years of its existence if you create names with extended ascii
      bytes; metadump name obfuscation has silently failed for just as long;
      and the feature clashes horribly with the UTF8 encodings that most
      systems use today.  There is exactly one fstest for this feature.
      
      In other words, this feature is crap.  Let's deprecate it now so we can
      remove it from the codebase in 2030.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      7ba83850
    • Darrick J. Wong's avatar
      xfs: test the ascii case-insensitive hash · 6db09a8d
      Darrick J. Wong authored
      Now that we've made kernel and userspace use the same tolower code for
      computing directory index hashes, add that to the selftest code.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      6db09a8d