1. 13 Apr, 2023 10 commits
    • Dave Chinner's avatar
      Merge tag 'scrub-detect-refcount-gaps-6.4_2023-04-11' of... · e7cef2fe
      Dave Chinner authored
      Merge tag 'scrub-detect-refcount-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: detect incorrect gaps in refcount btree [v24.5]
      
      The next few patchsets address a deficiency in scrub that I found while
      QAing the refcount btree scrubber.  If there's a gap between refcount
      records, we need to cross-reference that gap with the reverse mappings
      to ensure that there are no overlapping records in the rmap btree.  If
      we find any, then the refcount btree is not consistent.  This is not a
      property that is specific to the refcount btree; they all need to have
      this sort of keyspace scanning logic to detect inconsistencies.
      
      To do this accurately, we need to be able to scan the keyspace of a
      btree (which we already do) to be able to tell the caller if the
      keyspace is empty, sparse, or fully covered by records.  The first few
      patches add the keyspace scanner to the generic btree code, along with
      the ability to mask off parts of btree keys because when we scan the
      rmapbt, we only care about space usage, not the owners.
      
      The final patch closes the scanning gap in the refcountbt scanner.
      
      v23.1: create helpers for the key extraction and comparison functions,
             improve documentation, and eliminate the ->mask_key indirect
             calls
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      e7cef2fe
    • Dave Chinner's avatar
      Merge tag 'scrub-btree-key-enhancements-6.4_2023-04-11' of... · 6858c887
      Dave Chinner authored
      Merge tag 'scrub-btree-key-enhancements-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: enhance btree key scrubbing [v24.5]
      
      This series fixes the scrub btree block checker to ensure that the keys
      in the parent block accurately represent the block, and check the
      ordering of all interior key records.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      6858c887
    • Dave Chinner's avatar
      Merge tag 'rmap-btree-fix-key-handling-6.4_2023-04-11' of... · 1ee75505
      Dave Chinner authored
      Merge tag 'rmap-btree-fix-key-handling-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: fix rmap btree key flag handling [v24.5]
      
      This series fixes numerous flag handling bugs in the rmapbt key code.
      The most serious transgression is that key comparisons completely strip
      out all flag bits from rm_offset, including the ones that participate in
      record lookups.  The second problem is that for years we've been letting
      the unwritten flag (which is an attribute of a specific record and not
      part of the record key) escape from leaf records into key records.
      
      The solution to the second problem is to filter attribute flags when
      creating keys from records, and the solution to the first problem is to
      preserve *only* the flags used for key lookups.  The ATTR and BMBT flags
      are a part of the lookup key, and the UNWRITTEN flag is a record
      attribute.
      
      This has worked for years without generating user complaints because
      ATTR and BMBT extents cannot be shared, so key comparisons succeed
      solely on rm_startblock.  Only file data fork extents can be shared, and
      those records never set any of the three flag bits, so comparisons that
      dig into rm_owner and rm_offset work just fine.
      
      A filesystem written with an unpatched kernel and mounted on a patched
      kernel will work correctly because the ATTR/BMBT flags have been
      conveyed into keys correctly all along, and we still ignore the
      UNWRITTEN flag in any key record.  This was what doomed my previous
      attempt to correct this problem in 2019.
      
      A filesystem written with a patched kernel and mounted on an unpatched
      kernel will also work correctly because unpatched kernels ignore all
      flags.
      
      With this patchset applied, the scrub code gains the ability to detect
      rmap btrees with incorrectly set attr and bmbt flags in the key records.
      After three years of testing, I haven't encountered any problems.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      1ee75505
    • Dave Chinner's avatar
      Merge tag 'btree-hoist-scrub-checks-6.4_2023-04-11' of... · b764ea20
      Dave Chinner authored
      Merge tag 'btree-hoist-scrub-checks-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: hoist scrub record checks into libxfs [v24.5]
      
      There are a few things about btree records that scrub checked but the
      libxfs _get_rec functions didn't.  Move these bits into libxfs so that
      everyone can benefit.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      b764ea20
    • Dave Chinner's avatar
      Merge tag 'btree-complain-bad-records-6.4_2023-04-11' of... · 01822a74
      Dave Chinner authored
      Merge tag 'btree-complain-bad-records-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: standardize btree record checking code [v24.5]
      
      While I was cleaning things up for 6.1, I noticed that the btree
      _query_range and _query_all functions don't perform the same checking
      that the _get_rec functions perform.  In fact, they don't perform /any/
      sanity checking, which means that callers aren't warned about impossible
      records.
      
      Therefore, hoist the record validation and complaint logging code into
      separate functions, and call them from any place where we convert an
      ondisk record into an incore record.  For online scrub, we can replace
      checking code with a call to the record checking functions in libxfs,
      thereby reducing the size of the codebase.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      01822a74
    • Dave Chinner's avatar
      Merge tag 'scrub-drain-intents-6.4_2023-04-11' of... · b634abac
      Dave Chinner authored
      Merge tag 'scrub-drain-intents-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: drain deferred work items when scrubbing [v24.5]
      
      The design doc for XFS online fsck contains a long discussion of the
      eventual consistency models in use for XFS metadata.  In that chapter,
      we note that it is possible for scrub to collide with a chain of
      deferred space metadata updates, and proposes a lightweight solution:
      The use of a pending-intents counter so that scrub can wait for the
      system to drain all chains.
      
      This patchset implements that scrub drain.  The first patch implements
      the basic mechanism, and the subsequent patches reduce the runtime
      overhead by converting the implementation to use sloppy counters and
      introducing jump labels to avoid walking into scrub hooks when it isn't
      running.  This last paradigm repeats elsewhere in this megaseries.
      
      v23.1: make intent items take an active ref to the perag structure and
             document why we bump and drop the intent counts when we do
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      b634abac
    • Dave Chinner's avatar
      Merge tag 'scrub-fix-legalese-6.4_2023-04-11' of... · 793f5c2c
      Dave Chinner authored
      Merge tag 'scrub-fix-legalese-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs_scrub: fix licensing and copyright notices [v24.5]
      
      Fix various attribution problems in the xfs_scrub source code, such as
      the author's contact information, out of date SPDX tags, and a rough
      estimate of when the feature was under heavy development.  The most
      egregious parts are the files that are missing license information
      completely.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      793f5c2c
    • Dave Chinner's avatar
      Merge tag 'pass-perag-refs-6.4_2023-04-11' of... · 1e5ffdc5
      Dave Chinner authored
      Merge tag 'pass-perag-refs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: pass perag references around when possible [v24.5]
      
      Avoid the cost of perag radix tree lookups by passing around active perag
      references when possible.
      
      v24.2: rework some of the naming and whatnot so there's less opencoding
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      1e5ffdc5
    • Dave Chinner's avatar
      Merge tag 'intents-perag-refs-6.4_2023-04-11' of... · 826053db
      Dave Chinner authored
      Merge tag 'intents-perag-refs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: make intent items take a perag reference [v24.5]
      
      Now that we've cleaned up some code warts in the deferred work item
      processing code, let's make intent items take an active perag reference
      from their creation until they are finally freed by the defer ops
      machinery.  This change facilitates the scrub drain in the next patchset
      and will make it easier for the future AG removal code to detect a busy
      AG in need of quiescing.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      826053db
    • Dave Chinner's avatar
      Merge tag 'online-fsck-design-6.4_2023-04-11' of... · bed25d80
      Dave Chinner authored
      Merge tag 'online-fsck-design-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
      
      xfs: design documentation for online fsck [v24.5]
      
      After six years of development and a nearly two year hiatus from
      patchbombing, I think it is time to resume the process of merging the
      online fsck feature into XFS.  The full patchset comprises 105 separate
      patchsets that capture 470 patches across the kernel, xfsprogs, and
      fstests projects.
      
      I would like to merge this feature into upstream in time for the 2023
      LTS kernel.  As of 5.15 (aka last year's LTS), we have merged all
      generally useful infrastructure improvements into the regular
      filesystem.  The only changes to the core filesystem that remain are the
      ones that are only useful to online fsck itself.  In other words, the
      vast majority of the new code in the patchsets comprising the online
      fsck feature are is mostly self contained and can be turned off via
      Kconfig.
      
      Many of you readers might be wondering -- why have I chosen to make one
      large submission with 100+ patchsets comprising ~500 patches?  Why
      didn't I merge small pieces of functionality bit by bit and revise
      common code as necessary?  Well, the simple answer is that in the past
      six years, the fundamental algorithms have been revised repeatedly as
      I've built out the functionality.  In other words, the codebase as it is
      now has the benefit that I now know every piece that's necessary to get
      the job done in a reasonable manner and within the constraints laid out
      by community reviews.  I believe this has reduced code churn in mainline
      and freed up my time so that I can iterate faster.
      
      As a concession to the mail servers, I'm breaking up the submission into
      smaller pieces; I'm only pushing the design document and the revisions
      to the existing scrub code, which is the first 20%% of the patches.
      Also, I'm arbitrarily restarting the version numbering by reversioning
      all patchsets from version 22 to epoch 23, version 1.
      
      The big question to everyone reading this is: How might I convince you
      that there is more merit in merging the whole feature and dealing with
      the consequences than continuing to maintain it out of tree?
      
      ---------
      
      To prepare the XFS community and potential patch reviewers for the
      upstream submission of the online fsck feature, I decided to write a
      document capturing the broader picture behind the online repair
      development effort.  The document begins by defining the problems that
      online fsck aims to solve and outlining specific use cases for the
      functionality.
      
      Using that as a base, the rest of the design document presents the high
      level algorithms that fulfill the goals set out at the start and the
      interactions between the large pieces of the system.  Case studies round
      out the design documentation by adding the details of exactly how
      specific parts of the online fsck code integrate the algorithms with the
      filesystem.
      
      The goal of this effort is to help the XFS community understand how the
      gigantic online repair patchset works.  The questions I submit to the
      community reviewers are:
      
      1. As you read the design doc (and later the code), do you feel that you
         understand what's going on well enough to try to fix a bug if you
         found one?
      
      2. What sorts of interactions between systems (or between scrub and the
         rest of the kernel) am I missing?
      
      3. Do you feel confident enough in the implementation as it is now that
         the benefits of merging the feature (as EXPERIMENTAL) outweigh any
         potential disruptions to XFS at large?
      
      4. Are there problematic interactions between subsystems that ought to
         be cleared up before merging?
      
      5. Can I just merge all of this?
      
      I intend to commit this document to the kernel's documentation directory
      when we start merging the patchset, albeit without the links to
      git.kernel.org.  A much more readable version of this is posted at:
      https://djwong.org/docs/xfs-online-fsck-design/
      
      v2: add missing sections about: all the in-kernel data structures and
          new apis that the scrub and repair functions use; how xattrs and
          directories are checked; how space btree records are checked; and
          add more details to the parts where all these bits tie together.
          Proofread for verb tense inconsistencies and eliminate vague 'we'
          usage.  Move all the discussion of what we can do with pageable
          kernel memory into a single source file and section.  Document where
          log incompat feature locks fit into the locking model.
      
      v3: resync with 6.0, fix a few typos, begin discussion of the merging
          plan for this megapatchset.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      bed25d80
  2. 12 Apr, 2023 30 commits
    • Ye Bin's avatar
      xfs: fix BUG_ON in xfs_getbmap() · 8ee81ed5
      Ye Bin authored
      There's issue as follows:
      XFS: Assertion failed: (bmv->bmv_iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c, line: 329
      ------------[ cut here ]------------
      kernel BUG at fs/xfs/xfs_message.c:102!
      invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      CPU: 1 PID: 14612 Comm: xfs_io Not tainted 6.3.0-rc2-next-20230315-00006-g2729d23ddb3b-dirty #422
      RIP: 0010:assfail+0x96/0xa0
      RSP: 0018:ffffc9000fa178c0 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff888179a18000
      RDX: 0000000000000000 RSI: ffff888179a18000 RDI: 0000000000000002
      RBP: 0000000000000000 R08: ffffffff8321aab6 R09: 0000000000000000
      R10: 0000000000000001 R11: ffffed1105f85139 R12: ffffffff8aacc4c0
      R13: 0000000000000149 R14: ffff888269f58000 R15: 000000000000000c
      FS:  00007f42f27a4740(0000) GS:ffff88882fc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000b92388 CR3: 000000024f006000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       xfs_getbmap+0x1a5b/0x1e40
       xfs_ioc_getbmap+0x1fd/0x5b0
       xfs_file_ioctl+0x2cb/0x1d50
       __x64_sys_ioctl+0x197/0x210
       do_syscall_64+0x39/0xb0
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Above issue may happen as follows:
               ThreadA                       ThreadB
      do_shared_fault
       __do_fault
        xfs_filemap_fault
         __xfs_filemap_fault
          filemap_fault
                                   xfs_ioc_getbmap -> Without BMV_IF_DELALLOC flag
      			      xfs_getbmap
      			       xfs_ilock(ip, XFS_IOLOCK_SHARED);
      			       filemap_write_and_wait
       do_page_mkwrite
        xfs_filemap_page_mkwrite
         __xfs_filemap_fault
          xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
          iomap_page_mkwrite
           ...
           xfs_buffered_write_iomap_begin
            xfs_bmapi_reserve_delalloc -> Allocate delay extent
                                    xfs_ilock_data_map_shared(ip)
      	                      xfs_getbmap_report_one
      			       ASSERT((bmv->bmv_iflags & BMV_IF_DELALLOC) != 0)
      	                        -> trigger BUG_ON
      
      As xfs_filemap_page_mkwrite() only hold XFS_MMAPLOCK_SHARED lock, there's
      small window mkwrite can produce delay extent after file write in xfs_getbmap().
      To solve above issue, just skip delalloc extents.
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      8ee81ed5
    • Darrick J. Wong's avatar
      xfs: verify buffer contents when we skip log replay · 22ed903e
      Darrick J. Wong authored
      syzbot detected a crash during log recovery:
      
      XFS (loop0): Mounting V5 Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791
      XFS (loop0): Torn write (CRC failure) detected at log block 0x180. Truncating head block from 0x200.
      XFS (loop0): Starting recovery (logdev: internal)
      ==================================================================
      BUG: KASAN: slab-out-of-bounds in xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
      Read of size 8 at addr ffff88807e89f258 by task syz-executor132/5074
      
      CPU: 0 PID: 5074 Comm: syz-executor132 Not tainted 6.2.0-rc1-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106
       print_address_description+0x74/0x340 mm/kasan/report.c:306
       print_report+0x107/0x1f0 mm/kasan/report.c:417
       kasan_report+0xcd/0x100 mm/kasan/report.c:517
       xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
       xfs_btree_lookup+0x346/0x12c0 fs/xfs/libxfs/xfs_btree.c:1913
       xfs_btree_simple_query_range+0xde/0x6a0 fs/xfs/libxfs/xfs_btree.c:4713
       xfs_btree_query_range+0x2db/0x380 fs/xfs/libxfs/xfs_btree.c:4953
       xfs_refcount_recover_cow_leftovers+0x2d1/0xa60 fs/xfs/libxfs/xfs_refcount.c:1946
       xfs_reflink_recover_cow+0xab/0x1b0 fs/xfs/xfs_reflink.c:930
       xlog_recover_finish+0x824/0x920 fs/xfs/xfs_log_recover.c:3493
       xfs_log_mount_finish+0x1ec/0x3d0 fs/xfs/xfs_log.c:829
       xfs_mountfs+0x146a/0x1ef0 fs/xfs/xfs_mount.c:933
       xfs_fs_fill_super+0xf95/0x11f0 fs/xfs/xfs_super.c:1666
       get_tree_bdev+0x400/0x620 fs/super.c:1282
       vfs_get_tree+0x88/0x270 fs/super.c:1489
       do_new_mount+0x289/0xad0 fs/namespace.c:3145
       do_mount fs/namespace.c:3488 [inline]
       __do_sys_mount fs/namespace.c:3697 [inline]
       __se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f89fa3f4aca
      Code: 83 c4 08 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007fffd5fb5ef8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
      RAX: ffffffffffffffda RBX: 00646975756f6e2c RCX: 00007f89fa3f4aca
      RDX: 0000000020000100 RSI: 0000000020009640 RDI: 00007fffd5fb5f10
      RBP: 00007fffd5fb5f10 R08: 00007fffd5fb5f50 R09: 000000000000970d
      R10: 0000000000200800 R11: 0000000000000206 R12: 0000000000000004
      R13: 0000555556c6b2c0 R14: 0000000000200800 R15: 00007fffd5fb5f50
       </TASK>
      
      The fuzzed image contains an AGF with an obviously garbage
      agf_refcount_level value of 32, and a dirty log with a buffer log item
      for that AGF.  The ondisk AGF has a higher LSN than the recovered log
      item.  xlog_recover_buf_commit_pass2 reads the buffer, compares the
      LSNs, and decides to skip replay because the ondisk buffer appears to be
      newer.
      
      Unfortunately, the ondisk buffer is corrupt, but recovery just read the
      buffer with no buffer ops specified:
      
      	error = xfs_buf_read(mp->m_ddev_targp, buf_f->blf_blkno,
      			buf_f->blf_len, buf_flags, &bp, NULL);
      
      Skipping the buffer leaves its contents in memory unverified.  This sets
      us up for a kernel crash because xfs_refcount_recover_cow_leftovers
      reads the buffer (which is still around in XBF_DONE state, so no read
      verification) and creates a refcountbt cursor of height 32.  This is
      impossible so we run off the end of the cursor object and crash.
      
      Fix this by invoking the verifier on all skipped buffers and aborting
      log recovery if the ondisk buffer is corrupt.  It might be smarter to
      force replay the log item atop the buffer and then see if it'll pass the
      write verifier (like ext4 does) but for now let's go with the
      conservative option where we stop immediately.
      
      Link: https://syzkaller.appspot.com/bug?extid=7e9494b8b399902e994eSigned-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      22ed903e
    • Darrick J. Wong's avatar
      xfs: _{attr,data}_map_shared should take ILOCK_EXCL until iread_extents is completely done · c95356ca
      Darrick J. Wong authored
      While fuzzing the data fork extent count on a btree-format directory
      with xfs/375, I observed the following (excerpted) splat:
      
      XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 1208
      ------------[ cut here ]------------
      WARNING: CPU: 0 PID: 43192 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
      Call Trace:
       <TASK>
       xfs_iread_extents+0x1af/0x210 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       xchk_dir_walk+0xb8/0x190 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       xchk_parent_count_parent_dentries+0x41/0x80 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       xchk_parent_validate+0x199/0x2e0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       xchk_parent+0xdf/0x130 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       xfs_scrub_metadata+0x2b8/0x730 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       xfs_scrubv_metadata+0x38b/0x4d0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       xfs_ioc_scrubv_metadata+0x111/0x160 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       xfs_file_ioctl+0x367/0xf50 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
       __x64_sys_ioctl+0x82/0xa0
       do_syscall_64+0x2b/0x80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      The cause of this is a race condition in xfs_ilock_data_map_shared,
      which performs an unlocked access to the data fork to guess which lock
      mode it needs:
      
      Thread 0                          Thread 1
      
      xfs_need_iread_extents
      <observe no iext tree>
      xfs_ilock(..., ILOCK_EXCL)
      xfs_iread_extents
      <observe no iext tree>
      <check ILOCK_EXCL>
      <load bmbt extents into iext>
      <notice iext size doesn't
       match nextents>
                                        xfs_need_iread_extents
                                        <observe iext tree>
                                        xfs_ilock(..., ILOCK_SHARED)
      <tear down iext tree>
      xfs_iunlock(..., ILOCK_EXCL)
                                        xfs_iread_extents
                                        <observe no iext tree>
                                        <check ILOCK_EXCL>
                                        *BOOM*
      
      Fix this race by adding a flag to the xfs_ifork structure to indicate
      that we have not yet read in the extent records and changing the
      predicate to look at the flag state, not if_height.  The memory barrier
      ensures that the flag will not be set until the very end of the
      function.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      c95356ca
    • Dave Chinner's avatar
      xfs: remove WARN when dquot cache insertion fails · 4b827b3f
      Dave Chinner authored
      It just creates unnecessary bot noise these days.
      
      Reported-by: syzbot+6ae213503fb12e87934f@syzkaller.appspotmail.com
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      4b827b3f
    • Dave Chinner's avatar
      xfs: don't consider future format versions valid · aa880198
      Dave Chinner authored
      In commit fe08cc50 we reworked the valid superblock version
      checks. If it is a V5 filesystem, it is always valid, then we
      checked if the version was less than V4 (reject) and then checked
      feature fields in the V4 flags to determine if it was valid.
      
      What we missed was that if the version is not V4 at this point,
      we shoudl reject the fs. i.e. the check current treats V6+
      filesystems as if it was a v4 filesystem. Fix this.
      
      cc: stable@vger.kernel.org
      Fixes: fe08cc50 ("xfs: open code sb verifier feature checks")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      aa880198
    • Darrick J. Wong's avatar
      xfs: ensure that all metadata and data blocks are not cow staging extents · 7ac14fa2
      Darrick J. Wong authored
      Make sure that all filesystem metadata blocks and file data blocks are
      not also marked as CoW staging extents.  The extra checking added here
      was inspired by an actual VM host filesystem corruption incident due to
      bugs in the CoW handling of 4.x kernels.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      7ac14fa2
    • Darrick J. Wong's avatar
      xfs: check the reference counts of gaps in the refcount btree · 7ad9ea63
      Darrick J. Wong authored
      Gaps in the reference count btree are also significant -- for these
      regions, there must not be any overlapping reverse mappings.  We don't
      currently check this, so make the refcount scrubber more complete.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      7ad9ea63
    • Darrick J. Wong's avatar
      xfs: implement masked btree key comparisons for _has_records scans · 4a200a09
      Darrick J. Wong authored
      For keyspace fullness scans, we want to be able to mask off the parts of
      the key that we don't care about.  For most btree types we /do/ want the
      full keyspace, but for checking that a given space usage also has a full
      complement of rmapbt records (even if different/multiple owners) we need
      this masking so that we only track sparseness of rm_startblock, not the
      whole keyspace (which is extremely sparse).
      
      Augment the ->diff_two_keys and ->keys_contiguous helpers to take a
      third union xfs_btree_key argument, and wire up xfs_rmap_has_records to
      pass this through.  This third "mask" argument should contain a nonzero
      value in each structure field that should be used in the key comparisons
      done during the scan.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      4a200a09
    • Darrick J. Wong's avatar
      xfs: replace xfs_btree_has_record with a general keyspace scanner · 6abc7aef
      Darrick J. Wong authored
      The current implementation of xfs_btree_has_record returns true if it
      finds /any/ record within the given range.  Unfortunately, that's not
      sufficient for scrub.  We want to be able to tell if a range of keyspace
      for a btree is devoid of records, is totally mapped to records, or is
      somewhere in between.  By forcing this to be a boolean, we conflated
      sparseness and fullness, which caused scrub to return incorrect results.
      Fix the API so that we can tell the caller which of those three is the
      current state.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      6abc7aef
    • Darrick J. Wong's avatar
      xfs: refactor ->diff_two_keys callsites · bd7e7951
      Darrick J. Wong authored
      Create wrapper functions around ->diff_two_keys so that we don't have to
      remember what the return values mean, and adjust some of the code
      comments to reflect the longtime code behavior.  We're going to
      introduce more uses of ->diff_two_keys in the next patch, so reduce the
      cognitive load for readers by doing this refactoring now.
      Suggested-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      bd7e7951
    • Darrick J. Wong's avatar
      xfs: always scrub record/key order of interior records · 2bea8df0
      Darrick J. Wong authored
      In commit d47fef93, we removed the firstrec and firstkey fields of
      struct xchk_btree because Christoph thought they were unnecessary
      because we could use the record index in the btree cursor.  This is
      incorrect because bc_ptrs (now bc_levels[].ptr) tracks the cursor
      position within a specific btree block, not within the entire level.
      
      The end result is that scrub no longer detects situations where the
      rightmost record of a block is identical to the leftmost record of that
      block's right sibling.  Fix this regression by reintroducing record
      validity booleans so that order checking skips *only* the leftmost
      record/key in each level.
      
      Fixes: d47fef93 ("xfs: don't track firstrec/firstkey separately in xchk_btree")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      2bea8df0
    • Darrick J. Wong's avatar
      xfs: refactor converting btree irec to btree key · ee5fe8ff
      Darrick J. Wong authored
      We keep doing these conversions to support btree queries, so refactor
      this into a helper.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      ee5fe8ff
    • Darrick J. Wong's avatar
      xfs: check btree keys reflect the child block · c99f99fa
      Darrick J. Wong authored
      When scrub is checking a non-root btree block, it should make sure that
      the keys in the parent btree block accurately capture the keyspace that
      the child block stores.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      c99f99fa
    • Darrick J. Wong's avatar
      xfs: detect unwritten bit set in rmapbt node block keys · 38384569
      Darrick J. Wong authored
      In the last patch, we changed the rmapbt code to remove the UNWRITTEN
      bit when creating an rmapbt key from an rmapbt record, and we changed
      the rmapbt key comparison code to start considering the ATTR and BMBT
      flags during lookup.  This brought the behavior of the rmapbt
      implementation in line with its specification.
      
      However, there may exist filesystems that have the unwritten bit still
      set in the rmapbt keys.  We should detect these situations and flag the
      rmapbt as one that would benefit from optimization.  Eventually, online
      repair will be able to do something in response to this.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      38384569
    • Darrick J. Wong's avatar
      xfs: fix rm_offset flag handling in rmap keys · 08c987de
      Darrick J. Wong authored
      Keys for extent interval records in the reverse mapping btree are
      supposed to be computed as follows:
      
      (physical block, owner, fork, is_btree, offset)
      
      This provides users the ability to look up a reverse mapping from a file
      block mapping record -- start with the physical block; then if there are
      multiple records for the same block, move on to the owner; then the
      inode fork type; and so on to the file offset.
      
      Unfortunately, the code that creates rmap lookup keys from rmap records
      forgot to mask off the record attribute flags, leading to ondisk keys
      that look like this:
      
      (physical block, owner, fork, is_btree, unwritten state, offset)
      
      Fortunately, this has all worked ok for the past six years because the
      key comparison functions incorrectly ignore the fork/bmbt/unwritten
      information that's encoded in the on-disk offset.  This means that
      lookup comparisons are only done with:
      
      (physical block, owner, offset)
      
      Queries can (theoretically) return incorrect results because of this
      omission.  On consistent filesystems this isn't an issue because xattr
      and bmbt blocks cannot be shared and hence the comparisons succeed
      purely on the contents of the rm_startblock field.  For the one case
      where we support sharing (written data fork blocks) all flag bits are
      zero, so the omission in the comparison has no ill effects.
      
      Unfortunately, this bug prevents scrub from detecting incorrect fork and
      bmbt flag bits in the rmap btree, so we really do need to fix the
      compare code.  Old filesystems with the unwritten bit erroneously set in
      the rmap key struct will work fine on new kernels since we still ignore
      the unwritten bit.  New filesystems on older kernels will work fine
      since the old kernels never paid attention to the unwritten bit.
      
      A previous version of this patch forgot to keep the (un)written state
      flag masked during the comparison and caused a major regression in
      5.9.x since unwritten extent conversion can update an rmap record
      without requiring key updates.
      
      Note that blocks cannot go directly from data fork to attr fork without
      being deallocated and reallocated, nor can they be added to or removed
      from a bmbt without a free/alloc cycle, so this should not cause any
      regressions.
      
      Found by fuzzing keys[1].attrfork = ones on xfs/371.
      
      Fixes: 4b8ed677 ("xfs: add rmap btree operations")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      08c987de
    • Darrick J. Wong's avatar
      xfs: hoist inode record alignment checks from scrub · de1a9ce2
      Darrick J. Wong authored
      Move the inobt record alignment checks from xchk_iallocbt_rec into
      xfs_inobt_check_irec so that they are applied everywhere.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      de1a9ce2
    • Darrick J. Wong's avatar
      xfs: hoist rmap record flag checks from scrub · e774b2ea
      Darrick J. Wong authored
      Move the rmap record flag checks from xchk_rmapbt_rec into
      xfs_rmap_check_irec so that they are applied everywhere.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      e774b2ea
    • Darrick J. Wong's avatar
      xfs: complain about bad file mapping records in the ondisk bmbt · 6a3bd8fc
      Darrick J. Wong authored
      Similar to what we've just done for the other btrees, create a function
      to log corrupt bmbt records and call it whenever we encounter a bad
      record in the ondisk btree.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      6a3bd8fc
    • Darrick J. Wong's avatar
      xfs: hoist rmap record flag checks from scrub · 7d7d6d2f
      Darrick J. Wong authored
      Move the rmap record flag checks from xchk_rmapbt_rec into
      xfs_rmap_check_irec so that they are applied everywhere.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      7d7d6d2f
    • Darrick J. Wong's avatar
      xfs: complain about bad records in query_range helpers · ee12eaaa
      Darrick J. Wong authored
      For every btree type except for the bmbt, refactor the code that
      complains about bad records into a helper and make the ->query_range
      helpers call it so that corruptions found via that avenue are logged.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      ee12eaaa
    • Darrick J. Wong's avatar
      xfs: standardize ondisk to incore conversion for bmap btrees · 69010fe3
      Darrick J. Wong authored
      Fix all xfs_bmbt_disk_get_all callsites to call xfs_bmap_validate_extent
      and bubble up corruption reports.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      69010fe3
    • Darrick J. Wong's avatar
      xfs: standardize ondisk to incore conversion for rmap btrees · c4e34172
      Darrick J. Wong authored
      Create a xfs_rmap_check_irec function to detect corruption in btree
      records.  Fix all xfs_rmap_btrec_to_irec callsites to call the new
      helper and bubble up corruption reports.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      c4e34172
    • Darrick J. Wong's avatar
      xfs: return a failure address from xfs_rmap_irec_offset_unpack · 39ab26d5
      Darrick J. Wong authored
      Currently, xfs_rmap_irec_offset_unpack returns only 0 or -EFSCORRUPTED.
      Change this function to return the code address of a failed conversion
      in preparation for the next patch, which standardizes localized record
      checking and reporting code.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      39ab26d5
    • Darrick J. Wong's avatar
      xfs: standardize ondisk to incore conversion for refcount btrees · 2b30cc0b
      Darrick J. Wong authored
      Create a xfs_refcount_check_irec function to detect corruption in btree
      records.  Fix all xfs_refcount_btrec_to_irec callsites to call the new
      helper and bubble up corruption reports.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      2b30cc0b
    • Darrick J. Wong's avatar
      xfs: standardize ondisk to incore conversion for inode btrees · 366a0b8d
      Darrick J. Wong authored
      Create a xfs_inobt_check_irec function to detect corruption in btree
      records.  Fix all xfs_inobt_btrec_to_irec callsites to call the new
      helper and bubble up corruption reports.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      366a0b8d
    • Darrick J. Wong's avatar
      xfs: standardize ondisk to incore conversion for free space btrees · 35e3b9a1
      Darrick J. Wong authored
      Create a xfs_alloc_btrec_to_irec function to convert an ondisk record to
      an incore record, and a xfs_alloc_check_irec function to detect
      corruption.  Replace all the open-coded logic with calls to the new
      helpers and bubble up corruption reports.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      35e3b9a1
    • Darrick J. Wong's avatar
      xfs: scrub should use ECHRNG to signal that the drain is needed · 88accf17
      Darrick J. Wong authored
      In the previous patch, we added jump labels to the intent drain code so
      that regular filesystem operations need not pay the price of checking
      for someone (scrub) waiting on intents to drain from some part of the
      filesystem when that someone isn't running.
      
      However, I observed that xfs/285 now spends a lot more time pushing the
      AIL from the inode btree scrubber than it used to.  This is because the
      inobt scrubber will try push the AIL to try to get logged inode cores
      written to the filesystem when it sees a weird discrepancy between the
      ondisk inode and the inobt records.  This AIL push is triggered when the
      setup function sees TRY_HARDER is set; and the requisite EDEADLOCK
      return is initiated when the discrepancy is seen.
      
      The solution to this performance slow down is to use a different result
      code (ECHRNG) for scrub code to signal that it needs to wait for
      deferred intent work items to drain out of some part of the filesystem.
      When this happens, set a new scrub state flag (XCHK_NEED_DRAIN) so that
      setup functions will activate the jump label.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      88accf17
    • Darrick J. Wong's avatar
      xfs: minimize overhead of drain wakeups by using jump labels · 466c525d
      Darrick J. Wong authored
      To reduce the runtime overhead even further when online fsck isn't
      running, use a static branch key to decide if we call wake_up on the
      drain.  For compilers that support jump labels, the call to wake_up is
      replaced by a nop sled when nobody is waiting for intents to drain.
      
      From my initial microbenchmarking, every transition of the static key
      between the on and off states takes about 22000ns to complete; this is
      paid entirely by the xfs_scrub process.  When the static key is off
      (which it should be when fsck isn't running), the nop sled adds an
      overhead of approximately 0.36ns to runtime code.  The post-atomic
      lockless waiter check adds about 0.03ns, which is basically free.
      
      For the few compilers that don't support jump labels, runtime code pays
      the cost of calling wake_up on an empty waitqueue, which was observed to
      be about 30ns.  However, most architectures that have sufficient memory
      and CPU capacity to run XFS also support jump labels, so this is not
      much of a worry.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      466c525d
    • Darrick J. Wong's avatar
      xfs: clean up scrub context if scrub setup returns -EDEADLOCK · 3f64c718
      Darrick J. Wong authored
      It has been a longstanding convention that online scrub and repair
      functions can return -EDEADLOCK to signal that they weren't able to
      obtain some necessary resource.  When this happens, the scrub framework
      is supposed to release all resources attached to the scrub context, set
      the TRY_HARDER flag in the scrub context flags, and try again.  In this
      context, individual scrub functions are supposed to take all the
      resources they (incorrectly) speculated were not necessary.
      
      We're about to make it so that the functions that lock and wait for a
      filesystem AG can also return EDEADLOCK to signal that we need to try
      again with the drain waiters enabled.  Therefore, refactor
      xfs_scrub_metadata to support this behavior for ->setup() functions.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      3f64c718
    • Darrick J. Wong's avatar
      xfs: allow queued AG intents to drain before scrubbing · d5c88131
      Darrick J. Wong authored
      When a writer thread executes a chain of log intent items, the AG header
      buffer locks will cycle during a transaction roll to get from one intent
      item to the next in a chain.  Although scrub takes all AG header buffer
      locks, this isn't sufficient to guard against scrub checking an AG while
      that writer thread is in the middle of finishing a chain because there's
      no higher level locking primitive guarding allocation groups.
      
      When there's a collision, cross-referencing between data structures
      (e.g. rmapbt and refcountbt) yields false corruption events; if repair
      is running, this results in incorrect repairs, which is catastrophic.
      
      Fix this by adding to the perag structure the count of active intents
      and make scrub wait until it has both AG header buffer locks and the
      intent counter reaches zero.
      
      One quirk of the drain code is that deferred bmap updates also bump and
      drop the intent counter.  A fundamental decision made during the design
      phase of the reverse mapping feature is that updates to the rmapbt
      records are always made by the same code that updates the primary
      metadata.  In other words, callers of bmapi functions expect that the
      bmapi functions will queue deferred rmap updates.
      
      Some parts of the reflink code queue deferred refcount (CUI) and bmap
      (BUI) updates in the same head transaction, but the deferred work
      manager completely finishes the CUI before the BUI work is started.  As
      a result, the CUI drops the intent count long before the deferred rmap
      (RUI) update even has a chance to bump the intent count.  The only way
      to keep the intent count elevated between the CUI and RUI is for the BUI
      to bump the counter until the RUI has been created.
      
      A second quirk of the intent drain code is that deferred work items must
      increment the intent counter as soon as the work item is added to the
      transaction.  When a BUI completes and queues an RUI, the RUI must
      increment the counter before the BUI decrements it.  The only way to
      accomplish this is to require that the counter be bumped as soon as the
      deferred work item is created in memory.
      
      In the next patches we'll improve on this facility, but this patch
      provides the basic functionality.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      d5c88131