1. 15 Apr, 2024 26 commits
    • Darrick J. Wong's avatar
      docs: update swapext -> exchmaps language · f783529b
      Darrick J. Wong authored
      Start reworking the atomic swapext design documentation to refer to its
      new file contents/mapping exchange name.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      f783529b
    • Darrick J. Wong's avatar
      xfs: capture inode generation numbers in the ondisk exchmaps log item · 14f19991
      Darrick J. Wong authored
      Per some very late review comments, capture the generation numbers of
      both inodes involved in a file content exchange operation so that we
      don't accidentally target files with have been reallocated.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      14f19991
    • Darrick J. Wong's avatar
      xfs: support non-power-of-two rtextsize with exchange-range · b3e60f84
      Darrick J. Wong authored
      The generic exchange-range alignment checks use (fast) bitmasking
      operations to perform block alignment checks on the exchange parameters.
      Unfortunately, bitmasks require that the alignment size be a power of
      two.  This isn't true for realtime devices with a non-power-of-two
      extent size, so we have to copy-pasta the generic checks using long
      division for this to work properly.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      b3e60f84
    • Darrick J. Wong's avatar
      xfs: make file range exchange support realtime files · e6294110
      Darrick J. Wong authored
      Now that bmap items support the realtime device, we can add the
      necessary pieces to the file range exchange code to support exchanging
      mappings.  All we really need to do here is adjust the blockcount
      upwards to the end of the rt extent and remove the inode checks.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      e6294110
    • Darrick J. Wong's avatar
      xfs: condense symbolic links after a mapping exchange operation · 33a9be2b
      Darrick J. Wong authored
      The previous commit added a new file mapping exchange flag that enables
      us to perform post-exchange processing on file2 once we're done
      exchanging the extent mappings.  Now add this ability for symlinks.
      
      This isn't used anywhere right now, but we need to have the basic ondisk
      flags in place so that a future online symlink repair feature can
      salvage the remote target in a temporary link and exchange the data fork
      mappings when ready.  If one file is in extents format and the other is
      inline, we will have to promote both to extents format to perform the
      exchange.  After the exchange, we can try to condense the fixed symlink
      down to inline format if possible.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      33a9be2b
    • Darrick J. Wong's avatar
      xfs: condense directories after a mapping exchange operation · da165fbd
      Darrick J. Wong authored
      The previous commit added a new file mapping exchange flag that enables
      us to perform post-swap processing on file2 once we're done exchanging
      extent mappings.  Now add this ability for directories.
      
      This isn't used anywhere right now, but we need to have the basic ondisk
      flags in place so that a future online directory repair feature can
      create salvaged dirents in a temporary directory and exchange the data
      fork mappings when ready.  If one file is in extents format and the
      other is inline, we will have to promote both to extents format to
      perform the exchange.  After the exchange, we can try to condense the
      fixed directory down to inline format if possible.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      da165fbd
    • Darrick J. Wong's avatar
      xfs: condense extended attributes after a mapping exchange operation · 497d7a26
      Darrick J. Wong authored
      Add a new file mapping exchange flag that enables us to perform
      post-exchange processing on file2 once we're done exchanging the extent
      mappings.  If we were swapping mappings between extended attribute
      forks, we want to be able to convert file2's attr fork from block to
      inline format.
      
      (This implies that all fork contents are exchanged.)
      
      This isn't used anywhere right now, but we need to have the basic ondisk
      flags in place so that a future online xattr repair feature can create
      salvaged attrs in a temporary file and exchange the attr fork mappings
      when ready.  If one file is in extents format and the other is inline,
      we will have to promote both to extents format to perform the exchange.
      After the exchange, we can try to condense the fixed file's attr fork
      back down to inline format if possible.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      497d7a26
    • Darrick J. Wong's avatar
      xfs: add error injection to test file mapping exchange recovery · 5fd022ec
      Darrick J. Wong authored
      Add an errortag so that we can test recovery of exchmaps log items.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      5fd022ec
    • Darrick J. Wong's avatar
      xfs: bind together the front and back ends of the file range exchange code · 42672471
      Darrick J. Wong authored
      So far, we've constructed the front end of the file range exchange code
      that does all the checking; and the back end of the file mapping
      exchange code that actually does the work.  Glue these two pieces
      together so that we can turn on the functionality.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      42672471
    • Darrick J. Wong's avatar
      xfs: create deferred log items for file mapping exchanges · 966ceafc
      Darrick J. Wong authored
      Now that we've created the skeleton of a log intent item to track and
      restart file mapping exchange operations, add the upper level logic to
      commit intent items and turn them into concrete work recorded in the
      log.  This builds on the existing bmap update intent items that have
      been around for a while now.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      966ceafc
    • Darrick J. Wong's avatar
      xfs: introduce a file mapping exchange log intent item · 6c08f434
      Darrick J. Wong authored
      Introduce a new intent log item to handle exchanging mappings between
      the forks of two files.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      6c08f434
    • Darrick J. Wong's avatar
      xfs: create a incompat flag for atomic file mapping exchanges · 1518646e
      Darrick J. Wong authored
      Create a incompat flag so that we only attempt to process file mapping
      exchange log items if the filesystem supports it, and a geometry flag to
      advertise support if it's present.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      1518646e
    • Darrick J. Wong's avatar
      xfs: introduce new file range exchange ioctl · 9a64d9b3
      Darrick J. Wong authored
      Introduce a new ioctl to handle exchanging ranges of bytes
      between files.  The goal here is to perform the exchange atomically with
      respect to applications -- either they see the file contents before the
      exchange or they see that A-B is now B-A, even if the kernel crashes.
      
      My original goal with all this code was to make it so that online repair
      can build a replacement directory or xattr structure in a temporary file
      and commit the repair by atomically exchanging all the data blocks
      between the two files.  However, I needed a way to test this mechanism
      thoroughly, so I've been evolving an ioctl interface since then.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      9a64d9b3
    • Darrick J. Wong's avatar
      vfs: export remap and write check helpers · 5b9932f6
      Darrick J. Wong authored
      Export these functions so that the next patch can use them to check the
      file ranges being passed to the XFS_IOC_EXCHANGE_RANGE operation.
      
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      5b9932f6
    • Darrick J. Wong's avatar
      xfs: constify xfs_bmap_is_written_extent · 15f78aa3
      Darrick J. Wong authored
      This predicate doesn't modify the structure that's being passed in, so
      we can mark it const.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      15f78aa3
    • Darrick J. Wong's avatar
      xfs: refactor non-power-of-two alignment checks · ac5cebee
      Darrick J. Wong authored
      Create a helper function that can compute if a 64-bit number is an
      integer multiple of a 32-bit number, where the 32-bit number is not
      required to be an even power of two.  This is needed for some new code
      for the realtime device, where we can set 37k allocation units and then
      have to remap them.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      ac5cebee
    • Darrick J. Wong's avatar
      xfs: hoist multi-fsb allocation unit detection to a helper · 6b700a5b
      Darrick J. Wong authored
      Replace the open-coded logic to decide if a file has a multi-fsb
      allocation unit to a helper to make the code easier to read.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      6b700a5b
    • Darrick J. Wong's avatar
      xfs: create a new helper to return a file's allocation unit · ee20808d
      Darrick J. Wong authored
      Create a new helper function to calculate the fundamental allocation
      unit (i.e. the smallest unit of space we can allocate) of a file.
      Things are going to get hairy with range-exchange on the realtime
      device, so prepare for this now.
      
      Remove the static attribute from xfs_is_falloc_aligned since the next
      patch will need it.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      ee20808d
    • Darrick J. Wong's avatar
      xfs: declare xfs_file.c symbols in xfs_file.h · 00acb28d
      Darrick J. Wong authored
      Move the two public symbols in xfs_file.c to xfs_file.h.  We're about to
      add more public symbols in that source file, so let's finally create the
      header file.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      00acb28d
    • Darrick J. Wong's avatar
      xfs: move xfs_iops.c declarations out of xfs_inode.h · 3fc48445
      Darrick J. Wong authored
      Similarly, move declarations of public symbols of xfs_iops.c from
      xfs_inode.h to xfs_iops.h.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      3fc48445
    • Darrick J. Wong's avatar
      xfs: move inode lease breaking functions to xfs_inode.c · a4db266a
      Darrick J. Wong authored
      The lease breaking functions operate at the scope of the entire VFS
      inode, not subranges of a file.  Move them to xfs_inode.c since they're
      already declared in xfs_inode.h.  This cleanup moves us closer to
      having xfs_FOO.h declare only the symbols in xfs_FOO.c.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      a4db266a
    • Darrick J. Wong's avatar
      xfs: only clear log incompat flags at clean unmount · 5302a5c8
      Darrick J. Wong authored
      While reviewing the online fsck patchset, someone spied the
      xfs_swapext_can_use_without_log_assistance function and wondered why we
      go through this inverted-bitmask dance to avoid setting the
      XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT feature.
      
      (The same principles apply to the logged extended attribute update
      feature bit in the since-merged LARP series.)
      
      The reason for this dance is that xfs_add_incompat_log_feature is an
      expensive operation -- it forces the log, pushes the AIL, and then if
      nobody's beaten us to it, sets the feature bit and issues a synchronous
      write of the primary superblock.  That could be a one-time cost
      amortized over the life of the filesystem, but the log quiesce and cover
      operations call xfs_clear_incompat_log_features to remove feature bits
      opportunistically.  On a moderately loaded filesystem this leads to us
      cycling those bits on and off over and over, which hurts performance.
      
      Why do we clear the log incompat bits?  Back in ~2020 I think Dave and I
      had a conversation on IRC[2] about what the log incompat bits represent.
      IIRC in that conversation we decided that the log incompat bits protect
      unrecovered log items so that old kernels won't try to recover them and
      barf.  Since a clean log has no protected log items, we could clear the
      bits at cover/quiesce time.
      
      As Dave Chinner pointed out in the thread, clearing log incompat bits at
      unmount time has positive effects for golden root disk image generator
      setups, since the generator could be running a newer kernel than what
      gets written to the golden image -- if there are log incompat fields set
      in the golden image that was generated by a newer kernel/OS image
      builder then the provisioning host cannot mount the filesystem even
      though the log is clean and recovery is unnecessary to mount the
      filesystem.
      
      Given that it's expensive to set log incompat bits, we really only want
      to do that once per bit per mount.  Therefore, I propose that we only
      clear log incompat bits as part of writing a clean unmount record.  Do
      this by adding an operational state flag to the xfs mount that guards
      whether or not the feature bit clearing can actually take place.
      
      This eliminates the l_incompat_users rwsem that we use to protect a log
      cleaning operation from clearing a feature bit that a frontend thread is
      trying to set -- this lock adds another way to fail w.r.t. locking.  For
      the swapext series, I shard that into multiple locks just to work around
      the lockdep complaints, and that's fugly.
      
      Link: https://lore.kernel.org/linux-xfs/20240131230043.GA6180@frogsfrogsfrogs/Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      5302a5c8
    • Darrick J. Wong's avatar
      xfs: fix error bailout in xrep_abt_build_new_trees · 98a778b4
      Darrick J. Wong authored
      Dan Carpenter reports:
      
      "Commit 4bdfd7d1 ("xfs: repair free space btrees") from Dec 15,
      2023 (linux-next), leads to the following Smatch static checker
      warning:
      
              fs/xfs/scrub/alloc_repair.c:781 xrep_abt_build_new_trees()
              warn: missing unwind goto?"
      
      That's a bug, so let's fix it.
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Fixes: 4bdfd7d1 ("xfs: repair free space btrees")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      98a778b4
    • Darrick J. Wong's avatar
      xfs: fix potential AGI <-> ILOCK ABBA deadlock in xrep_dinode_findmode_walk_directory · 21ad2d03
      Darrick J. Wong authored
      xfs/399 found the following deadlock when fuzzing core.mode = ones:
      
      /proc/20506/task/20558/stack :
      [<0>] xfs_ilock+0xa0/0x240 [xfs]
      [<0>] xfs_ilock_data_map_shared+0x1b/0x20 [xfs]
      [<0>] xrep_dinode_findmode_walk_directory+0x69/0xe0 [xfs]
      [<0>] xrep_dinode_find_mode+0x103/0x2a0 [xfs]
      [<0>] xrep_dinode_mode+0x7c/0x120 [xfs]
      [<0>] xrep_dinode_core+0xed/0x2b0 [xfs]
      [<0>] xrep_dinode_problems+0x10/0x80 [xfs]
      [<0>] xrep_inode+0x6c/0xc0 [xfs]
      [<0>] xrep_attempt+0x64/0x1d0 [xfs]
      [<0>] xfs_scrub_metadata+0x365/0x840 [xfs]
      [<0>] xfs_scrubv_metadata+0x282/0x430 [xfs]
      [<0>] xfs_ioc_scrubv_metadata+0x149/0x1a0 [xfs]
      [<0>] xfs_file_ioctl+0xc68/0x1780 [xfs]
      /proc/20506/task/20559/stack :
      [<0>] xfs_buf_lock+0x3b/0x110 [xfs]
      [<0>] xfs_buf_find_lock+0x66/0x1c0 [xfs]
      [<0>] xfs_buf_get_map+0x208/0xc00 [xfs]
      [<0>] xfs_buf_read_map+0x5d/0x2c0 [xfs]
      [<0>] xfs_trans_read_buf_map+0x1b0/0x4c0 [xfs]
      [<0>] xfs_read_agi+0xbd/0x190 [xfs]
      [<0>] xfs_ialloc_read_agi+0x47/0x160 [xfs]
      [<0>] xfs_imap_lookup+0x69/0x1f0 [xfs]
      [<0>] xfs_imap+0x1fc/0x3d0 [xfs]
      [<0>] xfs_iget+0x357/0xd50 [xfs]
      [<0>] xchk_dir_actor+0x16e/0x330 [xfs]
      [<0>] xchk_dir_walk_block+0x164/0x1e0 [xfs]
      [<0>] xchk_dir_walk+0x13a/0x190 [xfs]
      [<0>] xchk_directory+0x1a2/0x2b0 [xfs]
      [<0>] xfs_scrub_metadata+0x2f4/0x840 [xfs]
      [<0>] xfs_scrubv_metadata+0x282/0x430 [xfs]
      [<0>] xfs_ioc_scrubv_metadata+0x149/0x1a0 [xfs]
      [<0>] xfs_file_ioctl+0xc68/0x1780 [xfs]
      
      Thread 20558 holds an AGI buffer and is trying to grab the ILOCK of the
      root directory.  Thread 20559 holds the root directory ILOCK and is
      trying to grab the AGI of an inode that is one of the root directory's
      children.  The AGI held by 20558 is the same buffer that 20559 is trying
      to acquire.  In other words, this is an ABBA deadlock.
      
      In general, the lock order is ILOCK and then AGI -- rename does this
      while preparing for an operation involving whiteouts or renaming files
      out of existence; and unlink does this when moving an inode to the
      unlinked list.  The only place where we do it in the opposite order is
      on the child during an icreate, but at that point the child is marked
      INEW and is not visible to other threads.
      
      Work around this deadlock by replacing the blocking ilock attempt with a
      nonblocking loop that aborts after 30 seconds.  Relax for a jiffy after
      a failed lock attempt.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      21ad2d03
    • Darrick J. Wong's avatar
      xfs: fix an AGI lock acquisition ordering problem in xrep_dinode_findmode · 2afd5276
      Darrick J. Wong authored
      While reviewing the next patch which fixes an ABBA deadlock between the
      AGI and a directory ILOCK, someone asked a question about why we're
      holding the AGI in the first place.  The reason for that is to quiesce
      the inode structures for that AG while we do a repair.
      
      I then realized that the xrep_dinode_findmode invokes xchk_iscan_iter,
      which walks the inobts (and hence the AGIs) to find all the inodes.
      This itself is also an ABBA vector, since the damaged inode could be in
      AG 5, which we hold while we scan AG 0 for directories.  5 -> 0 is not
      allowed.
      
      To address this, modify the iscan to allow trylock of the AGI buffer
      using the flags argument to xfs_ialloc_read_agi that the previous patch
      added.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      2afd5276
    • Darrick J. Wong's avatar
      xfs: pass xfs_buf lookup flags to xfs_*read_agi · 549d3c9a
      Darrick J. Wong authored
      Allow callers to pass buffer lookup flags to xfs_read_agi and
      xfs_ialloc_read_agi.  This will be used in the next patch to fix a
      deadlock in the online fsck inode scanner.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      549d3c9a
  2. 14 Apr, 2024 10 commits
  3. 13 Apr, 2024 4 commits
    • Linus Torvalds's avatar
      Merge tag 'ata-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux · 7efd0a74
      Linus Torvalds authored
      Pull ata fixes from Damien Le Moal:
      
       - Add the mask_port_map parameter to the ahci driver. This is a
         follow-up to the recent snafu with the ASMedia controller and its
         virtual port hidding port-multiplier devices. As ASMedia confirmed
         that there is no way to determine if these slow-to-probe virtual
         ports are actually representing the ports of a port-multiplier
         devices, this new parameter allow masking ports to significantly
         speed up probing during system boot, resulting in shorter boot times.
      
       - A fix for an incorrect handling of a port unlock in
         ata_scsi_dev_rescan().
      
       - Allow command duration limits to be detected for ACS-4 devices are
         there are such devices out in the field.
      
      * tag 'ata-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
        ata: libata-core: Allow command duration limits detection for ACS-4 drives
        ata: libata-scsi: Fix ata_scsi_dev_rescan() error path
        ata: ahci: Add mask_port_map module parameter
      7efd0a74
    • Linus Torvalds's avatar
      Merge tag 'zonefs-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs · 76b0e9c4
      Linus Torvalds authored
      Pull zonefs fix from Damien Le Moal:
      
       - Suppress a coccicheck warning using str_plural()
      
      * tag 'zonefs-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
        zonefs: Use str_plural() to fix Coccinelle warning
      76b0e9c4
    • Linus Torvalds's avatar
      Merge tag 'v6.9-rc3-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 · fa4022cb
      Linus Torvalds authored
      Pull smb client fixes from Steve French:
      
       - fix for oops in cifs_get_fattr of deleted files
      
       - fix for the remote open counter going negative in some directory
         lease cases
      
       - fix for mkfifo to instantiate dentry to avoid possible crash
      
       - important fix to allow handling key rotation for mount and remount
         (ie cases that are becoming more common when password that was used
         for the mount will expire soon but will be replaced by new password)
      
      * tag 'v6.9-rc3-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
        smb3: fix broken reconnect when password changing on the server by allowing password rotation
        smb: client: instantiate when creating SFU files
        smb3: fix Open files on server counter going negative
        smb: client: fix NULL ptr deref in cifs_mark_open_handles_for_deleted_file()
      fa4022cb
    • Igor Pylypiv's avatar
      ata: libata-core: Allow command duration limits detection for ACS-4 drives · c0297e7d
      Igor Pylypiv authored
      Even though the command duration limits (CDL) feature was first added
      in ACS-5 (major version 12), there are some ACS-4 (major version 11)
      drives that implement CDL as well.
      
      IDENTIFY_DEVICE, SUPPORTED_CAPABILITIES, and CURRENT_SETTINGS log pages
      are mandatory in the ACS-4 standard so it should be safe to read these
      log pages on older drives implementing the ACS-4 standard.
      
      Fixes: 62e4a60e ("scsi: ata: libata: Detect support for command duration limits")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIgor Pylypiv <ipylypiv@google.com>
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      c0297e7d