1. 12 May, 2022 9 commits
    • Dave Chinner's avatar
      xfs: introduce attr remove initial states into xfs_attr_set_iter · e5d5596a
      Dave Chinner authored
      We need to merge the add and remove code paths to enable safe
      recovery of replace operations. Hoist the initial remove states from
      xfs_attr_remove_iter into xfs_attr_set_iter. We will make use of
      them in the next patches.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      e5d5596a
    • Dave Chinner's avatar
      xfs: xfs_attr_set_iter() does not need to return EAGAIN · 4e3d96a5
      Dave Chinner authored
      Now that the full xfs_attr_set_iter() state machine always
      terminates with either the state being XFS_DAS_DONE on success or
      an error on failure, we can get rid of the need for it to return
      -EAGAIN whenever it needs to roll the transaction before running
      the next state.
      
      That is, we don't need to spray -EAGAIN return states everywhere,
      the caller just check the state machine state for completion to
      determine what action should be taken next. This greatly simplifies
      the code within the state machine implementation as it now only has
      to handle 0 for success or -errno for error and it doesn't need to
      tell the caller to retry.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      4e3d96a5
    • Dave Chinner's avatar
      xfs: clean up final attr removal in xfs_attr_set_iter · b11fa61b
      Dave Chinner authored
      Clean up the final leaf/node states in xfs_attr_set_iter() to
      further simplify the high level state machine and to set the
      completion state correctly. As we are adding a separate state
      for node format removal, we need to ensure that node formats
      are collapsed back to shortform or empty correctly.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      b11fa61b
    • Dave Chinner's avatar
      xfs: remote xattr removal in xfs_attr_set_iter() is conditional · 2e7ef218
      Dave Chinner authored
      We may not have a remote value for the old xattr we have to remove,
      so skip over the remote value removal states and go straight to
      the xattr name removal in the leaf/node block.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      2e7ef218
    • Dave Chinner's avatar
      xfs: XFS_DAS_LEAF_REPLACE state only needed if !LARP · 411b434a
      Dave Chinner authored
      We can skip the REPLACE state when LARP is enabled, but that means
      the XFS_DAS_FLIP_LFLAG state is now poorly named - it indicates
      something that has been done rather than what the state is going to
      do. Rename it to "REMOVE_OLD" to indicate that we are now going to
      perform removal of the old attr.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      411b434a
    • Dave Chinner's avatar
      xfs: split remote attr setting out from replace path · 7d035336
      Dave Chinner authored
      When we set a new xattr, we have three exit paths:
      
      	1. nothing else to do
      	2. allocate and set the remote xattr value
      	3. perform the rest of a replace operation
      
      Currently we push both 2 and 3 into the same state, regardless of
      whether we just set a remote attribute or not. Once we've set the
      remote xattr, we have two exit states:
      
      	1. nothing else to do
      	2. perform the rest of a replace operation
      
      Hence we can split the remote xattr allocation and setting into
      their own states and factor it out of xfs_attr_set_iter() to further
      clean up the state machine and the implementation of the state
      machine.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
      Reviewed-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      7d035336
    • Dave Chinner's avatar
      xfs: consolidate leaf/node states in xfs_attr_set_iter · 251b29c8
      Dave Chinner authored
      The operations performed from XFS_DAS_FOUND_LBLK through to
      XFS_DAS_RM_LBLK are now identical to XFS_DAS_FOUND_NBLK through to
      XFS_DAS_RM_NBLK. We can collapse these down into a single set of
      code.
      
      To do this, define the states that leaf and node run through as
      separate sets of sequential states. Then as we move to the next
      state, we can use increments rather than specific state assignments
      to move through the states. This means the state progression is set
      by the initial state that enters the series and we don't need to
      duplicate the code anymore.
      
      At the exit point of the series we need to select the correct leaf
      or node state, but that can also be done by state increment rather
      than assignment.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      251b29c8
    • Dave Chinner's avatar
      xfs: kill XFS_DAC_LEAF_ADDNAME_INIT · 2157d169
      Dave Chinner authored
      We re-enter the XFS_DAS_FOUND_LBLK state when we have to allocate
      multiple extents for a remote xattr. We currently have a flag
      called XFS_DAC_LEAF_ADDNAME_INIT to avoid running the remote attr
      hole finding code more than once.
      
      However, for the node format tree, we have a separate state for this
      so we never reenter the state machine at XFS_DAS_FOUND_NBLK and so
      it does not need a special flag to skip over the remote attr hold
      finding code.
      
      Convert the leaf block code to use the same state machine as the
      node blocks and kill the  XFS_DAC_LEAF_ADDNAME_INIT flag.
      
      This further points out that this "ALLOC" state is only traversed
      if we have remote xattrs or we are doing a rename operation. Rename
      both the leaf and node alloc states to _ALLOC_RMT to indicate they
      are iterating to do allocation of remote xattr blocks.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      2157d169
    • Dave Chinner's avatar
      xfs: separate out initial attr_set states · e0c41089
      Dave Chinner authored
      We current use XFS_DAS_UNINIT for several steps in the attr_set
      state machine. We use it for setting shortform xattrs, converting
      from shortform to leaf, leaf add, leaf-to-node and leaf add. All of
      these things are essentially known before we start the state machine
      iterating, so we really should separate them out:
      
      XFS_DAS_SF_ADD:
      	- tries to do a shortform add
      	- on success -> done
      	- on ENOSPC converts to leaf, -> XFS_DAS_LEAF_ADD
      	- on error, dies.
      
      XFS_DAS_LEAF_ADD:
      	- tries to do leaf add
      	- on success:
      		- inline attr -> done
      		- remote xattr || REPLACE -> XFS_DAS_FOUND_LBLK
      	- on ENOSPC converts to node, -> XFS_DAS_NODE_ADD
      	- on error, dies
      
      XFS_DAS_NODE_ADD:
      	- tries to do node add
      	- on success:
      		- inline attr -> done
      		- remote xattr || REPLACE -> XFS_DAS_FOUND_NBLK
      	- on error, dies
      
      This makes it easier to understand how the state machine starts
      up and sets us up on the path to further state machine
      simplifications.
      
      This also converts the DAS state tracepoints to use strings rather
      than numbers, as converting between enums and numbers requires
      manual counting rather than just reading the name.
      
      This also introduces a XFS_DAS_DONE state so that we can trace
      successful operation completions easily.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      e0c41089
  2. 11 May, 2022 13 commits
  3. 09 May, 2022 2 commits
  4. 04 May, 2022 16 commits
    • Allison Henderson's avatar
      xfs: Set up infrastructure for log attribute replay · fd920008
      Allison Henderson authored
      Currently attributes are modified directly across one or more
      transactions. But they are not logged or replayed in the event of an
      error. The goal of log attr replay is to enable logging and replaying
      of attribute operations using the existing delayed operations
      infrastructure.  This will later enable the attributes to become part of
      larger multi part operations that also must first be recorded to the
      log.  This is mostly of interest in the scheme of parent pointers which
      would need to maintain an attribute containing parent inode information
      any time an inode is moved, created, or removed.  Parent pointers would
      then be of interest to any feature that would need to quickly derive an
      inode path from the mount point. Online scrub, nfs lookups and fs grow
      or shrink operations are all features that could take advantage of this.
      
      This patch adds two new log item types for setting or removing
      attributes as deferred operations.  The xfs_attri_log_item will log an
      intent to set or remove an attribute.  The corresponding
      xfs_attrd_log_item holds a reference to the xfs_attri_log_item and is
      freed once the transaction is done.  Both log items use a generic
      xfs_attr_log_format structure that contains the attribute name, value,
      flags, inode, and an op_flag that indicates if the operations is a set
      or remove.
      
      [dchinner: added extra little bits needed for intent whiteouts]
      Signed-off-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      fd920008
    • Allison Henderson's avatar
      xfs: Return from xfs_attr_set_iter if there are no more rmtblks to process · 9a39cdab
      Allison Henderson authored
      During an attr rename operation, blocks are saved for later removal
      as rmtblkno2. The rmtblkno is used in the case of needing to alloc
      more blocks if not enough were available.  However, in the case
      that no further blocks need to be added or removed, we can return as soon
      as xfs_attr_node_addname completes, rather than rolling the transaction
      with an -EAGAIN return.  This extra loop does not hurt anything right
      now, but it will be a problem later when we get into log items because
      we end up with an empty log transaction.  So, add a simple check to
      cut out the unneeded iteration.
      Signed-off-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      9a39cdab
    • Allison Henderson's avatar
      xfs: Fix double unlock in defer capture code · 7b3ec2b2
      Allison Henderson authored
      The new deferred attr patch set uncovered a double unlock in the
      recent port of the defer ops capture and continue code.  During log
      recovery, we're allowed to hold buffers to a transaction that's being
      used to replay an intent item.  When we capture the resources as part
      of scheduling a continuation of an intent chain, we call xfs_buf_hold
      to retain our reference to the buffer beyond the transaction commit,
      but we do /not/ call xfs_trans_bhold to maintain the buffer lock.
      This means that xfs_defer_ops_continue needs to relock the buffers
      before xfs_defer_restore_resources joins then tothe new transaction.
      
      Additionally, the buffers should not be passed back via the dres
      structure since they need to remain locked unlike the inodes.  So
      simply set dr_bufs to zero after populating the dres structure.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      7b3ec2b2
    • Dave Chinner's avatar
    • Dave Chinner's avatar
      Merge tag 'reflink-speedups-5.19_2022-04-28' of... · 166afc45
      Dave Chinner authored
      Merge tag 'reflink-speedups-5.19_2022-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.19-for-next
      
      xfs: fix reflink inefficiencies
      
      As Dave Chinner has complained about on IRC, there are a couple of
      things about reflink that are very inefficient.  First of all, we
      limited the size of all bunmapi operations to avoid flooding the log
      with defer ops in the worst case, but recent changes to the defer
      ops code have solved that problem, so get rid of the bunmapi length
      clamp.
      
      Second, the log reservations for reflink operations are far far
      larger than they need to be.  Shrink them to exactly what we need to
      handle each deferred RUI and CUI log item, and no more.  Also reduce
      logcount because we don't need 8 rolls per operation.  Introduce a
      transaction reservation compatibility layer to avoid changing the
      minimum log size calculations.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      166afc45
    • Dave Chinner's avatar
      Merge tag 'rmap-speedups-5.19_2022-04-28' of... · 956f1b8f
      Dave Chinner authored
      Merge tag 'rmap-speedups-5.19_2022-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.19-for-next
      
      xfs: fix rmap inefficiencies
      
      Reduce the performance impact of the reverse mapping btree when
      reflink is enabled by using the much faster non-overlapped btree
      lookup functions when we're searching the rmap index with a fully
      specified key.  If we find the exact record we're looking for,
      great!  We don't have to perform the full overlapped scan.  For
      filesystems with high sharing factors this reduces the xfs_scrub
      runtime by a good 15%%.
      
      This has been shown to reduce the fstests runtime for realtime rmap
      configurations by 30%%, since the lack of AGs severely limits
      scalability.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      956f1b8f
    • Dave Chinner's avatar
    • Dave Chinner's avatar
    • Dave Chinner's avatar
      xfs: validate v5 feature fields · f0f5f658
      Dave Chinner authored
      We don't check that the v4 feature flags taht v5 requires to be set
      are actually set anywhere. Do this check when we see that the
      filesystem is a v5 filesystem.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      f0f5f658
    • Dave Chinner's avatar
      xfs: set XFS_FEAT_NLINK correctly · dd0d2f97
      Dave Chinner authored
      While xfs_has_nlink() is not used in kernel, it is used in userspace
      (e.g. by xfs_db) so we need to set the XFS_FEAT_NLINK flag correctly
      in xfs_sb_version_to_features().
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      dd0d2f97
    • Dave Chinner's avatar
      xfs: validate inode fork size against fork format · 1eb70f54
      Dave Chinner authored
      xfs_repair catches fork size/format mismatches, but the in-kernel
      verifier doesn't, leading to null pointer failures when attempting
      to perform operations on the fork. This can occur in the
      xfs_dir_is_empty() where the in-memory fork format does not match
      the size and so the fork data pointer is accessed incorrectly.
      
      Note: this causes new failures in xfs/348 which is testing mode vs
      ftype mismatches. We now detect a regular file that has been changed
      to a directory or symlink mode as being corrupt because the data
      fork is for a symlink or directory should be in local form when
      there are only 3 bytes of data in the data fork. Hence the inode
      verify for the regular file now fires w/ -EFSCORRUPTED because
      the inode fork format does not match the format the corrupted mode
      says it should be in.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      1eb70f54
    • Dave Chinner's avatar
      xfs: detect self referencing btree sibling pointers · dc04db2a
      Dave Chinner authored
      To catch the obvious graph cycle problem and hence potential endless
      looping.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      dc04db2a
    • Dave Chinner's avatar
      xfs: intent item whiteouts · 0d227466
      Dave Chinner authored
      When we log modifications based on intents, we add both intent
      and intent done items to the modification being made. These get
      written to the log to ensure that the operation is re-run if the
      intent done is not found in the log.
      
      However, for operations that complete wholly within a single
      checkpoint, the change in the checkpoint is atomic and will never
      need replay. In this case, we don't need to actually write the
      intent and intent done items to the journal because log recovery
      will never need to manually restart this modification.
      
      Log recovery currently handles intent/intent done matching by
      inserting the intent into the AIL, then removing it when a matching
      intent done item is found. Hence for all the intent-based operations
      that complete within a checkpoint, we spend all that time parsing
      the intent/intent done items just to cancel them and do nothing with
      them.
      
      Hence it follows that the only time we actually need intents in the
      log is when the modification crosses checkpoint boundaries in the
      log and so may only be partially complete in the journal. Hence if
      we commit and intent done item to the CIL and the intent item is in
      the same checkpoint, we don't actually have to write them to the
      journal because log recovery will always cancel the intents.
      
      We've never really worried about the overhead of logging intents
      unnecessarily like this because the intents we log are generally
      very much smaller than the change being made. e.g. freeing an extent
      involves modifying at lease two freespace btree blocks and the AGF,
      so the EFI/EFD overhead is only a small increase in space and
      processing time compared to the overall cost of freeing an extent.
      
      However, delayed attributes change this cost equation dramatically,
      especially for inline attributes. In the case of adding an inline
      attribute, we only log the inode core and attribute fork at present.
      With delayed attributes, we now log the attr intent which includes
      the name and value, the inode core adn attr fork, and finally the
      attr intent done item. We increase the number of items we log from 1
      to 3, and the number of log vectors (regions) goes up from 3 to 7.
      Hence we tripple the number of objects that the CIL has to process,
      and more than double the number of log vectors that need to be
      written to the journal.
      
      At scale, this means delayed attributes cause a non-pipelined CIL to
      become CPU bound processing all the extra items, resulting in a > 40%
      performance degradation on 16-way file+xattr create worklaods.
      Pipelining the CIL (as per 5.15) reduces the performance degradation
      to 20%, but now the limitation is the rate at which the log items
      can be written to the iclogs and iclogs be dispatched for IO and
      completed.
      
      Even log IO completion is slowed down by these intents, because it
      now has to process 3x the number of items in the checkpoint.
      Processing completed intents is especially inefficient here, because
      we first insert the intent into the AIL, then remove it from the AIL
      when the intent done is processed. IOWs, we are also doing expensive
      operations in log IO completion we could completely avoid if we
      didn't log completed intent/intent done pairs.
      
      Enter log item whiteouts.
      
      When an intent done is committed, we can check to see if the
      associated intent is in the same checkpoint as we are currently
      committing the intent done to. If so, we can mark the intent log
      item with a whiteout and immediately free the intent done item
      rather than committing it to the CIL. We can basically skip the
      entire formatting and CIL insertion steps for the intent done item.
      
      However, we cannot remove the intent item from the CIL at this point
      because the unlocked per-cpu CIL item lists do not permit removal
      without holding the CIL context lock exclusively. Transaction commit
      only holds the context lock shared, hence the best we can do is mark
      the intent item with a whiteout so that the CIL push can release it
      rather than writing it to the log.
      
      This means we never write the intent to the log if the intent done
      has also been committed to the same checkpoint, but we'll always
      write the intent if the intent done has not been committed or has
      been committed to a different checkpoint. This will result in
      correct log recovery behaviour in all cases, without the overhead of
      logging unnecessary intents.
      
      This intent whiteout concept is generic - we can apply it to all
      intent/intent done pairs that have a direct 1:1 relationship. The
      way deferred ops iterate and relog intents mean that all intents
      currently have a 1:1 relationship with their done intent, and hence
      we can apply this cancellation to all existing intent/intent done
      implementations.
      
      For delayed attributes with a 16-way 64kB xattr create workload,
      whiteouts reduce the amount of journalled metadata from ~2.5GB/s
      down to ~600MB/s and improve the creation rate from 9000/s to
      14000/s.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      0d227466
    • Dave Chinner's avatar
      xfs: whiteouts release intents that are not in the AIL · 3512fc1e
      Dave Chinner authored
      When we release an intent that a whiteout applies to, it will not
      have been committed to the journal and so won't be in the AIL. Hence
      when we drop the last reference to the intent, we do not want to try
      to remove it from the AIL as that will trigger a filesystem
      shutdown. Hence make the removal of intents from the AIL conditional
      on them actually being in the AIL so we do the correct thing.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      3512fc1e
    • Dave Chinner's avatar
      xfs: add log item method to return related intents · c23ab603
      Dave Chinner authored
      To apply a whiteout to an intent item when an intent done item is
      committed, we need to be able to retrieve the intent item from the
      the intent done item. Add a log item op method for doing this, and
      wire all the intent done items up to it.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      c23ab603
    • Dave Chinner's avatar
      xfs: factor and move some code in xfs_log_cil.c · 22b1afc5
      Dave Chinner authored
      In preparation for adding support for intent item whiteouts.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      22b1afc5