1. 07 Jul, 2020 18 commits
    • Dave Chinner's avatar
      xfs: rework xfs_iflush_cluster() dirty inode iteration · 5717ea4d
      Dave Chinner authored
      Now that we have all the dirty inodes attached to the cluster
      buffer, we don't actually have to do radix tree lookups to find
      them. Sure, the radix tree is efficient, but walking a linked list
      of just the dirty inodes attached to the buffer is much better.
      
      We are also no longer dependent on having a locked inode passed into
      the function to determine where to start the lookup. This means we
      can drop it from the function call and treat all inodes the same.
      
      We also make xfs_iflush_cluster skip inodes marked with
      XFS_IRECLAIM. This we avoid races with inodes that reclaim is
      actively referencing or are being re-initialised by inode lookup. If
      they are actually dirty, they'll get written by a future cluster
      flush....
      
      We also add a shutdown check after obtaining the flush lock so that
      we catch inodes that are dirty in memory and may have inconsistent
      state due to the shutdown in progress. We abort these inodes
      directly and so they remove themselves directly from the buffer list
      and the AIL rather than having to wait for the buffer to be failed
      and callbacks run to be processed correctly.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      5717ea4d
    • Dave Chinner's avatar
      xfs: rename xfs_iflush_int() · e6187b34
      Dave Chinner authored
      with xfs_iflush() gone, we can rename xfs_iflush_int() back to
      xfs_iflush(). Also move it up above xfs_iflush_cluster() so we don't
      need the forward definition any more.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      e6187b34
    • Dave Chinner's avatar
      xfs: xfs_iflush() is no longer necessary · 90c60e16
      Dave Chinner authored
      Now we have a cached buffer on inode log items, we don't need
      to do buffer lookups when flushing inodes anymore - all we need
      to do is lock the buffer and we are ready to go.
      
      This largely gets rid of the need for xfs_iflush(), which is
      essentially just a mechanism to look up the buffer and flush the
      inode to it. Instead, we can just call xfs_iflush_cluster() with a
      few modifications to ensure it also flushes the inode we already
      hold locked.
      
      This allows the AIL inode item pushing to be almost entirely
      non-blocking in XFS - we won't block unless memory allocation
      for the cluster inode lookup blocks or the block device queues are
      full.
      
      Writeback during inode reclaim becomes a little more complex because
      we now have to lock the buffer ourselves, but otherwise this change
      is largely a functional no-op that removes a whole lot of code.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      90c60e16
    • Dave Chinner's avatar
      xfs: attach inodes to the cluster buffer when dirtied · 48d55e2a
      Dave Chinner authored
      Rather than attach inodes to the cluster buffer just when we are
      doing IO, attach the inodes to the cluster buffer when they are
      dirtied. The means the buffer always carries a list of dirty inodes
      that reference it, and we can use that list to make more fundamental
      changes to inode writeback that aren't otherwise possible.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      48d55e2a
    • Dave Chinner's avatar
      xfs: rework stale inodes in xfs_ifree_cluster · 71e3e356
      Dave Chinner authored
      Once we have inodes pinning the cluster buffer and attached whenever
      they are dirty, we no longer have a guarantee that the items are
      flush locked when we lock the cluster buffer. Hence we cannot just
      walk the buffer log item list and modify the attached inodes.
      
      If the inode is not flush locked, we have to ILOCK it first and then
      flush lock it to do all the prerequisite checks needed to avoid
      races with other code. This is already handled by
      xfs_ifree_get_one_inode(), so rework the inode iteration loop and
      function to update all inodes in cache whether they are attached to
      the buffer or not.
      
      Note: we also remove the copying of the log item lsn to the
      ili_flush_lsn as xfs_iflush_done() now uses the XFS_ISTALE flag to
      trigger aborts and so flush lsn matching is not needed in IO
      completion for processing freed inodes.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      71e3e356
    • Dave Chinner's avatar
      xfs: clean up inode reclaim comments · 02511a5a
      Dave Chinner authored
      Inode reclaim is quite different now to the way described in various
      comments, so update all the comments explaining what it does and how
      it works.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      02511a5a
    • Dave Chinner's avatar
      xfs: remove SYNC_WAIT from xfs_reclaim_inodes() · 4d0bab3a
      Dave Chinner authored
      Clean up xfs_reclaim_inodes() callers. Most callers want blocking
      behaviour, so just make the existing SYNC_WAIT behaviour the
      default.
      
      For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag()
      directly because we just want optimistic clean inode reclaim to be
      done in the background.
      
      For xfs_quiesce_attr() we can just remove the inode reclaim calls as
      they are a historic relic that was required to flush dirty inodes
      that contained unlogged changes. We now log all changes to the
      inodes, so the sync AIL push from xfs_log_quiesce() called by
      xfs_quiesce_attr() will do all the required inode writeback for
      freeze.
      
      Seeing as we now want to loop until all reclaimable inodes have been
      reclaimed, make xfs_reclaim_inodes() loop on the XFS_ICI_RECLAIM_TAG
      tag rather than having xfs_reclaim_inodes_ag() tell it that inodes
      were skipped. This is much more reliable and will always loop until
      all reclaimable inodes are reclaimed.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      4d0bab3a
    • Dave Chinner's avatar
      xfs: remove SYNC_TRYLOCK from inode reclaim · 50718b8d
      Dave Chinner authored
      All background reclaim is SYNC_TRYLOCK already, and even blocking
      reclaim (SYNC_WAIT) can use trylock mechanisms as
      xfs_reclaim_inodes_ag() will keep cycling until there are no more
      reclaimable inodes. Hence we can kill SYNC_TRYLOCK from inode
      reclaim and make everything unconditionally non-blocking.
      
      We remove all the optimistic "avoid blocking on locks" checks done
      in xfs_reclaim_inode_grab() as nothing blocks on locks anymore.
      Further, checking XFS_IFLOCK optimistically can result in detecting
      inodes in the process of being cleaned (i.e. between being removed
      from the AIL and having the flush lock dropped), so for
      xfs_reclaim_inodes() to reliably reclaim all inodes we need to drop
      these checks anyway.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      50718b8d
    • Dave Chinner's avatar
      xfs: don't block inode reclaim on the ILOCK · 9552e14d
      Dave Chinner authored
      When we attempt to reclaim an inode, the first thing we do is take
      the inode lock. This is blocking right now, so if the inode being
      accessed by something else (e.g. being flushed to the cluster
      buffer) we will block here.
      
      Change this to a trylock so that we do not block inode reclaim
      unnecessarily here.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      9552e14d
    • Dave Chinner's avatar
      xfs: allow multiple reclaimers per AG · 0e8e2c63
      Dave Chinner authored
      Inode reclaim will still throttle direct reclaim on the per-ag
      reclaim locks. This is no longer necessary as reclaim can run
      non-blocking now. Hence we can remove these locks so that we don't
      arbitrarily block reclaimers just because there are more direct
      reclaimers than there are AGs.
      
      This can result in multiple reclaimers working on the same range of
      an AG, but this doesn't cause any apparent issues. Optimising the
      spread of concurrent reclaimers for best efficiency can be done in a
      future patchset.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      0e8e2c63
    • Dave Chinner's avatar
      xfs: remove IO submission from xfs_reclaim_inode() · 617825fe
      Dave Chinner authored
      We no longer need to issue IO from shrinker based inode reclaim to
      prevent spurious OOM killer invocation. This leaves only the global
      filesystem management operations such as unmount needing to
      writeback dirty inodes and reclaim them.
      
      Instead of using the reclaim pass to write dirty inodes before
      reclaiming them, use the AIL to push all the dirty inodes before we
      try to reclaim them. This allows us to remove all the conditional
      SYNC_WAIT locking and the writeback code from xfs_reclaim_inode()
      and greatly simplify the checks we need to do to reclaim an inode.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      617825fe
    • Dave Chinner's avatar
      xfs: make inode reclaim almost non-blocking · 993f951f
      Dave Chinner authored
      Now that dirty inode writeback doesn't cause read-modify-write
      cycles on the inode cluster buffer under memory pressure, the need
      to throttle memory reclaim to the rate at which we can clean dirty
      inodes goes away. That is due to the fact that we no longer thrash
      inode cluster buffers under memory pressure to clean dirty inodes.
      
      This means inode writeback no longer stalls on memory allocation
      or read IO, and hence can be done asynchronously without generating
      memory pressure. As a result, blocking inode writeback in reclaim is
      no longer necessary to prevent reclaim priority windup as cleaning
      dirty inodes is no longer dependent on having memory reserves
      available for the filesystem to make progress reclaiming inodes.
      
      Hence we can convert inode reclaim to be non-blocking for shrinker
      callouts, both for direct reclaim and kswapd.
      
      On a vanilla kernel, running a 16-way fsmark create workload on a
      4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
      userspace mlock(). The OOM killer gets invoked at 15GB of
      pinned RAM.
      
      Without the inode cluster pinning, this non-blocking reclaim patch
      triggers premature OOM killer invocation with the same memory
      pinning, sometimes with as much as 45% of RAM being free.  It's
      trivially easy to trigger the OOM killer when reclaim does not
      block.
      
      With pinning inode clusters in RAM and then adding this patch, I can
      reliably pin 14.5GB of RAM and still have the fsmark workload run to
      completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
      is only a small amount of memory less than the vanilla kernel. It is
      much more reliable than just with async reclaim alone.
      
      simoops shows that allocation stalls go away when async reclaim is
      used. Vanilla kernel:
      
      Run time: 1924 seconds
      Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
      Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
      Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
      work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
      alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)
      
      With inode cluster pinning and async reclaim:
      
      Run time: 1924 seconds
      Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
      Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
      Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)
      work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34)
      alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03)
      
      Latencies don't really change much, nor does the work rate. However,
      allocation almost never stalls with these changes, whilst the
      vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample
      period. This difference is due to inode reclaim being largely
      non-blocking now.
      
      IOWs, once we have pinned inode cluster buffers, we can make inode
      reclaim non-blocking without a major risk of premature and/or
      spurious OOM killer invocation, and without any changes to memory
      reclaim infrastructure.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      993f951f
    • Dave Chinner's avatar
      xfs: pin inode backing buffer to the inode log item · 298f7bec
      Dave Chinner authored
      When we dirty an inode, we are going to have to write it disk at
      some point in the near future. This requires the inode cluster
      backing buffer to be present in memory. Unfortunately, under severe
      memory pressure we can reclaim the inode backing buffer while the
      inode is dirty in memory, resulting in stalling the AIL pushing
      because it has to do a read-modify-write cycle on the cluster
      buffer.
      
      When we have no memory available, the read of the cluster buffer
      blocks the AIL pushing process, and this causes all sorts of issues
      for memory reclaim as it requires inode writeback to make forwards
      progress. Allocating a cluster buffer causes more memory pressure,
      and results in more cluster buffers to be reclaimed, resulting in
      more RMW cycles to be done in the AIL context and everything then
      backs up on AIL progress. Only the synchronous inode cluster
      writeback in the the inode reclaim code provides some level of
      forwards progress guarantees that prevent OOM-killer rampages in
      this situation.
      
      Fix this by pinning the inode backing buffer to the inode log item
      when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
      This may mean the first modification of an inode that has been held
      in cache for a long time may block on a cluster buffer read, but
      we can do that in transaction context and block safely until the
      buffer has been allocated and read.
      
      Once we have the cluster buffer, the inode log item takes a
      reference to it, pinning it in memory, and attaches it to the log
      item for future reference. This means we can always grab the cluster
      buffer from the inode log item when we need it.
      
      When the inode is finally cleaned and removed from the AIL, we can
      drop the reference the inode log item holds on the cluster buffer.
      Once all inodes on the cluster buffer are clean, the cluster buffer
      will be unpinned and it will be available for memory reclaim to
      reclaim again.
      
      This avoids the issues with needing to do RMW cycles in the AIL
      pushing context, and hence allows complete non-blocking inode
      flushing to be performed by the AIL pushing context.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      298f7bec
    • Dave Chinner's avatar
      xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() · e98084b8
      Dave Chinner authored
      xfs_ail_delete_one() is called directly from dquot and inode IO
      completion, as well as from the generic xfs_trans_ail_delete()
      function. Inodes are about to have their own failure handling, and
      dquots will in future, too. Pull the clearing of the LI_FAILED flag
      up into the callers so we can customise the code appropriately.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      e98084b8
    • Dave Chinner's avatar
      xfs: unwind log item error flagging · 3536b61e
      Dave Chinner authored
      When an buffer IO error occurs, we want to mark all
      the log items attached to the buffer as failed. Open code
      the error handling loop so that we can modify the flagging for the
      different types of objects directly and independently of each other.
      
      This also allows us to remove the ->iop_error method from the log
      item operations.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      3536b61e
    • Dave Chinner's avatar
      xfs: handle buffer log item IO errors directly · 428947e9
      Dave Chinner authored
      Currently when a buffer with attached log items has an IO error
      it called ->iop_error for each attched log item. These all call
      xfs_set_li_failed() to handle the error, but we are about to change
      the way log items manage buffers. hence we first need to remove the
      per-item dependency on buffer handling done by xfs_set_li_failed().
      
      We already have specific buffer type IO completion routines, so move
      the log item error handling out of the generic error handling and
      into the log item specific functions so we can implement per-type
      error handling easily.
      
      This requires a more complex return value from the error handling
      code so that we can take the correct action the failure handling
      requires.  This results in some repeated boilerplate in the
      functions, but that can be cleaned up later once all the changes
      cascade through this code.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      428947e9
    • Dave Chinner's avatar
      xfs: get rid of log item callbacks · 2ef3f7f5
      Dave Chinner authored
      They are not used anymore, so remove them from the log item and the
      buffer iodone attachment interfaces.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      2ef3f7f5
    • Dave Chinner's avatar
      xfs: clean up the buffer iodone callback functions · fec671cd
      Dave Chinner authored
      Now that we've sorted inode and dquot buffers, we can apply the same
      cleanups to dirty buffers with buffer log items. They only have one
      callback, too, so we don't need the log item callback. Collapse the
      iodone functions and remove all the now unnecessary infrastructure
      around callback processing.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      fec671cd
  2. 06 Jul, 2020 22 commits