1. 29 Jul, 2020 5 commits
    • Darrick J. Wong's avatar
      xfs: validate ondisk/incore dquot flags · afeda600
      Darrick J. Wong authored
      While loading dquot records off disk, make sure that the quota type
      flags are the same between the incore dquot and the ondisk dquot.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      afeda600
    • Darrick J. Wong's avatar
      xfs: fix inode quota reservation checks · f959b5d0
      Darrick J. Wong authored
      xfs_trans_dqresv is the function that we use to make reservations
      against resource quotas.  Each resource contains two counters: the
      q_core counter, which tracks resources allocated on disk; and the dquot
      reservation counter, which tracks how much of that resource has either
      been allocated or reserved by threads that are working on metadata
      updates.
      
      For disk blocks, we compare the proposed reservation counter against the
      hard and soft limits to decide if we're going to fail the operation.
      However, for inodes we inexplicably compare against the q_core counter,
      not the incore reservation count.
      
      Since the q_core counter is always lower than the reservation count and
      we unlock the dquot between reservation and transaction commit, this
      means that multiple threads can reserve the last inode count before we
      hit the hard limit, and when they commit, we'll be well over the hard
      limit.
      
      Fix this by checking against the incore inode reservation counter, since
      we would appear to maintain that correctly (and that's what we report in
      GETQUOTA).
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarAllison Collins <allison.henderson@oracle.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      f959b5d0
    • Darrick J. Wong's avatar
      xfs: clear XFS_DQ_FREEING if we can't lock the dquot buffer to flush · c97738a9
      Darrick J. Wong authored
      In commit 8d3d7e2b, we changed xfs_qm_dqpurge to bail out if we
      can't lock the dquot buf to flush the dquot.  This prevents the AIL from
      blocking on the dquot, but it also forgets to clear the FREEING flag on
      its way out.  A subsequent purge attempt will see the FREEING flag is
      set and bail out, which leads to dqpurge_all failing to purge all the
      dquots.
      
      (copy-pasting from Dave Chinner's identical patch)
      
      This was found by inspection after having xfs/305 hang 1 in ~50
      iterations in a quotaoff operation:
      
      [ 8872.301115] xfs_quota       D13888 92262  91813 0x00004002
      [ 8872.302538] Call Trace:
      [ 8872.303193]  __schedule+0x2d2/0x780
      [ 8872.304108]  ? do_raw_spin_unlock+0x57/0xd0
      [ 8872.305198]  schedule+0x6e/0xe0
      [ 8872.306021]  schedule_timeout+0x14d/0x300
      [ 8872.307060]  ? __next_timer_interrupt+0xe0/0xe0
      [ 8872.308231]  ? xfs_qm_dqusage_adjust+0x200/0x200
      [ 8872.309422]  schedule_timeout_uninterruptible+0x2a/0x30
      [ 8872.310759]  xfs_qm_dquot_walk.isra.0+0x15a/0x1b0
      [ 8872.311971]  xfs_qm_dqpurge_all+0x7f/0x90
      [ 8872.313022]  xfs_qm_scall_quotaoff+0x18d/0x2b0
      [ 8872.314163]  xfs_quota_disable+0x3a/0x60
      [ 8872.315179]  kernel_quotactl+0x7e2/0x8d0
      [ 8872.316196]  ? __do_sys_newstat+0x51/0x80
      [ 8872.317238]  __x64_sys_quotactl+0x1e/0x30
      [ 8872.318266]  do_syscall_64+0x46/0x90
      [ 8872.319193]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [ 8872.320490] RIP: 0033:0x7f46b5490f2a
      [ 8872.321414] Code: Bad RIP value.
      
      Returning -EAGAIN from xfs_qm_dqpurge() without clearing the
      XFS_DQ_FREEING flag means the xfs_qm_dqpurge_all() code can never
      free the dquot, and we loop forever waiting for the XFS_DQ_FREEING
      flag to go away on the dquot that leaked it via -EAGAIN.
      
      Fixes: 8d3d7e2b ("xfs: trylock underlying buffer on dquot flush")
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarAllison Collins <allison.henderson@oracle.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      c97738a9
    • Brian Foster's avatar
      xfs: fix inode allocation block res calculation precedence · b2a88647
      Brian Foster authored
      The block reservation calculation for inode allocation is supposed
      to consist of the blocks required for the inode chunk plus
      (maxlevels-1) of the inode btree multiplied by the number of inode
      btrees in the fs (2 when finobt is enabled, 1 otherwise).
      
      Instead, the macro returns (ialloc_blocks + 2) due to a precedence
      error in the calculation logic. This leads to block reservation
      overruns via generic/531 on small block filesystems with finobt
      enabled. Add braces to fix the calculation and reserve the
      appropriate number of blocks.
      
      Fixes: 9d43b180 ("xfs: update inode allocation/free transaction reservations for finobt")
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      b2a88647
    • Brian Foster's avatar
      xfs: drain the buf delwri queue before xfsaild idles · f376b45e
      Brian Foster authored
      xfsaild is racy with respect to transaction abort and shutdown in
      that the task can idle or exit with an empty AIL but buffers still
      on the delwri queue. This was partly addressed by cancelling the
      delwri queue before the task exits to prevent memory leaks, but it's
      also possible for xfsaild to empty and idle with buffers on the
      delwri queue. For example, a transaction that pins a buffer that
      also happens to sit on the AIL delwri queue will explicitly remove
      the associated log item from the AIL if the transaction aborts. The
      side effect of this is an unmount hang in xfs_wait_buftarg() as the
      associated buffers remain held by the delwri queue indefinitely.
      This is reproduced on repeated runs of generic/531 with an fs format
      (-mrmapbt=1 -bsize=1k) that happens to also reproduce transaction
      aborts.
      
      Update xfsaild to not idle until both the AIL and associated delwri
      queue are empty and update the push code to continue delwri queue
      submission attempts even when the AIL is empty. This allows the AIL
      to eventually release aborted buffers stranded on the delwri queue
      when they are unlocked by the associated transaction. This should
      have no significant effect on normal runtime behavior because the
      xfsaild currently idles only when the AIL is empty and in practice
      the AIL is rarely empty with a populated delwri queue. The items
      must be AIL resident to land in the queue in the first place and
      generally aren't removed until writeback completes.
      
      Note that the pre-existing delwri queue cancel logic in the exit
      path is retained because task stop is external, could technically
      come at any point, and xfsaild is still responsible to release its
      buffer references before it exits.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      f376b45e
  2. 17 Jul, 2020 1 commit
    • Eric Sandeen's avatar
      xfs: preserve inode versioning across remounts · 4750a171
      Eric Sandeen authored
      The MS_I_VERSION mount flag is exposed via the VFS, as documented
      in the mount manpages etc; see the iversion and noiversion mount
      options in mount(8).
      
      As a result, mount -o remount looks for this option in /proc/mounts
      and will only send the I_VERSION flag back in during remount it it
      is present.  Since it's not there, a remount will /remove/ the
      I_VERSION flag at the vfs level, and iversion functionality is lost.
      
      xfs v5 superblocks intend to always have i_version enabled; it is
      set as a default at mount time, but is lost during remount for the
      reasons above.
      
      The generic fix would be to expose this documented option in
      /proc/mounts, but since that was rejected, fix it up again in the
      xfs remount path instead, so that at least xfs won't suffer from
      this misbehavior.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      4750a171
  3. 14 Jul, 2020 3 commits
  4. 09 Jul, 2020 1 commit
    • Waiman Long's avatar
      xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim · c3f2375b
      Waiman Long authored
      Depending on the workloads, the following circular locking dependency
      warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
      lock) may show up:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.0.0-rc1+ #60 Tainted: G        W
      ------------------------------------------------------
      fsfreeze/4346 is trying to acquire lock:
      0000000026f1d784 (fs_reclaim){+.+.}, at:
      fs_reclaim_acquire.part.19+0x5/0x30
      
      but task is already holding lock:
      0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
      
      which lock already depends on the new lock.
        :
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(sb_internal);
                                     lock(fs_reclaim);
                                     lock(sb_internal);
        lock(fs_reclaim);
      
       *** DEADLOCK ***
      
      4 locks held by fsfreeze/4346:
       #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
       #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
       #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
       #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
      
      stack backtrace:
      Call Trace:
       dump_stack+0xe0/0x19a
       print_circular_bug.isra.10.cold.34+0x2f4/0x435
       check_prev_add.constprop.19+0xca1/0x15f0
       validate_chain.isra.14+0x11af/0x3b50
       __lock_acquire+0x728/0x1200
       lock_acquire+0x269/0x5a0
       fs_reclaim_acquire.part.19+0x29/0x30
       fs_reclaim_acquire+0x19/0x20
       kmem_cache_alloc+0x3e/0x3f0
       kmem_zone_alloc+0x79/0x150
       xfs_trans_alloc+0xfa/0x9d0
       xfs_sync_sb+0x86/0x170
       xfs_log_sbcount+0x10f/0x140
       xfs_quiesce_attr+0x134/0x270
       xfs_fs_freeze+0x4a/0x70
       freeze_super+0x1af/0x290
       do_vfs_ioctl+0xedc/0x16c0
       ksys_ioctl+0x41/0x80
       __x64_sys_ioctl+0x73/0xa9
       do_syscall_64+0x18f/0xd23
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This is a false positive as all the dirty pages are flushed out before
      the filesystem can be frozen.
      
      One way to avoid this splat is to add GFP_NOFS to the affected allocation
      calls by using the memalloc_nofs_save()/memalloc_nofs_restore() pair.
      This shouldn't matter unless the system is really running out of memory.
      In that particular case, the filesystem freeze operation may fail while
      it was succeeding previously.
      
      Without this patch, the command sequence below will show that the lock
      dependency chain sb_internal -> fs_reclaim exists.
      
       # fsfreeze -f /home
       # fsfreeze --unfreeze /home
       # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal
      
      After applying the patch, such sb_internal -> fs_reclaim lock dependency
      chain can no longer be found. Because of that, the locking dependency
      warning will not be shown.
      Suggested-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      c3f2375b
  5. 07 Jul, 2020 22 commits
    • Darrick J. Wong's avatar
      xfs: rtbitmap scrubber should check inode size · 2fb94e36
      Darrick J. Wong authored
      Make sure the rtbitmap is large enough to store the entire bitmap.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarAllison Collins <allison.henderson@oracle.com>
      2fb94e36
    • Darrick J. Wong's avatar
      xfs: rtbitmap scrubber should verify written extents · f866560b
      Darrick J. Wong authored
      Ensure that the realtime bitmap file is backed entirely by written
      extents.  No holes, no unwritten blocks, etc.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarAllison Collins <allison.henderson@oracle.com>
      f866560b
    • Dave Chinner's avatar
      xfs: remove xfs_inobp_check() · e2705b03
      Dave Chinner authored
      This debug code is called on every xfs_iflush() call, which then
      checks every inode in the buffer for non-zero unlinked list field.
      Hence it checks every inode in the cluster buffer every time a
      single inode on that cluster it flushed. This is resulting in:
      
      -   38.91%     5.33%  [kernel]  [k] xfs_iflush
         - 17.70% xfs_iflush
            - 9.93% xfs_inobp_check
                 4.36% xfs_buf_offset
      
      10% of the CPU time spent flushing inodes is repeatedly checking
      unlinked fields in the buffer. We don't need to do this.
      
      The other place we call xfs_inobp_check() is
      xfs_iunlink_update_dinode(), and this is after we've done this
      assert for the agino we are about to write into that inode:
      
      	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
      
      which means we've already checked that the agino we are about to
      write is not 0 on debug kernels. The inode buffer verifiers do
      everything else we need, so let's just remove this debug code.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      e2705b03
    • Dave Chinner's avatar
      xfs: factor xfs_iflush_done · a69a1dc2
      Dave Chinner authored
      xfs_iflush_done() does 3 distinct operations to the inodes attached
      to the buffer. Separate these operations out into functions so that
      it is easier to modify these operations independently in future.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      a69a1dc2
    • Dave Chinner's avatar
      xfs: rework xfs_iflush_cluster() dirty inode iteration · 5717ea4d
      Dave Chinner authored
      Now that we have all the dirty inodes attached to the cluster
      buffer, we don't actually have to do radix tree lookups to find
      them. Sure, the radix tree is efficient, but walking a linked list
      of just the dirty inodes attached to the buffer is much better.
      
      We are also no longer dependent on having a locked inode passed into
      the function to determine where to start the lookup. This means we
      can drop it from the function call and treat all inodes the same.
      
      We also make xfs_iflush_cluster skip inodes marked with
      XFS_IRECLAIM. This we avoid races with inodes that reclaim is
      actively referencing or are being re-initialised by inode lookup. If
      they are actually dirty, they'll get written by a future cluster
      flush....
      
      We also add a shutdown check after obtaining the flush lock so that
      we catch inodes that are dirty in memory and may have inconsistent
      state due to the shutdown in progress. We abort these inodes
      directly and so they remove themselves directly from the buffer list
      and the AIL rather than having to wait for the buffer to be failed
      and callbacks run to be processed correctly.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      5717ea4d
    • Dave Chinner's avatar
      xfs: rename xfs_iflush_int() · e6187b34
      Dave Chinner authored
      with xfs_iflush() gone, we can rename xfs_iflush_int() back to
      xfs_iflush(). Also move it up above xfs_iflush_cluster() so we don't
      need the forward definition any more.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      e6187b34
    • Dave Chinner's avatar
      xfs: xfs_iflush() is no longer necessary · 90c60e16
      Dave Chinner authored
      Now we have a cached buffer on inode log items, we don't need
      to do buffer lookups when flushing inodes anymore - all we need
      to do is lock the buffer and we are ready to go.
      
      This largely gets rid of the need for xfs_iflush(), which is
      essentially just a mechanism to look up the buffer and flush the
      inode to it. Instead, we can just call xfs_iflush_cluster() with a
      few modifications to ensure it also flushes the inode we already
      hold locked.
      
      This allows the AIL inode item pushing to be almost entirely
      non-blocking in XFS - we won't block unless memory allocation
      for the cluster inode lookup blocks or the block device queues are
      full.
      
      Writeback during inode reclaim becomes a little more complex because
      we now have to lock the buffer ourselves, but otherwise this change
      is largely a functional no-op that removes a whole lot of code.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      90c60e16
    • Dave Chinner's avatar
      xfs: attach inodes to the cluster buffer when dirtied · 48d55e2a
      Dave Chinner authored
      Rather than attach inodes to the cluster buffer just when we are
      doing IO, attach the inodes to the cluster buffer when they are
      dirtied. The means the buffer always carries a list of dirty inodes
      that reference it, and we can use that list to make more fundamental
      changes to inode writeback that aren't otherwise possible.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      48d55e2a
    • Dave Chinner's avatar
      xfs: rework stale inodes in xfs_ifree_cluster · 71e3e356
      Dave Chinner authored
      Once we have inodes pinning the cluster buffer and attached whenever
      they are dirty, we no longer have a guarantee that the items are
      flush locked when we lock the cluster buffer. Hence we cannot just
      walk the buffer log item list and modify the attached inodes.
      
      If the inode is not flush locked, we have to ILOCK it first and then
      flush lock it to do all the prerequisite checks needed to avoid
      races with other code. This is already handled by
      xfs_ifree_get_one_inode(), so rework the inode iteration loop and
      function to update all inodes in cache whether they are attached to
      the buffer or not.
      
      Note: we also remove the copying of the log item lsn to the
      ili_flush_lsn as xfs_iflush_done() now uses the XFS_ISTALE flag to
      trigger aborts and so flush lsn matching is not needed in IO
      completion for processing freed inodes.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      71e3e356
    • Dave Chinner's avatar
      xfs: clean up inode reclaim comments · 02511a5a
      Dave Chinner authored
      Inode reclaim is quite different now to the way described in various
      comments, so update all the comments explaining what it does and how
      it works.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      02511a5a
    • Dave Chinner's avatar
      xfs: remove SYNC_WAIT from xfs_reclaim_inodes() · 4d0bab3a
      Dave Chinner authored
      Clean up xfs_reclaim_inodes() callers. Most callers want blocking
      behaviour, so just make the existing SYNC_WAIT behaviour the
      default.
      
      For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag()
      directly because we just want optimistic clean inode reclaim to be
      done in the background.
      
      For xfs_quiesce_attr() we can just remove the inode reclaim calls as
      they are a historic relic that was required to flush dirty inodes
      that contained unlogged changes. We now log all changes to the
      inodes, so the sync AIL push from xfs_log_quiesce() called by
      xfs_quiesce_attr() will do all the required inode writeback for
      freeze.
      
      Seeing as we now want to loop until all reclaimable inodes have been
      reclaimed, make xfs_reclaim_inodes() loop on the XFS_ICI_RECLAIM_TAG
      tag rather than having xfs_reclaim_inodes_ag() tell it that inodes
      were skipped. This is much more reliable and will always loop until
      all reclaimable inodes are reclaimed.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      4d0bab3a
    • Dave Chinner's avatar
      xfs: remove SYNC_TRYLOCK from inode reclaim · 50718b8d
      Dave Chinner authored
      All background reclaim is SYNC_TRYLOCK already, and even blocking
      reclaim (SYNC_WAIT) can use trylock mechanisms as
      xfs_reclaim_inodes_ag() will keep cycling until there are no more
      reclaimable inodes. Hence we can kill SYNC_TRYLOCK from inode
      reclaim and make everything unconditionally non-blocking.
      
      We remove all the optimistic "avoid blocking on locks" checks done
      in xfs_reclaim_inode_grab() as nothing blocks on locks anymore.
      Further, checking XFS_IFLOCK optimistically can result in detecting
      inodes in the process of being cleaned (i.e. between being removed
      from the AIL and having the flush lock dropped), so for
      xfs_reclaim_inodes() to reliably reclaim all inodes we need to drop
      these checks anyway.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      50718b8d
    • Dave Chinner's avatar
      xfs: don't block inode reclaim on the ILOCK · 9552e14d
      Dave Chinner authored
      When we attempt to reclaim an inode, the first thing we do is take
      the inode lock. This is blocking right now, so if the inode being
      accessed by something else (e.g. being flushed to the cluster
      buffer) we will block here.
      
      Change this to a trylock so that we do not block inode reclaim
      unnecessarily here.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      9552e14d
    • Dave Chinner's avatar
      xfs: allow multiple reclaimers per AG · 0e8e2c63
      Dave Chinner authored
      Inode reclaim will still throttle direct reclaim on the per-ag
      reclaim locks. This is no longer necessary as reclaim can run
      non-blocking now. Hence we can remove these locks so that we don't
      arbitrarily block reclaimers just because there are more direct
      reclaimers than there are AGs.
      
      This can result in multiple reclaimers working on the same range of
      an AG, but this doesn't cause any apparent issues. Optimising the
      spread of concurrent reclaimers for best efficiency can be done in a
      future patchset.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      0e8e2c63
    • Dave Chinner's avatar
      xfs: remove IO submission from xfs_reclaim_inode() · 617825fe
      Dave Chinner authored
      We no longer need to issue IO from shrinker based inode reclaim to
      prevent spurious OOM killer invocation. This leaves only the global
      filesystem management operations such as unmount needing to
      writeback dirty inodes and reclaim them.
      
      Instead of using the reclaim pass to write dirty inodes before
      reclaiming them, use the AIL to push all the dirty inodes before we
      try to reclaim them. This allows us to remove all the conditional
      SYNC_WAIT locking and the writeback code from xfs_reclaim_inode()
      and greatly simplify the checks we need to do to reclaim an inode.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      617825fe
    • Dave Chinner's avatar
      xfs: make inode reclaim almost non-blocking · 993f951f
      Dave Chinner authored
      Now that dirty inode writeback doesn't cause read-modify-write
      cycles on the inode cluster buffer under memory pressure, the need
      to throttle memory reclaim to the rate at which we can clean dirty
      inodes goes away. That is due to the fact that we no longer thrash
      inode cluster buffers under memory pressure to clean dirty inodes.
      
      This means inode writeback no longer stalls on memory allocation
      or read IO, and hence can be done asynchronously without generating
      memory pressure. As a result, blocking inode writeback in reclaim is
      no longer necessary to prevent reclaim priority windup as cleaning
      dirty inodes is no longer dependent on having memory reserves
      available for the filesystem to make progress reclaiming inodes.
      
      Hence we can convert inode reclaim to be non-blocking for shrinker
      callouts, both for direct reclaim and kswapd.
      
      On a vanilla kernel, running a 16-way fsmark create workload on a
      4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
      userspace mlock(). The OOM killer gets invoked at 15GB of
      pinned RAM.
      
      Without the inode cluster pinning, this non-blocking reclaim patch
      triggers premature OOM killer invocation with the same memory
      pinning, sometimes with as much as 45% of RAM being free.  It's
      trivially easy to trigger the OOM killer when reclaim does not
      block.
      
      With pinning inode clusters in RAM and then adding this patch, I can
      reliably pin 14.5GB of RAM and still have the fsmark workload run to
      completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
      is only a small amount of memory less than the vanilla kernel. It is
      much more reliable than just with async reclaim alone.
      
      simoops shows that allocation stalls go away when async reclaim is
      used. Vanilla kernel:
      
      Run time: 1924 seconds
      Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
      Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
      Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
      work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
      alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)
      
      With inode cluster pinning and async reclaim:
      
      Run time: 1924 seconds
      Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
      Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
      Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)
      work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34)
      alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03)
      
      Latencies don't really change much, nor does the work rate. However,
      allocation almost never stalls with these changes, whilst the
      vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample
      period. This difference is due to inode reclaim being largely
      non-blocking now.
      
      IOWs, once we have pinned inode cluster buffers, we can make inode
      reclaim non-blocking without a major risk of premature and/or
      spurious OOM killer invocation, and without any changes to memory
      reclaim infrastructure.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      993f951f
    • Dave Chinner's avatar
      xfs: pin inode backing buffer to the inode log item · 298f7bec
      Dave Chinner authored
      When we dirty an inode, we are going to have to write it disk at
      some point in the near future. This requires the inode cluster
      backing buffer to be present in memory. Unfortunately, under severe
      memory pressure we can reclaim the inode backing buffer while the
      inode is dirty in memory, resulting in stalling the AIL pushing
      because it has to do a read-modify-write cycle on the cluster
      buffer.
      
      When we have no memory available, the read of the cluster buffer
      blocks the AIL pushing process, and this causes all sorts of issues
      for memory reclaim as it requires inode writeback to make forwards
      progress. Allocating a cluster buffer causes more memory pressure,
      and results in more cluster buffers to be reclaimed, resulting in
      more RMW cycles to be done in the AIL context and everything then
      backs up on AIL progress. Only the synchronous inode cluster
      writeback in the the inode reclaim code provides some level of
      forwards progress guarantees that prevent OOM-killer rampages in
      this situation.
      
      Fix this by pinning the inode backing buffer to the inode log item
      when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
      This may mean the first modification of an inode that has been held
      in cache for a long time may block on a cluster buffer read, but
      we can do that in transaction context and block safely until the
      buffer has been allocated and read.
      
      Once we have the cluster buffer, the inode log item takes a
      reference to it, pinning it in memory, and attaches it to the log
      item for future reference. This means we can always grab the cluster
      buffer from the inode log item when we need it.
      
      When the inode is finally cleaned and removed from the AIL, we can
      drop the reference the inode log item holds on the cluster buffer.
      Once all inodes on the cluster buffer are clean, the cluster buffer
      will be unpinned and it will be available for memory reclaim to
      reclaim again.
      
      This avoids the issues with needing to do RMW cycles in the AIL
      pushing context, and hence allows complete non-blocking inode
      flushing to be performed by the AIL pushing context.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      298f7bec
    • Dave Chinner's avatar
      xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() · e98084b8
      Dave Chinner authored
      xfs_ail_delete_one() is called directly from dquot and inode IO
      completion, as well as from the generic xfs_trans_ail_delete()
      function. Inodes are about to have their own failure handling, and
      dquots will in future, too. Pull the clearing of the LI_FAILED flag
      up into the callers so we can customise the code appropriately.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      e98084b8
    • Dave Chinner's avatar
      xfs: unwind log item error flagging · 3536b61e
      Dave Chinner authored
      When an buffer IO error occurs, we want to mark all
      the log items attached to the buffer as failed. Open code
      the error handling loop so that we can modify the flagging for the
      different types of objects directly and independently of each other.
      
      This also allows us to remove the ->iop_error method from the log
      item operations.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      3536b61e
    • Dave Chinner's avatar
      xfs: handle buffer log item IO errors directly · 428947e9
      Dave Chinner authored
      Currently when a buffer with attached log items has an IO error
      it called ->iop_error for each attched log item. These all call
      xfs_set_li_failed() to handle the error, but we are about to change
      the way log items manage buffers. hence we first need to remove the
      per-item dependency on buffer handling done by xfs_set_li_failed().
      
      We already have specific buffer type IO completion routines, so move
      the log item error handling out of the generic error handling and
      into the log item specific functions so we can implement per-type
      error handling easily.
      
      This requires a more complex return value from the error handling
      code so that we can take the correct action the failure handling
      requires.  This results in some repeated boilerplate in the
      functions, but that can be cleaned up later once all the changes
      cascade through this code.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      428947e9
    • Dave Chinner's avatar
      xfs: get rid of log item callbacks · 2ef3f7f5
      Dave Chinner authored
      They are not used anymore, so remove them from the log item and the
      buffer iodone attachment interfaces.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      2ef3f7f5
    • Dave Chinner's avatar
      xfs: clean up the buffer iodone callback functions · fec671cd
      Dave Chinner authored
      Now that we've sorted inode and dquot buffers, we can apply the same
      cleanups to dirty buffers with buffer log items. They only have one
      callback, too, so we don't need the log item callback. Collapse the
      iodone functions and remove all the now unnecessary infrastructure
      around callback processing.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      fec671cd
  6. 06 Jul, 2020 8 commits