1. 27 May, 2020 6 commits
    • Eric Sandeen's avatar
      xfs: always return -ENOSPC on project quota reservation failure · dcf1ccc9
      Eric Sandeen authored
      XFS project quota treats project hierarchies as "mini filesysems" and
      so rather than -EDQUOT, the intent is to return -ENOSPC when a quota
      reservation fails, but this behavior is not consistent.
      
      The only place we make a decision between -EDQUOT and -ENOSPC
      returns based on quota type is in xfs_trans_dqresv().
      
      This behavior is currently controlled by whether or not the
      XFS_QMOPT_ENOSPC flag gets passed into the quota reservation.  However,
      its use is not consistent; paths such as xfs_create() and xfs_symlink()
      don't set the flag, so a reservation failure will return -EDQUOT for
      project quota reservation failures rather than -ENOSPC for these sorts
      of operations, even for project quota:
      
      # mkdir mnt/project
      # xfs_quota -x -c "project -s -p mnt/project 42" mnt
      # xfs_quota -x -c 'limit -p isoft=2 ihard=3 42' mnt
      # touch mnt/project/file{1,2,3}
      touch: cannot touch ‘mnt/project/file3’: Disk quota exceeded
      
      We can make this consistent by not requiring the flag to be set at the
      top of the callchain; instead we can simply test whether we are
      reserving a project quota with XFS_QM_ISPDQ in xfs_trans_dqresv and if
      so, return -ENOSPC for that failure.  This removes the need for the
      XFS_QMOPT_ENOSPC altogether and simplifies the code a fair bit.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      dcf1ccc9
    • Eric Sandeen's avatar
      xfs: group quota should return EDQUOT when prj quota enabled · c8d329f3
      Eric Sandeen authored
      Long ago, group & project quota were mutually exclusive, and so
      when we turned on XFS_QMOPT_ENOSPC ("return ENOSPC if project quota
      is exceeded") when project quota was enabled, we only needed to
      disable it again for user quota.
      
      When group & project quota got separated, this got missed, and as a
      result if project quota is enabled and group quota is exceeded, the
      error code returned is incorrectly returned as ENOSPC not EDQUOT.
      
      Fix this by stripping XFS_QMOPT_ENOSPC out of flags for group
      quota when we try to reserve the space.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      c8d329f3
    • Dave Chinner's avatar
      xfs: remove the m_active_trans counter · b41b46c2
      Dave Chinner authored
      It's a global atomic counter, and we are hitting it at a rate of
      half a million transactions a second, so it's bouncing the counter
      cacheline all over the place on large machines. We don't actually
      need it anymore - it used to be required because the VFS freeze code
      could not track/prevent filesystem transactions that were running,
      but that problem no longer exists.
      
      Hence to remove the counter, we simply have to ensure that nothing
      calls xfs_sync_sb() while we are trying to quiesce the filesytem.
      That only happens if the log worker is still running when we call
      xfs_quiesce_attr(). The log worker is cancelled at the end of
      xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it
      early here and then we can remove the counter altogether.
      
      Concurrent create, 50 million inodes, identical 16p/16GB virtual
      machines on different physical hosts. Machine A has twice the CPU
      cores per socket of machine B:
      
      		unpatched	patched
      machine A:	3m16s		2m00s
      machine B:	4m04s		4m05s
      
      Create rates:
      		unpatched	patched
      machine A:	282k+/-31k	468k+/-21k
      machine B:	231k+/-8k	233k+/-11k
      
      Concurrent rm of same 50 million inodes:
      
      		unpatched	patched
      machine A:	6m42s		2m33s
      machine B:	4m47s		4m47s
      
      The transaction rate on the fast machine went from just under
      300k/sec to 700k/sec, which indicates just how much of a bottleneck
      this atomic counter was.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      b41b46c2
    • Dave Chinner's avatar
      xfs: separate read-only variables in struct xfs_mount · b0dff466
      Dave Chinner authored
      Seeing massive cpu usage from xfs_agino_range() on one machine;
      instruction level profiles look similar to another machine running
      the same workload, only one machine is consuming 10x as much CPU as
      the other and going much slower. The only real difference between
      the two machines is core count per socket. Both are running
      identical 16p/16GB virtual machine configurations
      
      Machine A:
      
        25.83%  [k] xfs_agino_range
        12.68%  [k] __xfs_dir3_data_check
         6.95%  [k] xfs_verify_ino
         6.78%  [k] xfs_dir2_data_entry_tag_p
         3.56%  [k] xfs_buf_find
         2.31%  [k] xfs_verify_dir_ino
         2.02%  [k] xfs_dabuf_map.constprop.0
         1.65%  [k] xfs_ag_block_count
      
      And takes around 13 minutes to remove 50 million inodes.
      
      Machine B:
      
        13.90%  [k] __pv_queued_spin_lock_slowpath
         3.76%  [k] do_raw_spin_lock
         2.83%  [k] xfs_dir3_leaf_check_int
         2.75%  [k] xfs_agino_range
         2.51%  [k] __raw_callee_save___pv_queued_spin_unlock
         2.18%  [k] __xfs_dir3_data_check
         2.02%  [k] xfs_log_commit_cil
      
      And takes around 5m30s to remove 50 million inodes.
      
      Suspect is cacheline contention on m_sectbb_log which is used in one
      of the macros in xfs_agino_range. This is a read-only variable but
      shares a cacheline with m_active_trans which is a global atomic that
      gets bounced all around the machine.
      
      The workload is trying to run hundreds of thousands of transactions
      per second and hence cacheline contention will be occurring on this
      atomic counter. Hence xfs_agino_range() is likely just be an
      innocent bystander as the cache coherency protocol fights over the
      cacheline between CPU cores and sockets.
      
      On machine A, this rearrangement of the struct xfs_mount
      results in the profile changing to:
      
         9.77%  [kernel]  [k] xfs_agino_range
         6.27%  [kernel]  [k] __xfs_dir3_data_check
         5.31%  [kernel]  [k] __pv_queued_spin_lock_slowpath
         4.54%  [kernel]  [k] xfs_buf_find
         3.79%  [kernel]  [k] do_raw_spin_lock
         3.39%  [kernel]  [k] xfs_verify_ino
         2.73%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
      
      Vastly less CPU usage in xfs_agino_range(), but still 3x the amount
      of machine B and still runs substantially slower than it should.
      
      Current rm -rf of 50 million files:
      
      		vanilla		patched
      machine A	13m20s		6m42s
      machine B	5m30s		5m02s
      
      It's an improvement, hence indicating that separation and further
      optimisation of read-only global filesystem data is worthwhile, but
      it clearly isn't the underlying issue causing this specific
      performance degradation.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      b0dff466
    • Dave Chinner's avatar
      xfs: reduce free inode accounting overhead · f18c9a90
      Dave Chinner authored
      Shaokun Zhang reported that XFS was using substantial CPU time in
      percpu_count_sum() when running a single threaded benchmark on
      a high CPU count (128p) machine from xfs_mod_ifree(). The issue
      is that the filesystem is empty when the benchmark runs, so inode
      allocation is running with a very low inode free count.
      
      With the percpu counter batching, this means comparisons when the
      counter is less that 128 * 256 = 32768 use the slow path of adding
      up all the counters across the CPUs, and this is expensive on high
      CPU count machines.
      
      The summing in xfs_mod_ifree() is only used to fire an assert if an
      underrun occurs. The error is ignored by the higher level code.
      Hence this is really just debug code and we don't need to run it
      on production kernels, nor do we need such debug checks to return
      error values just to trigger an assert.
      
      Finally, xfs_mod_icount/xfs_mod_ifree are only called from
      xfs_trans_unreserve_and_mod_sb(), so get rid of them and just
      directly call the percpu_counter_add/percpu_counter_compare
      functions. The compare functions are now run only on debug builds as
      they are internal to ASSERT() checks and so only compiled in when
      ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y).
      Reported-by: default avatarShaokun Zhang <zhangshaokun@hisilicon.com>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      f18c9a90
    • Dave Chinner's avatar
      xfs: gut error handling in xfs_trans_unreserve_and_mod_sb() · dc3ffbb1
      Dave Chinner authored
      xfs: gut error handling in xfs_trans_unreserve_and_mod_sb()
      
      From: Dave Chinner <dchinner@redhat.com>
      
      The error handling in xfs_trans_unreserve_and_mod_sb() is largely
      incorrect - rolling back the changes in the transaction if only one
      counter underruns makes all the other counters incorrect. We still
      allow the change to proceed and committing the transaction, except
      now we have multiple incorrect counters instead of a single
      underflow.
      
      Further, we don't actually report the error to the caller, so this
      is completely silent except on debug kernels that will assert on
      failure before we even get to the rollback code.  Hence this error
      handling is broken, untested, and largely unnecessary complexity.
      
      Just remove it.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      dc3ffbb1
  2. 19 May, 2020 22 commits
  3. 13 May, 2020 3 commits
  4. 08 May, 2020 9 commits