• Dave Chinner's avatar
    xfs: separate read-only variables in struct xfs_mount · b0dff466
    Dave Chinner authored
    Seeing massive cpu usage from xfs_agino_range() on one machine;
    instruction level profiles look similar to another machine running
    the same workload, only one machine is consuming 10x as much CPU as
    the other and going much slower. The only real difference between
    the two machines is core count per socket. Both are running
    identical 16p/16GB virtual machine configurations
    
    Machine A:
    
      25.83%  [k] xfs_agino_range
      12.68%  [k] __xfs_dir3_data_check
       6.95%  [k] xfs_verify_ino
       6.78%  [k] xfs_dir2_data_entry_tag_p
       3.56%  [k] xfs_buf_find
       2.31%  [k] xfs_verify_dir_ino
       2.02%  [k] xfs_dabuf_map.constprop.0
       1.65%  [k] xfs_ag_block_count
    
    And takes around 13 minutes to remove 50 million inodes.
    
    Machine B:
    
      13.90%  [k] __pv_queued_spin_lock_slowpath
       3.76%  [k] do_raw_spin_lock
       2.83%  [k] xfs_dir3_leaf_check_int
       2.75%  [k] xfs_agino_range
       2.51%  [k] __raw_callee_save___pv_queued_spin_unlock
       2.18%  [k] __xfs_dir3_data_check
       2.02%  [k] xfs_log_commit_cil
    
    And takes around 5m30s to remove 50 million inodes.
    
    Suspect is cacheline contention on m_sectbb_log which is used in one
    of the macros in xfs_agino_range. This is a read-only variable but
    shares a cacheline with m_active_trans which is a global atomic that
    gets bounced all around the machine.
    
    The workload is trying to run hundreds of thousands of transactions
    per second and hence cacheline contention will be occurring on this
    atomic counter. Hence xfs_agino_range() is likely just be an
    innocent bystander as the cache coherency protocol fights over the
    cacheline between CPU cores and sockets.
    
    On machine A, this rearrangement of the struct xfs_mount
    results in the profile changing to:
    
       9.77%  [kernel]  [k] xfs_agino_range
       6.27%  [kernel]  [k] __xfs_dir3_data_check
       5.31%  [kernel]  [k] __pv_queued_spin_lock_slowpath
       4.54%  [kernel]  [k] xfs_buf_find
       3.79%  [kernel]  [k] do_raw_spin_lock
       3.39%  [kernel]  [k] xfs_verify_ino
       2.73%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
    
    Vastly less CPU usage in xfs_agino_range(), but still 3x the amount
    of machine B and still runs substantially slower than it should.
    
    Current rm -rf of 50 million files:
    
    		vanilla		patched
    machine A	13m20s		6m42s
    machine B	5m30s		5m02s
    
    It's an improvement, hence indicating that separation and further
    optimisation of read-only global filesystem data is worthwhile, but
    it clearly isn't the underlying issue causing this specific
    performance degradation.
    Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
    b0dff466
xfs_mount.h 15.6 KB