- 14 Jul, 2022 17 commits
-
-
Darrick J. Wong authored
Merge tag 'xfs-buf-lockless-lookup-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeB xfs: lockless buffer cache lookups Current work to merge the XFS inode life cycle with the VFS inode life cycle is finding some interesting issues. If we have a path that hits buffer trylocks fairly hard (e.g. a non-blocking background inode freeing function), we end up hitting massive contention on the buffer cache hash locks: - 92.71% 0.05% [kernel] [k] xfs_inodegc_worker - 92.67% xfs_inodegc_worker - 92.13% xfs_inode_unlink - 91.52% xfs_inactive_ifree - 85.63% xfs_read_agi - 85.61% xfs_trans_read_buf_map - 85.59% xfs_buf_read_map - xfs_buf_get_map - 85.55% xfs_buf_find - 72.87% _raw_spin_lock - do_raw_spin_lock 71.86% __pv_queued_spin_lock_slowpath - 8.74% xfs_buf_rele - 7.88% _raw_spin_lock - 7.88% do_raw_spin_lock 7.63% __pv_queued_spin_lock_slowpath - 1.70% xfs_buf_trylock - 1.68% down_trylock - 1.41% _raw_spin_lock_irqsave - 1.39% do_raw_spin_lock __pv_queued_spin_lock_slowpath - 0.76% _raw_spin_unlock 0.75% do_raw_spin_unlock This is basically hammering the pag->pag_buf_lock from lots of CPUs doing trylocks at the same time. Most of the buffer trylock operations ultimately fail after we've done the lookup, so we're really hammering the buf hash lock whilst making no progress. We can also see significant spinlock traffic on the same lock just under normal operation when lots of tasks are accessing metadata from the same AG, so let's avoid all this by creating a lookup fast path which leverages the rhashtable's ability to do RCU protected lookups. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> * tag 'xfs-buf-lockless-lookup-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: lockless buffer lookup xfs: remove a superflous hash lookup when inserting new buffers xfs: reduce the number of atomic when locking a buffer after lookup xfs: merge xfs_buf_find() and xfs_buf_get_map() xfs: break up xfs_buf_find() into individual pieces xfs: rework xfs_buf_incore() API
-
Darrick J. Wong authored
Merge tag 'xfs-iunlink-item-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeB xfs: introduce in-memory inode unlink log items To facilitate future improvements in inode logging and improving inode cluster buffer locking order consistency, we need a new mechanism for defering inode cluster buffer modifications during unlinked list modifications. The unlinked inode list buffer locking is complex. The unlinked list is unordered - we add to the tail, remove from where-ever the inode is in the list. Hence we might need to lock two inode buffers here (previous inode in list and the one being removed). While we can order the locking of these buffers correctly within the confines of the unlinked list, there may be other inodes that need buffer locking in the same transaction. e.g. O_TMPFILE being linked into a directory also modifies the directory inode. Hence we need a mechanism for defering unlinked inode list updates until a point where we know that all modifications have been made and all that remains is to lock and modify the cluster buffers. We can do this by first observing that we serialise unlinked list modifications by holding the AGI buffer lock. IOWs, the AGI is going to be locked until the transaction commits any time we modify the unlinked list. Hence it doesn't matter when in the unlink transactions that we actually load, lock and modify the inode cluster buffer. We add an in-memory unlinked inode log item to defer the inode cluster buffer update to transaction commit time where it can be ordered with all the other inode cluster operations that need to be done. Essentially all we need to do is record the inodes that need to have their unlinked list pointer updated in a new log item that we attached to the transaction. This log item exists purely for the purpose of delaying the update of the unlinked list pointer until the inode cluster buffer can be locked in the correct order around the other inode cluster buffers. It plays no part in the actual commit, and there's no change to anything that is written to the log. i.e. the inode cluster buffers still have to be fully logged here (not just ordered) as log recovery depedends on this to replay mods to the unlinked inode list. Hence if we add a "precommit" hook into xfs_trans_commit() to run a "precommit" operation on these iunlink log items, we can delay the locking, modification and logging of the inode cluster buffer until after all other modifications have been made. The precommit hook reuires us to sort the items that are going to be run so that we can lock precommit items in the correct order as we perform the modifications they describe. To make this unlinked inode list processing simpler and easier to implement as a log item, we need to change the way we track the unlinked list in memory. Starting from the observation that an inode on the unlinked list is pinned in memory by the VFS, we can use the xfs_inode itself to track the unlinked list. To do this efficiently, we want the unlinked list to be a double linked list. The problem here is that we need a list per AGI unlinked list, and there are 64 of these per AGI. The approach taken in this patchset is to shadow the AGI unlinked list heads in the perag, and link inodes by agino, hence requiring only 8 extra bytes per inode to track this state. We can then use the agino pointers for lockless inode cache lookups to retreive the inode. The aginos in the inode are modified only under the AGI lock, just like the cluster buffer pointers, so we don't need any extra locking here. The i_next_unlinked field tracks the on-disk value of the unlinked list, and the i_prev_unlinked is a purely in-memory pointer that enables us to efficiently remove inodes from the middle of the list. This results in moving a lot of the unlink modification work into the precommit operations on the unlink log item. Tracking all the unlinked inodes in the inodes themselves also gets rid of the unlinked list reference hash table that is used to track this back pointer relationship. This greatly simplifies the the unlinked list modification code, and removes memory allocations in this hot path to track back pointers. This, overall, slightly reduces the CPU overhead of the unlink path. The result of this log item means that we move all the actual manipulation of objects to be logged out of the iunlink path and into the iunlink item. This allows for future optimisation of this mechanism without needing changes to high level unlink path, as well as making the unlink lock ordering predictable and synchronised with other operations that may require inode cluster locking. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> * tag 'xfs-iunlink-item-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: add in-memory iunlink log item xfs: add log item precommit operation xfs: combine iunlink inode update functions xfs: clean up xfs_iunlink_update_inode() xfs: double link the unlinked inode list xfs: introduce xfs_iunlink_lookup xfs: refactor xlog_recover_process_iunlinks() xfs: track the iunlink list pointer in the xfs_inode xfs: factor the xfs_iunlink functions xfs: flush inode gc workqueue before clearing agi bucket
-
Dave Chinner authored
Now that we have a standalone fast path for buffer lookup, we can easily convert it to use rcu lookups. When we continually hammer the buffer cache with trylock lookups, we end up with a huge amount of lock contention on the per-ag buffer hash locks: - 92.71% 0.05% [kernel] [k] xfs_inodegc_worker - 92.67% xfs_inodegc_worker - 92.13% xfs_inode_unlink - 91.52% xfs_inactive_ifree - 85.63% xfs_read_agi - 85.61% xfs_trans_read_buf_map - 85.59% xfs_buf_read_map - xfs_buf_get_map - 85.55% xfs_buf_find - 72.87% _raw_spin_lock - do_raw_spin_lock 71.86% __pv_queued_spin_lock_slowpath - 8.74% xfs_buf_rele - 7.88% _raw_spin_lock - 7.88% do_raw_spin_lock 7.63% __pv_queued_spin_lock_slowpath - 1.70% xfs_buf_trylock - 1.68% down_trylock - 1.41% _raw_spin_lock_irqsave - 1.39% do_raw_spin_lock __pv_queued_spin_lock_slowpath - 0.76% _raw_spin_unlock 0.75% do_raw_spin_unlock This is basically hammering the pag->pag_buf_lock from lots of CPUs doing trylocks at the same time. Most of the buffer trylock operations ultimately fail after we've done the lookup, so we're really hammering the buf hash lock whilst making no progress. We can also see significant spinlock traffic on the same lock just under normal operation when lots of tasks are accessing metadata from the same AG, so let's avoid all this by converting the lookup fast path to leverages the rhashtable's ability to do rcu protected lookups. We avoid races with the buffer release path by using atomic_inc_not_zero() on the buffer hold count. Any buffer that is in the LRU will have a non-zero count, thereby allowing the lockless fast path to be taken in most cache hit situations. If the buffer hold count is zero, then it is likely going through the release path so in that case we fall back to the existing lookup miss slow path. The slow path will then do an atomic lookup and insert under the buffer hash lock and hence serialise correctly against buffer release freeing the buffer. The use of rcu protected lookups means that buffer handles now need to be freed by RCU callbacks (same as inodes). We still free the buffer pages before the RCU callback - we won't be trying to access them at all on a buffer that has zero references - but we need the buffer handle itself to be present for the entire rcu protected read side to detect a zero hold count correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
Currently on the slow path insert we repeat the initial hash table lookup before we attempt the insert, resulting in a two traversals of the hash table to ensure the insert is valid. The rhashtable API provides a method for an atomic lookup and insert operation, so we can avoid one of the hash table traversals by using this method. Adapted from a large patch containing this optimisation by Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
Avoid an extra atomic operation in the non-trylock case by only doing a trylock if the XBF_TRYLOCK flag is set. This follows the pattern in the IO path with NOWAIT semantics where the "trylock-fail-lock" path showed 5-10% reduced throughput compared to just using single lock call when not under NOWAIT conditions. So make that same change here, too. See commit 942491c9 ("xfs: fix AIM7 regression") for details. Signed-off-by: Dave Chinner <dchinner@redhat.com> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
Now that we factored xfs_buf_find(), we can start separating into distinct fast and slow paths from xfs_buf_get_map(). We start by moving the lookup map and perag setup to _get_map(), and then move all the specifics of the fast path lookup into xfs_buf_lookup() and call it directly from _get_map(). We the move all the slow path code to xfs_buf_find_insert(), which is now also called directly from _get_map(). As such, xfs_buf_find() now goes away. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
xfs_buf_find() is made up of three main parts: lookup, insert and locking. The interactions with xfs_buf_get_map() require it to be called twice - once for a pure lookup, and again on lookup failure so the insert path can be run. We want to simplify this down a lot, so split it into a fast path lookup, a slow path insert and a "lock the found buffer" helper. This will then let us integrate these operations more effectively into xfs_buf_get_map() in future patches. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
Now that we have a clean operation to update the di_next_unlinked field of inode cluster buffers, we can easily defer this operation to transaction commit time so we can order the inode cluster buffer locking consistently. To do this, we introduce a new in-memory log item to track the unlinked list item modification that we are going to make. This follows the same observations as the in-memory double linked list used to track unlinked inodes in that the inodes on the list are pinned in memory and cannot go away, and hence we can simply reference them for the duration of the transaction without needing to take active references or pin them or look them up. This allows us to pass the xfs_inode to the transaction commit code along with the modification to be made, and then order the logged modifications via the ->iop_sort and ->iop_precommit operations for the new log item type. As this is an in-memory log item, it doesn't have formatting, CIL or AIL operational hooks - it exists purely to run the inode unlink modifications and is then removed from the transaction item list and freed once the precommit operation has run. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Dave Chinner authored
For inodes that are dirty, we have an attached cluster buffer that we want to use to track the dirty inode through the AIL. Unfortunately, locking the cluster buffer and adding it to the transaction when the inode is first logged in a transaction leads to buffer lock ordering inversions. The specific problem is ordering against the AGI buffer. When modifying unlinked lists, the buffer lock order is AGI -> inode cluster buffer as the AGI buffer lock serialises all access to the unlinked lists. Unfortunately, functionality like xfs_droplink() logs the inode before calling xfs_iunlink(), as do various directory manipulation functions. The inode can be logged way down in the stack as far as the bmapi routines and hence, without a major rewrite of lots of APIs there's no way we can avoid the inode being logged by something until after the AGI has been logged. As we are going to be using ordered buffers for inode AIL tracking, there isn't a need to actually lock that buffer against modification as all the modifications are captured by logging the inode item itself. Hence we don't actually need to join the cluster buffer into the transaction until just before it is committed. This means we do not perturb any of the existing buffer lock orders in transactions, and the inode cluster buffer is always locked last in a transaction that doesn't otherwise touch inode cluster buffers. We do this by introducing a precommit log item method. This commit just introduces the mechanism; the inode item implementation is in followup commits. The precommit items need to be sorted into consistent order as we may be locking multiple items here. Hence if we have two dirty inodes in cluster buffers A and B, and some other transaction has two separate dirty inodes in the same cluster buffers, locking them in different orders opens us up to ABBA deadlocks. Hence we sort the items on the transaction based on the presence of a sort log item method. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Dave Chinner authored
Combine the logging of the inode unlink list update into the calling function that looks up the buffer we end up logging. These do not need to be separate functions as they are both short, simple operations and there's only a single call path through them. This new function will end up being the core of the iunlink log item processing... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
We no longer need to have this function return the previous next agino value from the on-disk inode as we have it in the in-core inode now. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
Now we have forwards traversal via the incore inode in place, we now need to add back pointers to the incore inode to entirely replace the back reference cache. We use the same lookup semantics and constraints as for the forwards pointer lookups during unlinks, and so we can look up any inode in the unlinked list directly and update the list pointers, forwards or backwards, at any time. The only wrinkle in converting the unlinked list manipulations to use in-core previous pointers is that log recovery doesn't have the incore inode state built up so it can't just read in an inode and release it to finish off the unlink. Hence we need to modify the traversal in recovery to read one inode ahead before we release the inode at the head of the list. This populates the next->prev relationship sufficient to be able to replay the unlinked list and hence greatly simplify the runtime code. This recovery algorithm also requires that we actually remove inodes from the unlinked list one at a time as background inode inactivation will result in unlinked list removal racing with the building of the in-memory unlinked list state. We could serialise this by holding the AGI buffer lock when constructing the in memory state, but all that does is lockstep background processing with list building. It is much simpler to flush the inodegc immediately after releasing the inode so that it is unlinked immediately and there is no races present at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Dave Chinner authored
When an inode is on an unlinked list during normal operation, it is guaranteed to be pinned in memory as it is either referenced by the current unlink operation or it has a open file descriptor that references it and has it pinned in memory. Hence to look up an inode on the unlinked list, we can do a direct inode cache lookup and always expect the lookup to succeed. Add a function to do this lookup based on the agino that we use to link the chain of unlinked inodes together so we can begin the conversion the unlinked list manipulations to use in-memory inodes rather than inode cluster buffers and remove the backref cache. Use this lookup function to replace the on-disk inode buffer walk when removing inodes from the unlinked list with an in-core inode unlinked list walk. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
For upcoming changes to the way inode unlinked list processing is done, the structure of recovery needs to change slightly. We also really need to untangle the messy error handling in list recovery so that actions like emptying the bucket on inode lookup failure are associated with the bucket list walk failing, not failing to look up the inode. Refactor the recovery code now to keep the re-organisation seperate to the algorithm changes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
Having direct access to the i_next_unlinked pointer in unlinked inodes greatly simplifies the processing of inodes on the unlinked list. We no longer need to look up the inode buffer just to find next inode in the list if the xfs_inode is in memory. These improvements will be realised over upcoming patches as other dependencies on the inode buffer for unlinked list processing are removed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Dave Chinner authored
Prep work that separates the locking that protects the unlinked list from the actual operations being performed. This also helps document the fact they are performing list insert and remove operations. No functional code change. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Zhang Yi authored
In the procedure of recover AGI unlinked lists, if something bad happenes on one of the unlinked inode in the bucket list, we would call xlog_recover_clear_agi_bucket() to clear the whole unlinked bucket list, not the unlinked inodes after the bad one. If we have already added some inodes to the gc workqueue before the bad inode in the list, we could get below error when freeing those inodes, and finaly fail to complete the log recover procedure. XFS (ram0): Internal error xfs_iunlink_remove at line 2456 of file fs/xfs/xfs_inode.c. Caller xfs_ifree+0xb0/0x360 [xfs] The problem is xlog_recover_clear_agi_bucket() clear the bucket list, so the gc worker fail to check the agino in xfs_verify_agino(). Fix this by flush workqueue before clearing the bucket. Fixes: ab23a776 ("xfs: per-cpu deferred inode inactivation queues") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
- 09 Jul, 2022 4 commits
-
-
Andrey Strachuk authored
At line 1561, variable "state" is being compared with NULL every loop iteration. ------------------------------------------------------------------- 1561 for (i = 0; state != NULL && i < state->path.active; i++) { 1562 xfs_trans_brelse(args->trans, state->path.blk[i].bp); 1563 state->path.blk[i].bp = NULL; 1564 } ------------------------------------------------------------------- However, it cannot be NULL. ---------------------------------------- 1546 state = xfs_da_state_alloc(args); ---------------------------------------- xfs_da_state_alloc calls kmem_cache_zalloc. kmem_cache_zalloc is called with __GFP_NOFAIL flag and, therefore, it cannot return NULL. -------------------------------------------------------------------------- struct xfs_da_state * xfs_da_state_alloc( struct xfs_da_args *args) { struct xfs_da_state *state; state = kmem_cache_zalloc(xfs_da_state_cache, GFP_NOFS | __GFP_NOFAIL); state->args = args; state->mp = args->dp->i_mount; return state; } -------------------------------------------------------------------------- Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Andrey Strachuk <strochuk@ispras.ru> Fixes: 4d0cdd2b ("xfs: clean up xfs_attr_node_hasname") Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-
Eric Sandeen authored
We got a report that "renameat2() with flags=RENAME_WHITEOUT doesn't apply an SELinux label on xfs" as it does on other filesystems (for example, ext4 and tmpfs.) While I'm not quite sure how labels may interact w/ whiteout files, leaving them as unlabeled seems inconsistent at best. Now that xfs_init_security is not static, rename it to xfs_inode_init_security per dchinner's suggestion. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-
Darrick J. Wong authored
Merge tag 'xfs-perag-conv-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeA xfs: per-ag conversions for 5.20 This series drives the perag down into the AGI, AGF and AGFL access routines and unifies the perag structure initialisation with the high level AG header read functions. This largely replaces the xfs_mount/agno pair that is passed to all these functions with a perag, and in most places we already have a perag ready to pass in. There are a few places where perags need to be grabbed before reading the AG header buffers - some of these will need to be driven to higher layers to ensure we can run operations on AGs without getting stuck part way through waiting on a perag reference. The latter section of this patchset moves some of the AG geometry information from the xfs_mount to the xfs_perag, and starts converting code that requires geometry validation to use a perag instead of a mount and having to extract the AGNO from the object location. This also allows us to store the AG size in the perag and then we can stop having to compare the agno against sb_agcount to determine if the AG is the last AG and so has a runt size. This greatly simplifies some of the type validity checking we do and substantially reduces the CPU overhead of type validity checking. It also cuts over 1.2kB out of the binary size. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> * tag 'xfs-perag-conv-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: make is_log_ag() a first class helper xfs: replace xfs_ag_block_count() with perag accesses xfs: Pre-calculate per-AG agino geometry xfs: Pre-calculate per-AG agbno geometry xfs: pass perag to xfs_alloc_read_agfl xfs: pass perag to xfs_alloc_put_freelist xfs: pass perag to xfs_alloc_get_freelist xfs: pass perag to xfs_read_agf xfs: pass perag to xfs_read_agi xfs: pass perag to xfs_alloc_read_agf() xfs: kill xfs_alloc_pagf_init() xfs: pass perag to xfs_ialloc_read_agi() xfs: kill xfs_ialloc_pagi_init() xfs: make last AG grow/shrink perag centric
-
Darrick J. Wong authored
Merge tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeA xfs: improve CIL scalability This series aims to improve the scalability of XFS transaction commits on large CPU count machines. My 32p machine hits contention limits in xlog_cil_commit() at about 700,000 transaction commits a section. It hits this at 16 thread workloads, and 32 thread workloads go no faster and just burn CPU on the CIL spinlocks. This patchset gets rid of spinlocks and global serialisation points in the xlog_cil_commit() path. It does this by moving to a combination of per-cpu counters, unordered per-cpu lists and post-ordered per-cpu lists. This results in transaction commit rates exceeding 1.4 million commits/s under unlink certain workloads, and while the log lock contention is largely gone there is still significant lock contention in the VFS (dentry cache, inode cache and security layers) at >600,000 transactions/s that still limit scalability. The changes to the CIL accounting and behaviour, combined with the structural changes to xlog_write() in prior patchsets make the per-cpu restructuring possible and sane. This allows us to move to precalculated reservation requirements that allow for reservation stealing to be accounted across multiple CPUs accurately. That is, instead of trying to account for continuation log opheaders on a "growth" basis, we pre-calculate how many iclogs we'll need to write out a maximally sized CIL checkpoint and steal that reserveD that space one commit at a time until the CIL has a full reservation. If we ever run a commit when we are already at the hard limit (because post-throttling) we simply take an extra reservation from each commit that is run when over the limit. Hence we don't need to do space usage math in the fast path and so never need to sum the per-cpu counters in this fast path. Similarly, per-cpu lists have the problem of ordering - we can't remove an item from a per-cpu list if we want to move it forward in the CIL. We solve this problem by using an atomic counter to give every commit a sequence number that is copied into the log items in that transaction. Hence relogging items just overwrites the sequence number in the log item, and does not move it in the per-cpu lists. Once we reaggregate the per-cpu lists back into a single list in the CIL push work, we can run it through list-sort() and reorder it back into a globally ordered list. This costs a bit of CPU time, but now that the CIL can run multiple works and pipelines properly, this is not a limiting factor for performance. It does increase fsync latency when the CIL is full, but workloads issuing large numbers of fsync()s or sync transactions end up with very small CILs and so the latency impact or sorting is not measurable for such workloads. OVerall, this pushes the transaction commit bottleneck out to the lockless reservation grant head updates. These atomic updates don't start to be a limiting fact until > 1.5 million transactions/s are being run, at which point the accounting functions start to show up in profiles as the highest CPU users. Still, this series doubles transaction throughput without increasing CPU usage before we get to that cacheline contention breakdown point... ` Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> * tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: expanding delayed logging design with background material xfs: xlog_sync() manually adjusts grant head space xfs: avoid cil push lock if possible xfs: move CIL ordering to the logvec chain xfs: convert log vector chain to use list heads xfs: convert CIL to unordered per cpu lists xfs: Add order IDs to log items in CIL xfs: convert CIL busy extents to per-cpu xfs: track CIL ticket reservation in percpu structure xfs: implement percpu cil space used calculation xfs: introduce per-cpu CIL tracking structure xfs: rework per-iclog header CIL reservation xfs: lift init CIL reservation out of xc_cil_lock xfs: use the CIL space used counter for emptiness checks
-
- 07 Jul, 2022 19 commits
-
-
Dave Chinner authored
Make it consistent with the other buffer APIs to return a error and the buffer is placed in a parameter. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
We check if an ag contains the log in many places, so make this a first class XFS helper by lifting it to fs/xfs/libxfs/xfs_ag.h and renaming it xfs_ag_contains_log(). The convert all the places that check if the AG contains the log to use this helper. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
Many of the places that call xfs_ag_block_count() have a perag available. These places can just read pag->block_count directly instead of calculating the AG block count from first principles. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
There is a lot of overhead in functions like xfs_verify_agino() that repeatedly calculate the geometry limits of an AG. These can be pre-calculated as they are static and the verification context has a per-ag context it can quickly reference. In the case of xfs_verify_agino(), we now always have a perag context handy, so we can store the minimum and maximum agino values in the AG in the perag. This means we don't have to calculate it on every call and it can be inlined in callers if we move it to xfs_ag.h. xfs_verify_agino_or_null() gets the same perag treatment. xfs_agino_range() is moved to xfs_ag.c as it's not really a type function, and it's use is largely restricted as the first and last aginos can be grabbed straight from the perag in most cases. Note that we leave the original xfs_verify_agino in place in xfs_types.c as a static function as other callers in that file do not have per-ag contexts so still need to go the long way. It's been renamed to xfs_verify_agno_agino() to indicate it takes both an agno and an agino to differentiate it from new function. $ size --totals fs/xfs/built-in.a text data bss dec hex filename before 1482185 329588 572 1812345 1ba779 (TOTALS) after 1481937 329588 572 1812097 1ba681 (TOTALS) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
There is a lot of overhead in functions like xfs_verify_agbno() that repeatedly calculate the geometry limits of an AG. These can be pre-calculated as they are static and the verification context has a per-ag context it can quickly reference. In the case of xfs_verify_agbno(), we now always have a perag context handy, so we can store the AG length and the minimum valid block in the AG in the perag. This means we don't have to calculate it on every call and it can be inlined in callers if we move it to xfs_ag.h. Move xfs_ag_block_count() to xfs_ag.c because it's really a per-ag function and not an XFS type function. We need a little bit of rework that is specific to xfs_initialise_perag() to allow growfs to calculate the new perag sizes before we've updated the primary superblock during the grow (chicken/egg situation). Note that we leave the original xfs_verify_agbno in place in xfs_types.c as a static function as other callers in that file do not have per-ag contexts so still need to go the long way. It's been renamed to xfs_verify_agno_agbno() to indicate it takes both an agno and an agbno to differentiate it from new function. Future commits will make similar changes for other per-ag geometry validation functions. Further: $ size --totals fs/xfs/built-in.a text data bss dec hex filename before 1483006 329588 572 1813166 1baaae (TOTALS) after 1482185 329588 572 1812345 1ba779 (TOTALS) This rework reduces the binary size by ~820 bytes, indicating that much less work is being done to bounds check the agbno values against on per-ag geometry information. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
We have the perag in most places we call xfs_alloc_read_agfl, so pass the perag instead of a mount/agno pair. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
It's available in all callers, so pass it in so that the perag can be passed further down the stack. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
It's available in all callers, so pass it in so that the perag can be passed further down the stack. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
We have the perag in most places we call xfs_read_agf, so pass the perag instead of a mount/agno pair. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
We have the perag in most palces we call xfs_read_agi, so pass the perag instead of a mount/agno pair. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
xfs_alloc_read_agf() initialises the perag if it hasn't been done yet, so it makes sense to pass it the perag rather than pull a reference from the buffer. This allows callers to be per-ag centric rather than passing mount/agno pairs everywhere. Whilst modifying the xfs_reflink_find_shared() function definition, declare it static and remove the extern declaration as it is an internal function only these days. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
Trivial wrapper around xfs_alloc_read_agf(), can be easily replaced by passing a NULL agfbp to xfs_alloc_read_agf(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
xfs_ialloc_read_agi() initialises the perag if it hasn't been done yet, so it makes sense to pass it the perag rather than pull a reference from the buffer. This allows callers to be per-ag centric rather than passing mount/agno pairs everywhere. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
This is just a basic wrapper around xfs_ialloc_read_agi(), which can be entirely handled by xfs_ialloc_read_agi() by passing a NULL agibpp.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
Because the perag must exist for these operations, look it up as part of the common shrink operations and pass it instead of the mount/agno pair. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
I wrote up a description of how transactions, space reservations and relogging work together in response to a question for background material on the delayed logging design. Add this to the existing document for ease of future reference. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
When xlog_sync() rounds off the tail the iclog that is being flushed, it manually subtracts that space from the grant heads. This space is actually reserved by the transaction ticket that covers the xlog_sync() call from xlog_write(), but we don't plumb the ticket down far enough for it to account for the space consumed in the current log ticket. The grant heads are hot, so we really should be accounting this to the ticket is we can, rather than adding thousands of extra grant head updates every CIL commit. Interestingly, this actually indicates a potential log space overrun can occur when we force the log. By the time that xfs_log_force() pushes out an active iclog and consumes the roundoff space, the reservation for that roundoff space has been returned to the grant heads and is no longer covered by a reservation. In theory the roundoff added to log force on an already full log could push the write head past the tail. In practice, the CIL commit that writes to the log and needs the iclog pushed will have reserved space for roundoff, so when it releases the ticket there will still be physical space for the roundoff to be committed to the log, even though it is no longer reserved. This roundoff won't be enough space to allow a transaction to be woken if the log is full, so overruns should not actually occur in practice. That said, it indicates that we should not release the CIL context log ticket until after we've released the commit iclog. It also means that xlog_sync() still needs the direct grant head manipulation if we don't provide it with a ticket. Log forces are rare when we are in fast paths running 1.5 million transactions/s that make the grant heads hot, so let's optimise the hot case and pass CIL log tickets down to the xlog_sync() code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
Because now it hurts when the CIL fills up. - 37.20% __xfs_trans_commit - 35.84% xfs_log_commit_cil - 19.34% _raw_spin_lock - do_raw_spin_lock 19.01% __pv_queued_spin_lock_slowpath - 4.20% xfs_log_ticket_ungrant 0.90% xfs_log_space_wake Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
Adding a list_sort() call to the CIL push work while the xc_ctx_lock is held exclusively has resulted in fairly long lock hold times and that stops all front end transaction commits from making progress. We can move the sorting out of the xc_ctx_lock if we can transfer the ordering information to the log vectors as they are detached from the log items and then we can sort the log vectors. With these changes, we can move the list_sort() call to just before we call xlog_write() when we aren't holding any locks at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-