Commit ed3b4d6c authored by Dave Chinner's avatar Dave Chinner Committed by Alex Elder

xfs: Improve scalability of busy extent tracking

When we free a metadata extent, we record it in the per-AG busy
extent array so that it is not re-used before the freeing
transaction hits the disk. This array is fixed size, so when it
overflows we make further allocation transactions synchronous
because we cannot track more freed extents until those transactions
hit the disk and are completed. Under heavy mixed allocation and
freeing workloads with large log buffers, we can overflow this array
quite easily.

Further, the array is sparsely populated, which means that inserts
need to search for a free slot, and array searches often have to
search many more slots that are actually used to check all the
busy extents. Quite inefficient, really.

To enable this aspect of extent freeing to scale better, we need
a structure that can grow dynamically. While in other areas of
XFS we have used radix trees, the extents being freed are at random
locations on disk so are better suited to being indexed by an rbtree.

So, use a per-AG rbtree indexed by block number to track busy
extents.  This incures a memory allocation when marking an extent
busy, but should not occur too often in low memory situations. This
should scale to an arbitrary number of extents so should not be a
limitation for features such as in-memory aggregation of
transactions.

However, there are still situations where we can't avoid allocating
busy extents (such as allocation from the AGFL). To minimise the
overhead of such occurences, we need to avoid doing a synchronous
log force while holding the AGF locked to ensure that the previous
transactions are safely on disk before we use the extent. We can do
this by marking the transaction doing the allocation as synchronous
rather issuing a log force.

Because of the locking involved and the ordering of transactions,
the synchronous transaction provides the same guarantees as a
synchronous log force because it ensures that all the prior
transactions are already on disk when the synchronous transaction
hits the disk. i.e. it preserves the free->allocate order of the
extent correctly in recovery.

By doing this, we avoid holding the AGF locked while log writes are
in progress, hence reducing the length of time the lock is held and
therefore we increase the rate at which we can allocate and free
from the allocation group, thereby increasing overall throughput.

The only problem with this approach is that when a metadata buffer is
marked stale (e.g. a directory block is removed), then buffer remains
pinned and locked until the log goes to disk. The issue here is that
if that stale buffer is reallocated in a subsequent transaction, the
attempt to lock that buffer in the transaction will hang waiting
the log to go to disk to unlock and unpin the buffer. Hence if
someone tries to lock a pinned, stale, locked buffer we need to
push on the log to get it unlocked ASAP. Effectively we are trading
off a guaranteed log force for a much less common trigger for log
force to occur.

Ideally we should not reallocate busy extents. That is a much more
complex fix to the problem as it involves direct intervention in the
allocation btree searches in many places. This is left to a future
set of modifications.

Finally, now that we track busy extents in allocated memory, we
don't need the descriptors in the transaction structure to point to
them. We can replace the complex busy chunk infrastructure with a
simple linked list of busy extents. This allows us to remove a large
chunk of code, making the overall change a net reduction in code
size.
Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
parent 955833cf
......@@ -37,6 +37,7 @@
#include "xfs_sb.h"
#include "xfs_inum.h"
#include "xfs_log.h"
#include "xfs_ag.h"
#include "xfs_dmapi.h"
#include "xfs_mount.h"
......@@ -850,6 +851,12 @@ xfs_buf_lock_value(
* Note that this in no way locks the underlying pages, so it is only
* useful for synchronizing concurrent use of buffer objects, not for
* synchronizing independent access to the underlying pages.
*
* If we come across a stale, pinned, locked buffer, we know that we
* are being asked to lock a buffer that has been reallocated. Because
* it is pinned, we know that the log has not been pushed to disk and
* hence it will still be locked. Rather than sleeping until someone
* else pushes the log, push it ourselves before trying to get the lock.
*/
void
xfs_buf_lock(
......@@ -857,6 +864,8 @@ xfs_buf_lock(
{
trace_xfs_buf_lock(bp, _RET_IP_);
if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE))
xfs_log_force(bp->b_mount, 0);
if (atomic_read(&bp->b_io_remaining))
blk_run_address_space(bp->b_target->bt_mapping);
down(&bp->b_sema);
......
......@@ -19,6 +19,7 @@
#include "xfs_dmapi.h"
#include "xfs_sb.h"
#include "xfs_inum.h"
#include "xfs_log.h"
#include "xfs_ag.h"
#include "xfs_mount.h"
#include "xfs_quota.h"
......
......@@ -1059,83 +1059,112 @@ TRACE_EVENT(xfs_bunmap,
);
#define XFS_BUSY_SYNC \
{ 0, "async" }, \
{ 1, "sync" }
TRACE_EVENT(xfs_alloc_busy,
TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno,
xfs_extlen_t len, int slot),
TP_ARGS(mp, agno, agbno, len, slot),
TP_PROTO(struct xfs_trans *trans, xfs_agnumber_t agno,
xfs_agblock_t agbno, xfs_extlen_t len, int sync),
TP_ARGS(trans, agno, agbno, len, sync),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(struct xfs_trans *, tp)
__field(int, tid)
__field(xfs_agnumber_t, agno)
__field(xfs_agblock_t, agbno)
__field(xfs_extlen_t, len)
__field(int, slot)
__field(int, sync)
),
TP_fast_assign(
__entry->dev = mp->m_super->s_dev;
__entry->dev = trans->t_mountp->m_super->s_dev;
__entry->tp = trans;
__entry->tid = trans->t_ticket->t_tid;
__entry->agno = agno;
__entry->agbno = agbno;
__entry->len = len;
__entry->slot = slot;
__entry->sync = sync;
),
TP_printk("dev %d:%d agno %u agbno %u len %u slot %d",
TP_printk("dev %d:%d trans 0x%p tid 0x%x agno %u agbno %u len %u %s",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->tp,
__entry->tid,
__entry->agno,
__entry->agbno,
__entry->len,
__entry->slot)
__print_symbolic(__entry->sync, XFS_BUSY_SYNC))
);
#define XFS_BUSY_STATES \
{ 0, "found" }, \
{ 1, "missing" }
TRACE_EVENT(xfs_alloc_unbusy,
TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
int slot, int found),
TP_ARGS(mp, agno, slot, found),
xfs_agblock_t agbno, xfs_extlen_t len),
TP_ARGS(mp, agno, agbno, len),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(xfs_agnumber_t, agno)
__field(int, slot)
__field(int, found)
__field(xfs_agblock_t, agbno)
__field(xfs_extlen_t, len)
),
TP_fast_assign(
__entry->dev = mp->m_super->s_dev;
__entry->agno = agno;
__entry->slot = slot;
__entry->found = found;
__entry->agbno = agbno;
__entry->len = len;
),
TP_printk("dev %d:%d agno %u slot %d %s",
TP_printk("dev %d:%d agno %u agbno %u len %u",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->agno,
__entry->slot,
__print_symbolic(__entry->found, XFS_BUSY_STATES))
__entry->agbno,
__entry->len)
);
#define XFS_BUSY_STATES \
{ 0, "missing" }, \
{ 1, "found" }
TRACE_EVENT(xfs_alloc_busysearch,
TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno,
xfs_extlen_t len, xfs_lsn_t lsn),
TP_ARGS(mp, agno, agbno, len, lsn),
TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
xfs_agblock_t agbno, xfs_extlen_t len, int found),
TP_ARGS(mp, agno, agbno, len, found),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(xfs_agnumber_t, agno)
__field(xfs_agblock_t, agbno)
__field(xfs_extlen_t, len)
__field(xfs_lsn_t, lsn)
__field(int, found)
),
TP_fast_assign(
__entry->dev = mp->m_super->s_dev;
__entry->agno = agno;
__entry->agbno = agbno;
__entry->len = len;
__entry->lsn = lsn;
__entry->found = found;
),
TP_printk("dev %d:%d agno %u agbno %u len %u force lsn 0x%llx",
TP_printk("dev %d:%d agno %u agbno %u len %u %s",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->agno,
__entry->agbno,
__entry->len,
__print_symbolic(__entry->found, XFS_BUSY_STATES))
);
TRACE_EVENT(xfs_trans_commit_lsn,
TP_PROTO(struct xfs_trans *trans),
TP_ARGS(trans),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(struct xfs_trans *, tp)
__field(xfs_lsn_t, lsn)
),
TP_fast_assign(
__entry->dev = trans->t_mountp->m_super->s_dev;
__entry->tp = trans;
__entry->lsn = trans->t_commit_lsn;
),
TP_printk("dev %d:%d trans 0x%p commit_lsn 0x%llx",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->tp,
__entry->lsn)
);
......
......@@ -175,14 +175,20 @@ typedef struct xfs_agfl {
} xfs_agfl_t;
/*
* Busy block/extent entry. Used in perag to mark blocks that have been freed
* but whose transactions aren't committed to disk yet.
* Busy block/extent entry. Indexed by a rbtree in perag to mark blocks that
* have been freed but whose transactions aren't committed to disk yet.
*
* Note that we use the transaction ID to record the transaction, not the
* transaction structure itself. See xfs_alloc_busy_insert() for details.
*/
typedef struct xfs_perag_busy {
xfs_agblock_t busy_start;
xfs_extlen_t busy_length;
struct xfs_trans *busy_tp; /* transaction that did the free */
} xfs_perag_busy_t;
struct xfs_busy_extent {
struct rb_node rb_node; /* ag by-bno indexed search tree */
struct list_head list; /* transaction busy extent list */
xfs_agnumber_t agno;
xfs_agblock_t bno;
xfs_extlen_t length;
xlog_tid_t tid; /* transaction that created this */
};
/*
* Per-ag incore structure, copies of information in agf and agi,
......@@ -216,7 +222,8 @@ typedef struct xfs_perag {
xfs_agino_t pagl_leftrec;
xfs_agino_t pagl_rightrec;
#ifdef __KERNEL__
spinlock_t pagb_lock; /* lock for pagb_list */
spinlock_t pagb_lock; /* lock for pagb_tree */
struct rb_root pagb_tree; /* ordered tree of busy extents */
atomic_t pagf_fstrms; /* # of filestreams active in this AG */
......@@ -226,7 +233,6 @@ typedef struct xfs_perag {
int pag_ici_reclaimable; /* reclaimable inodes */
#endif
int pagb_count; /* pagb slots in use */
xfs_perag_busy_t pagb_list[XFS_PAGB_NUM_SLOTS]; /* unstable blocks */
} xfs_perag_t;
/*
......
This diff is collapsed.
......@@ -22,6 +22,7 @@ struct xfs_buf;
struct xfs_mount;
struct xfs_perag;
struct xfs_trans;
struct xfs_busy_extent;
/*
* Freespace allocation types. Argument to xfs_alloc_[v]extent.
......@@ -119,15 +120,13 @@ xfs_alloc_longest_free_extent(struct xfs_mount *mp,
#ifdef __KERNEL__
void
xfs_alloc_mark_busy(xfs_trans_t *tp,
xfs_alloc_busy_insert(xfs_trans_t *tp,
xfs_agnumber_t agno,
xfs_agblock_t bno,
xfs_extlen_t len);
void
xfs_alloc_clear_busy(xfs_trans_t *tp,
xfs_agnumber_t ag,
int idx);
xfs_alloc_busy_clear(struct xfs_mount *mp, struct xfs_busy_extent *busyp);
#endif /* __KERNEL__ */
......
......@@ -134,7 +134,7 @@ xfs_allocbt_free_block(
* disk. If a busy block is allocated, the iclog is pushed up to the
* LSN that freed the block.
*/
xfs_alloc_mark_busy(cur->bc_tp, be32_to_cpu(agf->agf_seqno), bno, 1);
xfs_alloc_busy_insert(cur->bc_tp, be32_to_cpu(agf->agf_seqno), bno, 1);
xfs_trans_agbtree_delta(cur->bc_tp, -1);
return 0;
}
......
......@@ -44,6 +44,7 @@
#include "xfs_trans_priv.h"
#include "xfs_trans_space.h"
#include "xfs_inode_item.h"
#include "xfs_trace.h"
kmem_zone_t *xfs_trans_zone;
......@@ -243,9 +244,8 @@ _xfs_trans_alloc(
tp->t_type = type;
tp->t_mountp = mp;
tp->t_items_free = XFS_LIC_NUM_SLOTS;
tp->t_busy_free = XFS_LBC_NUM_SLOTS;
xfs_lic_init(&(tp->t_items));
XFS_LBC_INIT(&(tp->t_busy));
INIT_LIST_HEAD(&tp->t_busy);
return tp;
}
......@@ -255,8 +255,13 @@ _xfs_trans_alloc(
*/
STATIC void
xfs_trans_free(
xfs_trans_t *tp)
struct xfs_trans *tp)
{
struct xfs_busy_extent *busyp, *n;
list_for_each_entry_safe(busyp, n, &tp->t_busy, list)
xfs_alloc_busy_clear(tp->t_mountp, busyp);
atomic_dec(&tp->t_mountp->m_active_trans);
xfs_trans_free_dqinfo(tp);
kmem_zone_free(xfs_trans_zone, tp);
......@@ -285,9 +290,8 @@ xfs_trans_dup(
ntp->t_type = tp->t_type;
ntp->t_mountp = tp->t_mountp;
ntp->t_items_free = XFS_LIC_NUM_SLOTS;
ntp->t_busy_free = XFS_LBC_NUM_SLOTS;
xfs_lic_init(&(ntp->t_items));
XFS_LBC_INIT(&(ntp->t_busy));
INIT_LIST_HEAD(&ntp->t_busy);
ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
ASSERT(tp->t_ticket != NULL);
......@@ -423,7 +427,6 @@ xfs_trans_reserve(
return error;
}
/*
* Record the indicated change to the given field for application
* to the file system's superblock when the transaction commits.
......@@ -930,26 +933,6 @@ xfs_trans_item_committed(
IOP_UNPIN(lip);
}
/* Clear all the per-AG busy list items listed in this transaction */
static void
xfs_trans_clear_busy_extents(
struct xfs_trans *tp)
{
xfs_log_busy_chunk_t *lbcp;
xfs_log_busy_slot_t *lbsp;
int i;
for (lbcp = &tp->t_busy; lbcp != NULL; lbcp = lbcp->lbc_next) {
i = 0;
for (lbsp = lbcp->lbc_busy; i < lbcp->lbc_unused; i++, lbsp++) {
if (XFS_LBC_ISFREE(lbcp, i))
continue;
xfs_alloc_clear_busy(tp, lbsp->lbc_ag, lbsp->lbc_idx);
}
}
xfs_trans_free_busy(tp);
}
/*
* This is typically called by the LM when a transaction has been fully
* committed to disk. It needs to unpin the items which have
......@@ -984,7 +967,6 @@ xfs_trans_committed(
kmem_free(licp);
}
xfs_trans_clear_busy_extents(tp);
xfs_trans_free(tp);
}
......@@ -1013,7 +995,6 @@ xfs_trans_uncommit(
xfs_trans_unreserve_and_mod_dquots(tp);
xfs_trans_free_items(tp, flags);
xfs_trans_free_busy(tp);
xfs_trans_free(tp);
}
......@@ -1075,6 +1056,8 @@ xfs_trans_commit_iclog(
*commit_lsn = xfs_log_done(mp, tp->t_ticket, &commit_iclog, log_flags);
tp->t_commit_lsn = *commit_lsn;
trace_xfs_trans_commit_lsn(tp);
if (nvec > XFS_TRANS_LOGVEC_COUNT)
kmem_free(log_vector);
......@@ -1260,7 +1243,6 @@ _xfs_trans_commit(
}
current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
xfs_trans_free_items(tp, error ? XFS_TRANS_ABORT : 0);
xfs_trans_free_busy(tp);
xfs_trans_free(tp);
XFS_STATS_INC(xs_trans_empty);
......@@ -1339,7 +1321,6 @@ xfs_trans_cancel(
current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
xfs_trans_free_items(tp, flags);
xfs_trans_free_busy(tp);
xfs_trans_free(tp);
}
......
......@@ -813,6 +813,7 @@ struct xfs_log_item_desc;
struct xfs_mount;
struct xfs_trans;
struct xfs_dquot_acct;
struct xfs_busy_extent;
typedef struct xfs_log_item {
struct list_head li_ail; /* AIL pointers */
......@@ -871,34 +872,6 @@ typedef struct xfs_item_ops {
#define XFS_ITEM_LOCKED 2
#define XFS_ITEM_PUSHBUF 3
/*
* This structure is used to maintain a list of block ranges that have been
* freed in the transaction. The ranges are listed in the perag[] busy list
* between when they're freed and the transaction is committed to disk.
*/
typedef struct xfs_log_busy_slot {
xfs_agnumber_t lbc_ag;
ushort lbc_idx; /* index in perag.busy[] */
} xfs_log_busy_slot_t;
#define XFS_LBC_NUM_SLOTS 31
typedef struct xfs_log_busy_chunk {
struct xfs_log_busy_chunk *lbc_next;
uint lbc_free; /* free slots bitmask */
ushort lbc_unused; /* first unused */
xfs_log_busy_slot_t lbc_busy[XFS_LBC_NUM_SLOTS];
} xfs_log_busy_chunk_t;
#define XFS_LBC_MAX_SLOT (XFS_LBC_NUM_SLOTS - 1)
#define XFS_LBC_FREEMASK ((1U << XFS_LBC_NUM_SLOTS) - 1)
#define XFS_LBC_INIT(cp) ((cp)->lbc_free = XFS_LBC_FREEMASK)
#define XFS_LBC_CLAIM(cp, slot) ((cp)->lbc_free &= ~(1 << (slot)))
#define XFS_LBC_SLOT(cp, slot) (&((cp)->lbc_busy[(slot)]))
#define XFS_LBC_VACANCY(cp) (((cp)->lbc_free) & XFS_LBC_FREEMASK)
#define XFS_LBC_ISFREE(cp, slot) ((cp)->lbc_free & (1 << (slot)))
/*
* This is the type of function which can be given to xfs_trans_callback()
* to be called upon the transaction's commit to disk.
......@@ -950,8 +923,7 @@ typedef struct xfs_trans {
unsigned int t_items_free; /* log item descs free */
xfs_log_item_chunk_t t_items; /* first log item desc chunk */
xfs_trans_header_t t_header; /* header for in-log trans */
unsigned int t_busy_free; /* busy descs free */
xfs_log_busy_chunk_t t_busy; /* busy/async free blocks */
struct list_head t_busy; /* list of busy extents */
unsigned long t_pflags; /* saved process flags state */
} xfs_trans_t;
......@@ -1025,9 +997,6 @@ int _xfs_trans_commit(xfs_trans_t *,
void xfs_trans_cancel(xfs_trans_t *, int);
int xfs_trans_ail_init(struct xfs_mount *);
void xfs_trans_ail_destroy(struct xfs_mount *);
xfs_log_busy_slot_t *xfs_trans_add_busy(xfs_trans_t *tp,
xfs_agnumber_t ag,
xfs_extlen_t idx);
extern kmem_zone_t *xfs_trans_zone;
......
......@@ -438,112 +438,3 @@ xfs_trans_unlock_chunk(
return freed;
}
/*
* This is called to add the given busy item to the transaction's
* list of busy items. It must find a free busy item descriptor
* or allocate a new one and add the item to that descriptor.
* The function returns a pointer to busy descriptor used to point
* to the new busy entry. The log busy entry will now point to its new
* descriptor with its ???? field.
*/
xfs_log_busy_slot_t *
xfs_trans_add_busy(xfs_trans_t *tp, xfs_agnumber_t ag, xfs_extlen_t idx)
{
xfs_log_busy_chunk_t *lbcp;
xfs_log_busy_slot_t *lbsp;
int i=0;
/*
* If there are no free descriptors, allocate a new chunk
* of them and put it at the front of the chunk list.
*/
if (tp->t_busy_free == 0) {
lbcp = (xfs_log_busy_chunk_t*)
kmem_alloc(sizeof(xfs_log_busy_chunk_t), KM_SLEEP);
ASSERT(lbcp != NULL);
/*
* Initialize the chunk, and then
* claim the first slot in the newly allocated chunk.
*/
XFS_LBC_INIT(lbcp);
XFS_LBC_CLAIM(lbcp, 0);
lbcp->lbc_unused = 1;
lbsp = XFS_LBC_SLOT(lbcp, 0);
/*
* Link in the new chunk and update the free count.
*/
lbcp->lbc_next = tp->t_busy.lbc_next;
tp->t_busy.lbc_next = lbcp;
tp->t_busy_free = XFS_LIC_NUM_SLOTS - 1;
/*
* Initialize the descriptor and the generic portion
* of the log item.
*
* Point the new slot at this item and return it.
* Also point the log item at its currently active
* descriptor and set the item's mount pointer.
*/
lbsp->lbc_ag = ag;
lbsp->lbc_idx = idx;
return lbsp;
}
/*
* Find the free descriptor. It is somewhere in the chunklist
* of descriptors.
*/
lbcp = &tp->t_busy;
while (lbcp != NULL) {
if (XFS_LBC_VACANCY(lbcp)) {
if (lbcp->lbc_unused <= XFS_LBC_MAX_SLOT) {
i = lbcp->lbc_unused;
break;
} else {
/* out-of-order vacancy */
cmn_err(CE_DEBUG, "OOO vacancy lbcp 0x%p\n", lbcp);
ASSERT(0);
}
}
lbcp = lbcp->lbc_next;
}
ASSERT(lbcp != NULL);
/*
* If we find a free descriptor, claim it,
* initialize it, and return it.
*/
XFS_LBC_CLAIM(lbcp, i);
if (lbcp->lbc_unused <= i) {
lbcp->lbc_unused = i + 1;
}
lbsp = XFS_LBC_SLOT(lbcp, i);
tp->t_busy_free--;
lbsp->lbc_ag = ag;
lbsp->lbc_idx = idx;
return lbsp;
}
/*
* xfs_trans_free_busy
* Free all of the busy lists from a transaction
*/
void
xfs_trans_free_busy(xfs_trans_t *tp)
{
xfs_log_busy_chunk_t *lbcp;
xfs_log_busy_chunk_t *lbcq;
lbcp = tp->t_busy.lbc_next;
while (lbcp != NULL) {
lbcq = lbcp->lbc_next;
kmem_free(lbcp);
lbcp = lbcq;
}
XFS_LBC_INIT(&tp->t_busy);
tp->t_busy.lbc_unused = 0;
}
......@@ -38,10 +38,6 @@ struct xfs_log_item_desc *xfs_trans_next_item(struct xfs_trans *,
void xfs_trans_free_items(struct xfs_trans *, int);
void xfs_trans_unlock_items(struct xfs_trans *,
xfs_lsn_t);
void xfs_trans_free_busy(xfs_trans_t *tp);
xfs_log_busy_slot_t *xfs_trans_add_busy(xfs_trans_t *tp,
xfs_agnumber_t ag,
xfs_extlen_t idx);
/*
* AIL traversal cursor.
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment