Commit e339dd8d authored by Brian Foster's avatar Brian Foster Committed by Darrick J. Wong

xfs: use sync buffer I/O for sync delwri queue submission

If a delwri queue occurs of a buffer that sits on a delwri queue
wait list, the queue sets _XBF_DELWRI_Q without changing the state
of ->b_list. This occurs, for example, if another thread beats the
current delwri waiter thread to the buffer lock after I/O
completion. Once the waiter acquires the lock, it removes the buffer
from the wait list and leaves a buffer with _XBF_DELWRI_Q set but
not populated on a list. This results in a lost buffer submission
and in turn can result in assert failures due to _XBF_DELWRI_Q being
set on buffer reclaim or filesystem lockups if the buffer happens to
cover an item in the AIL.

This problem has been reproduced by repeated iterations of xfs/305
on high CPU count (28xcpu) systems with limited memory (~1GB). Dirty
dquot reclaim races with an xfsaild push of a separate dquot backed
by the same buffer such that the buffer sits on the reclaim wait
list at the time xfsaild attempts to queue it. Since the latter
dquot has been flush locked but the underlying buffer not submitted
for I/O, the dquot pins the AIL and causes the filesystem to
livelock.

This race is essentially made possible by the buffer lock cycle
involved with waiting on a synchronous delwri queue submission.
Close the race by using synchronous buffer I/O for respective delwri
queue submission. This means the buffer remains locked across the
I/O and so is inaccessible from other contexts while in the
intermediate wait list state. The sync buffer I/O wait mechanism is
factored into a helper such that sync delwri buffer submission and
serialization are batched operations.
Designed-by: default avatarDave Chinner <dchinner@redhat.com>
Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
parent eaebb515
...@@ -1531,6 +1531,20 @@ xfs_buf_submit( ...@@ -1531,6 +1531,20 @@ xfs_buf_submit(
xfs_buf_rele(bp); xfs_buf_rele(bp);
} }
/*
* Wait for I/O completion of a sync buffer and return the I/O error code.
*/
static int
xfs_buf_iowait(
struct xfs_buf *bp)
{
trace_xfs_buf_iowait(bp, _RET_IP_);
wait_for_completion(&bp->b_iowait);
trace_xfs_buf_iowait_done(bp, _RET_IP_);
return bp->b_error;
}
/* /*
* Synchronous buffer IO submission path, read or write. * Synchronous buffer IO submission path, read or write.
*/ */
...@@ -1553,12 +1567,7 @@ xfs_buf_submit_wait( ...@@ -1553,12 +1567,7 @@ xfs_buf_submit_wait(
error = __xfs_buf_submit(bp); error = __xfs_buf_submit(bp);
if (error) if (error)
goto out; goto out;
error = xfs_buf_iowait(bp);
/* wait for completion before gathering the error from the buffer */
trace_xfs_buf_iowait(bp, _RET_IP_);
wait_for_completion(&bp->b_iowait);
trace_xfs_buf_iowait_done(bp, _RET_IP_);
error = bp->b_error;
out: out:
/* /*
...@@ -1961,16 +1970,11 @@ xfs_buf_cmp( ...@@ -1961,16 +1970,11 @@ xfs_buf_cmp(
} }
/* /*
* submit buffers for write. * Submit buffers for write. If wait_list is specified, the buffers are
* * submitted using sync I/O and placed on the wait list such that the caller can
* When we have a large buffer list, we do not want to hold all the buffers * iowait each buffer. Otherwise async I/O is used and the buffers are released
* locked while we block on the request queue waiting for IO dispatch. To avoid * at I/O completion time. In either case, buffers remain locked until I/O
* this problem, we lock and submit buffers in groups of 50, thereby minimising * completes and the buffer is released from the queue.
* the lock hold times for lists which may contain thousands of objects.
*
* To do this, we sort the buffer list before we walk the list to lock and
* submit buffers, and we plug and unplug around each group of buffers we
* submit.
*/ */
static int static int
xfs_buf_delwri_submit_buffers( xfs_buf_delwri_submit_buffers(
...@@ -2012,21 +2016,22 @@ xfs_buf_delwri_submit_buffers( ...@@ -2012,21 +2016,22 @@ xfs_buf_delwri_submit_buffers(
trace_xfs_buf_delwri_split(bp, _RET_IP_); trace_xfs_buf_delwri_split(bp, _RET_IP_);
/* /*
* We do all IO submission async. This means if we need * If we have a wait list, each buffer (and associated delwri
* to wait for IO completion we need to take an extra * queue reference) transfers to it and is submitted
* reference so the buffer is still valid on the other * synchronously. Otherwise, drop the buffer from the delwri
* side. We need to move the buffer onto the io_list * queue and submit async.
* at this point so the caller can still access it.
*/ */
bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_WRITE_FAIL); bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_WRITE_FAIL);
bp->b_flags |= XBF_WRITE | XBF_ASYNC; bp->b_flags |= XBF_WRITE;
if (wait_list) { if (wait_list) {
xfs_buf_hold(bp); bp->b_flags &= ~XBF_ASYNC;
list_move_tail(&bp->b_list, wait_list); list_move_tail(&bp->b_list, wait_list);
} else __xfs_buf_submit(bp);
} else {
bp->b_flags |= XBF_ASYNC;
list_del_init(&bp->b_list); list_del_init(&bp->b_list);
xfs_buf_submit(bp);
xfs_buf_submit(bp); }
} }
blk_finish_plug(&plug); blk_finish_plug(&plug);
...@@ -2073,9 +2078,11 @@ xfs_buf_delwri_submit( ...@@ -2073,9 +2078,11 @@ xfs_buf_delwri_submit(
list_del_init(&bp->b_list); list_del_init(&bp->b_list);
/* locking the buffer will wait for async IO completion. */ /*
xfs_buf_lock(bp); * Wait on the locked buffer, check for errors and unlock and
error2 = bp->b_error; * release the delwri queue reference.
*/
error2 = xfs_buf_iowait(bp);
xfs_buf_relse(bp); xfs_buf_relse(bp);
if (!error) if (!error)
error = error2; error = error2;
...@@ -2121,23 +2128,18 @@ xfs_buf_delwri_pushbuf( ...@@ -2121,23 +2128,18 @@ xfs_buf_delwri_pushbuf(
/* /*
* Delwri submission clears the DELWRI_Q buffer flag and returns with * Delwri submission clears the DELWRI_Q buffer flag and returns with
* the buffer on the wait list with an associated reference. Rather than * the buffer on the wait list with the original reference. Rather than
* bounce the buffer from a local wait list back to the original list * bounce the buffer from a local wait list back to the original list
* after I/O completion, reuse the original list as the wait list. * after I/O completion, reuse the original list as the wait list.
*/ */
xfs_buf_delwri_submit_buffers(&submit_list, buffer_list); xfs_buf_delwri_submit_buffers(&submit_list, buffer_list);
/* /*
* The buffer is now under I/O and wait listed as during typical delwri * The buffer is now locked, under I/O and wait listed on the original
* submission. Lock the buffer to wait for I/O completion. Rather than * delwri queue. Wait for I/O completion, restore the DELWRI_Q flag and
* remove the buffer from the wait list and release the reference, we * return with the buffer unlocked and on the original queue.
* want to return with the buffer queued to the original list. The
* buffer already sits on the original list with a wait list reference,
* however. If we let the queue inherit that wait list reference, all we
* need to do is reset the DELWRI_Q flag.
*/ */
xfs_buf_lock(bp); error = xfs_buf_iowait(bp);
error = bp->b_error;
bp->b_flags |= _XBF_DELWRI_Q; bp->b_flags |= _XBF_DELWRI_Q;
xfs_buf_unlock(bp); xfs_buf_unlock(bp);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment