• Dave Chinner's avatar
    xfs: log mount failures don't wait for buffers to be released · 85bec546
    Dave Chinner authored
    Recently I've been seeing xfs/051 fail on 1k block size filesystems.
    Trying to trace the events during the test lead to the problem going
    away, indicating that it was a race condition that lead to this
    ASSERT failure:
    
    XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 156
    .....
    [<ffffffff814e1257>] xfs_free_perag+0x87/0xb0
    [<ffffffff814e21b9>] xfs_mountfs+0x4d9/0x900
    [<ffffffff814e5dff>] xfs_fs_fill_super+0x3bf/0x4d0
    [<ffffffff811d8800>] mount_bdev+0x180/0x1b0
    [<ffffffff814e3ff5>] xfs_fs_mount+0x15/0x20
    [<ffffffff811d90a8>] mount_fs+0x38/0x170
    [<ffffffff811f4347>] vfs_kern_mount+0x67/0x120
    [<ffffffff811f7018>] do_mount+0x218/0xd60
    [<ffffffff811f7e5b>] SyS_mount+0x8b/0xd0
    
    When I finally caught it with tracing enabled, I saw that AG 2 had
    an elevated reference count and a buffer was responsible for it. I
    tracked down the specific buffer, and found that it was missing the
    final reference count release that would put it back on the LRU and
    hence be found by xfs_wait_buftarg() calls in the log mount failure
    handling.
    
    The last four traces for the buffer before the assert were (trimmed
    for relevance)
    
    kworker/0:1-5259   xfs_buf_iodone:        hold 2  lock 0 flags ASYNC
    kworker/0:1-5259   xfs_buf_ioerror:       hold 2  lock 0 error -5
    mount-7163	   xfs_buf_lock_done:     hold 2  lock 0 flags ASYNC
    mount-7163	   xfs_buf_unlock:        hold 2  lock 1 flags ASYNC
    
    This is an async write that is completing, so there's nobody waiting
    for it directly.  Hence we call xfs_buf_relse() once all the
    processing is complete. That does:
    
    static inline void xfs_buf_relse(xfs_buf_t *bp)
    {
    	xfs_buf_unlock(bp);
    	xfs_buf_rele(bp);
    }
    
    Now, it's clear that mount is waiting on the buffer lock, and that
    it has been released by xfs_buf_relse() and gained by mount. This is
    expected, because at this point the mount process is in
    xfs_buf_delwri_submit() waiting for all the IO it submitted to
    complete.
    
    The mount process, however, is waiting on the lock for the buffer
    because it is in xfs_buf_delwri_submit(). This waits for IO
    completion, but it doesn't wait for the buffer reference owned by
    the IO to go away. The mount process collects all the completions,
    fails the log recovery, and the higher level code then calls
    xfs_wait_buftarg() to free all the remaining buffers in the
    filesystem.
    
    The issue is that on unlocking the buffer, the scheduler has decided
    that the mount process has higher priority than the the kworker
    thread that is running the IO completion, and so immediately
    switched contexts to the mount process from the semaphore unlock
    code, hence preventing the kworker thread from finishing the IO
    completion and releasing the IO reference to the buffer.
    
    Hence by the time that xfs_wait_buftarg() is run, the buffer still
    has an active reference and so isn't on the LRU list that the
    function walks to free the remaining buffers. Hence we miss that
    buffer and continue onwards to tear down the mount structures,
    at which time we get find a stray reference count on the perag
    structure. On a non-debug kernel, this will be ignored and the
    structure torn down and freed. Hence when the kworker thread is then
    rescheduled and the buffer released and freed, it will access a
    freed perag structure.
    
    The problem here is that when the log mount fails, we still need to
    quiesce the log to ensure that the IO workqueues have returned to
    idle before we run xfs_wait_buftarg(). By synchronising the
    workqueues, we ensure that all IO completions are fully processed,
    not just to the point where buffers have been unlocked. This ensures
    we don't end up in the situation above.
    
    cc: <stable@vger.kernel.org> # 3.18
    Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
    Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
    85bec546
xfs_buf.c 44 KB