• Andrew Morton's avatar
    [PATCH] writeback correctness and efficiency changes · ec12ac49
    Andrew Morton authored
    This is a performance and correctness fix against the writeback paths.
    
    The writeback code has competing requirements.  Sometimes it is used
    for "memory cleansing": kupdate, bdflush, writer throttling, page
    allocator writeback, etc.  And sometimes this same code is used for
    data integrity pruposes: fsync, msync, fdatasync, sync, umount, various
    other kernel-internal uses.
    
    The problem is: how to handle a dirty buffer or page which is currently
    under writeback.
    
    For memory cleansing, we just want to skip that buffer/page and go onto
    the next one.  But for sync, we must wait on the old writeback and then
    start new writeback.
    
    mpage_writepages() is current correct for cleansing, but incorrect for
    sync.  block_write_full_page() is currently correct for sync, but
    inefficient for cleansing.
    
    The fix is fairly simple.
    
    - In mpage_writepages(), don't skip the page is it's a sync
    operation.
    
    - In block_write_full_page(), skip the buffer if it is a sync
    operation.  And return -EAGAIN to tell the caller that the writeout
    didn't work out.  The caller must then set the page dirty again and
    move it onto mapping->dirty_pages.
    
    This is an extension of the writepage API: writepage can now return
    EAGAIN.  There are only three callers, and they have been updated.
    
    fail_writepage() and ext3_writepage() were actually doing this by
    hand.  They have been changed to return -EAGAIN.  NTFS will want to
    be able to return -EAGAIN from its writepage as well.
    
    - A sticky question is: how to tell the writeout code which mode it
    is operating in?  Cleansing or sync?
    
    It's such a tiny code change that I didn't have the heart to go and
    propagate a `mode' argument down every instance of writepages() and
    writepage() in the kernel.  So I passed it in via current->flags.
    
    Incidentally, the occurrence of a locked-and-dirty buffer in
    block_write_full_page() is fairly rare: normally the collision avoidance
    happens at the address_space level, via PageWriteback.  But some
    mappings (blockdevs, ext3 files, etc) have their dirty buffers written
    out via submit_bh().  It is these buffers which can stall
    block_write_full_page().
    
    This wart will be pretty intrusive to fix.  ext3 needs to become fully
    page-based (ugh.  It's a block-based journalling filesystem, and pages
    are unnatural).  blockdev mappings are still written out by buffers
    because that's how filesystems use them.  Putting _all_ metadata
    (indirects, inodes, superblocks, etc) into standalone address_spaces
    would fix that up.
    
    - filemap_fdatawrite() sets PF_SYNC.  So filemap_fdatawrite() is the
    kernel function which will start writeback against a mapping for
    "data integrity" purposes, whereas the unexported, internal-only
    do_writepages() is the writeback function which is used for memory
    cleansing.  This difference is the reason why I didn't consolidate
    those functions ages ago...
    
    - Lots of code paths had a bogus extra call to filemap_fdatawait(),
    which I previously added in a moment of weak-headedness.  They have
    all been removed.
    ec12ac49
inode.c 81.4 KB