• Dan Williams's avatar
    mm, fs, dax: handle layout changes to pinned dax mappings · 5fac7408
    Dan Williams authored
    Background:
    
    get_user_pages() in the filesystem pins file backed memory pages for
    access by devices performing dma. However, it only pins the memory pages
    not the page-to-file offset association. If a file is truncated the
    pages are mapped out of the file and dma may continue indefinitely into
    a page that is owned by a device driver. This breaks coherency of the
    file vs dma, but the assumption is that if userspace wants the
    file-space truncated it does not matter what data is inbound from the
    device, it is not relevant anymore. The only expectation is that dma can
    safely continue while the filesystem reallocates the block(s).
    
    Problem:
    
    This expectation that dma can safely continue while the filesystem
    changes the block map is broken by dax. With dax the target dma page
    *is* the filesystem block. The model of leaving the page pinned for dma,
    but truncating the file block out of the file, means that the filesytem
    is free to reallocate a block under active dma to another file and now
    the expected data-incoherency situation has turned into active
    data-corruption.
    
    Solution:
    
    Defer all filesystem operations (fallocate(), truncate()) on a dax mode
    file while any page/block in the file is under active dma. This solution
    assumes that dma is transient. Cases where dma operations are known to
    not be transient, like RDMA, have been explicitly disabled via
    commits like 5f1d43de "IB/core: disable memory registration of
    filesystem-dax vmas".
    
    The dax_layout_busy_page() routine is called by filesystems with a lock
    held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
    The process of looking up a busy page invalidates all mappings
    to trigger any subsequent get_user_pages() to block on i_mmap_lock.
    The filesystem continues to call dax_layout_busy_page() until it finally
    returns no more active pages. This approach assumes that the page
    pinning is transient, if that assumption is violated the system would
    have likely hung from the uncompleted I/O.
    
    Cc: Jeff Moyer <jmoyer@redhat.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Matthew Wilcox <mawilcox@microsoft.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
    Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Reported-by: default avatarChristoph Hellwig <hch@lst.de>
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    Reviewed-by: default avatarJan Kara <jack@suse.cz>
    Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
    5fac7408
dax.c 47.1 KB