• Dave Chinner's avatar
    iomap: buffered write failure should not truncate the page cache · f43dc4dc
    Dave Chinner authored
    iomap_file_buffered_write_punch_delalloc() currently invalidates the
    page cache over the unused range of the delalloc extent that was
    allocated. While the write allocated the delalloc extent, it does
    not own it exclusively as the write does not hold any locks that
    prevent either writeback or mmap page faults from changing the state
    of either the page cache or the extent state backing this range.
    
    Whilst xfs_bmap_punch_delalloc_range() already handles races in
    extent conversion - it will only punch out delalloc extents and it
    ignores any other type of extent - the page cache truncate does not
    discriminate between data written by this write or some other task.
    As a result, truncating the page cache can result in data corruption
    if the write races with mmap modifications to the file over the same
    range.
    
    generic/346 exercises this workload, and if we randomly fail writes
    (as will happen when iomap gets stale iomap detection later in the
    patchset), it will randomly corrupt the file data because it removes
    data written by mmap() in the same page as the write() that failed.
    
    Hence we do not want to punch out the page cache over the range of
    the extent we failed to write to - what we actually need to do is
    detect the ranges that have dirty data in cache over them and *not
    punch them out*.
    
    To do this, we have to walk the page cache over the range of the
    delalloc extent we want to remove. This is made complex by the fact
    we have to handle partially up-to-date folios correctly and this can
    happen even when the FSB size == PAGE_SIZE because we now support
    multi-page folios in the page cache.
    
    Because we are only interested in discovering the edges of data
    ranges in the page cache (i.e. hole-data boundaries) we can make use
    of mapping_seek_hole_data() to find those transitions in the page
    cache. As we hold the invalidate_lock, we know that the boundaries
    are not going to change while we walk the range. This interface is
    also byte-based and is sub-page block aware, so we can find the data
    ranges in the cache based on byte offsets rather than page, folio or
    fs block sized chunks. This greatly simplifies the logic of finding
    dirty cached ranges in the page cache.
    
    Once we've identified a range that contains cached data, we can then
    iterate the range folio by folio. This allows us to determine if the
    data is dirty and hence perform the correct delalloc extent punching
    operations. The seek interface we use to iterate data ranges will
    give us sub-folio start/end granularity, so we may end up looking up
    the same folio multiple times as the seek interface iterates across
    each discontiguous data region in the folio.
    Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
    f43dc4dc
buffered-io.c 51.6 KB