-
Andrew Morton authored
From: Badari Pulavarty, Suparna Bhattacharya, Andrew Morton Forward port of Stephen Tweedie's DIO fixes from 2.4, to fix various DIO vs buffered IO exposures involving races causing: (a) stale data from uninstantiated blocks to be read, e.g. - O_DIRECT reads against buffered writes to a sparse region - O_DIRECT writes to a sparse region against buffered reads (b) potential data corruption with - O_DIRECT IOs against truncate due to writes to truncated blocks (which may have been reallocated to another file). Summary of fixes: 1) All the changes affect only regular files. RAW/O_DIRECT on block are unaffected. 2) The DIO code will not fill in sparse regions on a write. Instead -ENOTBLK is returned and the generic file write code would fallthrough to buffered IO in this case followed by writing through the pages to disk using filemap_fdatawrite/wait. 3) i_sem is held during both DIO reads and writes. For reads, and writes to already allocated blocks, it is released right after IO is issued, while for writes to newly allocated blocks (e.g file extending writes and hole overwrites) it is held all the way through until IO completes (and data is committed to disk). 4) filemap_fdatawrite/wait are called under i_sem to synchronize buffered pages to disk blocks before issuing DIO. 5) A new rwsem (i_alloc_sem) is held in shared mode all the while a DIO (read or write) is in progress, and in exclusive mode by truncate to guard against deallocation of data blocks during DIO. 6) All this new locking has been pushed down into blockdev_direct_IO to avoid interfering with NFS direct IO. The locks are taken in the order i_sem followed by i_alloc_sem. While i_sem may be released after IO submission in some cases, i_alloc_sem is held through until dio_complete (in the case of AIO-DIO this happens through the IO completion callback). 7) i_sem and i_alloc_sem are not held for the _nolock versions of write routines, as used by blockdev and XFS. Filesystems can specify the needs_special_locking parameter to __blockdev_direct_IO from their direct IO address space op accordingly. Note from Badari: Here is the locking (when needs_special_locking is true): (1) generic_file_*_write() holds i_sem (as before) and calls ->direct_IO(). blockdev_direct_IO gets i_alloc_sem and call direct_io_worker(). (2) generic_file_*_read() does not hold any locks. blockdev_direct_IO() gets i_sem and then i_alloc_sem and calls direct_io_worker() to do the work (3) direct_io_worker() does the work and drops i_sem after submitting IOs if appropriate and drops i_alloc_sem after completing IOs.
bc0e2bbf