• Damien Le Moal's avatar
    zonefs: Improve error handling · 14db5f64
    Damien Le Moal authored
    Write error handling is racy and can sometime lead to the error recovery
    path wrongly changing the inode size of a sequential zone file to an
    incorrect value  which results in garbage data being readable at the end
    of a file. There are 2 problems:
    
    1) zonefs_file_dio_write() updates a zone file write pointer offset
       after issuing a direct IO with iomap_dio_rw(). This update is done
       only if the IO succeed for synchronous direct writes. However, for
       asynchronous direct writes, the update is done without waiting for
       the IO completion so that the next asynchronous IO can be
       immediately issued. However, if an asynchronous IO completes with a
       failure right before the i_truncate_mutex lock protecting the update,
       the update may change the value of the inode write pointer offset
       that was corrected by the error path (zonefs_io_error() function).
    
    2) zonefs_io_error() is called when a read or write error occurs. This
       function executes a report zone operation using the callback function
       zonefs_io_error_cb(), which does all the error recovery handling
       based on the current zone condition, write pointer position and
       according to the mount options being used. However, depending on the
       zoned device being used, a report zone callback may be executed in a
       context that is different from the context of __zonefs_io_error(). As
       a result, zonefs_io_error_cb() may be executed without the inode
       truncate mutex lock held, which can lead to invalid error processing.
    
    Fix both problems as follows:
    - Problem 1: Perform the inode write pointer offset update before a
      direct write is issued with iomap_dio_rw(). This is safe to do as
      partial direct writes are not supported (IOMAP_DIO_PARTIAL is not
      set) and any failed IO will trigger the execution of zonefs_io_error()
      which will correct the inode write pointer offset to reflect the
      current state of the one on the device.
    - Problem 2: Change zonefs_io_error_cb() into zonefs_handle_io_error()
      and call this function directly from __zonefs_io_error() after
      obtaining the zone information using blkdev_report_zones() with a
      simple callback function that copies to a local stack variable the
      struct blk_zone obtained from the device. This ensures that error
      handling is performed holding the inode truncate mutex.
      This change also simplifies error handling for conventional zone files
      by bypassing the execution of report zones entirely. This is safe to
      do because the condition of conventional zones cannot be read-only or
      offline and conventional zone files are always fully mapped with a
      constant file size.
    Reported-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
    Fixes: 8dcc1a9d ("fs: New zonefs file system")
    Cc: stable@vger.kernel.org
    Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
    Tested-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
    Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
    Reviewed-by: default avatarHimanshu Madhani <himanshu.madhani@oracle.com>
    14db5f64
super.c 37.2 KB