1. 28 Jun, 2024 9 commits
  2. 27 Jun, 2024 4 commits
  3. 26 Jun, 2024 9 commits
  4. 24 Jun, 2024 2 commits
  5. 23 Jun, 2024 1 commit
  6. 21 Jun, 2024 4 commits
  7. 20 Jun, 2024 11 commits
    • Alan Adamson's avatar
      nvme: Atomic write support · 5f9bbea0
      Alan Adamson authored
      Add support to set block layer request_queue atomic write limits. The
      limits will be derived from either the namespace or controller atomic
      parameters.
      
      NVMe atomic-related parameters are grouped into "normal" and "power-fail"
      (or PF) class of parameter. For atomic write support, only PF parameters
      are of interest. The "normal" parameters are concerned with racing reads
      and writes (which also applies to PF). See NVM Command Set Specification
      Revision 1.0d section 2.1.4 for reference.
      
      Whether to use per namespace or controller atomic parameters is decided by
      NSFEAT bit 1 - see Figure 97: Identify – Identify Namespace Data
      Structure, NVM Command Set.
      
      NVMe namespaces may define an atomic boundary, whereby no atomic guarantees
      are provided for a write which straddles this per-lba space boundary. The
      block layer merging policy is such that no merges may occur in which the
      resultant request would straddle such a boundary.
      
      Unlike SCSI, NVMe specifies no granularity or alignment rules, apart from
      atomic boundary rule. In addition, again unlike SCSI, there is no
      dedicated atomic write command - a write which adheres to the atomic size
      limit and boundary is implicitly atomic.
      
      If NSFEAT bit 1 is set, the following parameters are of interest:
      - NAWUPF (Namespace Atomic Write Unit Power Fail)
      - NABSPF (Namespace Atomic Boundary Size Power Fail)
      - NABO (Namespace Atomic Boundary Offset)
      
      and we set request_queue limits as follows:
      - atomic_write_unit_max = rounddown_pow_of_two(NAWUPF)
      - atomic_write_max_bytes = NAWUPF
      - atomic_write_boundary = NABSPF
      
      If in the unlikely scenario that NABO is non-zero, then atomic writes will
      not be supported at all as dealing with this adds extra complexity. This
      policy may change in future.
      
      In all cases, atomic_write_unit_min is set to the logical block size.
      
      If NSFEAT bit 1 is unset, the following parameter is of interest:
      - AWUPF (Atomic Write Unit Power Fail)
      
      and we set request_queue limits as follows:
      - atomic_write_unit_max = rounddown_pow_of_two(AWUPF)
      - atomic_write_max_bytes = AWUPF
      - atomic_write_boundary = 0
      
      A new function, nvme_valid_atomic_write(), is also called from submission
      path to verify that a request has been submitted to the driver will
      actually be executed atomically. As mentioned, there is no dedicated NVMe
      atomic write command (which may error for a command which exceeds the
      controller atomic write limits).
      
      Note on NABSPF:
      There seems to be some vagueness in the spec as to whether NABSPF applies
      for NSFEAT bit 1 being unset. Figure 97 does not explicitly mention NABSPF
      and how it is affected by bit 1. However Figure 4 does tell to check Figure
      97 for info about per-namespace parameters, which NABSPF is, so it is
      implied. However currently nvme_update_disk_info() does check namespace
      parameter NABO regardless of this bit.
      Signed-off-by: default avatarAlan Adamson <alan.adamson@oracle.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      jpg: total rewrite
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Link: https://lore.kernel.org/r/20240620125359.2684798-11-john.g.garry@oracle.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5f9bbea0
    • John Garry's avatar
      scsi: scsi_debug: Atomic write support · 84f3a3c0
      John Garry authored
      Add initial support for atomic writes.
      
      As is standard method, feed device properties via modules param, those
      being:
      - atomic_max_size_blks
      - atomic_alignment_blks
      - atomic_granularity_blks
      - atomic_max_size_with_boundary_blks
      - atomic_max_boundary_blks
      
      These just match sbc4r22 section 6.6.4 - Block limits VPD page.
      
      We just support ATOMIC WRITE (16).
      
      The major change in the driver is how we lock the device for RW accesses.
      
      Currently the driver uses a per-device lock for accessing device metadata
      and "media" data (calls to do_device_access()) atomically for the duration
      of the whole read/write command.
      
      This should not suit verifying atomic writes. Reason being that currently
      all reads/writes are atomic, so using atomic writes does not prove
      anything.
      
      Change device access model to basis that regular writes only atomic on a
      per-sector basis, while reads and atomic writes are fully atomic.
      
      As mentioned, since accessing metadata and device media is atomic,
      continue to have regular writes involving metadata - like discard or PI -
      as atomic. We can improve this later.
      
      Currently we only support model where overlapping going reads or writes
      wait for current access to complete before commencing an atomic write.
      This is described in 4.29.3.2 section of the SBC. However, we simplify,
      things and wait for all accesses to complete (when issuing an atomic
      write).
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Link: https://lore.kernel.org/r/20240620125359.2684798-10-john.g.garry@oracle.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      84f3a3c0
    • John Garry's avatar
      scsi: sd: Atomic write support · bf4ae8f2
      John Garry authored
      Support is divided into two main areas:
      - reading VPD pages and setting sdev request_queue limits
      - support WRITE ATOMIC (16) command and tracing
      
      The relevant block limits VPD page need to be read to allow the block layer
      request_queue atomic write limits to be set. These VPD page limits are
      described in sbc4r22 section 6.6.4 - Block limits VPD page.
      
      There are five limits of interest:
      - MAXIMUM ATOMIC TRANSFER LENGTH
      - ATOMIC ALIGNMENT
      - ATOMIC TRANSFER LENGTH GRANULARITY
      - MAXIMUM ATOMIC TRANSFER LENGTH WITH BOUNDARY
      - MAXIMUM ATOMIC BOUNDARY SIZE
      
      MAXIMUM ATOMIC TRANSFER LENGTH is the maximum length for a WRITE ATOMIC
      (16) command. It will not be greater than the device MAXIMUM TRANSFER
      LENGTH.
      
      ATOMIC ALIGNMENT and ATOMIC TRANSFER LENGTH GRANULARITY are the minimum
      alignment and length values for an atomic write in terms of logical blocks.
      
      Unlike NVMe, SCSI does not specify an LBA space boundary, but does specify
      a per-IO boundary granularity. The maximum boundary size is specified in
      MAXIMUM ATOMIC BOUNDARY SIZE. When used, this boundary value is set in the
      WRITE ATOMIC (16) ATOMIC BOUNDARY field - layout for the WRITE_ATOMIC_16
      command can be found in sbc4r22 section 5.48. This boundary value is the
      granularity size at which the device may atomically write the data. A value
      of zero in WRITE ATOMIC (16) ATOMIC BOUNDARY field means that all data must
      be atomically written together.
      
      MAXIMUM ATOMIC TRANSFER LENGTH WITH BOUNDARY is the maximum atomic write
      length if a non-zero boundary value is set.
      
      For atomic write support, the WRITE ATOMIC (16) boundary is not of much
      interest, as the block layer expects each request submitted to be executed
      atomically. However, the SCSI spec does leave itself open to a quirky
      scenario where MAXIMUM ATOMIC TRANSFER LENGTH is zero, yet MAXIMUM ATOMIC
      TRANSFER LENGTH WITH BOUNDARY and MAXIMUM ATOMIC BOUNDARY SIZE are both
      non-zero. This case will be supported.
      
      To set the block layer request_queue atomic write capabilities, sanitize
      the VPD page limits and set limits as follows:
      - atomic_write_unit_min is derived from granularity and alignment values.
        If no granularity value is not set, use physical block size
      - atomic_write_unit_max is derived from MAXIMUM ATOMIC TRANSFER LENGTH. In
        the scenario where MAXIMUM ATOMIC TRANSFER LENGTH is zero and boundary
        limits are non-zero, use MAXIMUM ATOMIC BOUNDARY SIZE for
        atomic_write_unit_max. New flag scsi_disk.use_atomic_write_boundary is
        set for this scenario.
      - atomic_write_boundary_bytes is set to zero always
      
      SCSI also supports a WRITE ATOMIC (32) command, which is for type 2
      protection enabled. This is not going to be supported now, so check for
      T10_PI_TYPE2_PROTECTION when setting any request_queue limits.
      
      To handle an atomic write request, add support for WRITE ATOMIC (16)
      command in handler sd_setup_atomic_cmnd(). Flag use_atomic_write_boundary
      is checked here for encoding ATOMIC BOUNDARY field.
      
      Trace info is also added for WRITE_ATOMIC_16 command.
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Link: https://lore.kernel.org/r/20240620125359.2684798-9-john.g.garry@oracle.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bf4ae8f2
    • John Garry's avatar
      block: Add fops atomic write support · caf336f8
      John Garry authored
      Support atomic writes by submitting a single BIO with the REQ_ATOMIC set.
      
      It must be ensured that the atomic write adheres to its rules, like
      naturally aligned offset, so call blkdev_dio_invalid() ->
      blkdev_atomic_write_valid() [with renaming blkdev_dio_unaligned() to
      blkdev_dio_invalid()] for this purpose. The BIO submission path currently
      checks for atomic writes which are too large, so no need to check here.
      
      In blkdev_direct_IO(), if the nr_pages exceeds BIO_MAX_VECS, then we cannot
      produce a single BIO, so error in this case.
      
      Finally set FMODE_CAN_ATOMIC_WRITE when the bdev can support atomic writes
      and the associated file flag is for O_DIRECT.
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Link: https://lore.kernel.org/r/20240620125359.2684798-8-john.g.garry@oracle.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      caf336f8
    • Prasad Singamsetty's avatar
      block: Add atomic write support for statx · 9abcfbd2
      Prasad Singamsetty authored
      Extend statx system call to return additional info for atomic write support
      support if the specified file is a block device.
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarPrasad Singamsetty <prasad.singamsetty@oracle.com>
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Link: https://lore.kernel.org/r/20240620125359.2684798-7-john.g.garry@oracle.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9abcfbd2
    • John Garry's avatar
      block: Add core atomic write support · 9da3d1e9
      John Garry authored
      Add atomic write support, as follows:
      - add helper functions to get request_queue atomic write limits
      - report request_queue atomic write support limits to sysfs and update Doc
      - support to safely merge atomic writes
      - deal with splitting atomic writes
      - misc helper functions
      - add a per-request atomic write flag
      
      New request_queue limits are added, as follows:
      - atomic_write_hw_max is set by the block driver and is the maximum length
        of an atomic write which the device may support. It is not
        necessarily a power-of-2.
      - atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
        max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
        and atomic_write_max_sectors would be the limit on a merged atomic write
        request size. This value is not capped at max_sectors, as the value in
        max_sectors can be controlled from userspace, and it would only cause
        trouble if userspace could limit atomic_write_unit_max_bytes and the
        other atomic write limits.
      - atomic_write_hw_unit_{min,max} are set by the block driver and are the
        min/max length of an atomic write unit which the device may support. They
        both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
        the same value as atomic_write_hw_max.
      - atomic_write_unit_{min,max} are derived from
        atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
        Both min and max values must be a power-of-2.
      - atomic_write_hw_boundary is set by the block driver. If non-zero, it
        indicates an LBA space boundary at which an atomic write straddles no
        longer is atomically executed by the disk. The value must be a
        power-of-2. Note that it would be acceptable to enforce a rule that
        atomic_write_hw_boundary_sectors is a multiple of
        atomic_write_hw_unit_max, but the resultant code would be more
        complicated.
      
      All atomic writes limits are by default set 0 to indicate no atomic write
      support. Even though it is assumed by Linux that a logical block can always
      be atomically written, we ignore this as it is not of particular interest.
      Stacked devices are just not supported either for now.
      
      An atomic write must always be submitted to the block driver as part of a
      single request. As such, only a single BIO must be submitted to the block
      layer for an atomic write. When a single atomic write BIO is submitted, it
      cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
      by the maximum guaranteed BIO size which will not be required to be split.
      This max size is calculated by request_queue max segments and the number
      of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
      issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
      segment containing PAGE_SIZE of data, apart from the first+last, which each
      can fit logical block size of data. The first+last will be LBS
      length/aligned as we rely on direct IO alignment rules also.
      
      New sysfs files are added to report the following atomic write limits:
      - atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
      				bytes
      - atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
      				bytes
      - atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
      				bytes
      - atomic_write_max_bytes      - same as atomic_write_max_sectors in bytes
      
      Atomic writes may only be merged with other atomic writes and only under
      the following conditions:
      - total resultant request length <= atomic_write_max_bytes
      - the merged write does not straddle a boundary
      
      Helper function bdev_can_atomic_write() is added to indicate whether
      atomic writes may be issued to a bdev. If a bdev is a partition, the
      partition start must be aligned with both atomic_write_unit_min_sectors
      and atomic_write_hw_boundary_sectors.
      
      FSes will rely on the block layer to validate that an atomic write BIO
      submitted will be of valid size, so add blk_validate_atomic_write_op_size()
      for this purpose. Userspace expects an atomic write which is of invalid
      size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
      BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
      invalid size BIO.
      
      Flag REQ_ATOMIC is used for indicating an atomic write.
      Co-developed-by: default avatarHimanshu Madhani <himanshu.madhani@oracle.com>
      Signed-off-by: default avatarHimanshu Madhani <himanshu.madhani@oracle.com>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9da3d1e9
    • Prasad Singamsetty's avatar
      fs: Add initial atomic write support info to statx · 0f9ca80f
      Prasad Singamsetty authored
      Extend statx system call to return additional info for atomic write support
      support for a file.
      
      Helper function generic_fill_statx_atomic_writes() can be used by FSes to
      fill in the relevant statx fields. For now atomic_write_segments_max will
      always be 1, otherwise some rules would need to be imposed on iovec length
      and alignment, which we don't want now.
      Signed-off-by: default avatarPrasad Singamsetty <prasad.singamsetty@oracle.com>
      jpg: relocate bdev support to another patch
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Link: https://lore.kernel.org/r/20240620125359.2684798-5-john.g.garry@oracle.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0f9ca80f
    • Prasad Singamsetty's avatar
      fs: Initial atomic write support · c34fc6f2
      Prasad Singamsetty authored
      An atomic write is a write issued with torn-write protection, meaning
      that for a power failure or any other hardware failure, all or none of the
      data from the write will be stored, but never a mix of old and new data.
      
      Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
      write is to be issued with torn-write prevention, according to special
      alignment and length rules.
      
      For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
      iocb->ki_flags field to indicate the same.
      
      A call to statx will give the relevant atomic write info for a file:
      - atomic_write_unit_min
      - atomic_write_unit_max
      - atomic_write_segments_max
      
      Both min and max values must be a power-of-2.
      
      Applications can avail of atomic write feature by ensuring that the total
      length of a write is a power-of-2 in size and also sized between
      atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
      must ensure that the write is at a naturally-aligned offset in the file
      wrt the total write length. The value in atomic_write_segments_max
      indicates the upper limit for IOV_ITER iovcnt.
      
      Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
      flag set will have RWF_ATOMIC rejected and not just ignored.
      
      Add a type argument to kiocb_set_rw_flags() to allows reads which have
      RWF_ATOMIC set to be rejected.
      
      Helper function generic_atomic_write_valid() can be used by FSes to verify
      compliant writes. There we check for iov_iter type is for ubuf, which
      implies iovcnt==1 for pwritev2(), which is an initial restriction for
      atomic_write_segments_max. Initially the only user will be bdev file
      operations write handler. We will rely on the block BIO submission path to
      ensure write sizes are compliant for the bdev, so we don't need to check
      atomic writes sizes yet.
      Signed-off-by: default avatarPrasad Singamsetty <prasad.singamsetty@oracle.com>
      jpg: merge into single patch and much rewrite
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c34fc6f2
    • John Garry's avatar
      block: Generalize chunk_sectors support as boundary support · f70167a7
      John Garry authored
      The purpose of the chunk_sectors limit is to ensure that a mergeble request
      fits within the boundary of the chunck_sector value.
      
      Such a feature will be useful for other request_queue boundary limits, so
      generalize the chunk_sectors merge code.
      
      This idea was proposed by Hannes Reinecke.
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Link: https://lore.kernel.org/r/20240620125359.2684798-3-john.g.garry@oracle.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f70167a7
    • John Garry's avatar
      block: Pass blk_queue_get_max_sectors() a request pointer · 8d1dfd51
      John Garry authored
      Currently blk_queue_get_max_sectors() is passed a enum req_op. In future
      the value returned from blk_queue_get_max_sectors() may depend on certain
      request flags, so pass a request pointer.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Link: https://lore.kernel.org/r/20240620125359.2684798-2-john.g.garry@oracle.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8d1dfd51
    • Jens Axboe's avatar
      Merge branch 'for-6.11/block-limits' into for-6.11/block · e821bcec
      Jens Axboe authored
      Merge in queue limits cleanups.
      
      * for-6.11/block-limits:
        block: move the raid_partial_stripes_expensive flag into the features field
        block: remove the discard_alignment flag
        block: move the misaligned flag into the features field
        block: renumber and rename the cache disabled flag
        block: fix spelling and grammar for in writeback_cache_control.rst
        block: remove the unused blk_bounce enum
      e821bcec