1. 21 Nov, 2019 1 commit
    • Konstantin Khlebnikov's avatar
      block: add iostat counters for flush requests · b6866318
      Konstantin Khlebnikov authored
      Requests that triggers flushing volatile writeback cache to disk (barriers)
      have significant effect to overall performance.
      
      Block layer has sophisticated engine for combining several flush requests
      into one. But there is no statistics for actual flushes executed by disk.
      Requests which trigger flushes usually are barriers - zero-size writes.
      
      This patch adds two iostat counters into /sys/class/block/$dev/stat and
      /proc/diskstats - count of completed flush requests and their total time.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b6866318
  2. 20 Nov, 2019 1 commit
  3. 18 Nov, 2019 3 commits
  4. 13 Nov, 2019 2 commits
  5. 08 Nov, 2019 2 commits
  6. 07 Nov, 2019 12 commits
  7. 06 Nov, 2019 1 commit
  8. 05 Nov, 2019 5 commits
    • Jens Axboe's avatar
      Merge branch 'nvme-5.4-rc7' of git://git.infradead.org/nvme into for-linus · 0473976c
      Jens Axboe authored
      Pull NVMe fixes from Keith:
      
      "We have a few late nvme fixes for a couple device removal kernel
       crashes, and a compat fix for a new ioctl introduced during this merge
       window."
      
      * 'nvme-5.4-rc7' of git://git.infradead.org/nvme:
        nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
        nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
        nvme-rdma: fix a segmentation fault during module unload
      0473976c
    • Charles Machalow's avatar
      nvme: change nvme_passthru_cmd64 to explicitly mark rsvd · 0d6eeb1f
      Charles Machalow authored
      Changing nvme_passthru_cmd64 to add a field: rsvd2. This field is an explicit
      marker for the padding space added on certain platforms as a result of the
      enlargement of the result field from 32 bit to 64 bits in size, and
      fixes differences in struct size when using compat ioctl for 32-bit
      binaries on 64-bit architecture.
      
      Fixes: 65e68edc ("nvme: allow 64-bit results in passthru commands")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarCharles Machalow <csm10495@gmail.com>
      [changelog]
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      0d6eeb1f
    • Anton Eidelman's avatar
      nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths · 763303a8
      Anton Eidelman authored
      nvme_mpath_clear_ctrl_paths() iterates through
      the ctrl->namespaces list while holding ctrl->scan_lock.
      This does not seem to be the correct way of protecting
      from concurrent list modification.
      
      Specifically, nvme_scan_work() sorts ctrl->namespaces
      AFTER unlocking scan_lock.
      
      This may result in the following (rare) crash in ctrl disconnect
      during scan_work:
      
          BUG: kernel NULL pointer dereference, address: 0000000000000050
          Oops: 0000 [#1] SMP PTI
          CPU: 0 PID: 3995 Comm: nvme 5.3.5-050305-generic
          RIP: 0010:nvme_mpath_clear_current_path+0xe/0x90 [nvme_core]
          ...
          Call Trace:
           nvme_mpath_clear_ctrl_paths+0x3c/0x70 [nvme_core]
           nvme_remove_namespaces+0x35/0xe0 [nvme_core]
           nvme_do_delete_ctrl+0x47/0x90 [nvme_core]
           nvme_sysfs_delete+0x49/0x60 [nvme_core]
           dev_attr_store+0x17/0x30
           sysfs_kf_write+0x3e/0x50
           kernfs_fop_write+0x11e/0x1a0
           __vfs_write+0x1b/0x40
           vfs_write+0xb9/0x1a0
           ksys_write+0x67/0xe0
           __x64_sys_write+0x1a/0x20
           do_syscall_64+0x5a/0x130
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
          RIP: 0033:0x7f8d02bfb154
      
      Fix:
      After taking scan_lock in nvme_mpath_clear_ctrl_paths()
      down_read(&ctrl->namespaces_rwsem) as well to make list traversal safe.
      This will not cause deadlocks because taking scan_lock never happens
      while holding the namespaces_rwsem.
      Moreover, scan work downs namespaces_rwsem in the same order.
      
      Alternative: sort ctrl->namespaces in nvme_scan_work()
      while still holding the scan_lock.
      This would leave nvme_mpath_clear_ctrl_paths() without correct protection
      against ctrl->namespaces modification by anyone other than scan_work.
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      763303a8
    • Max Gurtovoy's avatar
      nvme-rdma: fix a segmentation fault during module unload · 9ad9e8d6
      Max Gurtovoy authored
      In case there are controllers that are not associated with any RDMA
      device (e.g. during unsuccessful reconnection) and the user will unload
      the module, these controllers will not be freed and will access already
      freed memory. The same logic appears in other fabric drivers as well.
      
      Fixes: 87fd1253 ("nvme-rdma: remove redundant reference between ib_device and tagset")
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      9ad9e8d6
    • Christoph Hellwig's avatar
      block: avoid blk_bio_segment_split for small I/O operations · fa532287
      Christoph Hellwig authored
      __blk_queue_split() adds significant overhead for small I/O operations.
      Add a shortcut to avoid it for cases where we know we never need to
      split.
      
      Based on a patch from Ming Lei.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fa532287
  9. 04 Nov, 2019 4 commits
  10. 03 Nov, 2019 2 commits
  11. 02 Nov, 2019 1 commit
    • Ming Lei's avatar
      blk-mq: avoid sysfs buffer overflow with too many CPU cores · 8962842c
      Ming Lei authored
      It is reported that sysfs buffer overflow can be triggered if the system
      has too many CPU cores(>841 on 4K PAGE_SIZE) when showing CPUs of
      hctx via /sys/block/$DEV/mq/$N/cpu_list.
      
      Use snprintf to avoid the potential buffer overflow.
      
      This version doesn't change the attribute format, and simply stops
      showing CPU numbers if the buffer is going to overflow.
      
      Cc: stable@vger.kernel.org
      Fixes: 676141e4("blk-mq: don't dump CPU -> hw queue map on driver load")
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8962842c
  12. 01 Nov, 2019 1 commit
  13. 31 Oct, 2019 1 commit
  14. 30 Oct, 2019 1 commit
    • Jens Axboe's avatar
      io_uring: ensure we clear io_kiocb->result before each issue · 6873e0bd
      Jens Axboe authored
      We use io_kiocb->result == -EAGAIN as a way to know if we need to
      re-submit a polled request, as -EAGAIN reporting happens out-of-line
      for IO submission failures. This field is cleared when we originally
      allocate the request, but it isn't reset when we retry the submission
      from async context. This can cause issues where we think something
      needs a re-issue, but we're really just reading stale data.
      
      Reset ->result whenever we re-prep a request for polled submission.
      
      Cc: stable@vger.kernel.org
      Fixes: 9e645e11 ("io_uring: add support for sqe links")
      Reported-by: default avatarBijan Mottahedeh <bijan.mottahedeh@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6873e0bd
  15. 29 Oct, 2019 3 commits
    • Anton Ivanov's avatar
      um-ubd: Entrust re-queue to the upper layers · d848074b
      Anton Ivanov authored
      Fixes crashes due to ubd requeue logic conflicting with the block-mq
      logic. Crash is reproducible in 5.0 - 5.3.
      
      Fixes: 53766def ("um: Clean-up command processing in UML UBD driver")
      Cc: stable@vger.kernel.org # v5.0+
      Signed-off-by: default avatarAnton Ivanov <anton.ivanov@cambridgegreys.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d848074b
    • Anton Eidelman's avatar
      nvme-multipath: remove unused groups_only mode in ana log · 86cccfbf
      Anton Eidelman authored
      groups_only mode in nvme_read_ana_log() is no longer used: remove it.
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      86cccfbf
    • Anton Eidelman's avatar
      nvme-multipath: fix possible io hang after ctrl reconnect · af8fd042
      Anton Eidelman authored
      The following scenario results in an IO hang:
      1) ctrl completes a request with NVME_SC_ANA_TRANSITION.
         NVME_NS_ANA_PENDING bit in ns->flags is set and ana_work is triggered.
      2) ana_work: nvme_read_ana_log() tries to get the ANA log page from the ctrl.
         This fails because ctrl disconnects.
         Therefore nvme_update_ns_ana_state() is not called
         and NVME_NS_ANA_PENDING bit in ns->flags is not cleared.
      3) ctrl reconnects: nvme_mpath_init(ctrl,...) calls
         nvme_read_ana_log(ctrl, groups_only=true).
         However, nvme_update_ana_state() does not update namespaces
         because nr_nsids = 0 (due to groups_only mode).
      4) scan_work calls nvme_validate_ns() finds the ns and re-validates OK.
      
      Result:
      The ctrl is now live but NVME_NS_ANA_PENDING bit in ns->flags is still set.
      Consequently ctrl will never be considered a viable path by __nvme_find_path().
      IO will hang if ctrl is the only or the last path to the namespace.
      
      More generally, while ctrl is reconnecting, its ANA state may change.
      And because nvme_mpath_init() requests ANA log in groups_only mode,
      these changes are not propagated to the existing ctrl namespaces.
      This may result in a mal-function or an IO hang.
      
      Solution:
      nvme_mpath_init() will nvme_read_ana_log() with groups_only set to false.
      This will not harm the new ctrl case (no namespaces present),
      and will make sure the ANA state of namespaces gets updated after reconnect.
      
      Note: Another option would be for nvme_mpath_init() to invoke
      nvme_parse_ana_log(..., nvme_set_ns_ana_state) for each existing namespace.
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      af8fd042