1. 08 May, 2024 2 commits
    • Sagi Grimberg's avatar
      nvmet-rdma: fix possible bad dereference when freeing rsps · 73964c1d
      Sagi Grimberg authored
      It is possible that the host connected and saw a cm established
      event and started sending nvme capsules on the qp, however the
      ctrl did not yet see an established event. This is why the
      rsp_wait_list exists (for async handling of these cmds, we move
      them to a pending list).
      
      Furthermore, it is possible that the ctrl cm times out, resulting
      in a connect-error cm event. in this case we hit a bad deref [1]
      because in nvmet_rdma_free_rsps we assume that all the responses
      are in the free list.
      
      We are freeing the cmds array anyways, so don't even bother to
      remove the rsp from the free_list. It is also guaranteed that we
      are not racing anything when we are releasing the queue so no
      other context accessing this array should be running.
      
      [1]:
      --
      Workqueue: nvmet-free-wq nvmet_rdma_free_queue_work [nvmet_rdma]
      [...]
      pc : nvmet_rdma_free_rsps+0x78/0xb8 [nvmet_rdma]
      lr : nvmet_rdma_free_queue_work+0x88/0x120 [nvmet_rdma]
       Call trace:
       nvmet_rdma_free_rsps+0x78/0xb8 [nvmet_rdma]
       nvmet_rdma_free_queue_work+0x88/0x120 [nvmet_rdma]
       process_one_work+0x1ec/0x4a0
       worker_thread+0x48/0x490
       kthread+0x158/0x160
       ret_from_fork+0x10/0x18
      --
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      73964c1d
    • Dan Carpenter's avatar
      nvmet: prevent sprintf() overflow in nvmet_subsys_nsid_exists() · d15dcd0f
      Dan Carpenter authored
      The nsid value is a u32 that comes from nvmet_req_find_ns().  It's
      endian data and we're on an error path and both of those raise red
      flags.  So let's make this safer.
      
      1) Make the buffer large enough for any u32.
      2) Remove the unnecessary initialization.
      3) Use snprintf() instead of sprintf() for even more safety.
      4) The sprintf() function returns the number of bytes printed, not
         counting the NUL terminator. It is impossible for the return value to
         be <= 0 so delete that.
      
      Fixes: 50536395 ("nvmet: fix nvme status code when namespace is disabled")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      d15dcd0f
  2. 07 May, 2024 4 commits
  3. 02 May, 2024 1 commit
    • Jens Axboe's avatar
      Merge commit '50abcc17' of... · fb15ffd0
      Jens Axboe authored
      Merge commit '50abcc17' of git://git.infradead.org/nvme into block-6.9
      
      Pull NVMe fixes from Keith.
      
      * git://git.infradead.org/nvme:
        nvme-tcp: strict pdu pacing to avoid send stalls on TLS
        nvmet: fix nvme status code when namespace is disabled
        nvmet-tcp: fix possible memory leak when tearing down a controller
        nvme: cancel pending I/O if nvme controller is in terminal state
        nvmet-auth: replace pr_debug() with pr_err() to report an error.
        nvmet-auth: return the error code to the nvmet_auth_host_hash() callers
        nvme: find numa distance only if controller has valid numa id
        nvme: fix warn output about shared namespaces without CONFIG_NVME_MULTIPATH
      fb15ffd0
  4. 01 May, 2024 7 commits
  5. 30 Apr, 2024 1 commit
    • Uday Shankar's avatar
      ublk: remove segment count and size limits · eaf4a9b1
      Uday Shankar authored
      ublk_drv currently creates block devices with the default max_segments
      and max_segment_size limits of BLK_MAX_SEGMENTS (128) and
      BLK_MAX_SEGMENT_SIZE (65536) respectively. These defaults can
      artificially constrain the I/O size seen by the ublk server - for
      example, suppose that the ublk server has configured itself to accept
      I/Os up to 1M and the application is also issuing 1M sized I/Os. If the
      I/O buffer used by the application is backed by 4K pages, the buffer
      could consist of up to 1M / 4K = 256 physically discontiguous segments
      (even if the buffer is virtually contiguous). As such, the I/O could
      exceed the default max_segments limit and get split. This can cause
      unnecessary performance issues if the ublk server is optimized to handle
      1M I/Os. The block layer's segment count/size limits exist to model
      hardware constraints which don't exist in ublk_drv's case, so just
      remove those limits for the block devices created by ublk_drv.
      Signed-off-by: default avatarUday Shankar <ushankar@purestorage.com>
      Reviewed-by: default avatarRiley Thomasson <riley@purestorage.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20240430211623.2802036-1-ushankar@purestorage.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      eaf4a9b1
  6. 19 Apr, 2024 1 commit
    • Li Nan's avatar
      blk-iocost: do not WARN if iocg was already offlined · 01bc4fda
      Li Nan authored
      In iocg_pay_debt(), warn is triggered if 'active_list' is empty, which
      is intended to confirm iocg is active when it has debt. However, warn
      can be triggered during a blkcg or disk removal, if iocg_waitq_timer_fn()
      is run at that time:
      
        WARNING: CPU: 0 PID: 2344971 at block/blk-iocost.c:1402 iocg_pay_debt+0x14c/0x190
        Call trace:
        iocg_pay_debt+0x14c/0x190
        iocg_kick_waitq+0x438/0x4c0
        iocg_waitq_timer_fn+0xd8/0x130
        __run_hrtimer+0x144/0x45c
        __hrtimer_run_queues+0x16c/0x244
        hrtimer_interrupt+0x2cc/0x7b0
      
      The warn in this situation is meaningless. Since this iocg is being
      removed, the state of the 'active_list' is irrelevant, and 'waitq_timer'
      is canceled after removing 'active_list' in ioc_pd_free(), which ensures
      iocg is freed after iocg_waitq_timer_fn() returns.
      
      Therefore, add the check if iocg was already offlined to avoid warn
      when removing a blkcg or disk.
      Signed-off-by: default avatarLi Nan <linan122@huawei.com>
      Reviewed-by: default avatarYu Kuai <yukuai3@huawei.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20240419093257.3004211-1-linan666@huaweicloud.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      01bc4fda
  7. 18 Apr, 2024 1 commit
  8. 12 Apr, 2024 2 commits
    • Yu Kuai's avatar
      block: fix that blk_time_get_ns() doesn't update time after schedule · 3ec48489
      Yu Kuai authored
      While monitoring the throttle time of IO from iocost, it's found that
      such time is always zero after the io_schedule() from ioc_rqos_throttle,
      for example, with the following debug patch:
      
      +       printk("%s-%d: %s enter %llu\n", current->comm, current->pid, __func__, blk_time_get_ns());
              while (true) {
                      set_current_state(TASK_UNINTERRUPTIBLE);
                      if (wait.committed)
                              break;
                      io_schedule();
              }
      +       printk("%s-%d: %s exit  %llu\n", current->comm, current->pid, __func__, blk_time_get_ns());
      
      It can be observerd that blk_time_get_ns() always return the same time:
      
      [ 1068.096579] fio-1268: ioc_rqos_throttle enter 1067901962288
      [ 1068.272587] fio-1268: ioc_rqos_throttle exit  1067901962288
      [ 1068.274389] fio-1268: ioc_rqos_throttle enter 1067901962288
      [ 1068.472690] fio-1268: ioc_rqos_throttle exit  1067901962288
      [ 1068.474485] fio-1268: ioc_rqos_throttle enter 1067901962288
      [ 1068.672656] fio-1268: ioc_rqos_throttle exit  1067901962288
      [ 1068.674451] fio-1268: ioc_rqos_throttle enter 1067901962288
      [ 1068.872655] fio-1268: ioc_rqos_throttle exit  1067901962288
      
      And I think the root cause is that 'PF_BLOCK_TS' is always cleared
      by blk_flush_plug() before scheduel(), hence blk_plug_invalidate_ts()
      will never be called:
      
      blk_time_get_ns
       plug->cur_ktime = ktime_get_ns();
       current->flags |= PF_BLOCK_TS;
      
      io_schedule:
       io_schedule_prepare
        blk_flush_plug
         __blk_flush_plug
          /* the flag is cleared, while time is not */
          current->flags &= ~PF_BLOCK_TS;
       schedule
       sched_update_worker
        /* the flag is not set, hence plug->cur_ktime is not cleared */
        if (tsk->flags & PF_BLOCK_TS)
         blk_plug_invalidate_ts()
      
      blk_time_get_ns
       /* got the time stashed before schedule */
       return plug->cur_ktime;
      
      Fix the problem by clearing cached time in __blk_flush_plug().
      
      Fixes: 06b23f92 ("block: update cached timestamp post schedule/preemption")
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Link: https://lore.kernel.org/r/20240411032349.3051233-2-yukuai1@huaweicloud.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3ec48489
    • Yi Zhang's avatar
      nvme: fix warn output about shared namespaces without CONFIG_NVME_MULTIPATH · 0bc2e80b
      Yi Zhang authored
      Move the stray '.' that is currently at the end of the line after
      newline '\n' to before newline character which is the right position.
      
      Fixes: ce8d7861 ("nvme: warn about shared namespaces without CONFIG_NVME_MULTIPATH")
      Signed-off-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Reviewed-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      0bc2e80b
  9. 09 Apr, 2024 1 commit
  10. 07 Apr, 2024 2 commits
  11. 06 Apr, 2024 1 commit
    • Rik van Riel's avatar
      blk-iocost: avoid out of bounds shift · beaa51b3
      Rik van Riel authored
      UBSAN catches undefined behavior in blk-iocost, where sometimes
      iocg->delay is shifted right by a number that is too large,
      resulting in undefined behavior on some architectures.
      
      [  186.556576] ------------[ cut here ]------------
      UBSAN: shift-out-of-bounds in block/blk-iocost.c:1366:23
      shift exponent 64 is too large for 64-bit type 'u64' (aka 'unsigned long long')
      CPU: 16 PID: 0 Comm: swapper/16 Tainted: G S          E    N 6.9.0-0_fbk700_debug_rc2_kbuilder_0_gc85af715 #1
      Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A23 12/08/2020
      Call Trace:
       <IRQ>
       dump_stack_lvl+0x8f/0xe0
       __ubsan_handle_shift_out_of_bounds+0x22c/0x280
       iocg_kick_delay+0x30b/0x310
       ioc_timer_fn+0x2fb/0x1f80
       __run_timer_base+0x1b6/0x250
      ...
      
      Avoid that undefined behavior by simply taking the
      "delay = 0" branch if the shift is too large.
      
      I am not sure what the symptoms of an undefined value
      delay will be, but I suspect it could be more than a
      little annoying to debug.
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20240404123253.0f58010f@imladris.surriel.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      beaa51b3
  12. 04 Apr, 2024 5 commits
  13. 02 Apr, 2024 4 commits
  14. 31 Mar, 2024 8 commits