1. 26 Jan, 2018 3 commits
  2. 25 Jan, 2018 3 commits
  3. 23 Jan, 2018 1 commit
  4. 20 Jan, 2018 1 commit
  5. 19 Jan, 2018 6 commits
  6. 18 Jan, 2018 9 commits
  7. 17 Jan, 2018 11 commits
    • Bart Van Assche's avatar
      block: Fix __bio_integrity_endio() documentation · de99a346
      Bart Van Assche authored
      Fixes: 4246a0b6 ("block: add a bi_error field to struct bio")
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      de99a346
    • Christoph Hellwig's avatar
      nvme-pci: clean up SMBSZ bit definitions · 88de4598
      Christoph Hellwig authored
      Define the bit positions instead of macros using the magic values,
      and move the expanded helpers to calculate the size and size unit into
      the implementation C file.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      88de4598
    • Christoph Hellwig's avatar
      nvme-pci: clean up CMB initialization · f65efd6d
      Christoph Hellwig authored
      Refactor the call to nvme_map_cmb, and change the conditions for probing
      for the CMB.  First remove the version check as NVMe TPs always apply
      to earlier versions of the spec as well.  Second check for the whole CMBSZ
      register for support of the CMB feature instead of just the size field
      inside of it to simplify the code a bit.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      f65efd6d
    • James Smart's avatar
      nvme-fc: correct hang in nvme_ns_remove() · 0fd997d3
      James Smart authored
      When connectivity is lost to a device, the association is terminated
      and the blk-mq queues are quiesced/stopped. When connectivity is
      re-established, they are resumed.
      
      If connectivity is lost for a sufficient amount of time that the
      controller is then deleted, the delete path starts tearing down queues,
      and eventually calling nvme_ns_remove(). It appears that pending
      commands may cause blk_cleanup_queue() to never complete and the
      teardown stalls.
      
      Correct by starting the ns queues after transitioning to a DELETING
      state, allowing pending commands to be flushed with io failures. Thus
      the delete path is clear when reached.
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      0fd997d3
    • James Smart's avatar
      nvme-fc: fix rogue admin cmds stalling teardown · d625d05e
      James Smart authored
      When connectivity is lost to a device, the association is terminated
      and the blk-mq queues are quiesced/stopped. When connectivity is
      re-established, they are resumed.
      
      If an admin command is received while connectivity is list, the ioctl
      queues the command on the admin_q and the command stalls (the thread
      issuing the ioctl hangs/waits). if the connectivity is lost long
      enough such that the controller is then deleted, the delete code
      makes its calls to initiate the delete, which then expects the core
      layer to call the transport when all references are removed and the
      controller can be freed.  Unfortunately, nothing in this path dequeued
      the admin command, so a reference sits outstanding and things stop,
      hanging the delete indefinitely.
      
      Correct by unquiescing the admin queue in the delete association. This
      means any admin command (which should only be from an ioctl) issued
      after connectivity is lost will detect the controller is in a
      reconnecting state and will (fast) fail the command. Thus, a pending
      reference can no longer be created.  Once connectivity is re-established,
      a new ioctl/admin command would see proper device state and function again.
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      d625d05e
    • Mike Snitzer's avatar
      blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request · 9e97d295
      Mike Snitzer authored
      After commit:
      
      923218f6 ("blk-mq: don't allocate driver tag upfront for flush rq")
      
      we no longer use the 'can_block' argument in
      blk_mq_sched_insert_request(). Kill it.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      
      Added actual commit message as to why it's being removed.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9e97d295
    • Ming Lei's avatar
      blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback · 396eaf21
      Ming Lei authored
      blk_insert_cloned_request() is called in the fast path of a dm-rq driver
      (e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
      blk_mq_request_bypass_insert() to directly append the request to the
      blk-mq hctx->dispatch_list of the underlying queue.
      
      1) This way isn't efficient enough because the hctx spinlock is always
      used.
      
      2) With blk_insert_cloned_request(), we completely bypass underlying
      queue's elevator and depend on the upper-level dm-rq driver's elevator
      to schedule IO.  But dm-rq currently can't get the underlying queue's
      dispatch feedback at all.  Without knowing whether a request was issued
      or not (e.g. due to underlying queue being busy) the dm-rq elevator will
      not be able to provide effective IO merging (as a side-effect of dm-rq
      currently blindly destaging a request from its elevator only to requeue
      it after a delay, which kills any opportunity for merging).  This
      obviously causes very bad sequential IO performance.
      
      Fix this by updating blk_insert_cloned_request() to use
      blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
      request to be issued directly to the underlying queue and returns the
      dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
      returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
      to _not_ destage the request.  Whereby preserving the opportunity to
      merge IO.
      
      With this, request-based DM's blk-mq sequential IO performance is vastly
      improved (as much as 3X in mpath/virtio-scsi testing).
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
      they were refactored to make them less fragile and easier to read/review]
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      396eaf21
    • Mike Snitzer's avatar
      blk-mq: factor out a few helpers from __blk_mq_try_issue_directly · 0f95549c
      Mike Snitzer authored
      No functional change.  Just makes code flow more logically.
      
      In following commit, __blk_mq_try_issue_directly() will be used to
      return the dispatch result (blk_status_t) to DM.  DM needs this
      information to improve IO merging.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0f95549c
    • Ming Lei's avatar
      blk-mq: turn WARN_ON in __blk_mq_run_hw_queue into printk · 7df938fb
      Ming Lei authored
      We know this WARN_ON is harmless and in reality it may be trigged,
      so convert it to printk() and dump_stack() to avoid to confusing
      people.
      
      Also add comment about two releated races here.
      
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7df938fb
    • Ming Lei's avatar
      blk-mq: make sure hctx->next_cpu is set correctly · 7bed4595
      Ming Lei authored
      When hctx->next_cpu is set from possible online CPUs, there is one
      race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
      break workqueue.
      
      The race can be triggered in the following two sitations:
      
      1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
      to dispatch requests from the DEAD cpu context, but at that
      time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
      CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
      a bad value.
      
      2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
      should be run on the other CPU A, then CPU A may become offline at the
      same time and all CPUs in hctx->cpumask become offline.
      
      This patch deals with this issue by re-selecting next CPU, and making
      sure it is set correctly.
      
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reported-by: default avatar"jianchao.wang" <jianchao.w.wang@oracle.com>
      Tested-by: default avatar"jianchao.wang" <jianchao.w.wang@oracle.com>
      Fixes: 20e4d813 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7bed4595
    • Tina Ruchandani's avatar
      aoe: use ktime_t instead of timeval · 85cf955d
      Tina Ruchandani authored
      'struct frame' uses two variables to store the sent timestamp - 'struct
      timeval' and jiffies. jiffies is used to avoid discrepancies caused by
      updates to system time. 'struct timeval' is deprecated because it uses
      32-bit representation for seconds which will overflow in year 2038.
      
      This patch does the following:
      - Replace the use of 'struct timeval' and jiffies with ktime_t, which
        is the recommended type for timestamping
      - ktime_t provides both long range (like jiffies) and high resolution
        (like timeval). Using ktime_get (monotonic time) instead of wall-clock
        time prevents any discprepancies caused by updates to system time.
      
      [updates by Arnd below]
      The original patch from Tina never went anywhere as we discussed how
      to keep the impact on performance minimal. I've started over now but
      arrived at basically the same patch that she had originally, except for
      an slightly improved tsince_hr() function. I'm making it more robust
      against overflows, and also optimize explicitly for the common case
      in which a frame is less than 4.2 seconds old, using only a 32-bit
      division in that case.
      
      This should make the new version more efficient than the old code,
      since we replace the existing two 32-bit division in do_gettimeofday()
      plus one multiplication with a single single 32-bit division in
      tsince_hr() and drop the double bookkeeping. It's also more efficient
      than the ktime_get_us() API we discussed before, since that would
      also rely on multiple divisions.
      
      Link: https://lists.linaro.org/pipermail/y2038/2015-May/000276.htmlSigned-off-by: default avatarTina Ruchandani <ruchandani.tina@gmail.com>
      Cc: Ed Cashin <ed.cashin@acm.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      85cf955d
  8. 16 Jan, 2018 1 commit
    • Arnd Bergmann's avatar
      blkcg: simplify statistic accumulation code · ddc21231
      Arnd Bergmann authored
      Some older compilers (gcc-4.4 through 4.6 in particular) struggle
      with the way that blkg_rwstat_read() returns a structure, leading
      to excessive stack usage and rather inefficient code:
      
      block/blk-cgroup.c: In function 'blkg_destroy':
      block/blk-cgroup.c:354:1: error: the frame size of 1296 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
      block/cfq-iosched.c: In function 'cfqg_stats_add_aux':
      block/cfq-iosched.c:753:1: error: the frame size of 1928 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
      block/bfq-cgroup.c: In function 'bfqg_stats_add_aux':
      block/bfq-cgroup.c:299:1: error: the frame size of 1928 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
      
      I also notice that there is no point in using atomic accesses
      for the local variables, so storing the temporaries in simple 'u64'
      variables not only avoids the stack usage on older compilers but
      also improves the object code on modern versions.
      
      Fixes: e6269c44 ("blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it")
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ddc21231
  9. 15 Jan, 2018 5 commits
    • Sagi Grimberg's avatar
      nvmet: release a ns reference in nvmet_req_uninit if needed · 423b4487
      Sagi Grimberg authored
      nvmet_req_init looked up a namespace and took a reference on it (unless it
      failed prior to that). If the request is uninitialized (in error cases) we
      need to remove that reference in case it was taken, otherwise we leak
      namespace reference when calling nvme_req_uninit.
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      423b4487
    • Roland Dreier's avatar
      nvme-fabrics: fix memory leak when parsing host ID option · df351ef7
      Roland Dreier authored
      We use match_strdup() to get a copy of the option string for host ID string, but
      we just pass it to uuid_parse() and don't store the string pointer, so we need to
      kfree() the string after parsing it.
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      df351ef7
    • Minwoo Im's avatar
      nvme: fix comment typos in nvme_create_io_queues · 8adb8c14
      Minwoo Im authored
      fix comment typos in nvme_create_io_queues() like below.
        _aount_ to _amount_
        _an_    to _can_
      Signed-off-by: default avatarMinwoo Im <minwoo.im.dev@gmail.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      8adb8c14
    • Roy Shterman's avatar
      nvme: host delete_work and reset_work on separate workqueues · b227c59b
      Roy Shterman authored
      We need to ensure that delete_work will be hosted on a different
      workqueue than all the works we flush or cancel from it.
      Otherwise we may hit a circular dependency warning [1].
      
      Also, given that delete_work flushes reset_work, host reset_work
      on nvme_reset_wq and delete_work on nvme_delete_wq. In addition,
      fix the flushing in the individual drivers to flush nvme_delete_wq
      when draining queued deletes.
      
      [1]:
      [  178.491942] =============================================
      [  178.492718] [ INFO: possible recursive locking detected ]
      [  178.493495] 4.9.0-rc4-c844263313a8-lb #3 Tainted: G           OE
      [  178.494382] ---------------------------------------------
      [  178.495160] kworker/5:1/135 is trying to acquire lock:
      [  178.495894]  (
      [  178.496120] "nvme-wq"
      [  178.496471] ){++++.+}
      [  178.496599] , at:
      [  178.496921] [<ffffffffa70ac206>] flush_work+0x1a6/0x2d0
      [  178.497670]
                     but task is already holding lock:
      [  178.498499]  (
      [  178.498724] "nvme-wq"
      [  178.499074] ){++++.+}
      [  178.499202] , at:
      [  178.499520] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
      [  178.500343]
                     other info that might help us debug this:
      [  178.501269]  Possible unsafe locking scenario:
      
      [  178.502113]        CPU0
      [  178.502472]        ----
      [  178.502829]   lock(
      [  178.503115] "nvme-wq"
      [  178.503467] );
      [  178.503716]   lock(
      [  178.504001] "nvme-wq"
      [  178.504353] );
      [  178.504601]
                      *** DEADLOCK ***
      
      [  178.505441]  May be due to missing lock nesting notation
      
      [  178.506453] 2 locks held by kworker/5:1/135:
      [  178.507068]  #0:
      [  178.507330]  (
      [  178.507598] "nvme-wq"
      [  178.507726] ){++++.+}
      [  178.508079] , at:
      [  178.508173] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
      [  178.509004]  #1:
      [  178.509265]  (
      [  178.509532] (&ctrl->delete_work)
      [  178.509795] ){+.+.+.}
      [  178.510145] , at:
      [  178.510239] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
      [  178.511070]
                     stack backtrace:
      :
      [  178.511693] CPU: 5 PID: 135 Comm: kworker/5:1 Tainted: G           OE   4.9.0-rc4-c844263313a8-lb #3
      [  178.512974] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
      [  178.514247] Workqueue: nvme-wq nvme_del_ctrl_work [nvme_tcp]
      [  178.515071]  ffffc2668175bae0 ffffffffa7450823 ffffffffa88abd80 ffffffffa88abd80
      [  178.516195]  ffffc2668175bb98 ffffffffa70eb012 ffffffffa8d8d90d ffff9c472e9ea700
      [  178.517318]  ffff9c472e9ea700 ffff9c4700000000 ffff9c4700007200 ab83be61bec0d50e
      [  178.518443] Call Trace:
      [  178.518807]  [<ffffffffa7450823>] dump_stack+0x85/0xc2
      [  178.519542]  [<ffffffffa70eb012>] __lock_acquire+0x17d2/0x18f0
      [  178.520377]  [<ffffffffa75839a7>] ? serial8250_console_putchar+0x27/0x30
      [  178.521330]  [<ffffffffa7583980>] ? wait_for_xmitr+0xa0/0xa0
      [  178.522174]  [<ffffffffa70ac1eb>] ? flush_work+0x18b/0x2d0
      [  178.522975]  [<ffffffffa70eb7cb>] lock_acquire+0x11b/0x220
      [  178.523753]  [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
      [  178.524535]  [<ffffffffa70ac229>] flush_work+0x1c9/0x2d0
      [  178.525291]  [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
      [  178.526077]  [<ffffffffa70a9cf0>] ? flush_workqueue_prep_pwqs+0x220/0x220
      [  178.527040]  [<ffffffffa70ae7cf>] __cancel_work_timer+0x10f/0x1d0
      [  178.527907]  [<ffffffffa70fecb9>] ? vprintk_default+0x29/0x40
      [  178.528726]  [<ffffffffa71cb507>] ? printk+0x48/0x50
      [  178.529434]  [<ffffffffa70ae8c3>] cancel_delayed_work_sync+0x13/0x20
      [  178.530381]  [<ffffffffc042100b>] nvme_stop_ctrl+0x5b/0x70 [nvme_core]
      [  178.531314]  [<ffffffffc0403dcc>] nvme_del_ctrl_work+0x2c/0x50 [nvme_tcp]
      [  178.532271]  [<ffffffffa70ad741>] process_one_work+0x1e1/0x6a0
      [  178.533101]  [<ffffffffa70ad6c2>] ? process_one_work+0x162/0x6a0
      [  178.533954]  [<ffffffffa70adc4e>] worker_thread+0x4e/0x490
      [  178.534735]  [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
      [  178.535588]  [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
      [  178.536441]  [<ffffffffa70b48cf>] kthread+0xff/0x120
      [  178.537149]  [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
      [  178.538094]  [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
      [  178.538900]  [<ffffffffa78e332a>] ret_from_fork+0x2a/0x40
      Signed-off-by: default avatarRoy Shterman <roys@lightbitslabs.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      b227c59b
    • Mike Snitzer's avatar
      dm: fix incomplete request_queue initialization · c100ec49
      Mike Snitzer authored
      DM is no longer prone to having its request_queue be improperly
      initialized.
      
      Summary of changes:
      
      - defer DM's blk_register_queue() from add_disk()-time until
        dm_setup_md_queue() by using add_disk_no_queue_reg() in alloc_dev().
      
      - dm_setup_md_queue() is updated to fully initialize DM's request_queue
        (_after_ all table loads have occurred and the request_queue's type,
        features and limits are known).
      
      A very welcome side-effect of these changes is DM no longer needs to:
      1) backfill the "mq" sysfs entry (because historically DM didn't
      initialize the request_queue to use blk-mq until _after_
      blk_register_queue() was called via add_disk()).
      2) call elv_register_queue() to get .request_fn request-based DM
      device's "iosched" exposed in syfs.
      
      In addition, blk-mq debugfs support is now made available because
      request-based DM's blk-mq request_queue is now properly initialized
      before dm_setup_md_queue() calls blk_register_queue().
      
      These changes also stave off the need to introduce new DM-specific
      workarounds in block core, e.g. this proposal:
      https://patchwork.kernel.org/patch/10067961/
      
      In the end DM devices should be less unicorn in nature (relative to
      initialization and availability of block core infrastructure provided by
      the request_queue).
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Tested-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c100ec49