1. 05 Oct, 2017 1 commit
    • Jens Axboe's avatar
      Merge branch 'nvme-4.14' of git://git.infradead.org/nvme into for-linus · d7b544de
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "A trivial one-liner from Martin to fix the visible of the uuid attr,
      and another one (originally from Abhishek Shah, rewritten by me) to fix
      the CMB addresses passed back to the controller in case of a system that
      remaps BAR addresses between host and device."
      d7b544de
  2. 04 Oct, 2017 2 commits
    • Benjamin Block's avatar
      bsg-lib: fix use-after-free under memory-pressure · eab40cf3
      Benjamin Block authored
      When under memory-pressure it is possible that the mempool which backs
      the 'struct request_queue' will make use of up to BLKDEV_MIN_RQ count
      emergency buffers - in case it can't get a regular allocation. These
      buffers are preallocated and once they are also used, they are
      re-supplied with old finished requests from the same request_queue (see
      mempool_free()).
      
      The bug is, when re-supplying the emergency pool, the old requests are
      not again ran through the callback mempool_t->alloc(), and thus also not
      through the callback bsg_init_rq(). Thus we skip initialization, and
      while the sense-buffer still should be good, scsi_request->cmd might
      have become to be an invalid pointer in the meantime. When the request
      is initialized in bsg.c, and the user's CDB is larger than BLK_MAX_CDB,
      bsg will replace it with a custom allocated buffer, which is freed when
      the user's command is finished, thus it dangles afterwards. When next a
      command is sent by the user that has a smaller/similar CDB as
      BLK_MAX_CDB, bsg will assume that scsi_request->cmd is backed by
      scsi_request->__cmd, will not make a custom allocation, and write into
      undefined memory.
      
      Fix this by splitting bsg_init_rq() into two functions:
       - bsg_init_rq() is changed to only do the allocation of the
         sense-buffer, which is used to back the bsg job's reply buffer. This
         pointer should never change during the lifetime of a scsi_request, so
         it doesn't need re-initialization.
       - bsg_initialize_rq() is a new function that makes use of
         'struct request_queue's initialize_rq_fn callback (which was
         introduced in v4.12). This is always called before the request is
         given out via blk_get_request(). This function does the remaining
         initialization that was previously done in bsg_init_rq(), and will
         also do it when the request is taken from the emergency-pool of the
         backing mempool.
      
      Fixes: 50b4d485 ("bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer")
      Cc: <stable@vger.kernel.org> # 4.11+
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarBenjamin Block <bblock@linux.vnet.ibm.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      eab40cf3
    • Christoph Hellwig's avatar
      nvme-pci: Use PCI bus address for data/queues in CMB · 8969f1f8
      Christoph Hellwig authored
      Currently, NVMe PCI host driver is programming CMB dma address as
      I/O SQs addresses. This results in failures on systems where 1:1
      outbound mapping is not used (example Broadcom iProc SOCs) because
      CMB BAR will be progammed with PCI bus address but NVMe PCI EP will
      try to access CMB using dma address.
      
      To have CMB working on systems without 1:1 outbound mapping, we
      program PCI bus address for I/O SQs instead of dma address. This
      approach will work on systems with/without 1:1 outbound mapping.
      
      Based on a report and previous patch from Abhishek Shah.
      
      Fixes: 8ffaadf7 ("NVMe: Use CMB for the IO SQes if available")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAbhishek Shah <abhishek.shah@broadcom.com>
      Tested-by: default avatarAbhishek Shah <abhishek.shah@broadcom.com>
      Reviewed-by: default avatarKeith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      8969f1f8
  3. 03 Oct, 2017 4 commits
    • Omar Sandoval's avatar
      blk-mq-debugfs: fix device sched directory for default scheduler · 70e62f4b
      Omar Sandoval authored
      In blk_mq_debugfs_register(), I remembered to set up the per-hctx sched
      directories if a default scheduler was already configured by
      blk_mq_sched_init() from blk_mq_init_allocated_queue(), but I didn't do
      the same for the device-wide sched directory. Fix it.
      
      Fixes: d332ce09 ("blk-mq-debugfs: allow schedulers to register debugfs attributes")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      70e62f4b
    • Jens Axboe's avatar
      null_blk: change configfs dependency to select · 6cd1a6fe
      Jens Axboe authored
      A recent commit made null_blk depend on configfs, which is kind of
      annoying since you now have to find this dependency and enable that
      as well. Discovered this since I no longer had null_blk available
      on a box I needed to debug, since it got killed when the config
      updated after the configfs change was merged.
      
      Fixes: 3bf2bd20 ("nullb: add configfs interface")
      Reviewed-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6cd1a6fe
    • Joseph Qi's avatar
      blk-throttle: fix possible io stall when upgrade to max · 4f02fb76
      Joseph Qi authored
      There is a case which will lead to io stall. The case is described as
      follows.
      /test1
        |-subtest1
      /test2
        |-subtest2
      And subtest1 and subtest2 each has 32 queued bios already.
      
      Now upgrade to max. In throtl_upgrade_state, it will try to dispatch
      bios as follows:
      1) tg=subtest1, do nothing;
      2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending
      left, no need to schedule next dispatch;
      3) tg=subtest2, do nothing;
      4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending
      left, no need to schedule next dispatch;
      5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from
      test2 to /, 8 queued bios from test1 to /, and 8 queued bios from test2
      to /; note that test1 and test2 each still has 16 queued bios left;
      6) tg=/, try to schedule next dispatch, but since disptime is now
      (update in tg_update_disptime, wait=0), pending timer is not scheduled
      in fact;
      7) In throtl_upgrade_state it totally dispatches 32 queued bios and with
      32 left. test1 and test2 each has 16 queued bios;
      8) throtl_pending_timer_fn sees the left over bios, but could do
      nothing, because throtl_select_dispatch returns 0, and test1/test2 has
      no pending tg.
      
      The blktrace shows the following:
      8,32   0        0     2.539007641     0  m   N throtl upgrade to max
      8,32   0        0     2.539072267     0  m   N throtl /test2 dispatch nr_queued=16 read=0 write=16
      8,32   7        0     2.539077142     0  m   N throtl /test1 dispatch nr_queued=16 read=0 write=16
      
      So force schedule dispatch if there are pending children.
      Reviewed-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJoseph Qi <qijiang.qj@alibaba-inc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4f02fb76
    • Wouter Verhelst's avatar
      MAINTAINERS: update list for NBD · 38b249bc
      Wouter Verhelst authored
      nbd-general@sourceforge.net becomes nbd@other.debian.org, because
      sourceforge is just a spamtrap these days.
      Signed-off-by: default avatarWouter Verhelst <w@uter.be>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      38b249bc
  4. 02 Oct, 2017 1 commit
    • Josef Bacik's avatar
      nbd: fix -ERESTARTSYS handling · 6e60a3bb
      Josef Bacik authored
      Christoph made it so that if we return'ed BLK_STS_RESOURCE whenever we
      got ERESTARTSYS from sending our packets we'd return BLK_STS_OK, which
      means we'd never requeue and just hang.  We really need to return the
      right value from the upper layer.
      
      Fixes: fc17b653 ("blk-mq: switch ->queue_rq return value to blk_status_t")
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6e60a3bb
  5. 01 Oct, 2017 1 commit
  6. 27 Sep, 2017 1 commit
    • Coly Li's avatar
      bcache: use llist_for_each_entry_safe() in __closure_wake_up() · a5f3d8a5
      Coly Li authored
      Commit 09b3efec ("bcache: Don't reinvent the wheel but use existing llist
      API") replaces the following while loop by llist_for_each_entry(),
      
      -
      -	while (reverse) {
      -		cl = container_of(reverse, struct closure, list);
      -		reverse = llist_next(reverse);
      -
      +	llist_for_each_entry(cl, reverse, list) {
       		closure_set_waiting(cl, 0);
       		closure_sub(cl, CLOSURE_WAITING + 1);
       	}
      
      This modification introduces a potential race by iterating a corrupted
      list. Here is how it happens.
      
      In the above modification, closure_sub() may wake up a process which is
      waiting on reverse list. If this process decides to wait again by calling
      closure_wait(), its cl->list will be added to another wait list. Then
      when llist_for_each_entry() continues to iterate next node, it will travel
      on another new wait list which is added in closure_wait(), not the
      original reverse list in __closure_wake_up(). It is more probably to
      happen on UP machine because the waked up process may preempt the process
      which wakes up it.
      
      Use llist_for_each_entry_safe() will fix the issue, the safe version fetch
      next node before waking up a process. Then the copy of next node will make
      sure list iteration stays on original reverse list.
      
      Fixes: 09b3efec ("bcache: Don't reinvent the wheel but use existing llist API")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reported-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reviewed-by: default avatarByungchul Park <byungchul.park@lge.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a5f3d8a5
  7. 26 Sep, 2017 4 commits
  8. 25 Sep, 2017 26 commits