1. 17 Nov, 2016 5 commits
  2. 16 Nov, 2016 3 commits
  3. 15 Nov, 2016 2 commits
  4. 14 Nov, 2016 3 commits
  5. 11 Nov, 2016 6 commits
  6. 10 Nov, 2016 8 commits
    • Jens Axboe's avatar
      block: hook up writeback throttling · 87760e5e
      Jens Axboe authored
      Enable throttling of buffered writeback to make it a lot
      more smooth, and has way less impact on other system activity.
      Background writeback should be, by definition, background
      activity. The fact that we flush huge bundles of it at the time
      means that it potentially has heavy impacts on foreground workloads,
      which isn't ideal. We can't easily limit the sizes of writes that
      we do, since that would impact file system layout in the presence
      of delayed allocation. So just throttle back buffered writeback,
      unless someone is waiting for it.
      
      The algorithm for when to throttle takes its inspiration in the
      CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
      the minimum latencies of requests over a window of time. In that
      window of time, if the minimum latency of any request exceeds a
      given target, then a scale count is incremented and the queue depth
      is shrunk. The next monitoring window is shrunk accordingly. Unlike
      CoDel, if we hit a window that exhibits good behavior, then we
      simply increment the scale count and re-calculate the limits for that
      scale value. This prevents us from oscillating between a
      close-to-ideal value and max all the time, instead remaining in the
      windows where we get good behavior.
      
      Unlike CoDel, blk-wb allows the scale count to to negative. This
      happens if we primarily have writes going on. Unlike positive
      scale counts, this doesn't change the size of the monitoring window.
      When the heavy writers finish, blk-bw quickly snaps back to it's
      stable state of a zero scale count.
      
      The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
      target to me met. It defaults to 2 msec for non-rotational storage, and
      75 msec for rotational storage. Setting this value to '0' disables
      blk-wb. Generally, a user would not have to touch this setting.
      
      We don't enable WBT on devices that are managed with CFQ, and have
      a non-root block cgroup attached. If we have a proportional share setup
      on this particular disk, then the wbt throttling will interfere with
      that. We don't have a strong need for wbt for that case, since we will
      rely on CFQ doing that for us.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      87760e5e
    • Jens Axboe's avatar
      blk-wbt: add general throttling mechanism · e34cbd30
      Jens Axboe authored
      We can hook this up to the block layer, to help throttle buffered
      writes.
      
      wbt registers a few trace points that can be used to track what is
      happening in the system:
      
      wbt_lat: 259:0: latency 2446318
      wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
                     wmean=518866, wmin=15522, wmax=5330353, wsamples=57
      wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32
      
      This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
      dumps the current read/write stats for that window, and wbt_step shows a
      step down event where we now scale back writes. Each trace includes the
      device, 259:0 in this case.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e34cbd30
    • Jens Axboe's avatar
      block: add scalable completion tracking of requests · cf43e6be
      Jens Axboe authored
      For legacy block, we simply track them in the request queue. For
      blk-mq, we track them on a per-sw queue basis, which we can then
      sum up through the hardware queues and finally to a per device
      state.
      
      The stats are tracked in, roughly, 0.1s interval windows.
      
      Add sysfs files to display the stats.
      
      The feature is off by default, to avoid any extra overhead. In-kernel
      users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
      flags. We currently don't turn it on if someone just reads any of
      the stats files, that is something we could add as well.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      cf43e6be
    • Tejun Heo's avatar
      block: cfq_cpd_alloc() should use @gfp · ebc4ff66
      Tejun Heo authored
      cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
      incorrectly hard coding GFP_KERNEL instead of using the mask specified
      through the @gfp parameter.  This currently doesn't cause any actual
      issues because all current callers specify GFP_KERNEL.  Fix it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Fixes: e4a9bde9 ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods")
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ebc4ff66
    • Christoph Hellwig's avatar
      nvme: don't pass the full CQE to nvme_complete_async_event · 7bf58533
      Christoph Hellwig authored
      We only need the status and result fields, and passing them explicitly
      makes life a lot easier for the Fibre Channel transport which doesn't
      have a full CQE for the fast path case.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarKeith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      7bf58533
    • Christoph Hellwig's avatar
      nvme: introduce struct nvme_request · d49187e9
      Christoph Hellwig authored
      This adds a shared per-request structure for all NVMe I/O.  This structure
      is embedded as the first member in all NVMe transport drivers request
      private data and allows to implement common functionality between the
      drivers.
      
      The first use is to replace the current abuse of the SCSI command
      passthrough fields in struct request for the NVMe command passthrough,
      but it will grow a field more fields to allow implementing things
      like common abort handlers in the future.
      
      The passthrough commands are handled by having a pointer to the SQE
      (struct nvme_command) in struct nvme_request, and the union of the
      possible result fields, which had to be turned from an anonymous
      into a named union for that purpose.  This avoids having to pass
      a reference to a full CQE around and thus makes checking the result
      a lot more lightweight.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarKeith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      d49187e9
    • Arnd Bergmann's avatar
      skd: fix function prototype · 41c9499b
      Arnd Bergmann authored
      Building with W=1 shows a harmless warning for the skd driver:
      
      drivers/block/skd_main.c:2959:1: error: ‘static’ is not at beginning of declaration [-Werror=old-style-declaration]
      
      This changes the prototype to the expected formatting.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      41c9499b
    • Arnd Bergmann's avatar
      skd: fix msix error handling · 3bc8492f
      Arnd Bergmann authored
      As reported by gcc -Wmaybe-uninitialized, the cleanup path for
      skd_acquire_msix tries to free the already allocated msi-x vectors
      in reverse order, but the index variable may not have been
      used yet:
      
      drivers/block/skd_main.c: In function ‘skd_acquire_irq’:
      drivers/block/skd_main.c:3890:8: error: ‘i’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
      
      This changes the failure path to skip releasing the interrupts
      if we have not started requesting them yet.
      
      Fixes: 180b0ae7 ("skd: use pci_alloc_irq_vectors")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      3bc8492f
  7. 09 Nov, 2016 1 commit
  8. 08 Nov, 2016 3 commits
  9. 07 Nov, 2016 1 commit
    • Christoph Hellwig's avatar
      pktcdvd: don't scribble over the bvec array · feebd568
      Christoph Hellwig authored
      Hi Peter, hi Jens,
      
      I've been looking over the multi page bio vec work again recently, and
      one of the stumbling blocks is raw biovec access in the pktcdvd.
      
      The first issue is that it directly sets up the page and offset pointers
      in the biovec just before calling bio_add_page.  As bio_add_page already
      does the setup it's trivial to just switch it to stack variables for the
      arguments.
      
      The second issue is the copy code in pkt_make_local_copy, which
      effectively is an opencoded version of bio_copy_data except that it
      skips pages that already are the same in the ѕource and destination.
      But we look at the only calleer we just set up the bio using bio_add_page
      to point exactly to the page array that pkt_make_local_copy compares,
      so the pages will always be the same and we can just remove this function.
      
      Note that all this is done based on code inspection, I don't have any
      packet writing hardware myself.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      feebd568
  10. 06 Nov, 2016 1 commit
    • Gabriel Krisman Bertazi's avatar
      blk-mq: Always schedule hctx->next_cpu · c02ebfdd
      Gabriel Krisman Bertazi authored
      Commit 0e87e58b ("blk-mq: improve warning for running a queue on the
      wrong CPU") attempts to avoid triggering the WARN_ON in
      __blk_mq_run_hw_queue when the expected CPU is dead.  Problem is, in the
      last batch execution before round robin, blk_mq_hctx_next_cpu can
      schedule a dead CPU and also update next_cpu to the next alive CPU in
      the mask, which will trigger the WARN_ON despite the previous
      workaround.
      
      The following patch fixes this scenario by always scheduling the value
      in hctx->next_cpu.  This changes the moment when we round-robin the CPU
      running the hctx, but it really doesn't matter, since it still executes
      BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU.
      
      Fixes: 0e87e58b ("blk-mq: improve warning for running a queue on the wrong CPU")
      Signed-off-by: default avatarGabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c02ebfdd
  11. 05 Nov, 2016 1 commit
    • Jens Axboe's avatar
      block: add code to track actual device queue depth · d278d4a8
      Jens Axboe authored
      For blk-mq, ->nr_requests does track queue depth, at least at init
      time. But for the older queue paths, it's simply a soft setting.
      On top of that, it's generally larger than the hardware setting
      on purpose, to allow backup of requests for merging.
      
      Fill a hole in struct request with a 'queue_depth' member, that
      drivers can call to more closely inform the block layer of the
      real queue depth.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      d278d4a8
  12. 04 Nov, 2016 2 commits
    • Shaohua Li's avatar
      blk-mq: immediately dispatch big size request · 600271d9
      Shaohua Li authored
      This is corresponding part for blk-mq. Disk with multiple hardware
      queues doesn't need this as we only hold 1 request at most.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      600271d9
    • Shaohua Li's avatar
      block: immediately dispatch big size request · 50d24c34
      Shaohua Li authored
      Currently block plug holds up to 16 non-mergeable requests. This makes
      sense if the request size is small, eg, reduce lock contention. But if
      request size is big enough, we don't need to worry about lock
      contention. Holding such request makes no sense and it lows the disk
      utilization.
      
      In practice, this improves 10% throughput for my raid5 sequential write
      workload.
      
      The size (128k) is arbitrary right now, but it makes sure lock
      contention is small. This probably could be more intelligent, eg, check
      average request size holded. Since this is mainly for sequential IO,
      probably not worthy.
      
      V2: check the last request instead of the first request, so as long as
      there is one big size request we flush the plug.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      50d24c34
  13. 03 Nov, 2016 1 commit
  14. 02 Nov, 2016 3 commits
    • Bart Van Assche's avatar
      nvme: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code · a6eaa884
      Bart Van Assche authored
      Make nvme_requeue_req() check BLK_MQ_S_STOPPED instead of
      QUEUE_FLAG_STOPPED. Remove the QUEUE_FLAG_STOPPED manipulations
      that became superfluous because of this change. Change
      blk_queue_stopped() tests into blk_mq_queue_stopped().
      
      This patch fixes a race condition: using queue_flag_clear_unlocked()
      is not safe if any other function that manipulates the queue flags
      can be called concurrently, e.g. blk_cleanup_queue().
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a6eaa884
    • Bart Van Assche's avatar
      nvme: Fix a race condition related to stopping queues · 3174dd33
      Bart Van Assche authored
      Avoid that nvme_queue_rq() is still running when nvme_stop_queues()
      returns.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      3174dd33
    • Bart Van Assche's avatar
      dm: Fix a race condition related to stopping and starting queues · 7b17c2f7
      Bart Van Assche authored
      Ensure that all ongoing dm_mq_queue_rq() and dm_mq_requeue_request()
      calls have stopped before setting the "queue stopped" flag. This
      allows to remove the "queue stopped" test from dm_mq_queue_rq() and
      dm_mq_requeue_request(). This patch fixes a race condition because
      dm_mq_queue_rq() is called without holding the queue lock and hence
      BLK_MQ_S_STOPPED can be set at any time while dm_mq_queue_rq() is
      in progress. This patch prevents that the following hang occurs
      sporadically when using dm-mq:
      
      INFO: task systemd-udevd:10111 blocked for more than 480 seconds.
      Call Trace:
       [<ffffffff8161f397>] schedule+0x37/0x90
       [<ffffffff816239ef>] schedule_timeout+0x27f/0x470
       [<ffffffff8161e76f>] io_schedule_timeout+0x9f/0x110
       [<ffffffff8161fb36>] bit_wait_io+0x16/0x60
       [<ffffffff8161f929>] __wait_on_bit_lock+0x49/0xa0
       [<ffffffff8114fe69>] __lock_page+0xb9/0xc0
       [<ffffffff81165d90>] truncate_inode_pages_range+0x3e0/0x760
       [<ffffffff81166120>] truncate_inode_pages+0x10/0x20
       [<ffffffff81212a20>] kill_bdev+0x30/0x40
       [<ffffffff81213d41>] __blkdev_put+0x71/0x360
       [<ffffffff81214079>] blkdev_put+0x49/0x170
       [<ffffffff812141c0>] blkdev_close+0x20/0x30
       [<ffffffff811d48e8>] __fput+0xe8/0x1f0
       [<ffffffff811d4a29>] ____fput+0x9/0x10
       [<ffffffff810842d3>] task_work_run+0x83/0xb0
       [<ffffffff8106606e>] do_exit+0x3ee/0xc40
       [<ffffffff8106694b>] do_group_exit+0x4b/0xc0
       [<ffffffff81073d9a>] get_signal+0x2ca/0x940
       [<ffffffff8101bf43>] do_signal+0x23/0x660
       [<ffffffff810022b3>] exit_to_usermode_loop+0x73/0xb0
       [<ffffffff81002cb0>] syscall_return_slowpath+0xb0/0xc0
       [<ffffffff81624e33>] entry_SYSCALL_64_fastpath+0xa6/0xa8
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      7b17c2f7