1. 23 Sep, 2016 1 commit
    • Glauber Costa's avatar
      cfq: fix starvation of asynchronous writes · 3932a86b
      Glauber Costa authored
      While debugging timeouts happening in my application workload (ScyllaDB), I have
      observed calls to open() taking a long time, ranging everywhere from 2 seconds -
      the first ones that are enough to time out my application - to more than 30
      seconds.
      
      The problem seems to happen because XFS may block on pending metadata updates
      under certain circumnstances, and that's confirmed with the following backtrace
      taken by the offcputime tool (iovisor/bcc):
      
          ffffffffb90c57b1 finish_task_switch
          ffffffffb97dffb5 schedule
          ffffffffb97e310c schedule_timeout
          ffffffffb97e1f12 __down
          ffffffffb90ea821 down
          ffffffffc046a9dc xfs_buf_lock
          ffffffffc046abfb _xfs_buf_find
          ffffffffc046ae4a xfs_buf_get_map
          ffffffffc046babd xfs_buf_read_map
          ffffffffc0499931 xfs_trans_read_buf_map
          ffffffffc044a561 xfs_da_read_buf
          ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
          ffffffffc0452b90 xfs_dir2_leaf_lookup_int
          ffffffffc0452e0f xfs_dir2_leaf_lookup
          ffffffffc044d9d3 xfs_dir_lookup
          ffffffffc047d1d9 xfs_lookup
          ffffffffc0479e53 xfs_vn_lookup
          ffffffffb925347a path_openat
          ffffffffb9254a71 do_filp_open
          ffffffffb9242a94 do_sys_open
          ffffffffb9242b9e sys_open
          ffffffffb97e42b2 entry_SYSCALL_64_fastpath
          00007fb0698162ed [unknown]
      
      Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
      high "Dispatch wait" times, on the dozens of seconds range and consistent with
      the open() times I have saw in that run.
      
      Still from the blktrace output, we can after searching a bit, identify the
      request that wasn't dispatched:
      
        8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1) 141696240
        8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 [xfsaild/sda1]
        8,0   11      154    81.092473207   804  G  WM 141698288 + 8 [xfsaild/sda1]
        8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911) [xfsaild/sda1]
        <==== 'I' means Inserted (into the IO scheduler) ===================================>
        8,0    0   289372    96.718761435     0  D  WM 141698288 + 8 (15626265317) [swapper/0]
        <==== Only 15s later the CFQ scheduler dispatches the request ======================>
      
      As we can see above, in this particular example CFQ took 15 seconds to dispatch
      this request. Going back to the full trace, we can see that the xfsaild queue
      had plenty of opportunity to run, and it was selected as the active queue many
      times. It would just always be preempted by something else (example):
      
        8,0    1        0    81.117912979     0  m   N cfq1618SN / insert_request
        8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
        8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
        8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired t=1
        8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
        8,0    1        0    81.117915340     0  m   N / served: vt=1948520448 min_vt=1948520448
        8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1 disp=0 charge=0 iops=1 sect=0
      
      where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
      IO dispatchers.
      
      The requests preempting the xfsaild queue are synchronous requests. That's a
      characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
      While it can be argued that preempting ASYNC requests in favor of SYNC is part
      of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
      goal.
      
      Moreover, unless I am misunderstanding something, that breaks the expectation
      set by the "fifo_expire_async" tunable, which in my system is set to the
      default.
      
      Looking at the code, it seems to me that the issue is that after we make
      an async queue active, there is no guarantee that it will execute any request.
      
      When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
      requests in flight. An incoming request from another queue can also preempt it
      in such situation before we have the chance to execute anything (as seen in the
      trace above).
      
      This patch sets the must_dispatch flag if we notice that we have requests
      that are already fifo_expired. This flag is always cleared after
      cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
      the queue for subsequent requests (unless they are themselves expired)
      
      Care is taken during preempt to still allow rt requests to preempt us
      regardless.
      
      Testing my workload with this patch applied produces much better results.
      From the application side I see no timeouts, and the open() latency histogram
      generated by systemtap looks much better, with the worst outlier at 131ms:
      
      Latency histogram of xfs_buf_lock acquisition (microseconds):
       value |-------------------------------------------------- count
           0 |                                                     11
           1 |@@@@                                                161
           2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
           4 |@                                                    54
           8 |                                                     36
          16 |                                                      7
          32 |                                                      0
          64 |                                                      0
             ~
        1024 |                                                      0
        2048 |                                                      0
        4096 |                                                      1
        8192 |                                                      1
       16384 |                                                      2
       32768 |                                                      0
       65536 |                                                      0
      131072 |                                                      1
      262144 |                                                      0
      524288 |                                                      0
      Signed-off-by: default avatarGlauber Costa <glauber@scylladb.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: linux-block@vger.kernel.org
      CC: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarGlauber Costa <glauber@scylladb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      3932a86b
  2. 22 Sep, 2016 4 commits
  3. 21 Sep, 2016 7 commits
    • Arnd Bergmann's avatar
      lightnvm: propagate device_add() error code · 1e3aeae4
      Arnd Bergmann authored
      device_add() may fail, and all callers are supposed to check the
      return value, but one new user in lightnvm doesn't:
      
      drivers/lightnvm/sysfs.c: In function 'nvm_sysfs_register_dev':
      drivers/lightnvm/sysfs.c:184:2: error: ignoring return value of 'device_add',
        declared with attribute warn_unused_result [-Werror=unused-result]
      
      This changes the caller to propagate any error codes, which avoids
      the warning.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Fixes: 38c9e260b9f9 ("lightnvm: expose device geometry through sysfs")
      Signed-off-by: default avatarMatias Bjørling <m@bjorling.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      1e3aeae4
    • Simon A. F. Lund's avatar
      lightnvm: expose device geometry through sysfs · 40267efd
      Simon A. F. Lund authored
      For a host to access an Open-Channel SSD, it has to know its geometry,
      so that it writes and reads at the appropriate device bounds.
      
      Currently, the geometry information is kept within the kernel, and not
      exported to user-space for consumption. This patch exposes the
      configuration through sysfs and enables user-space libraries, such as
      liblightnvm, to use the sysfs implementation to get the geometry of an
      Open-Channel SSD.
      
      The sysfs entries are stored within the device hierarchy, and can be
      found using the "lightnvm" device type.
      
      An example configuration looks like this:
      
      /sys/class/nvme/
      └── nvme0n1
         ├── capabilities: 3
         ├── device_mode: 1
         ├── erase_max: 1000000
         ├── erase_typ: 1000000
         ├── flash_media_type: 0
         ├── media_capabilities: 0x00000001
         ├── media_type: 0
         ├── multiplane: 0x00010101
         ├── num_blocks: 1022
         ├── num_channels: 1
         ├── num_luns: 4
         ├── num_pages: 64
         ├── num_planes: 1
         ├── page_size: 4096
         ├── prog_max: 100000
         ├── prog_typ: 100000
         ├── read_max: 10000
         ├── read_typ: 10000
         ├── sector_oob_size: 0
         ├── sector_size: 4096
         ├── media_manager: gennvm
         ├── ppa_format: 0x380830082808001010102008
         ├── vendor_opcode: 0
         ├── max_phys_secs: 64
         └── version: 1
      Signed-off-by: default avatarSimon A. F. Lund <slund@cnexlabs.com>
      Signed-off-by: default avatarMatias Bjørling <m@bjorling.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      40267efd
    • Matias Bjørling's avatar
      lightnvm: control life of nvm_dev in driver · b0b4e09c
      Matias Bjørling authored
      LightNVM compatible device drivers does not have a method to expose
      LightNVM specific sysfs entries.
      
      To enable LightNVM sysfs entries to be exposed, lightnvm device
      drivers require a struct device to attach it to. To allow both the
      actual device driver and lightnvm sysfs entries to coexist, the device
      driver tracks the lifetime of the nvm_dev structure.
      
      This patch refactors NVMe and null_blk to handle the lifetime of struct
      nvm_dev, which eliminates the need for struct gendisk when a lightnvm
      compatible device is provided.
      Signed-off-by: default avatarMatias Bjørling <m@bjorling.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b0b4e09c
    • Matias Bjørling's avatar
      blk-mq: register device instead of disk · b21d5b30
      Matias Bjørling authored
      Enable devices without a gendisk instance to register itself with blk-mq
      and expose the associated multi-queue sysfs entries.
      Signed-off-by: default avatarMatias Bjørling <m@bjorling.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b21d5b30
    • Matias Bjørling's avatar
      null_blk: refactor to support non-gendisk devices · 9ae2d0aa
      Matias Bjørling authored
      With LightNVM enabled devices, the gendisk structure is not exposed
      to the user. This hides the device driver specific sysfs entries, and
      prevents binding of LightNVM geometry information to the device.
      
      Refactor the device registration process, so that gendisk and
      non-gendisk devices are easily managed.
      Signed-off-by: default avatarMatias Bjørling <m@bjorling.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      9ae2d0aa
    • Matias Bjørling's avatar
      nvme: refactor namespaces to support non-gendisk devices · ac81bfa9
      Matias Bjørling authored
      With LightNVM enabled namespaces, the gendisk structure is not exposed
      to the user. This prevents LightNVM users from accessing the NVMe device
      driver specific sysfs entries, and LightNVM namespace geometry.
      
      Refactor the revalidation process, so that a namespace, instead of a
      gendisk, is revalidated. This later allows patches to wire up the
      sysfs entries up to a non-gendisk namespace.
      Signed-off-by: default avatarMatias Bjørling <m@bjorling.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ac81bfa9
    • Geert Uytterhoeven's avatar
      lightnvm: NVM should depend on HAS_DMA · e105ddb4
      Geert Uytterhoeven authored
      If NO_DMA=y:
      
          drivers/built-in.o: In function `nvme_nvm_dev_dma_free':
          lightnvm.c:(.text+0x23df1a): undefined reference to `dma_pool_free'
          drivers/built-in.o: In function `nvme_nvm_dev_dma_alloc':
          lightnvm.c:(.text+0x23df38): undefined reference to `dma_pool_alloc'
          drivers/built-in.o: In function `nvme_nvm_destroy_dma_pool':
          lightnvm.c:(.text+0x23df4c): undefined reference to `dma_pool_destroy'
          drivers/built-in.o: In function `nvme_nvm_create_dma_pool':
          lightnvm.c:(.text+0x23df7e): undefined reference to `dma_pool_create'
      
      and
      
          ERROR: "dma_pool_destroy" [drivers/nvme/host/nvme-core.ko] undefined!
          ERROR: "dma_pool_free" [drivers/nvme/host/nvme-core.ko] undefined!
          ERROR: "dma_pool_alloc" [drivers/nvme/host/nvme-core.ko] undefined!
          ERROR: "dma_pool_create" [drivers/nvme/host/nvme-core.ko] undefined!
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarMatias Bjørling <m@bjorling.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e105ddb4
  4. 19 Sep, 2016 1 commit
  5. 17 Sep, 2016 7 commits
  6. 16 Sep, 2016 1 commit
    • Jens Axboe's avatar
      blk-mq: account higher order dispatch · 703fd1c0
      Jens Axboe authored
      We currently account a '0' dispatch, and anything above that still falls
      below the range set by BLK_MQ_MAX_DISPATCH_ORDER. If we dispatch more,
      we don't account it.
      
      Change the last bucket to be inclusive of anything above the range we
      track, and have the sysfs file reflect that by including a '+' in the
      output:
      
      $ cat /sys/block/nvme0n1/mq/0/dispatched
              0	1006
              1	20229
              2	1
              4	0
              8	0
             16	0
             32+	0
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      703fd1c0
  7. 14 Sep, 2016 12 commits
  8. 08 Sep, 2016 4 commits
  9. 29 Aug, 2016 3 commits
    • Baoyou Xie's avatar
      mtip32xx: mark symbols static where possible · 99e6b87e
      Baoyou Xie authored
      We get 1 warning when biuld kernel with W=1:
      drivers/block/mtip32xx/mtip32xx.c:3689:6: warning: no previous prototype for
       'mtip_block_release' [-Wmissing-prototypes]
      
      In fact, this function is only used in the file in which it is declared
      and don't need a declaration, but can be made static.
      so this patch marks it 'static'.
      Signed-off-by: default avatarBaoyou Xie <baoyou.xie@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      99e6b87e
    • Jens Axboe's avatar
      blk-mq: prefetch request in blk_mq_tag_to_rq() · 88c7b2b7
      Jens Axboe authored
      When drivers or the core calls this function, they usually
      dereference the request shortly there after. Prefetch the first
      cache line.
      
      Profiling IO workloads shows that this is the most common cache
      miss on the block side of things.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      88c7b2b7
    • Jens Axboe's avatar
      blk-mq: improve layout of blk_mq_hw_ctx · 8d354f13
      Jens Axboe authored
      Various cache line optimizations:
      
      - Move delay_work towards the end. It's huge, and we don't use it
        a lot (only SCSI).
      
      - Move the atomic state into the same cacheline as the the dispatch
        list and lock.
      
      - Rearrange a few members to pack it better.
      
      - Shrink the max-order for dispatch accounting from 10 to 7. This
        means that ->dispatched[] and ->run now take up their own
        cacheline.
      
      This shrinks struct blk_mq_hw_ctx down to 8 cachelines.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      8d354f13