1. 21 Jul, 2016 4 commits
    • Tahsin Erdogan's avatar
      block: do not merge requests without consulting with io scheduler · 72ef799b
      Tahsin Erdogan authored
      Before merging a bio into an existing request, io scheduler is called to
      get its approval first. However, the requests that come from a plug
      flush may get merged by block layer without consulting with io
      scheduler.
      
      In case of CFQ, this can cause fairness problems. For instance, if a
      request gets merged into a low weight cgroup's request, high weight cgroup
      now will depend on low weight cgroup to get scheduled. If high weigt cgroup
      needs that io request to complete before submitting more requests, then it
      will also lose its timeslice.
      
      Following script demonstrates the problem. Group g1 has a low weight, g2
      and g3 have equal high weights but g2's requests are adjacent to g1's
      requests so they are subject to merging. Due to these merges, g2 gets
      poor disk time allocation.
      
      cat > cfq-merge-repro.sh << "EOF"
      #!/bin/bash
      set -e
      
      IO_ROOT=/mnt-cgroup/io
      
      mkdir -p $IO_ROOT
      
      if ! mount | grep -qw $IO_ROOT; then
        mount -t cgroup none -oblkio $IO_ROOT
      fi
      
      cd $IO_ROOT
      
      for i in g1 g2 g3; do
        if [ -d $i ]; then
          rmdir $i
        fi
      done
      
      mkdir g1 && echo 10 > g1/blkio.weight
      mkdir g2 && echo 495 > g2/blkio.weight
      mkdir g3 && echo 495 > g3/blkio.weight
      
      RUNTIME=10
      
      (echo $BASHPID > g1/cgroup.procs &&
       fio --readonly --name name1 --filename /dev/sdb \
           --rw read --size 64k --bs 64k --time_based \
           --runtime=$RUNTIME --offset=0k &> /dev/null)&
      
      (echo $BASHPID > g2/cgroup.procs &&
       fio --readonly --name name1 --filename /dev/sdb \
           --rw read --size 64k --bs 64k --time_based \
           --runtime=$RUNTIME --offset=64k &> /dev/null)&
      
      (echo $BASHPID > g3/cgroup.procs &&
       fio --readonly --name name1 --filename /dev/sdb \
           --rw read --size 64k --bs 64k --time_based \
           --runtime=$RUNTIME --offset=256k &> /dev/null)&
      
      sleep $((RUNTIME+1))
      
      for i in g1 g2 g3; do
        echo ---- $i ----
        cat $i/blkio.time
      done
      
      EOF
      # ./cfq-merge-repro.sh
      ---- g1 ----
      8:16 162
      ---- g2 ----
      8:16 165
      ---- g3 ----
      8:16 686
      
      After applying the patch:
      
      # ./cfq-merge-repro.sh
      ---- g1 ----
      8:16 90
      ---- g2 ----
      8:16 445
      ---- g3 ----
      8:16 471
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      72ef799b
    • Bart Van Assche's avatar
      block: Fix spelling in a source code comment · 68bdf1ac
      Bart Van Assche authored
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      68bdf1ac
    • Yigal Korman's avatar
      block: expose QUEUE_FLAG_DAX in sysfs · ea6ca600
      Yigal Korman authored
      Provides the ability to identify DAX enabled devices in userspace.
      Signed-off-by: default avatarYigal Korman <yigal@plexistor.com>
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ea6ca600
    • Toshi Kani's avatar
      block: add QUEUE_FLAG_DAX for devices to advertise their DAX support · 163d4baa
      Toshi Kani authored
      Currently, presence of direct_access() in block_device_operations
      indicates support of DAX on its block device.  Because
      block_device_operations is instantiated with 'const', this DAX
      capablity may not be enabled conditinally.
      
      In preparation for supporting DAX to device-mapper devices, add
      QUEUE_FLAG_DAX to request_queue flags to advertise their DAX
      support.  This will allow to set the DAX capability based on how
      mapped device is composed.
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: <linux-s390@vger.kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      163d4baa
  2. 18 Jul, 2016 1 commit
  3. 13 Jul, 2016 1 commit
  4. 28 Jun, 2016 5 commits
    • Masanari Iida's avatar
      Doc: block: Fix a typo in queue-sysfs.txt · 141fd28c
      Masanari Iida authored
      This patch fix a spelling typo found in queue-sysfs.txt.
      Signed-off-by: default avatarMasanari Iida <standby24x7@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      141fd28c
    • Jan Kara's avatar
      cfq-iosched: Charge at least 1 jiffie instead of 1 ns · 0b31c10c
      Jan Kara authored
      Commit 9a7f38c4 (cfq-iosched: Convert from jiffies to nanoseconds)
      could result in charging just 1 ns to a cgroup submitting IO instead of 1
      jiffie we always charged before. It is arguable what is the right amount
      to change but for now lets retain the old behavior of always charging at
      least one jiffie.
      
      Fixes: 9a7f38c4Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      0b31c10c
    • Jan Kara's avatar
      cfq-iosched: Fix regression in bonnie++ rewrite performance · 149321a6
      Jan Kara authored
      Commit 9a7f38c4 (cfq-iosched: Convert from jiffies to nanoseconds)
      broke the condition for detecting starved sync IO in
      cfq_completed_request() because rq->start_time remained in jiffies but
      we compared it with nanosecond values. This manifested as a regression
      in bonnie++ rewrite performance because we always ended up considering
      sync IO starved and thus never increased async IO queue depth.
      
      Since rq->start_time is used in a lot of places, converting it to ns
      values would be non-trivial. So just revert the condition in CFQ to use
      comparison with jiffies. This will lead to suboptimal results if
      cfq_fifo_expire[1] will ever come close to 1 jiffie but so far we are
      relatively far from that with the storage used with CFQ (the default
      value is 128 ms).
      
      Fixes: 9a7f38c4Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      149321a6
    • Jan Kara's avatar
      cfq-iosched: Convert slice_resid from u64 to s64 · 93fdf147
      Jan Kara authored
      slice_resid can be both positive and negative. Commit 9a7f38c4
      (cfq-iosched: Convert from jiffies to nanoseconds) converted it from
      long to u64. Although this did not introduce any functional regression
      (the operations just overflow and the result was fine), it is certainly
      wrong and could cause issues in future. So convert the type to more
      appropriate s64.
      
      Fixes: 9a7f38c4Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      93fdf147
    • Jan Kara's avatar
      block: Convert fifo_time from ulong to u64 · 9828c2c6
      Jan Kara authored
      Currently rq->fifo_time is unsigned long but CFQ stores nanosecond
      timestamp in it which would overflow on 32-bit archs. Convert it to u64
      to avoid the overflow. Since the rq->fifo_time is unioned with struct
      call_single_data(), this does not change the size of struct request in
      any way.
      
      We have to slightly fixup block/deadline-iosched.c so that comparison
      happens in the right types.
      
      Fixes: 9a7f38c4Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      9828c2c6
  5. 17 Jun, 2016 1 commit
    • Arnd Bergmann's avatar
      blktrace: avoid using timespec · 59a37f8b
      Arnd Bergmann authored
      The blktrace code stores the current time in a 32-bit word in its
      user interface. This is a bad idea because 32-bit seconds overflow
      at some point.
      
      We probably have until 2106 before this one overflows, as it seems
      to use an 'unsigned' variable, but we should confirm that user
      space treats it the same way.
      
      Aside from this, we want to stop using 'struct timespec' here,
      so I'm adding a comment about the overflow and change the code
      to use timespec64 instead to make the loss of range more obvious.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      59a37f8b
  6. 14 Jun, 2016 3 commits
  7. 10 Jun, 2016 1 commit
  8. 09 Jun, 2016 10 commits
    • Jens Axboe's avatar
      cfq-iosched: temporarily boost queue priority for idle classes · b8269db4
      Jens Axboe authored
      If we're queuing REQ_PRIO IO and the task is running at an idle IO
      class, then temporarily boost the priority. This prevents livelocks
      due to priority inversion, when a low priority task is holding file
      system resources while attempting to do IO.
      
      An example of that is shown below. An ioniced idle task is holding
      the directory mutex, while a normal priority task is trying to do
      a directory lookup.
      
      [478381.198925] ------------[ cut here ]------------
      [478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds.
      [478381.201324]       Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
      [478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [478381.203462] ionice          D ffff8803692736a8     0 1168369      1 0x00000080
      [478381.203466]  ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698
      [478381.204589]  ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002
      [478381.205752]  ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000
      [478381.206874] Call Trace:
      [478381.207253]  [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80
      [478381.208175]  [<ffffffff8177cea7>] schedule+0x37/0x90
      [478381.208932]  [<ffffffff8177f5fc>] schedule_timeout+0x1dc/0x250
      [478381.209805]  [<ffffffff81421c17>] ? __blk_run_queue+0x37/0x50
      [478381.210706]  [<ffffffff810ca1c5>] ? ktime_get+0x45/0xb0
      [478381.211489]  [<ffffffff8177c407>] io_schedule_timeout+0xa7/0x110
      [478381.212402]  [<ffffffff810a8c2b>] ? prepare_to_wait+0x5b/0x90
      [478381.213280]  [<ffffffff8177d616>] bit_wait_io+0x36/0x50
      [478381.214063]  [<ffffffff8177d325>] __wait_on_bit+0x65/0x90
      [478381.214961]  [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80
      [478381.215872]  [<ffffffff8177d47c>] out_of_line_wait_on_bit+0x7c/0x90
      [478381.216806]  [<ffffffff810a89f0>] ? wake_atomic_t_function+0x40/0x40
      [478381.217773]  [<ffffffff811f03aa>] __wait_on_buffer+0x2a/0x30
      [478381.218641]  [<ffffffff8123c557>] ext4_bread+0x57/0x70
      [478381.219425]  [<ffffffff8124498c>] __ext4_read_dirblock+0x3c/0x380
      [478381.220467]  [<ffffffff8124665d>] ext4_dx_find_entry+0x7d/0x170
      [478381.221357]  [<ffffffff8114c49e>] ? find_get_entry+0x1e/0xa0
      [478381.222208]  [<ffffffff81246bd4>] ext4_find_entry+0x484/0x510
      [478381.223090]  [<ffffffff812471a2>] ext4_lookup+0x52/0x160
      [478381.223882]  [<ffffffff811c401d>] lookup_real+0x1d/0x60
      [478381.224675]  [<ffffffff811c4698>] __lookup_hash+0x38/0x50
      [478381.225697]  [<ffffffff817745bd>] lookup_slow+0x45/0xab
      [478381.226941]  [<ffffffff811c690e>] link_path_walk+0x7ae/0x820
      [478381.227880]  [<ffffffff811c6a42>] path_init+0xc2/0x430
      [478381.228677]  [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20
      [478381.229776]  [<ffffffff811c8c57>] path_openat+0x77/0x620
      [478381.230767]  [<ffffffff81185c6e>] ? page_add_file_rmap+0x2e/0x70
      [478381.232019]  [<ffffffff811cb253>] do_filp_open+0x43/0xa0
      [478381.233016]  [<ffffffff8108c4a9>] ? creds_are_invalid+0x29/0x70
      [478381.234072]  [<ffffffff811c0cb0>] do_open_execat+0x70/0x170
      [478381.235039]  [<ffffffff811c1bf8>] do_execveat_common.isra.36+0x1b8/0x6e0
      [478381.236051]  [<ffffffff811c214c>] do_execve+0x2c/0x30
      [478381.236809]  [<ffffffff811ca392>] ? getname+0x12/0x20
      [478381.237564]  [<ffffffff811c23be>] SyS_execve+0x2e/0x40
      [478381.238338]  [<ffffffff81780a1d>] stub_execve+0x6d/0xa0
      [478381.239126] ------------[ cut here ]------------
      [478381.239915] ------------[ cut here ]------------
      [478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds.
      [478381.242673]       Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
      [478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [478381.244902] python2.7       D ffff88005cf8fb98     0 1168375 1168248 0x00000080
      [478381.244904]  ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0
      [478381.246023]  ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff
      [478381.247138]  ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8
      [478381.248252] Call Trace:
      [478381.248630]  [<ffffffff8177cea7>] schedule+0x37/0x90
      [478381.249382]  [<ffffffff8177d08e>] schedule_preempt_disabled+0xe/0x10
      [478381.250465]  [<ffffffff8177e892>] __mutex_lock_slowpath+0x92/0x100
      [478381.251409]  [<ffffffff8177e91b>] mutex_lock+0x1b/0x2f
      [478381.252199]  [<ffffffff817745ae>] lookup_slow+0x36/0xab
      [478381.253023]  [<ffffffff811c690e>] link_path_walk+0x7ae/0x820
      [478381.253877]  [<ffffffff811aeb41>] ? try_charge+0xc1/0x700
      [478381.254690]  [<ffffffff811c6a42>] path_init+0xc2/0x430
      [478381.255525]  [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20
      [478381.256450]  [<ffffffff811c8c57>] path_openat+0x77/0x620
      [478381.257256]  [<ffffffff8115b2fb>] ? lru_cache_add_active_or_unevictable+0x2b/0xa0
      [478381.258390]  [<ffffffff8117b623>] ? handle_mm_fault+0x13f3/0x1720
      [478381.259309]  [<ffffffff811cb253>] do_filp_open+0x43/0xa0
      [478381.260139]  [<ffffffff811d7ae2>] ? __alloc_fd+0x42/0x120
      [478381.260962]  [<ffffffff811b95ac>] do_sys_open+0x13c/0x230
      [478381.261779]  [<ffffffff81011393>] ? syscall_trace_enter_phase1+0x113/0x170
      [478381.262851]  [<ffffffff811b96c2>] SyS_open+0x22/0x30
      [478381.263598]  [<ffffffff81780532>] system_call_fastpath+0x12/0x17
      [478381.264551] ------------[ cut here ]------------
      [478381.265377] ------------[ cut here ]------------
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      b8269db4
    • Ming Lei's avatar
      block: drbd: avoid to use BIO_MAX_SIZE · 8bf223c2
      Ming Lei authored
      Use BIO_MAX_PAGES instead and we will remove BIO_MAX_SIZE.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Tested-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      8bf223c2
    • Ming Lei's avatar
      block: bio: remove BIO_MAX_SECTORS · 30ac4607
      Ming Lei authored
      No one need this macro, so remove it. The motivation is for supporting
      multipage bvecs, in which we only know what the max count of bvecs is
      supported in the bio, instead of max size or max sectors.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Tested-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      30ac4607
    • Ming Lei's avatar
      fs: xfs: replace BIO_MAX_SECTORS with BIO_MAX_PAGES · c908e380
      Ming Lei authored
      BIO_MAX_PAGES is used as maximum count of bvecs, so
      replace BIO_MAX_SECTORS with BIO_MAX_PAGES since
      BIO_MAX_SECTORS is to be removed.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Tested-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c908e380
    • Ming Lei's avatar
      iov_iter: use bvec iterator to implement iterate_bvec() · 1bdc76ae
      Ming Lei authored
      bvec has one native/mature iterator for long time, so not
      necessary to use the reinvented wheel for iterating bvecs
      in lib/iov_iter.c.
      
      Two ITER_BVEC test cases are run:
      	- xfstest(-g auto) on loop dio/aio, no regression found
      	- swap file works well under extreme stress(stress-ng --all 64 -t
      	  800 -v), and lots of OOMs are triggerd, and the whole
      	system still survives
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Tested-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      1bdc76ae
    • Ming Lei's avatar
      block: mark 1st parameter of bvec_iter_advance as const · 80f162ff
      Ming Lei authored
      bvec_iter_advance() only writes the parameter of iterator,
      so the base address of bvec can be marked as const safely.
      
      Without the change, we can see compiling warning in the
      following patch for implementing iterate_bvec(): lib/iov_iter.c
      with bvec iterator.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Tested-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      80f162ff
    • Ming Lei's avatar
      block: move two bvec structure into bvec.h · 0781e79e
      Ming Lei authored
      This patch moves 'struct bio_vec' and 'struct bvec_iter'
      into 'include/linux/bvec.h', then always include this header
      into 'include/linux/blk_types.h'.
      
      With this change, both 'struct bvec_iter' and bvec iterator
      helpers don't depend on CONFIG_BLOCK any more, then we can
      use bvec iterator to implement iterate_bvec(): lib/iov_iter.c.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Tested-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      0781e79e
    • Ming Lei's avatar
      block: move bvec iterator into include/linux/bvec.h · 8fc55455
      Ming Lei authored
      bvec iterator helpers should be used to implement by
      iterate_bvec():lib/iov_iter.c too, and move them into
      one header, so that we can keep bvec iterator header
      out of CONFIG_BLOCK. Then we can remove the reinventing
      of wheel in iterate_bvec().
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Tested-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      8fc55455
    • Omar Sandoval's avatar
      blk-mq: actually hook up defer list when running requests · 52b9c330
      Omar Sandoval authored
      If ->queue_rq() returns BLK_MQ_RQ_QUEUE_OK, we use continue and skip
      over the rest of the loop body. However, dptr is assigned later in the
      loop body, and the BLK_MQ_RQ_QUEUE_OK case is exactly the case that we'd
      want it for.
      
      NVMe isn't actually using BLK_MQ_F_DEFER_ISSUE yet, nor is any other
      in-tree driver, but if the code's going to be there, it might as well
      work.
      
      Fixes: 74c45052 ("blk-mq: add a 'list' parameter to ->queue_rq()")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      52b9c330
    • Christoph Hellwig's avatar
      block: better packing for struct request · ca93e453
      Christoph Hellwig authored
      Keep the 32-bit CPU and cmd_type flags together to avoid holes on 64-bit
      architectures.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ca93e453
  9. 08 Jun, 2016 4 commits
  10. 07 Jun, 2016 10 commits