1. 04 May, 2020 4 commits
  2. 30 Apr, 2020 4 commits
    • Tejun Heo's avatar
      iocost_monitor: drop string wrap around numbers when outputting json · 21f3cfea
      Tejun Heo authored
      Wrapping numbers in strings is used by some to work around bit-width issues in
      some enviroments. The problem isn't innate to json and the workaround seems to
      cause more integration problems than help. Let's drop the string wrapping.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      21f3cfea
    • Tejun Heo's avatar
      iocost_monitor: exit successfully if interval is zero · f4fe3ea6
      Tejun Heo authored
      This is to help external tools to decide whether iocost_monitor has all its
      requirements met or not based on the exit status of an -i0 run.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f4fe3ea6
    • Tejun Heo's avatar
      blk-iocost: account for IO size when testing latencies · cd006509
      Tejun Heo authored
      On each IO completion, iocost decides whether the IO met or missed its latency
      target. Currently, the targets are fixed numbers per IO type. While this can be
      good enough for loose latency targets way higher than typical completion
      latencies, the effect of IO size makes it difficult to tighten the latency
      target - a target adequate for 4k IOs might be too tight for 512k IOs and
      vice-versa.
      
      iocost already has all the necessary information to account for different IO
      sizes when testing whether the latency target is met as iocost can calculate the
      size vtime cost of a given IO. This patch updates the completion path to
      calculate the size vtime cost of the IO, deduct the nsec equivalent from the
      observed latency and use the adjusted value to decide whether the target is met.
      
      This makes latency targets independent from IO size and enables determining
      adequate latency targets with fixed size fio runs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cd006509
    • Tejun Heo's avatar
      blk-iocost: switch to fixed non-auto-decaying use_delay · 54c52e10
      Tejun Heo authored
      The use_delay mechanism was introduced by blk-iolatency to hold memory
      allocators accountable for the reclaim and other shared IOs they cause. The
      duration of the delay is dynamically balanced between iolatency increasing the
      value on each target miss and it auto-decaying as time passes and threads get
      delayed on it.
      
      While this works well for iolatency, iocost's control model isn't compatible
      with it. There is no repeated "violation" events which can be balanced against
      auto-decaying. iocost instead knows how much a given cgroup is over budget and
      wants to prevent that cgroup from issuing IOs while over budget. Until now,
      iocost has been adding the cost of force-issued IOs. However, this doesn't
      reflect the amount which is already over budget and is simply not enough to
      counter the auto-decaying allowing anon-memory leaking low priority cgroup to
      go over its alloted share of IOs.
      
      As auto-decaying doesn't make much sense for iocost, this patch introduces a
      different mode of operation for use_delay - when blkcg_set_delay() are used
      insted of blkcg_add/use_delay(), the delay duration is not auto-decayed until it
      is explicitly cleared with blkcg_clear_delay(). iocost is updated to keep the
      delay duration synchronized to the budget overage amount.
      
      With this change, iocost can effectively police cgroups which generate
      significant amount of force-issued IOs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      54c52e10
  3. 29 Apr, 2020 5 commits
  4. 25 Apr, 2020 4 commits
  5. 24 Apr, 2020 2 commits
  6. 22 Apr, 2020 5 commits
  7. 20 Apr, 2020 14 commits
    • Christoph Hellwig's avatar
      block: fold bdev_unhash_inode into invalidate_partition · 9bc5c397
      Christoph Hellwig authored
      invalidate_partition and bdev_unhash_inode are always paired, and
      invalidate_partition already does an icache lookup for the block device
      inode.  Piggy back on that to remove the inode from the hash.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9bc5c397
    • Christoph Hellwig's avatar
      block: mark invalidate_partition static · 02d33b67
      Christoph Hellwig authored
      invalidate_partition is only used in genhd.c, so mark it static.  Also
      drop the return value given that is is always ignored.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      02d33b67
    • Christoph Hellwig's avatar
      block: simplify block device syncing in bdev_del_partition · d5f3178e
      Christoph Hellwig authored
      We just checked a little above that the block device for the partition
      im busy.  That implies no file system is mounted, and thus the only
      thing in fsync_bdev that actually is used is sync_blockdev.  Just call
      sync_blockdev directly.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d5f3178e
    • Christoph Hellwig's avatar
      block: don't call invalidate_partition from blk_drop_partitions · e669c1da
      Christoph Hellwig authored
      Given that the device must not be busy, most of the calls from
      invalidate_partition that are related to file system metadata are
      guranteed to not happen.  Just open code the calls to sync_blockdev
      and invalidate_bdev instead.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e669c1da
    • Christoph Hellwig's avatar
      dasd: use blk_drop_partitions instead of badly reimplementing it · 21be6cdc
      Christoph Hellwig authored
      Use the blk_drop_partitions function instead of messing around with
      ioctls that get kernel pointers.  For this blk_drop_partitions needs
      to be exported, which it normally shouldn't - make an exception for
      s390 only.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      21be6cdc
    • Christoph Hellwig's avatar
      block: remove the disk argument from blk_drop_partitions · d46430bf
      Christoph Hellwig authored
      The gendisk can be trivially deducted from the block_device.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d46430bf
    • Christoph Hellwig's avatar
      block: remove hd_struct_kill · 4377b48d
      Christoph Hellwig authored
      The function has a single caller, so just open code it.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4377b48d
    • Christoph Hellwig's avatar
      block: cleanup hd_struct freeing · 8da2892e
      Christoph Hellwig authored
      Move hd_ref_init out of line as there it isn't anywhere near a fast path,
      and rename the rcu ref freeing callbacks to be more descriptive.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8da2892e
    • Christoph Hellwig's avatar
      block: pass a hd_struct to delete_partition · cddae808
      Christoph Hellwig authored
      All callers have the hd_struct at hand, so pass it instead of performing
      another lookup.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cddae808
    • Christoph Hellwig's avatar
      block: refactor blkpg_ioctl · fa9156ae
      Christoph Hellwig authored
      Split each sub-command out into a separate helper, and move those helpers
      to block/partitions/core.c instead of having a lot of partition
      manipulation logic open coded in block/ioctl.c.
      
      Signed-off-by: Christoph Hellwig <hch@lst.de
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fa9156ae
    • Douglas Anderson's avatar
      Revert "scsi: core: run queue if SCSI device queue isn't ready and queue is idle" · b4fd63f4
      Douglas Anderson authored
      This reverts commit 7e70aa78.
      
      Now that we have the patches ("blk-mq: In blk_mq_dispatch_rq_list()
      "no budget" is a reason to kick") and ("blk-mq: Rerun dispatching in
      the case of budget contention") we should no longer need the fix in
      the SCSI code.  Revert it, resolving conflicts with other patches that
      have touched this code.
      
      With this revert (and the two new patches) I can run the script that
      was in commit 7e70aa78 ("scsi: core: run queue if SCSI device
      queue isn't ready and queue is idle") in a loop with no failure.  If I
      do this revert without the two new patches I can easily get a failure.
      Signed-off-by: default avatarDouglas Anderson <dianders@chromium.org>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Acked-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b4fd63f4
    • Douglas Anderson's avatar
      blk-mq: Rerun dispatching in the case of budget contention · a0823421
      Douglas Anderson authored
      If ever a thread running blk-mq code tries to get budget and fails it
      immediately stops doing work and assumes that whenever budget is freed
      up that queues will be kicked and whatever work the thread was trying
      to do will be tried again.
      
      One path where budget is freed and queues are kicked in the normal
      case can be seen in scsi_finish_command().  Specifically:
      - scsi_finish_command()
        - scsi_device_unbusy()
          - # Decrement "device_busy", AKA release budget
        - scsi_io_completion()
          - scsi_end_request()
            - blk_mq_run_hw_queues()
      
      The above is all well and good.  The problem comes up when a thread
      claims the budget but then releases it without actually dispatching
      any work.  Since we didn't schedule any work we'll never run the path
      of finishing work / kicking the queues.
      
      This isn't often actually a problem which is why this issue has
      existed for a while and nobody noticed.  Specifically we only get into
      this situation when we unexpectedly found that we weren't going to do
      any work.  Code that later receives new work kicks the queues.  All
      good, right?
      
      The problem shows up, however, if timing is just wrong and we hit a
      race.  To see this race let's think about the case where we only have
      a budget of 1 (only one thread can hold budget).  Now imagine that a
      thread got budget and then decided not to dispatch work.  It's about
      to call put_budget() but then the thread gets context switched out for
      a long, long time.  While in this state, any and all kicks of the
      queue (like the when we received new work) will be no-ops because
      nobody can get budget.  Finally the thread holding budget gets to run
      again and returns.  All the normal kicks will have been no-ops and we
      have an I/O stall.
      
      As you can see from the above, you need just the right timing to see
      the race.  To start with, the only case it happens if we thought we
      had work, actually managed to get the budget, but then actually didn't
      have work.  That's pretty rare to start with.  Even then, there's
      usually a very small amount of time between realizing that there's no
      work and putting the budget.  During this small amount of time new
      work has to come in and the queue kick has to make it all the way to
      trying to get the budget and fail.  It's pretty unlikely.
      
      One case where this could have failed is illustrated by an example of
      threads running blk_mq_do_dispatch_sched():
      
      * Threads A and B both run has_work() at the same time with the same
        "hctx".  Imagine has_work() is exact.  There's no lock, so it's OK
        if Thread A and B both get back true.
      * Thread B gets interrupted for a long time right after it decides
        that there is work.  Maybe its CPU gets an interrupt and the
        interrupt handler is slow.
      * Thread A runs, get budget, dispatches work.
      * Thread A's work finishes and budget is released.
      * Thread B finally runs again and gets budget.
      * Since Thread A already took care of the work and no new work has
        come in, Thread B will get NULL from dispatch_request().  I believe
        this is specifically why dispatch_request() is allowed to return
        NULL in the first place if has_work() must be exact.
      * Thread B will now be holding the budget and is about to call
        put_budget(), but hasn't called it yet.
      * Thread B gets interrupted for a long time (again).  Dang interrupts.
      * Now Thread C (maybe with a different "hctx" but the same queue)
        comes along and runs blk_mq_do_dispatch_sched().
      * Thread C won't do anything because it can't get budget.
      * Finally Thread B will run again and put the budget without kicking
        any queues.
      
      Even though the example above is with blk_mq_do_dispatch_sched() I
      believe the race is possible any time someone is holding budget but
      doesn't do work.
      
      Unfortunately, the unlikely has become more likely if you happen to be
      using the BFQ I/O scheduler.  BFQ, by design, sometimes returns "true"
      for has_work() but then NULL for dispatch_request() and stays in this
      state for a while (currently up to 9 ms).  Suddenly you only need one
      race to hit, not two races in a row.  With my current setup this is
      easy to reproduce in reboot tests and traces have actually shown that
      we hit a race similar to the one described above.
      
      Note that we only need to fix blk_mq_do_dispatch_sched() and
      blk_mq_do_dispatch_ctx() and not the other places that put budget.  In
      other cases we know that we have work to do on at least one "hctx" and
      code already exists to kick that "hctx"'s queue.  When that work
      finally finishes all the queues will be kicked using the normal flow.
      
      One last note is that (at least in the SCSI case) budget is shared by
      all "hctx"s that have the same queue.  Thus we need to make sure to
      kick the whole queue, not just re-run dispatching on a single "hctx".
      Signed-off-by: default avatarDouglas Anderson <dianders@chromium.org>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a0823421
    • Douglas Anderson's avatar
      blk-mq: Add blk_mq_delay_run_hw_queues() API call · b9151e7b
      Douglas Anderson authored
      We have:
      * blk_mq_run_hw_queue()
      * blk_mq_delay_run_hw_queue()
      * blk_mq_run_hw_queues()
      
      ...but not blk_mq_delay_run_hw_queues(), presumably because nobody
      needed it before now.  Since we need it for a later patch in this
      series, add it.
      Signed-off-by: default avatarDouglas Anderson <dianders@chromium.org>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b9151e7b
    • Douglas Anderson's avatar
      blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick · ab3cee37
      Douglas Anderson authored
      In blk_mq_dispatch_rq_list(), if blk_mq_sched_needs_restart() returns
      true and the driver returns BLK_STS_RESOURCE then we'll kick the
      queue.  However, there's another case where we might need to kick it.
      If we were unable to get budget we can be in much the same state as
      when the driver returns BLK_STS_RESOURCE, so we should treat it the
      same.
      
      It should be noted that even if we add a whole bunch of extra kicking
      to the queue in other patches this patch is still important.
      Specifically any kicking that happened before we re-spliced leftover
      requests into 'hctx->dispatch' wouldn't have found any work, so we
      really need to make sure we kick ourselves after we've done the
      splicing.
      Signed-off-by: default avatarDouglas Anderson <dianders@chromium.org>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ab3cee37
  8. 19 Apr, 2020 2 commits
    • Linus Torvalds's avatar
      Linux 5.7-rc2 · ae83d0b4
      Linus Torvalds authored
      ae83d0b4
    • Brian Geffon's avatar
      mm: Fix MREMAP_DONTUNMAP accounting on VMA merge · dadbd85f
      Brian Geffon authored
      When remapping a mapping where a portion of a VMA is remapped
      into another portion of the VMA it can cause the VMA to become
      split. During the copy_vma operation the VMA can actually
      be remerged if it's an anonymous VMA whose pages have not yet
      been faulted. This isn't normally a problem because at the end
      of the remap the original portion is unmapped causing it to
      become split again.
      
      However, MREMAP_DONTUNMAP leaves that original portion in place which
      means that the VMA which was split and then remerged is not actually
      split at the end of the mremap. This patch fixes a bug where
      we don't detect that the VMAs got remerged and we end up
      putting back VM_ACCOUNT on the next mapping which is completely
      unreleated. When that next mapping is unmapped it results in
      incorrectly unaccounting for the memory which was never accounted,
      and eventually we will underflow on the memory comittment.
      
      There is also another issue which is similar, we're currently
      accouting for the number of pages in the new_vma but that's wrong.
      We need to account for the length of the remap operation as that's
      all that is being added. If there was a mapping already at that
      location its comittment would have been adjusted as part of
      the munmap at the start of the mremap.
      
      A really simple repro can be seen in:
      https://gist.github.com/bgaff/e101ce99da7d9a8c60acc641d07f312c
      
      Fixes: e346b381 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarBrian Geffon <bgeffon@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dadbd85f