1. 26 May, 2022 1 commit
    • Tejun Heo's avatar
      blk-iolatency: Fix inflight count imbalances and IO hangs on offline · 8a177a36
      Tejun Heo authored
      iolatency needs to track the number of inflight IOs per cgroup. As this
      tracking can be expensive, it is disabled when no cgroup has iolatency
      configured for the device. To ensure that the inflight counters stay
      balanced, iolatency_set_limit() freezes the request_queue while manipulating
      the enabled counter, which ensures that no IO is in flight and thus all
      counters are zero.
      
      Unfortunately, iolatency_set_limit() isn't the only place where the enabled
      counter is manipulated. iolatency_pd_offline() can also dec the counter and
      trigger disabling. As this disabling happens without freezing the q, this
      can easily happen while some IOs are in flight and thus leak the counts.
      
      This can be easily demonstrated by turning on iolatency on an one empty
      cgroup while IOs are in flight in other cgroups and then removing the
      cgroup. Note that iolatency shouldn't have been enabled elsewhere in the
      system to ensure that removing the cgroup disables iolatency for the whole
      device.
      
      The following keeps flipping on and off iolatency on sda:
      
        echo +io > /sys/fs/cgroup/cgroup.subtree_control
        while true; do
            mkdir -p /sys/fs/cgroup/test
            echo '8:0 target=100000' > /sys/fs/cgroup/test/io.latency
            sleep 1
            rmdir /sys/fs/cgroup/test
            sleep 1
        done
      
      and there's concurrent fio generating direct rand reads:
      
        fio --name test --filename=/dev/sda --direct=1 --rw=randread \
            --runtime=600 --time_based --iodepth=256 --numjobs=4 --bs=4k
      
      while monitoring with the following drgn script:
      
        while True:
          for css in css_for_each_descendant_pre(prog['blkcg_root'].css.address_of_()):
              for pos in hlist_for_each(container_of(css, 'struct blkcg', 'css').blkg_list):
                  blkg = container_of(pos, 'struct blkcg_gq', 'blkcg_node')
                  pd = blkg.pd[prog['blkcg_policy_iolatency'].plid]
                  if pd.value_() == 0:
                      continue
                  iolat = container_of(pd, 'struct iolatency_grp', 'pd')
                  inflight = iolat.rq_wait.inflight.counter.value_()
                  if inflight:
                      print(f'inflight={inflight} {disk_name(blkg.q.disk).decode("utf-8")} '
                            f'{cgroup_path(css.cgroup).decode("utf-8")}')
          time.sleep(1)
      
      The monitoring output looks like the following:
      
        inflight=1 sda /user.slice
        inflight=1 sda /user.slice
        ...
        inflight=14 sda /user.slice
        inflight=13 sda /user.slice
        inflight=17 sda /user.slice
        inflight=15 sda /user.slice
        inflight=18 sda /user.slice
        inflight=17 sda /user.slice
        inflight=20 sda /user.slice
        inflight=19 sda /user.slice <- fio stopped, inflight stuck at 19
        inflight=19 sda /user.slice
        inflight=19 sda /user.slice
      
      If a cgroup with stuck inflight ends up getting throttled, the throttled IOs
      will never get issued as there's no completion event to wake it up leading
      to an indefinite hang.
      
      This patch fixes the bug by unifying enable handling into a work item which
      is automatically kicked off from iolatency_set_min_lat_nsec() which is
      called from both iolatency_set_limit() and iolatency_pd_offline() paths.
      Punting to a work item is necessary as iolatency_pd_offline() is called
      under spinlocks while freezing a request_queue requires a sleepable context.
      
      This also simplifies the code reducing LOC sans the comments and avoids the
      unnecessary freezes which were happening whenever a cgroup's latency target
      is newly set or cleared.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Liu Bo <bo.liu@linux.alibaba.com>
      Fixes: 8c772a9b ("blk-iolatency: fix IO hang due to negative inflight counter")
      Cc: stable@vger.kernel.org # v5.0+
      Link: https://lore.kernel.org/r/Yn9ScX6Nx2qIiQQi@slm.duckdns.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8a177a36
  2. 23 May, 2022 1 commit
  3. 21 May, 2022 1 commit
  4. 19 May, 2022 4 commits
  5. 18 May, 2022 2 commits
    • Jens Axboe's avatar
      blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE() · 1305e2c9
      Jens Axboe authored
      A previous commit got rid of unnecessary rcu_read_lock() inside the
      IRQ disabling queue_lock, but this debug statement was left. It's now
      firing since we are indeed not inside a RCU read lock, but we don't
      need to be as we're still preempt safe.
      
      Get rid of the check, as we have a lockdep assert for holding the
      queue lock right after it anyway.
      
      Link: https://lore.kernel.org/linux-block/46253c48-81cb-0787-20ad-9133afdd9e21@samsung.com/Reported-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Fixes: 77c570a1 ("blk-cgroup: Remove unnecessary rcu_read_lock/unlock()")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1305e2c9
    • Laibin Qiu's avatar
      blk-throttle: Set BIO_THROTTLED when bio has been throttled · 5a011f88
      Laibin Qiu authored
      1.In current process, all bio will set the BIO_THROTTLED flag
      after __blk_throtl_bio().
      
      2.If bio needs to be throttled, it will start the timer and
      stop submit bio directly. Bio will submit in
      blk_throtl_dispatch_work_fn() when the timer expires.But in
      the current process, if bio is throttled. The BIO_THROTTLED
      will be set to bio after timer start. If the bio has been
      completed, it may cause use-after-free blow.
      
      BUG: KASAN: use-after-free in blk_throtl_bio+0x12f0/0x2c70
      Read of size 2 at addr ffff88801b8902d4 by task fio/26380
      
       dump_stack+0x9b/0xce
       print_address_description.constprop.6+0x3e/0x60
       kasan_report.cold.9+0x22/0x3a
       blk_throtl_bio+0x12f0/0x2c70
       submit_bio_checks+0x701/0x1550
       submit_bio_noacct+0x83/0xc80
       submit_bio+0xa7/0x330
       mpage_readahead+0x380/0x500
       read_pages+0x1c1/0xbf0
       page_cache_ra_unbounded+0x471/0x6f0
       do_page_cache_ra+0xda/0x110
       ondemand_readahead+0x442/0xae0
       page_cache_async_ra+0x210/0x300
       generic_file_buffered_read+0x4d9/0x2130
       generic_file_read_iter+0x315/0x490
       blkdev_read_iter+0x113/0x1b0
       aio_read+0x2ad/0x450
       io_submit_one+0xc8e/0x1d60
       __se_sys_io_submit+0x125/0x350
       do_syscall_64+0x2d/0x40
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Allocated by task 26380:
       kasan_save_stack+0x19/0x40
       __kasan_kmalloc.constprop.2+0xc1/0xd0
       kmem_cache_alloc+0x146/0x440
       mempool_alloc+0x125/0x2f0
       bio_alloc_bioset+0x353/0x590
       mpage_alloc+0x3b/0x240
       do_mpage_readpage+0xddf/0x1ef0
       mpage_readahead+0x264/0x500
       read_pages+0x1c1/0xbf0
       page_cache_ra_unbounded+0x471/0x6f0
       do_page_cache_ra+0xda/0x110
       ondemand_readahead+0x442/0xae0
       page_cache_async_ra+0x210/0x300
       generic_file_buffered_read+0x4d9/0x2130
       generic_file_read_iter+0x315/0x490
       blkdev_read_iter+0x113/0x1b0
       aio_read+0x2ad/0x450
       io_submit_one+0xc8e/0x1d60
       __se_sys_io_submit+0x125/0x350
       do_syscall_64+0x2d/0x40
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Freed by task 0:
       kasan_save_stack+0x19/0x40
       kasan_set_track+0x1c/0x30
       kasan_set_free_info+0x1b/0x30
       __kasan_slab_free+0x111/0x160
       kmem_cache_free+0x94/0x460
       mempool_free+0xd6/0x320
       bio_free+0xe0/0x130
       bio_put+0xab/0xe0
       bio_endio+0x3a6/0x5d0
       blk_update_request+0x590/0x1370
       scsi_end_request+0x7d/0x400
       scsi_io_completion+0x1aa/0xe50
       scsi_softirq_done+0x11b/0x240
       blk_mq_complete_request+0xd4/0x120
       scsi_mq_done+0xf0/0x200
       virtscsi_vq_done+0xbc/0x150
       vring_interrupt+0x179/0x390
       __handle_irq_event_percpu+0xf7/0x490
       handle_irq_event_percpu+0x7b/0x160
       handle_irq_event+0xcc/0x170
       handle_edge_irq+0x215/0xb20
       common_interrupt+0x60/0x120
       asm_common_interrupt+0x1e/0x40
      
      Fix this by move BIO_THROTTLED set into the queue_lock.
      Signed-off-by: default avatarLaibin Qiu <qiulaibin@huawei.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20220301123919.2381579-1-qiulaibin@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5a011f88
  6. 17 May, 2022 2 commits
  7. 16 May, 2022 3 commits
  8. 12 May, 2022 2 commits
  9. 11 May, 2022 1 commit
  10. 05 May, 2022 3 commits
  11. 02 May, 2022 16 commits
  12. 23 Apr, 2022 4 commits