1. 28 May, 2020 2 commits
  2. 27 May, 2020 1 commit
    • Dongli Zhang's avatar
      nvme-pci: avoid race between nvme_reap_pending_cqes() and nvme_poll() · 9210c075
      Dongli Zhang authored
      There may be a race between nvme_reap_pending_cqes() and nvme_poll(), e.g.,
      when doing live reset while polling the nvme device.
      
            CPU X                        CPU Y
                                     nvme_poll()
      nvme_dev_disable()
      -> nvme_stop_queues()
      -> nvme_suspend_io_queues()
      -> nvme_suspend_queue()
                                     -> spin_lock(&nvmeq->cq_poll_lock);
      -> nvme_reap_pending_cqes()
         -> nvme_process_cq()        -> nvme_process_cq()
      
      In the above scenario, the nvme_process_cq() for the same queue may be
      running on both CPU X and CPU Y concurrently.
      
      It is much more easier to reproduce the issue when CONFIG_PREEMPT is
      enabled in kernel. When CONFIG_PREEMPT is disabled, it would take longer
      time for nvme_stop_queues()-->blk_mq_quiesce_queue() to wait for grace
      period.
      
      This patch protects nvme_process_cq() with nvmeq->cq_poll_lock in
      nvme_reap_pending_cqes().
      
      Fixes: fa46c6fb ("nvme/pci: move cqe check after device shutdown")
      Signed-off-by: default avatarDongli Zhang <dongli.zhang@oracle.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      9210c075
  3. 21 May, 2020 2 commits
    • Chaitanya Kulkarni's avatar
      null_blk: don't allow discard for zoned mode · 1592cd15
      Chaitanya Kulkarni authored
      Zoned block device specification do not define the behavior of
      discard/trim command as this command is generally replaced by the reset
      write pointer (zone reset) command. Emulate this in null_blk by making
      zoned and discard options mutually exclusive.
      Suggested-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1592cd15
    • Chaitanya Kulkarni's avatar
      null_blk: return error for invalid zone size · e2748325
      Chaitanya Kulkarni authored
      In null_init_zone_dev() check if the zone size is larger than device
      capacity, return error if needed.
      
      This also fixes the following oops :-
      
      null_blk: changed the number of conventional zones to 4294967295
      BUG: kernel NULL pointer dereference, address: 0000000000000010
      PGD 7d76c5067 P4D 7d76c5067 PUD 7d240c067 PMD 0
      Oops: 0002 [#1] SMP NOPTI
      CPU: 4 PID: 5508 Comm: nullbtests.sh Tainted: G OE 5.7.0-rc4lblk-fnext0
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e4
      RIP: 0010:null_init_zoned_dev+0x17a/0x27f [null_blk]
      RSP: 0018:ffffc90007007e00 EFLAGS: 00010246
      RAX: 0000000000000020 RBX: ffff8887fb3f3c00 RCX: 0000000000000007
      RDX: 0000000000000000 RSI: ffff8887ca09d688 RDI: ffff888810fea510
      RBP: 0000000000000010 R08: ffff8887ca09d688 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff8887c26e8000
      R13: ffffffffa05e9390 R14: 0000000000000000 R15: 0000000000000001
      FS:  00007fcb5256f740(0000) GS:ffff888810e00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000010 CR3: 000000081e8fe000 CR4: 00000000003406e0
      Call Trace:
       null_add_dev+0x534/0x71b [null_blk]
       nullb_device_power_store.cold.41+0x8/0x2e [null_blk]
       configfs_write_file+0xe6/0x150
       vfs_write+0xba/0x1e0
       ksys_write+0x5f/0xe0
       do_syscall_64+0x60/0x250
       entry_SYSCALL_64_after_hwframe+0x49/0xb3
      RIP: 0033:0x7fcb51c71840
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e2748325
  4. 16 May, 2020 1 commit
  5. 12 May, 2020 1 commit
  6. 09 May, 2020 4 commits
  7. 07 May, 2020 2 commits
  8. 05 May, 2020 1 commit
    • Tejun Heo's avatar
      iocost: protect iocg->abs_vdebt with iocg->waitq.lock · 0b80f986
      Tejun Heo authored
      abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup
      is and controls the activation of use_delay mechanism. Once a cgroup goes
      over budget from forced IOs, it has to pay it back with its future budget.
      The progress guarantee on debt paying comes from the iocg being active -
      active iocgs are processed by the periodic timer, which ensures that as time
      passes the debts dissipate and the iocg returns to normal operation.
      
      However, both iocg activation and vdebt handling are asynchronous and a
      sequence like the following may happen.
      
      1. The iocg is in the process of being deactivated by the periodic timer.
      
      2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns
         without anything because it still sees that the iocg is already active.
      
      3. The iocg is deactivated.
      
      4. The bio from #2 is over budget but needs to be forced. It increases
         abs_vdebt and goes over the threshold and enables use_delay.
      
      5. IO control is enabled for the iocg's subtree and now IOs are attributed
         to the descendant cgroups and the iocg itself no longer issues IOs.
      
      This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no
      further IOs which can activate it. This can end up unduly punishing all the
      descendants cgroups.
      
      The usual throttling path has the same issue - the iocg must be active while
      throttled to ensure that future event will wake it up - and solves the
      problem by synchronizing the throttling path with a spinlock. abs_vdebt
      handling is another form of overage handling and shares a lot of
      characteristics including the fact that it isn't in the hottest path.
      
      This patch fixes the above and other possible races by strictly
      synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarVlad Dmitriev <vvd@fb.com>
      Cc: stable@vger.kernel.org # v5.4+
      Fixes: e1518f63 ("blk-iocost: Don't let merges push vtime into the future")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0b80f986
  9. 30 Apr, 2020 2 commits
  10. 27 Apr, 2020 1 commit
    • Niklas Cassel's avatar
      nvme: prevent double free in nvme_alloc_ns() error handling · 132be623
      Niklas Cassel authored
      When jumping to the out_put_disk label, we will call put_disk(), which will
      trigger a call to disk_release(), which calls blk_put_queue().
      
      Later in the cleanup code, we do blk_cleanup_queue(), which will also call
      blk_put_queue().
      
      Putting the queue twice is incorrect, and will generate a KASAN splat.
      
      Set the disk->queue pointer to NULL, before calling put_disk(), so that the
      first call to blk_put_queue() will not free the queue.
      
      The second call to blk_put_queue() uses another pointer to the same queue,
      so this call will still free the queue.
      
      Fixes: 85136c01 ("lightnvm: simplify geometry enumeration")
      Signed-off-by: default avatarNiklas Cassel <niklas.cassel@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      132be623
  11. 23 Apr, 2020 2 commits
    • Damien Le Moal's avatar
      null_blk: Cleanup zoned device initialization · d205bde7
      Damien Le Moal authored
      Move all zoned mode related code from null_blk_main.c to
      null_blk_zoned.c, avoiding an ugly #ifdef in the process.
      Rename null_zone_init() into null_init_zoned_dev(), null_zone_exit()
      into null_free_zoned_dev() and add the new function
      null_register_zoned_dev() to finalize the zoned dev setup before
      add_disk().
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d205bde7
    • Damien Le Moal's avatar
      null_blk: Fix zoned command handling · 9dd44c7e
      Damien Le Moal authored
      For write operations issued to a null_blk device with zoned mode
      enabled, the state and write pointer position of the zone targeted by
      the command should be checked before badblocks and memory backing
      are handled as the write may be first failed due to, for instance, a
      sector position not aligned with the zone write pointer. This order of
      checking for errors reflects more accuratly the behavior of physical
      zoned devices.
      
      Furthermore, the write pointer position of the target zone should be
      incremented only and only if no errors are reported by badblocks and
      memory backing handling.
      
      To fix this, introduce the small helper function null_process_cmd()
      which execute null_handle_badblocks() and null_handle_memory_backed()
      and use this function in null_zone_write() to correctly handle write
      requests to zoned null devices depending on the type and state of the
      write target zone. Also call this function in null_handle_zoned() to
      process read requests to zoned null devices.
      
      null_process_cmd() is called directly from null_handle_cmd() for
      regular null devices, resulting in no functional change for these type
      of devices. To have symmetric names, the function null_handle_zoned()
      is renamed to null_process_zoned_cmd().
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9dd44c7e
  12. 21 Apr, 2020 2 commits
  13. 20 Apr, 2020 1 commit
    • Douglas Anderson's avatar
      bdev: Reduce time holding bd_mutex in sync in blkdev_close() · b849dd84
      Douglas Anderson authored
      While trying to "dd" to the block device for a USB stick, I
      encountered a hung task warning (blocked for > 120 seconds).  I
      managed to come up with an easy way to reproduce this on my system
      (where /dev/sdb is the block device for my USB stick) with:
      
        while true; do dd if=/dev/zero of=/dev/sdb bs=4M; done
      
      With my reproduction here are the relevant bits from the hung task
      detector:
      
       INFO: task udevd:294 blocked for more than 122 seconds.
       ...
       udevd           D    0   294      1 0x00400008
       Call trace:
        ...
        mutex_lock_nested+0x40/0x50
        __blkdev_get+0x7c/0x3d4
        blkdev_get+0x118/0x138
        blkdev_open+0x94/0xa8
        do_dentry_open+0x268/0x3a0
        vfs_open+0x34/0x40
        path_openat+0x39c/0xdf4
        do_filp_open+0x90/0x10c
        do_sys_open+0x150/0x3c8
        ...
      
       ...
       Showing all locks held in the system:
       ...
       1 lock held by dd/2798:
        #0: ffffff814ac1a3b8 (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0x50/0x204
       ...
       dd              D    0  2798   2764 0x00400208
       Call trace:
        ...
        schedule+0x8c/0xbc
        io_schedule+0x1c/0x40
        wait_on_page_bit_common+0x238/0x338
        __lock_page+0x5c/0x68
        write_cache_pages+0x194/0x500
        generic_writepages+0x64/0xa4
        blkdev_writepages+0x24/0x30
        do_writepages+0x48/0xa8
        __filemap_fdatawrite_range+0xac/0xd8
        filemap_write_and_wait+0x30/0x84
        __blkdev_put+0x88/0x204
        blkdev_put+0xc4/0xe4
        blkdev_close+0x28/0x38
        __fput+0xe0/0x238
        ____fput+0x1c/0x28
        task_work_run+0xb0/0xe4
        do_notify_resume+0xfc0/0x14bc
        work_pending+0x8/0x14
      
      The problem appears related to the fact that my USB disk is terribly
      slow and that I have a lot of RAM in my system to cache things.
      Specifically my writes seem to be happening at ~15 MB/s and I've got
      ~4 GB of RAM in my system that can be used for buffering.  To write 4
      GB of buffer to disk thus takes ~4000 MB / ~15 MB/s = ~267 seconds.
      
      The 267 second number is a problem because in __blkdev_put() we call
      sync_blockdev() while holding the bd_mutex.  Any other callers who
      want the bd_mutex will be blocked for the whole time.
      
      The problem is made worse because I believe blkdev_put() specifically
      tells other tasks (namely udev) to go try to access the device at right
      around the same time we're going to hold the mutex for a long time.
      
      Putting some traces around this (after disabling the hung task detector),
      I could confirm:
       dd:    437.608600: __blkdev_put() right before sync_blockdev() for sdb
       udevd: 437.623901: blkdev_open() right before blkdev_get() for sdb
       dd:    661.468451: __blkdev_put() right after sync_blockdev() for sdb
       udevd: 663.820426: blkdev_open() right after blkdev_get() for sdb
      
      A simple fix for this is to realize that sync_blockdev() works fine if
      you're not holding the mutex.  Also, it's not the end of the world if
      you sync a little early (though it can have performance impacts).
      Thus we can make a guess that we're going to need to do the sync and
      then do it without holding the mutex.  We still do one last sync with
      the mutex but it should be much, much faster.
      
      With this, my hung task warnings for my test case are gone.
      Signed-off-by: default avatarDouglas Anderson <dianders@chromium.org>
      Reviewed-by: default avatarGuenter Roeck <groeck@chromium.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b849dd84
  14. 18 Apr, 2020 1 commit
  15. 17 Apr, 2020 17 commits