1. 06 Oct, 2019 1 commit
    • Mika Westerberg's avatar
      bdi: Do not use freezable workqueue · a2b90f11
      Mika Westerberg authored
      A removable block device, such as NVMe or SSD connected over Thunderbolt
      can be hot-removed any time including when the system is suspended. When
      device is hot-removed during suspend and the system gets resumed, kernel
      first resumes devices and then thaws the userspace including freezable
      workqueues. What happens in that case is that the NVMe driver notices
      that the device is unplugged and removes it from the system. This ends
      up calling bdi_unregister() for the gendisk which then schedules
      wb_workfn() to be run one more time.
      
      However, since the bdi_wq is still frozen flush_delayed_work() call in
      wb_shutdown() blocks forever halting system resume process. User sees
      this as hang as nothing is happening anymore.
      
      Triggering sysrq-w reveals this:
      
        Workqueue: nvme-wq nvme_remove_dead_ctrl_work [nvme]
        Call Trace:
         ? __schedule+0x2c5/0x630
         ? wait_for_completion+0xa4/0x120
         schedule+0x3e/0xc0
         schedule_timeout+0x1c9/0x320
         ? resched_curr+0x1f/0xd0
         ? wait_for_completion+0xa4/0x120
         wait_for_completion+0xc3/0x120
         ? wake_up_q+0x60/0x60
         __flush_work+0x131/0x1e0
         ? flush_workqueue_prep_pwqs+0x130/0x130
         bdi_unregister+0xb9/0x130
         del_gendisk+0x2d2/0x2e0
         nvme_ns_remove+0xed/0x110 [nvme_core]
         nvme_remove_namespaces+0x96/0xd0 [nvme_core]
         nvme_remove+0x5b/0x160 [nvme]
         pci_device_remove+0x36/0x90
         device_release_driver_internal+0xdf/0x1c0
         nvme_remove_dead_ctrl_work+0x14/0x30 [nvme]
         process_one_work+0x1c2/0x3f0
         worker_thread+0x48/0x3e0
         kthread+0x100/0x140
         ? current_work+0x30/0x30
         ? kthread_park+0x80/0x80
         ret_from_fork+0x35/0x40
      
      This is not limited to NVMes so exactly same issue can be reproduced by
      hot-removing SSD (over Thunderbolt) while the system is suspended.
      
      Prevent this from happening by removing WQ_FREEZABLE from bdi_wq.
      Reported-by: default avatarAceLan Kao <acelan.kao@canonical.com>
      Link: https://marc.info/?l=linux-kernel&m=138695698516487
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=204385
      Link: https://lore.kernel.org/lkml/20191002122136.GD2819@lahna.fi.intel.com/#tAcked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a2b90f11
  2. 04 Oct, 2019 1 commit
  3. 03 Oct, 2019 3 commits
  4. 01 Oct, 2019 4 commits
    • Stefan Haberland's avatar
      Revert "s390/dasd: Add discard support for ESE volumes" · 964ce509
      Stefan Haberland authored
      This reverts commit 7e64db15.
      
      The thin provisioning feature introduces an IOCTL and the discard support
      to allow userspace tools and filesystems to release unused and previously
      allocated space respectively.
      
      During some internal performance improvements and further tests, the
      release of allocated space revealed some issues that may lead to data
      corruption in some configurations when filesystems are mounted with
      discard support enabled.
      
      While we're working on a fix and trying to clarify the situation,
      this commit reverts the discard support for ESE volumes to prevent
      potential data corruption.
      
      Cc: <stable@vger.kernel.org> # 5.3
      Signed-off-by: default avatarStefan Haberland <sth@linux.ibm.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      964ce509
    • Jan Höppner's avatar
      s390/dasd: Fix error handling during online processing · dd454839
      Jan Höppner authored
      It is possible that the CCW commands for reading volume and extent pool
      information are not supported, either by the storage server (for
      dedicated DASDs) or by z/VM (for virtual devices, such as MDISKs).
      
      As a command reject will occur in such a case, the current error
      handling leads to a failing online processing and thus the DASD can't be
      used at all.
      
      Since the data being read is not essential for an fully operational
      DASD, the error handling can be removed. Information about the failing
      command is sent to the s390dbf debug feature.
      
      Fixes: c729696b ("s390/dasd: Recognise data for ESE volumes")
      Cc: <stable@vger.kernel.org> # 5.3
      Reported-by: default avatarFrank Heimes <frank.heimes@canonical.com>
      Signed-off-by: default avatarJan Höppner <hoeppner@linux.ibm.com>
      Signed-off-by: default avatarStefan Haberland <sth@linux.ibm.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dd454839
    • Arnd Bergmann's avatar
      io_uring: use __kernel_timespec in timeout ABI · bdf20073
      Arnd Bergmann authored
      All system calls use struct __kernel_timespec instead of the old struct
      timespec, but this one was just added with the old-style ABI. Change it
      now to enforce the use of __kernel_timespec, avoiding ABI confusion and
      the need for compat handlers on 32-bit architectures.
      
      Any user space caller will have to use __kernel_timespec now, but this
      is unambiguous and works for any C library regardless of the time_t
      definition. A nicer way to specify the timeout would have been a less
      ambiguous 64-bit nanosecond value, but I suppose it's too late now to
      change that as this would impact both 32-bit and 64-bit users.
      
      Fixes: 5262f567 ("io_uring: IORING_OP_TIMEOUT support")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bdf20073
    • Martijn Coenen's avatar
      loop: change queue block size to match when using DIO · 85560117
      Martijn Coenen authored
      The loop driver assumes that if the passed in fd is opened with
      O_DIRECT, the caller wants to use direct I/O on the loop device.
      However, if the underlying block device has a different block size than
      the loop block queue, direct I/O can't be enabled. Instead of requiring
      userspace to manually change the blocksize and re-enable direct I/O,
      just change the queue block sizes to match, as well as the io_min size.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMartijn Coenen <maco@android.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      85560117
  5. 27 Sep, 2019 6 commits
    • Jens Axboe's avatar
      Merge branch 'nvme-5.4' of git://git.infradead.org/nvme into for-linus · 2d5ba0c7
      Jens Axboe authored
      Pull NVMe changes from Sagi:
      
      "This set consists of various fixes and cleanups:
       - controller removal race fix from Balbir
       - quirk additions from Gabriel and Jian-Hong
       - nvme-pci power state save fix from Mario
       - Add 64bit user commands (for 64bit registers) from Marta
       - nvme-rdma/nvme-tcp fixes from Max, Mark and Me
       - Minor cleanups and nits from James, Dan and John"
      
      * 'nvme-5.4' of git://git.infradead.org/nvme:
        nvme-rdma: fix possible use-after-free in connect timeout
        nvme: Move ctrl sqsize to generic space
        nvme: Add ctrl attributes for queue_count and sqsize
        nvme: allow 64-bit results in passthru commands
        nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T
        nvmet-tcp: remove superflous check on request sgl
        Added QUIRKs for ADATA XPG SX8200 Pro 512GB
        nvme-rdma: Fix max_hw_sectors calculation
        nvme: fix an error code in nvme_init_subsystem()
        nvme-pci: Save PCI state before putting drive into deepest state
        nvme-tcp: fix wrong stop condition in io_work
        nvme-pci: Fix a race in controller removal
        nvmet: change ppl to lpp
      2d5ba0c7
    • Ming Lei's avatar
      blk-mq: apply normal plugging for HDD · 3154df26
      Ming Lei authored
      Some HDD drive may expose multiple hardware queues, such as MegraRaid.
      Let's apply the normal plugging for such devices because sequential IO
      may benefit a lot from plug merging.
      
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3154df26
    • Ming Lei's avatar
      blk-mq: honor IO scheduler for multiqueue devices · a12de1d4
      Ming Lei authored
      If a device is using multiple queues, the IO scheduler may be bypassed.
      This may hurt performance for some slow MQ devices, and it also breaks
      zoned devices which depend on mq-deadline for respecting the write order
      in one zone.
      
      Don't bypass io scheduler if we have one setup.
      
      This patch can double sequential write performance basically on MQ
      scsi_debug when mq-deadline is applied.
      
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarJavier González <javier@javigon.com>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a12de1d4
    • Sagi Grimberg's avatar
      nvme-rdma: fix possible use-after-free in connect timeout · 67b483dd
      Sagi Grimberg authored
      If the connect times out, we may have already destroyed the
      queue in the timeout handler, so test if the queue is still
      allocated in the connect error handler.
      Reported-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      67b483dd
    • Yufen Yu's avatar
      block: fix null pointer dereference in blk_mq_rq_timed_out() · 8d699663
      Yufen Yu authored
      We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
      as following:
      
      [  108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
      [  108.827059] PGD 0 P4D 0
      [  108.827313] Oops: 0000 [#1] SMP PTI
      [  108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
      [  108.829503] Workqueue: kblockd blk_mq_timeout_work
      [  108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
      [  108.838191] Call Trace:
      [  108.838406]  bt_iter+0x74/0x80
      [  108.838665]  blk_mq_queue_tag_busy_iter+0x204/0x450
      [  108.839074]  ? __switch_to_asm+0x34/0x70
      [  108.839405]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.839823]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.840273]  ? syscall_return_via_sysret+0xf/0x7f
      [  108.840732]  blk_mq_timeout_work+0x74/0x200
      [  108.841151]  process_one_work+0x297/0x680
      [  108.841550]  worker_thread+0x29c/0x6f0
      [  108.841926]  ? rescuer_thread+0x580/0x580
      [  108.842344]  kthread+0x16a/0x1a0
      [  108.842666]  ? kthread_flush_work+0x170/0x170
      [  108.843100]  ret_from_fork+0x35/0x40
      
      The bug is caused by the race between timeout handle and completion for
      flush request.
      
      When timeout handle function blk_mq_rq_timed_out() try to read
      'req->q->mq_ops', the 'req' have completed and reinitiated by next
      flush request, which would call blk_rq_init() to clear 'req' as 0.
      
      After commit 12f5b931 ("blk-mq: Remove generation seqeunce"),
      normal requests lifetime are protected by refcount. Until 'rq->ref'
      drop to zero, the request can really be free. Thus, these requests
      cannot been reused before timeout handle finish.
      
      However, flush request has defined .end_io and rq->end_io() is still
      called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
      can be reused by the next flush request handle, resulting in null
      pointer deference BUG ON.
      
      We fix this problem by covering flush request with 'rq->ref'.
      If the refcount is not zero, flush_end_io() return and wait the
      last holder recall it. To record the request status, we add a new
      entry 'rq_status', which will be used in flush_end_io().
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: stable@vger.kernel.org # v4.18+
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      
      -------
      v2:
       - move rq_status from struct request to struct blk_flush_queue
      v3:
       - remove unnecessary '{}' pair.
      v4:
       - let spinlock to protect 'fq->rq_status'
      v5:
       - move rq_status after flush_running_idx member of struct blk_flush_queue
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8d699663
    • Yufen Yu's avatar
      rq-qos: get rid of redundant wbt_update_limits() · 2af2783f
      Yufen Yu authored
      We have updated limits after calling wbt_set_min_lat(). No need to
      update again.
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2af2783f
  6. 26 Sep, 2019 6 commits
    • Keith Busch's avatar
      nvme: Move ctrl sqsize to generic space · f968688f
      Keith Busch authored
      This isn't specific to fabrics.
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      f968688f
    • Tejun Heo's avatar
      iocost: bump up default latency targets for hard disks · 7afcccaf
      Tejun Heo authored
      The default hard disk param sets latency targets at 50ms.  As the
      default target percentiles are zero, these don't directly regulate
      vrate; however, they're still used to calculate the period length -
      100ms in this case.
      
      This is excessively low.  A SATA drive with QD32 saturated with random
      IOs can easily reach avg completion latency of several hundred msecs.
      A period duration which is substantially lower than avg completion
      latency can lead to wildly fluctuating vrate.
      
      Let's bump up the default latency targets to 250ms so that the period
      duration is sufficiently long.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7afcccaf
    • Tejun Heo's avatar
      iocost: improve nr_lagging handling · 7cd806a9
      Tejun Heo authored
      Some IOs may span multiple periods.  As latencies are collected on
      completion, the inbetween periods won't register them and may
      incorrectly decide to increase vrate.  nr_lagging tracks these IOs to
      avoid those situations.  Currently, whenever there are IOs which are
      spanning from the previous period, busy_level is reset to 0 if
      negative thus suppressing vrate increase.
      
      This has the following two problems.
      
      * When latency target percentiles aren't set, vrate adjustment should
        only be governed by queue depth depletion; however, the current code
        keeps nr_lagging active which pulls in latency results and can keep
        down vrate unexpectedly.
      
      * When lagging condition is detected, it resets the entire negative
        busy_level.  This turned out to be way too aggressive on some
        devices which sometimes experience extended latencies on a small
        subset of commands.  In addition, a lagging IO will be accounted as
        latency target miss on completion anyway and resetting busy_level
        amplifies its impact unnecessarily.
      
      This patch fixes the above two problems by disabling nr_lagging
      counting when latency target percentiles aren't set and blocking vrate
      increases when there are lagging IOs while leaving busy_level as-is.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7cd806a9
    • Tejun Heo's avatar
      iocost: better trace vrate changes · 25d41e4a
      Tejun Heo authored
      vrate_adj tracepoint traces vrate changes; however, it does so only
      when busy_level is non-zero.  busy_level turning to zero can sometimes
      be as interesting an event.  This patch also enables vrate_adj
      tracepoint on other vrate related events - busy_level changes and
      non-zero nr_lagging.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      25d41e4a
    • Ming Lei's avatar
      block: don't release queue's sysfs lock during switching elevator · b89f625e
      Ming Lei authored
      cecf5d87 ("block: split .sysfs_lock into two locks") starts to
      release & acquire sysfs_lock before registering/un-registering elevator
      queue during switching elevator for avoiding potential deadlock from
      showing & storing 'queue/iosched' attributes and removing elevator's
      kobject.
      
      Turns out there isn't such deadlock because 'q->sysfs_lock' isn't
      required in .show & .store of queue/iosched's attributes, and just
      elevator's sysfs lock is acquired in elv_iosched_store() and
      elv_iosched_show(). So it is safe to hold queue's sysfs lock when
      registering/un-registering elevator queue.
      
      The biggest issue is that commit cecf5d87 assumes that concurrent
      write on 'queue/scheduler' can't happen. However, this assumption isn't
      true, because kernfs_fop_write() only guarantees that concurrent write
      aren't called on the same open file, but the write could be from
      different open on the file. So we can't release & re-acquire queue's
      sysfs lock during switching elevator, otherwise use-after-free on
      elevator could be triggered.
      
      Fixes the issue by not releasing queue's sysfs lock during switching
      elevator.
      
      Fixes: cecf5d87 ("block: split .sysfs_lock into two locks")
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b89f625e
    • Ming Lei's avatar
      blk-mq: move lockdep_assert_held() into elevator_exit · 284b94be
      Ming Lei authored
      Commit c48dac13 ("block: don't hold q->sysfs_lock in elevator_init_mq")
      removes q->sysfs_lock from elevator_init_mq(), but forgot to deal with
      lockdep_assert_held() called in blk_mq_sched_free_requests() which is
      run in failure path of elevator_init_mq().
      
      blk_mq_sched_free_requests() is called in the following 3 functions:
      
      	elevator_init_mq()
      	elevator_exit()
      	blk_cleanup_queue()
      
      In blk_cleanup_queue(), blk_mq_sched_free_requests() is followed exactly
      by 'mutex_lock(&q->sysfs_lock)'.
      
      So moving the lockdep_assert_held() from blk_mq_sched_free_requests()
      into elevator_exit() for fixing the report by syzbot.
      
      Reported-by: syzbot+da3b7677bb913dc1b737@syzkaller.appspotmail.com
      Fixed: c48dac13 ("block: don't hold q->sysfs_lock in elevator_init_mq")
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      284b94be
  7. 25 Sep, 2019 13 commits
  8. 24 Sep, 2019 6 commits