1. 15 Oct, 2019 4 commits
    • Dan Williams's avatar
      libata/ahci: Fix PCS quirk application · 09d6ac8d
      Dan Williams authored
      Commit c312ef17 "libata/ahci: Drop PCS quirk for Denverton and
      beyond" got the polarity wrong on the check for which board-ids should
      have the quirk applied. The board type board_ahci_pcs7 is defined at the
      end of the list such that "pcs7" boards can be special cased in the
      future if they need the quirk. All prior Intel board ids "<
      board_ahci_pcs7" should proceed with applying the quirk.
      Reported-by: default avatarAndreas Friedrich <afrie@gmx.net>
      Reported-by: default avatarStephen Douthit <stephend@silicom-usa.com>
      Fixes: c312ef17 ("libata/ahci: Drop PCS quirk for Denverton and beyond")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      09d6ac8d
    • Tejun Heo's avatar
      blk-rq-qos: fix first node deletion of rq_qos_del() · 307f4065
      Tejun Heo authored
      rq_qos_del() incorrectly assigns the node being deleted to the head if
      it was the first on the list in the !prev path.  Fix it by iterating
      with ** instead.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Fixes: a7905043 ("blk-rq-qos: refactor out common elements of blk-wbt")
      Cc: stable@vger.kernel.org # v4.19+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      307f4065
    • Tejun Heo's avatar
      blkcg: Fix multiple bugs in blkcg_activate_policy() · 9d179b86
      Tejun Heo authored
      blkcg_activate_policy() has the following bugs.
      
      * cf09a8ee ("blkcg: pass @q and @blkcg into
        blkcg_pol_alloc_pd_fn()") added @blkcg to ->pd_alloc_fn(); however,
        blkcg_activate_policy() ends up using pd's allocated for the root
        blkcg for all preallocations, so ->pd_init_fn() for non-root blkcgs
        can be passed in pd's which are allocated for the root blkcg.
      
        For blk-iocost, this means that ->pd_init_fn() can write beyond the
        end of the allocated object as it determines the length of the flex
        array at the end based on the blkcg's nesting level.
      
      * Each pd is initialized as they get allocated.  If alloc fails, the
        policy will get freed with pd's initialized on it.
      
      * After the above partial failure, the partial pds are not freed.
      
      This patch fixes all the above issues by
      
      * Restructuring blkcg_activate_policy() so that alloc and init passes
        are separate.  Init takes place only after all allocs succeeded and
        on failure all allocated pds are freed.
      
      * Unifying and fixing the cleanup of the remaining pd_prealloc.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: cf09a8ee ("blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9d179b86
    • yangerkun's avatar
      io_uring: consider the overflow of sequence for timeout req · 5da0fb1a
      yangerkun authored
      Now we recalculate the sequence of timeout with 'req->sequence =
      ctx->cached_sq_head + count - 1', judge the right place to insert
      for timeout_list by compare the number of request we still expected for
      completion. But we have not consider about the situation of overflow:
      
      1. ctx->cached_sq_head + count - 1 may overflow. And a bigger count for
      the new timeout req can have a small req->sequence.
      
      2. cached_sq_head of now may overflow compare with before req. And it
      will lead the timeout req with small req->sequence.
      
      This overflow will lead to the misorder of timeout_list, which can lead
      to the wrong order of the completion of timeout_list. Fix it by reuse
      req->submit.sequence to store the count, and change the logic of
      inserting sort in io_timeout.
      Signed-off-by: default avataryangerkun <yangerkun@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5da0fb1a
  2. 14 Oct, 2019 1 commit
  3. 11 Oct, 2019 1 commit
  4. 10 Oct, 2019 2 commits
  5. 09 Oct, 2019 1 commit
  6. 08 Oct, 2019 1 commit
  7. 06 Oct, 2019 3 commits
  8. 04 Oct, 2019 1 commit
  9. 03 Oct, 2019 3 commits
  10. 01 Oct, 2019 4 commits
    • Stefan Haberland's avatar
      Revert "s390/dasd: Add discard support for ESE volumes" · 964ce509
      Stefan Haberland authored
      This reverts commit 7e64db15.
      
      The thin provisioning feature introduces an IOCTL and the discard support
      to allow userspace tools and filesystems to release unused and previously
      allocated space respectively.
      
      During some internal performance improvements and further tests, the
      release of allocated space revealed some issues that may lead to data
      corruption in some configurations when filesystems are mounted with
      discard support enabled.
      
      While we're working on a fix and trying to clarify the situation,
      this commit reverts the discard support for ESE volumes to prevent
      potential data corruption.
      
      Cc: <stable@vger.kernel.org> # 5.3
      Signed-off-by: default avatarStefan Haberland <sth@linux.ibm.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      964ce509
    • Jan Höppner's avatar
      s390/dasd: Fix error handling during online processing · dd454839
      Jan Höppner authored
      It is possible that the CCW commands for reading volume and extent pool
      information are not supported, either by the storage server (for
      dedicated DASDs) or by z/VM (for virtual devices, such as MDISKs).
      
      As a command reject will occur in such a case, the current error
      handling leads to a failing online processing and thus the DASD can't be
      used at all.
      
      Since the data being read is not essential for an fully operational
      DASD, the error handling can be removed. Information about the failing
      command is sent to the s390dbf debug feature.
      
      Fixes: c729696b ("s390/dasd: Recognise data for ESE volumes")
      Cc: <stable@vger.kernel.org> # 5.3
      Reported-by: default avatarFrank Heimes <frank.heimes@canonical.com>
      Signed-off-by: default avatarJan Höppner <hoeppner@linux.ibm.com>
      Signed-off-by: default avatarStefan Haberland <sth@linux.ibm.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dd454839
    • Arnd Bergmann's avatar
      io_uring: use __kernel_timespec in timeout ABI · bdf20073
      Arnd Bergmann authored
      All system calls use struct __kernel_timespec instead of the old struct
      timespec, but this one was just added with the old-style ABI. Change it
      now to enforce the use of __kernel_timespec, avoiding ABI confusion and
      the need for compat handlers on 32-bit architectures.
      
      Any user space caller will have to use __kernel_timespec now, but this
      is unambiguous and works for any C library regardless of the time_t
      definition. A nicer way to specify the timeout would have been a less
      ambiguous 64-bit nanosecond value, but I suppose it's too late now to
      change that as this would impact both 32-bit and 64-bit users.
      
      Fixes: 5262f567 ("io_uring: IORING_OP_TIMEOUT support")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bdf20073
    • Martijn Coenen's avatar
      loop: change queue block size to match when using DIO · 85560117
      Martijn Coenen authored
      The loop driver assumes that if the passed in fd is opened with
      O_DIRECT, the caller wants to use direct I/O on the loop device.
      However, if the underlying block device has a different block size than
      the loop block queue, direct I/O can't be enabled. Instead of requiring
      userspace to manually change the blocksize and re-enable direct I/O,
      just change the queue block sizes to match, as well as the io_min size.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMartijn Coenen <maco@android.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      85560117
  11. 27 Sep, 2019 6 commits
    • Jens Axboe's avatar
      Merge branch 'nvme-5.4' of git://git.infradead.org/nvme into for-linus · 2d5ba0c7
      Jens Axboe authored
      Pull NVMe changes from Sagi:
      
      "This set consists of various fixes and cleanups:
       - controller removal race fix from Balbir
       - quirk additions from Gabriel and Jian-Hong
       - nvme-pci power state save fix from Mario
       - Add 64bit user commands (for 64bit registers) from Marta
       - nvme-rdma/nvme-tcp fixes from Max, Mark and Me
       - Minor cleanups and nits from James, Dan and John"
      
      * 'nvme-5.4' of git://git.infradead.org/nvme:
        nvme-rdma: fix possible use-after-free in connect timeout
        nvme: Move ctrl sqsize to generic space
        nvme: Add ctrl attributes for queue_count and sqsize
        nvme: allow 64-bit results in passthru commands
        nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T
        nvmet-tcp: remove superflous check on request sgl
        Added QUIRKs for ADATA XPG SX8200 Pro 512GB
        nvme-rdma: Fix max_hw_sectors calculation
        nvme: fix an error code in nvme_init_subsystem()
        nvme-pci: Save PCI state before putting drive into deepest state
        nvme-tcp: fix wrong stop condition in io_work
        nvme-pci: Fix a race in controller removal
        nvmet: change ppl to lpp
      2d5ba0c7
    • Ming Lei's avatar
      blk-mq: apply normal plugging for HDD · 3154df26
      Ming Lei authored
      Some HDD drive may expose multiple hardware queues, such as MegraRaid.
      Let's apply the normal plugging for such devices because sequential IO
      may benefit a lot from plug merging.
      
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3154df26
    • Ming Lei's avatar
      blk-mq: honor IO scheduler for multiqueue devices · a12de1d4
      Ming Lei authored
      If a device is using multiple queues, the IO scheduler may be bypassed.
      This may hurt performance for some slow MQ devices, and it also breaks
      zoned devices which depend on mq-deadline for respecting the write order
      in one zone.
      
      Don't bypass io scheduler if we have one setup.
      
      This patch can double sequential write performance basically on MQ
      scsi_debug when mq-deadline is applied.
      
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarJavier González <javier@javigon.com>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a12de1d4
    • Sagi Grimberg's avatar
      nvme-rdma: fix possible use-after-free in connect timeout · 67b483dd
      Sagi Grimberg authored
      If the connect times out, we may have already destroyed the
      queue in the timeout handler, so test if the queue is still
      allocated in the connect error handler.
      Reported-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      67b483dd
    • Yufen Yu's avatar
      block: fix null pointer dereference in blk_mq_rq_timed_out() · 8d699663
      Yufen Yu authored
      We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
      as following:
      
      [  108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
      [  108.827059] PGD 0 P4D 0
      [  108.827313] Oops: 0000 [#1] SMP PTI
      [  108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
      [  108.829503] Workqueue: kblockd blk_mq_timeout_work
      [  108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
      [  108.838191] Call Trace:
      [  108.838406]  bt_iter+0x74/0x80
      [  108.838665]  blk_mq_queue_tag_busy_iter+0x204/0x450
      [  108.839074]  ? __switch_to_asm+0x34/0x70
      [  108.839405]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.839823]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.840273]  ? syscall_return_via_sysret+0xf/0x7f
      [  108.840732]  blk_mq_timeout_work+0x74/0x200
      [  108.841151]  process_one_work+0x297/0x680
      [  108.841550]  worker_thread+0x29c/0x6f0
      [  108.841926]  ? rescuer_thread+0x580/0x580
      [  108.842344]  kthread+0x16a/0x1a0
      [  108.842666]  ? kthread_flush_work+0x170/0x170
      [  108.843100]  ret_from_fork+0x35/0x40
      
      The bug is caused by the race between timeout handle and completion for
      flush request.
      
      When timeout handle function blk_mq_rq_timed_out() try to read
      'req->q->mq_ops', the 'req' have completed and reinitiated by next
      flush request, which would call blk_rq_init() to clear 'req' as 0.
      
      After commit 12f5b931 ("blk-mq: Remove generation seqeunce"),
      normal requests lifetime are protected by refcount. Until 'rq->ref'
      drop to zero, the request can really be free. Thus, these requests
      cannot been reused before timeout handle finish.
      
      However, flush request has defined .end_io and rq->end_io() is still
      called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
      can be reused by the next flush request handle, resulting in null
      pointer deference BUG ON.
      
      We fix this problem by covering flush request with 'rq->ref'.
      If the refcount is not zero, flush_end_io() return and wait the
      last holder recall it. To record the request status, we add a new
      entry 'rq_status', which will be used in flush_end_io().
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: stable@vger.kernel.org # v4.18+
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      
      -------
      v2:
       - move rq_status from struct request to struct blk_flush_queue
      v3:
       - remove unnecessary '{}' pair.
      v4:
       - let spinlock to protect 'fq->rq_status'
      v5:
       - move rq_status after flush_running_idx member of struct blk_flush_queue
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8d699663
    • Yufen Yu's avatar
      rq-qos: get rid of redundant wbt_update_limits() · 2af2783f
      Yufen Yu authored
      We have updated limits after calling wbt_set_min_lat(). No need to
      update again.
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2af2783f
  12. 26 Sep, 2019 6 commits
    • Keith Busch's avatar
      nvme: Move ctrl sqsize to generic space · f968688f
      Keith Busch authored
      This isn't specific to fabrics.
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      f968688f
    • Tejun Heo's avatar
      iocost: bump up default latency targets for hard disks · 7afcccaf
      Tejun Heo authored
      The default hard disk param sets latency targets at 50ms.  As the
      default target percentiles are zero, these don't directly regulate
      vrate; however, they're still used to calculate the period length -
      100ms in this case.
      
      This is excessively low.  A SATA drive with QD32 saturated with random
      IOs can easily reach avg completion latency of several hundred msecs.
      A period duration which is substantially lower than avg completion
      latency can lead to wildly fluctuating vrate.
      
      Let's bump up the default latency targets to 250ms so that the period
      duration is sufficiently long.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7afcccaf
    • Tejun Heo's avatar
      iocost: improve nr_lagging handling · 7cd806a9
      Tejun Heo authored
      Some IOs may span multiple periods.  As latencies are collected on
      completion, the inbetween periods won't register them and may
      incorrectly decide to increase vrate.  nr_lagging tracks these IOs to
      avoid those situations.  Currently, whenever there are IOs which are
      spanning from the previous period, busy_level is reset to 0 if
      negative thus suppressing vrate increase.
      
      This has the following two problems.
      
      * When latency target percentiles aren't set, vrate adjustment should
        only be governed by queue depth depletion; however, the current code
        keeps nr_lagging active which pulls in latency results and can keep
        down vrate unexpectedly.
      
      * When lagging condition is detected, it resets the entire negative
        busy_level.  This turned out to be way too aggressive on some
        devices which sometimes experience extended latencies on a small
        subset of commands.  In addition, a lagging IO will be accounted as
        latency target miss on completion anyway and resetting busy_level
        amplifies its impact unnecessarily.
      
      This patch fixes the above two problems by disabling nr_lagging
      counting when latency target percentiles aren't set and blocking vrate
      increases when there are lagging IOs while leaving busy_level as-is.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7cd806a9
    • Tejun Heo's avatar
      iocost: better trace vrate changes · 25d41e4a
      Tejun Heo authored
      vrate_adj tracepoint traces vrate changes; however, it does so only
      when busy_level is non-zero.  busy_level turning to zero can sometimes
      be as interesting an event.  This patch also enables vrate_adj
      tracepoint on other vrate related events - busy_level changes and
      non-zero nr_lagging.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      25d41e4a
    • Ming Lei's avatar
      block: don't release queue's sysfs lock during switching elevator · b89f625e
      Ming Lei authored
      cecf5d87 ("block: split .sysfs_lock into two locks") starts to
      release & acquire sysfs_lock before registering/un-registering elevator
      queue during switching elevator for avoiding potential deadlock from
      showing & storing 'queue/iosched' attributes and removing elevator's
      kobject.
      
      Turns out there isn't such deadlock because 'q->sysfs_lock' isn't
      required in .show & .store of queue/iosched's attributes, and just
      elevator's sysfs lock is acquired in elv_iosched_store() and
      elv_iosched_show(). So it is safe to hold queue's sysfs lock when
      registering/un-registering elevator queue.
      
      The biggest issue is that commit cecf5d87 assumes that concurrent
      write on 'queue/scheduler' can't happen. However, this assumption isn't
      true, because kernfs_fop_write() only guarantees that concurrent write
      aren't called on the same open file, but the write could be from
      different open on the file. So we can't release & re-acquire queue's
      sysfs lock during switching elevator, otherwise use-after-free on
      elevator could be triggered.
      
      Fixes the issue by not releasing queue's sysfs lock during switching
      elevator.
      
      Fixes: cecf5d87 ("block: split .sysfs_lock into two locks")
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b89f625e
    • Ming Lei's avatar
      blk-mq: move lockdep_assert_held() into elevator_exit · 284b94be
      Ming Lei authored
      Commit c48dac13 ("block: don't hold q->sysfs_lock in elevator_init_mq")
      removes q->sysfs_lock from elevator_init_mq(), but forgot to deal with
      lockdep_assert_held() called in blk_mq_sched_free_requests() which is
      run in failure path of elevator_init_mq().
      
      blk_mq_sched_free_requests() is called in the following 3 functions:
      
      	elevator_init_mq()
      	elevator_exit()
      	blk_cleanup_queue()
      
      In blk_cleanup_queue(), blk_mq_sched_free_requests() is followed exactly
      by 'mutex_lock(&q->sysfs_lock)'.
      
      So moving the lockdep_assert_held() from blk_mq_sched_free_requests()
      into elevator_exit() for fixing the report by syzbot.
      
      Reported-by: syzbot+da3b7677bb913dc1b737@syzkaller.appspotmail.com
      Fixed: c48dac13 ("block: don't hold q->sysfs_lock in elevator_init_mq")
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      284b94be
  13. 25 Sep, 2019 7 commits