1. 27 Sep, 2019 3 commits
    • Ming Lei's avatar
      blk-mq: honor IO scheduler for multiqueue devices · a12de1d4
      Ming Lei authored
      If a device is using multiple queues, the IO scheduler may be bypassed.
      This may hurt performance for some slow MQ devices, and it also breaks
      zoned devices which depend on mq-deadline for respecting the write order
      in one zone.
      
      Don't bypass io scheduler if we have one setup.
      
      This patch can double sequential write performance basically on MQ
      scsi_debug when mq-deadline is applied.
      
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarJavier González <javier@javigon.com>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a12de1d4
    • Yufen Yu's avatar
      block: fix null pointer dereference in blk_mq_rq_timed_out() · 8d699663
      Yufen Yu authored
      We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
      as following:
      
      [  108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
      [  108.827059] PGD 0 P4D 0
      [  108.827313] Oops: 0000 [#1] SMP PTI
      [  108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
      [  108.829503] Workqueue: kblockd blk_mq_timeout_work
      [  108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
      [  108.838191] Call Trace:
      [  108.838406]  bt_iter+0x74/0x80
      [  108.838665]  blk_mq_queue_tag_busy_iter+0x204/0x450
      [  108.839074]  ? __switch_to_asm+0x34/0x70
      [  108.839405]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.839823]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.840273]  ? syscall_return_via_sysret+0xf/0x7f
      [  108.840732]  blk_mq_timeout_work+0x74/0x200
      [  108.841151]  process_one_work+0x297/0x680
      [  108.841550]  worker_thread+0x29c/0x6f0
      [  108.841926]  ? rescuer_thread+0x580/0x580
      [  108.842344]  kthread+0x16a/0x1a0
      [  108.842666]  ? kthread_flush_work+0x170/0x170
      [  108.843100]  ret_from_fork+0x35/0x40
      
      The bug is caused by the race between timeout handle and completion for
      flush request.
      
      When timeout handle function blk_mq_rq_timed_out() try to read
      'req->q->mq_ops', the 'req' have completed and reinitiated by next
      flush request, which would call blk_rq_init() to clear 'req' as 0.
      
      After commit 12f5b931 ("blk-mq: Remove generation seqeunce"),
      normal requests lifetime are protected by refcount. Until 'rq->ref'
      drop to zero, the request can really be free. Thus, these requests
      cannot been reused before timeout handle finish.
      
      However, flush request has defined .end_io and rq->end_io() is still
      called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
      can be reused by the next flush request handle, resulting in null
      pointer deference BUG ON.
      
      We fix this problem by covering flush request with 'rq->ref'.
      If the refcount is not zero, flush_end_io() return and wait the
      last holder recall it. To record the request status, we add a new
      entry 'rq_status', which will be used in flush_end_io().
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: stable@vger.kernel.org # v4.18+
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      
      -------
      v2:
       - move rq_status from struct request to struct blk_flush_queue
      v3:
       - remove unnecessary '{}' pair.
      v4:
       - let spinlock to protect 'fq->rq_status'
      v5:
       - move rq_status after flush_running_idx member of struct blk_flush_queue
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8d699663
    • Yufen Yu's avatar
      rq-qos: get rid of redundant wbt_update_limits() · 2af2783f
      Yufen Yu authored
      We have updated limits after calling wbt_set_min_lat(). No need to
      update again.
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2af2783f
  2. 26 Sep, 2019 5 commits
    • Tejun Heo's avatar
      iocost: bump up default latency targets for hard disks · 7afcccaf
      Tejun Heo authored
      The default hard disk param sets latency targets at 50ms.  As the
      default target percentiles are zero, these don't directly regulate
      vrate; however, they're still used to calculate the period length -
      100ms in this case.
      
      This is excessively low.  A SATA drive with QD32 saturated with random
      IOs can easily reach avg completion latency of several hundred msecs.
      A period duration which is substantially lower than avg completion
      latency can lead to wildly fluctuating vrate.
      
      Let's bump up the default latency targets to 250ms so that the period
      duration is sufficiently long.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7afcccaf
    • Tejun Heo's avatar
      iocost: improve nr_lagging handling · 7cd806a9
      Tejun Heo authored
      Some IOs may span multiple periods.  As latencies are collected on
      completion, the inbetween periods won't register them and may
      incorrectly decide to increase vrate.  nr_lagging tracks these IOs to
      avoid those situations.  Currently, whenever there are IOs which are
      spanning from the previous period, busy_level is reset to 0 if
      negative thus suppressing vrate increase.
      
      This has the following two problems.
      
      * When latency target percentiles aren't set, vrate adjustment should
        only be governed by queue depth depletion; however, the current code
        keeps nr_lagging active which pulls in latency results and can keep
        down vrate unexpectedly.
      
      * When lagging condition is detected, it resets the entire negative
        busy_level.  This turned out to be way too aggressive on some
        devices which sometimes experience extended latencies on a small
        subset of commands.  In addition, a lagging IO will be accounted as
        latency target miss on completion anyway and resetting busy_level
        amplifies its impact unnecessarily.
      
      This patch fixes the above two problems by disabling nr_lagging
      counting when latency target percentiles aren't set and blocking vrate
      increases when there are lagging IOs while leaving busy_level as-is.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7cd806a9
    • Tejun Heo's avatar
      iocost: better trace vrate changes · 25d41e4a
      Tejun Heo authored
      vrate_adj tracepoint traces vrate changes; however, it does so only
      when busy_level is non-zero.  busy_level turning to zero can sometimes
      be as interesting an event.  This patch also enables vrate_adj
      tracepoint on other vrate related events - busy_level changes and
      non-zero nr_lagging.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      25d41e4a
    • Ming Lei's avatar
      block: don't release queue's sysfs lock during switching elevator · b89f625e
      Ming Lei authored
      cecf5d87 ("block: split .sysfs_lock into two locks") starts to
      release & acquire sysfs_lock before registering/un-registering elevator
      queue during switching elevator for avoiding potential deadlock from
      showing & storing 'queue/iosched' attributes and removing elevator's
      kobject.
      
      Turns out there isn't such deadlock because 'q->sysfs_lock' isn't
      required in .show & .store of queue/iosched's attributes, and just
      elevator's sysfs lock is acquired in elv_iosched_store() and
      elv_iosched_show(). So it is safe to hold queue's sysfs lock when
      registering/un-registering elevator queue.
      
      The biggest issue is that commit cecf5d87 assumes that concurrent
      write on 'queue/scheduler' can't happen. However, this assumption isn't
      true, because kernfs_fop_write() only guarantees that concurrent write
      aren't called on the same open file, but the write could be from
      different open on the file. So we can't release & re-acquire queue's
      sysfs lock during switching elevator, otherwise use-after-free on
      elevator could be triggered.
      
      Fixes the issue by not releasing queue's sysfs lock during switching
      elevator.
      
      Fixes: cecf5d87 ("block: split .sysfs_lock into two locks")
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b89f625e
    • Ming Lei's avatar
      blk-mq: move lockdep_assert_held() into elevator_exit · 284b94be
      Ming Lei authored
      Commit c48dac13 ("block: don't hold q->sysfs_lock in elevator_init_mq")
      removes q->sysfs_lock from elevator_init_mq(), but forgot to deal with
      lockdep_assert_held() called in blk_mq_sched_free_requests() which is
      run in failure path of elevator_init_mq().
      
      blk_mq_sched_free_requests() is called in the following 3 functions:
      
      	elevator_init_mq()
      	elevator_exit()
      	blk_cleanup_queue()
      
      In blk_cleanup_queue(), blk_mq_sched_free_requests() is followed exactly
      by 'mutex_lock(&q->sysfs_lock)'.
      
      So moving the lockdep_assert_held() from blk_mq_sched_free_requests()
      into elevator_exit() for fixing the report by syzbot.
      
      Reported-by: syzbot+da3b7677bb913dc1b737@syzkaller.appspotmail.com
      Fixed: c48dac13 ("block: don't hold q->sysfs_lock in elevator_init_mq")
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      284b94be
  3. 25 Sep, 2019 4 commits
    • Linus Torvalds's avatar
      Merge tag 'ceph-for-5.4-rc1' of git://github.com/ceph/ceph-client · f41def39
      Linus Torvalds authored
      Pull ceph updates from Ilya Dryomov:
       "The highlights are:
      
         - automatic recovery of a blacklisted filesystem session (Zheng Yan).
           This is disabled by default and can be enabled by mounting with the
           new "recover_session=clean" option.
      
         - serialize buffered reads and O_DIRECT writes (Jeff Layton). Care is
           taken to avoid serializing O_DIRECT reads and writes with each
           other, this is based on the exclusion scheme from NFS.
      
         - handle large osdmaps better in the face of fragmented memory
           (myself)
      
         - don't limit what security.* xattrs can be get or set (Jeff Layton).
           We were overly restrictive here, unnecessarily preventing things
           like file capability sets stored in security.capability from
           working.
      
         - allow copy_file_range() within the same inode and across different
           filesystems within the same cluster (Luis Henriques)"
      
      * tag 'ceph-for-5.4-rc1' of git://github.com/ceph/ceph-client: (41 commits)
        ceph: call ceph_mdsc_destroy from destroy_fs_client
        libceph: use ceph_kvmalloc() for osdmap arrays
        libceph: avoid a __vmalloc() deadlock in ceph_kvmalloc()
        ceph: allow object copies across different filesystems in the same cluster
        ceph: include ceph_debug.h in cache.c
        ceph: move static keyword to the front of declarations
        rbd: pull rbd_img_request_create() dout out into the callers
        ceph: reconnect connection if session hang in opening state
        libceph: drop unused con parameter of calc_target()
        ceph: use release_pages() directly
        rbd: fix response length parameter for encoded strings
        ceph: allow arbitrary security.* xattrs
        ceph: only set CEPH_I_SEC_INITED if we got a MAC label
        ceph: turn ceph_security_invalidate_secctx into static inline
        ceph: add buffered/direct exclusionary locking for reads and writes
        libceph: handle OSD op ceph_pagelist_append() errors
        ceph: don't return a value from void function
        ceph: don't freeze during write page faults
        ceph: update the mtime when truncating up
        ceph: fix indentation in __get_snap_name()
        ...
      f41def39
    • Linus Torvalds's avatar
      Merge tag 'fuse-update-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse · 7b1373dd
      Linus Torvalds authored
      Pull fuse updates from Miklos Szeredi:
      
       - Continue separating the transport (user/kernel communication) and the
         filesystem layers of fuse. Getting rid of most layering violations
         will allow for easier cleanup and optimization later on.
      
       - Prepare for the addition of the virtio-fs filesystem. The actual
         filesystem will be introduced by a separate pull request.
      
       - Convert to new mount API.
      
       - Various fixes, optimizations and cleanups.
      
      * tag 'fuse-update-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (55 commits)
        fuse: Make fuse_args_to_req static
        fuse: fix memleak in cuse_channel_open
        fuse: fix beyond-end-of-page access in fuse_parse_cache()
        fuse: unexport fuse_put_request
        fuse: kmemcg account fs data
        fuse: on 64-bit store time in d_fsdata directly
        fuse: fix missing unlock_page in fuse_writepage()
        fuse: reserve byteswapped init opcodes
        fuse: allow skipping control interface and forced unmount
        fuse: dissociate DESTROY from fuseblk
        fuse: delete dentry if timeout is zero
        fuse: separate fuse device allocation and installation in fuse_conn
        fuse: add fuse_iqueue_ops callbacks
        fuse: extract fuse_fill_super_common()
        fuse: export fuse_dequeue_forget() function
        fuse: export fuse_get_unique()
        fuse: export fuse_send_init_request()
        fuse: export fuse_len_args()
        fuse: export fuse_end_request()
        fuse: fix request limit
        ...
      7b1373dd
    • Linus Torvalds's avatar
      Merge tag 'tpmdd-next-20190925' of git://git.infradead.org/users/jjs/linux-tpmdd · 301310c6
      Linus Torvalds authored
      Pull tpm fixes from Jarkko Sakkinen.
      
      * tag 'tpmdd-next-20190925' of git://git.infradead.org/users/jjs/linux-tpmdd:
        tpm: Wrap the buffer from the caller to tpm_buf in tpm_send()
        MAINTAINERS: keys: Update path to trusted.h
        KEYS: trusted: correctly initialize digests and fix locking issue
        selftests/tpm2: Add log and *.pyc to .gitignore
        selftests/tpm2: Add the missing TEST_FILES assignment
      301310c6
    • Linus Torvalds's avatar
      Merge tag 'iomap-5.4-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 4ef5b13a
      Linus Torvalds authored
      Pull iomap updates from Darrick Wong:
       "After last week's failed pull request attempt, I scuttled everything
        in the branch except for the directio endio api changes, which were
        trivial. Everything else will simply have to wait for the next cycle.
      
        Summary:
      
         - Report both io errors and short io results to the directio endio
           handler.
      
         - Allow directio callers to pass an ops structure to iomap_dio_rw"
      
      * tag 'iomap-5.4-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        iomap: move the iomap_dio_rw ->end_io callback into a structure
        iomap: split size and error for iomap_dio_rw ->end_io
      4ef5b13a
  4. 24 Sep, 2019 28 commits