1. 01 Oct, 2019 2 commits
    • Arnd Bergmann's avatar
      io_uring: use __kernel_timespec in timeout ABI · bdf20073
      Arnd Bergmann authored
      All system calls use struct __kernel_timespec instead of the old struct
      timespec, but this one was just added with the old-style ABI. Change it
      now to enforce the use of __kernel_timespec, avoiding ABI confusion and
      the need for compat handlers on 32-bit architectures.
      
      Any user space caller will have to use __kernel_timespec now, but this
      is unambiguous and works for any C library regardless of the time_t
      definition. A nicer way to specify the timeout would have been a less
      ambiguous 64-bit nanosecond value, but I suppose it's too late now to
      change that as this would impact both 32-bit and 64-bit users.
      
      Fixes: 5262f567 ("io_uring: IORING_OP_TIMEOUT support")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bdf20073
    • Martijn Coenen's avatar
      loop: change queue block size to match when using DIO · 85560117
      Martijn Coenen authored
      The loop driver assumes that if the passed in fd is opened with
      O_DIRECT, the caller wants to use direct I/O on the loop device.
      However, if the underlying block device has a different block size than
      the loop block queue, direct I/O can't be enabled. Instead of requiring
      userspace to manually change the blocksize and re-enable direct I/O,
      just change the queue block sizes to match, as well as the io_min size.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMartijn Coenen <maco@android.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      85560117
  2. 27 Sep, 2019 6 commits
    • Jens Axboe's avatar
      Merge branch 'nvme-5.4' of git://git.infradead.org/nvme into for-linus · 2d5ba0c7
      Jens Axboe authored
      Pull NVMe changes from Sagi:
      
      "This set consists of various fixes and cleanups:
       - controller removal race fix from Balbir
       - quirk additions from Gabriel and Jian-Hong
       - nvme-pci power state save fix from Mario
       - Add 64bit user commands (for 64bit registers) from Marta
       - nvme-rdma/nvme-tcp fixes from Max, Mark and Me
       - Minor cleanups and nits from James, Dan and John"
      
      * 'nvme-5.4' of git://git.infradead.org/nvme:
        nvme-rdma: fix possible use-after-free in connect timeout
        nvme: Move ctrl sqsize to generic space
        nvme: Add ctrl attributes for queue_count and sqsize
        nvme: allow 64-bit results in passthru commands
        nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T
        nvmet-tcp: remove superflous check on request sgl
        Added QUIRKs for ADATA XPG SX8200 Pro 512GB
        nvme-rdma: Fix max_hw_sectors calculation
        nvme: fix an error code in nvme_init_subsystem()
        nvme-pci: Save PCI state before putting drive into deepest state
        nvme-tcp: fix wrong stop condition in io_work
        nvme-pci: Fix a race in controller removal
        nvmet: change ppl to lpp
      2d5ba0c7
    • Ming Lei's avatar
      blk-mq: apply normal plugging for HDD · 3154df26
      Ming Lei authored
      Some HDD drive may expose multiple hardware queues, such as MegraRaid.
      Let's apply the normal plugging for such devices because sequential IO
      may benefit a lot from plug merging.
      
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3154df26
    • Ming Lei's avatar
      blk-mq: honor IO scheduler for multiqueue devices · a12de1d4
      Ming Lei authored
      If a device is using multiple queues, the IO scheduler may be bypassed.
      This may hurt performance for some slow MQ devices, and it also breaks
      zoned devices which depend on mq-deadline for respecting the write order
      in one zone.
      
      Don't bypass io scheduler if we have one setup.
      
      This patch can double sequential write performance basically on MQ
      scsi_debug when mq-deadline is applied.
      
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarJavier González <javier@javigon.com>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a12de1d4
    • Sagi Grimberg's avatar
      nvme-rdma: fix possible use-after-free in connect timeout · 67b483dd
      Sagi Grimberg authored
      If the connect times out, we may have already destroyed the
      queue in the timeout handler, so test if the queue is still
      allocated in the connect error handler.
      Reported-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      67b483dd
    • Yufen Yu's avatar
      block: fix null pointer dereference in blk_mq_rq_timed_out() · 8d699663
      Yufen Yu authored
      We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
      as following:
      
      [  108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
      [  108.827059] PGD 0 P4D 0
      [  108.827313] Oops: 0000 [#1] SMP PTI
      [  108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
      [  108.829503] Workqueue: kblockd blk_mq_timeout_work
      [  108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
      [  108.838191] Call Trace:
      [  108.838406]  bt_iter+0x74/0x80
      [  108.838665]  blk_mq_queue_tag_busy_iter+0x204/0x450
      [  108.839074]  ? __switch_to_asm+0x34/0x70
      [  108.839405]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.839823]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.840273]  ? syscall_return_via_sysret+0xf/0x7f
      [  108.840732]  blk_mq_timeout_work+0x74/0x200
      [  108.841151]  process_one_work+0x297/0x680
      [  108.841550]  worker_thread+0x29c/0x6f0
      [  108.841926]  ? rescuer_thread+0x580/0x580
      [  108.842344]  kthread+0x16a/0x1a0
      [  108.842666]  ? kthread_flush_work+0x170/0x170
      [  108.843100]  ret_from_fork+0x35/0x40
      
      The bug is caused by the race between timeout handle and completion for
      flush request.
      
      When timeout handle function blk_mq_rq_timed_out() try to read
      'req->q->mq_ops', the 'req' have completed and reinitiated by next
      flush request, which would call blk_rq_init() to clear 'req' as 0.
      
      After commit 12f5b931 ("blk-mq: Remove generation seqeunce"),
      normal requests lifetime are protected by refcount. Until 'rq->ref'
      drop to zero, the request can really be free. Thus, these requests
      cannot been reused before timeout handle finish.
      
      However, flush request has defined .end_io and rq->end_io() is still
      called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
      can be reused by the next flush request handle, resulting in null
      pointer deference BUG ON.
      
      We fix this problem by covering flush request with 'rq->ref'.
      If the refcount is not zero, flush_end_io() return and wait the
      last holder recall it. To record the request status, we add a new
      entry 'rq_status', which will be used in flush_end_io().
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: stable@vger.kernel.org # v4.18+
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      
      -------
      v2:
       - move rq_status from struct request to struct blk_flush_queue
      v3:
       - remove unnecessary '{}' pair.
      v4:
       - let spinlock to protect 'fq->rq_status'
      v5:
       - move rq_status after flush_running_idx member of struct blk_flush_queue
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8d699663
    • Yufen Yu's avatar
      rq-qos: get rid of redundant wbt_update_limits() · 2af2783f
      Yufen Yu authored
      We have updated limits after calling wbt_set_min_lat(). No need to
      update again.
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2af2783f
  3. 26 Sep, 2019 6 commits
    • Keith Busch's avatar
      nvme: Move ctrl sqsize to generic space · f968688f
      Keith Busch authored
      This isn't specific to fabrics.
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      f968688f
    • Tejun Heo's avatar
      iocost: bump up default latency targets for hard disks · 7afcccaf
      Tejun Heo authored
      The default hard disk param sets latency targets at 50ms.  As the
      default target percentiles are zero, these don't directly regulate
      vrate; however, they're still used to calculate the period length -
      100ms in this case.
      
      This is excessively low.  A SATA drive with QD32 saturated with random
      IOs can easily reach avg completion latency of several hundred msecs.
      A period duration which is substantially lower than avg completion
      latency can lead to wildly fluctuating vrate.
      
      Let's bump up the default latency targets to 250ms so that the period
      duration is sufficiently long.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7afcccaf
    • Tejun Heo's avatar
      iocost: improve nr_lagging handling · 7cd806a9
      Tejun Heo authored
      Some IOs may span multiple periods.  As latencies are collected on
      completion, the inbetween periods won't register them and may
      incorrectly decide to increase vrate.  nr_lagging tracks these IOs to
      avoid those situations.  Currently, whenever there are IOs which are
      spanning from the previous period, busy_level is reset to 0 if
      negative thus suppressing vrate increase.
      
      This has the following two problems.
      
      * When latency target percentiles aren't set, vrate adjustment should
        only be governed by queue depth depletion; however, the current code
        keeps nr_lagging active which pulls in latency results and can keep
        down vrate unexpectedly.
      
      * When lagging condition is detected, it resets the entire negative
        busy_level.  This turned out to be way too aggressive on some
        devices which sometimes experience extended latencies on a small
        subset of commands.  In addition, a lagging IO will be accounted as
        latency target miss on completion anyway and resetting busy_level
        amplifies its impact unnecessarily.
      
      This patch fixes the above two problems by disabling nr_lagging
      counting when latency target percentiles aren't set and blocking vrate
      increases when there are lagging IOs while leaving busy_level as-is.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7cd806a9
    • Tejun Heo's avatar
      iocost: better trace vrate changes · 25d41e4a
      Tejun Heo authored
      vrate_adj tracepoint traces vrate changes; however, it does so only
      when busy_level is non-zero.  busy_level turning to zero can sometimes
      be as interesting an event.  This patch also enables vrate_adj
      tracepoint on other vrate related events - busy_level changes and
      non-zero nr_lagging.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      25d41e4a
    • Ming Lei's avatar
      block: don't release queue's sysfs lock during switching elevator · b89f625e
      Ming Lei authored
      cecf5d87 ("block: split .sysfs_lock into two locks") starts to
      release & acquire sysfs_lock before registering/un-registering elevator
      queue during switching elevator for avoiding potential deadlock from
      showing & storing 'queue/iosched' attributes and removing elevator's
      kobject.
      
      Turns out there isn't such deadlock because 'q->sysfs_lock' isn't
      required in .show & .store of queue/iosched's attributes, and just
      elevator's sysfs lock is acquired in elv_iosched_store() and
      elv_iosched_show(). So it is safe to hold queue's sysfs lock when
      registering/un-registering elevator queue.
      
      The biggest issue is that commit cecf5d87 assumes that concurrent
      write on 'queue/scheduler' can't happen. However, this assumption isn't
      true, because kernfs_fop_write() only guarantees that concurrent write
      aren't called on the same open file, but the write could be from
      different open on the file. So we can't release & re-acquire queue's
      sysfs lock during switching elevator, otherwise use-after-free on
      elevator could be triggered.
      
      Fixes the issue by not releasing queue's sysfs lock during switching
      elevator.
      
      Fixes: cecf5d87 ("block: split .sysfs_lock into two locks")
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b89f625e
    • Ming Lei's avatar
      blk-mq: move lockdep_assert_held() into elevator_exit · 284b94be
      Ming Lei authored
      Commit c48dac13 ("block: don't hold q->sysfs_lock in elevator_init_mq")
      removes q->sysfs_lock from elevator_init_mq(), but forgot to deal with
      lockdep_assert_held() called in blk_mq_sched_free_requests() which is
      run in failure path of elevator_init_mq().
      
      blk_mq_sched_free_requests() is called in the following 3 functions:
      
      	elevator_init_mq()
      	elevator_exit()
      	blk_cleanup_queue()
      
      In blk_cleanup_queue(), blk_mq_sched_free_requests() is followed exactly
      by 'mutex_lock(&q->sysfs_lock)'.
      
      So moving the lockdep_assert_held() from blk_mq_sched_free_requests()
      into elevator_exit() for fixing the report by syzbot.
      
      Reported-by: syzbot+da3b7677bb913dc1b737@syzkaller.appspotmail.com
      Fixed: c48dac13 ("block: don't hold q->sysfs_lock in elevator_init_mq")
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      284b94be
  4. 25 Sep, 2019 13 commits
  5. 24 Sep, 2019 13 commits
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 351c8a09
      Linus Torvalds authored
      Pull i2c updates from Wolfram Sang:
      
       - new driver for ICY, an Amiga Zorro card :)
      
       - axxia driver gained slave mode support, NXP driver gained ACPI
      
       - the slave EEPROM backend gained 16 bit address support
      
       - and lots of regular driver updates and reworks
      
      * 'i2c/for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (52 commits)
        i2c: tegra: Move suspend handling to NOIRQ phase
        i2c: imx: ACPI support for NXP i2c controller
        i2c: uniphier(-f): remove all dev_dbg()
        i2c: uniphier(-f): use devm_platform_ioremap_resource()
        i2c: slave-eeprom: Add comment about address handling
        i2c: exynos5: Remove IRQF_ONESHOT
        i2c: stm32f7: Make structure stm32f7_i2c_algo constant
        i2c: cht-wc: drop check because i2c_unregister_device() is NULL safe
        i2c-eeprom_slave: Add support for more eeprom models
        i2c: fsi: Add of_put_node() before break
        i2c: synquacer: Make synquacer_i2c_ops constant
        i2c: hix5hd2: Remove IRQF_ONESHOT
        i2c: i801: Use iTCO version 6 in Cannon Lake PCH and beyond
        watchdog: iTCO: Add support for Cannon Lake PCH iTCO
        i2c: iproc: Make bcm_iproc_i2c_quirks constant
        i2c: iproc: Add full name of devicetree node to adapter name
        i2c: piix4: Add ACPI support
        i2c: piix4: Fix probing of reserved ports on AMD Family 16h Model 30h
        i2c: ocores: use request_any_context_irq() to register IRQ handler
        i2c: designware: Fix optional reset error handling
        ...
      351c8a09
    • Linus Torvalds's avatar
      Merge tag 'sound-fix-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 3cf7487c
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "A few small remaining wrap-up for this merge window.
      
        Most of patches are device-specific (HD-audio and USB-audio quirks,
        FireWire, pcm316a, fsl, rsnd, Atmel, and TI fixes), while there is a
        simple fix (actually two commits) for ASoC core"
      
      * tag 'sound-fix-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: usb-audio: Add DSD support for EVGA NU Audio
        ALSA: hda - Add laptop imic fixup for ASUS M9V laptop
        ASoC: ti: fix SND_SOC_DM365_VOICE_CODEC dependencies
        ASoC: pcm3168a: The codec does not support S32_LE
        ASoC: core: use list_del_init and move it back to soc_cleanup_component
        ALSA: hda/realtek - PCI quirk for Medion E4254
        ALSA: hda - Apply AMD controller workaround for Raven platform
        ASoC: rsnd: do error check after rsnd_channel_normalization()
        ASoC: atmel_ssc_dai: Remove wrong spinlock usage
        ASoC: core: delete component->card_list in soc_remove_component only
        ASoC: fsl_sai: Fix noise when using EDMA
        ALSA: usb-audio: Add Hiby device family to quirks for native DSD support
        ALSA: hda/realtek - Fix alienware headset mic
        ALSA: dice: fix wrong packet parameter for Alesis iO26
      3cf7487c
    • Jarkko Sakkinen's avatar
      tpm: Wrap the buffer from the caller to tpm_buf in tpm_send() · e13cd21f
      Jarkko Sakkinen authored
      tpm_send() does not give anymore the result back to the caller. This
      would require another memcpy(), which kind of tells that the whole
      approach is somewhat broken. Instead, as Mimi suggested, this commit
      just wraps the data to the tpm_buf, and thus the result will not go to
      the garbage.
      
      Obviously this assumes from the caller that it passes large enough
      buffer, which makes the whole API somewhat broken because it could be
      different size than @buflen but since trusted keys is the only module
      using this API right now I think that this fix is sufficient for the
      moment.
      
      In the near future the plan is to replace the parameters with a tpm_buf
      created by the caller.
      Reported-by: default avatarMimi Zohar <zohar@linux.ibm.com>
      Suggested-by: default avatarMimi Zohar <zohar@linux.ibm.com>
      Cc: stable@vger.kernel.org
      Fixes: 412eb585 ("use tpm_buf in tpm_transmit_cmd() as the IO parameter")
      Signed-off-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Reviewed-by: default avatarJerry Snitselaar <jsnitsel@redhat.com>
      e13cd21f
    • Denis Efremov's avatar
      MAINTAINERS: keys: Update path to trusted.h · c980ecff
      Denis Efremov authored
      Update MAINTAINERS record to reflect that trusted.h
      was moved to a different directory in commit 22447981
      ("KEYS: Move trusted.h to include/keys [ver #2]").
      
      Cc: Denis Kenzior <denkenz@gmail.com>
      Cc: James Bottomley <jejb@linux.ibm.com>
      Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Cc: Mimi Zohar <zohar@linux.ibm.com>
      Cc: linux-integrity@vger.kernel.org
      Signed-off-by: default avatarDenis Efremov <efremov@linux.com>
      Reviewed-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      c980ecff
    • Roberto Sassu's avatar
      KEYS: trusted: correctly initialize digests and fix locking issue · 9f75c822
      Roberto Sassu authored
      Commit 0b6cf6b9 ("tpm: pass an array of tpm_extend_digest structures to
      tpm_pcr_extend()") modifies tpm_pcr_extend() to accept a digest for each
      PCR bank. After modification, tpm_pcr_extend() expects that digests are
      passed in the same order as the algorithms set in chip->allocated_banks.
      
      This patch fixes two issues introduced in the last iterations of the patch
      set: missing initialization of the TPM algorithm ID in the tpm_digest
      structures passed to tpm_pcr_extend() by the trusted key module, and
      unreleased locks in the TPM driver due to returning from tpm_pcr_extend()
      without calling tpm_put_ops().
      
      Cc: stable@vger.kernel.org
      Fixes: 0b6cf6b9 ("tpm: pass an array of tpm_extend_digest structures to tpm_pcr_extend()")
      Signed-off-by: default avatarRoberto Sassu <roberto.sassu@huawei.com>
      Suggested-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Reviewed-by: default avatarJerry Snitselaar <jsnitsel@redhat.com>
      Reviewed-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      9f75c822
    • Petr Vorel's avatar
      selftests/tpm2: Add log and *.pyc to .gitignore · 34cd83bb
      Petr Vorel authored
      Fixes: 6ea3dfe1 ("selftests: add TPM 2.0 tests")
      Signed-off-by: default avatarPetr Vorel <pvorel@suse.cz>
      Reviewed-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      34cd83bb
    • Jarkko Sakkinen's avatar
      selftests/tpm2: Add the missing TEST_FILES assignment · 981c107c
      Jarkko Sakkinen authored
      The Python files required by the selftests are not packaged because of
      the missing assignment to TEST_FILES. Add the assignment.
      
      Cc: stable@vger.kernel.org
      Fixes: 6ea3dfe1 ("selftests: add TPM 2.0 tests")
      Signed-off-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Reviewed-by: default avatarPetr Vorel <pvorel@suse.cz>
      981c107c
    • Linus Torvalds's avatar
      Merge tag 'for-5.4/io_uring-2019-09-24' of git://git.kernel.dk/linux-block · b6cb84b4
      Linus Torvalds authored
      Pull more io_uring updates from Jens Axboe:
       "A collection of later fixes and additions, that weren't quite ready
        for pushing out with the initial pull request.
      
        This contains:
      
         - Fix potential use-after-free of shadow requests (Jackie)
      
         - Fix potential OOM crash in request allocation (Jackie)
      
         - kmalloc+memcpy -> kmemdup cleanup (Jackie)
      
         - Fix poll crash regression (me)
      
         - Fix SQ thread not being nice and giving up CPU for !PREEMPT (me)
      
         - Add support for timeouts, making it easier to do epoll_wait()
           conversions, for instance (me)
      
         - Ensure io_uring works without f_ops->read_iter() and
           f_ops->write_iter() (me)"
      
      * tag 'for-5.4/io_uring-2019-09-24' of git://git.kernel.dk/linux-block:
        io_uring: correctly handle non ->{read,write}_iter() file_operations
        io_uring: IORING_OP_TIMEOUT support
        io_uring: use cond_resched() in sqthread
        io_uring: fix potential crash issue due to io_get_req failure
        io_uring: ensure poll commands clear ->sqe
        io_uring: fix use-after-free of shadow_req
        io_uring: use kmemdup instead of kmalloc and memcpy
      b6cb84b4
    • Linus Torvalds's avatar
      Merge tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block · 2e959dd8
      Linus Torvalds authored
      Pull more block updates from Jens Axboe:
       "Some later additions that weren't quite done for the first pull
        request, and also a few fixes that have arrived since.
      
        This contains:
      
         - Kill silly pktcdvd warning on attempting to register a non-scsi
           passthrough device (me)
      
         - Use symbolic constants for the block t10 protection types, and
           switch to handling it in core rather than in the drivers (Max)
      
         - libahci platform missing node put fix (Nishka)
      
         - Small series of fixes for BFQ (Paolo)
      
         - Fix possible nbd crash (Xiubo)"
      
      * tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block:
        block: drop device references in bsg_queue_rq()
        block: t10-pi: fix -Wswitch warning
        pktcdvd: remove warning on attempting to register non-passthrough dev
        ata: libahci_platform: Add of_node_put() before loop exit
        nbd: fix possible page fault for nbd disk
        nbd: rename the runtime flags as NBD_RT_ prefixed
        block, bfq: push up injection only after setting service time
        block, bfq: increase update frequency of inject limit
        block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1
        block, bfq: update inject limit only after injection occurred
        block: centralize PI remapping logic to the block layer
        block: use symbolic constants for t10_pi type
      2e959dd8
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 9c9fa97a
      Linus Torvalds authored
      Merge updates from Andrew Morton:
      
       - a few hot fixes
      
       - ocfs2 updates
      
       - almost all of -mm (slab-generic, slab, slub, kmemleak, kasan,
         cleanups, debug, pagecache, memcg, gup, pagemap, memory-hotplug,
         sparsemem, vmalloc, initialization, z3fold, compaction, mempolicy,
         oom-kill, hugetlb, migration, thp, mmap, madvise, shmem, zswap,
         zsmalloc)
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits)
        mm/zsmalloc.c: fix a -Wunused-function warning
        zswap: do not map same object twice
        zswap: use movable memory if zpool support allocate movable memory
        zpool: add malloc_support_movable to zpool_driver
        shmem: fix obsolete comment in shmem_getpage_gfp()
        mm/madvise: reduce code duplication in error handling paths
        mm: mmap: increase sockets maximum memory size pgoff for 32bits
        mm/mmap.c: refine find_vma_prev() with rb_last()
        riscv: make mmap allocation top-down by default
        mips: use generic mmap top-down layout and brk randomization
        mips: replace arch specific way to determine 32bit task with generic version
        mips: adjust brk randomization offset to fit generic version
        mips: use STACK_TOP when computing mmap base address
        mips: properly account for stack randomization and stack guard gap
        arm: use generic mmap top-down layout and brk randomization
        arm: use STACK_TOP when computing mmap base address
        arm: properly account for stack randomization and stack guard gap
        arm64, mm: make randomization selected by generic topdown mmap layout
        arm64, mm: move generic mmap layout functions to mm
        arm64: consider stack randomization for mmap base only when necessary
        ...
      9c9fa97a
    • Qian Cai's avatar
      mm/zsmalloc.c: fix a -Wunused-function warning · 2b38d01b
      Qian Cai authored
      set_zspage_inuse() was introduced in the commit 4f42047b ("zsmalloc:
      use accessor") but all the users of it were removed later by the commits,
      
      bdb0af7c ("zsmalloc: factor page chain functionality out")
      3783689a ("zsmalloc: introduce zspage structure")
      
      so the function can be safely removed now.
      
      Link: http://lkml.kernel.org/r/1568658408-19374-1-git-send-email-cai@lca.pwSigned-off-by: default avatarQian Cai <cai@lca.pw>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b38d01b
    • Vitaly Wool's avatar
      zswap: do not map same object twice · 068619e3
      Vitaly Wool authored
      zswap_writeback_entry() maps a handle to read swpentry first, and
      then in the most common case it would map the same handle again.
      This is ok when zbud is the backend since its mapping callback is
      plain and simple, but it slows things down for z3fold.
      
      Since there's hardly a point in unmapping a handle _that_ fast as
      zswap_writeback_entry() does when it reads swpentry, the
      suggestion is to keep the handle mapped till the end.
      
      Link: http://lkml.kernel.org/r/20190916004640.b453167d3556c4093af4cf7d@gmail.comSigned-off-by: default avatarVitaly Wool <vitalywool@gmail.com>
      Reviewed-by: default avatarDan Streetman <ddstreet@ieee.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      068619e3
    • Hui Zhu's avatar
      zswap: use movable memory if zpool support allocate movable memory · d2fcd82b
      Hui Zhu authored
      This is the third version that was updated according to the comments from
      Sergey Senozhatsky https://lkml.org/lkml/2019/5/29/73 and Shakeel Butt
      https://lkml.org/lkml/2019/6/4/973
      
      zswap compresses swap pages into a dynamically allocated RAM-based memory
      pool.  The memory pool should be zbud, z3fold or zsmalloc.  All of them
      will allocate unmovable pages.  It will increase the number of unmovable
      page blocks that will bad for anti-fragment.
      
      zsmalloc support page migration if request movable page:
              handle = zs_malloc(zram->mem_pool, comp_len,
                      GFP_NOIO | __GFP_HIGHMEM |
                      __GFP_MOVABLE);
      
      And commit "zpool: Add malloc_support_movable to zpool_driver" add
      zpool_malloc_support_movable check malloc_support_movable to make sure if
      a zpool support allocate movable memory.
      
      This commit let zswap allocate block with gfp
      __GFP_HIGHMEM | __GFP_MOVABLE if zpool support allocate movable memory.
      
      Following part is test log in a pc that has 8G memory and 2G swap.
      
      Without this commit:
      ~# echo lz4 > /sys/module/zswap/parameters/compressor
      ~# echo zsmalloc > /sys/module/zswap/parameters/zpool
      ~# echo 1 > /sys/module/zswap/parameters/enabled
      ~# swapon /swapfile
      ~# cd /home/teawater/kernel/vm-scalability/
      /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
      /home/teawater/kernel/vm-scalability# ./case-anon-w-seq
      2717908992 bytes / 4826062 usecs = 549973 KB/s
      2717908992 bytes / 4864201 usecs = 545661 KB/s
      2717908992 bytes / 4867015 usecs = 545346 KB/s
      2717908992 bytes / 4915485 usecs = 539968 KB/s
      397853 usecs to free memory
      357820 usecs to free memory
      421333 usecs to free memory
      420454 usecs to free memory
      /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
      Page block order: 9
      Pages per block:  512
      
      Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
      Node    0, zone      DMA, type    Unmovable      1      1      1      0      2      1      1      0      1      0      0
      Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3
      Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type    Unmovable      6      5      8      6      6      5      4      1      1      1      0
      Node    0, zone    DMA32, type      Movable     25     20     20     19     22     15     14     11     11      5    767
      Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type    Unmovable   4753   5588   5159   4613   3712   2520   1448    594    188     11      0
      Node    0, zone   Normal, type      Movable     16      3    457   2648   2143   1435    860    459    223    224    296
      Node    0, zone   Normal, type  Reclaimable      0      0     44     38     11      2      0      0      0      0      0
      Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      
      Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
      Node 0, zone      DMA            1            7            0            0            0            0
      Node 0, zone    DMA32            4         1652            0            0            0            0
      Node 0, zone   Normal          931         1485           15            0            0            0
      
      With this commit:
      ~# echo lz4 > /sys/module/zswap/parameters/compressor
      ~# echo zsmalloc > /sys/module/zswap/parameters/zpool
      ~# echo 1 > /sys/module/zswap/parameters/enabled
      ~# swapon /swapfile
      ~# cd /home/teawater/kernel/vm-scalability/
      /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
      /home/teawater/kernel/vm-scalability# ./case-anon-w-seq
      2717908992 bytes / 4689240 usecs = 566020 KB/s
      2717908992 bytes / 4760605 usecs = 557535 KB/s
      2717908992 bytes / 4803621 usecs = 552543 KB/s
      2717908992 bytes / 5069828 usecs = 523530 KB/s
      431546 usecs to free memory
      383397 usecs to free memory
      456454 usecs to free memory
      224487 usecs to free memory
      /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
      Page block order: 9
      Pages per block:  512
      
      Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
      Node    0, zone      DMA, type    Unmovable      1      1      1      0      2      1      1      0      1      0      0
      Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3
      Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type    Unmovable     10      8     10      9     10      4      3      2      3      0      0
      Node    0, zone    DMA32, type      Movable     18     12     14     16     16     11      9      5      5      6    775
      Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      1
      Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type    Unmovable   2669   1236    452    118     37     14      4      1      2      3      0
      Node    0, zone   Normal, type      Movable   3850   6086   5274   4327   3510   2494   1520    934    438    220    470
      Node    0, zone   Normal, type  Reclaimable     56     93    155    124     47     31     17      7      3      0      0
      Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      
      Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
      Node 0, zone      DMA            1            7            0            0            0            0
      Node 0, zone    DMA32            4         1650            2            0            0            0
      Node 0, zone   Normal           79         2326           26            0            0            0
      
      You can see that the number of unmovable page blocks is decreased
      when the kernel has this commit.
      
      Link: http://lkml.kernel.org/r/20190605100630.13293-2-teawaterz@linux.alibaba.comSigned-off-by: default avatarHui Zhu <teawaterz@linux.alibaba.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2fcd82b