1. 13 Sep, 2018 1 commit
    • Jens Axboe's avatar
      null_blk: fix zoned support for non-rq based operation · b228ba1c
      Jens Axboe authored
      The supported added for zones in null_blk seem to assume that only rq
      based operation is possible. But this depends on the queue_mode setting,
      if this is set to 0, then cmd->bio is what we need to be operating on.
      Right now any attempt to load null_blk with queue_mode=0 will
      insta-crash, since cmd->rq is NULL and null_handle_cmd() assumes it to
      always be set.
      
      Make the zoned code deal with bio's instead, or pass in the
      appropriate sector/nr_sectors instead.
      
      Fixes: ca4b2a01 ("null_blk: add zone support")
      Tested-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b228ba1c
  2. 11 Sep, 2018 1 commit
  3. 10 Sep, 2018 1 commit
  4. 06 Sep, 2018 1 commit
  5. 05 Sep, 2018 2 commits
    • Mikulas Patocka's avatar
      block: don't warn when doing fsync on read-only devices · 8b2ded1c
      Mikulas Patocka authored
      It is possible to call fsync on a read-only handle (for example, fsck.ext2
      does it when doing read-only check), and this call results in kernel
      warning.
      
      The patch b089cfd9 ("block: don't warn for flush on read-only device")
      attempted to disable the warning, but it is buggy and it doesn't
      (op_is_flush tests flags, but bio_op strips off the flags).
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Fixes: 721c7fc7 ("block: fail op_is_write() requests to read-only partitions")
      Cc: stable@vger.kernel.org	# 4.18
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8b2ded1c
    • Sagi Grimberg's avatar
      nvmet-rdma: fix possible bogus dereference under heavy load · 8407879c
      Sagi Grimberg authored
      Currently we always repost the recv buffer before we send a response
      capsule back to the host. Since ordering is not guaranteed for send
      and recv completions, it is posible that we will receive a new request
      from the host before we got a send completion for the response capsule.
      
      Today, we pre-allocate 2x rsps the length of the queue, but in reality,
      under heavy load there is nothing that is really preventing the gap to
      expand until we exhaust all our rsps.
      
      To fix this, if we don't have any pre-allocated rsps left, we dynamically
      allocate a rsp and make sure to free it when we are done. If under memory
      pressure we fail to allocate a rsp, we silently drop the command and
      wait for the host to retry.
      Reported-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      [hch: dropped a superflous assignment]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      8407879c
  6. 04 Sep, 2018 1 commit
  7. 31 Aug, 2018 3 commits
    • Dennis Zhou (Facebook)'s avatar
      blkcg: use tryget logic when associating a blkg with a bio · 31118850
      Dennis Zhou (Facebook) authored
      There is a very small change a bio gets caught up in a really
      unfortunate race between a task migration, cgroup exiting, and itself
      trying to associate with a blkg. This is due to css offlining being
      performed after the css->refcnt is killed which triggers removal of
      blkgs that reach their blkg->refcnt of 0.
      
      To avoid this, association with a blkg should use tryget and fallback to
      using the root_blkg.
      
      Fixes: 08e18eab ("block: add bi_blkg to the bio for cgroups")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennisszhou@gmail.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      31118850
    • Dennis Zhou (Facebook)'s avatar
      blkcg: delay blkg destruction until after writeback has finished · 59b57717
      Dennis Zhou (Facebook) authored
      Currently, blkcg destruction relies on a sequence of events:
        1. Destruction starts. blkcg_css_offline() is called and blkgs
           release their reference to the blkcg. This immediately destroys
           the cgwbs (writeback).
        2. With blkgs giving up their reference, the blkcg ref count should
           become zero and eventually call blkcg_css_free() which finally
           frees the blkcg.
      
      Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
      and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
      on the completion of all writeback associated with the blkcg. A count of
      the number of cgwbs is maintained and once that goes to zero, blkg
      destruction can follow. This should prevent premature blkg destruction
      related to writeback.
      
      The new process for blkcg cleanup is as follows:
        1. Destruction starts. blkcg_css_offline() is called which offlines
           writeback. Blkg destruction is delayed on the cgwb_refcnt count to
           avoid punting potentially large amounts of outstanding writeback
           to root while maintaining any ongoing policies. Here, the base
           cgwb_refcnt is put back.
        2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
           and handles destruction of blkgs. This is where the css reference
           held by each blkg is released.
        3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
           This finally frees the blkg.
      
      It seems in the past blk-throttle didn't do the most understandable
      things with taking data from a blkg while associating with current. So,
      the simplification and unification of what blk-throttle is doing caused
      this.
      
      Fixes: 08e18eab ("block: add bi_blkg to the bio for cgroups")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennisszhou@gmail.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      59b57717
    • Dennis Zhou (Facebook)'s avatar
      Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()" · 6b065462
      Dennis Zhou (Facebook) authored
      This reverts commit 4c699480.
      
      Destroying blkgs is tricky because of the nature of the relationship. A
      blkg should go away when either a blkcg or a request_queue goes away.
      However, blkg's pin the blkcg to ensure they remain valid. To break this
      cycle, when a blkcg is offlined, blkgs put back their css ref. This
      eventually lets css_free() get called which frees the blkcg.
      
      The above commit (4c699480) breaks this order of events by trying to
      destroy blkgs in css_free(). As the blkgs still hold references to the
      blkcg, css_free() is never called.
      
      The race between blkcg_bio_issue_check() and cgroup_rmdir() will be
      addressed in the following patch by delaying destruction of a blkg until
      all writeback associated with the blkcg has been finished.
      
      Fixes: 4c699480 ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennisszhou@gmail.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6b065462
  8. 29 Aug, 2018 2 commits
  9. 28 Aug, 2018 5 commits
    • Chaitanya Kulkarni's avatar
    • James Smart's avatar
      nvme-fcloop: Fix dropped LS's to removed target port · afd299ca
      James Smart authored
      When a targetport is removed from the config, fcloop will avoid calling
      the LS done() routine thinking the targetport is gone. This leaves the
      initiator reset/reconnect hanging as it waits for a status on the
      Create_Association LS for the reconnect.
      
      Change the filter in the LS callback path. If tport null (set when
      failed validation before "sending to remote port"), be sure to call
      done. This was the main bug. But, continue the logic that only calls
      done if tport was set but there is no remoteport (e.g. case where
      remoteport has been removed, thus host doesn't expect a completion).
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      afd299ca
    • Michal Wnukowski's avatar
      nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_event · f1ed3df2
      Michal Wnukowski authored
      In many architectures loads may be reordered with older stores to
      different locations.  In the nvme driver the following two operations
      could be reordered:
      
       - Write shadow doorbell (dbbuf_db) into memory.
       - Read EventIdx (dbbuf_ei) from memory.
      
      This can result in a potential race condition between driver and VM host
      processing requests (if given virtual NVMe controller has a support for
      shadow doorbell).  If that occurs, then the NVMe controller may decide to
      wait for MMIO doorbell from guest operating system, and guest driver may
      decide not to issue MMIO doorbell on any of subsequent commands.
      
      This issue is purely timing-dependent one, so there is no easy way to
      reproduce it. Currently the easiest known approach is to run "Oracle IO
      Numbers" (orion) that is shipped with Oracle DB:
      
      orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \
      	concat -write 40 -duration 120 -matrix row -testname nvme_test
      
      Where nvme_test is a .lun file that contains a list of NVMe block
      devices to run test against. Limiting number of vCPUs assigned to given
      VM instance seems to increase chances for this bug to occur. On test
      environment with VM that got 4 NVMe drives and 1 vCPU assigned the
      virtual NVMe controller hang could be observed within 10-20 minutes.
      That correspond to about 400-500k IO operations processed (or about
      100GB of IO read/writes).
      
      Orion tool was used as a validation and set to run in a loop for 36
      hours (equivalent of pushing 550M IO operations). No issues were
      observed. That suggest that the patch fixes the issue.
      
      Fixes: f9f38e33 ("nvme: improve performance for virtual NVMe devices")
      Signed-off-by: default avatarMichal Wnukowski <wnukowski@google.com>
      Reviewed-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      [hch: updated changelog and comment a bit]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      f1ed3df2
    • John Pittman's avatar
      block: bsg: move atomic_t ref_count variable to refcount API · db193954
      John Pittman authored
      Currently, variable ref_count within the bsg_device struct is of
      type atomic_t.  For variables being used as reference counters,
      the refcount API should be used instead of atomic.  The newer
      refcount API works to prevent counter overflows and use-after-free
      bugs.  So, move this varable from the atomic API to refcount,
      potentially avoiding the issues mentioned.
      Signed-off-by: default avatarJohn Pittman <jpittman@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      db193954
    • Chengguang Xu's avatar
      block: remove unnecessary condition check · 62d2a194
      Chengguang Xu authored
      kmem_cache_destroy() can handle NULL pointer correctly, so there is
      no need to check e->icq_cache before calling kmem_cache_destroy().
      Signed-off-by: default avatarChengguang Xu <cgxu519@gmx.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      62d2a194
  10. 27 Aug, 2018 10 commits
  11. 25 Aug, 2018 6 commits
  12. 24 Aug, 2018 7 commits
    • Linus Torvalds's avatar
      Merge branch 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata · 05193597
      Linus Torvalds authored
      Pull libata updates from Tejun Heo:
       "Nothing too interesting. Mostly ahci and ahci_platform changes, many
        around power management"
      
      * 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: (22 commits)
        ata: ahci_platform: enable to get and control reset
        ata: libahci_platform: add reset control support
        ata: add an extra argument to ahci_platform_get_resources()
        ata: sata_rcar: Add r8a77965 support
        ata: sata_rcar: exclude setting of PHY registers in Gen3
        ata: sata_rcar: really mask all interrupts on Gen2 and later
        Revert "ata: ahci_platform: allow disabling of hotplug to save power"
        ata: libahci: Allow reconfigure of DEVSLP register
        ata: libahci: Correct setting of DEVSLP register
        ata: ahci: Enable DEVSLP by default on x86 with SLP_S0
        ata: ahci: Support state with min power but Partial low power state
        Revert "ata: ahci_platform: convert kcalloc to devm_kcalloc"
        ata: sata_rcar: Add rudimentary Runtime PM support
        ata: sata_rcar: Provide a short-hand for &pdev->dev
        ata: Only output sg element mapped number in verbose debug
        ata: Guard ata_scsi_dump_cdb() by ATA_VERBOSE_DEBUG
        ata: ahci_platform: convert kcalloc to devm_kcalloc
        ata: ahci_platform: convert kzallloc to kcalloc
        ata: ahci_platform: correct parameter documentation for ahci_platform_shutdown
        libata: remove ata_sff_data_xfer_noirq()
        ...
      05193597
    • Linus Torvalds's avatar
      Merge branch 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 59676610
      Linus Torvalds authored
      Pull cgroup updates from Tejun Heo:
       "Just one commit from Steven to take out spin lock from trace event
        handlers"
      
      * 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroup/tracing: Move taking of spin lock out of trace event handlers
      59676610
    • Linus Torvalds's avatar
      Merge branch 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq · 9022ada8
      Linus Torvalds authored
      Pull workqueue updates from Tejun Heo:
       "Over the lockdep cross-release churn, workqueue lost some of the
        existing annotations. Johannes Berg restored it and also improved
        them"
      
      * 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
        workqueue: re-add lockdep dependencies for flushing
        workqueue: skip lockdep wq dependency in cancel_work_sync()
      9022ada8
    • Linus Torvalds's avatar
      Merge tag 'iommu-updates-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 18b8bfdf
      Linus Torvalds authored
      Pull IOMMU updates from Joerg Roedel:
      
       - PASID table handling updates for the Intel VT-d driver. It implements
         a global PASID space now so that applications usings multiple devices
         will just have one PASID.
      
       - A new config option to make iommu passthroug mode the default.
      
       - New sysfs attribute for iommu groups to export the type of the
         default domain.
      
       - A debugfs interface (for debug only) usable by IOMMU drivers to
         export internals to user-space.
      
       - R-Car Gen3 SoCs support for the ipmmu-vmsa driver
      
       - The ARM-SMMU now aborts transactions from unknown devices and devices
         not attached to any domain.
      
       - Various cleanups and smaller fixes all over the place.
      
      * tag 'iommu-updates-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (42 commits)
        iommu/omap: Fix cache flushes on L2 table entries
        iommu: Remove the ->map_sg indirection
        iommu/arm-smmu-v3: Abort all transactions if SMMU is enabled in kdump kernel
        iommu/arm-smmu-v3: Prevent any devices access to memory without registration
        iommu/ipmmu-vmsa: Don't register as BUS IOMMU if machine doesn't have IPMMU-VMSA
        iommu/ipmmu-vmsa: Clarify supported platforms
        iommu/ipmmu-vmsa: Fix allocation in atomic context
        iommu: Add config option to set passthrough as default
        iommu: Add sysfs attribyte for domain type
        iommu/arm-smmu-v3: sync the OVACKFLG to PRIQ consumer register
        iommu/arm-smmu: Error out only if not enough context interrupts
        iommu/io-pgtable-arm-v7s: Abort allocation when table address overflows the PTE
        iommu/io-pgtable-arm: Fix pgtable allocation in selftest
        iommu/vt-d: Remove the obsolete per iommu pasid tables
        iommu/vt-d: Apply per pci device pasid table in SVA
        iommu/vt-d: Allocate and free pasid table
        iommu/vt-d: Per PCI device pasid table interfaces
        iommu/vt-d: Add for_each_device_domain() helper
        iommu/vt-d: Move device_domain_info to header
        iommu/vt-d: Apply global PASID in SVA
        ...
      18b8bfdf
    • Linus Torvalds's avatar
      Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux · d972604f
      Linus Torvalds authored
      Pull thermal management updates from Zhang Rui:
      
       - Add Daniel Lezcano as the reviewer of thermal framework and SoC
         driver changes (Daniel Lezcano).
      
       - Fix a bug in intel_dts_soc_thermal driver, which does not translate
         IO-APIC GSI (Global System Interrupt) into Linux irq number (Hans de
         Goede).
      
       - For device tree bindings, allow cooling devices sharing same trip
         point with same contribution value to share cooling map (Viresh
         Kumar).
      
      * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux:
        dt-bindings: thermal: Allow multiple devices to share cooling map
        MAINTAINERS: Add Daniel Lezcano as designated reviewer for thermal
        Thermal: Intel SoC DTS: Translate IO-APIC GSI number to linux irq number
      d972604f
    • Linus Torvalds's avatar
      Merge tag 'apparmor-pr-2018-08-23' of... · 57bb8e37
      Linus Torvalds authored
      Merge tag 'apparmor-pr-2018-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor
      
      Pull apparmor updates from John Johansen:
       "There is nothing major this time just four bug fixes and a patch to
        remove some dead code:
      
        Cleanups:
         - remove no-op permission check in policy_unpack
      
        Bug fixes:
         - fix an error code in __aa_create_ns()
         - fix failure to audit context info in build_change_hat
         - check buffer bounds when mapping permissions mask
         - fully initialize aa_perms struct when answering userspace query"
      
      * tag 'apparmor-pr-2018-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor:
        apparmor: remove no-op permission check in policy_unpack
        apparmor: fix an error code in __aa_create_ns()
        apparmor: Fix failure to audit context info in build_change_hat
        apparmor: Fully initialize aa_perms struct when answering userspace query
        apparmor: Check buffer bounds when mapping permissions mask
      57bb8e37
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · aa5b1054
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
      
       - An implementation for the newly added hv_ops->flush() for the OPAL
         hvc console driver backends, I forgot to apply this after merging the
         hvc driver changes before the merge window.
      
       - Enable all PCI bridges at boot on powernv, to avoid races when
         multiple children of a bridge try to enable it simultaneously. This
         is a workaround until the PCI core can be enhanced to fix the races.
      
       - A fix to query PowerVM for the correct system topology at boot before
         initialising sched domains, seen in some configurations to cause
         broken scheduling etc.
      
       - A fix for pte_access_permitted() on "nohash" platforms.
      
       - Two commits to fix SIGBUS when using remap_pfn_range() seen on Power9
         due to a workaround when using the nest MMU (GPUs, accelerators).
      
       - Another fix to the VFIO code used by KVM, the previous fix had some
         bugs which caused guests to not start in some configurations.
      
       - A handful of other minor fixes.
      
      Thanks to: Aneesh Kumar K.V, Benjamin Herrenschmidt, Christophe Leroy,
      Hari Bathini, Luke Dashjr, Mahesh Salgaonkar, Nicholas Piggin, Paul
      Mackerras, Srikar Dronamraju.
      
      * tag 'powerpc-4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/mce: Fix SLB rebolting during MCE recovery path.
        KVM: PPC: Book3S: Fix guest DMA when guest partially backed by THP pages
        powerpc/mm/radix: Only need the Nest MMU workaround for R -> RW transition
        powerpc/mm/books3s: Add new pte bit to mark pte temporarily invalid.
        powerpc/nohash: fix pte_access_permitted()
        powerpc/topology: Get topology for shared processors at boot
        powerpc64/ftrace: Include ftrace.h needed for enable/disable calls
        powerpc/powernv/pci: Work around races in PCI bridge enabling
        powerpc/fadump: cleanup crash memory ranges support
        powerpc/powernv: provide a console flush operation for opal hvc driver
        powerpc/traps: Avoid rate limit messages from show unhandled signals
        powerpc/64s: Fix PACA_IRQ_HARD_DIS accounting in idle_power4()
      aa5b1054