1. 25 Jul, 2019 1 commit
    • Linus Torvalds's avatar
      Merge branch 'access-creds' · a29a0a46
      Linus Torvalds authored
      The access() (and faccessat()) credentials change can cause an
      unnecessary load on the RCU machinery because every access() call ends
      up freeing the temporary access credential using RCU.
      
      This isn't really noticeable on small machines, but if you have hundreds
      of cores you can cause huge slowdowns due to RCU storms.
      
      It's easy to avoid: the temporary access crededntials aren't actually
      normally accessed using RCU at all, so we can avoid the whole issue by
      just marking them as such.
      
      * access-creds:
        access: avoid the RCU grace period for the temporary subjective credentials
      a29a0a46
  2. 24 Jul, 2019 6 commits
    • Linus Torvalds's avatar
      access: avoid the RCU grace period for the temporary subjective credentials · d7852fbd
      Linus Torvalds authored
      It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
      work because it installs a temporary credential that gets allocated and
      freed for each system call.
      
      The allocation and freeing overhead is mostly benign, but because
      credentials can be accessed under the RCU read lock, the freeing
      involves a RCU grace period.
      
      Which is not a huge deal normally, but if you have a lot of access()
      calls, this causes a fair amount of seconday damage: instead of having a
      nice alloc/free patterns that hits in hot per-CPU slab caches, you have
      all those delayed free's, and on big machines with hundreds of cores,
      the RCU overhead can end up being enormous.
      
      But it turns out that all of this is entirely unnecessary.  Exactly
      because access() only installs the credential as the thread-local
      subjective credential, the temporary cred pointer doesn't actually need
      to be RCU free'd at all.  Once we're done using it, we can just free it
      synchronously and avoid all the RCU overhead.
      
      So add a 'non_rcu' flag to 'struct cred', which can be set by users that
      know they only use it in non-RCU context (there are other potential
      users for this).  We can make it a union with the rcu freeing list head
      that we need for the RCU case, so this doesn't need any extra storage.
      
      Note that this also makes 'get_current_cred()' clear the new non_rcu
      flag, in case we have filesystems that take a long-term reference to the
      cred and then expect the RCU delayed freeing afterwards.  It's not
      entirely clear that this is required, but it makes for clear semantics:
      the subjective cred remains non-RCU as long as you only access it
      synchronously using the thread-local accessors, but you _can_ use it as
      a generic cred if you want to.
      
      It is possible that we should just remove the whole RCU markings for
      ->cred entirely.  Only ->real_cred is really supposed to be accessed
      through RCU, and the long-term cred copies that nfs uses might want to
      explicitly re-enable RCU freeing if required, rather than have
      get_current_cred() do it implicitly.
      
      But this is a "minimal semantic changes" change for the immediate
      problem.
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Glauber <jglauber@marvell.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7852fbd
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · bed38c3e
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "An assortment of non-regression fixes that have accumulated since the
        start of the merge window.
      
         - A fix for a user triggerable oops on machines where transactional
           memory is disabled, eg. Power9 bare metal, Power8 with TM disabled
           on the command line, or all Power7 or earlier machines.
      
         - Three fixes for handling of PMU and power saving registers when
           running nested KVM on Power9.
      
         - Two fixes for bugs found while stress testing the XIVE interrupt
           controller code, also on Power9.
      
         - A fix to allow guests to boot under Qemu/KVM on Power9 using the
           the Hash MMU with >= 1TB of memory.
      
         - Two fixes for bugs in the recent DMA cleanup, one of which could
           lead to checkstops.
      
         - And finally three fixes for the PAPR SCM nvdimm driver.
      
        Thanks to: Alexey Kardashevskiy, Andrea Arcangeli, Cédric Le Goater,
        Christoph Hellwig, David Gibson, Gautham R. Shenoy, Michael Neuling,
        Oliver O'Halloran, Satheesh Rajendran, Shawn Anastasio, Suraj Jitindar
        Singh, Vaibhav Jain"
      
      * tag 'powerpc-5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/papr_scm: Force a scm-unbind if initial scm-bind fails
        powerpc/papr_scm: Update drc_pmem_unbind() to use H_SCM_UNBIND_ALL
        powerpc/pseries: Update SCM hcall op-codes in hvcall.h
        powerpc/tm: Fix oops on sigreturn on systems without TM
        powerpc/dma: Fix invalid DMA mmap behavior
        KVM: PPC: Book3S HV: XIVE: fix rollback when kvmppc_xive_create fails
        powerpc/xive: Fix loop exit-condition in xive_find_target_in_mask()
        powerpc: fix off by one in max_zone_pfn initialization for ZONE_DMA
        KVM: PPC: Book3S HV: Save and restore guest visible PSSCR bits on pseries
        powerpc/pmu: Set pmcregs_in_use in paca when running as LPAR
        KVM: PPC: Book3S HV: Always save guest pmu for guest capable of nesting
        powerpc/mm: Limit rma_size to 1TB when running without HV mode
      bed38c3e
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 76260774
      Linus Torvalds authored
      Pull KVM fixes from Paolo Bonzini:
       "Bugfixes, a pvspinlock optimization, and documentation moving"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: X86: Boost queue head vCPU to mitigate lock waiter preemption
        Documentation: move Documentation/virtual to Documentation/virt
        KVM: nVMX: Set cached_vmcs12 and cached_shadow_vmcs12 NULL after free
        KVM: X86: Dynamically allocate user_fpu
        KVM: X86: Fix fpu state crash in kvm guest
        Revert "kvm: x86: Use task structs fpu field for user"
        KVM: nVMX: Clear pending KVM_REQ_GET_VMCS12_PAGES when leaving nested
      76260774
    • Linus Torvalds's avatar
      Merge tag 'dma-mapping-5.3-2' of git://git.infradead.org/users/hch/dma-mapping · c2626876
      Linus Torvalds authored
      Pull dma-mapping regression fix from Christoph Hellwig:
       "Ensure that dma_addressing_limited doesn't crash on devices without a
        dma mask (Eric Auger)"
      
      * tag 'dma-mapping-5.3-2' of git://git.infradead.org/users/hch/dma-mapping:
        dma-mapping: use dma_get_mask in dma_addressing_limited
      c2626876
    • Wanpeng Li's avatar
      KVM: X86: Boost queue head vCPU to mitigate lock waiter preemption · 266e85a5
      Wanpeng Li authored
      Commit 11752adb (locking/pvqspinlock: Implement hybrid PV queued/unfair locks)
      introduces hybrid PV queued/unfair locks
       - queued mode (no starvation)
       - unfair mode (good performance on not heavily contended lock)
      The lock waiter goes into the unfair mode especially in VMs with over-commit
      vCPUs since increaing over-commitment increase the likehood that the queue
      head vCPU may have been preempted and not actively spinning.
      
      However, reschedule queue head vCPU timely to acquire the lock still can get
      better performance than just depending on lock stealing in over-subscribe
      scenario.
      
      Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
      ebizzy -M
                   vanilla     boosting    improved
       1VM          23520        25040         6%
       2VM           8000        13600        70%
       3VM           3100         5400        74%
      
      The lock holder vCPU yields to the queue head vCPU when unlock, to boost queue
      head vCPU which is involuntary preemption or the one which is voluntary halt
      due to fail to acquire the lock after a short spin in the guest.
      
      Cc: Waiman Long <longman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      266e85a5
    • Christoph Hellwig's avatar
      Documentation: move Documentation/virtual to Documentation/virt · 2f5947df
      Christoph Hellwig authored
      Renaming docs seems to be en vogue at the moment, so fix on of the
      grossly misnamed directories.  We usually never use "virtual" as
      a shortcut for virtualization in the kernel, but always virt,
      as seen in the virt/ top-level directory.  Fix up the documentation
      to match that.
      
      Fixes: ed16648e ("Move kvm, uml, and lguest subdirectories under a common "virtual" directory, I.E:")
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2f5947df
  3. 23 Jul, 2019 2 commits
  4. 22 Jul, 2019 18 commits
    • Linus Torvalds's avatar
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7b5cf701
      Linus Torvalds authored
      Pull preemption Kconfig fix from Thomas Gleixner:
       "The PREEMPT_RT stub config renamed PREEMPT to PREEMPT_LL and defined
        PREEMPT outside of the menu and made it selectable by both PREEMPT_LL
        and PREEMPT_RT.
      
        Stupid me missed that 114 defconfigs select CONFIG_PREEMPT which
        obviously can't work anymore. oldconfig builds are affected as well,
        but it's more obvious as the user gets asked. [old]defconfig silently
        fixes it up and selects PREEMPT_NONE.
      
        Unbreak it by undoing the rename and adding a intermediate config
        symbol which is selected by both PREEMPT and PREEMPT_RT. That requires
        to chase down a few #ifdefs, but it's better than tweaking 114
        defconfigs and annoying users"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/rt, Kconfig: Unbreak def/oldconfig with CONFIG_PREEMPT=y
      7b5cf701
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20190722' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux · 44b912cd
      Linus Torvalds authored
      Pull pidfd polling fix from Christian Brauner:
       "A fix for pidfd polling. It ensures that the task's exit state is
        visible to all waiters"
      
      * tag 'for-linus-20190722' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
        pidfd: fix a poll race when setting exit_state
      44b912cd
    • Linus Torvalds's avatar
      Merge tag 'for-5.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 21c730d7
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
      
       - fixes for leaks caused by recently merged patches
      
       - one build fix
      
       - a fix to prevent mixing of incompatible features
      
      * tag 'for-5.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: don't leak extent_map in btrfs_get_io_geometry()
        btrfs: free checksum hash on in close_ctree
        btrfs: Fix build error while LIBCRC32C is module
        btrfs: inode: Don't compress if NODATASUM or NODATACOW set
      21c730d7
    • Thomas Gleixner's avatar
      sched/rt, Kconfig: Unbreak def/oldconfig with CONFIG_PREEMPT=y · b8d33498
      Thomas Gleixner authored
      The merge of the CONFIG_PREEMPT_RT stub renamed CONFIG_PREEMPT to
      CONFIG_PREEMPT_LL which causes all defconfigs which have CONFIG_PREEMPT=y
      set to fall back to CONFIG_PREEMPT_NONE because CONFIG_PREEMPT depends on
      the preemption mode choice wich defaults to NONE. This also affects
      oldconfig builds.
      
      So rather than changing 114 defconfig files and being an annoyance to
      users, revert the rename and select a new config symbol PREEMPTION. That
      keeps everything working smoothly and the revelant ifdef's are going to be
      fixed up step by step.
      Reported-by: default avatarMark Rutland <mark.rutland@arm.com>
      Fixes: a50a3f4b ("sched/rt, Kconfig: Introduce CONFIG_PREEMPT_RT")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      b8d33498
    • Linus Torvalds's avatar
      Merge tag 'media/v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · c92f0380
      Linus Torvalds authored
      Pull media fixes from Mauro Carvalho Chehab:
       "For two regressions in media core:
      
         - v4l2-subdev: fix regression in check_pad()
      
         - videodev2.h: change V4L2_PIX_FMT_BGRA444 define: fourcc was already
           in use"
      
      * tag 'media/v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
        media: videodev2.h: change V4L2_PIX_FMT_BGRA444 define: fourcc was already in use
        media: v4l2-subdev: fix regression in check_pad()
      c92f0380
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 83768245
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Several netfilter fixes including a nfnetlink deadlock fix from
          Florian Westphal and fix for dropping VRF packets from Miaohe Lin.
      
       2) Flow offload fixes from Pablo Neira Ayuso including a fix to restore
          proper block sharing.
      
       3) Fix r8169 PHY init from Thomas Voegtle.
      
       4) Fix memory leak in mac80211, from Lorenzo Bianconi.
      
       5) Missing NULL check on object allocation in cxgb4, from Navid
          Emamdoost.
      
       6) Fix scaling of RX power in sfp phy driver, from Andrew Lunn.
      
       7) Check that there is actually an ip header to access in skb->data in
          VRF, from Peter Kosyh.
      
       8) Remove spurious rcu unlock in hv_netvsc, from Haiyang Zhang.
      
       9) One more tweak the the TCP fragmentation memory limit changes, to be
          less harmful to applications setting small SO_SNDBUF values. From
          Eric Dumazet.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (40 commits)
        tcp: be more careful in tcp_fragment()
        hv_netvsc: Fix extra rcu_read_unlock in netvsc_recv_callback()
        vrf: make sure skb->data contains ip header to make routing
        connector: remove redundant input callback from cn_dev
        qed: Prefer pcie_capability_read_word()
        igc: Prefer pcie_capability_read_word()
        cxgb4: Prefer pcie_capability_read_word()
        be2net: Synchronize be_update_queues with dev_watchdog
        bnx2x: Prevent load reordering in tx completion processing
        net: phy: sfp: hwmon: Fix scaling of RX power
        net: sched: verify that q!=NULL before setting q->flags
        chelsio: Fix a typo in a function name
        allocate_flower_entry: should check for null deref
        net: hns3: typo in the name of a constant
        kbuild: add net/netfilter/nf_tables_offload.h to header-test blacklist.
        tipc: Fix a typo
        mac80211: don't warn about CW params when not using them
        mac80211: fix possible memory leak in ieee80211_assign_beacon
        nl80211: fix NL80211_HE_MAX_CAPABILITY_LEN
        nl80211: fix VENDOR_CMD_RAW_DATA
        ...
      83768245
    • Suren Baghdasaryan's avatar
      pidfd: fix a poll race when setting exit_state · b191d649
      Suren Baghdasaryan authored
      There is a race between reading task->exit_state in pidfd_poll and
      writing it after do_notify_parent calls do_notify_pidfd. Expected
      sequence of events is:
      
      CPU 0                            CPU 1
      ------------------------------------------------
      exit_notify
        do_notify_parent
          do_notify_pidfd
        tsk->exit_state = EXIT_DEAD
                                        pidfd_poll
                                           if (tsk->exit_state)
      
      However nothing prevents the following sequence:
      
      CPU 0                            CPU 1
      ------------------------------------------------
      exit_notify
        do_notify_parent
          do_notify_pidfd
                                         pidfd_poll
                                            if (tsk->exit_state)
        tsk->exit_state = EXIT_DEAD
      
      This causes a polling task to wait forever, since poll blocks because
      exit_state is 0 and the waiting task is not notified again. A stress
      test continuously doing pidfd poll and process exits uncovered this bug.
      
      To fix it, we make sure that the task's exit_state is always set before
      calling do_notify_pidfd.
      
      Fixes: b53b0b9d ("pidfd: add polling support")
      Cc: kernel-team@android.com
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Link: https://lore.kernel.org/r/20190717172100.261204-1-joel@joelfernandes.org
      [christian@brauner.io: adapt commit message and drop unneeded changes from wait_task_zombie]
      Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
      b191d649
    • Vaibhav Jain's avatar
      powerpc/papr_scm: Force a scm-unbind if initial scm-bind fails · 3a855b7a
      Vaibhav Jain authored
      In some cases initial bind of scm memory for an lpar can fail if
      previously it wasn't released using a scm-unbind hcall. This situation
      can arise due to panic of the previous kernel or forced lpar
      fadump. In such cases the H_SCM_BIND_MEM return a H_OVERLAP error.
      
      To mitigate such cases the patch updates papr_scm_probe() to force a
      call to drc_pmem_unbind() in case the initial bind of scm memory fails
      with EBUSY error. In case scm-bind operation again fails after the
      forced scm-unbind then we follow the existing error path. We also
      update drc_pmem_bind() to handle the H_OVERLAP error returned by phyp
      and indicate it as a EBUSY error back to the caller.
      Suggested-by: default avatar"Oliver O'Halloran" <oohall@gmail.com>
      Signed-off-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Reviewed-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190629160610.23402-4-vaibhav@linux.ibm.com
      3a855b7a
    • Vaibhav Jain's avatar
      powerpc/papr_scm: Update drc_pmem_unbind() to use H_SCM_UNBIND_ALL · 0d7fc080
      Vaibhav Jain authored
      The new hcall named H_SCM_UNBIND_ALL has been introduce that can
      unbind all or specific scm memory assigned to an lpar. This is
      more efficient than using H_SCM_UNBIND_MEM as currently we don't
      support partial unbind of scm memory.
      
      Hence this patch proposes following changes to drc_pmem_unbind():
      
          * Update drc_pmem_unbind() to replace hcall H_SCM_UNBIND_MEM to
            H_SCM_UNBIND_ALL.
      
          * Update drc_pmem_unbind() to handles cases when PHYP asks the guest
            kernel to wait for specific amount of time before retrying the
            hcall via the 'LONG_BUSY' return value.
      
          * Ensure appropriate error code is returned back from the function
            in case of an error.
      Reviewed-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190629160610.23402-3-vaibhav@linux.ibm.com
      0d7fc080
    • Vaibhav Jain's avatar
      powerpc/pseries: Update SCM hcall op-codes in hvcall.h · 6d140e75
      Vaibhav Jain authored
      Update the hvcalls.h to include op-codes for new hcalls introduce to
      manage SCM memory. Also update existing hcall definitions to reflect
      current papr specification for SCM.
      
      The removed hcall op-codes H_SCM_MEM_QUERY, H_SCM_BLOCK_CLEAR were
      transient proposals and there support was never implemented by
      Power-VM nor they were used anywhere in Linux kernel. Hence we don't
      expect anyone to be impacted by this change.
      Signed-off-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190629160610.23402-2-vaibhav@linux.ibm.com
      6d140e75
    • Jan Kiszka's avatar
      KVM: nVMX: Set cached_vmcs12 and cached_shadow_vmcs12 NULL after free · c6bf2ae9
      Jan Kiszka authored
      Shall help finding use-after-free bugs earlier.
      Suggested-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c6bf2ae9
    • Wanpeng Li's avatar
      KVM: X86: Dynamically allocate user_fpu · d9a710e5
      Wanpeng Li authored
      After reverting commit 240c35a3 (kvm: x86: Use task structs fpu field
      for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3)
      is the order at which allocations are deemed costly to service. In serveless
      scenario, one host can service hundreds/thoudands firecracker/kata-container
      instances, howerver, new instance will fail to launch after memory is too
      fragmented to allocate kvm_vcpu struct on host, this was observed in some
      cloud provider product environments.
      
      This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my
      Skylake server.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d9a710e5
    • Wanpeng Li's avatar
      KVM: X86: Fix fpu state crash in kvm guest · e7517324
      Wanpeng Li authored
      The idea before commit 240c35a3 (which has just been reverted)
      was that we have the following FPU states:
      
                     userspace (QEMU)             guest
      ---------------------------------------------------------------------------
                     processor                    vcpu->arch.guest_fpu
      >>> KVM_RUN: kvm_load_guest_fpu
                     vcpu->arch.user_fpu          processor
      >>> preempt out
                     vcpu->arch.user_fpu          current->thread.fpu
      >>> preempt in
                     vcpu->arch.user_fpu          processor
      >>> back to userspace
      >>> kvm_put_guest_fpu
                     processor                    vcpu->arch.guest_fpu
      ---------------------------------------------------------------------------
      
      With the new lazy model we want to get the state back to the processor
      when schedule in from current->thread.fpu.
      Reported-by: default avatarThomas Lambertz <mail@thomaslambertz.de>
      Reported-by: default avataranthony <antdev66@gmail.com>
      Tested-by: default avataranthony <antdev66@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Lambertz <mail@thomaslambertz.de>
      Cc: anthony <antdev66@gmail.com>
      Cc: stable@vger.kernel.org
      Fixes: 5f409e20 (x86/fpu: Defer FPU state load until return to userspace)
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      [Add a comment in front of the warning. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e7517324
    • Paolo Bonzini's avatar
      Revert "kvm: x86: Use task structs fpu field for user" · ec269475
      Paolo Bonzini authored
      This reverts commit 240c35a3
      ("kvm: x86: Use task structs fpu field for user", 2018-11-06).
      The commit is broken and causes QEMU's FPU state to be destroyed
      when KVM_RUN is preempted.
      
      Fixes: 240c35a3 ("kvm: x86: Use task structs fpu field for user")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ec269475
    • Jan Kiszka's avatar
      KVM: nVMX: Clear pending KVM_REQ_GET_VMCS12_PAGES when leaving nested · cf64527b
      Jan Kiszka authored
      Letting this pend may cause nested_get_vmcs12_pages to run against an
      invalid state, corrupting the effective vmcs of L1.
      
      This was triggerable in QEMU after a guest corruption in L2, followed by
      a L1 reset.
      Signed-off-by: default avatarJan Kiszka <jan.kiszka@siemens.com>
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Fixes: 7f7f1ba3 ("KVM: x86: do not load vmcs12 pages while still in SMM")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cf64527b
    • Eric Dumazet's avatar
      tcp: be more careful in tcp_fragment() · b617158d
      Eric Dumazet authored
      Some applications set tiny SO_SNDBUF values and expect
      TCP to just work. Recent patches to address CVE-2019-11478
      broke them in case of losses, since retransmits might
      be prevented.
      
      We should allow these flows to make progress.
      
      This patch allows the first and last skb in retransmit queue
      to be split even if memory limits are hit.
      
      It also adds the some room due to the fact that tcp_sendmsg()
      and tcp_sendpage() might overshoot sk_wmem_queued by about one full
      TSO skb (64KB size). Note this allowance was already present
      in stable backports for kernels < 4.15
      
      Note for < 4.15 backports :
       tcp_rtx_queue_tail() will probably look like :
      
      static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
      {
      	struct sk_buff *skb = tcp_send_head(sk);
      
      	return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
      }
      
      Fixes: f070ef2a ("tcp: tcp_fragment() should apply sane memory limits")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarAndrew Prout <aprout@ll.mit.edu>
      Tested-by: default avatarAndrew Prout <aprout@ll.mit.edu>
      Tested-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Tested-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Cc: Jonathan Looney <jtl@netflix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b617158d
    • Haiyang Zhang's avatar
      hv_netvsc: Fix extra rcu_read_unlock in netvsc_recv_callback() · be4363bd
      Haiyang Zhang authored
      There is an extra rcu_read_unlock left in netvsc_recv_callback(),
      after a previous patch that removes RCU from this function.
      This patch removes the extra RCU unlock.
      
      Fixes: 345ac089 ("hv_netvsc: pass netvsc_device to receive callback")
      Signed-off-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be4363bd
    • Michael Neuling's avatar
      powerpc/tm: Fix oops on sigreturn on systems without TM · f16d80b7
      Michael Neuling authored
      On systems like P9 powernv where we have no TM (or P8 booted with
      ppc_tm=off), userspace can construct a signal context which still has
      the MSR TS bits set. The kernel tries to restore this context which
      results in the following crash:
      
        Unexpected TM Bad Thing exception at c0000000000022fc (msr 0x8000000102a03031) tm_scratch=800000020280f033
        Oops: Unrecoverable exception, sig: 6 [#1]
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in:
        CPU: 0 PID: 1636 Comm: sigfuz Not tainted 5.2.0-11043-g0a8ad0ff #69
        NIP:  c0000000000022fc LR: 00007fffb2d67e48 CTR: 0000000000000000
        REGS: c00000003fffbd70 TRAP: 0700   Not tainted  (5.2.0-11045-g7142b497d8)
        MSR:  8000000102a03031 <SF,VEC,VSX,FP,ME,IR,DR,LE,TM[E]>  CR: 42004242  XER: 00000000
        CFAR: c0000000000022e0 IRQMASK: 0
        GPR00: 0000000000000072 00007fffb2b6e560 00007fffb2d87f00 0000000000000669
        GPR04: 00007fffb2b6e728 0000000000000000 0000000000000000 00007fffb2b6f2a8
        GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        GPR12: 0000000000000000 00007fffb2b76900 0000000000000000 0000000000000000
        GPR16: 00007fffb2370000 00007fffb2d84390 00007fffea3a15ac 000001000a250420
        GPR20: 00007fffb2b6f260 0000000010001770 0000000000000000 0000000000000000
        GPR24: 00007fffb2d843a0 00007fffea3a14a0 0000000000010000 0000000000800000
        GPR28: 00007fffea3a14d8 00000000003d0f00 0000000000000000 00007fffb2b6e728
        NIP [c0000000000022fc] rfi_flush_fallback+0x7c/0x80
        LR [00007fffb2d67e48] 0x7fffb2d67e48
        Call Trace:
        Instruction dump:
        e96a0220 e96a02a8 e96a0330 e96a03b8 394a0400 4200ffdc 7d2903a6 e92d0c00
        e94d0c08 e96d0c10 e82d0c18 7db242a6 <4c000024> 7db243a6 7db142a6 f82d0c18
      
      The problem is the signal code assumes TM is enabled when
      CONFIG_PPC_TRANSACTIONAL_MEM is enabled. This may not be the case as
      with P9 powernv or if `ppc_tm=off` is used on P8.
      
      This means any local user can crash the system.
      
      Fix the problem by returning a bad stack frame to the user if they try
      to set the MSR TS bits with sigreturn() on systems where TM is not
      supported.
      
      Found with sigfuz kernel selftest on P9.
      
      This fixes CVE-2019-13648.
      
      Fixes: 2b0a576d ("powerpc: Add new transactional memory state to the signal context")
      Cc: stable@vger.kernel.org # v3.9
      Reported-by: default avatarPraveen Pandey <Praveen.Pandey@in.ibm.com>
      Signed-off-by: default avatarMichael Neuling <mikey@neuling.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190719050502.405-1-mikey@neuling.org
      f16d80b7
  5. 21 Jul, 2019 13 commits
    • Linus Torvalds's avatar
      Linus 5.3-rc1 · 5f9e832c
      Linus Torvalds authored
      5f9e832c
    • Peter Kosyh's avatar
      vrf: make sure skb->data contains ip header to make routing · 107e47cc
      Peter Kosyh authored
      vrf_process_v4_outbound() and vrf_process_v6_outbound() do routing
      using ip/ipv6 addresses, but don't make sure the header is available
      in skb->data[] (skb_headlen() is less then header size).
      
      Case:
      
      1) igb driver from intel.
      2) Packet size is greater then 255.
      3) MPLS forwards to VRF device.
      
      So, patch adds pskb_may_pull() calls in vrf_process_v4/v6_outbound()
      functions.
      Signed-off-by: default avatarPeter Kosyh <p.kosyh@gmail.com>
      Reviewed-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      107e47cc
    • Vasily Averin's avatar
      connector: remove redundant input callback from cn_dev · 903e9d1b
      Vasily Averin authored
      A small cleanup: this callback is never used.
      Originally fixed by Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
      for OpenVZ7 bug OVZ-6877
      
      cc: stanislav.kinsburskiy@gmail.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      903e9d1b
    • Frederick Lawler's avatar
      qed: Prefer pcie_capability_read_word() · 93428c58
      Frederick Lawler authored
      Commit 8c0d3a02 ("PCI: Add accessors for PCI Express Capability")
      added accessors for the PCI Express Capability so that drivers didn't
      need to be aware of differences between v1 and v2 of the PCI
      Express Capability.
      
      Replace pci_read_config_word() and pci_write_config_word() calls with
      pcie_capability_read_word() and pcie_capability_write_word().
      Signed-off-by: default avatarFrederick Lawler <fred@fredlawl.com>
      Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93428c58
    • Frederick Lawler's avatar
      igc: Prefer pcie_capability_read_word() · a16f6d3a
      Frederick Lawler authored
      Commit 8c0d3a02 ("PCI: Add accessors for PCI Express Capability")
      added accessors for the PCI Express Capability so that drivers didn't
      need to be aware of differences between v1 and v2 of the PCI
      Express Capability.
      
      Replace pci_read_config_word() and pci_write_config_word() calls with
      pcie_capability_read_word() and pcie_capability_write_word().
      Signed-off-by: default avatarFrederick Lawler <fred@fredlawl.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a16f6d3a
    • Frederick Lawler's avatar
      cxgb4: Prefer pcie_capability_read_word() · 6133b920
      Frederick Lawler authored
      Commit 8c0d3a02 ("PCI: Add accessors for PCI Express Capability")
      added accessors for the PCI Express Capability so that drivers didn't
      need to be aware of differences between v1 and v2 of the PCI
      Express Capability.
      
      Replace pci_read_config_word() and pci_write_config_word() calls with
      pcie_capability_read_word() and pcie_capability_write_word().
      Signed-off-by: default avatarFrederick Lawler <fred@fredlawl.com>
      Reviewed-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6133b920
    • Benjamin Poirier's avatar
      be2net: Synchronize be_update_queues with dev_watchdog · ffd342e0
      Benjamin Poirier authored
      As pointed out by Firo Yang, a netdev tx timeout may trigger just before an
      ethtool set_channels operation is started. be_tx_timeout(), which dumps
      some queue structures, is not written to run concurrently with
      be_update_queues(), which frees/allocates those queues structures. Add some
      synchronization between the two.
      
      Message-id: <CH2PR18MB31898E033896F9760D36BFF288C90@CH2PR18MB3189.namprd18.prod.outlook.com>
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ffd342e0
    • Brian King's avatar
      bnx2x: Prevent load reordering in tx completion processing · ea811b79
      Brian King authored
      This patch fixes an issue seen on Power systems with bnx2x which results
      in the skb is NULL WARN_ON in bnx2x_free_tx_pkt firing due to the skb
      pointer getting loaded in bnx2x_free_tx_pkt prior to the hw_cons
      load in bnx2x_tx_int. Adding a read memory barrier resolves the issue.
      Signed-off-by: default avatarBrian King <brking@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ea811b79
    • Andrew Lunn's avatar
      net: phy: sfp: hwmon: Fix scaling of RX power · 0cea0e11
      Andrew Lunn authored
      The RX power read from the SFP uses units of 0.1uW. This must be
      scaled to units of uW for HWMON. This requires a divide by 10, not the
      current 100.
      
      With this change in place, sensors(1) and ethtool -m agree:
      
      sff2-isa-0000
      Adapter: ISA adapter
      in0:          +3.23 V
      temp1:        +33.1 C
      power1:      270.00 uW
      power2:      200.00 uW
      curr1:        +0.01 A
      
              Laser output power                        : 0.2743 mW / -5.62 dBm
              Receiver signal average optical power     : 0.2014 mW / -6.96 dBm
      
      Reported-by: chris.healy@zii.aero
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Fixes: 1323061a ("net: phy: sfp: Add HWMON support for module sensors")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0cea0e11
    • Vlad Buslov's avatar
      net: sched: verify that q!=NULL before setting q->flags · 503d81d4
      Vlad Buslov authored
      In function int tc_new_tfilter() q pointer can be NULL when adding filter
      on a shared block. With recent change that resets TCQ_F_CAN_BYPASS after
      filter creation, following NULL pointer dereference happens in case parent
      block is shared:
      
      [  212.925060] BUG: kernel NULL pointer dereference, address: 0000000000000010
      [  212.925445] #PF: supervisor write access in kernel mode
      [  212.925709] #PF: error_code(0x0002) - not-present page
      [  212.925965] PGD 8000000827923067 P4D 8000000827923067 PUD 827924067 PMD 0
      [  212.926302] Oops: 0002 [#1] SMP KASAN PTI
      [  212.926539] CPU: 18 PID: 2617 Comm: tc Tainted: G    B             5.2.0+ #512
      [  212.926938] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
      [  212.927364] RIP: 0010:tc_new_tfilter+0x698/0xd40
      [  212.927633] Code: 74 0d 48 85 c0 74 08 48 89 ef e8 03 aa 62 00 48 8b 84 24 a0 00 00 00 48 8d 78 10 48 89 44 24 18 e8 4d 0c 6b ff 48 8b 44 24 18 <83> 60 10 f
      b 48 85 ed 0f 85 3d fe ff ff e9 4f fe ff ff e8 81 26 f8
      [  212.928607] RSP: 0018:ffff88884fd5f5d8 EFLAGS: 00010296
      [  212.928905] RAX: 0000000000000000 RBX: 0000000000000000 RCX: dffffc0000000000
      [  212.929201] RDX: 0000000000000007 RSI: 0000000000000004 RDI: 0000000000000297
      [  212.929402] RBP: ffff88886bedd600 R08: ffffffffb91d4b51 R09: fffffbfff7616e4d
      [  212.929609] R10: fffffbfff7616e4c R11: ffffffffbb0b7263 R12: ffff88886bc61040
      [  212.929803] R13: ffff88884fd5f950 R14: ffffc900039c5000 R15: ffff88835e927680
      [  212.929999] FS:  00007fe7c50b6480(0000) GS:ffff88886f980000(0000) knlGS:0000000000000000
      [  212.930235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  212.930394] CR2: 0000000000000010 CR3: 000000085bd04002 CR4: 00000000001606e0
      [  212.930588] Call Trace:
      [  212.930682]  ? tc_del_tfilter+0xa40/0xa40
      [  212.930811]  ? __lock_acquire+0x5b5/0x2460
      [  212.930948]  ? find_held_lock+0x85/0xa0
      [  212.931081]  ? tc_del_tfilter+0xa40/0xa40
      [  212.931201]  rtnetlink_rcv_msg+0x4ab/0x5f0
      [  212.931332]  ? rtnl_dellink+0x490/0x490
      [  212.931454]  ? lockdep_hardirqs_on+0x260/0x260
      [  212.931589]  ? netlink_deliver_tap+0xab/0x5a0
      [  212.931717]  ? match_held_lock+0x1b/0x240
      [  212.931844]  netlink_rcv_skb+0xd0/0x200
      [  212.931958]  ? rtnl_dellink+0x490/0x490
      [  212.932079]  ? netlink_ack+0x440/0x440
      [  212.932205]  ? netlink_deliver_tap+0x161/0x5a0
      [  212.932335]  ? lock_downgrade+0x360/0x360
      [  212.932457]  ? lock_acquire+0xe5/0x210
      [  212.932579]  netlink_unicast+0x296/0x350
      [  212.932705]  ? netlink_attachskb+0x390/0x390
      [  212.932834]  ? _copy_from_iter_full+0xe0/0x3a0
      [  212.932976]  netlink_sendmsg+0x394/0x600
      [  212.937998]  ? netlink_unicast+0x350/0x350
      [  212.943033]  ? move_addr_to_kernel.part.0+0x90/0x90
      [  212.948115]  ? netlink_unicast+0x350/0x350
      [  212.953185]  sock_sendmsg+0x96/0xa0
      [  212.958099]  ___sys_sendmsg+0x482/0x520
      [  212.962881]  ? match_held_lock+0x1b/0x240
      [  212.967618]  ? copy_msghdr_from_user+0x250/0x250
      [  212.972337]  ? lock_downgrade+0x360/0x360
      [  212.976973]  ? rwlock_bug.part.0+0x60/0x60
      [  212.981548]  ? __mod_node_page_state+0x1f/0xa0
      [  212.986060]  ? match_held_lock+0x1b/0x240
      [  212.990567]  ? find_held_lock+0x85/0xa0
      [  212.994989]  ? do_user_addr_fault+0x349/0x5b0
      [  212.999387]  ? lock_downgrade+0x360/0x360
      [  213.003713]  ? find_held_lock+0x85/0xa0
      [  213.007972]  ? __fget_light+0xa1/0xf0
      [  213.012143]  ? sockfd_lookup_light+0x91/0xb0
      [  213.016165]  __sys_sendmsg+0xba/0x130
      [  213.020040]  ? __sys_sendmsg_sock+0xb0/0xb0
      [  213.023870]  ? handle_mm_fault+0x337/0x470
      [  213.027592]  ? page_fault+0x8/0x30
      [  213.031316]  ? lockdep_hardirqs_off+0xbe/0x100
      [  213.034999]  ? mark_held_locks+0x24/0x90
      [  213.038671]  ? do_syscall_64+0x1e/0xe0
      [  213.042297]  do_syscall_64+0x74/0xe0
      [  213.045828]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  213.049354] RIP: 0033:0x7fe7c527c7b8
      [  213.052792] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f
      0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 54
      [  213.060269] RSP: 002b:00007ffc3f7908a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      [  213.064144] RAX: ffffffffffffffda RBX: 000000005d34716f RCX: 00007fe7c527c7b8
      [  213.068094] RDX: 0000000000000000 RSI: 00007ffc3f790910 RDI: 0000000000000003
      [  213.072109] RBP: 0000000000000000 R08: 0000000000000001 R09: 00007fe7c5340cc0
      [  213.076113] R10: 0000000000404ec2 R11: 0000000000000246 R12: 0000000000000080
      [  213.080146] R13: 0000000000480640 R14: 0000000000000080 R15: 0000000000000000
      [  213.084147] Modules linked in: act_gact cls_flower sch_ingress nfsv3 nfs_acl nfs lockd grace fscache bridge stp llc sunrpc intel_rapl_msr intel_rapl_common
      [<1;69;32Msb_edac rdma_ucm rdma_cm x86_pkg_temp_thermal iw_cm intel_powerclamp ib_cm coretemp kvm_intel kvm irqbypass mlx5_ib ib_uverbs ib_core crct10dif_pclmul crc32_pc
      lmul crc32c_intel ghash_clmulni_intel mlx5_core intel_cstate intel_uncore iTCO_wdt igb iTCO_vendor_support mlxfw mei_me ptp ses intel_rapl_perf mei pcspkr ipmi
      _ssif i2c_i801 joydev enclosure pps_core lpc_ich ioatdma wmi dca ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad ast i2c_algo_bit drm_vram_helpe
      r ttm drm_kms_helper drm mpt3sas raid_class scsi_transport_sas
      [  213.112326] CR2: 0000000000000010
      [  213.117429] ---[ end trace adb58eb0a4ee6283 ]---
      
      Verify that q pointer is not NULL before setting the 'flags' field.
      
      Fixes: 3f05e688 ("net_sched: unset TCQ_F_CAN_BYPASS when adding filters")
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      503d81d4
    • Christophe JAILLET's avatar
      chelsio: Fix a typo in a function name · 85d9bf97
      Christophe JAILLET authored
      It is likely that 'my3216_poll()' should be 'my3126_poll()'. (1 and 2
      switched in 3126.
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      85d9bf97
    • Navid Emamdoost's avatar
      allocate_flower_entry: should check for null deref · bb132083
      Navid Emamdoost authored
      allocate_flower_entry does not check for allocation success, but tries
      to deref the result. I only moved the spin_lock under null check, because
       the caller is checking allocation's status at line 652.
      Signed-off-by: default avatarNavid Emamdoost <navid.emamdoost@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb132083
    • Christophe JAILLET's avatar
      net: hns3: typo in the name of a constant · 4803d010
      Christophe JAILLET authored
      All constant in 'enum HCLGE_MBX_OPCODE' start with HCLGE, except
      'HLCGE_MBX_PUSH_VLAN_INFO' (C and L switched)
      
      s/HLC/HCL/
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4803d010