1. 20 Aug, 2021 18 commits
  2. 13 Aug, 2021 16 commits
    • Peter Xu's avatar
      KVM: Allow to have arch-specific per-vm debugfs files · 3165af73
      Peter Xu authored
      Allow archs to create arch-specific nodes under kvm->debugfs_dentry directory
      besides the stats fields.  The new interface kvm_arch_create_vm_debugfs() is
      defined but not yet used.  It's called after kvm->debugfs_dentry is created, so
      it can be referenced directly in kvm_arch_create_vm_debugfs().  Arch should
      define their own versions when they want to create extra debugfs nodes.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220455.26054-2-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3165af73
    • Sean Christopherson's avatar
      KVM: nVMX: Unconditionally clear nested.pi_pending on nested VM-Enter · f7782bb8
      Sean Christopherson authored
      Clear nested.pi_pending on nested VM-Enter even if L2 will run without
      posted interrupts enabled.  If nested.pi_pending is left set from a
      previous L2, vmx_complete_nested_posted_interrupt() will pick up the
      stale flag and exit to userspace with an "internal emulation error" due
      the new L2 not having a valid nested.pi_desc.
      
      Arguably, vmx_complete_nested_posted_interrupt() should first check for
      posted interrupts being enabled, but it's also completely reasonable that
      KVM wouldn't screw up a fundamental flag.  Not to mention that the mere
      existence of nested.pi_pending is a long-standing bug as KVM shouldn't
      move the posted interrupt out of the IRR until it's actually processed,
      e.g. KVM effectively drops an interrupt when it performs a nested VM-Exit
      with a "pending" posted interrupt.  Fixing the mess is a future problem.
      
      Prior to vmx_complete_nested_posted_interrupt() interpreting a null PI
      descriptor as an error, this was a benign bug as the null PI descriptor
      effectively served as a check on PI not being enabled.  Even then, the
      new flow did not become problematic until KVM started checking the result
      of kvm_check_nested_events().
      
      Fixes: 705699a1 ("KVM: nVMX: Enable nested posted interrupt processing")
      Fixes: 966eefb8 ("KVM: nVMX: Disable vmcs02 posted interrupts if vmcs12 PID isn't mappable")
      Fixes: 47d3530f86c0 ("KVM: x86: Exit to userspace when kvm_check_nested_events fails")
      Cc: stable@vger.kernel.org
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210810144526.2662272-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f7782bb8
    • Like Xu's avatar
      KVM: x86: Clean up redundant ROL16(val, n) macro definition · c1a527a1
      Like Xu authored
      The ROL16(val, n) macro is repeatedly defined in several vmcs-related
      files, and it has never been used outside the KVM context.
      
      Let's move it to vmcs.h without any intended functional changes.
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Message-Id: <20210809093410.59304-4-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c1a527a1
    • Uros Bizjak's avatar
      KVM: x86: Move declaration of kvm_spurious_fault() to x86.h · 65297341
      Uros Bizjak authored
      Move the declaration of kvm_spurious_fault() to KVM's "private" x86.h,
      it should never be called by anything other than low level KVM code.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarUros Bizjak <ubizjak@gmail.com>
      [sean: rebased to a series without __ex()/__kvm_handle_fault_on_reboot()]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210809173955.1710866-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      65297341
    • Sean Christopherson's avatar
      KVM: x86: Kill off __ex() and __kvm_handle_fault_on_reboot() · ad0577c3
      Sean Christopherson authored
      Remove the __kvm_handle_fault_on_reboot() and __ex() macros now that all
      VMX and SVM instructions use asm goto to handle the fault (or in the
      case of VMREAD, completely custom logic).  Drop kvm_spurious_fault()'s
      asmlinkage annotation as __kvm_handle_fault_on_reboot() was the only
      flow that invoked it from assembly code.
      
      Cc: Uros Bizjak <ubizjak@gmail.com>
      Cc: Like Xu <like.xu.linux@gmail.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210809173955.1710866-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ad0577c3
    • Sean Christopherson's avatar
      KVM: VMX: Hide VMCS control calculators in vmx.c · 2fba4fc1
      Sean Christopherson authored
      Now that nested VMX pulls KVM's desired VMCS controls from vmcs01 instead
      of re-calculating on the fly, bury the helpers that do the calcluations
      in vmx.c.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210810171952.2758100-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2fba4fc1
    • Sean Christopherson's avatar
      KVM: VMX: Drop caching of KVM's desired sec exec controls for vmcs01 · b6247686
      Sean Christopherson authored
      Remove the secondary execution controls cache now that it's effectively
      dead code; it is only read immediately after it is written.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210810171952.2758100-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b6247686
    • Sean Christopherson's avatar
      KVM: nVMX: Pull KVM L0's desired controls directly from vmcs01 · 389ab252
      Sean Christopherson authored
      When preparing controls for vmcs02, grab KVM's desired controls from
      vmcs01's shadow state instead of recalculating the controls from scratch,
      or in the secondary execution controls, instead of using the dedicated
      cache.  Calculating secondary exec controls is eye-poppingly expensive
      due to the guest CPUID checks, hence the dedicated cache, but the other
      calculations aren't exactly free either.
      
      Explicitly clear several bits (x2APIC, DESC exiting, and load EFER on
      exit) as appropriate as they may be set in vmcs01, whereas the previous
      implementation relied on dynamic bits being cleared in the calculator.
      
      Intentionally propagate VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL from
      vmcs01 to vmcs02.  Whether or not PERF_GLOBAL_CTRL is loaded depends on
      whether or not perf itself is active, so unless perf stops between the
      exit from L1 and entry to L2, vmcs01 will hold the desired value.  This
      is purely an optimization as atomic_switch_perf_msrs() will set/clear
      the control as needed at VM-Enter, i.e. it avoids two extra VMWRITEs in
      the case where perf is active (versus starting with the bits clear in
      vmcs02, which was the previous behavior).
      
      Cc: Zeng Guang <guang.zeng@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210810171952.2758100-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      389ab252
    • Paolo Bonzini's avatar
      KVM: stats: remove dead stores · ee3b6e41
      Paolo Bonzini authored
      These stores are copied and pasted from the "if" statements above.
      They are dead and while they are not really a bug, they can be
      confusing to anyone reading the code as well.  Remove them.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ee3b6e41
    • Paolo Bonzini's avatar
      KVM: VMX: Reset DR6 only when KVM_DEBUGREG_WONT_EXIT · 1ccb6f98
      Paolo Bonzini authored
      The commit efdab992 ("KVM: x86: fix escape of guest dr6 to the host")
      fixed a bug by resetting DR6 unconditionally when the vcpu being scheduled out.
      
      But writing to debug registers is slow, and it can be visible in perf results
      sometimes, even if neither the host nor the guest activate breakpoints.
      
      Since KVM_DEBUGREG_WONT_EXIT on Intel processors is the only case
      where DR6 gets the guest value, and it never happens at all on SVM,
      the register can be cleared in vmx.c right after reading it.
      Reported-by: default avatarLai Jiangshan <laijs@linux.alibaba.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1ccb6f98
    • Paolo Bonzini's avatar
      KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXIT · 375e28ff
      Paolo Bonzini authored
      Commit c77fb5fe ("KVM: x86: Allow the guest to run with dirty debug
      registers") allows the guest accessing to DRs without exiting when
      KVM_DEBUGREG_WONT_EXIT and we need to ensure that they are synchronized
      on entry to the guest---including DR6 that was not synced before the commit.
      
      But the commit sets the hardware DR6 not only when KVM_DEBUGREG_WONT_EXIT,
      but also when KVM_DEBUGREG_BP_ENABLED.  The second case is unnecessary
      and just leads to a more case which leaks stale DR6 to the host which has
      to be resolved by unconditionally reseting DR6 in kvm_arch_vcpu_put().
      
      Even if KVM_DEBUGREG_WONT_EXIT, however, setting the host DR6 only matters
      on VMX because SVM always uses the DR6 value from the VMCB.  So move this
      line to vmx.c and make it conditional on KVM_DEBUGREG_WONT_EXIT.
      Reported-by: default avatarLai Jiangshan <jiangshanlai@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      375e28ff
    • Lai Jiangshan's avatar
      KVM: X86: Remove unneeded KVM_DEBUGREG_RELOAD · 34e9f860
      Lai Jiangshan authored
      Commit ae561ede ("KVM: x86: DR0-DR3 are not clear on reset") added code to
      ensure eff_db are updated when they're modified through non-standard paths.
      
      But there is no reason to also update hardware DRs unless hardware breakpoints
      are active or DR exiting is disabled, and in those cases updating hardware is
      handled by KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_BP_ENABLED.
      
      KVM_DEBUGREG_RELOAD just causes unnecesarry load of hardware DRs and is better
      to be removed.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20210809174307.145263-1-jiangshanlai@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      34e9f860
    • Paolo Bonzini's avatar
      Merge branch 'kvm-tdpmmu-fixes' into HEAD · 9a63b451
      Paolo Bonzini authored
      Merge topic branch with fixes for 5.14-rc6 and 5.15 merge window.
      9a63b451
    • Sean Christopherson's avatar
      KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock · ce25681d
      Sean Christopherson authored
      Add yet another spinlock for the TDP MMU and take it when marking indirect
      shadow pages unsync.  When using the TDP MMU and L1 is running L2(s) with
      nested TDP, KVM may encounter shadow pages for the TDP entries managed by
      L1 (controlling L2) when handling a TDP MMU page fault.  The unsync logic
      is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
      misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
      which runs with mmu_lock held for read, not write.
      
      Lack of a critical section manifests most visibly as an underflow of
      unsync_children in clear_unsync_child_bit() due to unsync_children being
      corrupted when multiple CPUs write it without a critical section and
      without atomic operations.  But underflow is the best case scenario.  The
      worst case scenario is that unsync_children prematurely hits '0' and
      leads to guest memory corruption due to KVM neglecting to properly sync
      shadow pages.
      
      Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
      would functionally be ok.  Usurping the lock could degrade performance when
      building upper level page tables on different vCPUs, especially since the
      unsync flow could hold the lock for a comparatively long time depending on
      the number of indirect shadow pages and the depth of the paging tree.
      
      For simplicity, take the lock for all MMUs, even though KVM could fairly
      easily know that mmu_lock is held for write.  If mmu_lock is held for
      write, there cannot be contention for the inner spinlock, and marking
      shadow pages unsync across multiple vCPUs will be slow enough that
      bouncing the kvm_arch cacheline should be in the noise.
      
      Note, even though L2 could theoretically be given access to its own EPT
      entries, a nested MMU must hold mmu_lock for write and thus cannot race
      against a TDP MMU page fault.  I.e. the additional spinlock only _needs_ to
      be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
      that is running with the TDP MMU enabled.  Holding mmu_lock for read also
      prevents the indirect shadow page from being freed.  But as above, keep
      it simple and always take the lock.
      
      Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
      effectively disable unsync behavior for nested TDP.  Write protecting leaf
      shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
      VMMs typically don't modify TDP entries, but the same may not hold true for
      non-standard use cases and/or VMMs that are migrating physical pages (from
      L1's perspective).
      
      Alternative #2, the unsync logic could be made thread safe.  In theory,
      simply converting all relevant kvm_mmu_page fields to atomics and using
      atomic bitops for the bitmap would suffice.  However, (a) an in-depth audit
      would be required, (b) the code churn would be substantial, and (c) legacy
      shadow paging would incur additional atomic operations in performance
      sensitive paths for no benefit (to legacy shadow paging).
      
      Fixes: a2855afc ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181815.3378104-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ce25681d
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs · 0103098f
      Sean Christopherson authored
      Set the min_level for the TDP iterator at the root level when zapping all
      SPTEs to optimize the iterator's try_step_down().  Zapping a non-leaf
      SPTE will recursively zap all its children, thus there is no need for the
      iterator to attempt to step down.  This avoids rereading the top-level
      SPTEs after they are zapped by causing try_step_down() to short-circuit.
      
      In most cases, optimizing try_step_down() will be in the noise as the cost
      of zapping SPTEs completely dominates the overall time.  The optimization
      is however helpful if the zap occurs with relatively few SPTEs, e.g. if KVM
      is zapping in response to multiple memslot updates when userspace is adding
      and removing read-only memslots for option ROMs.  In that case, the task
      doing the zapping likely isn't a vCPU thread, but it still holds mmu_lock
      for read and thus can be a noisy neighbor of sorts.
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181414.3376143-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0103098f
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs · 524a1e4e
      Sean Christopherson authored
      Pass "all ones" as the end GFN to signal "zap all" for the TDP MMU and
      really zap all SPTEs in this case.  As is, zap_gfn_range() skips non-leaf
      SPTEs whose range exceeds the range to be zapped.  If shadow_phys_bits is
      not aligned to the range size of top-level SPTEs, e.g. 512gb with 4-level
      paging, the "zap all" flows will skip top-level SPTEs whose range extends
      beyond shadow_phys_bits and leak their SPs when the VM is destroyed.
      
      Use the current upper bound (based on host.MAXPHYADDR) to detect that the
      caller wants to zap all SPTEs, e.g. instead of using the max theoretical
      gfn, 1 << (52 - 12).  The more precise upper bound allows the TDP iterator
      to terminate its walk earlier when running on hosts with MAXPHYADDR < 52.
      
      Add a WARN on kmv->arch.tdp_mmu_pages when the TDP MMU is destroyed to
      help future debuggers should KVM decide to leak SPTEs again.
      
      The bug is most easily reproduced by running (and unloading!) KVM in a
      VM whose host.MAXPHYADDR < 39, as the SPTE for gfn=0 will be skipped.
      
        =============================================================================
        BUG kvm_mmu_page_header (Not tainted): Objects remaining in kvm_mmu_page_header on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
        Slab 0x000000004d8f7af1 objects=22 used=2 fp=0x00000000624d29ac flags=0x4000000000000200(slab|zone=1)
        CPU: 0 PID: 1582 Comm: rmmod Not tainted 5.14.0-rc2+ #420
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack_lvl+0x45/0x59
         slab_err+0x95/0xc9
         __kmem_cache_shutdown.cold+0x3c/0x158
         kmem_cache_destroy+0x3d/0xf0
         kvm_mmu_module_exit+0xa/0x30 [kvm]
         kvm_arch_exit+0x5d/0x90 [kvm]
         kvm_exit+0x78/0x90 [kvm]
         vmx_exit+0x1a/0x50 [kvm_intel]
         __x64_sys_delete_module+0x13f/0x220
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181414.3376143-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      524a1e4e
  3. 10 Aug, 2021 2 commits
    • Paolo Bonzini's avatar
      Merge branch 'kvm-vmx-secctl' into HEAD · c3e9434c
      Paolo Bonzini authored
      Merge common topic branch for 5.14-rc6 and 5.15 merge window.
      c3e9434c
    • Sean Christopherson's avatar
      KVM: VMX: Use current VMCS to query WAITPKG support for MSR emulation · 7b9cae02
      Sean Christopherson authored
      Use the secondary_exec_controls_get() accessor in vmx_has_waitpkg() to
      effectively get the controls for the current VMCS, as opposed to using
      vmx->secondary_exec_controls, which is the cached value of KVM's desired
      controls for vmcs01 and truly not reflective of any particular VMCS.
      
      While the waitpkg control is not dynamic, i.e. vmcs01 will always hold
      the same waitpkg configuration as vmx->secondary_exec_controls, the same
      does not hold true for vmcs02 if the L1 VMM hides the feature from L2.
      If L1 hides the feature _and_ does not intercept MSR_IA32_UMWAIT_CONTROL,
      L2 could incorrectly read/write L1's virtual MSR instead of taking a #GP.
      
      Fixes: 6e3ba4ab ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210810171952.2758100-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7b9cae02
  4. 06 Aug, 2021 4 commits