1. 19 Jan, 2022 28 commits
    • Sean Christopherson's avatar
      KVM: VMX: Don't do full kick when triggering posted interrupt "fails" · 0f65a9d3
      Sean Christopherson authored
      Replace the full "kick" with just the "wake" in the fallback path when
      triggering a virtual interrupt via a posted interrupt fails because the
      guest is not IN_GUEST_MODE.  If the guest transitions into guest mode
      between the check and the kick, then it's guaranteed to see the pending
      interrupt as KVM syncs the PIR to IRR (and onto GUEST_RVI) after setting
      IN_GUEST_MODE.  Kicking the guest in this case is nothing more than an
      unnecessary VM-Exit (and host IRQ).
      
      Opportunistically update comments to explain the various ordering rules
      and barriers at play.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211208015236.1616697-17-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0f65a9d3
    • Sean Christopherson's avatar
      KVM: SVM: Skip AVIC and IRTE updates when loading blocking vCPU · 782f6455
      Sean Christopherson authored
      Don't bother updating the Physical APIC table or IRTE when loading a vCPU
      that is blocking, i.e. won't be marked IsRun{ning}=1, as the pCPU is
      queried if and only if IsRunning is '1'.  If the vCPU was migrated, the
      new pCPU will be picked up when avic_vcpu_load() is called by
      svm_vcpu_unblocking().
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211208015236.1616697-15-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      782f6455
    • Sean Christopherson's avatar
      KVM: SVM: Use kvm_vcpu_is_blocking() in AVIC load to handle preemption · af52f5aa
      Sean Christopherson authored
      Use kvm_vcpu_is_blocking() to determine whether or not the vCPU should be
      marked running during avic_vcpu_load().  Drop avic_is_running, which
      really should have been named "vcpu_is_not_blocking", as it tracked if
      the vCPU was blocking, not if it was actually running, e.g. it was set
      during svm_create_vcpu() when the vCPU was obviously not running.
      
      This is technically a teeny tiny functional change, as the vCPU will be
      marked IsRunning=1 on being reloaded if the vCPU is preempted between
      svm_vcpu_blocking() and prepare_to_rcuwait().  But that's a benign change
      as the vCPU will be marked IsRunning=0 when KVM voluntarily schedules out
      the vCPU.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211208015236.1616697-14-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      af52f5aa
    • Sean Christopherson's avatar
      KVM: SVM: Remove unnecessary APICv/AVIC update in vCPU unblocking path · e422b889
      Sean Christopherson authored
      Remove handling of KVM_REQ_APICV_UPDATE from svm_vcpu_unblocking(), it's
      no longer needed as it was made obsolete by commit df7e4827 ("KVM:
      SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling AVIC").
      Prior to that commit, the manual check was necessary to ensure the AVIC
      stuff was updated by avic_set_running() when a request to enable APICv
      became pending while the vCPU was blocking, as the request handling
      itself would not do the update.  But, as evidenced by the commit, that
      logic was flawed and subject to various races.
      
      Now that svm_refresh_apicv_exec_ctrl() does avic_vcpu_load/put() in
      response to an APICv status change, drop the manual check in the
      unblocking path.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211208015236.1616697-13-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e422b889
    • Sean Christopherson's avatar
      KVM: SVM: Don't bother checking for "running" AVIC when kicking for IPIs · 202470d5
      Sean Christopherson authored
      Drop the avic_vcpu_is_running() check when waking vCPUs in response to a
      VM-Exit due to incomplete IPI delivery.  The check isn't wrong per se, but
      it's not 100% accurate in the sense that it doesn't guarantee that the vCPU
      was one of the vCPUs that didn't receive the IPI.
      
      The check isn't required for correctness as blocking == !running in this
      context.
      
      From a performance perspective, waking a live task is not expensive as the
      only moderately costly operation is a locked operation to temporarily
      disable preemption.  And if that is indeed a performance issue,
      kvm_vcpu_is_blocking() would be a better check than poking into the AVIC.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-12-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      202470d5
    • Sean Christopherson's avatar
      KVM: SVM: Signal AVIC doorbell iff vCPU is in guest mode · 31f251d4
      Sean Christopherson authored
      Signal the AVIC doorbell iff the vCPU is running in the guest.  If the vCPU
      is not IN_GUEST_MODE, it's guaranteed to pick up any pending IRQs on the
      next VMRUN, which unconditionally processes the vIRR.
      
      Add comments to document the logic.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211208015236.1616697-11-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      31f251d4
    • Sean Christopherson's avatar
      KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks · c3e8abf0
      Sean Christopherson authored
      Drop kvm_x86_ops' pre/post_block() now that all implementations are nops.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-10-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c3e8abf0
    • Sean Christopherson's avatar
      KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpers · b6d42bad
      Sean Christopherson authored
      Unexport switch_to_{hv,sw}_timer() now that common x86 handles the
      transitions.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b6d42bad
    • Sean Christopherson's avatar
      KVM: VMX: Move preemption timer <=> hrtimer dance to common x86 · 98c25ead
      Sean Christopherson authored
      Handle the switch to/from the hypervisor/software timer when a vCPU is
      blocking in common x86 instead of in VMX.  Even though VMX is the only
      user of a hypervisor timer, the logic and all functions involved are
      generic x86 (unless future CPUs do something completely different and
      implement a hypervisor timer that runs regardless of mode).
      
      Handling the switch in common x86 will allow for the elimination of the
      pre/post_blocks hooks, and also lets KVM switch back to the hypervisor
      timer if and only if it was in use (without additional params).  Add a
      comment explaining why the switch cannot be deferred to kvm_sched_out()
      or kvm_vcpu_block().
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      98c25ead
    • Sean Christopherson's avatar
      KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx · 12a8eee5
      Sean Christopherson authored
      Move the seemingly generic block_vcpu_list from kvm_vcpu to vcpu_vmx, and
      rename the list and all associated variables to clarify that it tracks
      the set of vCPU that need to be poked on a posted interrupt to the wakeup
      vector.  The list is not used to track _all_ vCPUs that are blocking, and
      the term "blocked" can be misleading as it may refer to a blocking
      condition in the host or the guest, where as the PI wakeup case is
      specifically for the vCPUs that are actively blocking from within the
      guest.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      12a8eee5
    • Sean Christopherson's avatar
      KVM: Drop unused kvm_vcpu.pre_pcpu field · e6eec09b
      Sean Christopherson authored
      Remove kvm_vcpu.pre_pcpu as it no longer has any users.  No functional
      change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e6eec09b
    • Sean Christopherson's avatar
      KVM: VMX: Handle PI descriptor updates during vcpu_put/load · d76fb406
      Sean Christopherson authored
      Move the posted interrupt pre/post_block logic into vcpu_put/load
      respectively, using the kvm_vcpu_is_blocking() to determining whether or
      not the wakeup handler needs to be set (and unset).  This avoids updating
      the PI descriptor if halt-polling is successful, reduces the number of
      touchpoints for updating the descriptor, and eliminates the confusing
      behavior of intentionally leaving a "stale" PI.NDST when a blocking vCPU
      is scheduled back in after preemption.
      
      The downside is that KVM will do the PID update twice if the vCPU is
      preempted after prepare_to_rcuwait() but before schedule(), but that's a
      rare case (and non-existent on !PREEMPT kernels).
      
      The notable wart is the need to send a self-IPI on the wakeup vector if
      an outstanding notification is pending after configuring the wakeup
      vector.  Ideally, KVM would just do a kvm_vcpu_wake_up() in this case,
      but the scheduler doesn't support waking a task from its preemption
      notifier callback, i.e. while the task is right in the middle of
      being scheduled out.
      
      Note, setting the wakeup vector before halt-polling is not necessary:
      once the pending IRQ will be recorded in the PIR, kvm_vcpu_has_events()
      will detect this (via kvm_cpu_get_interrupt(), kvm_apic_get_interrupt(),
      apic_has_interrupt_for_ppr() and finally vmx_sync_pir_to_irr()) and
      terminate the polling.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d76fb406
    • Paolo Bonzini's avatar
      Merge branch 'kvm-pi-raw-spinlock' into HEAD · 4f5a884f
      Paolo Bonzini authored
      Bring in fix for VT-d posted interrupts before further changing the code in 5.17.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4f5a884f
    • Christian Borntraeger's avatar
      KVM: avoid warning on s390 in mark_page_dirty · e09fccb5
      Christian Borntraeger authored
      Avoid warnings on s390 like
      [ 1801.980931] CPU: 12 PID: 117600 Comm: kworker/12:0 Tainted: G            E     5.17.0-20220113.rc0.git0.32ce2abb03cf.300.fc35.s390x+next #1
      [ 1801.980938] Workqueue: events irqfd_inject [kvm]
      [...]
      [ 1801.981057] Call Trace:
      [ 1801.981060]  [<000003ff805f0f5c>] mark_page_dirty_in_slot+0xa4/0xb0 [kvm]
      [ 1801.981083]  [<000003ff8060e9fe>] adapter_indicators_set+0xde/0x268 [kvm]
      [ 1801.981104]  [<000003ff80613c24>] set_adapter_int+0x64/0xd8 [kvm]
      [ 1801.981124]  [<000003ff805fb9aa>] kvm_set_irq+0xc2/0x130 [kvm]
      [ 1801.981144]  [<000003ff805f8d86>] irqfd_inject+0x76/0xa0 [kvm]
      [ 1801.981164]  [<0000000175e56906>] process_one_work+0x1fe/0x470
      [ 1801.981173]  [<0000000175e570a4>] worker_thread+0x64/0x498
      [ 1801.981176]  [<0000000175e5ef2c>] kthread+0x10c/0x110
      [ 1801.981180]  [<0000000175de73c8>] __ret_from_fork+0x40/0x58
      [ 1801.981185]  [<000000017698440a>] ret_from_fork+0xa/0x40
      
      when writing to a guest from an irqfd worker as long as we do not have
      the dirty ring.
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@linux.ibm.com>
      Reluctantly-acked-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20220113122924.740496-1-borntraeger@linux.ibm.com>
      Fixes: 2efd61a6 ("KVM: Warn if mark_page_dirty() is called without an active vCPU")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e09fccb5
    • Sean Christopherson's avatar
      KVM: selftests: Add a test to force emulation with a pending exception · e337f7e0
      Sean Christopherson authored
      Add a VMX specific test to verify that KVM doesn't explode if userspace
      attempts KVM_RUN when emulation is required with a pending exception.
      KVM VMX's emulation support for !unrestricted_guest punts exceptions to
      userspace instead of attempting to synthesize the exception with all the
      correct state (and stack switching, etc...).
      
      Punting is acceptable as there's never been a request to support
      injecting exceptions when emulating due to invalid state, but KVM has
      historically assumed that userspace will do the right thing and either
      clear the exception or kill the guest.  Deliberately do the opposite and
      attempt to re-enter the guest with a pending exception and emulation
      required to verify KVM continues to punt the combination to userspace,
      e.g. doesn't explode, WARN, etc...
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211228232437.1875318-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e337f7e0
    • Sean Christopherson's avatar
      KVM: VMX: Reject KVM_RUN if emulation is required with pending exception · fc4fad79
      Sean Christopherson authored
      Reject KVM_RUN if emulation is required (because VMX is running without
      unrestricted guest) and an exception is pending, as KVM doesn't support
      emulating exceptions except when emulating real mode via vm86.  The vCPU
      is hosed either way, but letting KVM_RUN proceed triggers a WARN due to
      the impossible condition.  Alternatively, the WARN could be removed, but
      then userspace and/or KVM bugs would result in the vCPU silently running
      in a bad state, which isn't very friendly to users.
      
      Originally, the bug was hit by syzkaller with a nested guest as that
      doesn't require kvm_intel.unrestricted_guest=0.  That particular flavor
      is likely fixed by commit cd0e615c ("KVM: nVMX: Synthesize
      TRIPLE_FAULT for L2 if emulation is required"), but it's trivial to
      trigger the WARN with a non-nested guest, and userspace can likely force
      bad state via ioctls() for a nested guest as well.
      
      Checking for the impossible condition needs to be deferred until KVM_RUN
      because KVM can't force specific ordering between ioctls.  E.g. clearing
      exception.pending in KVM_SET_SREGS doesn't prevent userspace from setting
      it in KVM_SET_VCPU_EVENTS, and disallowing KVM_SET_VCPU_EVENTS with
      emulation_required would prevent userspace from queuing an exception and
      then stuffing sregs.  Note, if KVM were to try and detect/prevent the
      condition prior to KVM_RUN, handle_invalid_guest_state() and/or
      handle_emulation_failure() would need to be modified to clear the pending
      exception prior to exiting to userspace.
      
       ------------[ cut here ]------------
       WARNING: CPU: 6 PID: 137812 at arch/x86/kvm/vmx/vmx.c:1623 vmx_queue_exception+0x14f/0x160 [kvm_intel]
       CPU: 6 PID: 137812 Comm: vmx_invalid_nes Not tainted 5.15.2-7cc36c3e14ae-pop #279
       Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
       RIP: 0010:vmx_queue_exception+0x14f/0x160 [kvm_intel]
       Code: <0f> 0b e9 fd fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
       RSP: 0018:ffffa45c83577d38 EFLAGS: 00010202
       RAX: 0000000000000003 RBX: 0000000080000006 RCX: 0000000000000006
       RDX: 0000000000000000 RSI: 0000000000010002 RDI: ffff9916af734000
       RBP: ffff9916af734000 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000006
       R13: 0000000000000000 R14: ffff9916af734038 R15: 0000000000000000
       FS:  00007f1e1a47c740(0000) GS:ffff99188fb80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f1e1a6a8008 CR3: 000000026f83b005 CR4: 00000000001726e0
       Call Trace:
        kvm_arch_vcpu_ioctl_run+0x13a2/0x1f20 [kvm]
        kvm_vcpu_ioctl+0x279/0x690 [kvm]
        __x64_sys_ioctl+0x83/0xb0
        do_syscall_64+0x3b/0xc0
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Reported-by: syzbot+82112403ace4cbd780d8@syzkaller.appspotmail.com
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211228232437.1875318-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fc4fad79
    • Jim Mattson's avatar
      selftests: kvm/x86: Add test for KVM_SET_PMU_EVENT_FILTER · bef9a701
      Jim Mattson authored
      Verify that the PMU event filter works as expected.
      
      Note that the virtual PMU doesn't work as expected on AMD Zen CPUs (an
      intercepted rdmsr is counted as a retired branch instruction), but the
      PMU event filter does work.
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220115052431.447232-7-jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bef9a701
    • Jim Mattson's avatar
      selftests: kvm/x86: Introduce x86_model() · 2ba90474
      Jim Mattson authored
      Extract the x86 model number from CPUID.01H:EAX.
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220115052431.447232-6-jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2ba90474
    • Jim Mattson's avatar
      selftests: kvm/x86: Export x86_family() for use outside of processor.c · 398f9240
      Jim Mattson authored
      Move this static inline function to processor.h, so that it can be
      used in individual tests, as needed.
      
      Opportunistically replace the bare 'unsigned' with 'unsigned int.'
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220115052431.447232-5-jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      398f9240
    • Jim Mattson's avatar
      selftests: kvm/x86: Introduce is_amd_cpu() · 21066101
      Jim Mattson authored
      Replace the one ad hoc "AuthenticAMD" CPUID vendor string comparison
      with a new function, is_amd_cpu().
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220115052431.447232-4-jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      21066101
    • Jim Mattson's avatar
      selftests: kvm/x86: Parameterize the CPUID vendor string check · b33b9c40
      Jim Mattson authored
      Refactor is_intel_cpu() to make it easier to reuse the bulk of the
      code for other vendors in the future.
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220115052431.447232-3-jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b33b9c40
    • Jim Mattson's avatar
      KVM: x86/pmu: Use binary search to check filtered events · 7ff775ac
      Jim Mattson authored
      The PMU event filter may contain up to 300 events. Replace the linear
      search in reprogram_gp_counter() with a binary search.
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220115052431.447232-2-jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7ff775ac
    • Wei Wang's avatar
      kvm: selftests: conditionally build vm_xsave_req_perm() · 1a1d1dbc
      Wei Wang authored
      vm_xsave_req_perm() is currently defined and used by x86_64 only.
      Make it compiled into vm_create_with_vcpus() only when on x86_64
      machines. Otherwise, it would cause linkage errors, e.g. on s390x.
      
      Fixes: 415a3c33 ("kvm: selftests: Add support for KVM_CAP_XSAVE2")
      Reported-by: default avatarJanis Schoetterl-Glausch <scgl@linux.ibm.com>
      Signed-off-by: default avatarWei Wang <wei.w.wang@intel.com>
      Tested-by: default avatarJanis Schoetterl-Glausch <scgl@linux.ibm.com>
      Message-Id: <20220118014817.30910-1-wei.w.wang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1a1d1dbc
    • Like Xu's avatar
      KVM: x86/cpuid: Clear XFD for component i if the base feature is missing · e9737468
      Like Xu authored
      According to Intel extended feature disable (XFD) spec, the sub-function i
      (i > 1) of CPUID function 0DH enumerates "details for state component i.
      ECX[2] enumerates support for XFD support for this state component."
      
      If KVM does not report F(XFD) feature (e.g. due to CONFIG_X86_64),
      then the corresponding XFD support for any state component i
      should also be removed. Translate this dependency into KVM terms.
      
      Fixes: 690a757d ("kvm: x86: Add CPUID support for Intel AMX")
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Message-Id: <20220117074531.76925-1-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e9737468
    • David Matlack's avatar
      KVM: x86/mmu: Improve TLB flush comment in kvm_mmu_slot_remove_write_access() · 6ff94f27
      David Matlack authored
      Rewrite the comment in kvm_mmu_slot_remove_write_access() that explains
      why it is safe to flush TLBs outside of the MMU lock after
      write-protecting SPTEs for dirty logging. The current comment is a long
      run-on sentence that was difficult to understand. In addition it was
      specific to the shadow MMU (mentioning mmu_spte_update()) when the TDP
      MMU has to handle this as well.
      
      The new comment explains:
       - Why the TLB flush is necessary at all.
       - Why it is desirable to do the TLB flush outside of the MMU lock.
       - Why it is safe to do the TLB flush outside of the MMU lock.
      
      No functional change intended.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220113233020.3986005-5-dmatlack@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6ff94f27
    • David Matlack's avatar
      KVM: x86/mmu: Document and enforce MMU-writable and Host-writable invariants · 5f16bcac
      David Matlack authored
      SPTEs are tagged with software-only bits to indicate if it is
      "MMU-writable" and "Host-writable". These bits are used to determine why
      KVM has marked an SPTE as read-only.
      
      Document these bits and their invariants, and enforce the invariants
      with new WARNs in spte_can_locklessly_be_made_writable() to ensure they
      are not accidentally violated in the future.
      
      Opportunistically move DEFAULT_SPTE_{MMU,HOST}_WRITABLE next to
      EPT_SPTE_{MMU,HOST}_WRITABLE since the new documentation applies to
      both.
      
      No functional change intended.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220113233020.3986005-4-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5f16bcac
    • David Matlack's avatar
      KVM: x86/mmu: Clear MMU-writable during changed_pte notifier · f082d86e
      David Matlack authored
      When handling the changed_pte notifier and the new PTE is read-only,
      clear both the Host-writable and MMU-writable bits in the SPTE. This
      preserves the invariant that MMU-writable is set if-and-only-if
      Host-writable is set.
      
      No functional change intended. Nothing currently relies on the
      aforementioned invariant and technically the changed_pte notifier is
      dead code.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220113233020.3986005-3-dmatlack@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f082d86e
    • David Matlack's avatar
      KVM: x86/mmu: Fix write-protection of PTs mapped by the TDP MMU · 7c8a4742
      David Matlack authored
      When the TDP MMU is write-protection GFNs for page table protection (as
      opposed to for dirty logging, or due to the HVA not being writable), it
      checks if the SPTE is already write-protected and if so skips modifying
      the SPTE and the TLB flush.
      
      This behavior is incorrect because it fails to check if the SPTE
      is write-protected for page table protection, i.e. fails to check
      that MMU-writable is '0'.  If the SPTE was write-protected for dirty
      logging but not page table protection, the SPTE could locklessly be made
      writable, and vCPUs could still be running with writable mappings cached
      in their TLB.
      
      Fix this by only skipping setting the SPTE if the SPTE is already
      write-protected *and* MMU-writable is already clear.  Technically,
      checking only MMU-writable would suffice; a SPTE cannot be writable
      without MMU-writable being set.  But check both to be paranoid and
      because it arguably yields more readable code.
      
      Fixes: 46044f72 ("kvm: x86/mmu: Support write protection for nesting in tdp MMU")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220113233020.3986005-2-dmatlack@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7c8a4742
  2. 18 Jan, 2022 1 commit
    • Marcelo Tosatti's avatar
      KVM: VMX: switch blocked_vcpu_on_cpu_lock to raw spinlock · 5f02ef74
      Marcelo Tosatti authored
      blocked_vcpu_on_cpu_lock is taken from hard interrupt context
      (pi_wakeup_handler), therefore it cannot sleep.
      
      Switch it to a raw spinlock.
      
      Fixes:
      
      [41297.066254] BUG: scheduling while atomic: CPU 0/KVM/635218/0x00010001
      [41297.066323] Preemption disabled at:
      [41297.066324] [<ffffffff902ee47f>] irq_enter_rcu+0xf/0x60
      [41297.066339] Call Trace:
      [41297.066342]  <IRQ>
      [41297.066346]  dump_stack_lvl+0x34/0x44
      [41297.066353]  ? irq_enter_rcu+0xf/0x60
      [41297.066356]  __schedule_bug.cold+0x7d/0x8b
      [41297.066361]  __schedule+0x439/0x5b0
      [41297.066365]  ? task_blocks_on_rt_mutex.constprop.0.isra.0+0x1b0/0x440
      [41297.066369]  schedule_rtlock+0x1e/0x40
      [41297.066371]  rtlock_slowlock_locked+0xf1/0x260
      [41297.066374]  rt_spin_lock+0x3b/0x60
      [41297.066378]  pi_wakeup_handler+0x31/0x90 [kvm_intel]
      [41297.066388]  sysvec_kvm_posted_intr_wakeup_ipi+0x9d/0xd0
      [41297.066392]  </IRQ>
      [41297.066392]  asm_sysvec_kvm_posted_intr_wakeup_ipi+0x12/0x20
      ...
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5f02ef74
  3. 17 Jan, 2022 6 commits
    • Like Xu's avatar
      KVM: x86: Making the module parameter of vPMU more common · 4732f244
      Like Xu authored
      The new module parameter to control PMU virtualization should apply
      to Intel as well as AMD, for situations where userspace is not trusted.
      If the module parameter allows PMU virtualization, there could be a
      new KVM_CAP or guest CPUID bits whereby userspace can enable/disable
      PMU virtualization on a per-VM basis.
      
      If the module parameter does not allow PMU virtualization, there
      should be no userspace override, since we have no precedent for
      authorizing that kind of override. If it's false, other counter-based
      profiling features (such as LBR including the associated CPUID bits
      if any) will not be exposed.
      
      Change its name from "pmu" to "enable_pmu" as we have temporary
      variables with the same name in our code like "struct kvm_pmu *pmu".
      
      Fixes: b1d66dad ("KVM: x86/svm: Add module param to control PMU virtualization")
      Suggested-by : Jim Mattson <jmattson@google.com>
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Message-Id: <20220111073823.21885-1-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4732f244
    • Vitaly Kuznetsov's avatar
      KVM: selftests: Test KVM_SET_CPUID2 after KVM_RUN · ecebb966
      Vitaly Kuznetsov authored
      KVM forbids KVM_SET_CPUID2 after KVM_RUN was performed on a vCPU unless
      the supplied CPUID data is equal to what was previously set. Test this.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220117150542.2176196-5-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ecebb966
    • Vitaly Kuznetsov's avatar
      KVM: selftests: Rename 'get_cpuid_test' to 'cpuid_test' · 9e6d484f
      Vitaly Kuznetsov authored
      In preparation to reusing the existing 'get_cpuid_test' for testing
      "KVM_SET_CPUID{,2} after KVM_RUN" rename it to 'cpuid_test' to avoid
      the confusion.
      
      No functional change intended.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220117150542.2176196-4-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9e6d484f
    • Vitaly Kuznetsov's avatar
      KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN · c6617c61
      Vitaly Kuznetsov authored
      Commit feb627e8 ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN")
      forbade changing CPUID altogether but unfortunately this is not fully
      compatible with existing VMMs. In particular, QEMU reuses vCPU fds for
      CPU hotplug after unplug and it calls KVM_SET_CPUID2. Instead of full ban,
      check whether the supplied CPUID data is equal to what was previously set.
      Reported-by: default avatarIgor Mammedov <imammedo@redhat.com>
      Fixes: feb627e8 ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220117150542.2176196-3-vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      [Do not call kvm_find_cpuid_entry repeatedly. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c6617c61
    • Vitaly Kuznetsov's avatar
      KVM: x86: Do runtime CPUID update before updating vcpu->arch.cpuid_entries · ee3a5f9e
      Vitaly Kuznetsov authored
      kvm_update_cpuid_runtime() mangles CPUID data coming from userspace
      VMM after updating 'vcpu->arch.cpuid_entries', this makes it
      impossible to compare an update with what was previously
      supplied. Introduce __kvm_update_cpuid_runtime() version which can be
      used to tweak the input before it goes to 'vcpu->arch.cpuid_entries'
      so the upcoming update check can compare tweaked data.
      
      No functional change intended.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220117150542.2176196-2-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ee3a5f9e
    • Like Xu's avatar
      KVM: x86/pmu: Fix available_event_types check for REF_CPU_CYCLES event · a2186448
      Like Xu authored
      According to CPUID 0x0A.EBX bit vector, the event [7] should be the
      unrealized event "Topdown Slots" instead of the *kernel* generalized
      common hardware event "REF_CPU_CYCLES", so we need to skip the cpuid
      unavaliblity check in the intel_pmc_perf_hw_id() for the last
      REF_CPU_CYCLES event and update the confusing comment.
      
      If the event is marked as unavailable in the Intel guest CPUID
      0AH.EBX leaf, we need to avoid any perf_event creation, whether
      it's a gp or fixed counter. To distinguish whether it is a rejected
      event or an event that needs to be programmed with PERF_TYPE_RAW type,
      a new special returned value of "PERF_COUNT_HW_MAX + 1" is introduced.
      
      Fixes: 62079d8a ("KVM: PMU: add proper support for fixed counter 2")
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Message-Id: <20220105051509.69437-1-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a2186448
  4. 14 Jan, 2022 5 commits
    • Yang Zhong's avatar
      x86/fpu: Fix inline prefix warnings · c862dcd1
      Yang Zhong authored
      Fix sparse warnings in xstate and remove inline prefix.
      
      Fixes: 980fe2fd ("x86/fpu: Extend fpu_xstate_prctl() with guest permissions")
      Signed-off-by: default avatarYang Zhong <yang.zhong@intel.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Message-Id: <20220113180825.322333-1-yang.zhong@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c862dcd1
    • Yang Zhong's avatar
      selftest: kvm: Add amx selftest · bf70636d
      Yang Zhong authored
      This selftest covers two aspects of AMX.  The first is triggering #NM
      exception and checking the MSR XFD_ERR value.  The second case is
      loading tile config and tile data into guest registers and trapping to
      the host side for a complete save/load of the guest state.  TMM0
      is also checked against memory data after save/restore.
      Signed-off-by: default avatarYang Zhong <yang.zhong@intel.com>
      Message-Id: <20211223145322.2914028-4-yang.zhong@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bf70636d
    • Yang Zhong's avatar
      selftest: kvm: Move struct kvm_x86_state to header · 6559b4a5
      Yang Zhong authored
      Those changes can avoid dereferencing pointer compile issue
      when amx_test.c reference state->xsave.
      
      Move struct kvm_x86_state definition to processor.h.
      Signed-off-by: default avatarYang Zhong <yang.zhong@intel.com>
      Message-Id: <20211223145322.2914028-3-yang.zhong@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6559b4a5
    • Paolo Bonzini's avatar
      selftest: kvm: Reorder vcpu_load_state steps for AMX · 551447cf
      Paolo Bonzini authored
      For AMX support it is recommended to load XCR0 after XFD, so
      that KVM does not see XFD=0, XCR=1 for a save state that will
      eventually be disabled (which would lead to premature allocation
      of the space required for that save state).
      
      It is also required to load XSAVE data after XCR0 and XFD, so
      that KVM can trigger allocation of the extra space required to
      store AMX state.
      
      Adjust vcpu_load_state to obey these new requirements.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarYang Zhong <yang.zhong@intel.com>
      Message-Id: <20211223145322.2914028-2-yang.zhong@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      551447cf
    • Kevin Tian's avatar
      kvm: x86: Disable interception for IA32_XFD on demand · b5274b1b
      Kevin Tian authored
      Always intercepting IA32_XFD causes non-negligible overhead when this
      register is updated frequently in the guest.
      
      Disable r/w emulation after intercepting the first WRMSR(IA32_XFD)
      with a non-zero value.
      
      Disable WRMSR emulation implies that IA32_XFD becomes out-of-sync
      with the software states in fpstate and the per-cpu xfd cache. This
      leads to two additional changes accordingly:
      
        - Call fpu_sync_guest_vmexit_xfd_state() after vm-exit to bring
          software states back in-sync with the MSR, before handle_exit_irqoff()
          is called.
      
        - Always trap #NM once write interception is disabled for IA32_XFD.
          The #NM exception is rare if the guest doesn't use dynamic
          features. Otherwise, there is at most one exception per guest
          task given a dynamic feature.
      
      p.s. We have confirmed that SDM is being revised to say that
      when setting IA32_XFD[18] the AMX register state is not guaranteed
      to be preserved. This clarification avoids adding mess for a creative
      guest which sets IA32_XFD[18]=1 before saving active AMX state to
      its own storage.
      Signed-off-by: default avatarKevin Tian <kevin.tian@intel.com>
      Signed-off-by: default avatarJing Liu <jing2.liu@intel.com>
      Signed-off-by: default avatarYang Zhong <yang.zhong@intel.com>
      Message-Id: <20220105123532.12586-22-yang.zhong@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b5274b1b