1. 08 Jun, 2022 27 commits
    • Like Xu's avatar
      KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled · bef6ecca
      Like Xu authored
      On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
      detect whether the processor supports performance monitoring facility.
      
      It depends on the PMU is enabled for the guest, and a software write
      operation to this available bit will be ignored. The proposal to ignore
      the toggle in KVM is the way to go and that behavior matches bare metal.
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Message-Id: <20220411101946.20262-5-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bef6ecca
    • Like Xu's avatar
      perf/x86/core: Pass "struct kvm_pmu *" to determine the guest values · 39a4d779
      Like Xu authored
      Splitting the logic for determining the guest values is unnecessarily
      confusing, and potentially fragile. Perf should have full knowledge and
      control of what values are loaded for the guest.
      
      If we change .guest_get_msrs() to take a struct kvm_pmu pointer, then it
      can generate the full set of guest values by grabbing guest ds_area and
      pebs_data_cfg. Alternatively, .guest_get_msrs() could take the desired
      guest MSR values directly (ds_area and pebs_data_cfg), but kvm_pmu is
      vendor agnostic, so we don't see any reason to not just pass the pointer.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarLike Xu <like.xu@linux.intel.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Message-Id: <20220411101946.20262-4-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      39a4d779
    • Like Xu's avatar
      perf/x86/intel: Handle guest PEBS overflow PMI for KVM guest · 69e575dd
      Like Xu authored
      With PEBS virtualization, the guest PEBS records get delivered to the
      guest DS, and the host pmi handler uses perf_guest_cbs->is_in_guest()
      to distinguish whether the PMI comes from the guest code like Intel PT.
      
      No matter how many guest PEBS counters are overflowed, only triggering
      one fake event is enough. The fake event causes the KVM PMI callback to
      be called, thereby injecting the PEBS overflow PMI into the guest.
      
      KVM may inject the PMI with BUFFER_OVF set, even if the guest DS is
      empty. That should really be harmless. Thus guest PEBS handler would
      retrieve the correct information from its own PEBS records buffer.
      
      Cc: linux-perf-users@vger.kernel.org
      Originally-by: default avatarAndi Kleen <ak@linux.intel.com>
      Co-developed-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Message-Id: <20220411101946.20262-3-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      69e575dd
    • Like Xu's avatar
      perf/x86/intel: Add EPT-Friendly PEBS for Ice Lake Server · fb358e0b
      Like Xu authored
      Add support for EPT-Friendly PEBS, a new CPU feature that enlightens PEBS
      to translate guest linear address through EPT, and facilitates handling
      VM-Exits that occur when accessing PEBS records.  More information can
      be found in the December 2021 release of Intel's SDM, Volume 3,
      18.9.5 "EPT-Friendly PEBS". This new hardware facility makes sure the
      guest PEBS records will not be lost, which is available on Intel Ice Lake
      Server platforms (and later).
      
      KVM will check this field through perf_get_x86_pmu_capability() instead
      of hard coding the CPU models in the KVM code. If it is supported, the
      guest PEBS capability will be exposed to the guest. Guest PEBS can be
      enabled when and only when "EPT-Friendly PEBS" is supported and
      EPT is enabled.
      
      Cc: linux-perf-users@vger.kernel.org
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Message-Id: <20220411101946.20262-2-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fb358e0b
    • Chao Gao's avatar
      KVM: VMX: enable IPI virtualization · d588bb9b
      Chao Gao authored
      With IPI virtualization enabled, the processor emulates writes to
      APIC registers that would send IPIs. The processor sets the bit
      corresponding to the vector in target vCPU's PIR and may send a
      notification (IPI) specified by NDST and NV fields in target vCPU's
      Posted-Interrupt Descriptor (PID). It is similar to what IOMMU
      engine does when dealing with posted interrupt from devices.
      
      A PID-pointer table is used by the processor to locate the PID of a
      vCPU with the vCPU's APIC ID. The table size depends on maximum APIC
      ID assigned for current VM session from userspace. Allocating memory
      for PID-pointer table is deferred to vCPU creation, because irqchip
      mode and VM-scope maximum APIC ID is settled at that point. KVM can
      skip PID-pointer table allocation if !irqchip_in_kernel().
      
      Like VT-d PI, if a vCPU goes to blocked state, VMM needs to switch its
      notification vector to wakeup vector. This can ensure that when an IPI
      for blocked vCPUs arrives, VMM can get control and wake up blocked
      vCPUs. And if a VCPU is preempted, its posted interrupt notification
      is suppressed.
      
      Note that IPI virtualization can only virualize physical-addressing,
      flat mode, unicast IPIs. Sending other IPIs would still cause a
      trap-like APIC-write VM-exit and need to be handled by VMM.
      Signed-off-by: default avatarChao Gao <chao.gao@intel.com>
      Signed-off-by: default avatarZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220419154510.11938-1-guang.zeng@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d588bb9b
    • Zeng Guang's avatar
      kvm: selftests: Add KVM_CAP_MAX_VCPU_ID cap test · 753dcf7a
      Zeng Guang authored
      Basic test coverage of KVM_CAP_MAX_VCPU_ID cap.
      
      This capability can be enabled before vCPU creation and only allowed
      to set once. if assigned vcpu id is beyond KVM_CAP_MAX_VCPU_ID
      capability, vCPU creation will fail.
      Signed-off-by: default avatarZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220422134456.26655-1-guang.zeng@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      753dcf7a
    • Zeng Guang's avatar
      KVM: x86: Allow userspace to set maximum VCPU id for VM · 35875316
      Zeng Guang authored
      Introduce new max_vcpu_ids in KVM for x86 architecture. Userspace
      can assign maximum possible vcpu id for current VM session using
      KVM_CAP_MAX_VCPU_ID of KVM_ENABLE_CAP ioctl().
      
      This is done for x86 only because the sole use case is to guide
      memory allocation for PID-pointer table, a structure needed to
      enable VMX IPI.
      
      By default, max_vcpu_ids set as KVM_MAX_VCPU_IDS.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220419154444.11888-1-guang.zeng@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      35875316
    • Zeng Guang's avatar
      KVM: Move kvm_arch_vcpu_precreate() under kvm->lock · 1d5e740d
      Zeng Guang authored
      kvm_arch_vcpu_precreate() targets to handle arch specific VM resource
      to be prepared prior to the actual creation of vCPU. For example, x86
      platform may need do per-VM allocation based on max_vcpu_ids at the
      first vCPU creation. It probably leads to concurrency control on this
      allocation as multiple vCPU creation could happen simultaneously. From
      the architectual point of view, it's necessary to execute
      kvm_arch_vcpu_precreate() under protect of kvm->lock.
      
      Currently only arm64, x86 and s390 have non-nop implementations at the
      stage of vCPU pre-creation. Remove the lock acquiring in s390's design
      and make sure all architecture can run kvm_arch_vcpu_precreate() safely
      under kvm->lock without recrusive lock issue.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220419154409.11842-1-guang.zeng@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1d5e740d
    • Zeng Guang's avatar
      KVM: VMX: Clean up vmx_refresh_apicv_exec_ctrl() · f08a06c9
      Zeng Guang authored
      Remove the condition check cpu_has_secondary_exec_ctrls(). Calling
      vmx_refresh_apicv_exec_ctrl() premises secondary controls activated
      and VMCS fields related to APICv valid as well. If it's invoked in
      wrong circumstance at the worst case, VMX operation will report
      VMfailValid error without further harmful impact and just functions
      as if all the secondary controls were 0.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220419153604.11786-1-guang.zeng@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f08a06c9
    • Zeng Guang's avatar
      KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode · 5413bcba
      Zeng Guang authored
      Upcoming Intel CPUs will support virtual x2APIC MSR writes to the vICR,
      i.e. will trap and generate an APIC-write VM-Exit instead of intercepting
      the WRMSR.  Add support for handling "nodecode" x2APIC writes, which
      were previously impossible.
      
      Note, x2APIC MSR writes are 64 bits wide.
      Signed-off-by: default avatarZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220419153516.11739-1-guang.zeng@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5413bcba
    • Robert Hoo's avatar
      KVM: VMX: Report tertiary_exec_control field in dump_vmcs() · 0b85baa5
      Robert Hoo authored
      Add tertiary_exec_control field report in dump_vmcs(). Meanwhile,
      reorganize the dump output of VMCS category as follows.
      
      Before change:
      *** Control State ***
       PinBased=0x000000ff CPUBased=0xb5a26dfa SecondaryExec=0x061037eb
       EntryControls=0000d1ff ExitControls=002befff
      
      After change:
      *** Control State ***
       CPUBased=0xb5a26dfa SecondaryExec=0x061037eb TertiaryExec=0x0000000000000010
       PinBased=0x000000ff EntryControls=0000d1ff ExitControls=002befff
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarRobert Hoo <robert.hu@linux.intel.com>
      Signed-off-by: default avatarZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220419153441.11687-1-guang.zeng@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0b85baa5
    • Robert Hoo's avatar
      KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config · 1ad4e543
      Robert Hoo authored
      Check VMX features on tertiary execution control in VMCS config setup.
      Sub-features in tertiary execution control to be enabled are adjusted
      according to hardware capabilities although no sub-feature is enabled
      in this patch.
      
      EVMCSv1 doesn't support tertiary VM-execution control, so disable it
      when EVMCSv1 is in use. And define the auxiliary functions for Tertiary
      control field here, using the new BUILD_CONTROLS_SHADOW().
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarRobert Hoo <robert.hu@linux.intel.com>
      Signed-off-by: default avatarZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220419153400.11642-1-guang.zeng@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1ad4e543
    • Robert Hoo's avatar
      KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation · ed3905ba
      Robert Hoo authored
      The Tertiary VM-Exec Control, different from previous control fields, is 64
      bit. So extend BUILD_CONTROLS_SHADOW() by adding a 'bit' parameter, to
      support both 32 bit and 64 bit fields' auxiliary functions building.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarRobert Hoo <robert.hu@linux.intel.com>
      Signed-off-by: default avatarZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220419153318.11595-1-guang.zeng@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ed3905ba
    • Robert Hoo's avatar
      x86/cpu: Add new VMX feature, Tertiary VM-Execution control · 465932db
      Robert Hoo authored
      A new 64-bit control field "tertiary processor-based VM-execution
      controls", is defined [1]. It's controlled by bit 17 of the primary
      processor-based VM-execution controls.
      
      Different from its brother VM-execution fields, this tertiary VM-
      execution controls field is 64 bit. So it occupies 2 vmx_feature_leafs,
      TERTIARY_CTLS_LOW and TERTIARY_CTLS_HIGH.
      
      Its companion VMX capability reporting MSR,MSR_IA32_VMX_PROCBASED_CTLS3
      (0x492), is also semantically different from its brothers, whose 64 bits
      consist of all allow-1, rather than 32-bit allow-0 and 32-bit allow-1 [1][2].
      Therefore, its init_vmx_capabilities() is a little different from others.
      
      [1] ISE 6.2 "VMCS Changes"
      https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
      
      [2] SDM Vol3. Appendix A.3
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarRobert Hoo <robert.hu@linux.intel.com>
      Signed-off-by: default avatarZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220419153240.11549-1-guang.zeng@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      465932db
    • Sean Christopherson's avatar
      KVM: x86/mmu: Comment FNAME(sync_page) to document TLB flushing logic · b8b9156e
      Sean Christopherson authored
      Add a comment to FNAME(sync_page) to explain why the TLB flushing logic
      conspiculously doesn't handle the scenario of guest protections being
      reduced.  Specifically, if synchronizing a SPTE drops execute protections,
      KVM will not emit a TLB flush, whereas dropping writable or clearing A/D
      bits does trigger a flush via mmu_spte_update().  Architecturally, until
      the GPTE is implicitly or explicitly flushed from the guest's perspective,
      KVM is not required to flush any old, stale translations.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Message-Id: <20220513195000.99371-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b8b9156e
    • Sean Christopherson's avatar
      KVM: x86/mmu: Drop RWX=0 SPTEs during ept_sync_page() · 9fb35657
      Sean Christopherson authored
      All of sync_page()'s existing checks filter out only !PRESENT gPTE,
      because without execute-only, all upper levels are guaranteed to be at
      least READABLE.  However, if EPT with execute-only support is in use by
      L1, KVM can create an SPTE that is shadow-present but guest-inaccessible
      (RWX=0) if the upper level combined permissions are R (or RW) and
      the leaf EPTE is changed from R (or RW) to X.  Because the EPTE is
      considered present when viewed in isolation, and no reserved bits are set,
      FNAME(prefetch_invalid_gpte) will consider the GPTE valid, and cause a
      not-present SPTE to be created.
      
      The SPTE is "correct": the guest translation is inaccessible because
      the combined protections of all levels yield RWX=0, and KVM will just
      redirect any vmexits to the guest.  If EPT A/D bits are disabled, KVM
      can mistake the SPTE for an access-tracked SPTE, but again such confusion
      isn't fatal, as the "saved" protections are also RWX=0.  However,
      creating a useless SPTE in general means that KVM messed up something,
      even if this particular goof didn't manifest as a functional bug.
      So, drop SPTEs whose new protections will yield a RWX=0 SPTE, and
      add a WARN in make_spte() to detect creation of SPTEs that will
      result in RWX=0 protections.
      
      Fixes: d95c5568 ("kvm: mmu: track read permission explicitly for shadow EPT page tables")
      Cc: David Matlack <dmatlack@google.com>
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220513195000.99371-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9fb35657
    • Maciej S. Szmigiero's avatar
      KVM: selftests: nSVM: Add svm_nested_soft_inject_test · d8969871
      Maciej S. Szmigiero authored
      Add a KVM self-test that checks whether a nSVM L1 is able to successfully
      inject a software interrupt, a soft exception and a NMI into its L2 guest.
      
      In practice, this tests both the next_rip field consistency and
      L1-injected event with intervening L0 VMEXIT during its delivery:
      the first nested VMRUN (that's also trying to inject a software interrupt)
      will immediately trigger a L0 NPF.
      This L0 NPF will have zero in its CPU-returned next_rip field, which if
      incorrectly reused by KVM will trigger a #PF when trying to return to
      such address 0 from the interrupt handler.
      
      For NMI injection this tests whether the L1 NMI state isn't getting
      incorrectly mixed with the L2 NMI state if a L1 -> L2 NMI needs to be
      re-injected.
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      [sean: check exact L2 RIP on first soft interrupt]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <d5f3d56528558ad8e28a9f1e1e4187f5a1e6770a.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d8969871
    • Maciej S. Szmigiero's avatar
      KVM: nSVM: Transparently handle L1 -> L2 NMI re-injection · 159fc6fa
      Maciej S. Szmigiero authored
      A NMI that L1 wants to inject into its L2 should be directly re-injected,
      without causing L0 side effects like engaging NMI blocking for L1.
      
      It's also worth noting that in this case it is L1 responsibility
      to track the NMI window status for its L2 guest.
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <f894d13501cd48157b3069a4b4c7369575ddb60e.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      159fc6fa
    • Sean Christopherson's avatar
      KVM: x86: Differentiate Soft vs. Hard IRQs vs. reinjected in tracepoint · 2d613912
      Sean Christopherson authored
      In the IRQ injection tracepoint, differentiate between Hard IRQs and Soft
      "IRQs", i.e. interrupts that are reinjected after incomplete delivery of
      a software interrupt from an INTn instruction.  Tag reinjected interrupts
      as such, even though the information is usually redundant since soft
      interrupts are only ever reinjected by KVM.  Though rare in practice, a
      hard IRQ can be reinjected.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      [MSS: change "kvm_inj_virq" event "reinjected" field type to bool]
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <9664d49b3bd21e227caa501cff77b0569bebffe2.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2d613912
    • Sean Christopherson's avatar
      KVM: x86: Print error code in exception injection tracepoint iff valid · 21d4c575
      Sean Christopherson authored
      Print the error code in the exception injection tracepoint if and only if
      the exception has an error code.  Define the entire error code sequence
      as a set of formatted strings, print empty strings if there's no error
      code, and abuse __print_symbolic() by passing it an empty array to coerce
      it into printing the error code as a hex string.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <e8f0511733ed2a0410cbee8a0a7388eac2ee5bac.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      21d4c575
    • Sean Christopherson's avatar
      KVM: x86: Trace re-injected exceptions · a61d7c54
      Sean Christopherson authored
      Trace exceptions that are re-injected, not just those that KVM is
      injecting for the first time.  Debugging re-injection bugs is painful
      enough as is, not having visibility into what KVM is doing only makes
      things worse.
      
      Delay propagating pending=>injected in the non-reinjection path so that
      the tracing can properly identify reinjected exceptions.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <25470690a38b4d2b32b6204875dd35676c65c9f2.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a61d7c54
    • Sean Christopherson's avatar
      KVM: SVM: Re-inject INTn instead of retrying the insn on "failure" · 7e5b5ef8
      Sean Christopherson authored
      Re-inject INTn software interrupts instead of retrying the instruction if
      the CPU encountered an intercepted exception while vectoring the INTn,
      e.g. if KVM intercepted a #PF when utilizing shadow paging.  Retrying the
      instruction is architecturally wrong e.g. will result in a spurious #DB
      if there's a code breakpoint on the INT3/O, and lack of re-injection also
      breaks nested virtualization, e.g. if L1 injects a software interrupt and
      vectoring the injected interrupt encounters an exception that is
      intercepted by L0 but not L1.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <1654ad502f860948e4f2d57b8bd881d67301f785.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7e5b5ef8
    • Sean Christopherson's avatar
      KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction · 6ef88d6e
      Sean Christopherson authored
      Re-inject INT3/INTO instead of retrying the instruction if the CPU
      encountered an intercepted exception while vectoring the software
      exception, e.g. if vectoring INT3 encounters a #PF and KVM is using
      shadow paging.  Retrying the instruction is architecturally wrong, e.g.
      will result in a spurious #DB if there's a code breakpoint on the INT3/O,
      and lack of re-injection also breaks nested virtualization, e.g. if L1
      injects a software exception and vectoring the injected exception
      encounters an exception that is intercepted by L0 but not L1.
      
      Due to, ahem, deficiencies in the SVM architecture, acquiring the next
      RIP may require flowing through the emulator even if NRIPS is supported,
      as the CPU clears next_rip if the VM-Exit is due to an exception other
      than "exceptions caused by the INT3, INTO, and BOUND instructions".  To
      deal with this, "skip" the instruction to calculate next_rip (if it's
      not already known), and then unwind the RIP write and any side effects
      (RFLAGS updates).
      
      Save the computed next_rip and use it to re-stuff next_rip if injection
      doesn't complete.  This allows KVM to do the right thing if next_rip was
      known prior to injection, e.g. if L1 injects a soft event into L2, and
      there is no backing INTn instruction, e.g. if L1 is injecting an
      arbitrary event.
      
      Note, it's impossible to guarantee architectural correctness given SVM's
      architectural flaws.  E.g. if the guest executes INTn (no KVM injection),
      an exit occurs while vectoring the INTn, and the guest modifies the code
      stream while the exit is being handled, KVM will compute the incorrect
      next_rip due to "skipping" the wrong instruction.  A future enhancement
      to make this less awful would be for KVM to detect that the decoded
      instruction is not the correct INTn and drop the to-be-injected soft
      event (retrying is a lesser evil compared to shoving the wrong RIP on the
      exception stack).
      Reported-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <65cb88deab40bc1649d509194864312a89bbe02e.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6ef88d6e
    • Sean Christopherson's avatar
      KVM: SVM: Stuff next_rip on emulated INT3 injection if NRIPS is supported · 3741aec4
      Sean Christopherson authored
      If NRIPS is supported in hardware but disabled in KVM, set next_rip to
      the next RIP when advancing RIP as part of emulating INT3 injection.
      There is no flag to tell the CPU that KVM isn't using next_rip, and so
      leaving next_rip is left as is will result in the CPU pushing garbage
      onto the stack when vectoring the injected event.
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Fixes: 66b7138f ("KVM: SVM: Emulate nRIP feature when reinjecting INT3")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <cd328309a3b88604daa2359ad56f36cb565ce2d4.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3741aec4
    • Sean Christopherson's avatar
      KVM: SVM: Unwind "speculative" RIP advancement if INTn injection "fails" · cd9e6da8
      Sean Christopherson authored
      Unwind the RIP advancement done by svm_queue_exception() when injecting
      an INT3 ultimately "fails" due to the CPU encountering a VM-Exit while
      vectoring the injected event, even if the exception reported by the CPU
      isn't the same event that was injected.  If vectoring INT3 encounters an
      exception, e.g. #NP, and vectoring the #NP encounters an intercepted
      exception, e.g. #PF when KVM is using shadow paging, then the #NP will
      be reported as the event that was in-progress.
      
      Note, this is still imperfect, as it will get a false positive if the
      INT3 is cleanly injected, no VM-Exit occurs before the IRET from the INT3
      handler in the guest, the instruction following the INT3 generates an
      exception (directly or indirectly), _and_ vectoring that exception
      encounters an exception that is intercepted by KVM.  The false positives
      could theoretically be solved by further analyzing the vectoring event,
      e.g. by comparing the error code against the expected error code were an
      exception to occur when vectoring the original injected exception, but
      SVM without NRIPS is a complete disaster, trying to make it 100% correct
      is a waste of time.
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Fixes: 66b7138f ("KVM: SVM: Emulate nRIP feature when reinjecting INT3")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <450133cf0a026cb9825a2ff55d02cb136a1cb111.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cd9e6da8
    • Maciej S. Szmigiero's avatar
      KVM: SVM: Don't BUG if userspace injects an interrupt with GIF=0 · f17c31c4
      Maciej S. Szmigiero authored
      Don't BUG/WARN on interrupt injection due to GIF being cleared,
      since it's trivial for userspace to force the situation via
      KVM_SET_VCPU_EVENTS (even if having at least a WARN there would be correct
      for KVM internally generated injections).
      
        kernel BUG at arch/x86/kvm/svm/svm.c:3386!
        invalid opcode: 0000 [#1] SMP
        CPU: 15 PID: 926 Comm: smm_test Not tainted 5.17.0-rc3+ #264
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:svm_inject_irq+0xab/0xb0 [kvm_amd]
        Code: <0f> 0b 0f 1f 00 0f 1f 44 00 00 80 3d ac b3 01 00 00 55 48 89 f5 53
        RSP: 0018:ffffc90000b37d88 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff88810a234ac0 RCX: 0000000000000006
        RDX: 0000000000000000 RSI: ffffc90000b37df7 RDI: ffff88810a234ac0
        RBP: ffffc90000b37df7 R08: ffff88810a1fa410 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
        R13: ffff888109571000 R14: ffff88810a234ac0 R15: 0000000000000000
        FS:  0000000001821380(0000) GS:ffff88846fdc0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f74fc550008 CR3: 000000010a6fe000 CR4: 0000000000350ea0
        Call Trace:
         <TASK>
         inject_pending_event+0x2f7/0x4c0 [kvm]
         kvm_arch_vcpu_ioctl_run+0x791/0x17a0 [kvm]
         kvm_vcpu_ioctl+0x26d/0x650 [kvm]
         __x64_sys_ioctl+0x82/0xb0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      Fixes: 219b65dc ("KVM: SVM: Improve nested interrupt injection")
      Cc: stable@vger.kernel.org
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <35426af6e123cbe91ec7ce5132ce72521f02b1b5.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f17c31c4
    • Maciej S. Szmigiero's avatar
      KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02 · 00f08d99
      Maciej S. Szmigiero authored
      The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
      This field value (instead of the saved guest RIP) in used by the CPU for
      the return address pushed on stack when injecting a software interrupt or
      INT3 or INTO exception.
      
      Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
      loading a nested state and NRIPS is exposed to L1.  If NRIPS is supported
      in hardware but not exposed to L1 (nrips=0 or hidden by userspace), stuff
      vmcb02's next_rip from the new L2 RIP to emulate a !NRIPS CPU (which
      saves RIP on the stack as-is).
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <c2e0a3d78db3ae30530f11d4e9254b452a89f42b.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      00f08d99
  2. 07 Jun, 2022 10 commits
    • Paolo Bonzini's avatar
      Merge tag 'kvm-s390-next-5.19-2' of... · 5552de7b
      Paolo Bonzini authored
      Merge tag 'kvm-s390-next-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
      
      KVM: s390: pvdump and selftest improvements
      
      - add an interface to provide a hypervisor dump for secure guests
      - improve selftests to show tests
      5552de7b
    • Paolo Bonzini's avatar
      b31455e9
    • Paolo Bonzini's avatar
      a280e358
    • Maxim Levitsky's avatar
      KVM: SVM: fix tsc scaling cache logic · 11d39e8c
      Maxim Levitsky authored
      SVM uses a per-cpu variable to cache the current value of the
      tsc scaling multiplier msr on each cpu.
      
      Commit 1ab9287a
      ("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
      broke this caching logic.
      
      Refactor the code so that all TSC scaling multiplier writes go through
      a single function which checks and updates the cache.
      
      This fixes the following scenario:
      
      1. A CPU runs a guest with some tsc scaling ratio.
      
      2. New guest with different tsc scaling ratio starts on this CPU
         and terminates almost immediately.
      
         This ensures that the short running guest had set the tsc scaling ratio just
         once when it was set via KVM_SET_TSC_KHZ. Due to the bug,
         the per-cpu cache is not updated.
      
      3. The original guest continues to run, it doesn't restore the msr
         value back to its own value, because the cache matches,
         and thus continues to run with a wrong tsc scaling ratio.
      
      Fixes: 1ab9287a ("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220606181149.103072-1-mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      11d39e8c
    • Vitaly Kuznetsov's avatar
      KVM: selftests: Make hyperv_clock selftest more stable · eae260be
      Vitaly Kuznetsov authored
      hyperv_clock doesn't always give a stable test result, especially with
      AMD CPUs. The test compares Hyper-V MSR clocksource (acquired either
      with rdmsr() from within the guest or KVM_GET_MSRS from the host)
      against rdtsc(). To increase the accuracy, increase the measured delay
      (done with nop loop) by two orders of magnitude and take the mean rdtsc()
      value before and after rdmsr()/KVM_GET_MSRS.
      Reported-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Tested-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220601144322.1968742-1-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eae260be
    • Ben Gardon's avatar
      KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging · 5ba7c4c6
      Ben Gardon authored
      Currently disabling dirty logging with the TDP MMU is extremely slow.
      On a 96 vCPU / 96G VM backed with gigabyte pages, it takes ~200 seconds
      to disable dirty logging with the TDP MMU, as opposed to ~4 seconds with
      the shadow MMU.
      
      When disabling dirty logging, zap non-leaf parent entries to allow
      replacement with huge pages instead of recursing and zapping all of the
      child, leaf entries. This reduces the number of TLB flushes required.
      and reduces the disable dirty log time with the TDP MMU to ~3 seconds.
      
      Opportunistically add a WARN() to catch GFNs that are mapped at a
      higher level than their max level.
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220525230904.1584480-1-bgardon@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5ba7c4c6
    • Jan Beulich's avatar
      x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm() · 1df931d9
      Jan Beulich authored
      As noted (and fixed) a couple of times in the past, "=@cc<cond>" outputs
      and clobbering of "cc" don't work well together. The compiler appears to
      mean to reject such, but doesn't - in its upstream form - quite manage
      to yet for "cc". Furthermore two similar macros don't clobber "cc", and
      clobbering "cc" is pointless in asm()-s for x86 anyway - the compiler
      always assumes status flags to be clobbered there.
      
      Fixes: 989b5db2 ("x86/uaccess: Implement macros for CMPXCHG on user addresses")
      Signed-off-by: default avatarJan Beulich <jbeulich@suse.com>
      Message-Id: <485c0c0b-a3a7-0b7c-5264-7d00c01de032@suse.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1df931d9
    • Shaoqin Huang's avatar
      KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots() · cf4a8693
      Shaoqin Huang authored
      When freeing obsolete previous roots, check prev_roots as intended, not
      the current root.
      Signed-off-by: default avatarShaoqin Huang <shaoqin.huang@intel.com>
      Fixes: 527d5cd7 ("KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped")
      Message-Id: <20220607005905.2933378-1-shaoqin.huang@intel.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cf4a8693
    • Seth Forshee's avatar
      entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set · 3e684903
      Seth Forshee authored
      A livepatch transition may stall indefinitely when a kvm vCPU is heavily
      loaded. To the host, the vCPU task is a user thread which is spending a
      very long time in the ioctl(KVM_RUN) syscall. During livepatch
      transition, set_notify_signal() will be called on such tasks to
      interrupt the syscall so that the task can be transitioned. This
      interrupts guest execution, but when xfer_to_guest_mode_work() sees that
      TIF_NOTIFY_SIGNAL is set but not TIF_SIGPENDING it concludes that an
      exit to user mode is unnecessary, and guest execution is resumed without
      transitioning the task for the livepatch.
      
      This handling of TIF_NOTIFY_SIGNAL is incorrect, as set_notify_signal()
      is expected to break tasks out of interruptible kernel loops and cause
      them to return to userspace. Change xfer_to_guest_mode_work() to handle
      TIF_NOTIFY_SIGNAL the same as TIF_SIGPENDING, signaling to the vCPU run
      loop that an exit to userpsace is needed. Any pending task_work will be
      run when get_signal() is called from exit_to_user_mode_loop(), so there
      is no longer any need to run task work from xfer_to_guest_mode_work().
      Suggested-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Signed-off-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Message-Id: <20220504180840.2907296-1-sforshee@digitalocean.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3e684903
    • Alexey Kardashevskiy's avatar
      KVM: Don't null dereference ops->destroy · e8bc2427
      Alexey Kardashevskiy authored
      A KVM device cleanup happens in either of two callbacks:
      1) destroy() which is called when the VM is being destroyed;
      2) release() which is called when a device fd is closed.
      
      Most KVM devices use 1) but Book3s's interrupt controller KVM devices
      (XICS, XIVE, XIVE-native) use 2) as they need to close and reopen during
      the machine execution. The error handling in kvm_ioctl_create_device()
      assumes destroy() is always defined which leads to NULL dereference as
      discovered by Syzkaller.
      
      This adds a checks for destroy!=NULL and adds a missing release().
      
      This is not changing kvm_destroy_devices() as devices with defined
      release() should have been removed from the KVM devices list by then.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e8bc2427
  3. 06 Jun, 2022 3 commits