1. 01 Dec, 2022 16 commits
    • Anton Romanov's avatar
      KVM: x86: Use current rather than snapshotted TSC frequency if it is constant · 3ebcbd22
      Anton Romanov authored
      Don't snapshot tsc_khz into per-cpu cpu_tsc_khz if the host TSC is
      constant, in which case the actual TSC frequency will never change and thus
      capturing TSC during initialization is unnecessary, KVM can simply use
      tsc_khz.  This value is snapshotted from
      kvm_timer_init->kvmclock_cpu_online->tsc_khz_changed(NULL)
      
      On CPUs with constant TSC, but not a hardware-specified TSC frequency,
      snapshotting cpu_tsc_khz and using that to set a VM's target TSC frequency
      can lead to VM to think its TSC frequency is not what it actually is if
      refining the TSC completes after KVM snapshots tsc_khz.  The actual
      frequency never changes, only the kernel's calculation of what that
      frequency is changes.
      
      Ideally, KVM would not be able to race with TSC refinement, or would have
      a hook into tsc_refine_calibration_work() to get an alert when refinement
      is complete.  Avoiding the race altogether isn't practical as refinement
      takes a relative eternity; it's deliberately put on a work queue outside of
      the normal boot sequence to avoid unnecessarily delaying boot.
      
      Adding a hook is doable, but somewhat gross due to KVM's ability to be
      built as a module.  And if the TSC is constant, which is likely the case
      for every VMX/SVM-capable CPU produced in the last decade, the race can be
      hit if and only if userspace is able to create a VM before TSC refinement
      completes; refinement is slow, but not that slow.
      
      For now, punt on a proper fix, as not taking a snapshot can help some uses
      cases and not taking a snapshot is arguably correct irrespective of the
      race with refinement.
      Signed-off-by: default avatarAnton Romanov <romanton@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220608183525.1143682-1-romanton@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      3ebcbd22
    • Sean Christopherson's avatar
      KVM: selftests: Verify userspace can stuff IA32_FEATURE_CONTROL at will · b80732fd
      Sean Christopherson authored
      Verify the KVM allows userspace to set all supported bits in the
      IA32_FEATURE_CONTROL MSR irrespective of the current guest CPUID, and
      that all unsupported bits are rejected.
      
      Throw the testcase into vmx_msrs_test even though it's not technically a
      VMX MSR; it's close enough, and the most frequently feature controlled by
      the MSR is VMX.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220607232353.3375324-4-seanjc@google.com
      b80732fd
    • Sean Christopherson's avatar
      KVM: VMX: Move MSR_IA32_FEAT_CTL.LOCKED check into "is valid" helper · 2d6cd686
      Sean Christopherson authored
      Move the check on IA32_FEATURE_CONTROL being locked, i.e. read-only from
      the guest, into the helper to check the overall validity of the incoming
      value.  Opportunistically rename the helper to make it clear that it
      returns a bool.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220607232353.3375324-3-seanjc@google.com
      2d6cd686
    • Sean Christopherson's avatar
      KVM: VMX: Allow userspace to set all supported FEATURE_CONTROL bits · d2a00af2
      Sean Christopherson authored
      Allow userspace to set all supported bits in MSR IA32_FEATURE_CONTROL
      irrespective of the guest CPUID model, e.g. via KVM_SET_MSRS.  KVM's ABI
      is that userspace is allowed to set MSRs before CPUID, i.e. can set MSRs
      to values that would fault according to the guest CPUID model.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220607232353.3375324-2-seanjc@google.com
      d2a00af2
    • Sean Christopherson's avatar
      KVM: VMX: Make vmread_error_trampoline() uncallable from C code · 0b5e7a16
      Sean Christopherson authored
      Declare vmread_error_trampoline() as an opaque symbol so that it cannot
      be called from C code, at least not without some serious fudging.  The
      trampoline always passes parameters on the stack so that the inline
      VMREAD sequence doesn't need to clobber registers.  regparm(0) was
      originally added to document the stack behavior, but it ended up being
      confusing because regparm(0) is a nop for 64-bit targets.
      
      Opportunustically wrap the trampoline and its declaration in #ifdeffery
      to make it even harder to invoke incorrectly, to document why it exists,
      and so that it's not left behind if/when CONFIG_CC_HAS_ASM_GOTO_OUTPUT
      is true for all supported toolchains.
      
      No functional change intended.
      
      Cc: Uros Bizjak <ubizjak@gmail.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220928232015.745948-1-seanjc@google.com
      0b5e7a16
    • Sean Christopherson's avatar
      KVM: nVMX: Reword comments about generating nested CR0/4 read shadows · 4a8fd4a7
      Sean Christopherson authored
      Reword the comments that (attempt to) document nVMX's overrides of the
      CR0/4 read shadows for L2 after calling vmx_set_cr0/4().  The important
      behavior that needs to be documented is that KVM needs to override the
      shadows to account for L1's masks even though the shadows are set by the
      common helpers (and that setting the shadows first would result in the
      correct shadows being clobbered).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Link: https://lore.kernel.org/r/20220831000721.4066617-1-seanjc@google.com
      4a8fd4a7
    • Sean Christopherson's avatar
      KVM: x86: Clean up KVM_CAP_X86_USER_SPACE_MSR documentation · 1f158147
      Sean Christopherson authored
      Clean up the KVM_CAP_X86_USER_SPACE_MSR documentation to eliminate
      misleading and/or inconsistent verbiage, and to actually document what
      accesses are intercepted by which flags.
      
        - s/will/may since not all #GPs are guaranteed to be intercepted
        - s/deflect/intercept to align with common KVM terminology
        - s/user space/userspace to align with the majority of KVM docs
        - Avoid using "trap" terminology, as KVM exits to userspace _before_
          stepping, i.e. doesn't exhibit trap-like behavior
        - Actually document the flags
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220831001706.4075399-4-seanjc@google.com
      1f158147
    • Sean Christopherson's avatar
      KVM: x86: Reword MSR filtering docs to more precisely define behavior · b93d2ec3
      Sean Christopherson authored
      Reword the MSR filtering documentatiion to more precisely define the
      behavior of filtering using common virtualization terminology.
      
        - Explicitly document KVM's behavior when an MSR is denied
        - s/handled/allowed as there is no guarantee KVM will "handle" the
          MSR access
        - Drop the "fall back" terminology, which incorrectly suggests that
          there is existing KVM behavior to fall back to
        - Fix an off-by-one error in the range (the end is exclusive)
        - Call out the interaction between MSR filtering and
          KVM_CAP_X86_USER_SPACE_MSR's KVM_MSR_EXIT_REASON_FILTER
        - Delete the redundant paragraph on what '0' and '1' in the bitmap
          means, it's covered by the sections on KVM_MSR_FILTER_{READ,WRITE}
        - Delete the clause on x2APIC MSR behavior depending on APIC base, this
          is covered by stating that KVM follows architectural behavior when
          emulating/virtualizing MSR accesses
      Reported-by: default avatarAaron Lewis <aaronlewis@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220831001706.4075399-3-seanjc@google.com
      b93d2ec3
    • Sean Christopherson's avatar
      KVM: x86: Delete documentation for READ|WRITE in KVM_X86_SET_MSR_FILTER · 5c8c0b32
      Sean Christopherson authored
      Delete the paragraph that describes the behavior when both
      KVM_MSR_FILTER_READ | KVM_MSR_FILTER_WRITE are set for a range.  There is
      nothing special about KVM's handling of this combination, whereas
      explicitly documenting the combination suggests that there is some magic
      behavior the user needs to be aware of.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220831001706.4075399-2-seanjc@google.com
      5c8c0b32
    • Jim Mattson's avatar
      KVM: VMX: Execute IBPB on emulated VM-exit when guest has IBRS · 2e7eab81
      Jim Mattson authored
      According to Intel's document on Indirect Branch Restricted
      Speculation, "Enabling IBRS does not prevent software from controlling
      the predicted targets of indirect branches of unrelated software
      executed later at the same predictor mode (for example, between two
      different user applications, or two different virtual machines). Such
      isolation can be ensured through use of the Indirect Branch Predictor
      Barrier (IBPB) command." This applies to both basic and enhanced IBRS.
      
      Since L1 and L2 VMs share hardware predictor modes (guest-user and
      guest-kernel), hardware IBRS is not sufficient to virtualize
      IBRS. (The way that basic IBRS is implemented on pre-eIBRS parts,
      hardware IBRS is actually sufficient in practice, even though it isn't
      sufficient architecturally.)
      
      For virtual CPUs that support IBRS, add an indirect branch prediction
      barrier on emulated VM-exit, to ensure that the predicted targets of
      indirect branches executed in L1 cannot be controlled by software that
      was executed in L2.
      
      Since we typically don't intercept guest writes to IA32_SPEC_CTRL,
      perform the IBPB at emulated VM-exit regardless of the current
      IA32_SPEC_CTRL.IBRS value, even though the IBPB could technically be
      deferred until L1 sets IA32_SPEC_CTRL.IBRS, if IA32_SPEC_CTRL.IBRS is
      clear at emulated VM-exit.
      
      This is CVE-2022-2196.
      
      Fixes: 5c911bef ("KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02")
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221019213620.1953281-3-jmattson@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      2e7eab81
    • Jim Mattson's avatar
      KVM: VMX: Guest usage of IA32_SPEC_CTRL is likely · 4f209989
      Jim Mattson authored
      At this point in time, most guests (in the default, out-of-the-box
      configuration) are likely to use IA32_SPEC_CTRL.  Therefore, drop the
      compiler hint that it is unlikely for KVM to be intercepting WRMSR of
      IA32_SPEC_CTRL.
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221019213620.1953281-2-jmattson@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      4f209989
    • Sean Christopherson's avatar
      KVM: nVMX: Inject #GP, not #UD, if "generic" VMXON CR0/CR4 check fails · 9cc40932
      Sean Christopherson authored
      Inject #GP for if VMXON is attempting with a CR0/CR4 that fails the
      generic "is CRx valid" check, but passes the CR4.VMXE check, and do the
      generic checks _after_ handling the post-VMXON VM-Fail.
      
      The CR4.VMXE check, and all other #UD cases, are special pre-conditions
      that are enforced prior to pivoting on the current VMX mode, i.e. occur
      before interception if VMXON is attempted in VMX non-root mode.
      
      All other CR0/CR4 checks generate #GP and effectively have lower priority
      than the post-VMXON check.
      
      Per the SDM:
      
          IF (register operand) or (CR0.PE = 0) or (CR4.VMXE = 0) or ...
              THEN #UD;
          ELSIF not in VMX operation
              THEN
                  IF (CPL > 0) or (in A20M mode) or
                  (the values of CR0 and CR4 are not supported in VMX operation)
                      THEN #GP(0);
          ELSIF in VMX non-root operation
              THEN VMexit;
          ELSIF CPL > 0
              THEN #GP(0);
          ELSE VMfail("VMXON executed in VMX root operation");
          FI;
      
      which, if re-written without ELSIF, yields:
      
          IF (register operand) or (CR0.PE = 0) or (CR4.VMXE = 0) or ...
              THEN #UD
      
          IF in VMX non-root operation
              THEN VMexit;
      
          IF CPL > 0
              THEN #GP(0)
      
          IF in VMX operation
              THEN VMfail("VMXON executed in VMX root operation");
      
          IF (in A20M mode) or
             (the values of CR0 and CR4 are not supported in VMX operation)
                      THEN #GP(0);
      
      Note, KVM unconditionally forwards VMXON VM-Exits that occur in L2 to L1,
      i.e. there is no need to check the vCPU is not in VMX non-root mode.  Add
      a comment to explain why unconditionally forwarding such exits is
      functionally correct.
      Reported-by: default avatarEric Li <ercli@ucdavis.edu>
      Fixes: c7d855c2 ("KVM: nVMX: Inject #UD if VMXON is attempted with incompatible CR0/CR4")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221006001956.329314-1-seanjc@google.com
      9cc40932
    • Zhao Liu's avatar
      KVM: SVM: Replace kmap_atomic() with kmap_local_page() · a8a12c00
      Zhao Liu authored
      The use of kmap_atomic() is being deprecated in favor of
      kmap_local_page()[1].
      
      The main difference between atomic and local mappings is that local
      mappings don't disable page faults or preemption.
      
      There're 2 reasons we can use kmap_local_page() here:
      1. SEV is 64-bit only and kmap_local_page() doesn't disable migration in
      this case, but here the function clflush_cache_range() uses CLFLUSHOPT
      instruction to flush, and on x86 CLFLUSHOPT is not CPU-local and flushes
      the page out of the entire cache hierarchy on all CPUs (APM volume 3,
      chapter 3, CLFLUSHOPT). So there's no need to disable preemption to ensure
      CPU-local.
      2. clflush_cache_range() doesn't need to disable pagefault and the mapping
      is still valid even if sleeps. This is also true for sched out/in when
      preempted.
      
      In addition, though kmap_local_page() is a thin wrapper around
      page_address() on 64-bit, kmap_local_page() should still be used here in
      preference to page_address() since page_address() isn't suitable to be used
      in a generic function (like sev_clflush_pages()) where the page passed in
      is not easy to determine the source of allocation. Keeping the kmap* API in
      place means it can be used for things other than highmem mappings[2].
      
      Therefore, sev_clflush_pages() is a function that should use
      kmap_local_page() in place of kmap_atomic().
      
      Convert the calls of kmap_atomic() / kunmap_atomic() to kmap_local_page() /
      kunmap_local().
      
      [1]: https://lore.kernel.org/all/20220813220034.806698-1-ira.weiny@intel.com
      [2]: https://lore.kernel.org/lkml/5d667258-b58b-3d28-3609-e7914c99b31b@intel.com/Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Suggested-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Signed-off-by: default avatarZhao Liu <zhao1.liu@intel.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220928092748.463631-1-zhao1.liu@linux.intel.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      a8a12c00
    • Sean Christopherson's avatar
      KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid · 5c30e810
      Sean Christopherson authored
      Skip the WRMSR fastpath in SVM's VM-Exit handler if the next RIP isn't
      valid, e.g. because KVM is running with nrips=false.  SVM must decode and
      emulate to skip the WRMSR if the CPU doesn't provide the next RIP.
      Getting the instruction bytes to decode the WRMSR requires reading guest
      memory, which in turn means dereferencing memslots, and that isn't safe
      because KVM doesn't hold SRCU when the fastpath runs.
      
      Don't bother trying to enable the fastpath for this case, e.g. by doing
      only the WRMSR and leaving the "skip" until later.  NRIPS is supported on
      all modern CPUs (KVM has considered making it mandatory), and the next
      RIP will be valid the vast, vast majority of the time.
      
        =============================
        WARNING: suspicious RCU usage
        6.0.0-smp--4e557fcd3d80-skip #13 Tainted: G           O
        -----------------------------
        include/linux/kvm_host.h:954 suspicious rcu_dereference_check() usage!
      
        other info that might help us debug this:
      
        rcu_scheduler_active = 2, debug_locks = 1
        1 lock held by stable/206475:
         #0: ffff9d9dfebcc0f0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x8b/0x620 [kvm]
      
        stack backtrace:
        CPU: 152 PID: 206475 Comm: stable Tainted: G           O       6.0.0-smp--4e557fcd3d80-skip #13
        Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 10.48.0 01/27/2022
        Call Trace:
         <TASK>
         dump_stack_lvl+0x69/0xaa
         dump_stack+0x10/0x12
         lockdep_rcu_suspicious+0x11e/0x130
         kvm_vcpu_gfn_to_memslot+0x155/0x190 [kvm]
         kvm_vcpu_gfn_to_hva_prot+0x18/0x80 [kvm]
         paging64_walk_addr_generic+0x183/0x450 [kvm]
         paging64_gva_to_gpa+0x63/0xd0 [kvm]
         kvm_fetch_guest_virt+0x53/0xc0 [kvm]
         __do_insn_fetch_bytes+0x18b/0x1c0 [kvm]
         x86_decode_insn+0xf0/0xef0 [kvm]
         x86_emulate_instruction+0xba/0x790 [kvm]
         kvm_emulate_instruction+0x17/0x20 [kvm]
         __svm_skip_emulated_instruction+0x85/0x100 [kvm_amd]
         svm_skip_emulated_instruction+0x13/0x20 [kvm_amd]
         handle_fastpath_set_msr_irqoff+0xae/0x180 [kvm]
         svm_vcpu_run+0x4b8/0x5a0 [kvm_amd]
         vcpu_enter_guest+0x16ca/0x22f0 [kvm]
         kvm_arch_vcpu_ioctl_run+0x39d/0x900 [kvm]
         kvm_vcpu_ioctl+0x538/0x620 [kvm]
         __se_sys_ioctl+0x77/0xc0
         __x64_sys_ioctl+0x1d/0x20
         do_syscall_64+0x3d/0x80
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Fixes: 404d5d7b ("KVM: X86: Introduce more exit_fastpath_completion enum values")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220930234031.1732249-1-seanjc@google.com
      5c30e810
    • Sean Christopherson's avatar
      KVM: x86: Fail emulation during EMULTYPE_SKIP on any exception · 17122c06
      Sean Christopherson authored
      Treat any exception during instruction decode for EMULTYPE_SKIP as a
      "full" emulation failure, i.e. signal failure instead of queuing the
      exception.  When decoding purely to skip an instruction, KVM and/or the
      CPU has already done some amount of emulation that cannot be unwound,
      e.g. on an EPT misconfig VM-Exit KVM has already processeed the emulated
      MMIO.  KVM already does this if a #UD is encountered, but not for other
      exceptions, e.g. if a #PF is encountered during fetch.
      
      In SVM's soft-injection use case, queueing the exception is particularly
      problematic as queueing exceptions while injecting events can put KVM
      into an infinite loop due to bailing from VM-Enter to service the newly
      pending exception.  E.g. multiple warnings to detect such behavior fire:
      
        ------------[ cut here ]------------
        WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9873 kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm]
        Modules linked in: kvm_amd ccp kvm irqbypass
        CPU: 3 PID: 1017 Comm: svm_nested_soft Not tainted 6.0.0-rc1+ #220
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm]
        Call Trace:
         kvm_vcpu_ioctl+0x223/0x6d0 [kvm]
         __x64_sys_ioctl+0x85/0xc0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
        ---[ end trace 0000000000000000 ]---
        ------------[ cut here ]------------
        WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9987 kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm]
        Modules linked in: kvm_amd ccp kvm irqbypass
        CPU: 3 PID: 1017 Comm: svm_nested_soft Tainted: G        W          6.0.0-rc1+ #220
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm]
        Call Trace:
         kvm_vcpu_ioctl+0x223/0x6d0 [kvm]
         __x64_sys_ioctl+0x85/0xc0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
        ---[ end trace 0000000000000000 ]---
      
      Fixes: 6ea6e843 ("KVM: x86: inject exceptions produced by x86_decode_insn")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220930233632.1725475-1-seanjc@google.com
      17122c06
    • Peng Hao's avatar
      KVM: x86: Keep the lock order consistent between SRCU and gpc spinlock · 4265df66
      Peng Hao authored
      Acquire SRCU before taking the gpc spinlock in wait_pending_event() so as
      to be consistent with all other functions that acquire both locks.  It's
      not illegal to acquire SRCU inside a spinlock, nor is there deadlock
      potential, but in general it's preferable to order locks from least
      restrictive to most restrictive, e.g. if wait_pending_event() needed to
      sleep for whatever reason, it could do so while holding SRCU, but would
      need to drop the spinlock.
      Signed-off-by: default avatarPeng Hao <flyingpeng@tencent.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/CAPm50a++Cb=QfnjMZ2EnCj-Sb9Y4UM-=uOEtHAcjnNLCAAf-dQ@mail.gmail.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      4265df66
  2. 30 Nov, 2022 7 commits
    • Sean Christopherson's avatar
      KVM: VMX: Resume guest immediately when injecting #GP on ECREATE · eb3992e8
      Sean Christopherson authored
      Resume the guest immediately when injecting a #GP on ECREATE due to an
      invalid enclave size, i.e. don't attempt ECREATE in the host.  The #GP is
      a terminal fault, e.g. skipping the instruction if ECREATE is successful
      would result in KVM injecting #GP on the instruction following ECREATE.
      
      Fixes: 70210c04 ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions")
      Cc: stable@vger.kernel.org
      Cc: Kai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Link: https://lore.kernel.org/r/20220930233132.1723330-1-seanjc@google.com
      eb3992e8
    • Paolo Bonzini's avatar
      KVM: x86: fix uninitialized variable use on KVM_REQ_TRIPLE_FAULT · df0bb47b
      Paolo Bonzini authored
      If a triple fault was fixed by kvm_x86_ops.nested_ops->triple_fault (by
      turning it into a vmexit), there is no need to leave vcpu_enter_guest().
      Any vcpu->requests will be caught later before the actual vmentry,
      and in fact vcpu_enter_guest() was not initializing the "r" variable.
      Depending on the compiler's whims, this could cause the
      x86_64/triple_fault_event_test test to fail.
      
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Fixes: 92e7d5c8 ("KVM: x86: allow L1 to not intercept triple fault")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      df0bb47b
    • Michal Luczaj's avatar
      KVM: x86: Remove unused argument in gpc_unmap_khva() · c1a81f3b
      Michal Luczaj authored
      Remove the unused @kvm argument from gpc_unmap_khva().
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c1a81f3b
    • Michal Luczaj's avatar
      KVM: Shorten gfn_to_pfn_cache function names · aba3caef
      Michal Luczaj authored
      Formalize "gpc" as the acronym and use it in function names.
      
      No functional change intended.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      aba3caef
    • David Woodhouse's avatar
      KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundary · 8acc3518
      David Woodhouse authored
      Torture test the cases where the runstate crosses a page boundary, and
      and especially the case where it's configured in 32-bit mode and doesn't,
      but then switching to 64-bit mode makes it go onto the second page.
      
      To simplify this, make the KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST ioctl
      also update the guest runstate area. It already did so if the actual
      runstate changed, as a side-effect of kvm_xen_update_runstate(). So
      doing it in the plain adjustment case is making it more consistent, as
      well as giving us a nice way to trigger the update without actually
      running the vCPU again and changing the values.
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Reviewed-by: default avatarPaul Durrant <paul@xen.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8acc3518
    • David Woodhouse's avatar
      KVM: x86/xen: Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured · d8ba8ba4
      David Woodhouse authored
      Closer inspection of the Xen code shows that we aren't supposed to be
      using the XEN_RUNSTATE_UPDATE flag unconditionally. It should be
      explicitly enabled by guests through the HYPERVISOR_vm_assist hypercall.
      If we randomly set the top bit of ->state_entry_time for a guest that
      hasn't asked for it and doesn't expect it, that could make the runtimes
      fail to add up and confuse the guest. Without the flag it's perfectly
      safe for a vCPU to read its own vcpu_runstate_info; just not for one
      vCPU to read *another's*.
      
      I briefly pondered adding a word for the whole set of VMASST_TYPE_*
      flags but the only one we care about for HVM guests is this, so it
      seemed a bit pointless.
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20221127122210.248427-3-dwmw2@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d8ba8ba4
    • David Woodhouse's avatar
      KVM: x86/xen: Compatibility fixes for shared runstate area · 5ec3289b
      David Woodhouse authored
      The guest runstate area can be arbitrarily byte-aligned. In fact, even
      when a sane 32-bit guest aligns the overall structure nicely, the 64-bit
      fields in the structure end up being unaligned due to the fact that the
      32-bit ABI only aligns them to 32 bits.
      
      So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE
      is buggy, because if it's unaligned then we can't update the whole field
      atomically; the low bytes might be observable before the _UPDATE bit is.
      Xen actually updates the *byte* containing that top bit, on its own. KVM
      should do the same.
      
      In addition, we cannot assume that the runstate area fits within a single
      page. One option might be to make the gfn_to_pfn cache cope with regions
      that cross a page — but getting a contiguous virtual kernel mapping of a
      discontiguous set of IOMEM pages is a distinctly non-trivial exercise,
      and it seems this is the *only* current use case for the GPC which would
      benefit from it.
      
      An earlier version of the runstate code did use a gfn_to_hva cache for
      this purpose, but it still had the single-page restriction because it
      used the uhva directly — because it needs to be able to do so atomically
      when the vCPU is being scheduled out, so it used pagefault_disable()
      around the accesses and didn't just use kvm_write_guest_cached() which
      has a fallback path.
      
      So... use a pair of GPCs for the first and potential second page covering
      the runstate area. We can get away with locking both at once because
      nothing else takes more than one GPC lock at a time so we can invent
      a trivial ordering rule.
      
      The common case where it's all in the same page is kept as a fast path,
      but in both cases, the actual guest structure (compat or not) is built
      up from the fields in @vx, following preset pointers to the state and
      times fields. The only difference is whether those pointers point to
      the kernel stack (in the split case) or to guest memory directly via
      the GPC.  The fast path is also fixed to use a byte access for the
      XEN_RUNSTATE_UPDATE bit, then the only real difference is the dual
      memcpy.
      
      Finally, Xen also does write the runstate area immediately when it's
      configured. Flip the kvm_xen_update_runstate() and …_guest() functions
      and call the latter directly when the runstate area is set. This means
      that other ioctls which modify the runstate also write it immediately
      to the guest when they do so, which is also intended.
      
      Update the xen_shinfo_test to exercise the pathological case where the
      XEN_RUNSTATE_UPDATE flag in the top byte of the state_entry_time is
      actually in a different page to the rest of the 64-bit word.
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5ec3289b
  3. 28 Nov, 2022 12 commits
    • Paolo Bonzini's avatar
      Merge tag 'kvm-s390-next-6.2-1' of... · 1e79a9e3
      Paolo Bonzini authored
      Merge tag 'kvm-s390-next-6.2-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
      
      - Second batch of the lazy destroy patches
      - First batch of KVM changes for kernel virtual != physical address support
      - Removal of a unused function
      1e79a9e3
    • Jiaxi Chen's avatar
      KVM: x86: Advertise PREFETCHIT0/1 CPUID to user space · 29c46979
      Jiaxi Chen authored
      Latest Intel platform Granite Rapids has introduced a new instruction -
      PREFETCHIT0/1, which moves code to memory (cache) closer to the
      processor depending on specific hints.
      
      The bit definition:
      CPUID.(EAX=7,ECX=1):EDX[bit 14]
      
      PREFETCHIT0/1 is on a KVM-only subleaf. Plus an x86_FEATURE definition
      for this feature bit to direct it to the KVM entry.
      
      Advertise PREFETCHIT0/1 to KVM userspace. This is safe because there are
      no new VMX controls or additional host enabling required for guests to
      use this feature.
      Signed-off-by: default avatarJiaxi Chen <jiaxi.chen@linux.intel.com>
      Message-Id: <20221125125845.1182922-9-jiaxi.chen@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      29c46979
    • Jiaxi Chen's avatar
      KVM: x86: Advertise AVX-NE-CONVERT CPUID to user space · 9977f087
      Jiaxi Chen authored
      AVX-NE-CONVERT is a new set of instructions which can convert low
      precision floating point like BF16/FP16 to high precision floating point
      FP32, and can also convert FP32 elements to BF16. This instruction
      allows the platform to have improved AI capabilities and better
      compatibility.
      
      The bit definition:
      CPUID.(EAX=7,ECX=1):EDX[bit 5]
      
      AVX-NE-CONVERT is on a KVM-only subleaf. Plus an x86_FEATURE definition
      for this feature bit to direct it to the KVM entry.
      
      Advertise AVX-NE-CONVERT to KVM userspace. This is safe because there
      are no new VMX controls or additional host enabling required for guests
      to use this feature.
      Signed-off-by: default avatarJiaxi Chen <jiaxi.chen@linux.intel.com>
      Message-Id: <20221125125845.1182922-8-jiaxi.chen@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9977f087
    • Jiaxi Chen's avatar
      KVM: x86: Advertise AVX-VNNI-INT8 CPUID to user space · 24d74b9f
      Jiaxi Chen authored
      AVX-VNNI-INT8 is a new set of instructions in the latest Intel platform
      Sierra Forest, aims for the platform to have superior AI capabilities.
      This instruction multiplies the individual bytes of two unsigned or
      unsigned source operands, then adds and accumulates the results into the
      destination dword element size operand.
      
      The bit definition:
      CPUID.(EAX=7,ECX=1):EDX[bit 4]
      
      AVX-VNNI-INT8 is on a new and sparse CPUID leaf and all bits on this
      leaf have no truly kernel use case for now. Given that and to save space
      for kernel feature bits, move this new leaf to KVM-only subleaf and plus
      an x86_FEATURE definition for AVX-VNNI-INT8 to direct it to the KVM
      entry.
      
      Advertise AVX-VNNI-INT8 to KVM userspace. This is safe because there are
      no new VMX controls or additional host enabling required for guests to
      use this feature.
      Signed-off-by: default avatarJiaxi Chen <jiaxi.chen@linux.intel.com>
      Message-Id: <20221125125845.1182922-7-jiaxi.chen@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      24d74b9f
    • Jiaxi Chen's avatar
      x86: KVM: Advertise AVX-IFMA CPUID to user space · 5e85c4eb
      Jiaxi Chen authored
      AVX-IFMA is a new instruction in the latest Intel platform Sierra
      Forest. This instruction packed multiplies unsigned 52-bit integers and
      adds the low/high 52-bit products to Qword Accumulators.
      
      The bit definition:
      CPUID.(EAX=7,ECX=1):EAX[bit 23]
      
      AVX-IFMA is on an expected-dense CPUID leaf and some other bits on this
      leaf have kernel usages. Given that, define this feature bit like
      X86_FEATURE_<name> in kernel. Considering AVX-IFMA itself has no truly
      kernel usages and /proc/cpuinfo has too much unreadable flags, hide this
      one in /proc/cpuinfo.
      
      Advertise AVX-IFMA to KVM userspace. This is safe because there are no
      new VMX controls or additional host enabling required for guests to use
      this feature.
      Signed-off-by: default avatarJiaxi Chen <jiaxi.chen@linux.intel.com>
      Acked-by: default avatarBorislav Petkov <bp@suse.de>
      Message-Id: <20221125125845.1182922-6-jiaxi.chen@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5e85c4eb
    • Chang S. Bae's avatar
      x86: KVM: Advertise AMX-FP16 CPUID to user space · af2872f6
      Chang S. Bae authored
      Latest Intel platform Granite Rapids has introduced a new instruction -
      AMX-FP16, which performs dot-products of two FP16 tiles and accumulates
      the results into a packed single precision tile. AMX-FP16 adds FP16
      capability and also allows a FP16 GPU trained model to run faster
      without loss of accuracy or added SW overhead.
      
      The bit definition:
      CPUID.(EAX=7,ECX=1):EAX[bit 21]
      
      AMX-FP16 is on an expected-dense CPUID leaf and some other bits on this
      leaf have kernel usages. Given that, define this feature bit like
      X86_FEATURE_<name> in kernel. Considering AMX-FP16 itself has no truly
      kernel usages and /proc/cpuinfo has too much unreadable flags, hide this
      one in /proc/cpuinfo.
      
      Advertise AMX-FP16 to KVM userspace. This is safe because there are no
      new VMX controls or additional host enabling required for guests to use
      this feature.
      Signed-off-by: default avatarChang S. Bae <chang.seok.bae@intel.com>
      Signed-off-by: default avatarJiaxi Chen <jiaxi.chen@linux.intel.com>
      Acked-by: default avatarBorislav Petkov <bp@suse.de>
      Message-Id: <20221125125845.1182922-5-jiaxi.chen@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      af2872f6
    • Jiaxi Chen's avatar
      x86: KVM: Advertise CMPccXADD CPUID to user space · 6a19d7aa
      Jiaxi Chen authored
      CMPccXADD is a new set of instructions in the latest Intel platform
      Sierra Forest. This new instruction set includes a semaphore operation
      that can compare and add the operands if condition is met, which can
      improve database performance.
      
      The bit definition:
      CPUID.(EAX=7,ECX=1):EAX[bit 7]
      
      CMPccXADD is on an expected-dense CPUID leaf and some other bits on this
      leaf have kernel usages. Given that, define this feature bit like
      X86_FEATURE_<name> in kernel. Considering CMPccXADD itself has no truly
      kernel usages and /proc/cpuinfo has too much unreadable flags, hide this
      one in /proc/cpuinfo.
      
      Advertise CMPCCXADD to KVM userspace. This is safe because there are no
      new VMX controls or additional host enabling required for guests to use
      this feature.
      Signed-off-by: default avatarJiaxi Chen <jiaxi.chen@linux.intel.com>
      Acked-by: default avatarBorislav Petkov <bp@suse.de>
      Message-Id: <20221125125845.1182922-4-jiaxi.chen@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6a19d7aa
    • Sean Christopherson's avatar
      KVM: x86: Update KVM-only leaf handling to allow for 100% KVM-only leafs · 047c7229
      Sean Christopherson authored
      Rename kvm_cpu_cap_init_scattered() to kvm_cpu_cap_init_kvm_defined() in
      anticipation of adding KVM-only CPUID leafs that aren't recognized by the
      kernel and thus not scattered, i.e. for leafs that are 100% KVM-defined.
      
      Adjust/add comments to kvm_only_cpuid_leafs and KVM_X86_FEATURE to
      document how to create new kvm_only_cpuid_leafs entries for scattered
      features as well as features that are entirely unknown to the kernel.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221125125845.1182922-3-jiaxi.chen@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      047c7229
    • Sean Christopherson's avatar
      KVM: x86: Add BUILD_BUG_ON() to detect bad usage of "scattered" flags · c4690d01
      Sean Christopherson authored
      Add a compile-time assert in the SF() macro to detect improper usage,
      i.e. to detect passing in an X86_FEATURE_* flag that isn't actually
      scattered by the kernel.  Upcoming feature flags will be 100% KVM-only
      and will have X86_FEATURE_* macros that point at a kvm_only_cpuid_leafs
      word, not a kernel-defined word.  Using SF() and thus boot_cpu_has() for
      such feature flags would access memory beyond x86_capability[NCAPINTS]
      and at best incorrectly hide a feature, and at worst leak kernel state to
      userspace.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221125125845.1182922-2-jiaxi.chen@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c4690d01
    • David Woodhouse's avatar
      MAINTAINERS: Add KVM x86/xen maintainer list · 7927e275
      David Woodhouse authored
      Adding Paul as co-maintainer of Xen support to help ensure that things
      don't fall through the cracks when I spend three months at a time
      travelling...
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Reviewed-by: default avatarPaul Durrant <paul@xen.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7927e275
    • David Woodhouse's avatar
    • Paolo Bonzini's avatar
      KVM: always declare prototype for kvm_arch_irqchip_in_kernel · 3ca9d84e
      Paolo Bonzini authored
      Architecture code might want to use it even if CONFIG_HAVE_KVM_IRQ_ROUTING
      is false; for example PPC XICS has KVM_IRQ_LINE and wants to use
      kvm_arch_irqchip_in_kernel from there, but it does not have
      KVM_SET_GSI_ROUTING so the prototype was not provided.
      
      Fixes: d663b8a2 ("KVM: replace direct irq.h inclusion")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3ca9d84e
  4. 24 Nov, 2022 1 commit
  5. 23 Nov, 2022 4 commits