1. 02 Aug, 2021 40 commits
    • Sean Christopherson's avatar
      KVM: SVM: Stuff save->dr6 at during VMSA sync, not at RESET/INIT · d0f9f826
      Sean Christopherson authored
      Move code to stuff vmcb->save.dr6 to its architectural init value from
      svm_vcpu_reset() into sev_es_sync_vmsa().  Except for protected guests,
      a.k.a. SEV-ES guests, vmcb->save.dr6 is set during VM-Enter, i.e. the
      extra write is unnecessary.  For SEV-ES, stuffing save->dr6 handles a
      theoretical case where the VMSA could be encrypted before the first
      KVM_RUN.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-33-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d0f9f826
    • Sean Christopherson's avatar
      KVM: SVM: Drop redundant writes to vmcb->save.cr4 at RESET/INIT · 6cfe7b83
      Sean Christopherson authored
      Drop direct writes to vmcb->save.cr4 during vCPU RESET/INIT, as the
      values being written are fully redundant with respect to
      svm_set_cr4(vcpu, 0) a few lines earlier.  Note, svm_set_cr4() also
      correctly forces X86_CR4_PAE when NPT is disabled.
      
      No functional change intended.
      Reviewed-by: default avatarReiji Watanabe <reijiw@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-32-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6cfe7b83
    • Sean Christopherson's avatar
      KVM: SVM: Tweak order of cr0/cr4/efer writes at RESET/INIT · ef8a0fa5
      Sean Christopherson authored
      Hoist svm_set_cr0() up in the sequence of register initialization during
      vCPU RESET/INIT, purely to match VMX so that a future patch can move the
      sequences to common x86.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-31-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ef8a0fa5
    • Sean Christopherson's avatar
      KVM: nVMX: Don't evaluate "emulation required" on nested VM-Exit · 816be9e9
      Sean Christopherson authored
      Use the "internal" variants of setting segment registers when stuffing
      state on nested VM-Exit in order to skip the "emulation required"
      updates.  VM-Exit must always go to protected mode, and all segments are
      mostly hardcoded (to valid values) on VM-Exit.  The bits of the segments
      that aren't hardcoded are explicitly checked during VM-Enter, e.g. the
      selector RPLs must all be zero.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-30-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      816be9e9
    • Sean Christopherson's avatar
      KVM: VMX: Skip emulation required checks during pmode/rmode transitions · 1dd7a4f1
      Sean Christopherson authored
      Don't refresh "emulation required" when stuffing segments during
      transitions to/from real mode when running without unrestricted guest.
      The checks are unnecessary as vmx_set_cr0() unconditionally rechecks
      "emulation required".  They also happen to be broken, as enter_pmode()
      and enter_rmode() run with a stale vcpu->arch.cr0.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-29-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1dd7a4f1
    • Sean Christopherson's avatar
      KVM: VMX: Process CR0.PG side effects after setting CR0 assets · 32437c2a
      Sean Christopherson authored
      Move the long mode and EPT w/o unrestricted guest side effect processing
      down in vmx_set_cr0() so that the EPT && !URG case doesn't have to stuff
      vcpu->arch.cr0 early.  This also fixes an oddity where CR0 might not be
      marked available, i.e. the early vcpu->arch.cr0 write would appear to be
      in danger of being overwritten, though that can't actually happen in the
      current code since CR0.TS is the only guest-owned bit, and CR0.TS is not
      read by vmx_set_cr4().
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-28-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      32437c2a
    • Sean Christopherson's avatar
      KVM: x86/mmu: Skip the permission_fault() check on MMIO if CR0.PG=0 · 908b7d43
      Sean Christopherson authored
      Skip the MMU permission_fault() check if paging is disabled when
      verifying the cached MMIO GVA is usable.  The check is unnecessary and
      can theoretically get a false positive since the MMU doesn't zero out
      "permissions" or "pkru_mask" when guest paging is disabled.
      
      The obvious alternative is to zero out all the bitmasks when configuring
      nonpaging MMUs, but that's unnecessary work and doesn't align with the
      MMU's general approach of doing as little as possible for flows that are
      supposed to be unreachable.
      
      This is nearly a nop as the false positive is nothing more than an
      insignificant performance blip, and more or less limited to string MMIO
      when L1 is running with paging disabled.  KVM doesn't cache MMIO if L2 is
      active with nested TDP since the "GVA" is really an L2 GPA.  If L2 is
      active without nested TDP, then paging can't be disabled as neither VMX
      nor SVM allows entering the guest without paging of some form.
      
      Jumping back to L1 with paging disabled, in that case direct_map is true
      and so KVM will use CR2 as a GPA; the only time it doesn't is if the
      fault from the emulator doesn't match or emulator_can_use_gpa(), and that
      fails only on string MMIO and other instructions with multiple memory
      operands.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-27-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      908b7d43
    • Sean Christopherson's avatar
      KVM: VMX: Pull GUEST_CR3 from the VMCS iff CR3 load exiting is disabled · 81ca0e73
      Sean Christopherson authored
      Tweak the logic for grabbing vmcs.GUEST_CR3 in vmx_cache_reg() to look
      directly at the execution controls, as opposed to effectively inferring
      the controls based on vCPUs.  Inferring the controls isn't wrong, but it
      creates a very subtle dependency between the caching logic, the state of
      vcpu->arch.cr0 (via is_paging()), and the behavior of vmx_set_cr0().
      
      Using the execution controls doesn't completely eliminate the dependency
      in vmx_set_cr0(), e.g. neglecting to cache CR3 before enabling
      interception would still break the guest, but it does reduce the
      code dependency and mostly eliminate the logical dependency (that CR3
      loads are intercepted in certain scenarios).  Eliminating the subtle
      read of vcpu->arch.cr0 will also allow for additional cleanup in
      vmx_set_cr0().
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-26-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      81ca0e73
    • Sean Christopherson's avatar
      KVM: nVMX: Do not clear CR3 load/store exiting bits if L1 wants 'em · 470750b3
      Sean Christopherson authored
      Keep CR3 load/store exiting enable as needed when running L2 in order to
      honor L1's desires.  This fixes a largely theoretical bug where L1 could
      intercept CR3 but not CR0.PG and end up not getting the desired CR3 exits
      when L2 enables paging.  In other words, the existing !is_paging() check
      inadvertantly handles the normal case for L2 where vmx_set_cr0() is
      called during VM-Enter, which is guaranteed to run with paging enabled,
      and thus will never clear the bits.
      
      Removing the !is_paging() check will also allow future consolidation and
      cleanup of the related code.  From a performance perspective, this is
      all a nop, as the VMCS controls shadow will optimize away the VMWRITE
      when the controls are in the desired state.
      
      Add a comment explaining why CR3 is intercepted, with a big disclaimer
      about not querying the old CR3.  Because vmx_set_cr0() is used for flows
      that are not directly tied to MOV CR3, e.g. vCPU RESET/INIT and nested
      VM-Enter, it's possible that is_paging() is not synchronized with CR3
      load/store exiting.  This is actually guaranteed in the current code, as
      KVM starts with CR3 interception disabled.  Obviously that can be fixed,
      but there's no good reason to play whack-a-mole, and it tends to end
      poorly, e.g. descriptor table exiting for UMIP emulation attempted to be
      precise in the past and ended up botching the interception toggling.
      
      Fixes: fe3ef05c ("KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-25-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      470750b3
    • Sean Christopherson's avatar
      KVM: VMX: Fold ept_update_paging_mode_cr0() back into vmx_set_cr0() · c834fd7f
      Sean Christopherson authored
      Move the CR0/CR3/CR4 shenanigans for EPT without unrestricted guest back
      into vmx_set_cr0().  This will allow a future patch to eliminate the
      rather gross stuffing of vcpu->arch.cr0 in the paging transition cases
      by snapshotting the old CR0.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-24-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c834fd7f
    • Sean Christopherson's avatar
      KVM: VMX: Remove direct write to vcpu->arch.cr0 during vCPU RESET/INIT · 4f0dcb54
      Sean Christopherson authored
      Remove a bogus write to vcpu->arch.cr0 that immediately precedes
      vmx_set_cr0() during vCPU RESET/INIT.  For RESET, this is a nop since
      the "old" CR0 value is meaningless.  But for INIT, if the vCPU is coming
      from paging enabled mode, crushing vcpu->arch.cr0 will cause the various
      is_paging() checks in vmx_set_cr0() to get false negatives.
      
      For the exit_lmode() case, the false negative is benign as vmx_set_efer()
      is called immediately after vmx_set_cr0().
      
      For EPT without unrestricted guest, the false negative will cause KVM to
      unnecessarily run with CR3 load/store exiting.  But again, this is
      benign, albeit sub-optimal.
      Reviewed-by: default avatarReiji Watanabe <reijiw@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-23-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4f0dcb54
    • Sean Christopherson's avatar
      KVM: VMX: Invert handling of CR0.WP for EPT without unrestricted guest · ee5a5584
      Sean Christopherson authored
      Opt-in to forcing CR0.WP=1 for shadow paging, and stop lying about WP
      being "always on" for unrestricted guest.  In addition to making KVM a
      wee bit more honest, this paves the way for additional cleanup.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-22-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ee5a5584
    • Sean Christopherson's avatar
      KVM: SVM: Don't bother writing vmcb->save.rip at vCPU RESET/INIT · 9e90e215
      Sean Christopherson authored
      Drop unnecessary initialization of vmcb->save.rip during vCPU RESET/INIT,
      as svm_vcpu_run() unconditionally propagates VCPU_REGS_RIP to save.rip.
      
      No true functional change intended.
      Reviewed-by: default avatarReiji Watanabe <reijiw@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-21-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9e90e215
    • Sean Christopherson's avatar
      KVM: x86: Move EDX initialization at vCPU RESET to common code · 49d8665c
      Sean Christopherson authored
      Move the EDX initialization at vCPU RESET, which is now identical between
      VMX and SVM, into common code.
      
      No functional change intended.
      Reviewed-by: default avatarReiji Watanabe <reijiw@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-20-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      49d8665c
    • Sean Christopherson's avatar
      KVM: x86: Consolidate APIC base RESET initialization code · 4547700a
      Sean Christopherson authored
      Consolidate the APIC base RESET logic, which is currently spread out
      across both x86 and vendor code.  For an in-kernel APIC, the vendor code
      is redundant.  But for a userspace APIC, KVM relies on the vendor code
      to initialize vcpu->arch.apic_base.  Hoist the vcpu->arch.apic_base
      initialization above the !apic check so that it applies to both flavors
      of APIC emulation, and delete the vendor code.
      Reviewed-by: default avatarReiji Watanabe <reijiw@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-19-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4547700a
    • Sean Christopherson's avatar
      KVM: x86: Open code necessary bits of kvm_lapic_set_base() at vCPU RESET · 42122123
      Sean Christopherson authored
      Stuff vcpu->arch.apic_base and apic->base_address directly during APIC
      reset, as opposed to bouncing through kvm_set_apic_base() while fudging
      the ENABLE bit during creation to avoid the other, unwanted side effects.
      
      This is a step towards consolidating the APIC RESET logic across x86,
      VMX, and SVM.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-18-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      42122123
    • Sean Christopherson's avatar
      KVM: VMX: Stuff vcpu->arch.apic_base directly at vCPU RESET · f0428b3d
      Sean Christopherson authored
      Write vcpu->arch.apic_base directly instead of bouncing through
      kvm_set_apic_base().  This is a glorified nop, and is a step towards
      cleaning up the mess that is local APIC creation.
      
      When using an in-kernel APIC, kvm_create_lapic() explicitly sets
      vcpu->arch.apic_base to MSR_IA32_APICBASE_ENABLE to avoid its own
      kvm_lapic_set_base() call in kvm_lapic_reset() from triggering state
      changes.  That call during RESET exists purely to set apic->base_address
      to the default base value.  As a result, by the time VMX gets control,
      the only missing piece is the BSP bit being set for the reset BSP.
      
      For a userspace APIC, there are no side effects to process (for the APIC).
      
      In both cases, the call to kvm_update_cpuid_runtime() is a nop because
      the vCPU hasn't yet been exposed to userspace, i.e. there can't be any
      CPUID entries.
      
      No functional change intended.
      Reviewed-by: default avatarReiji Watanabe <reijiw@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-17-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f0428b3d
    • Sean Christopherson's avatar
      KVM: x86: Set BSP bit in reset BSP vCPU's APIC base by default · 503bc494
      Sean Christopherson authored
      Set the BSP bit appropriately during local APIC "reset" instead of
      relying on vendor code to clean up at a later point.  This is a step
      towards consolidating the local APIC, VMX, and SVM xAPIC initialization
      code.
      Reviewed-by: default avatarReiji Watanabe <reijiw@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-16-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      503bc494
    • Sean Christopherson's avatar
      KVM: x86: Don't force set BSP bit when local APIC is managed by userspace · 01913c57
      Sean Christopherson authored
      Don't set the BSP bit in vcpu->arch.apic_base when the local APIC is
      managed by userspace.  Forcing all vCPUs to be BSPs is non-sensical, and
      was dead code when it was added by commit 97222cc8 ("KVM: Emulate
      local APIC in kernel").  At the time, kvm_lapic_set_base() was invoked
      if and only if the local APIC was in-kernel (and it couldn't be called
      before the vCPU created its APIC).
      
      kvm_lapic_set_base() eventually gained generic usage, but the latent bug
      escaped notice because the only true consumer would be the guest itself
      in the form of an explicit RDMSRs on APs.  Out of Linux, SeaBIOS, and
      EDK2/OVMF, only OVMF consumes the BSP bit from the APIC_BASE MSR.  For
      the vast majority of usage in OVMF, BSP confusion would be benign.
      OVMF's BSP election upon SMI rendezvous might be broken, but practically
      no one runs KVM with an out-of-kernel local APIC, let alone does so while
      utilizing SMIs with OVMF.
      
      Fixes: 97222cc8 ("KVM: Emulate local APIC in kernel")
      Reviewed-by: default avatarReiji Watanabe <reijiw@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-15-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      01913c57
    • Sean Christopherson's avatar
      KVM: x86: Migrate the PIT only if vcpu0 is migrated, not any BSP · 0214f6bb
      Sean Christopherson authored
      Make vcpu0 the arbitrary owner of the PIT, as was intended when PIT
      migration was added by commit 2f599714 ("KVM: migrate PIT timer").
      The PIT was unintentionally turned into being owned by the BSP by commit
      c5af89b6 ("KVM: Introduce kvm_vcpu_is_bsp() function."), and was then
      unintentionally converted to a shared ownership model when
      kvm_vcpu_is_bsp() was modified to check the APIC base MSR instead of
      hardcoding vcpu0 as the BSP.
      
      Functionally, this just means the PIT's hrtimer is migrated less often.
      The real motivation is to remove the usage of kvm_vcpu_is_bsp(), so that
      more legacy/broken crud can be removed in a future patch.
      
      Fixes: 58d269d8 ("KVM: x86: BSP in MSR_IA32_APICBASE is writable")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-14-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0214f6bb
    • Sean Christopherson's avatar
      KVM: x86: Remove defunct BSP "update" in local APIC reset · 549240e8
      Sean Christopherson authored
      Remove a BSP APIC update in kvm_lapic_reset() that is a glorified and
      confusing nop.  When the code was originally added, kvm_vcpu_is_bsp()
      queried kvm->arch.bsp_vcpu, i.e. the intent was to set the BSP bit in the
      BSP vCPU's APIC.  But, stuffing the BSP bit at INIT was wrong since the
      guest can change its BSP(s); this was fixed by commit 58d269d8 ("KVM:
      x86: BSP in MSR_IA32_APICBASE is writable").
      
      In other words, kvm_vcpu_is_bsp() is now purely a reflection of
      vcpu->arch.apic_base.MSR_IA32_APICBASE_BSP, thus the update will always
      set the current value and kvm_lapic_set_base() is effectively a nop if
      the new and old values match.  The RESET case, which does need to stuff
      the BSP for the reset vCPU, is handled by vendor code (though this will
      soon be moved to common code).
      
      No functional change intended.
      Reviewed-by: default avatarReiji Watanabe <reijiw@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-13-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      549240e8
    • Sean Christopherson's avatar
      KVM: x86: WARN if the APIC map is dirty without an in-kernel local APIC · c2f79a65
      Sean Christopherson authored
      WARN if KVM ends up in a state where it thinks its APIC map needs to be
      recalculated, but KVM is not emulating the local APIC.  This is mostly
      to document KVM's "rules" in order to provide clarity in future cleanups.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-12-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c2f79a65
    • Sean Christopherson's avatar
      KVM: SVM: Drop explicit MMU reset at RESET/INIT · 5d2d7e41
      Sean Christopherson authored
      Drop an explicit MMU reset in SVM's vCPU RESET/INIT flow now that the
      common x86 path correctly handles conditional MMU resets, e.g. if INIT
      arrives while the vCPU is in 64-bit mode.
      
      This reverts commit ebae871a ("kvm: svm: reset mmu on VCPU reset").
      Reviewed-by: default avatarReiji Watanabe <reijiw@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5d2d7e41
    • Sean Christopherson's avatar
      KVM: VMX: Remove explicit MMU reset in enter_rmode() · 61152cd9
      Sean Christopherson authored
      Drop an explicit MMU reset when entering emulated real mode now that the
      vCPU INIT/RESET path correctly handles conditional MMU resets, e.g. if
      INIT arrives while the vCPU is in 64-bit mode.
      
      Note, while there are multiple other direct calls to vmx_set_cr0(), i.e.
      paths that change CR0 without invoking kvm_post_set_cr0(), only the INIT
      emulation can reach enter_rmode().  CLTS emulation only toggles CR.TS,
      VM-Exit (and late VM-Fail) emulation cannot architecturally transition to
      Real Mode, and VM-Enter to Real Mode is possible if and only if
      Unrestricted Guest is enabled (exposed to L1).
      
      This effectively reverts commit 8668a3c4 ("KVM: VMX: Reset mmu
      context when entering real mode")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      61152cd9
    • Sean Christopherson's avatar
      KVM: SVM: Fall back to KVM's hardcoded value for EDX at RESET/INIT · 665f4d92
      Sean Christopherson authored
      At vCPU RESET/INIT (mostly RESET), stuff EDX with KVM's hardcoded,
      default Family-Model-Stepping ID of 0x600 if CPUID.0x1 isn't defined.
      At RESET, the CPUID lookup is guaranteed to "miss" because KVM emulates
      RESET before exposing the vCPU to userspace, i.e. userspace can't
      possibly have done set the vCPU's CPUID model, and thus KVM will always
      write '0'.  At INIT, using 0x600 is less bad than using '0'.
      
      While initializing EDX to '0' is _extremely_ unlikely to be noticed by
      the guest, let alone break the guest, and can be overridden by
      userspace for the RESET case, using 0x600 is preferable as it will allow
      consolidating the relevant VMX and SVM RESET/INIT logic in the future.
      And, digging through old specs suggests that neither Intel nor AMD have
      ever shipped a CPU that initialized EDX to '0' at RESET.
      
      Regarding 0x600 as KVM's default Family, it is a sane default and in
      many ways the most appropriate.  Prior to the 386 implementations, DX
      was undefined at RESET.  With the 386, 486, 586/P5, and 686/P6/Athlon,
      both Intel and AMD set EDX to 3, 4, 5, and 6 respectively.  AMD switched
      to using '15' as its primary Family with the introduction of AMD64, but
      Intel has continued using '6' for the last few decades.
      
      So, '6' is a valid Family for both Intel and AMD CPUs, is compatible
      with both 32-bit and 64-bit CPUs (albeit not a perfect fit for 64-bit
      AMD), and of the common Families (3 - 6), is the best fit with respect to
      KVM's virtual CPU model.  E.g. prior to the P6, Intel CPUs did not have a
      STI window.  Modern operating systems, Linux included, rely on the STI
      window, e.g. for "safe halt", and KVM unconditionally assumes the virtual
      CPU has an STI window.  Thus enumerating a Family ID of 3, 4, or 5 would
      be provably wrong.
      
      Opportunistically remove a stale comment.
      
      Fixes: 66f7b72e ("KVM: x86: Make register state after reset conform to specification")
      Reviewed-by: default avatarReiji Watanabe <reijiw@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      665f4d92
    • Sean Christopherson's avatar
      KVM: SVM: Require exact CPUID.0x1 match when stuffing EDX at INIT · 067a456d
      Sean Christopherson authored
      Do not allow an inexact CPUID "match" when querying the guest's CPUID.0x1
      to stuff EDX during INIT.  In the common case, where the guest CPU model
      is an AMD variant, allowing an inexact match is a nop since KVM doesn't
      emulate Intel's goofy "out-of-range" logic for AMD and Hygon.  If the
      vCPU model happens to be an Intel variant, an inexact match is possible
      if and only if the max CPUID leaf is precisely '0'. Aside from the fact
      that there's probably no CPU in existence with a single CPUID leaf, if
      the max CPUID leaf is '0', that means that CPUID.0.EAX is '0', and thus
      an inexact match for CPUID.0x1.EAX will also yield '0'.
      
      So, with lots of twisty logic, no functional change intended.
      Reviewed-by: default avatarReiji Watanabe <reijiw@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      067a456d
    • Sean Christopherson's avatar
      KVM: VMX: Set EDX at INIT with CPUID.0x1, Family-Model-Stepping · 2a24be79
      Sean Christopherson authored
      Set EDX at RESET/INIT based on the userspace-defined CPUID model when
      possible, i.e. when CPUID.0x1.EAX is defind by userspace.  At RESET/INIT,
      all CPUs that support CPUID set EDX to the FMS enumerated in
      CPUID.0x1.EAX.  If no CPUID match is found, fall back to KVM's default
      of 0x600 (Family '6'), which is the least awful approximation of KVM's
      virtual CPU model.
      
      Fixes: 6aa8b732 ("[PATCH] kvm: userspace interface")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2a24be79
    • Sean Christopherson's avatar
      KVM: SVM: Zero out GDTR.base and IDTR.base on INIT · 4f117ce4
      Sean Christopherson authored
      Explicitly set GDTR.base and IDTR.base to zero when intializing the VMCB.
      Functionally this only affects INIT, as the bases are implicitly set to
      zero on RESET by virtue of the VMCB being zero allocated.
      
      Per AMD's APM, GDTR.base and IDTR.base are zeroed after RESET and INIT.
      
      Fixes: 04d2cc77 ("KVM: Move main vcpu loop into subarch independent code")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4f117ce4
    • Sean Christopherson's avatar
      KVM: nVMX: Set LDTR to its architecturally defined value on nested VM-Exit · afc8de01
      Sean Christopherson authored
      Set L1's LDTR on VM-Exit per the Intel SDM:
      
        The host-state area does not contain a selector field for LDTR. LDTR is
        established as follows on all VM exits: the selector is cleared to
        0000H, the segment is marked unusable and is otherwise undefined
        (although the base address is always canonical).
      
      This is likely a benign bug since the LDTR is unusable, as it means the
      L1 VMM is conditioned to reload its LDTR in order to function properly on
      bare metal.
      
      Fixes: 4704d0be ("KVM: nVMX: Exiting from L2 to L1")
      Reviewed-by: default avatarReiji Watanabe <reijiw@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      afc8de01
    • Sean Christopherson's avatar
      KVM: x86: Flush the guest's TLB on INIT · df37ed38
      Sean Christopherson authored
      Flush the guest's TLB on INIT, as required by Intel's SDM.  Although
      AMD's APM states that the TLBs are unchanged by INIT, it's not clear that
      that's correct as the APM also states that the TLB is flush on "External
      initialization of the processor."  Regardless, relying on the guest to be
      paranoid is unnecessarily risky, while an unnecessary flush is benign
      from a functional perspective and likely has no measurable impact on
      guest performance.
      
      Note, as of the April 2021 version of Intels' SDM, it also contradicts
      itself with respect to TLB flushing.  The overview of INIT explicitly
      calls out the TLBs as being invalidated, while a table later in the same
      section says they are unchanged.
      
        9.1 INITIALIZATION OVERVIEW:
          The major difference is that during an INIT, the internal caches, MSRs,
          MTRRs, and x87 FPU state are left unchanged (although, the TLBs and BTB
          are invalidated as with a hardware reset)
      
        Table 9-1:
      
        Register                    Power up    Reset      INIT
        Data and Code Cache, TLBs:  Invalid[6]  Invalid[6] Unchanged
      
      Given Core2's erratum[*] about global TLB entries not being flush on INIT,
      it's safe to assume that the table is simply wrong.
      
        AZ28. INIT Does Not Clear Global Entries in the TLB
        Problem: INIT may not flush a TLB entry when:
          • The processor is in protected mode with paging enabled and the page global enable
            flag is set (PGE bit of CR4 register)
          • G bit for the page table entry is set
          • TLB entry is present in TLB when INIT occurs
          • Software may encounter unexpected page fault or incorrect address translation due
            to a TLB entry erroneously left in TLB after INIT.
      
        Workaround: Write to CR3, CR4 (setting bits PSE, PGE or PAE) or CR0 (setting
                    bits PG or PE) registers before writing to memory early in BIOS
                    code to clear all the global entries from TLB.
      
        Status: For the steppings affected, see the Summary Tables of Changes.
      
      [*] https://www.intel.com/content/dam/support/us/en/documents/processors/mobile/celeron/sb/320121.pdf
      
      Fixes: 6aa8b732 ("[PATCH] kvm: userspace interface")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      df37ed38
    • Maxim Levitsky's avatar
      KVM: x86: APICv: drop immediate APICv disablement on current vCPU · df63202f
      Maxim Levitsky authored
      Special case of disabling the APICv on the current vCPU right away in
      kvm_request_apicv_update doesn't bring much benefit vs raising
      KVM_REQ_APICV_UPDATE on it instead, since this request will be processed
      on the next entry to the guest.
      (the comment about having another #VMEXIT is wrong).
      
      It also hides various assumptions that APIVc enable state matches
      the APICv inhibit state, as this special case only makes those states
      match on the current vCPU.
      
      Previous patches fixed few such assumptions so now it should be safe
      to drop this special case.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210713142023.106183-5-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      df63202f
    • Paolo Bonzini's avatar
      KVM: x86: enable TDP MMU by default · 71ba3f31
      Paolo Bonzini authored
      With the addition of fast page fault support, the TDP-specific MMU has reached
      feature parity with the original MMU.  All my testing in the last few months
      has been done with the TDP MMU; switch the default on 64-bit machines.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      71ba3f31
    • David Matlack's avatar
      KVM: x86/mmu: fast_page_fault support for the TDP MMU · 6e8eb206
      David Matlack authored
      Make fast_page_fault interoperate with the TDP MMU by leveraging
      walk_shadow_page_lockless_{begin,end} to acquire the RCU read lock and
      introducing a new helper function kvm_tdp_mmu_fast_pf_get_last_sptep to
      grab the lowest level sptep.
      Suggested-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210713220957.3493520-5-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6e8eb206
    • David Matlack's avatar
      KVM: x86/mmu: Make walk_shadow_page_lockless_{begin,end} interoperate with the TDP MMU · c5c8c7c5
      David Matlack authored
      Acquire the RCU read lock in walk_shadow_page_lockless_begin and release
      it in walk_shadow_page_lockless_end when the TDP MMU is enabled.  This
      should not introduce any functional changes but is used in the following
      commit to make fast_page_fault interoperate with the TDP MMU.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210713220957.3493520-4-dmatlack@google.com>
      [Use if...else instead of if(){return;}]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c5c8c7c5
    • David Matlack's avatar
      KVM: x86/mmu: Fix use of enums in trace_fast_page_fault · 61bcd360
      David Matlack authored
      Enum values have to be exported to userspace since the formatting is not
      done in the kernel. Without doing this perf maps RET_PF_FIXED and
      RET_PF_SPURIOUS to 0, which results in incorrect output:
      
        $ perf record -a -e kvmmmu:fast_page_fault --filter "ret==3" -- ./access_tracking_perf_test
        $ perf script | head -1
         [...] new 610006048d25877 spurious 0 fixed 0  <------ should be 1
      
      Fix this by exporting the enum values to userspace with TRACE_DEFINE_ENUM.
      
      Fixes: c4371c2a ("KVM: x86/mmu: Return unique RET_PF_* values if the fault was fixed")
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210713220957.3493520-3-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      61bcd360
    • David Matlack's avatar
      KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault · 76cd325e
      David Matlack authored
      fast_page_fault is only called from direct_page_fault where we know the
      address is a gpa.
      
      Fixes: 736c291c ("KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM")
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210713220957.3493520-2-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      76cd325e
    • Peter Xu's avatar
      KVM: Introduce kvm_get_kvm_safe() · 605c7130
      Peter Xu authored
      Introduce this safe version of kvm_get_kvm() so that it can be called even
      during vm destruction.  Use it in kvm_debugfs_open() and remove the verbose
      comment.  Prepare to be used elsewhere.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20210625153214.43106-3-peterx@redhat.com>
      [Preserve the comment in kvm_debugfs_open. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      605c7130
    • Juergen Gross's avatar
      x86/kvm: remove non-x86 stuff from arch/x86/kvm/ioapic.h · 1694caef
      Juergen Gross authored
      The file has been moved to arch/x86 long time ago. Time to get rid of
      non-x86 stuff.
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Message-Id: <20210701154105.23215-3-jgross@suse.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1694caef
    • Peter Xu's avatar
      KVM: X86: Add per-vm stat for max rmap list size · ec1cf69c
      Peter Xu authored
      Add a new statistic max_mmu_rmap_size, which stores the maximum size of rmap
      for the vm.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20210625153214.43106-2-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ec1cf69c
    • Sean Christopherson's avatar
      KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits() · 7fa2a347
      Sean Christopherson authored
      Return the old SPTE when clearing a SPTE and push the "old SPTE present"
      check to the caller.  Private shadow page support will use the old SPTE
      in rmap_remove() to determine whether or not there is a linked private
      shadow page.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <b16bac1fd1357aaf39e425aab2177d3f89ee8318.1625186503.git.isaku.yamahata@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7fa2a347