1. 28 Jul, 2022 32 commits
    • Sean Christopherson's avatar
      KVM: x86/mmu: Treat NX as a valid SPTE bit for NPT · 6c6ab524
      Sean Christopherson authored
      Treat the NX bit as valid when using NPT, as KVM will set the NX bit when
      the NX huge page mitigation is enabled (mindblowing) and trigger the WARN
      that fires on reserved SPTE bits being set.
      
      KVM has required NX support for SVM since commit b26a71a1 ("KVM: SVM:
      Refuse to load kvm_amd if NX support is not available") for exactly this
      reason, but apparently it never occurred to anyone to actually test NPT
      with the mitigation enabled.
      
        ------------[ cut here ]------------
        spte = 0x800000018a600ee7, level = 2, rsvd bits = 0x800f0000001fe000
        WARNING: CPU: 152 PID: 15966 at arch/x86/kvm/mmu/spte.c:215 make_spte+0x327/0x340 [kvm]
        Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 10.48.0 01/27/2022
        RIP: 0010:make_spte+0x327/0x340 [kvm]
        Call Trace:
         <TASK>
         tdp_mmu_map_handle_target_level+0xc3/0x230 [kvm]
         kvm_tdp_mmu_map+0x343/0x3b0 [kvm]
         direct_page_fault+0x1ae/0x2a0 [kvm]
         kvm_tdp_page_fault+0x7d/0x90 [kvm]
         kvm_mmu_page_fault+0xfb/0x2e0 [kvm]
         npf_interception+0x55/0x90 [kvm_amd]
         svm_invoke_exit_handler+0x31/0xf0 [kvm_amd]
         svm_handle_exit+0xf6/0x1d0 [kvm_amd]
         vcpu_enter_guest+0xb6d/0xee0 [kvm]
         ? kvm_pmu_trigger_event+0x6d/0x230 [kvm]
         vcpu_run+0x65/0x2c0 [kvm]
         kvm_arch_vcpu_ioctl_run+0x355/0x610 [kvm]
         kvm_vcpu_ioctl+0x551/0x610 [kvm]
         __se_sys_ioctl+0x77/0xc0
         __x64_sys_ioctl+0x1d/0x20
         do_syscall_64+0x44/0xa0
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
         </TASK>
        ---[ end trace 0000000000000000 ]---
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220723013029.1753623-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6c6ab524
    • Suravee Suthikulpanit's avatar
      KVM: x86: Do not block APIC write for non ICR registers · 1bd9dfec
      Suravee Suthikulpanit authored
      The commit 5413bcba ("KVM: x86: Add support for vICR APIC-write
      VM-Exits in x2APIC mode") introduces logic to prevent APIC write
      for offset other than ICR in kvm_apic_write_nodecode() function.
      This breaks x2AVIC support, which requires KVM to trap and emulate
      x2APIC MSR writes.
      
      Therefore, removes the warning and modify to logic to allow MSR write.
      
      Fixes: 5413bcba ("KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode")
      Cc: Zeng Guang <guang.zeng@intel.com>
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Message-Id: <20220725053356.4275-1-suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1bd9dfec
    • Suravee Suthikulpanit's avatar
      KVM: SVM: Do not virtualize MSR accesses for APIC LVTT register · 0a8735a6
      Suravee Suthikulpanit authored
      AMD does not support APIC TSC-deadline timer mode. AVIC hardware
      will generate GP fault when guest kernel writes 1 to bits [18]
      of the APIC LVTT register (offset 0x32) to set the timer mode.
      (Note: bit 18 is reserved on AMD system).
      
      Therefore, always intercept and let KVM emulate the MSR accesses.
      
      Fixes: f3d7c8aa6882 ("KVM: SVM: Fix x2APIC MSRs interception")
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Message-Id: <20220725033428.3699-1-suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0a8735a6
    • Sean Christopherson's avatar
      KVM: selftests: Verify VMX MSRs can be restored to KVM-supported values · ce30d8b9
      Sean Christopherson authored
      Verify that KVM allows toggling VMX MSR bits to be "more" restrictive,
      and also allows restoring each MSR to KVM's original, less restrictive
      value.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220607213604.3346000-16-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ce30d8b9
    • Sean Christopherson's avatar
      KVM: nVMX: Set UMIP bit CR4_FIXED1 MSR when emulating UMIP · a910b5ab
      Sean Christopherson authored
      Make UMIP an "allowed-1" bit CR4_FIXED1 MSR when KVM is emulating UMIP.
      KVM emulates UMIP for both L1 and L2, and so should enumerate that L2 is
      allowed to have CR4.UMIP=1.  Not setting the bit doesn't immediately
      break nVMX, as KVM does set/clear the bit in CR4_FIXED1 in response to a
      guest CPUID update, i.e. KVM will correctly (dis)allow nested VM-Entry
      based on whether or not UMIP is exposed to L1.  That said, KVM should
      enumerate the bit as being allowed from time zero, e.g. userspace will
      see the wrong value if the MSR is read before CPUID is written.
      
      Fixes: 0367f205 ("KVM: vmx: add support for emulating UMIP")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220607213604.3346000-12-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a910b5ab
    • Paolo Bonzini's avatar
      Revert "KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control" · 9389d577
      Paolo Bonzini authored
      This reverts commit 03a8871a.
      
      Since commit 03a8871a ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL
      VM-{Entry,Exit} control"), KVM has taken ownership of the "load
      IA32_PERF_GLOBAL_CTRL" VMX entry/exit control bits, trying to set these
      bits in the IA32_VMX_TRUE_{ENTRY,EXIT}_CTLS MSRs if the guest's CPUID
      supports the architectural PMU (CPUID[EAX=0Ah].EAX[7:0]=1), and clear
      otherwise.
      
      This was a misguided attempt at mimicking what commit 5f76f6f5
      ("KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled",
      2018-10-01) did for MPX.  However, that commit was a workaround for
      another KVM bug and not something that should be imitated.  Mucking with
      the VMX MSRs creates a subtle, difficult to maintain ABI as KVM must
      ensure that any internal changes, e.g. to how KVM handles _any_ guest
      CPUID changes, yield the same functional result.  Therefore, KVM's policy
      is to let userspace have full control of the guest vCPU model so long
      as the host kernel is not at risk.
      
      Now that KVM really truly ensures kvm_set_msr() will succeed by loading
      PERF_GLOBAL_CTRL if and only if it exists, revert KVM's misguided and
      roundabout behavior.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      [sean: make it a pure revert]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220722224409.1336532-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9389d577
    • Sean Christopherson's avatar
      KVM: nVMX: Attempt to load PERF_GLOBAL_CTRL on nVMX xfer iff it exists · 4496a6f9
      Sean Christopherson authored
      Attempt to load PERF_GLOBAL_CTRL during nested VM-Enter/VM-Exit if and
      only if the MSR exists (according to the guest vCPU model).  KVM has very
      misguided handling of VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL and
      attempts to force the nVMX MSR settings to match the vPMU model, i.e. to
      hide/expose the control based on whether or not the MSR exists from the
      guest's perspective.
      
      KVM's modifications fail to handle the scenario where the vPMU is hidden
      from the guest _after_ being exposed to the guest, e.g. by userspace
      doing multiple KVM_SET_CPUID2 calls, which is allowed if done before any
      KVM_RUN.  nested_vmx_pmu_refresh() is called if and only if there's a
      recognized vPMU, i.e. KVM will leave the bits in the allow state and then
      ultimately reject the MSR load and WARN.
      
      KVM should not force the VMX MSRs in the first place.  KVM taking control
      of the MSRs was a misguided attempt at mimicking what commit 5f76f6f5
      ("KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled",
      2018-10-01) did for MPX.  However, the MPX commit was a workaround for
      another KVM bug and not something that should be imitated (and it should
      never been done in the first place).
      
      In other words, KVM's ABI _should_ be that userspace has full control
      over the MSRs, at which point triggering the WARN that loading the MSR
      must not fail is trivial.
      
      The intent of the WARN is still valid; KVM has consistency checks to
      ensure that vmcs12->{guest,host}_ia32_perf_global_ctrl is valid.  The
      problem is that '0' must be considered a valid value at all times, and so
      the simple/obvious solution is to just not actually load the MSR when it
      does not exist.  It is userspace's responsibility to provide a sane vCPU
      model, i.e. KVM is well within its ABI and Intel's VMX architecture to
      skip the loads if the MSR does not exist.
      
      Fixes: 03a8871a ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220722224409.1336532-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4496a6f9
    • Sean Christopherson's avatar
      KVM: VMX: Add helper to check if the guest PMU has PERF_GLOBAL_CTRL · b663f0b5
      Sean Christopherson authored
      Add a helper to check of the guest PMU has PERF_GLOBAL_CTRL, which is
      unintuitive _and_ diverges from Intel's architecturally defined behavior.
      Even worse, KVM currently implements the check using two different (but
      equivalent) checks, _and_ there has been at least one attempt to add a
      _third_ flavor.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220722224409.1336532-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b663f0b5
    • Sean Christopherson's avatar
      KVM: VMX: Mark all PERF_GLOBAL_(OVF)_CTRL bits reserved if there's no vPMU · 93255bf9
      Sean Christopherson authored
      Mark all MSR_CORE_PERF_GLOBAL_CTRL and MSR_CORE_PERF_GLOBAL_OVF_CTRL bits
      as reserved if there is no guest vPMU.  The nVMX VM-Entry consistency
      checks do not check for a valid vPMU prior to consuming the masks via
      kvm_valid_perf_global_ctrl(), i.e. may incorrectly allow a non-zero mask
      to be loaded via VM-Enter or VM-Exit (well, attempted to be loaded, the
      actual MSR load will be rejected by intel_is_valid_msr()).
      
      Fixes: f5132b01 ("KVM: Expose a version 2 architectural PMU to a guests")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220722224409.1336532-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      93255bf9
    • Paolo Bonzini's avatar
      Revert "KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled" · 8805875a
      Paolo Bonzini authored
      Since commit 5f76f6f5 ("KVM: nVMX: Do not expose MPX VMX controls
      when guest MPX disabled"), KVM has taken ownership of the "load
      IA32_BNDCFGS" and "clear IA32_BNDCFGS" VMX entry/exit controls,
      trying to set these bits in the IA32_VMX_TRUE_{ENTRY,EXIT}_CTLS
      MSRs if the guest's CPUID supports MPX, and clear otherwise.
      
      The intent of the patch was to apply it to L0 in order to work around
      L1 kernels that lack the fix in commit 691bd434 ("kvm: vmx: allow
      host to access guest MSR_IA32_BNDCFGS", 2017-07-04): by hiding the
      control bits from L0, L1 hides BNDCFGS from KVM_GET_MSR_INDEX_LIST,
      and the L1 bug is neutralized even in the lack of commit 691bd434.
      
      This was perhaps a sensible kludge at the time, but a horrible
      idea in the long term and in fact it has not been extended to
      other CPUID bits like these:
      
        X86_FEATURE_LM => VM_EXIT_HOST_ADDR_SPACE_SIZE, VM_ENTRY_IA32E_MODE,
                          VMX_MISC_SAVE_EFER_LMA
      
        X86_FEATURE_TSC => CPU_BASED_RDTSC_EXITING, CPU_BASED_USE_TSC_OFFSETTING,
                           SECONDARY_EXEC_TSC_SCALING
      
        X86_FEATURE_INVPCID_SINGLE => SECONDARY_EXEC_ENABLE_INVPCID
      
        X86_FEATURE_MWAIT => CPU_BASED_MONITOR_EXITING, CPU_BASED_MWAIT_EXITING
      
        X86_FEATURE_INTEL_PT => SECONDARY_EXEC_PT_CONCEAL_VMX, SECONDARY_EXEC_PT_USE_GPA,
                                VM_EXIT_CLEAR_IA32_RTIT_CTL, VM_ENTRY_LOAD_IA32_RTIT_CTL
      
        X86_FEATURE_XSAVES => SECONDARY_EXEC_XSAVES
      
      These days it's sort of common knowledge that any MSR in
      KVM_GET_MSR_INDEX_LIST must allow *at least* setting it with KVM_SET_MSR
      to a default value, so it is unlikely that something like commit
      5f76f6f5 will be needed again.  So revert it, at the potential cost
      of breaking L1s with a 6 year old kernel.  While in principle the L0 owner
      doesn't control what runs on L1, such an old hypervisor would probably
      have many other bugs.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8805875a
    • Sean Christopherson's avatar
      KVM: nVMX: Let userspace set nVMX MSR to any _host_ supported value · f8ae08f9
      Sean Christopherson authored
      Restrict the nVMX MSRs based on KVM's config, not based on the guest's
      current config.  Using the guest's config to audit the new config
      prevents userspace from restoring the original config (KVM's config) if
      at any point in the past the guest's config was restricted in any way.
      
      Fixes: 62cc6b9d ("KVM: nVMX: support restore of VMX capability MSRs")
      Cc: stable@vger.kernel.org
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220607213604.3346000-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f8ae08f9
    • Sean Christopherson's avatar
      KVM: nVMX: Rename handle_vm{on,off}() to handle_vmx{on,off}() · a645c2b5
      Sean Christopherson authored
      Rename the exit handlers for VMXON and VMXOFF to match the instruction
      names, the terms "vmon" and "vmoff" are not used anywhere in Intel's
      documentation, nor are they used elsehwere in KVM.
      
      Sadly, the exit reasons are exposed to userspace and so cannot be renamed
      without breaking userspace. :-(
      
      Fixes: ec378aee ("KVM: nVMX: Implement VMXON and VMXOFF")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220607213604.3346000-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a645c2b5
    • Sean Christopherson's avatar
      KVM: nVMX: Inject #UD if VMXON is attempted with incompatible CR0/CR4 · c7d855c2
      Sean Christopherson authored
      Inject a #UD if L1 attempts VMXON with a CR0 or CR4 that is disallowed
      per the associated nested VMX MSRs' fixed0/1 settings.  KVM cannot rely
      on hardware to perform the checks, even for the few checks that have
      higher priority than VM-Exit, as (a) KVM may have forced CR0/CR4 bits in
      hardware while running the guest, (b) there may incompatible CR0/CR4 bits
      that have lower priority than VM-Exit, e.g. CR0.NE, and (c) userspace may
      have further restricted the allowed CR0/CR4 values by manipulating the
      guest's nested VMX MSRs.
      
      Note, despite a very strong desire to throw shade at Jim, commit
      70f3aac9 ("kvm: nVMX: Remove superfluous VMX instruction fault checks")
      is not to blame for the buggy behavior (though the comment...).  That
      commit only removed the CR0.PE, EFLAGS.VM, and COMPATIBILITY mode checks
      (though it did erroneously drop the CPL check, but that has already been
      remedied).  KVM may force CR0.PE=1, but will do so only when also
      forcing EFLAGS.VM=1 to emulate Real Mode, i.e. hardware will still #UD.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=216033
      Fixes: ec378aee ("KVM: nVMX: Implement VMXON and VMXOFF")
      Reported-by: default avatarEric Li <ercli@ucdavis.edu>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220607213604.3346000-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c7d855c2
    • Sean Christopherson's avatar
      KVM: nVMX: Account for KVM reserved CR4 bits in consistency checks · ca58f3aa
      Sean Christopherson authored
      Check that the guest (L2) and host (L1) CR4 values that would be loaded
      by nested VM-Enter and VM-Exit respectively are valid with respect to
      KVM's (L0 host) allowed CR4 bits.  Failure to check KVM reserved bits
      would allow L1 to load an illegal CR4 (or trigger hardware VM-Fail or
      failed VM-Entry) by massaging guest CPUID to allow features that are not
      supported by KVM.  Amusingly, KVM itself is an accomplice in its doom, as
      KVM adjusts L1's MSR_IA32_VMX_CR4_FIXED1 to allow L1 to enable bits for
      L2 based on L1's CPUID model.
      
      Note, although nested_{guest,host}_cr4_valid() are _currently_ used if
      and only if the vCPU is post-VMXON (nested.vmxon == true), that may not
      be true in the future, e.g. emulating VMXON has a bug where it doesn't
      check the allowed/required CR0/CR4 bits.
      
      Cc: stable@vger.kernel.org
      Fixes: 3899152c ("KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220607213604.3346000-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ca58f3aa
    • Sean Christopherson's avatar
      KVM: x86: Split kvm_is_valid_cr4() and export only the non-vendor bits · c33f6f22
      Sean Christopherson authored
      Split the common x86 parts of kvm_is_valid_cr4(), i.e. the reserved bits
      checks, into a separate helper, __kvm_is_valid_cr4(), and export only the
      inner helper to vendor code in order to prevent nested VMX from calling
      back into vmx_is_valid_cr4() via kvm_is_valid_cr4().
      
      On SVM, this is a nop as SVM doesn't place any additional restrictions on
      CR4.
      
      On VMX, this is also currently a nop, but only because nested VMX is
      missing checks on reserved CR4 bits for nested VM-Enter.  That bug will
      be fixed in a future patch, and could simply use kvm_is_valid_cr4() as-is,
      but nVMX has _another_ bug where VMXON emulation doesn't enforce VMX's
      restrictions on CR0/CR4.  The cleanest and most intuitive way to fix the
      VMXON bug is to use nested_host_cr{0,4}_valid().  If the CR4 variant
      routes through kvm_is_valid_cr4(), using nested_host_cr4_valid() won't do
      the right thing for the VMXON case as vmx_is_valid_cr4() enforces VMX's
      restrictions if and only if the vCPU is post-VMXON.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220607213604.3346000-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c33f6f22
    • Sean Christopherson's avatar
      KVM: selftests: Add an option to run vCPUs while disabling dirty logging · cfe12e64
      Sean Christopherson authored
      Add a command line option to dirty_log_perf_test to run vCPUs for the
      entire duration of disabling dirty logging.  By default, the test stops
      running runs vCPUs before disabling dirty logging, which is faster but
      less interesting as it doesn't stress KVM's handling of contention
      between page faults and the zapping of collapsible SPTEs.  Enabling the
      flag also lets the user verify that KVM is indeed rebuilding zapped SPTEs
      as huge pages by checking KVM's pages_{1g,2m,4k} stats.  Without vCPUs to
      fault in the zapped SPTEs, the stats will show that KVM is zapping pages,
      but they never show whether or not KVM actually allows huge pages to be
      recreated.
      
      Note!  Enabling the flag can _significantly_ increase runtime, especially
      if the thread that's disabling dirty logging doesn't have a dedicated
      pCPU, e.g. if all pCPUs are used to run vCPUs.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220715232107.3775620-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cfe12e64
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs · 85f44f8c
      Sean Christopherson authored
      When zapping collapsible SPTEs in the TDP MMU, don't bottom out on a leaf
      SPTE now that KVM doesn't require a PFN to compute the host mapping level,
      i.e. now that there's no need to first find a leaf SPTE and then step
      back up.
      
      Drop the now unused tdp_iter_step_up(), as it is not the safest of
      helpers (using any of the low level iterators requires some understanding
      of the various side effects).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220715232107.3775620-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      85f44f8c
    • Sean Christopherson's avatar
      KVM: x86/mmu: Document the "rules" for using host_pfn_mapping_level() · 65e3b446
      Sean Christopherson authored
      Add a comment to document how host_pfn_mapping_level() can be used safely,
      as the line between safe and dangerous is quite thin.  E.g. if KVM were
      to ever support in-place promotion to create huge pages, consuming the
      level is safe if the caller holds mmu_lock and checks that there's an
      existing _leaf_ SPTE, but unsafe if the caller only checks that there's a
      non-leaf SPTE.
      
      Opportunistically tweak the existing comments to explicitly document why
      KVM needs to use READ_ONCE().
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220715232107.3775620-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      65e3b446
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't require refcounted "struct page" to create huge SPTEs · a8ac499b
      Sean Christopherson authored
      Drop the requirement that a pfn be backed by a refcounted, compound or
      or ZONE_DEVICE, struct page, and instead rely solely on the host page
      tables to identify huge pages.  The PageCompound() check is a remnant of
      an old implementation that identified (well, attempt to identify) huge
      pages without walking the host page tables.  The ZONE_DEVICE check was
      added as an exception to the PageCompound() requirement.  In other words,
      neither check is actually a hard requirement, if the primary has a pfn
      backed with a huge page, then KVM can back the pfn with a huge page
      regardless of the backing store.
      
      Dropping the @pfn parameter will also allow KVM to query the max host
      mapping level without having to first get the pfn, which is advantageous
      for use outside of the page fault path where KVM wants to take action if
      and only if a page can be mapped huge, i.e. avoids the pfn lookup for
      gfns that can't be backed with a huge page.
      
      Cc: Mingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20220715232107.3775620-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a8ac499b
    • Sean Christopherson's avatar
      KVM: x86/mmu: Restrict mapping level based on guest MTRR iff they're used · d5e90a69
      Sean Christopherson authored
      Restrict the mapping level for SPTEs based on the guest MTRRs if and only
      if KVM may actually use the guest MTRRs to compute the "real" memtype.
      For all forms of paging, guest MTRRs are purely virtual in the sense that
      they are completely ignored by hardware, i.e. they affect the memtype
      only if software manually consumes them.  The only scenario where KVM
      consumes the guest MTRRs is when shadow_memtype_mask is non-zero and the
      guest has non-coherent DMA, in all other cases KVM simply leaves the PAT
      field in SPTEs as '0' to encode WB memtype.
      
      Note, KVM may still ultimately ignore guest MTRRs, e.g. if the backing
      pfn is host MMIO, but false positives are ok as they only cause a slight
      performance blip (unless the guest is doing weird things with its MTRRs,
      which is extremely unlikely).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220715230016.3762909-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d5e90a69
    • Sean Christopherson's avatar
      KVM: x86/mmu: Add shadow mask for effective host MTRR memtype · 38bf9d7b
      Sean Christopherson authored
      Add shadow_memtype_mask to capture that EPT needs a non-zero memtype mask
      instead of relying on TDP being enabled, as NPT doesn't need a non-zero
      mask.  This is a glorified nop as kvm_x86_ops.get_mt_mask() returns zero
      for NPT anyways.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220715230016.3762909-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      38bf9d7b
    • Sean Christopherson's avatar
      KVM: x86: Drop unnecessary goto+label in kvm_arch_init() · 82ffad2d
      Sean Christopherson authored
      Return directly if kvm_arch_init() detects an error before doing any real
      work, jumping through a label obfuscates what's happening and carries the
      unnecessary risk of leaving 'r' uninitialized.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220715230016.3762909-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      82ffad2d
    • Sean Christopherson's avatar
      KVM: x86: Reject loading KVM if host.PAT[0] != WB · 94bda2f4
      Sean Christopherson authored
      Reject KVM if entry '0' in the host's IA32_PAT MSR is not programmed to
      writeback (WB) memtype.  KVM subtly relies on IA32_PAT entry '0' to be
      programmed to WB by leaving the PAT bits in shadow paging and NPT SPTEs
      as '0'.  If something other than WB is in PAT[0], at _best_ guests will
      suffer very poor performance, and at worst KVM will crash the system by
      breaking cache-coherency expecations (e.g. using WC for guest memory).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220715230016.3762909-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      94bda2f4
    • Suravee Suthikulpanit's avatar
      KVM: SVM: Fix x2APIC MSRs interception · 01e69cef
      Suravee Suthikulpanit authored
      The index for svm_direct_access_msrs was incorrectly initialized with
      the APIC MMIO register macros. Fix by introducing a macro for calculating
      x2APIC MSRs.
      
      Fixes: 5c127c85 ("KVM: SVM: Adding support for configuring x2APIC MSRs interception")
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Message-Id: <20220718083833.222117-1-suravee.suthikulpanit@amd.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      01e69cef
    • Sean Christopherson's avatar
      KVM: x86/mmu: Remove underscores from __pte_list_remove() · 3c2e1037
      Sean Christopherson authored
      Remove the underscores from __pte_list_remove(), the function formerly
      known as pte_list_remove() is now named kvm_zap_one_rmap_spte() to show
      that it zaps rmaps/PTEs, i.e. doesn't just remove an entry from a list.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220715224226.3749507-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3c2e1037
    • Sean Christopherson's avatar
      KVM: x86/mmu: Rename pte_list_{destroy,remove}() to show they zap SPTEs · 9202aee8
      Sean Christopherson authored
      Rename pte_list_remove() and pte_list_destroy() to kvm_zap_one_rmap_spte()
      and kvm_zap_all_rmap_sptes() respectively to document that (a) they zap
      SPTEs and (b) to better document how they differ (remove vs. destroy does
      not exactly scream "one vs. all").
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220715224226.3749507-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9202aee8
    • Sean Christopherson's avatar
      KVM: x86/mmu: Rename rmap zap helpers to eliminate "unmap" wrapper · f8480721
      Sean Christopherson authored
      Rename kvm_unmap_rmap() and kvm_zap_rmap() to kvm_zap_rmap() and
      __kvm_zap_rmap() respectively to show that what was the "unmap" helper is
      just a wrapper for the "zap" helper, i.e. that they do the exact same
      thing, one just exists to deal with its caller passing in more params.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220715224226.3749507-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f8480721
    • Sean Christopherson's avatar
      KVM: x86/mmu: Rename __kvm_zap_rmaps() to align with other nomenclature · 2833eda0
      Sean Christopherson authored
      Rename __kvm_zap_rmaps() to kvm_rmap_zap_gfn_range() to avoid future
      confusion with a soon-to-be-introduced __kvm_zap_rmap().  Using a plural
      "rmaps" is somewhat ambiguous without additional context, as it's not
      obvious whether it's referring to multiple rmap lists, versus multiple
      rmap entries within a single list.
      
      Use kvm_rmap_zap_gfn_range() to align with the pattern established by
      kvm_rmap_zap_collapsible_sptes(), without losing the information that it
      zaps only rmap-based MMUs, i.e. don't rename it to __kvm_zap_gfn_range().
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220715224226.3749507-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2833eda0
    • Sean Christopherson's avatar
      KVM: x86/mmu: Drop the "p is for pointer" from rmap helpers · aed02fe3
      Sean Christopherson authored
      Drop the trailing "p" from rmap helpers, i.e. rename functions to simply
      be kvm_<action>_rmap().  Declaring that a function takes a pointer is
      completely unnecessary and goes against kernel style.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220715224226.3749507-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      aed02fe3
    • Sean Christopherson's avatar
      KVM: x86/mmu: Directly "destroy" PTE list when recycling rmaps · a42989e7
      Sean Christopherson authored
      Use pte_list_destroy() directly when recycling rmaps instead of bouncing
      through kvm_unmap_rmapp() and kvm_zap_rmapp().  Calling kvm_unmap_rmapp()
      is unnecessary and odd as it requires passing dummy parameters; passing
      NULL for @slot when __rmap_add() already has a valid slot is especially
      weird and confusing.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220715224226.3749507-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a42989e7
    • Sean Christopherson's avatar
      KVM: x86/mmu: Return a u64 (the old SPTE) from mmu_spte_clear_track_bits() · 35d539c3
      Sean Christopherson authored
      Return a u64, not an int, from mmu_spte_clear_track_bits().  The return
      value is the old SPTE value, which is very much a 64-bit value.  The sole
      caller that consumes the return value, drop_spte(), already uses a u64.
      The only reason that truncating the SPTE value is not problematic is
      because drop_spte() only queries the shadow-present bit, which is in the
      lower 32 bits.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220715224226.3749507-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      35d539c3
    • Maciej S. Szmigiero's avatar
      KVM: nSVM: Pull CS.Base from actual VMCB12 for soft int/ex re-injection · da0b93d6
      Maciej S. Szmigiero authored
      enter_svm_guest_mode() first calls nested_vmcb02_prepare_control() to copy
      control fields from VMCB12 to the current VMCB, then
      nested_vmcb02_prepare_save() to perform a similar copy of the save area.
      
      This means that nested_vmcb02_prepare_control() still runs with the
      previous save area values in the current VMCB so it shouldn't take the L2
      guest CS.Base from this area.
      
      Explicitly pull CS.Base from the actual VMCB12 instead in
      enter_svm_guest_mode().
      
      Granted, having a non-zero CS.Base is a very rare thing (and even
      impossible in 64-bit mode), having it change between nested VMRUNs is
      probably even rarer, but if it happens it would create a really subtle bug
      so it's better to fix it upfront.
      
      Fixes: 6ef88d6e ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction")
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <4caa0f67589ae3c22c311ee0e6139496902f2edc.1658159083.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      da0b93d6
  2. 22 Jul, 2022 1 commit
  3. 20 Jul, 2022 3 commits
  4. 19 Jul, 2022 4 commits