1. 19 Jan, 2022 3 commits
    • David Matlack's avatar
      KVM: x86/mmu: Document and enforce MMU-writable and Host-writable invariants · 5f16bcac
      David Matlack authored
      SPTEs are tagged with software-only bits to indicate if it is
      "MMU-writable" and "Host-writable". These bits are used to determine why
      KVM has marked an SPTE as read-only.
      
      Document these bits and their invariants, and enforce the invariants
      with new WARNs in spte_can_locklessly_be_made_writable() to ensure they
      are not accidentally violated in the future.
      
      Opportunistically move DEFAULT_SPTE_{MMU,HOST}_WRITABLE next to
      EPT_SPTE_{MMU,HOST}_WRITABLE since the new documentation applies to
      both.
      
      No functional change intended.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220113233020.3986005-4-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5f16bcac
    • David Matlack's avatar
      KVM: x86/mmu: Clear MMU-writable during changed_pte notifier · f082d86e
      David Matlack authored
      When handling the changed_pte notifier and the new PTE is read-only,
      clear both the Host-writable and MMU-writable bits in the SPTE. This
      preserves the invariant that MMU-writable is set if-and-only-if
      Host-writable is set.
      
      No functional change intended. Nothing currently relies on the
      aforementioned invariant and technically the changed_pte notifier is
      dead code.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220113233020.3986005-3-dmatlack@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f082d86e
    • David Matlack's avatar
      KVM: x86/mmu: Fix write-protection of PTs mapped by the TDP MMU · 7c8a4742
      David Matlack authored
      When the TDP MMU is write-protection GFNs for page table protection (as
      opposed to for dirty logging, or due to the HVA not being writable), it
      checks if the SPTE is already write-protected and if so skips modifying
      the SPTE and the TLB flush.
      
      This behavior is incorrect because it fails to check if the SPTE
      is write-protected for page table protection, i.e. fails to check
      that MMU-writable is '0'.  If the SPTE was write-protected for dirty
      logging but not page table protection, the SPTE could locklessly be made
      writable, and vCPUs could still be running with writable mappings cached
      in their TLB.
      
      Fix this by only skipping setting the SPTE if the SPTE is already
      write-protected *and* MMU-writable is already clear.  Technically,
      checking only MMU-writable would suffice; a SPTE cannot be writable
      without MMU-writable being set.  But check both to be paranoid and
      because it arguably yields more readable code.
      
      Fixes: 46044f72 ("kvm: x86/mmu: Support write protection for nesting in tdp MMU")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220113233020.3986005-2-dmatlack@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7c8a4742
  2. 17 Jan, 2022 6 commits
    • Like Xu's avatar
      KVM: x86: Making the module parameter of vPMU more common · 4732f244
      Like Xu authored
      The new module parameter to control PMU virtualization should apply
      to Intel as well as AMD, for situations where userspace is not trusted.
      If the module parameter allows PMU virtualization, there could be a
      new KVM_CAP or guest CPUID bits whereby userspace can enable/disable
      PMU virtualization on a per-VM basis.
      
      If the module parameter does not allow PMU virtualization, there
      should be no userspace override, since we have no precedent for
      authorizing that kind of override. If it's false, other counter-based
      profiling features (such as LBR including the associated CPUID bits
      if any) will not be exposed.
      
      Change its name from "pmu" to "enable_pmu" as we have temporary
      variables with the same name in our code like "struct kvm_pmu *pmu".
      
      Fixes: b1d66dad ("KVM: x86/svm: Add module param to control PMU virtualization")
      Suggested-by : Jim Mattson <jmattson@google.com>
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Message-Id: <20220111073823.21885-1-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4732f244
    • Vitaly Kuznetsov's avatar
      KVM: selftests: Test KVM_SET_CPUID2 after KVM_RUN · ecebb966
      Vitaly Kuznetsov authored
      KVM forbids KVM_SET_CPUID2 after KVM_RUN was performed on a vCPU unless
      the supplied CPUID data is equal to what was previously set. Test this.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220117150542.2176196-5-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ecebb966
    • Vitaly Kuznetsov's avatar
      KVM: selftests: Rename 'get_cpuid_test' to 'cpuid_test' · 9e6d484f
      Vitaly Kuznetsov authored
      In preparation to reusing the existing 'get_cpuid_test' for testing
      "KVM_SET_CPUID{,2} after KVM_RUN" rename it to 'cpuid_test' to avoid
      the confusion.
      
      No functional change intended.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220117150542.2176196-4-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9e6d484f
    • Vitaly Kuznetsov's avatar
      KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN · c6617c61
      Vitaly Kuznetsov authored
      Commit feb627e8 ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN")
      forbade changing CPUID altogether but unfortunately this is not fully
      compatible with existing VMMs. In particular, QEMU reuses vCPU fds for
      CPU hotplug after unplug and it calls KVM_SET_CPUID2. Instead of full ban,
      check whether the supplied CPUID data is equal to what was previously set.
      Reported-by: default avatarIgor Mammedov <imammedo@redhat.com>
      Fixes: feb627e8 ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220117150542.2176196-3-vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      [Do not call kvm_find_cpuid_entry repeatedly. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c6617c61
    • Vitaly Kuznetsov's avatar
      KVM: x86: Do runtime CPUID update before updating vcpu->arch.cpuid_entries · ee3a5f9e
      Vitaly Kuznetsov authored
      kvm_update_cpuid_runtime() mangles CPUID data coming from userspace
      VMM after updating 'vcpu->arch.cpuid_entries', this makes it
      impossible to compare an update with what was previously
      supplied. Introduce __kvm_update_cpuid_runtime() version which can be
      used to tweak the input before it goes to 'vcpu->arch.cpuid_entries'
      so the upcoming update check can compare tweaked data.
      
      No functional change intended.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220117150542.2176196-2-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ee3a5f9e
    • Like Xu's avatar
      KVM: x86/pmu: Fix available_event_types check for REF_CPU_CYCLES event · a2186448
      Like Xu authored
      According to CPUID 0x0A.EBX bit vector, the event [7] should be the
      unrealized event "Topdown Slots" instead of the *kernel* generalized
      common hardware event "REF_CPU_CYCLES", so we need to skip the cpuid
      unavaliblity check in the intel_pmc_perf_hw_id() for the last
      REF_CPU_CYCLES event and update the confusing comment.
      
      If the event is marked as unavailable in the Intel guest CPUID
      0AH.EBX leaf, we need to avoid any perf_event creation, whether
      it's a gp or fixed counter. To distinguish whether it is a rejected
      event or an event that needs to be programmed with PERF_TYPE_RAW type,
      a new special returned value of "PERF_COUNT_HW_MAX + 1" is introduced.
      
      Fixes: 62079d8a ("KVM: PMU: add proper support for fixed counter 2")
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Message-Id: <20220105051509.69437-1-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a2186448
  3. 14 Jan, 2022 21 commits
  4. 07 Jan, 2022 10 commits
    • Jing Liu's avatar
      kvm: x86: Exclude unpermitted xfeatures at KVM_GET_SUPPORTED_CPUID · 445ecdf7
      Jing Liu authored
      KVM_GET_SUPPORTED_CPUID should not include any dynamic xstates in
      CPUID[0xD] if they have not been requested with prctl. Otherwise
      a process which directly passes KVM_GET_SUPPORTED_CPUID to
      KVM_SET_CPUID2 would now fail even if it doesn't intend to use a
      dynamically enabled feature. Userspace must know that prctl is
      required and allocate >4K xstate buffer before setting any dynamic
      bit.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJing Liu <jing2.liu@intel.com>
      Signed-off-by: default avatarYang Zhong <yang.zhong@intel.com>
      Message-Id: <20220105123532.12586-5-yang.zhong@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      445ecdf7
    • Jing Liu's avatar
      kvm: x86: Fix xstate_required_size() to follow XSTATE alignment rule · cc04b6a2
      Jing Liu authored
      CPUID.0xD.1.EBX enumerates the size of the XSAVE area (in compacted
      format) required by XSAVES. If CPUID.0xD.i.ECX[1] is set for a state
      component (i), this state component should be located on the next
      64-bytes boundary following the preceding state component in the
      compacted layout.
      
      Fix xstate_required_size() to follow the alignment rule. AMX is the
      first state component with 64-bytes alignment to catch this bug.
      Signed-off-by: default avatarJing Liu <jing2.liu@intel.com>
      Signed-off-by: default avatarYang Zhong <yang.zhong@intel.com>
      Message-Id: <20220105123532.12586-4-yang.zhong@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cc04b6a2
    • Thomas Gleixner's avatar
      x86/fpu: Prepare guest FPU for dynamically enabled FPU features · 36487e62
      Thomas Gleixner authored
      To support dynamically enabled FPU features for guests prepare the guest
      pseudo FPU container to keep track of the currently enabled xfeatures and
      the guest permissions.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarJing Liu <jing2.liu@intel.com>
      Signed-off-by: default avatarYang Zhong <yang.zhong@intel.com>
      Message-Id: <20220105123532.12586-3-yang.zhong@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      36487e62
    • Thomas Gleixner's avatar
      x86/fpu: Extend fpu_xstate_prctl() with guest permissions · 980fe2fd
      Thomas Gleixner authored
      KVM requires a clear separation of host user space and guest permissions
      for dynamic XSTATE components.
      
      Add a guest permissions member to struct fpu and a separate set of prctl()
      arguments: ARCH_GET_XCOMP_GUEST_PERM and ARCH_REQ_XCOMP_GUEST_PERM.
      
      The semantics are equivalent to the host user space permission control
      except for the following constraints:
      
        1) Permissions have to be requested before the first vCPU is created
      
        2) Permissions are frozen when the first vCPU is created to ensure
           consistency. Any attempt to expand permissions via the prctl() after
           that point is rejected.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarJing Liu <jing2.liu@intel.com>
      Signed-off-by: default avatarYang Zhong <yang.zhong@intel.com>
      Message-Id: <20220105123532.12586-2-yang.zhong@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      980fe2fd
    • Michael Roth's avatar
      kvm: selftests: move ucall declarations into ucall_common.h · 96c1a628
      Michael Roth authored
      Now that core kvm_util declarations have special home in
      kvm_util_base.h, move ucall-related declarations out into a separate
      header.
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-Id: <20211210164620.11636-3-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      96c1a628
    • Michael Roth's avatar
      kvm: selftests: move base kvm_util.h declarations to kvm_util_base.h · 7d9a662e
      Michael Roth authored
      Between helper macros and interfaces that will be introduced in
      subsequent patches, much of kvm_util.h would end up being declarations
      specific to ucall. Ideally these could be separated out into a separate
      header since they are not strictly required for writing guest tests and
      are mostly self-contained interfaces other than a reliance on a few
      core declarations like struct kvm_vm. This doesn't make a big
      difference as far as how tests will be compiled/written since all these
      interfaces will still be packaged up into a single/common libkvm.a used
      by all tests, but it is still nice to be able to compartmentalize to
      improve readabilty and reduce merge conflicts in the future for common
      tasks like adding new interfaces to kvm_util.h.
      
      Furthermore, some of the ucall declarations will be arch-specific,
      requiring various #ifdef'ery in kvm_util.h. Ideally these declarations
      could live in separate arch-specific headers, e.g.
      include/<arch>/ucall.h, which would handle arch-specific declarations
      as well as pulling in common ucall-related declarations shared by all
      archs.
      
      One simple way to do this would be to #include ucall.h at the bottom of
      kvm_util.h, after declarations it relies upon like struct kvm_vm.
      This is brittle however, and doesn't scale easily to other sets of
      interfaces that may be added in the future.
      
      Instead, move all declarations currently in kvm_util.h into
      kvm_util_base.h, then have kvm_util.h #include it. With this change,
      non-base declarations can be selectively moved/introduced into separate
      headers, which can then be included in kvm_util.h so that individual
      tests don't need to be touched. Subsequent patches will then move
      ucall-related declarations into a separate header to meet the above
      goals.
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-Id: <20211210164620.11636-2-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7d9a662e
    • Michael Roth's avatar
      KVM: SVM: include CR3 in initial VMSA state for SEV-ES guests · 405329fc
      Michael Roth authored
      Normally guests will set up CR3 themselves, but some guests, such as
      kselftests, and potentially CONFIG_PVH guests, rely on being booted
      with paging enabled and CR3 initialized to a pre-allocated page table.
      
      Currently CR3 updates via KVM_SET_SREGS* are not loaded into the guest
      VMCB until just prior to entering the guest. For SEV-ES/SEV-SNP, this
      is too late, since it will have switched over to using the VMSA page
      prior to that point, with the VMSA CR3 copied from the VMCB initial
      CR3 value: 0.
      
      Address this by sync'ing the CR3 value into the VMCB save area
      immediately when KVM_SET_SREGS* is issued so it will find it's way into
      the initial VMSA.
      Suggested-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-Id: <20211216171358.61140-10-michael.roth@amd.com>
      [Remove vmx_post_set_cr3; add a remark about kvm_set_cr3 not calling the
       new hook. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      405329fc
    • Peter Zijlstra's avatar
      KVM: VMX: Provide vmread version using asm-goto-with-outputs · 907d1393
      Peter Zijlstra authored
      Use asm-goto-output for smaller fast path code.
      
      Message-Id: <YbcbbGW2GcMx6KpD@hirez.programming.kicks-ass.net>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      907d1393
    • David Woodhouse's avatar
      KVM: x86: Fix wall clock writes in Xen shared_info not to mark page dirty · 55749769
      David Woodhouse authored
      When dirty ring logging is enabled, any dirty logging without an active
      vCPU context will cause a kernel oops. But we've already declared that
      the shared_info page doesn't get dirty tracking anyway, since it would
      be kind of insane to mark it dirty every time we deliver an event channel
      interrupt. Userspace is supposed to just assume it's always dirty any
      time a vCPU can run or event channels are routed.
      
      So stop using the generic kvm_write_wall_clock() and just write directly
      through the gfn_to_pfn_cache that we already have set up.
      
      We can make kvm_write_wall_clock() static in x86.c again now, but let's
      not remove the 'sec_hi_ofs' argument even though it's not used yet. At
      some point we *will* want to use that for KVM guests too.
      
      Fixes: 629b5348 ("KVM: x86/xen: update wallclock region")
      Reported-by: default avatarbutt3rflyh4ck <butterflyhuangxx@gmail.com>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20211210163625.2886-6-dwmw2@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      55749769
    • David Woodhouse's avatar
      KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery · 14243b38
      David Woodhouse authored
      This adds basic support for delivering 2 level event channels to a guest.
      
      Initially, it only supports delivery via the IRQ routing table, triggered
      by an eventfd. In order to do so, it has a kvm_xen_set_evtchn_fast()
      function which will use the pre-mapped shared_info page if it already
      exists and is still valid, while the slow path through the irqfd_inject
      workqueue will remap the shared_info page if necessary.
      
      It sets the bits in the shared_info page but not the vcpu_info; that is
      deferred to __kvm_xen_has_interrupt() which raises the vector to the
      appropriate vCPU.
      
      Add a 'verbose' mode to xen_shinfo_test while adding test cases for this.
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20211210163625.2886-5-dwmw2@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      14243b38