1. 24 Jun, 2021 40 commits
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use MMU's roles to compute last non-leaf level · b67a93a8
      Sean Christopherson authored
      Use the MMU's role to get CR4.PSE when determining the last level at
      which the guest _cannot_ create a non-leaf PTE, i.e. cannot create a
      huge page.
      
      Note, the existing logic is arguably wrong when considering 5-level
      paging and the case where 1gb pages aren't supported.  In practice, the
      logic is confusing but not broken, because except for 32-bit non-PAE
      paging, bit 7 (_PAGE_PSE) bit is reserved when a huge page isn't supported at
      that level.  I.e. setting bit 7 will terminate the guest walk one way or
      another.  Furthermore, last_nonleaf_level is only consulted after KVM has
      verified there are no reserved bits set.
      
      All that confusion will be addressed in a future patch by dropping
      last_nonleaf_level entirely.  For now, massage the code to continue the
      march toward using mmu_role for (almost) all MMU computations.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-35-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b67a93a8
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use MMU's role to compute PKRU bitmask · 2e4c0661
      Sean Christopherson authored
      Use the MMU's role to calculate the Protection Keys (Restrict Userspace)
      bitmask instead of pulling bits from current vCPU state.  For some flows,
      the vCPU state may not be correct (or relevant), e.g. EPT doesn't
      interact with PKRU.  Case in point, the "ept" param simply disappears.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-34-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2e4c0661
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use MMU's role to compute permission bitmask · c596f147
      Sean Christopherson authored
      Use the MMU's role to generate the permission bitmasks for the MMU.
      For some flows, the vCPU state may not be correct (or relevant), e.g.
      the nested NPT MMU can be initialized with incoherent vCPU state.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-33-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c596f147
    • Sean Christopherson's avatar
      KVM: x86/mmu: Drop vCPU param from reserved bits calculator · b705a277
      Sean Christopherson authored
      Drop the vCPU param from __reset_rsvds_bits_mask() as it's now unused,
      and ideally will remain unused in the future.  Any information that's
      needed by the low level helper should be explicitly provided as it's used
      for both shadow/host MMUs and guest MMUs, i.e. vCPU state may be
      meaningless or simply wrong.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-32-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b705a277
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use MMU's role to get CR4.PSE for computing rsvd bits · 4e9c0d80
      Sean Christopherson authored
      Use the MMU's role to get CR4.PSE when calculating reserved bits for the
      guest's PTEs.  Practically speaking, this is a glorified nop as the role
      always come from vCPU state for the relevant flows, but converting to
      the roles will provide consistency once everything else is converted, and
      will Just Work if the "always comes from vCPU" behavior were ever to
      change (unlikely).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-31-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4e9c0d80
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't grab CR4.PSE for calculating shadow reserved bits · 8c985b2d
      Sean Christopherson authored
      Unconditionally pass pse=false when calculating reserved bits for shadow
      PTEs.  CR4.PSE is only relevant for 32-bit non-PAE paging, which KVM does
      not use for shadow paging (including nested NPT).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-30-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8c985b2d
    • Sean Christopherson's avatar
      KVM: x86/mmu: Always set new mmu_role immediately after checking old role · 18db1b17
      Sean Christopherson authored
      Refactor shadow MMU initialization to immediately set its new mmu_role
      after verifying it differs from the old role, and so that all flavors
      of MMU initialization share the same check-and-set pattern.  Immediately
      setting the role will allow future commits to use mmu_role to configure
      the MMU without consuming stale state.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-29-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      18db1b17
    • Sean Christopherson's avatar
      KVM: x86/mmu: Set CR4.PKE/LA57 in MMU role iff long mode is active · 84c679f5
      Sean Christopherson authored
      Don't set cr4_pke or cr4_la57 in the MMU role if long mode isn't active,
      which is required for protection keys and 5-level paging to be fully
      enabled.  Ignoring the bit avoids unnecessary reconfiguration on reuse,
      and also means consumers of mmu_role don't need to manually check for
      long mode.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-28-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      84c679f5
    • Sean Christopherson's avatar
      KVM: x86/mmu: Do not set paging-related bits in MMU role if CR0.PG=0 · ca8d664f
      Sean Christopherson authored
      Don't set CR0/CR4/EFER bits in the MMU role if paging is disabled, paging
      modifiers are irrelevant if there is no paging in the first place.
      Somewhat arbitrarily clear gpte_is_8_bytes for shadow paging if paging is
      disabled in the guest.  Again, there are no guest PTEs to process, so the
      size is meaningless.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-27-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ca8d664f
    • Sean Christopherson's avatar
      KVM: x86/mmu: Add accessors to query mmu_role bits · 60667724
      Sean Christopherson authored
      Add accessors via a builder macro for all mmu_role bits that track a CR0,
      CR4, or EFER bit, abstracting whether the bits are in the base or the
      extended role.
      
      Future commits will switch to using mmu_role instead of vCPU state to
      configure the MMU, i.e. there are about to be a large number of users.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-26-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      60667724
    • Sean Christopherson's avatar
      KVM: x86/mmu: Rename "nxe" role bit to "efer_nx" for macro shenanigans · 167f8a5c
      Sean Christopherson authored
      Rename "nxe" to "efer_nx" so that future macro magic can use the pattern
      <reg>_<bit> for all CR0, CR4, and EFER bits that included in the role.
      Using "efer_nx" also makes it clear that the role bit reflects EFER.NX,
      not the NX bit in the corresponding PTE.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-25-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      167f8a5c
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use MMU's role_regs, not vCPU state, to compute mmu_role · 8626c120
      Sean Christopherson authored
      Use the provided role_regs to calculate the mmu_role instead of pulling
      bits from current vCPU state.  For some flows, e.g. nested TDP, the vCPU
      state may not be correct (or relevant).
      
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-24-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8626c120
    • Sean Christopherson's avatar
      KVM: x86/mmu: Ignore CR0 and CR4 bits in nested EPT MMU role · cd6767c3
      Sean Christopherson authored
      Do not incorporate CR0/CR4 bits into the role for the nested EPT MMU, as
      EPT behavior is not influenced by CR0/CR4.  Note, this is the guest_mmu,
      (L1's EPT), not nested_mmu (L2's IA32 paging); the nested_mmu does need
      CR0/CR4, and is initialized in a separate flow.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-23-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cd6767c3
    • Sean Christopherson's avatar
      KVM: x86/mmu: Consolidate misc updates into shadow_mmu_init_context() · af098972
      Sean Christopherson authored
      Consolidate the MMU metadata update calls to deduplicate code, and to
      prep for future cleanup.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-22-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      af098972
    • Sean Christopherson's avatar
      KVM: x86/mmu: Add struct and helpers to retrieve MMU role bits from regs · 594e91a1
      Sean Christopherson authored
      Introduce "struct kvm_mmu_role_regs" to hold the register state that is
      incorporated into the mmu_role.  For nested TDP, the register state that
      is factored into the MMU isn't vCPU state; the dedicated struct will be
      used to propagate the correct state throughout the flows without having
      to pass multiple params, and also provides helpers for the various flag
      accessors.
      
      Intentionally make the new helpers cumbersome/ugly by prepending four
      underscores.  In the not-too-distant future, it will be preferable to use
      the mmu_role to query bits as the mmu_role can drop irrelevant bits
      without creating contradictions, e.g. clearing CR4 bits when CR0.PG=0.
      Reserve the clean helper names (no underscores) for the mmu_role.
      
      Add a helper for vCPU conversion, which is the common case.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-21-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      594e91a1
    • Sean Christopherson's avatar
      KVM: x86/mmu: Grab shadow root level from mmu_role for shadow MMUs · d555f705
      Sean Christopherson authored
      Use the mmu_role to initialize shadow root level instead of assuming the
      level of KVM's shadow root (host) is the same as that of the guest root,
      or in the case of 32-bit non-PAE paging where KVM forces PAE paging.
      For nested NPT, the shadow root level cannot be adapted to L1's NPT root
      level and is instead always the TDP root level because NPT uses the
      current host CR0/CR4/EFER, e.g. 64-bit KVM can't drop into 32-bit PAE to
      shadow L1's NPT.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-20-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d555f705
    • Sean Christopherson's avatar
      KVM: x86/mmu: Move nested NPT reserved bit calculation into MMU proper · 16be1d12
      Sean Christopherson authored
      Move nested NPT's invocation of reset_shadow_zero_bits_mask() into the
      MMU proper and unexport said function.  Aside from dropping an export,
      this is a baby step toward eliminating the call entirely by fixing the
      shadow_root_level confusion.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-19-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      16be1d12
    • Sean Christopherson's avatar
      KVM: x86: Read and pass all CR0/CR4 role bits to shadow MMU helper · 20f632bd
      Sean Christopherson authored
      Grab all CR0/CR4 MMU role bits from current vCPU state when initializing
      a non-nested shadow MMU.  Extract the masks from kvm_post_set_cr{0,4}(),
      as the CR0/CR4 update masks must exactly match the mmu_role bits, with
      one exception (see below).  The "full" CR0/CR4 will be used by future
      commits to initialize the MMU and its role, as opposed to the current
      approach of pulling everything from vCPU, which is incorrect for certain
      flows, e.g. nested NPT.
      
      CR4.LA57 is an exception, as it can be toggled on VM-Exit (for L1's MMU)
      but can't be toggled via MOV CR4 while long mode is active.  I.e. LA57
      needs to be in the mmu_role, but technically doesn't need to be checked
      by kvm_post_set_cr4().  However, the extra check is completely benign as
      the hardware restrictions simply mean LA57 will never be _the_ cause of
      a MMU reset during MOV CR4.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-18-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      20f632bd
    • Sean Christopherson's avatar
      KVM: x86/mmu: Drop smep_andnot_wp check from "uses NX" for shadow MMUs · 18feaad3
      Sean Christopherson authored
      Drop the smep_andnot_wp role check from the "uses NX" calculation now
      that all non-nested shadow MMUs treat NX as used via the !TDP check.
      
      The shadow MMU for nested NPT, which shares the helper, does not need to
      deal with SMEP (or WP) as NPT walks are always "user" accesses and WP is
      explicitly noted as being ignored:
      
        Table walks for guest page tables are always treated as user writes at
        the nested page table level.
      
        A table walk for the guest page itself is always treated as a user
        access at the nested page table level
      
        The host hCR0.WP bit is ignored under nested paging.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-17-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      18feaad3
    • Sean Christopherson's avatar
      KVM: nSVM: Add a comment to document why nNPT uses vmcb01, not vCPU state · 31e96bc6
      Sean Christopherson authored
      Add a comment in the nested NPT initialization flow to call out that it
      intentionally uses vmcb01 instead current vCPU state to get the effective
      hCR4 and hEFER for L1's NPT context.
      
      Note, despite nSVM's efforts to handle the case where vCPU state doesn't
      reflect L1 state, the MMU may still do the wrong thing due to pulling
      state from the vCPU instead of the passed in CR0/CR4/EFER values.  This
      will be addressed in future commits.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-16-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      31e96bc6
    • Sean Christopherson's avatar
      KVM: x86: Fix sizes used to pass around CR0, CR4, and EFER · dbc4739b
      Sean Christopherson authored
      When configuring KVM's MMU, pass CR0 and CR4 as unsigned longs, and EFER
      as a u64 in various flows (mostly MMU).  Passing the params as u32s is
      functionally ok since all of the affected registers reserve bits 63:32 to
      zero (enforced by KVM), but it's technically wrong.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-15-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dbc4739b
    • Sean Christopherson's avatar
      KVM: x86/mmu: Rename unsync helper and update related comments · 0337f585
      Sean Christopherson authored
      Rename mmu_need_write_protect() to mmu_try_to_unsync_pages() and update
      a variety of related, stale comments.  Add several new comments to call
      out subtle details, e.g. that upper-level shadow pages are write-tracked,
      and that can_unsync is false iff KVM is in the process of synchronizing
      pages.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-14-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0337f585
    • Sean Christopherson's avatar
      KVM: x86/mmu: Drop the intermediate "transient" __kvm_sync_page() · 479a1efc
      Sean Christopherson authored
      Nove the kvm_unlink_unsync_page() call out of kvm_sync_page() and into
      it's sole caller, and fold __kvm_sync_page() into kvm_sync_page() since
      the latter becomes a pure pass-through.  There really should be no reason
      for code to do a complete sync of a shadow page outside of the full
      kvm_mmu_sync_roots(), e.g. the one use case that creeped in turned out to
      be flawed and counter-productive.
      
      Drop the stale comment about @sp->gfn needing to be write-protected, as
      it directly contradicts the kvm_mmu_get_page() usage.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-13-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      479a1efc
    • Sean Christopherson's avatar
      KVM: x86/mmu: comment on kvm_mmu_get_page's syncing of pages · 07dc4f35
      Sean Christopherson authored
      Explain the usage of sync_page() in kvm_mmu_get_page(), which is
      subtle in how and why it differs from mmu_sync_children().
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      [Split out of a different patch by Sean. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      07dc4f35
    • Sean Christopherson's avatar
      KVM: x86/mmu: WARN and zap SP when sync'ing if MMU role mismatches · 2640b086
      Sean Christopherson authored
      When synchronizing a shadow page, WARN and zap the page if its mmu role
      isn't compatible with the current MMU context, where "compatible" is an
      exact match sans the bits that have no meaning in the overall MMU context
      or will be explicitly overwritten during the sync.  Many of the helpers
      used by sync_page() are specific to the current context, updating a SMM
      vs. non-SMM shadow page would use the wrong memslots, updating L1 vs. L2
      PTEs might work but would be extremely bizaree, and so on and so forth.
      
      Drop the guard with respect to 8-byte vs. 4-byte PTEs in
      __kvm_sync_page(), it was made useless when kvm_mmu_get_page() stopped
      trying to sync shadow pages irrespective of the current MMU context.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-12-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2640b086
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use MMU role to check for matching guest page sizes · 00a66978
      Sean Christopherson authored
      Originally, __kvm_sync_page used to check the cr4_pae bit in the role
      to avoid zapping 4-byte kvm_mmu_pages when guest page size are 8-byte
      or the other way round.  However, in commit 47c42e6b ("KVM: x86: fix
      handling of role.cr4_pae and rename it to 'gpte_size'", 2019-03-28) it
      was observed that this did not work for nested EPT, where the page table
      size would be 8 bytes even if CR4.PAE=0.  (Note that the check still
      has to be done for nested *NPT*, so it is not possible to use tdp_enabled
      or similar).
      
      Therefore, a hack was introduced to identify nested EPT shadow pages
      and unconditionally call __kvm_sync_page() on them.  However, it is
      possible to do without the hack to identify nested EPT shadow pages:
      if EPT is active, there will be no shadow pages in non-EPT format,
      and all of them will have gpte_is_8_bytes set to true; we can just
      check the MMU role directly, and the test will always be true.
      
      Even for non-EPT shadow MMUs, this test should really always be true
      now that __kvm_sync_page() is called if and only if the role is an
      exact match (kvm_mmu_get_page()) or is part of the current MMU context
      (kvm_mmu_sync_roots()).  A future commit will convert the likely-pointless
      check into a meaningful WARN to enforce that the mmu_roles of the current
      context and the shadow page are compatible.
      
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-11-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      00a66978
    • Sean Christopherson's avatar
      KVM: x86/mmu: Unconditionally zap unsync SPs when creating >4k SP at GFN · ddc16abb
      Sean Christopherson authored
      When creating a new upper-level shadow page, zap unsync shadow pages at
      the same target gfn instead of attempting to sync the pages.  This fixes
      a bug where an unsync shadow page could be sync'd with an incompatible
      context, e.g. wrong smm, is_guest, etc... flags.  In practice, the bug is
      relatively benign as sync_page() is all but guaranteed to fail its check
      that the guest's desired gfn (for the to-be-sync'd page) matches the
      current gfn associated with the shadow page.  I.e. kvm_sync_page() would
      end up zapping the page anyways.
      
      Alternatively, __kvm_sync_page() could be modified to explicitly verify
      the mmu_role of the unsync shadow page is compatible with the current MMU
      context.  But, except for this specific case, __kvm_sync_page() is called
      iff the page is compatible, e.g. the transient sync in kvm_mmu_get_page()
      requires an exact role match, and the call from kvm_sync_mmu_roots() is
      only synchronizing shadow pages from the current MMU (which better be
      compatible or KVM has problems).  And as described above, attempting to
      sync shadow pages when creating an upper-level shadow page is unlikely
      to succeed, e.g. zero successful syncs were observed when running Linux
      guests despite over a million attempts.
      
      Fixes: 9f1a122f ("KVM: MMU: allow more page become unsync at getting sp time")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-10-seanjc@google.com>
      [Remove WARN_ON after __kvm_sync_page. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ddc16abb
    • Sean Christopherson's avatar
      Revert "KVM: MMU: record maximum physical address width in kvm_mmu_extended_role" · 6c032f12
      Sean Christopherson authored
      Drop MAXPHYADDR from mmu_role now that all MMUs have their role
      invalidated after a CPUID update.  Invalidating the role forces all MMUs
      to re-evaluate the guest's MAXPHYADDR, and the guest's MAXPHYADDR can
      only be changed only through a CPUID update.
      
      This reverts commit de3ccd26.
      
      Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6c032f12
    • Sean Christopherson's avatar
      KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken · 63f5a190
      Sean Christopherson authored
      Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest
      instability.  Initialize last_vmentry_cpu to -1 and use it to detect if
      the vCPU has been run at least once when its CPUID model is changed.
      
      KVM does not correctly handle changes to paging related settings in the
      guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc...  KVM
      could theoretically zap all shadow pages, but actually making that happen
      is a mess due to lock inversion (vcpu->mutex is held).  And even then,
      updating paging settings on the fly would only work if all vCPUs are
      stopped, updated in concert with identical settings, then restarted.
      
      To support running vCPUs with different vCPU models (that affect paging),
      KVM would need to track all relevant information in kvm_mmu_page_role.
      Note, that's the _page_ role, not the full mmu_role.  Updating mmu_role
      isn't sufficient as a vCPU can reuse a shadow page translation that was
      created by a vCPU with different settings and thus completely skip the
      reserved bit checks (that are tied to CPUID).
      
      Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as
      it would require doubling gfn_track from a u16 to a u32, i.e. would
      increase KVM's memory footprint by 2 bytes for every 4kb of guest memory.
      E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT
      would all need to be tracked.
      
      In practice, there is no remotely sane use case for changing any paging
      related CPUID entries on the fly, so just sweep it under the rug (after
      yelling at userspace).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      63f5a190
    • Sean Christopherson's avatar
      KVM: x86: Force all MMUs to reinitialize if guest CPUID is modified · 49c6f875
      Sean Christopherson authored
      Invalidate all MMUs' roles after a CPUID update to force reinitizliation
      of the MMU context/helpers.  Despite the efforts of commit de3ccd26
      ("KVM: MMU: record maximum physical address width in kvm_mmu_extended_role"),
      there are still a handful of CPUID-based properties that affect MMU
      behavior but are not incorporated into mmu_role.  E.g. 1gb hugepage
      support, AMD vs. Intel handling of bit 8, and SEV's C-Bit location all
      factor into the guest's reserved PTE bits.
      
      The obvious alternative would be to add all such properties to mmu_role,
      but doing so provides no benefit over simply forcing a reinitialization
      on every CPUID update, as setting guest CPUID is a rare operation.
      
      Note, reinitializing all MMUs after a CPUID update does not fix all of
      KVM's woes.  Specifically, kvm_mmu_page_role doesn't track the CPUID
      properties, which means that a vCPU can reuse shadow pages that should
      not exist for the new vCPU model, e.g. that map GPAs that are now illegal
      (due to MAXPHYADDR changes) or that set bits that are now reserved
      (PAGE_SIZE for 1gb pages), etc...
      
      Tracking the relevant CPUID properties in kvm_mmu_page_role would address
      the majority of problems, but fully tracking that much state in the
      shadow page role comes with an unpalatable cost as it would require a
      non-trivial increase in KVM's memory footprint.  The GBPAGES case is even
      worse, as neither Intel nor AMD provides a way to disable 1gb hugepage
      support in the hardware page walker, i.e. it's a virtualization hole that
      can't be closed when using TDP.
      
      In other words, resetting the MMU after a CPUID update is largely a
      superficial fix.  But, it will allow reverting the tracking of MAXPHYADDR
      in the mmu_role, and that case in particular needs to mostly work because
      KVM's shadow_root_level depends on guest MAXPHYADDR when 5-level paging
      is supported.  For cases where KVM botches guest behavior, the damage is
      limited to that guest.  But for the shadow_root_level, a misconfigured
      MMU can cause KVM to incorrectly access memory, e.g. due to walking off
      the end of its shadow page tables.
      
      Fixes: 7dcd5755 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
      Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      49c6f875
    • Sean Christopherson's avatar
      Revert "KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack" · f71a53d1
      Sean Christopherson authored
      Restore CR4.LA57 to the mmu_role to fix an amusing edge case with nested
      virtualization.  When KVM (L0) is using TDP, CR4.LA57 is not reflected in
      mmu_role.base.level because that tracks the shadow root level, i.e. TDP
      level.  Normally, this is not an issue because LA57 can't be toggled
      while long mode is active, i.e. the guest has to first disable paging,
      then toggle LA57, then re-enable paging, thus ensuring an MMU
      reinitialization.
      
      But if L1 is crafty, it can load a new CR4 on VM-Exit and toggle LA57
      without having to bounce through an unpaged section.  L1 can also load a
      new CR3 on exit, i.e. it doesn't even need to play crazy paging games, a
      single entry PML5 is sufficient.  Such shenanigans are only problematic
      if L0 and L1 use TDP, otherwise L1 and L2 share an MMU that gets
      reinitialized on nested VM-Enter/VM-Exit due to mmu_role.base.guest_mode.
      
      Note, in the L2 case with nested TDP, even though L1 can switch between
      L2s with different LA57 settings, thus bypassing the paging requirement,
      in that case KVM's nested_mmu will track LA57 in base.level.
      
      This reverts commit 8053f924.
      
      Fixes: 8053f924 ("KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f71a53d1
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use MMU's role to detect CR4.SMEP value in nested NPT walk · ef318b9e
      Sean Christopherson authored
      Use the MMU's role to get its effective SMEP value when injecting a fault
      into the guest.  When walking L1's (nested) NPT while L2 is active, vCPU
      state will reflect L2, whereas NPT uses the host's (L1 in this case) CR0,
      CR4, EFER, etc...  If L1 and L2 have different settings for SMEP and
      L1 does not have EFER.NX=1, this can result in an incorrect PFEC.FETCH
      when injecting #NPF.
      
      Fixes: e57d4a35 ("KVM: Add instruction fetch checking when walking guest page table")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ef318b9e
    • Sean Christopherson's avatar
      KVM: x86: Properly reset MMU context at vCPU RESET/INIT · 0aa18375
      Sean Christopherson authored
      Reset the MMU context at vCPU INIT (and RESET for good measure) if CR0.PG
      was set prior to INIT.  Simply re-initializing the current MMU is not
      sufficient as the current root HPA may not be usable in the new context.
      E.g. if TDP is disabled and INIT arrives while the vCPU is in long mode,
      KVM will fail to switch to the 32-bit pae_root and bomb on the next
      VM-Enter due to running with a 64-bit CR3 in 32-bit mode.
      
      This bug was papered over in both VMX and SVM, but still managed to rear
      its head in the MMU role on VMX.  Because EFER.LMA=1 requires CR0.PG=1,
      kvm_calc_shadow_mmu_root_page_role() checks for EFER.LMA without first
      checking CR0.PG.  VMX's RESET/INIT flow writes CR0 before EFER, and so
      an INIT with the vCPU in 64-bit mode will cause the hack-a-fix to
      generate the wrong MMU role.
      
      In VMX, the INIT issue is specific to running without unrestricted guest
      since unrestricted guest is available if and only if EPT is enabled.
      Commit 8668a3c4 ("KVM: VMX: Reset mmu context when entering real
      mode") resolved the issue by forcing a reset when entering emulated real
      mode.
      
      In SVM, commit ebae871a ("kvm: svm: reset mmu on VCPU reset") forced
      a MMU reset on every INIT to workaround the flaw in common x86.  Note, at
      the time the bug was fixed, the SVM problem was exacerbated by a complete
      lack of a CR4 update.
      
      The vendor resets will be reverted in future patches, primarily to aid
      bisection in case there are non-INIT flows that rely on the existing VMX
      logic.
      
      Because CR0.PG is unconditionally cleared on INIT, and because CR0.WP and
      all CR4/EFER paging bits are ignored if CR0.PG=0, simply checking that
      CR0.PG was '1' prior to INIT/RESET is sufficient to detect a required MMU
      context reset.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0aa18375
    • Sean Christopherson's avatar
      KVM: x86/mmu: Treat NX as used (not reserved) for all !TDP shadow MMUs · 112022bd
      Sean Christopherson authored
      Mark NX as being used for all non-nested shadow MMUs, as KVM will set the
      NX bit for huge SPTEs if the iTLB mutli-hit mitigation is enabled.
      Checking the mitigation itself is not sufficient as it can be toggled on
      at any time and KVM doesn't reset MMU contexts when that happens.  KVM
      could reset the contexts, but that would require purging all SPTEs in all
      MMUs, for no real benefit.  And, KVM already forces EFER.NX=1 when TDP is
      disabled (for WP=0, SMEP=1, NX=0), so technically NX is never reserved
      for shadow MMUs.
      
      Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      112022bd
    • Sean Christopherson's avatar
      KVM: x86/mmu: Remove broken WARN that fires on 32-bit KVM w/ nested EPT · f0d43790
      Sean Christopherson authored
      Remove a misguided WARN that attempts to detect the scenario where using
      a special A/D tracking flag will set reserved bits on a non-MMIO spte.
      The WARN triggers false positives when using EPT with 32-bit KVM because
      of the !64-bit clause, which is just flat out wrong.  The whole A/D
      tracking goo is specific to EPT, and one of the big selling points of EPT
      is that EPT is decoupled from the host's native paging mode.
      
      Drop the WARN instead of trying to salvage the check.  Keeping a check
      specific to A/D tracking bits would essentially regurgitate the same code
      that led to KVM needed the tracking bits in the first place.
      
      A better approach would be to add a generic WARN on reserved bits being
      set, which would naturally cover the A/D tracking bits, work for all
      flavors of paging, and be self-documenting to some extent.
      
      Fixes: 8a406c89 ("KVM: x86/mmu: Rename and document A/D scheme for TDP SPTEs")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f0d43790
    • Jing Zhang's avatar
      KVM: debugfs: Reuse binary stats descriptors · bc9e9e67
      Jing Zhang authored
      To remove code duplication, use the binary stats descriptors in the
      implementation of the debugfs interface for statistics. This unifies
      the definition of statistics for the binary and debugfs interfaces.
      Signed-off-by: default avatarJing Zhang <jingzhangos@google.com>
      Message-Id: <20210618222709.1858088-8-jingzhangos@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bc9e9e67
    • Jing Zhang's avatar
      KVM: selftests: Add selftest for KVM statistics data binary interface · 0b45d587
      Jing Zhang authored
      Add selftest to check KVM stats descriptors validity.
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarRicardo Koller <ricarkol@google.com>
      Reviewed-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Tested-by: Fuad Tabba <tabba@google.com> #arm64
      Signed-off-by: default avatarJing Zhang <jingzhangos@google.com>
      Message-Id: <20210618222709.1858088-7-jingzhangos@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0b45d587
    • Jing Zhang's avatar
      KVM: stats: Add documentation for binary statistics interface · fdc09ddd
      Jing Zhang authored
      This new API provides a file descriptor for every VM and VCPU to read
      KVM statistics data in binary format.
      It is meant to provide a lightweight, flexible, scalable and efficient
      lock-free solution for user space telemetry applications to pull the
      statistics data periodically for large scale systems. The pulling
      frequency could be as high as a few times per second.
      The statistics descriptors are defined by KVM in kernel and can be
      by userspace to discover VM/VCPU statistics during the one-time setup
      stage.
      The statistics data itself could be read out by userspace telemetry
      periodically without any extra parsing or setup effort.
      There are a few existed interface protocols and definitions, but no
      one can fulfil all the requirements this interface implemented as
      below:
      1. During high frequency periodic stats reading, there should be no
         extra efforts except the stats data read itself.
      2. Support stats annotation, like type (cumulative, instantaneous,
         peak, histogram, etc) and unit (counter, time, size, cycles, etc).
      3. The stats data reading should be free of lock/synchronization. We
         don't care about the consistency between all the stats data. All
         stats data can not be read out at exactly the same time. We really
         care about the change or trend of the stats data. The lock-free
         solution is not just for efficiency and scalability, also for the
         stats data accuracy and usability. For example, in the situation
         that all the stats data readings are protected by a global lock,
         if one VCPU died somehow with that lock held, then all stats data
         reading would be blocked, then we have no way from stats data that
         which VCPU has died.
      4. The stats data reading workload can be handed over to other
         unprivileged process.
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarRicardo Koller <ricarkol@google.com>
      Reviewed-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Signed-off-by: default avatarJing Zhang <jingzhangos@google.com>
      Message-Id: <20210618222709.1858088-6-jingzhangos@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fdc09ddd
    • Jing Zhang's avatar
      KVM: stats: Support binary stats retrieval for a VCPU · ce55c049
      Jing Zhang authored
      Add a VCPU ioctl to get a statistics file descriptor by which a read
      functionality is provided for userspace to read out VCPU stats header,
      descriptors and data.
      Define VCPU statistics descriptors and header for all architectures.
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarRicardo Koller <ricarkol@google.com>
      Reviewed-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: Fuad Tabba <tabba@google.com> #arm64
      Signed-off-by: default avatarJing Zhang <jingzhangos@google.com>
      Message-Id: <20210618222709.1858088-5-jingzhangos@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ce55c049
    • Jing Zhang's avatar
      KVM: stats: Support binary stats retrieval for a VM · fcfe1bae
      Jing Zhang authored
      Add a VM ioctl to get a statistics file descriptor by which a read
      functionality is provided for userspace to read out VM stats header,
      descriptors and data.
      Define VM statistics descriptors and header for all architectures.
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarRicardo Koller <ricarkol@google.com>
      Reviewed-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: Fuad Tabba <tabba@google.com> #arm64
      Signed-off-by: default avatarJing Zhang <jingzhangos@google.com>
      Message-Id: <20210618222709.1858088-4-jingzhangos@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fcfe1bae