1. 24 Jun, 2021 7 commits
  2. 23 Jun, 2021 1 commit
  3. 22 Jun, 2021 4 commits
  4. 21 Jun, 2021 5 commits
    • Sean Christopherson's avatar
      KVM: nVMX: Dynamically compute max VMCS index for vmcs12 · ba1f8245
      Sean Christopherson authored
      Calculate the max VMCS index for vmcs12 by walking the array to find the
      actual max index.  Hardcoding the index is prone to bitrot, and the
      calculation is only done on KVM bringup (albeit on every CPU, but there
      aren't _that_ many null entries in the array).
      
      Fixes: 3c0f9936 ("KVM: nVMX: Add a TSC multiplier field in VMCS12")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210618214658.2700765-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ba1f8245
    • Jim Mattson's avatar
      KVM: VMX: Skip #PF(RSVD) intercepts when emulating smaller maxphyaddr · 5140bc7d
      Jim Mattson authored
      As part of smaller maxphyaddr emulation, kvm needs to intercept
      present page faults to see if it needs to add the RSVD flag (bit 3) to
      the error code. However, there is no need to intercept page faults
      that already have the RSVD flag set. When setting up the page fault
      intercept, add the RSVD flag into the #PF error code mask field (but
      not the #PF error code match field) to skip the intercept when the
      RSVD flag is already set.
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Message-Id: <20210618235941.1041604-1-jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5140bc7d
    • Bharata B Rao's avatar
      KVM: PPC: Book3S HV: Add support for H_RPT_INVALIDATE · f0c6fbbb
      Bharata B Rao authored
      H_RPT_INVALIDATE does two types of TLB invalidations:
      
      1. Process-scoped invalidations for guests when LPCR[GTSE]=0.
         This is currently not used in KVM as GTSE is not usually
         disabled in KVM.
      2. Partition-scoped invalidations that an L1 hypervisor does on
         behalf of an L2 guest. This is currently handled
         by H_TLB_INVALIDATE hcall and this new replaces the old that.
      
      This commit enables process-scoped invalidations for L1 guests.
      Support for process-scoped and partition-scoped invalidations
      from/for nested guests will be added separately.
      
      Process scoped tlbie invalidations from L1 and nested guests
      need RS register for TLBIE instruction to contain both PID and
      LPID.  This patch introduces primitives that execute tlbie
      instruction with both PID and LPID set in prepartion for
      H_RPT_INVALIDATE hcall.
      
      A description of H_RPT_INVALIDATE follows:
      
      int64   /* H_Success: Return code on successful completion */
              /* H_Busy - repeat the call with the same */
              /* H_Parameter, H_P2, H_P3, H_P4, H_P5 : Invalid
      	   parameters */
      hcall(const uint64 H_RPT_INVALIDATE, /* Invalidate RPT
      					translation
      					lookaside information */
            uint64 id,        /* PID/LPID to invalidate */
            uint64 target,    /* Invalidation target */
            uint64 type,      /* Type of lookaside information */
            uint64 pg_sizes,  /* Page sizes */
            uint64 start,     /* Start of Effective Address (EA)
      			   range (inclusive) */
            uint64 end)       /* End of EA range (exclusive) */
      
      Invalidation targets (target)
      -----------------------------
      Core MMU        0x01 /* All virtual processors in the
      			partition */
      Core local MMU  0x02 /* Current virtual processor */
      Nest MMU        0x04 /* All nest/accelerator agents
      			in use by the partition */
      
      A combination of the above can be specified,
      except core and core local.
      
      Type of translation to invalidate (type)
      ---------------------------------------
      NESTED       0x0001  /* invalidate nested guest partition-scope */
      TLB          0x0002  /* Invalidate TLB */
      PWC          0x0004  /* Invalidate Page Walk Cache */
      PRT          0x0008  /* Invalidate caching of Process Table
      			Entries if NESTED is clear */
      PAT          0x0008  /* Invalidate caching of Partition Table
      			Entries if NESTED is set */
      
      A combination of the above can be specified.
      
      Page size mask (pages)
      ----------------------
      4K              0x01
      64K             0x02
      2M              0x04
      1G              0x08
      All sizes       (-1UL)
      
      A combination of the above can be specified.
      All page sizes can be selected with -1.
      
      Semantics: Invalidate radix tree lookaside information
                 matching the parameters given.
      * Return H_P2, H_P3 or H_P4 if target, type, or pageSizes parameters
        are different from the defined values.
      * Return H_PARAMETER if NESTED is set and pid is not a valid nested
        LPID allocated to this partition
      * Return H_P5 if (start, end) doesn't form a valid range. Start and
        end should be a valid Quadrant address and  end > start.
      * Return H_NotSupported if the partition is not in running in radix
        translation mode.
      * May invalidate more translation information than requested.
      * If start = 0 and end = -1, set the range to cover all valid
        addresses. Else start and end should be aligned to 4kB (lower 11
        bits clear).
      * If NESTED is clear, then invalidate process scoped lookaside
        information. Else pid specifies a nested LPID, and the invalidation
        is performed   on nested guest partition table and nested guest
        partition scope real addresses.
      * If pid = 0 and NESTED is clear, then valid addresses are quadrant 3
        and quadrant 0 spaces, Else valid addresses are quadrant 0.
      * Pages which are fully covered by the range are to be invalidated.
        Those which are partially covered are considered outside
        invalidation range, which allows a caller to optimally invalidate
        ranges that may   contain mixed page sizes.
      * Return H_SUCCESS on success.
      Signed-off-by: default avatarBharata B Rao <bharata@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210621085003.904767-4-bharata@linux.ibm.com
      f0c6fbbb
    • Bharata B Rao's avatar
      powerpc/book3s64/radix: Add H_RPT_INVALIDATE pgsize encodings to mmu_psize_def · d6265cb3
      Bharata B Rao authored
      Add a field to mmu_psize_def to store the page size encodings
      of H_RPT_INVALIDATE hcall. Initialize this while scanning the radix
      AP encodings. This will be used when invalidating with required
      page size encoding in the hcall.
      Signed-off-by: default avatarBharata B Rao <bharata@linux.ibm.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210621085003.904767-3-bharata@linux.ibm.com
      d6265cb3
    • Aneesh Kumar K.V's avatar
      KVM: PPC: Book3S HV: Fix comments of H_RPT_INVALIDATE arguments · f09216a1
      Aneesh Kumar K.V authored
      The type values H_RPTI_TYPE_PRT and H_RPTI_TYPE_PAT indicate
      invalidating the caching of process and partition scoped entries
      respectively.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarBharata B Rao <bharata@linux.ibm.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210621085003.904767-2-bharata@linux.ibm.com
      f09216a1
  5. 20 Jun, 2021 1 commit
  6. 18 Jun, 2021 8 commits
  7. 17 Jun, 2021 14 commits
    • Kai Huang's avatar
      KVM: x86/mmu: Fix TDP MMU page table level · f1b83255
      Kai Huang authored
      TDP MMU iterator's level is identical to page table's actual level.  For
      instance, for the last level page table (whose entry points to one 4K
      page), iter->level is 1 (PG_LEVEL_4K), and in case of 5 level paging,
      the iter->level is mmu->shadow_root_level, which is 5.  However, struct
      kvm_mmu_page's level currently is not set correctly when it is allocated
      in kvm_tdp_mmu_map().  When iterator hits non-present SPTE and needs to
      allocate a new child page table, currently iter->level, which is the
      level of the page table where the non-present SPTE belongs to, is used.
      This results in struct kvm_mmu_page's level always having its parent's
      level (excpet root table's level, which is initialized explicitly using
      mmu->shadow_root_level).
      
      This is kinda wrong, and not consistent with existing non TDP MMU code.
      Fortuantely sp->role.level is only used in handle_removed_tdp_mmu_page()
      and kvm_tdp_mmu_zap_sp(), and they are already aware of this and behave
      correctly.  However to make it consistent with legacy MMU code (and fix
      the issue that both root page table and its child page table have
      shadow_root_level), use iter->level - 1 in kvm_tdp_mmu_map(), and change
      handle_removed_tdp_mmu_page() and kvm_tdp_mmu_zap_sp() accordingly.
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <bcb6569b6e96cb78aaa7b50640e6e6b53291a74e.1623717884.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f1b83255
    • Kai Huang's avatar
      KVM: x86/mmu: Fix pf_fixed count in tdp_mmu_map_handle_target_level() · 857f8474
      Kai Huang authored
      Currently pf_fixed is not increased when prefault is true.  This is not
      correct, since prefault here really means "async page fault completed".
      In that case, the original page fault from the guest was morphed into as
      async page fault and pf_fixed was not increased.  So when prefault
      indicates async page fault is completed, pf_fixed should be increased.
      
      Additionally, currently pf_fixed is also increased even when page fault
      is spurious, while legacy MMU increases pf_fixed when page fault returns
      RET_PF_EMULATE or RET_PF_FIXED.
      
      To fix above two issues, change to increase pf_fixed when return value
      is not RET_PF_SPURIOUS (RET_PF_RETRY has already been ruled out by
      reaching here).
      
      More information:
      https://lore.kernel.org/kvm/cover.1620200410.git.kai.huang@intel.com/T/#mbb5f8083e58a2cd262231512b9211cbe70fc3bd5
      
      Fixes: bb18842e ("kvm: x86/mmu: Add TDP MMU PF handler")
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <2ea8b7f5d4f03c99b32bc56fc982e1e4e3d3fc6b.1623717884.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      857f8474
    • Kai Huang's avatar
      KVM: x86/mmu: Fix return value in tdp_mmu_map_handle_target_level() · 57a3e96d
      Kai Huang authored
      Currently tdp_mmu_map_handle_target_level() returns 0, which is
      RET_PF_RETRY, when page fault is actually fixed.  This makes
      kvm_tdp_mmu_map() also return RET_PF_RETRY in this case, instead of
      RET_PF_FIXED.  Fix by initializing ret to RET_PF_FIXED.
      
      Note that kvm_mmu_page_fault() resumes guest on both RET_PF_RETRY and
      RET_PF_FIXED, which means in practice returning the two won't make
      difference, so this fix alone won't be necessary for stable tree.
      
      Fixes: bb18842e ("kvm: x86/mmu: Add TDP MMU PF handler")
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <f9e8956223a586cd28c090879a8ff40f5eb6d609.1623717884.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      57a3e96d
    • Wanpeng Li's avatar
      KVM: LAPIC: Keep stored TMCCT register value 0 after KVM_SET_LAPIC · 2735886c
      Wanpeng Li authored
      KVM_GET_LAPIC stores the current value of TMCCT and KVM_SET_LAPIC's memcpy
      stores it in vcpu->arch.apic->regs, KVM_SET_LAPIC could store zero in
      vcpu->arch.apic->regs after it uses it, and then the stored value would
      always be zero. In addition, the TMCCT is always computed on-demand and
      never directly readable.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1623223000-18116-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2735886c
    • Ashish Kalra's avatar
      KVM: X86: Introduce KVM_HC_MAP_GPA_RANGE hypercall · 0dbb1123
      Ashish Kalra authored
      This hypercall is used by the SEV guest to notify a change in the page
      encryption status to the hypervisor. The hypercall should be invoked
      only when the encryption attribute is changed from encrypted -> decrypted
      and vice versa. By default all guest pages are considered encrypted.
      
      The hypercall exits to userspace to manage the guest shared regions and
      integrate with the userspace VMM's migration code.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: default avatarSteve Rutherford <srutherford@google.com>
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Co-developed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <90778988e1ee01926ff9cac447aacb745f954c8c.1623174621.git.ashish.kalra@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0dbb1123
    • Paolo Bonzini's avatar
      KVM: switch per-VM stats to u64 · e3cb6fa0
      Paolo Bonzini authored
      Make them the same type as vCPU stats.  There is no reason
      to limit the counters to unsigned long.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e3cb6fa0
    • Sean Christopherson's avatar
      KVM: x86/mmu: Grab nx_lpage_splits as an unsigned long before division · ade74e14
      Sean Christopherson authored
      Snapshot kvm->stats.nx_lpage_splits into a local unsigned long to avoid
      64-bit division on 32-bit kernels.  Casting to an unsigned long is safe
      because the maximum number of shadow pages, n_max_mmu_pages, is also an
      unsigned long, i.e. KVM will start recycling shadow pages before the
      number of splits can exceed a 32-bit value.
      
        ERROR: modpost: "__udivdi3" [arch/x86/kvm/kvm.ko] undefined!
      
      Fixes: 7ee093d4f3f5 ("KVM: switch per-VM stats to u64")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210615162905.2132937-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ade74e14
    • Vitaly Kuznetsov's avatar
      KVM: x86: Check for pending interrupts when APICv is getting disabled · bca66dbc
      Vitaly Kuznetsov authored
      When APICv is active, interrupt injection doesn't raise KVM_REQ_EVENT
      request (see __apic_accept_irq()) as the required work is done by hardware.
      In case KVM_REQ_APICV_UPDATE collides with such injection, the interrupt
      may never get delivered.
      
      Currently, the described situation is hardly possible: all
      kvm_request_apicv_update() calls normally happen upon VM creation when
      no interrupts are pending. We are, however, going to move unconditional
      kvm_request_apicv_update() call from kvm_hv_activate_synic() to
      synic_update_vector() and without this fix 'hyperv_connections' test from
      kvm-unit-tests gets stuck on IPI delivery attempt right after configuring
      a SynIC route which triggers APICv disablement.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210609150911.1471882c-4-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bca66dbc
    • Sean Christopherson's avatar
      KVM: nVMX: Drop redundant checks on vmcs12 in EPTP switching emulation · c5ffd408
      Sean Christopherson authored
      Drop the explicit check on EPTP switching being enabled.  The EPTP
      switching check is handled in the generic VMFUNC function check, while
      the underlying VMFUNC enablement check is done by hardware and redone
      by generic VMFUNC emulation.
      
      The vmcs12 EPT check is handled by KVM at VM-Enter in the form of a
      consistency check, keep it but add a WARN.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-16-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c5ffd408
    • Sean Christopherson's avatar
      KVM: nVMX: WARN if subtly-impossible VMFUNC conditions occur · 546e8398
      Sean Christopherson authored
      WARN and inject #UD when emulating VMFUNC for L2 if the function is
      out-of-bounds or if VMFUNC is not enabled in vmcs12.  Neither condition
      should occur in practice, as the CPU is supposed to prioritize the #UD
      over VM-Exit for out-of-bounds input and KVM is supposed to enable
      VMFUNC in vmcs02 if and only if it's enabled in vmcs12, but neither of
      those dependencies is obvious.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-15-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      546e8398
    • Sean Christopherson's avatar
      KVM: x86: Drop pointless @reset_roots from kvm_init_mmu() · c9060662
      Sean Christopherson authored
      Remove the @reset_roots param from kvm_init_mmu(), the one user,
      kvm_mmu_reset_context() has already unloaded the MMU and thus freed and
      invalidated all roots.  This also happens to be why the reset_roots=true
      paths doesn't leak roots; they're already invalid.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-14-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c9060662
    • Sean Christopherson's avatar
      KVM: x86: Defer MMU sync on PCID invalidation · e62f1aa8
      Sean Christopherson authored
      Defer the MMU sync on PCID invalidation so that multiple sync requests in
      a single VM-Exit are batched.  This is a very minor optimization as
      checking for unsync'd children is quite cheap.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-13-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e62f1aa8
    • Sean Christopherson's avatar
      KVM: nVMX: Use fast PGD switch when emulating VMFUNC[EPTP_SWITCH] · 39353ab5
      Sean Christopherson authored
      Use __kvm_mmu_new_pgd() via kvm_init_shadow_ept_mmu() to emulate
      VMFUNC[EPTP_SWITCH] instead of nuking all MMUs.  EPTP_SWITCH is the EPT
      equivalent of MOV to CR3, i.e. is a perfect fit for the common PGD flow,
      the only hiccup being that A/D enabling is buried in the EPTP.  But, that
      is easily handled by bouncing through kvm_init_shadow_ept_mmu().
      
      Explicitly request a guest TLB flush if VPID is disabled.  Per Intel's
      SDM, if VPID is disabled, "an EPTP-switching VMFUNC invalidates combined
      mappings associated with VPID 0000H (for all PCIDs and for all EP4TA
      values, where EP4TA is the value of bits 51:12 of EPTP)".
      
      Note, this technically is a very bizarre bug fix of sorts if L2 is using
      PAE paging, as avoiding the full MMU reload also avoids incorrectly
      reloading the PDPTEs, which the SDM explicitly states are not touched:
      
        If PAE paging is in use, an EPTP-switching VMFUNC does not load the
        four page-directory-pointer-table entries (PDPTEs) from the
        guest-physical address in CR3. The logical processor continues to use
        the four guest-physical addresses already present in the PDPTEs. The
        guest-physical address in CR3 is not translated through the new EPT
        paging structures (until some operation that would load the PDPTEs).
      
      In addition to optimizing L2's MMU shenanigans, avoiding the full reload
      also optimizes L1's MMU as KVM_REQ_MMU_RELOAD wipes out all roots in both
      root_mmu and guest_mmu.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-12-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      39353ab5
    • Sean Christopherson's avatar
      KVM: x86: Use KVM_REQ_TLB_FLUSH_GUEST to handle INVPCID(ALL) emulation · 28f28d45
      Sean Christopherson authored
      Use KVM_REQ_TLB_FLUSH_GUEST instead of KVM_REQ_MMU_RELOAD when emulating
      INVPCID of all contexts.  In the current code, this is a glorified nop as
      TLB_FLUSH_GUEST becomes kvm_mmu_unload(), same as MMU_RELOAD, when TDP
      is disabled, which is the only time INVPCID is only intercepted+emulated.
      In the future, reusing TLB_FLUSH_GUEST will simplify optimizing paths
      that emulate a guest TLB flush, e.g. by synchronizing as needed instead
      of completely unloading all MMUs.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-11-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      28f28d45