1. 21 Mar, 2022 1 commit
  2. 18 Mar, 2022 3 commits
  3. 15 Mar, 2022 2 commits
  4. 14 Mar, 2022 6 commits
  5. 11 Mar, 2022 9 commits
  6. 09 Mar, 2022 4 commits
  7. 08 Mar, 2022 15 commits
    • Suravee Suthikulpanit's avatar
      KVM: SVM: Allow AVIC support on system w/ physical APIC ID > 255 · 4a204f78
      Suravee Suthikulpanit authored
      Expand KVM's mask for the AVIC host physical ID to the full 12 bits defined
      by the architecture.  The number of bits consumed by hardware is model
      specific, e.g. early CPUs ignored bits 11:8, but there is no way for KVM
      to enumerate the "true" size.  So, KVM must allow using all bits, else it
      risks rejecting completely legal x2APIC IDs on newer CPUs.
      
      This means KVM relies on hardware to not assign x2APIC IDs that exceed the
      "true" width of the field, but presumably hardware is smart enough to tie
      the width to the max x2APIC ID.  KVM also relies on hardware to support at
      least 8 bits, as the legacy xAPIC ID is writable by software.  But, those
      assumptions are unavoidable due to the lack of any way to enumerate the
      "true" width.
      
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Fixes: 44a95dae ("KVM: x86: Detect and Initialize AVIC support")
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Message-Id: <20220211000851.185799-1-suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4a204f78
    • Sean Christopherson's avatar
      KVM: selftests: Add test to populate a VM with the max possible guest mem · b58c55d5
      Sean Christopherson authored
      Add a selftest that enables populating a VM with the maximum amount of
      guest memory allowed by the underlying architecture.  Abuse KVM's
      memslots by mapping a single host memory region into multiple memslots so
      that the selftest doesn't require a system with terabytes of RAM.
      
      Default to 512gb of guest memory, which isn't all that interesting, but
      should work on all MMUs and doesn't take an exorbitant amount of memory
      or time.  E.g. testing with ~64tb of guest memory takes the better part
      of an hour, and requires 200gb of memory for KVM's page tables when using
      4kb pages.
      
      To inflicit maximum abuse on KVM' MMU, default to 4kb pages (or whatever
      the not-hugepage size is) in the backing store (memfd).  Use memfd for
      the host backing store to ensure that hugepages are guaranteed when
      requested, and to give the user explicit control of the size of hugepage
      being tested.
      
      By default, spin up as many vCPUs as there are available to the selftest,
      and distribute the work of dirtying each 4kb chunk of memory across all
      vCPUs.  Dirtying guest memory forces KVM to populate its page tables, and
      also forces KVM to write back accessed/dirty information to struct page
      when the guest memory is freed.
      
      On x86, perform two passes with a MMU context reset between each pass to
      coerce KVM into dropping all references to the MMU root, e.g. to emulate
      a vCPU dropping the last reference.  Perform both passes and all
      rendezvous on all architectures in the hope that arm64 and s390x can gain
      similar shenanigans in the future.
      
      Measure and report the duration of each operation, which is helpful not
      only to verify the test is working as intended, but also to easily
      evaluate the performance differences different page sizes.
      
      Provide command line options to limit the amount of guest memory, set the
      size of each slot (i.e. of the host memory region), set the number of
      vCPUs, and to enable usage of hugepages.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-29-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b58c55d5
    • Sean Christopherson's avatar
      KVM: selftests: Define cpu_relax() helpers for s390 and x86 · 17ae5ebc
      Sean Christopherson authored
      Add cpu_relax() for s390 and x86 for use in arch-agnostic tests.  arm64
      already defines its own version.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-28-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      17ae5ebc
    • Sean Christopherson's avatar
      KVM: selftests: Split out helper to allocate guest mem via memfd · a4187c9b
      Sean Christopherson authored
      Extract the code for allocating guest memory via memfd out of
      vm_userspace_mem_region_add() and into a new helper, kvm_memfd_alloc().
      A future selftest to populate a guest with the maximum amount of guest
      memory will abuse KVM's memslots to alias guest memory regions to a
      single memfd-backed host region, i.e. needs to back a guest with memfd
      memory without a 1:1 association between a memslot and a memfd instance.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-27-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a4187c9b
    • Sean Christopherson's avatar
      KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils · 3d7d6043
      Sean Christopherson authored
      Move set_memory_region_test's KVM_SET_USER_MEMORY_REGION helper to KVM's
      utils so that it can be used by other tests.  Provide a raw version as
      well as an assert-success version to reduce the amount of boilerplate
      code need for basic usage.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-26-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3d7d6043
    • Sean Christopherson's avatar
      KVM: x86/mmu: WARN on any attempt to atomically update REMOVED SPTE · 396fd74d
      Sean Christopherson authored
      Disallow calling tdp_mmu_set_spte_atomic() with a REMOVED "old" SPTE.
      This solves a conundrum introduced by commit 3255530a ("KVM: x86/mmu:
      Automatically update iter->old_spte if cmpxchg fails"); if the helper
      doesn't update old_spte in the REMOVED case, then theoretically the
      caller could get stuck in an infinite loop as it will fail indefinitely
      on the REMOVED SPTE.  E.g. until recently, clear_dirty_gfn_range() didn't
      check for a present SPTE and would have spun until getting rescheduled.
      
      In practice, only the page fault path should "create" a new SPTE, all
      other paths should only operate on existing, a.k.a. shadow present,
      SPTEs.  Now that the page fault path pre-checks for a REMOVED SPTE in all
      cases, require all other paths to indirectly pre-check by verifying the
      target SPTE is a shadow-present SPTE.
      
      Note, this does not guarantee the actual SPTE isn't REMOVED, nor is that
      scenario disallowed.  The invariant is only that the caller mustn't
      invoke tdp_mmu_set_spte_atomic() if the SPTE was REMOVED when last
      observed by the caller.
      
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-25-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      396fd74d
    • Sean Christopherson's avatar
      KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE · 58298b06
      Sean Christopherson authored
      Explicitly check for a REMOVED leaf SPTE prior to attempting to map
      the final SPTE when handling a TDP MMU fault.  Functionally, this is a
      nop as tdp_mmu_set_spte_atomic() will eventually detect the frozen SPTE.
      Pre-checking for a REMOVED SPTE is a minor optmization, but the real goal
      is to allow tdp_mmu_set_spte_atomic() to have an invariant that the "old"
      SPTE is never a REMOVED SPTE.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-24-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      58298b06
    • Paolo Bonzini's avatar
      KVM: x86/mmu: Zap defunct roots via asynchronous worker · efd995da
      Paolo Bonzini authored
      Zap defunct roots, a.k.a. roots that have been invalidated after their
      last reference was initially dropped, asynchronously via the existing work
      queue instead of forcing the work upon the unfortunate task that happened
      to drop the last reference.
      
      If a vCPU task drops the last reference, the vCPU is effectively blocked
      by the host for the entire duration of the zap.  If the root being zapped
      happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
      being active, the zap can take several hundred seconds.  Unsurprisingly,
      most guests are unhappy if a vCPU disappears for hundreds of seconds.
      
      E.g. running a synthetic selftest that triggers a vCPU root zap with
      ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
      Offloading the zap to a worker drops the block time to <100ms.
      
      There is an important nuance to this change.  If the same work item
      was queued twice before the work function has run, it would only
      execute once and one reference would be leaked.  Therefore, now that
      queueing and flushing items is not anymore protected by kvm->slots_lock,
      kvm_tdp_mmu_invalidate_all_roots() has to check root->role.invalid and
      skip already invalid roots.  On the other hand, kvm_mmu_zap_all_fast()
      must return only after those skipped roots have been zapped as well.
      These two requirements can be satisfied only if _all_ places that
      change invalid to true now schedule the worker before releasing the
      mmu_lock.  There are just two, kvm_tdp_mmu_put_root() and
      kvm_tdp_mmu_invalidate_all_roots().
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-23-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      efd995da
    • Sean Christopherson's avatar
      KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls · 1b6043e8
      Sean Christopherson authored
      When zapping a TDP MMU root, perform the zap in two passes to avoid
      zapping an entire top-level SPTE while holding RCU, which can induce RCU
      stalls.  In the first pass, zap SPTEs at PG_LEVEL_1G, and then
      zap top-level entries in the second pass.
      
      With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
      SPTEs take up to ~7 or so seconds (time varies based on kernel config,
      number of (v)CPUs, etc...).  With 5-level paging, that time can balloon
      well into hundreds of seconds.
      
      Before remote TLB flushes were omitted, the problem was even worse as
      waiting for all active vCPUs to respond to the IPI introduced significant
      overhead for VMs with large numbers of vCPUs.
      
      By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
      the amount of work that is done without dropping RCU protection is
      strictly bounded, with the worst case latency for a single operation
      being less than 100ms.
      
      Zapping at 1gb in the first pass is not arbitrary.  First and foremost,
      KVM relies on being able to zap 1gb shadow pages in a single shot when
      when repacing a shadow page with a hugepage.  Zapping a 1gb shadow page
      that is fully populated with 4kb dirty SPTEs also triggers the worst case
      latency due writing back the struct page accessed/dirty bits for each 4kb
      page, i.e. the two-pass approach is guaranteed to work so long as KVM can
      cleany zap a 1gb shadow page.
      
        rcu: INFO: rcu_sched self-detected stall on CPU
        rcu:     52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
                                                softirq=15759/15759 fqs=5058
         (t=21016 jiffies g=66453 q=238577)
        NMI backtrace for cpu 52
        Call Trace:
         ...
         mark_page_accessed+0x266/0x2f0
         kvm_set_pfn_accessed+0x31/0x40
         handle_removed_tdp_mmu_page+0x259/0x2e0
         __handle_changed_spte+0x223/0x2c0
         handle_removed_tdp_mmu_page+0x1c1/0x2e0
         __handle_changed_spte+0x223/0x2c0
         handle_removed_tdp_mmu_page+0x1c1/0x2e0
         __handle_changed_spte+0x223/0x2c0
         zap_gfn_range+0x141/0x3b0
         kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
         kvm_mmu_zap_all_fast+0x121/0x190
         kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
         kvm_page_track_flush_slot+0x5c/0x80
         kvm_arch_flush_shadow_memslot+0xe/0x10
         kvm_set_memslot+0x172/0x4e0
         __kvm_set_memory_region+0x337/0x590
         kvm_vm_ioctl+0x49c/0xf80
      Reported-by: default avatarDavid Matlack <dmatlack@google.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: Mingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-22-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1b6043e8
    • Paolo Bonzini's avatar
      KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root · 8351779c
      Paolo Bonzini authored
      Allow yielding when zapping SPTEs after the last reference to a valid
      root is put.  Because KVM must drop all SPTEs in response to relevant
      mmu_notifier events, mark defunct roots invalid and reset their refcount
      prior to zapping the root.  Keeping the refcount elevated while the zap
      is in-progress ensures the root is reachable via mmu_notifier until the
      zap completes and the last reference to the invalid, defunct root is put.
      
      Allowing kvm_tdp_mmu_put_root() to yield fixes soft lockup issues if the
      root in being put has a massive paging structure, e.g. zapping a root
      that is backed entirely by 4kb pages for a guest with 32tb of memory can
      take hundreds of seconds to complete.
      
        watchdog: BUG: soft lockup - CPU#49 stuck for 485s! [max_guest_memor:52368]
        RIP: 0010:kvm_set_pfn_dirty+0x30/0x50 [kvm]
         __handle_changed_spte+0x1b2/0x2f0 [kvm]
         handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
         __handle_changed_spte+0x1f4/0x2f0 [kvm]
         handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
         __handle_changed_spte+0x1f4/0x2f0 [kvm]
         tdp_mmu_zap_root+0x307/0x4d0 [kvm]
         kvm_tdp_mmu_put_root+0x7c/0xc0 [kvm]
         kvm_mmu_free_roots+0x22d/0x350 [kvm]
         kvm_mmu_reset_context+0x20/0x60 [kvm]
         kvm_arch_vcpu_ioctl_set_sregs+0x5a/0xc0 [kvm]
         kvm_vcpu_ioctl+0x5bd/0x710 [kvm]
         __se_sys_ioctl+0x77/0xc0
         __x64_sys_ioctl+0x1d/0x20
         do_syscall_64+0x44/0xa0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      KVM currently doesn't put a root from a non-preemptible context, so other
      than the mmu_notifier wrinkle, yielding when putting a root is safe.
      
      Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't
      take a reference to each root (it requires mmu_lock be held for the
      entire duration of the walk).
      
      tdp_mmu_next_root() is used only by the yield-friendly iterator.
      
      tdp_mmu_zap_root_work() is explicitly yield friendly.
      
      kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out,
      but is still yield-friendly in all call sites, as all callers can be
      traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or
      kvm_create_vm().
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-21-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8351779c
    • Paolo Bonzini's avatar
      KVM: x86/mmu: Zap invalidated roots via asynchronous worker · 22b94c4b
      Paolo Bonzini authored
      Use the system worker threads to zap the roots invalidated
      by the TDP MMU's "fast zap" mechanism, implemented by
      kvm_tdp_mmu_invalidate_all_roots().
      
      At this point, apart from allowing some parallelism in the zapping of
      roots, the workqueue is a glorified linked list: work items are added and
      flushed entirely within a single kvm->slots_lock critical section.  However,
      the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
      assumes that it owns a reference to all invalid roots; therefore, no
      one can set the invalid bit outside kvm_mmu_zap_all_fast().  Putting the
      invalidated roots on a linked list... erm, on a workqueue ensures that
      tdp_mmu_zap_root_work() only puts back those extra references that
      kvm_mmu_zap_all_invalidated_roots() had gifted to it.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      22b94c4b
    • Sean Christopherson's avatar
      KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages · bb95dfb9
      Sean Christopherson authored
      Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead
      of immediately flushing.  Because the shadow pages are freed in an RCU
      callback, so long as at least one CPU holds RCU, all CPUs are protected.
      For vCPUs running in the guest, i.e. consuming TLB entries, KVM only
      needs to ensure the caller services the pending TLB flush before dropping
      its RCU protections.  I.e. use the caller's RCU as a proxy for all vCPUs
      running in the guest.
      
      Deferring the flushes allows batching flushes, e.g. when installing a
      1gb hugepage and zapping a pile of SPs.  And when zapping an entire root,
      deferring flushes allows skipping the flush entirely (because flushes are
      not needed in that case).
      
      Avoiding flushes when zapping an entire root is especially important as
      synchronizing with other CPUs via IPI after zapping every shadow page can
      cause significant performance issues for large VMs.  The issue is
      exacerbated by KVM zapping entire top-level entries without dropping
      RCU protection, which can lead to RCU stalls even when zapping roots
      backing relatively "small" amounts of guest memory, e.g. 2tb.  Removing
      the IPI bottleneck largely mitigates the RCU issues, though it's likely
      still a problem for 5-level paging.  A future patch will further address
      the problem by zapping roots in multiple passes to avoid holding RCU for
      an extended duration.
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-20-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bb95dfb9
    • Sean Christopherson's avatar
      KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched · bd296779
      Sean Christopherson authored
      When yielding in the TDP MMU iterator, service any pending TLB flush
      before dropping RCU protections in anticipation of using the caller's RCU
      "lock" as a proxy for vCPUs in the guest.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-19-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bd296779
    • Sean Christopherson's avatar
      KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() · cf3e2642
      Sean Christopherson authored
      Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various
      functions accordingly.  When removing mappings for functional correctness
      (except for the stupid VFIO GPU passthrough memslots bug), zapping the
      leaf SPTEs is sufficient as the paging structures themselves do not point
      at guest memory and do not directly impact the final translation (in the
      TDP MMU).
      
      Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only
      the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and
      kvm_unmap_gfn_range().
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-18-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cf3e2642
    • Sean Christopherson's avatar
      KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range · acbda82a
      Sean Christopherson authored
      Now that all callers of zap_gfn_range() hold mmu_lock for write, drop
      support for zapping with mmu_lock held for read.  That all callers hold
      mmu_lock for write isn't a random coincidence; now that the paths that
      need to zap _everything_ have their own path, the only callers left are
      those that need to zap for functional correctness.  And when zapping is
      required for functional correctness, mmu_lock must be held for write,
      otherwise the caller has no guarantees about the state of the TDP MMU
      page tables after it has run, e.g. the SPTE(s) it zapped can be
      immediately replaced by a vCPU faulting in a page.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-17-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      acbda82a