1. 08 Mar, 2022 26 commits
    • Sean Christopherson's avatar
      KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls · 1b6043e8
      Sean Christopherson authored
      When zapping a TDP MMU root, perform the zap in two passes to avoid
      zapping an entire top-level SPTE while holding RCU, which can induce RCU
      stalls.  In the first pass, zap SPTEs at PG_LEVEL_1G, and then
      zap top-level entries in the second pass.
      
      With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
      SPTEs take up to ~7 or so seconds (time varies based on kernel config,
      number of (v)CPUs, etc...).  With 5-level paging, that time can balloon
      well into hundreds of seconds.
      
      Before remote TLB flushes were omitted, the problem was even worse as
      waiting for all active vCPUs to respond to the IPI introduced significant
      overhead for VMs with large numbers of vCPUs.
      
      By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
      the amount of work that is done without dropping RCU protection is
      strictly bounded, with the worst case latency for a single operation
      being less than 100ms.
      
      Zapping at 1gb in the first pass is not arbitrary.  First and foremost,
      KVM relies on being able to zap 1gb shadow pages in a single shot when
      when repacing a shadow page with a hugepage.  Zapping a 1gb shadow page
      that is fully populated with 4kb dirty SPTEs also triggers the worst case
      latency due writing back the struct page accessed/dirty bits for each 4kb
      page, i.e. the two-pass approach is guaranteed to work so long as KVM can
      cleany zap a 1gb shadow page.
      
        rcu: INFO: rcu_sched self-detected stall on CPU
        rcu:     52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
                                                softirq=15759/15759 fqs=5058
         (t=21016 jiffies g=66453 q=238577)
        NMI backtrace for cpu 52
        Call Trace:
         ...
         mark_page_accessed+0x266/0x2f0
         kvm_set_pfn_accessed+0x31/0x40
         handle_removed_tdp_mmu_page+0x259/0x2e0
         __handle_changed_spte+0x223/0x2c0
         handle_removed_tdp_mmu_page+0x1c1/0x2e0
         __handle_changed_spte+0x223/0x2c0
         handle_removed_tdp_mmu_page+0x1c1/0x2e0
         __handle_changed_spte+0x223/0x2c0
         zap_gfn_range+0x141/0x3b0
         kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
         kvm_mmu_zap_all_fast+0x121/0x190
         kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
         kvm_page_track_flush_slot+0x5c/0x80
         kvm_arch_flush_shadow_memslot+0xe/0x10
         kvm_set_memslot+0x172/0x4e0
         __kvm_set_memory_region+0x337/0x590
         kvm_vm_ioctl+0x49c/0xf80
      Reported-by: default avatarDavid Matlack <dmatlack@google.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: Mingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-22-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1b6043e8
    • Paolo Bonzini's avatar
      KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root · 8351779c
      Paolo Bonzini authored
      Allow yielding when zapping SPTEs after the last reference to a valid
      root is put.  Because KVM must drop all SPTEs in response to relevant
      mmu_notifier events, mark defunct roots invalid and reset their refcount
      prior to zapping the root.  Keeping the refcount elevated while the zap
      is in-progress ensures the root is reachable via mmu_notifier until the
      zap completes and the last reference to the invalid, defunct root is put.
      
      Allowing kvm_tdp_mmu_put_root() to yield fixes soft lockup issues if the
      root in being put has a massive paging structure, e.g. zapping a root
      that is backed entirely by 4kb pages for a guest with 32tb of memory can
      take hundreds of seconds to complete.
      
        watchdog: BUG: soft lockup - CPU#49 stuck for 485s! [max_guest_memor:52368]
        RIP: 0010:kvm_set_pfn_dirty+0x30/0x50 [kvm]
         __handle_changed_spte+0x1b2/0x2f0 [kvm]
         handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
         __handle_changed_spte+0x1f4/0x2f0 [kvm]
         handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
         __handle_changed_spte+0x1f4/0x2f0 [kvm]
         tdp_mmu_zap_root+0x307/0x4d0 [kvm]
         kvm_tdp_mmu_put_root+0x7c/0xc0 [kvm]
         kvm_mmu_free_roots+0x22d/0x350 [kvm]
         kvm_mmu_reset_context+0x20/0x60 [kvm]
         kvm_arch_vcpu_ioctl_set_sregs+0x5a/0xc0 [kvm]
         kvm_vcpu_ioctl+0x5bd/0x710 [kvm]
         __se_sys_ioctl+0x77/0xc0
         __x64_sys_ioctl+0x1d/0x20
         do_syscall_64+0x44/0xa0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      KVM currently doesn't put a root from a non-preemptible context, so other
      than the mmu_notifier wrinkle, yielding when putting a root is safe.
      
      Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't
      take a reference to each root (it requires mmu_lock be held for the
      entire duration of the walk).
      
      tdp_mmu_next_root() is used only by the yield-friendly iterator.
      
      tdp_mmu_zap_root_work() is explicitly yield friendly.
      
      kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out,
      but is still yield-friendly in all call sites, as all callers can be
      traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or
      kvm_create_vm().
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-21-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8351779c
    • Paolo Bonzini's avatar
      KVM: x86/mmu: Zap invalidated roots via asynchronous worker · 22b94c4b
      Paolo Bonzini authored
      Use the system worker threads to zap the roots invalidated
      by the TDP MMU's "fast zap" mechanism, implemented by
      kvm_tdp_mmu_invalidate_all_roots().
      
      At this point, apart from allowing some parallelism in the zapping of
      roots, the workqueue is a glorified linked list: work items are added and
      flushed entirely within a single kvm->slots_lock critical section.  However,
      the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
      assumes that it owns a reference to all invalid roots; therefore, no
      one can set the invalid bit outside kvm_mmu_zap_all_fast().  Putting the
      invalidated roots on a linked list... erm, on a workqueue ensures that
      tdp_mmu_zap_root_work() only puts back those extra references that
      kvm_mmu_zap_all_invalidated_roots() had gifted to it.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      22b94c4b
    • Sean Christopherson's avatar
      KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages · bb95dfb9
      Sean Christopherson authored
      Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead
      of immediately flushing.  Because the shadow pages are freed in an RCU
      callback, so long as at least one CPU holds RCU, all CPUs are protected.
      For vCPUs running in the guest, i.e. consuming TLB entries, KVM only
      needs to ensure the caller services the pending TLB flush before dropping
      its RCU protections.  I.e. use the caller's RCU as a proxy for all vCPUs
      running in the guest.
      
      Deferring the flushes allows batching flushes, e.g. when installing a
      1gb hugepage and zapping a pile of SPs.  And when zapping an entire root,
      deferring flushes allows skipping the flush entirely (because flushes are
      not needed in that case).
      
      Avoiding flushes when zapping an entire root is especially important as
      synchronizing with other CPUs via IPI after zapping every shadow page can
      cause significant performance issues for large VMs.  The issue is
      exacerbated by KVM zapping entire top-level entries without dropping
      RCU protection, which can lead to RCU stalls even when zapping roots
      backing relatively "small" amounts of guest memory, e.g. 2tb.  Removing
      the IPI bottleneck largely mitigates the RCU issues, though it's likely
      still a problem for 5-level paging.  A future patch will further address
      the problem by zapping roots in multiple passes to avoid holding RCU for
      an extended duration.
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-20-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bb95dfb9
    • Sean Christopherson's avatar
      KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched · bd296779
      Sean Christopherson authored
      When yielding in the TDP MMU iterator, service any pending TLB flush
      before dropping RCU protections in anticipation of using the caller's RCU
      "lock" as a proxy for vCPUs in the guest.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-19-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bd296779
    • Sean Christopherson's avatar
      KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() · cf3e2642
      Sean Christopherson authored
      Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various
      functions accordingly.  When removing mappings for functional correctness
      (except for the stupid VFIO GPU passthrough memslots bug), zapping the
      leaf SPTEs is sufficient as the paging structures themselves do not point
      at guest memory and do not directly impact the final translation (in the
      TDP MMU).
      
      Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only
      the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and
      kvm_unmap_gfn_range().
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-18-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cf3e2642
    • Sean Christopherson's avatar
      KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range · acbda82a
      Sean Christopherson authored
      Now that all callers of zap_gfn_range() hold mmu_lock for write, drop
      support for zapping with mmu_lock held for read.  That all callers hold
      mmu_lock for write isn't a random coincidence; now that the paths that
      need to zap _everything_ have their own path, the only callers left are
      those that need to zap for functional correctness.  And when zapping is
      required for functional correctness, mmu_lock must be held for write,
      otherwise the caller has no guarantees about the state of the TDP MMU
      page tables after it has run, e.g. the SPTE(s) it zapped can be
      immediately replaced by a vCPU faulting in a page.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-17-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      acbda82a
    • Sean Christopherson's avatar
      KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page · e2b5b21d
      Sean Christopherson authored
      Add a dedicated helper for zapping a TDP MMU root, and use it in the three
      flows that do "zap_all" and intentionally do not do a TLB flush if SPTEs
      are zapped (zapping an entire root is safe if and only if it cannot be in
      use by any vCPU).  Because a TLB flush is never required, unconditionally
      pass "false" to tdp_mmu_iter_cond_resched() when potentially yielding.
      
      Opportunistically document why KVM must not yield when zapping roots that
      are being zapped by kvm_tdp_mmu_put_root(), i.e. roots whose refcount has
      reached zero, and further harden the flow to detect improper KVM behavior
      with respect to roots that are supposed to be unreachable.
      
      In addition to hardening zapping of roots, isolating zapping of roots
      will allow future simplification of zap_gfn_range() by having it zap only
      leaf SPTEs, and by removing its tricky "zap all" heuristic.  By having
      all paths that truly need to free _all_ SPs flow through the dedicated
      root zapper, the generic zapper can be freed of those concerns.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-16-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e2b5b21d
    • Sean Christopherson's avatar
      KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU · 77c8cd6b
      Sean Christopherson authored
      Don't flush the TLBs when zapping all TDP MMU pages, as the only time KVM
      uses the slow version of "zap everything" is when the VM is being
      destroyed or the owning mm has exited.  In either case, KVM_RUN is
      unreachable for the VM, i.e. the guest TLB entries cannot be consumed.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-15-seanjc@google.com>
      Reviewed-by: default avatarMingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      77c8cd6b
    • Sean Christopherson's avatar
      KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery · c10743a1
      Sean Christopherson authored
      When recovering a potential hugepage that was shattered for the iTLB
      multihit workaround, precisely zap only the target page instead of
      iterating over the TDP MMU to find the SP that was passed in.  This will
      allow future simplification of zap_gfn_range() by having it zap only
      leaf SPTEs.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-14-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c10743a1
    • Sean Christopherson's avatar
      KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw values · 626808d1
      Sean Christopherson authored
      Refactor __tdp_mmu_set_spte() to work with raw values instead of a
      tdp_iter objects so that a future patch can modify SPTEs without doing a
      walk, and without having to synthesize a tdp_iter.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-13-seanjc@google.com>
      Reviewed-by: default avatarMingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      626808d1
    • Sean Christopherson's avatar
      KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path · 966da62a
      Sean Christopherson authored
      WARN if the new_spte being set by __tdp_mmu_set_spte() is a REMOVED_SPTE,
      which is called out by the comment as being disallowed but not actually
      checked.  Keep the WARN on the old_spte as well, because overwriting a
      REMOVED_SPTE in the non-atomic path is also disallowed (as evidence by
      lack of splats with the existing WARN).
      
      Fixes: 08f07c80 ("KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler")
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-12-seanjc@google.com>
      Reviewed-by: default avatarMingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      966da62a
    • Sean Christopherson's avatar
      KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU · 0e587aa7
      Sean Christopherson authored
      Add helpers to read and write TDP MMU SPTEs instead of open coding
      rcu_dereference() all over the place, and to provide a convenient
      location to document why KVM doesn't exempt holding mmu_lock for write
      from having to hold RCU (and any future changes to the rules).
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-11-seanjc@google.com>
      Reviewed-by: default avatarMingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0e587aa7
    • Sean Christopherson's avatar
      KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks · a151acec
      Sean Christopherson authored
      Drop RCU protection after processing each root when handling MMU notifier
      hooks that aren't the "unmap" path, i.e. aren't zapping.  Temporarily
      drop RCU to let RCU do its thing between roots, and to make it clear that
      there's no special behavior that relies on holding RCU across all roots.
      
      Currently, the RCU protection is completely superficial, it's necessary
      only to make rcu_dereference() of SPTE pointers happy.  A future patch
      will rely on holding RCU as a proxy for vCPUs in the guest, e.g. to
      ensure shadow pages aren't freed before all vCPUs do a TLB flush (or
      rather, acknowledge the need for a flush), but in that case RCU needs to
      be held until the flush is complete if and only if the flush is needed
      because a shadow page may have been removed.  And except for the "unmap"
      path, MMU notifier events cannot remove SPs (don't toggle PRESENT bit,
      and can't change the PFN for a SP).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-10-seanjc@google.com>
      Reviewed-by: default avatarMingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a151acec
    • Sean Christopherson's avatar
      KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte · 93fa50f6
      Sean Christopherson authored
      Batch TLB flushes (with other MMUs) when handling ->change_spte()
      notifications in the TDP MMU.  The MMU notifier path in question doesn't
      allow yielding and correcty flushes before dropping mmu_lock.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-9-seanjc@google.com>
      Reviewed-by: default avatarMingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      93fa50f6
    • Sean Christopherson's avatar
      KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal · c8e5a0d0
      Sean Christopherson authored
      Look for a !leaf=>leaf conversion instead of a PFN change when checking
      if a SPTE change removed a TDP MMU shadow page.  Convert the PFN check
      into a WARN, as KVM should never change the PFN of a shadow page (except
      when its being zapped or replaced).
      
      From a purely theoretical perspective, it's not illegal to replace a SP
      with a hugepage pointing at the same PFN.  In practice, it's impossible
      as that would require mapping guest memory overtop a kernel-allocated SP.
      Either way, the check is odd.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-8-seanjc@google.com>
      Reviewed-by: default avatarMingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c8e5a0d0
    • Paolo Bonzini's avatar
      KVM: x86/mmu: do not allow readers to acquire references to invalid roots · 614f6970
      Paolo Bonzini authored
      Remove the "shared" argument of for_each_tdp_mmu_root_yield_safe, thus ensuring
      that readers do not ever acquire a reference to an invalid root.  After this
      patch, all readers except kvm_tdp_mmu_zap_invalidated_roots() treat
      refcount=0/valid, refcount=0/invalid and refcount=1/invalid in exactly the
      same way.  kvm_tdp_mmu_zap_invalidated_roots() is different but it also
      does not acquire a reference to the invalid root, and it cannot see
      refcount=0/invalid because it is guaranteed to run after
      kvm_tdp_mmu_invalidate_all_roots().
      
      Opportunistically add a lockdep assertion to the yield-safe iterator.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      614f6970
    • Paolo Bonzini's avatar
      KVM: x86/mmu: only perform eager page splitting on valid roots · 7c554d8e
      Paolo Bonzini authored
      Eager page splitting is an optimization; it does not have to be performed on
      invalid roots.  It is also the only case in which a reader might acquire
      a reference to an invalid root, so after this change we know that readers
      will skip both dying and invalid roots.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7c554d8e
    • Sean Christopherson's avatar
      KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter · 226b8c8f
      Sean Christopherson authored
      Assert that mmu_lock is held for write by users of the yield-unfriendly
      TDP iterator.  The nature of a shared walk means that the caller needs to
      play nice with other tasks modifying the page tables, which is more or
      less the same thing as playing nice with yielding.  Theoretically, KVM
      could gain a flow where it could legitimately take mmu_lock for read in
      a non-preemptible context, but that's highly unlikely and any such case
      should be viewed with a fair amount of scrutiny.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      226b8c8f
    • Sean Christopherson's avatar
      KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush · 7ae5840e
      Sean Christopherson authored
      Remove the misleading flush "handling" when zapping invalidated TDP MMU
      roots, and document that flushing is unnecessary for all flavors of MMUs
      when zapping invalid/obsolete roots/pages.  The "handling" in the TDP MMU
      is dead code, as zap_gfn_range() is called with shared=true, in which
      case it will never return true due to the flushing being handled by
      tdp_mmu_zap_spte_atomic().
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7ae5840e
    • Sean Christopherson's avatar
      KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic · db01416b
      Sean Christopherson authored
      Explicitly ignore the result of zap_gfn_range() when putting the last
      reference to a TDP MMU root, and add a pile of comments to formalize the
      TDP MMU's behavior of deferring TLB flushes to alloc/reuse.  Note, this
      only affects the !shared case, as zap_gfn_range() subtly never returns
      true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic().
      
      Putting the root without a flush is ok because even if there are stale
      references to the root in the TLB, they are unreachable because KVM will
      not run the guest with the same ASID without first flushing (where ASID
      in this context refers to both SVM's explicit ASID and Intel's implicit
      ASID that is constructed from VPID+PCID+EPT4A+etc...).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-5-seanjc@google.com>
      Reviewed-by: default avatarMingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      db01416b
    • Sean Christopherson's avatar
      KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap · f28e9c7f
      Sean Christopherson authored
      Fix misleading and arguably wrong comments in the TDP MMU's fast zap
      flow.  The comments, and the fact that actually zapping invalid roots was
      added separately, strongly suggests that zapping invalid roots is an
      optimization and not required for correctness.  That is a lie.
      
      KVM _must_ zap invalid roots before returning from kvm_mmu_zap_all_fast(),
      because when it's called from kvm_mmu_invalidate_zap_pages_in_memslot(),
      KVM is relying on it to fully remove all references to the memslot.  Once
      the memslot is gone, KVM's mmu_notifier hooks will be unable to find the
      stale references as the hva=>gfn translation is done via the memslots.
      If KVM doesn't immediately zap SPTEs and userspace unmaps a range after
      deleting a memslot, KVM will fail to zap in response to the mmu_notifier
      due to not finding a memslot corresponding to the notifier's range, which
      leads to a variation of use-after-free.
      
      The other misleading comment (and code) explicitly states that roots
      without a reference should be skipped.  While that's technically true,
      it's also extremely misleading as it should be impossible for KVM to
      encounter a defunct root on the list while holding mmu_lock for write.
      Opportunistically add a WARN to enforce that invariant.
      
      Fixes: b7cccd39 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
      Fixes: 4c6654bd ("KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f28e9c7f
    • Sean Christopherson's avatar
      KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU · 3354ef5a
      Sean Christopherson authored
      Explicitly check for present SPTEs when clearing dirty bits in the TDP
      MMU.  This isn't strictly required for correctness, as setting the dirty
      bit in a defunct SPTE will not change the SPTE from !PRESENT to PRESENT.
      However, the guarded MMU_WARN_ON() in spte_ad_need_write_protect() would
      complain if anyone actually turned on KVM's MMU debugging.
      
      Fixes: a6a0b05d ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-3-seanjc@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3354ef5a
    • Paolo Bonzini's avatar
      KVM: use __vcalloc for very large allocations · 37b2a651
      Paolo Bonzini authored
      Allocations whose size is related to the memslot size can be arbitrarily
      large.  Do not use kvzalloc/kvcalloc, as those are limited to "not crazy"
      sizes that fit in 32 bits.
      
      Cc: stable@vger.kernel.org
      Fixes: 7661809d ("mm: don't allow oversized kvmalloc() calls")
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      37b2a651
    • Paolo Bonzini's avatar
      mm: use vmalloc_array and vcalloc for array allocations · 3000f2e2
      Paolo Bonzini authored
      Instead of using array_size or just a multiply, use a function that
      takes care of both the multiplication and the overflow checks.
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3000f2e2
    • Paolo Bonzini's avatar
      mm: vmalloc: introduce array allocation functions · a8749a35
      Paolo Bonzini authored
      Linux has dozens of occurrences of vmalloc(array_size()) and
      vzalloc(array_size()).  Allow to simplify the code by providing
      vmalloc_array and vcalloc, as well as the underscored variants that let
      the caller specify the GFP flags.
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a8749a35
  2. 04 Mar, 2022 1 commit
  3. 02 Mar, 2022 2 commits
    • Paolo Bonzini's avatar
      KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run · 8d25b7be
      Paolo Bonzini authored
      kvm_arch_vcpu_ioctl_run is already doing srcu_read_lock/unlock in two
      places, namely vcpu_run and post_kvm_run_save, and a third is actually
      needed around the call to vcpu->arch.complete_userspace_io to avoid
      the following splat:
      
        WARNING: suspicious RCU usage
        arch/x86/kvm/pmu.c:190 suspicious rcu_dereference_check() usage!
        other info that might help us debug this:
        rcu_scheduler_active = 2, debug_locks = 1
        1 lock held by CPU 28/KVM/370841:
        #0: ff11004089f280b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x87/0x730 [kvm]
        Call Trace:
         <TASK>
         dump_stack_lvl+0x59/0x73
         reprogram_fixed_counter+0x15d/0x1a0 [kvm]
         kvm_pmu_trigger_event+0x1a3/0x260 [kvm]
         ? free_moved_vector+0x1b4/0x1e0
         complete_fast_pio_in+0x8a/0xd0 [kvm]
      
      This splat is not at all unexpected, since complete_userspace_io callbacks
      can execute similar code to vmexits.  For example, SVM with nrips=false
      will call into the emulator from svm_skip_emulated_instruction().
      
      While it's tempting to never acquire kvm->srcu for an uninitialized vCPU,
      practically speaking there's no penalty to acquiring kvm->srcu "early"
      as the KVM_MP_STATE_UNINITIALIZED path is a one-time thing per vCPU.  On
      the other hand, seemingly innocuous helpers like kvm_apic_accept_events()
      and sync_regs() can theoretically reach code that might access
      SRCU-protected data structures, e.g. sync_regs() can trigger forced
      existing of nested mode via kvm_vcpu_ioctl_x86_set_vcpu_events().
      Reported-by: default avatarLike Xu <likexu@tencent.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8d25b7be
    • Like Xu's avatar
      KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots() · c6c937d6
      Like Xu authored
      Just like on the optional mmu_alloc_direct_roots() path, once shadow
      path reaches "r = -EIO" somewhere, the caller needs to know the actual
      state in order to enter error handling and avoid something worse.
      
      Fixes: 4a38162e ("KVM: MMU: load PDPTRs outside mmu_lock")
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220301124941.48412-1-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c6c937d6
  4. 01 Mar, 2022 11 commits
    • Sean Christopherson's avatar
      KVM: SVM: Disable preemption across AVIC load/put during APICv refresh · b652de1e
      Sean Christopherson authored
      Disable preemption when loading/putting the AVIC during an APICv refresh.
      If the vCPU task is preempted and migrated ot a different pCPU, the
      unprotected avic_vcpu_load() could set the wrong pCPU in the physical ID
      cache/table.
      
      Pull the necessary code out of avic_vcpu_{,un}blocking() and into a new
      helper to reduce the probability of introducing this exact bug a third
      time.
      
      Fixes: df7e4827 ("KVM: SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling AVIC")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b652de1e
    • Sean Christopherson's avatar
      KVM: SVM: Exit to userspace on ENOMEM/EFAULT GHCB errors · aa9f5841
      Sean Christopherson authored
      Exit to userspace if setup_vmgexit_scratch() fails due to OOM or because
      copying data from guest (userspace) memory failed/faulted.  The OOM
      scenario is clearcut, it's userspace's decision as to whether it should
      terminate the guest, free memory, etc...
      
      As for -EFAULT, arguably, any guest issue is a violation of the guest's
      contract with userspace, and thus userspace needs to decide how to
      proceed.  E.g. userspace defines what is RAM vs. MMIO and communicates
      that directly to the guest, KVM is not involved in deciding what is/isn't
      RAM nor in communicating that information to the guest.  If the scratch
      GPA doesn't resolve to a memslot, then the guest is not honoring the
      memory configuration as defined by userspace.
      
      And if userspace unmaps an hva for whatever reason, then exiting to
      userspace with -EFAULT is absolutely the right thing to do.  KVM's ABI
      currently sucks and doesn't provide enough information to act on the
      -EFAULT, but that will hopefully be remedied in the future as there are
      multiple use cases, e.g. uffd and virtiofs truncation, that shouldn't
      require any work in KVM beyond returning -EFAULT with a small amount of
      metadata.
      
      KVM could define its ABI such that failure to access the scratch area is
      reflected into the guest, i.e. establish a contract with userspace, but
      that's undesirable as it limits KVM's options in the future, e.g. in the
      potential uffd case any failure on a uaccess needs to kick out to
      userspace.  KVM does have several cases where it reflects these errors
      into the guest, e.g. kvm_pv_clock_pairing() and Hyper-V emulation, but
      KVM would preferably "fix" those instead of propagating the falsehood
      that any memory failure is the guest's fault.
      
      Lastly, returning a boolean as an "error" for that a helper that isn't
      named accordingly never works out well.
      
      Fixes: ad5b3532 ("KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure")
      Cc: Alper Gun <alpergun@google.com>
      Cc: Peter Gonda <pgonda@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220225205209.3881130-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      aa9f5841
    • Sean Christopherson's avatar
      KVM: WARN if is_unsync_root() is called on a root without a shadow page · 5d6a3221
      Sean Christopherson authored
      WARN and bail if is_unsync_root() is passed a root for which there is no
      shadow page, i.e. is passed the physical address of one of the special
      roots, which do not have an associated shadow page.  The current usage
      squeaks by without bug reports because neither kvm_mmu_sync_roots() nor
      kvm_mmu_sync_prev_roots() calls the helper with pae_root or pml4_root,
      and 5-level AMD CPUs are not generally available, i.e. no one can coerce
      KVM into calling is_unsync_root() on pml5_root.
      
      Note, this doesn't fix the mess with 5-level nNPT, it just (hopefully)
      prevents KVM from crashing.
      
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220225182248.3812651-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5d6a3221
    • Sean Christopherson's avatar
      KVM: Drop KVM_REQ_MMU_RELOAD and update vcpu-requests.rst documentation · e65a3b46
      Sean Christopherson authored
      Remove the now unused KVM_REQ_MMU_RELOAD, shift KVM_REQ_VM_DEAD into the
      unoccupied space, and update vcpu-requests.rst, which was missing an
      entry for KVM_REQ_VM_DEAD.  Switching KVM_REQ_VM_DEAD to entry '1' also
      fixes the stale comment about bits 4-7 being reserved.
      Reviewed-by: default avatarClaudio Imbrenda <imbrenda@linux.ibm.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220225182248.3812651-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e65a3b46
    • Sean Christopherson's avatar
      KVM: s390: Replace KVM_REQ_MMU_RELOAD usage with arch specific request · cc65c3a1
      Sean Christopherson authored
      Add an arch request, KVM_REQ_REFRESH_GUEST_PREFIX, to deal with guest
      prefix changes instead of piggybacking KVM_REQ_MMU_RELOAD.  This will
      allow for the removal of the generic KVM_REQ_MMU_RELOAD, which isn't
      actually used by generic KVM.
      
      No functional change intended.
      Reviewed-by: default avatarClaudio Imbrenda <imbrenda@linux.ibm.com>
      Reviewed-by: default avatarJanosch Frank <frankja@linux.ibm.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220225182248.3812651-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cc65c3a1
    • Sean Christopherson's avatar
      KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped · 527d5cd7
      Sean Christopherson authored
      Zap only obsolete roots when responding to zapping a single root shadow
      page.  Because KVM keeps root_count elevated when stuffing a previous
      root into its PGD cache, shadowing a 64-bit guest means that zapping any
      root causes all vCPUs to reload all roots, even if their current root is
      not affected by the zap.
      
      For many kernels, zapping a single root is a frequent operation, e.g. in
      Linux it happens whenever an mm is dropped, e.g. process exits, etc...
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220225182248.3812651-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      527d5cd7
    • Sean Christopherson's avatar
      KVM: Drop kvm_reload_remote_mmus(), open code request in x86 users · 2f6f66cc
      Sean Christopherson authored
      Remove the generic kvm_reload_remote_mmus() and open code its
      functionality into the two x86 callers.  x86 is (obviously) the only
      architecture that uses the hook, and is also the only architecture that
      uses KVM_REQ_MMU_RELOAD in a way that's consistent with the name.  That
      will change in a future patch, as x86's usage when zapping a single
      shadow page x86 doesn't actually _need_ to reload all vCPUs' MMUs, only
      MMUs whose root is being zapped actually need to be reloaded.
      
      s390 also uses KVM_REQ_MMU_RELOAD, but for a slightly different purpose.
      
      Drop the generic code in anticipation of implementing s390 and x86 arch
      specific requests, which will allow dropping KVM_REQ_MMU_RELOAD entirely.
      
      Opportunistically reword the x86 TDP MMU comment to avoid making
      references to functions (and requests!) when possible, and to remove the
      rather ambiguous "this".
      
      No functional change intended.
      
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220225182248.3812651-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2f6f66cc
    • Sean Christopherson's avatar
      KVM: x86: Invoke kvm_mmu_unload() directly on CR4.PCIDE change · f6d0a252
      Sean Christopherson authored
      Replace a KVM_REQ_MMU_RELOAD request with a direct kvm_mmu_unload() call
      when the guest's CR4.PCIDE changes.  This will allow tweaking the logic
      of KVM_REQ_MMU_RELOAD to free only obsolete/invalid roots, which is the
      historical intent of KVM_REQ_MMU_RELOAD.  The recent PCIDE behavior is
      the only user of KVM_REQ_MMU_RELOAD that doesn't mark affected roots as
      obsolete, needs to unconditionally unload the entire MMU, _and_ affects
      only the current vCPU.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220225182248.3812651-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f6d0a252
    • Hou Wenlong's avatar
      KVM: x86/emulator: Move the unhandled outer privilege level logic of far... · 1e326ad4
      Hou Wenlong authored
      KVM: x86/emulator: Move the unhandled outer privilege level logic of far return into __load_segment_descriptor()
      
      Outer-privilege level return is not implemented in emulator,
      move the unhandled logic into __load_segment_descriptor to
      make it easier to understand why the checks for RET are
      incomplete.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarHou Wenlong <houwenlong.hwl@antgroup.com>
      Message-Id: <5b7188e6388ac9f4567d14eab32db9adf3e00119.1644292363.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1e326ad4
    • Hou Wenlong's avatar
      KVM: x86/emulator: Fix wrong privilege check for code segment in __load_segment_descriptor() · 31c66dab
      Hou Wenlong authored
      Code segment descriptor can be loaded by jmp/call/ret, iret
      and int. The privilege checks are different between those
      instructions above realmode. Although, the emulator has
      use x86_transfer_type enumerate to differentiate them, but
      it is not really used in __load_segment_descriptor(). Note,
      far jump/call to call gate, task gate or task state segment
      are not implemented in emulator.
      
      As for far jump/call to code segment, if DPL > CPL for conforming
      code or (RPL > CPL or DPL != CPL) for non-conforming code, it
      should trigger #GP. The current checks are ok.
      
      As for far return, if RPL < CPL or DPL > RPL for conforming
      code or DPL != RPL for non-conforming code, it should trigger #GP.
      Outer level return is not implemented above virtual-8086 mode in
      emulator. So it implies that RPL <= CPL, but the current checks
      wouldn't trigger #GP if RPL < CPL.
      
      As for code segment loading in task switch, if DPL > RPL for conforming
      code or DPL != RPL for non-conforming code, it should trigger #TS. Since
      segment selector is loaded before segment descriptor when load state from
      tss, it implies that RPL = CPL, so the current checks are ok.
      
      The only problem in current implementation is missing RPL < CPL check for
      far return. However, change code to follow the manual is better.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarHou Wenlong <houwenlong.hwl@antgroup.com>
      Message-Id: <e01f5ea70fc1f18f23da1182acdbc5c97c0e5886.1644292363.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      31c66dab
    • Hou Wenlong's avatar
      KVM: x86/emulator: Defer not-present segment check in __load_segment_descriptor() · ca85f002
      Hou Wenlong authored
      Per Intel's SDM on the "Instruction Set Reference", when
      loading segment descriptor, not-present segment check should
      be after all type and privilege checks. But the emulator checks
      it first, then #NP is triggered instead of #GP if privilege fails
      and segment is not present. Put not-present segment check after
      type and privilege checks in __load_segment_descriptor().
      
      Fixes: 38ba30ba (KVM: x86 emulator: Emulate task switch in emulator.c)
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarHou Wenlong <houwenlong.hwl@antgroup.com>
      Message-Id: <52573c01d369f506cadcf7233812427cf7db81a7.1644292363.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ca85f002