1. 06 Aug, 2021 3 commits
    • David Matlack's avatar
      KVM: Cache the last used slot index per vCPU · fe22ed82
      David Matlack authored
      The memslot for a given gfn is looked up multiple times during page
      fault handling. Avoid binary searching for it multiple times by caching
      the most recently used slot. There is an existing VM-wide last_used_slot
      but that does not work well for cases where vCPUs are accessing memory
      in different slots (see performance data below).
      
      Another benefit of caching the most recently use slot (versus looking
      up the slot once and passing around a pointer) is speeding up memslot
      lookups *across* faults and during spte prefetching.
      
      To measure the performance of this change I ran dirty_log_perf_test with
      64 vCPUs and 64 memslots and measured "Populate memory time" and
      "Iteration 2 dirty memory time".  Tests were ran with eptad=N to force
      dirty logging to use fast_page_fault so its performance could be
      measured.
      
      Config     | Metric                        | Before | After
      ---------- | ----------------------------- | ------ | ------
      tdp_mmu=Y  | Populate memory time          | 6.76s  | 5.47s
      tdp_mmu=Y  | Iteration 2 dirty memory time | 2.83s  | 0.31s
      tdp_mmu=N  | Populate memory time          | 20.4s  | 18.7s
      tdp_mmu=N  | Iteration 2 dirty memory time | 2.65s  | 0.30s
      
      The "Iteration 2 dirty memory time" results are especially compelling
      because they are equivalent to running the same test with a single
      memslot. In other words, fast_page_fault performance no longer scales
      with the number of memslots.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210804222844.1419481-4-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fe22ed82
    • David Matlack's avatar
      KVM: Move last_used_slot logic out of search_memslots · 0f22af94
      David Matlack authored
      Make search_memslots unconditionally search all memslots and move the
      last_used_slot logic up one level to __gfn_to_memslot. This is in
      preparation for introducing a per-vCPU last_used_slot.
      
      As part of this change convert existing callers of search_memslots to
      __gfn_to_memslot to avoid making any functional changes.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210804222844.1419481-3-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0f22af94
    • David Matlack's avatar
      KVM: Rename lru_slot to last_used_slot · 87689270
      David Matlack authored
      lru_slot is used to keep track of the index of the most-recently used
      memslot. The correct acronym would be "mru" but that is not a common
      acronym. So call it last_used_slot which is a bit more obvious.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210804222844.1419481-2-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      87689270
  2. 05 Aug, 2021 1 commit
    • Paolo Bonzini's avatar
      KVM: xen: do not use struct gfn_to_hva_cache · 319afe68
      Paolo Bonzini authored
      gfn_to_hva_cache is not thread-safe, so it is usually used only within
      a vCPU (whose code is protected by vcpu->mutex).  The Xen interface
      implementation has such a cache in kvm->arch, but it is not really
      used except to store the location of the shared info page.  Replace
      shinfo_set and shinfo_cache with just the value that is passed via
      KVM_XEN_ATTR_TYPE_SHARED_INFO; the only complication is that the
      initialization value is not zero anymore and therefore kvm_xen_init_vm
      needs to be introduced.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      319afe68
  3. 04 Aug, 2021 4 commits
    • Like Xu's avatar
      KVM: x86/pmu: Introduce pmc->is_paused to reduce the call time of perf interfaces · e79f49c3
      Like Xu authored
      Based on our observations, after any vm-exit associated with vPMU, there
      are at least two or more perf interfaces to be called for guest counter
      emulation, such as perf_event_{pause, read_value, period}(), and each one
      will {lock, unlock} the same perf_event_ctx. The frequency of calls becomes
      more severe when guest use counters in a multiplexed manner.
      
      Holding a lock once and completing the KVM request operations in the perf
      context would introduce a set of impractical new interfaces. So we can
      further optimize the vPMU implementation by avoiding repeated calls to
      these interfaces in the KVM context for at least one pattern:
      
      After we call perf_event_pause() once, the event will be disabled and its
      internal count will be reset to 0. So there is no need to pause it again
      or read its value. Once the event is paused, event period will not be
      updated until the next time it's resumed or reprogrammed. And there is
      also no need to call perf_event_period twice for a non-running counter,
      considering the perf_event for a running counter is never paused.
      
      Based on this implementation, for the following common usage of
      sampling 4 events using perf on a 4u8g guest:
      
        echo 0 > /proc/sys/kernel/watchdog
        echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent
        echo 10000 > /proc/sys/kernel/perf_event_max_sample_rate
        echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
        for i in `seq 1 1 10`
        do
        taskset -c 0 perf record \
        -e cpu-cycles -e instructions -e branch-instructions -e cache-misses \
        /root/br_instr a
        done
      
      the average latency of the guest NMI handler is reduced from
      37646.7 ns to 32929.3 ns (~1.14x speed up) on the Intel ICX server.
      Also, in addition to collecting more samples, no loss of sampling
      accuracy was observed compared to before the optimization.
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Message-Id: <20210728120705.6855-1-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      e79f49c3
    • Peter Xu's avatar
      KVM: X86: Optimize zapping rmap · a75b5404
      Peter Xu authored
      Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be
      slow.  The easy way is to travers the rmap list, collecting the a/d bits and
      free the slots along the way.
      
      Provide a pte_list_destroy() and do exactly that.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220605.26377-1-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a75b5404
    • Peter Xu's avatar
      KVM: X86: Optimize pte_list_desc with per-array counter · 13236e25
      Peter Xu authored
      Add a counter field into pte_list_desc, so as to simplify the add/remove/loop
      logic.  E.g., we don't need to loop over the array any more for most reasons.
      
      This will make more sense after we've switched the array size to be larger
      otherwise the counter will be a waste.
      
      Initially I wanted to store a tail pointer at the head of the array list so we
      don't need to traverse the list at least for pushing new ones (if without the
      counter we traverse both the list and the array).  However that'll need
      slightly more change without a huge lot benefit, e.g., after we grow entry
      numbers per array the list traversing is not so expensive.
      
      So let's be simple but still try to get as much benefit as we can with just
      these extra few lines of changes (not to mention the code looks easier too
      without looping over arrays).
      
      I used the same a test case to fork 500 child and recycle them ("./rmap_fork
      500" [1]), this patch further speeds up the total fork time of about 4%, which
      is a total of 33% of vanilla kernel:
      
              Vanilla:      473.90 (+-5.93%)
              3->15 slots:  366.10 (+-4.94%)
              Add counter:  351.00 (+-3.70%)
      
      [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220602.26327-1-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      13236e25
    • Peter Xu's avatar
      KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger · dc1cff96
      Peter Xu authored
      Currently rmap array element only contains 3 entries.  However for EPT=N there
      could have a lot of guest pages that got tens of even hundreds of rmap entry.
      
      A normal distribution of a 6G guest (even if idle) shows this with rmap count
      statistics:
      
      Rmap_Count:     0       1       2-3     4-7     8-15    16-31   32-63   64-127  128-255 256-511 512-1023
      Level=4K:       3089171 49005   14016   1363    235     212     15      7       0       0       0
      Level=2M:       5951    227     0       0       0       0       0       0       0       0       0
      Level=1G:       32      0       0       0       0       0       0       0       0       0       0
      
      If we do some more fork some pages will grow even larger rmap counts.
      
      This patch makes PTE_LIST_EXT bigger so it'll be more efficient for the general
      use case of EPT=N as we do list reference less and the loops over PTE_LIST_EXT
      will be slightly more efficient; but still not too large so less waste when
      array not full.
      
      It should not affecting EPT=Y since EPT normally only has zero or one rmap
      entry for each page, so no array is even allocated.
      
      With a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]),
      this patch speeds up fork time of about 29%.
      
          Before: 473.90 (+-5.93%)
          After:  366.10 (+-4.94%)
      
      [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220455.26054-6-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dc1cff96
  4. 03 Aug, 2021 3 commits
    • Hamza Mahfooz's avatar
      KVM: const-ify all relevant uses of struct kvm_memory_slot · 269e9552
      Hamza Mahfooz authored
      As alluded to in commit f36f3f28 ("KVM: add "new" argument to
      kvm_arch_commit_memory_region"), a bunch of other places where struct
      kvm_memory_slot is used, needs to be refactored to preserve the
      "const"ness of struct kvm_memory_slot across-the-board.
      Signed-off-by: default avatarHamza Mahfooz <someguy@effective-light.com>
      Message-Id: <20210713023338.57108-1-someguy@effective-light.com>
      [Do not touch body of slot_rmap_walk_init. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      269e9552
    • Paolo Bonzini's avatar
      KVM: Don't take mmu_lock for range invalidation unless necessary · 071064f1
      Paolo Bonzini authored
      Avoid taking mmu_lock for .invalidate_range_{start,end}() notifications
      that are unrelated to KVM.  This is possible now that memslot updates are
      blocked from range_start() to range_end(); that ensures that lock elision
      happens in both or none, and therefore that mmu_notifier_count updates
      (which must occur while holding mmu_lock for write) are always paired
      across start->end.
      
      Based on patches originally written by Ben Gardon.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      071064f1
    • Paolo Bonzini's avatar
      KVM: Block memslot updates across range_start() and range_end() · 52ac8b35
      Paolo Bonzini authored
      We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
      notifications that are unrelated to KVM.  Because mmu_notifier_count
      must be modified while holding mmu_lock for write, and must always
      be paired across start->end to stay balanced, lock elision must
      happen in both or none.  Therefore, in preparation for this change,
      this patch prevents memslot updates across range_start() and range_end().
      
      Note, technically flag-only memslot updates could be allowed in parallel,
      but stalling a memslot update for a relatively short amount of time is
      not a scalability issue, and this is all more than complex enough.
      
      A long note on the locking: a previous version of the patch used an rwsem
      to block the memslot update while the MMU notifier run, but this resulted
      in the following deadlock involving the pseudo-lock tagged as
      "mmu_notifier_invalidate_range_start".
      
         ======================================================
         WARNING: possible circular locking dependency detected
         5.12.0-rc3+ #6 Tainted: G           OE
         ------------------------------------------------------
         qemu-system-x86/3069 is trying to acquire lock:
         ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
      
         but task is already holding lock:
         ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
      
         which lock already depends on the new lock.
      
      This corresponds to the following MMU notifier logic:
      
          invalidate_range_start
            take pseudo lock
            down_read()           (*)
            release pseudo lock
          invalidate_range_end
            take pseudo lock      (**)
            up_read()
            release pseudo lock
      
      At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
      at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
      
      This could cause a deadlock (ignoring for a second that the pseudo lock
      is not a lock):
      
      - invalidate_range_start waits on down_read(), because the rwsem is
      held by install_new_memslots
      
      - install_new_memslots waits on down_write(), because the rwsem is
      held till (another) invalidate_range_end finishes
      
      - invalidate_range_end sits waits on the pseudo lock, held by
      invalidate_range_start.
      
      Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
      it would change the *shared* rwsem readers into *shared recursive*
      readers), so open-code the wait using a readers count and a
      spinlock.  This also allows handling blockable and non-blockable
      critical section in the same way.
      
      Losing the rwsem fairness does theoretically allow MMU notifiers to
      block install_new_memslots forever.  Note that mm/mmu_notifier.c's own
      retry scheme in mmu_interval_read_begin also uses wait/wake_up
      and is likewise not fair.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      52ac8b35
  5. 02 Aug, 2021 29 commits