1. 20 Feb, 2019 40 commits
    • Sean Christopherson's avatar
      Revert "KVM: MMU: show mmu_valid_gen in shadow page related tracepoints" · b59c4830
      Sean Christopherson authored
      ...as part of removing x86 KVM's fast invalidate mechanism, i.e. this
      is one part of a revert all patches from the series that introduced the
      mechanism[1].
      
      This reverts commit 2248b023.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b59c4830
    • Sean Christopherson's avatar
      Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages" · 42560fb1
      Sean Christopherson authored
      ...as part of removing x86 KVM's fast invalidate mechanism, i.e. this
      is one part of a revert all patches from the series that introduced the
      mechanism[1].
      
      This reverts commit 35006126.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      42560fb1
    • Sean Christopherson's avatar
      Revert "KVM: MMU: zap pages in batch" · 43d2b14b
      Sean Christopherson authored
      Unwinding optimizations related to obsolete pages is a step towards
      removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
      a revert all patches from the series that introduced the mechanism[1].
      
      This reverts commit e7d11c7a.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      43d2b14b
    • Sean Christopherson's avatar
      Revert "KVM: MMU: collapse TLB flushes when zap all pages" · 210f4942
      Sean Christopherson authored
      Unwinding optimizations related to obsolete pages is a step towards
      removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
      a revert all patches from the series that introduced the mechanism[1].
      
      This reverts commit f34d251d.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      210f4942
    • Sean Christopherson's avatar
      Revert "KVM: MMU: reclaim the zapped-obsolete page first" · 52d5dedc
      Sean Christopherson authored
      Unwinding optimizations related to obsolete pages is a step towards
      removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
      a revert all patches from the series that introduced the mechanism[1].
      
      This reverts commit 365c8868.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      52d5dedc
    • Sean Christopherson's avatar
      KVM: x86/mmu: Remove is_obsolete() call · 5ff05683
      Sean Christopherson authored
      Unwinding usage of is_obsolete() is a step towards removing x86's fast
      invalidate mechanism, i.e. this is one part of a revert all patches from
      the series that introduced the mechanism[1].
      
      This is a partial revert of commit 05988d72 ("KVM: MMU: reduce
      KVM_REQ_MMU_RELOAD when root page is zapped").
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5ff05683
    • Sean Christopherson's avatar
      KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes · 571c5af0
      Sean Christopherson authored
      Call cond_resched_lock() when zapping MMIO to reschedule if needed or to
      release and reacquire mmu_lock in case of contention.  There is no need
      to flush or zap when temporarily dropping mmu_lock as zapping MMIO sptes
      is done when holding the memslots lock and with the "update in-progress"
      bit set in the memslots generation, which disables MMIO spte caching.
      The walk does need to be restarted if mmu_lock is dropped as the active
      pages list may be modified.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      571c5af0
    • Sean Christopherson's avatar
      Revert "KVM: MMU: drop kvm_mmu_zap_mmio_sptes" · 4771450c
      Sean Christopherson authored
      Revert back to a dedicated (and slower) mechanism for handling the
      scenario where all MMIO shadow PTEs need to be zapped due to overflowing
      the MMIO generation number.  The MMIO generation scenario is almost
      literally a one-in-a-million occurrence, i.e. is not a performance
      sensitive scenario.
      
      Restoring kvm_mmu_zap_mmio_sptes() leaves VM teardown as the only user
      of kvm_mmu_invalidate_zap_all_pages() and paves the way for removing
      the fast invalidate mechanism altogether.
      
      This reverts commit a8eca9dc.
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4771450c
    • Sean Christopherson's avatar
      Revert "KVM: MMU: document fast invalidate all pages" · a592a3b8
      Sean Christopherson authored
      Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches
      from the original series[1].
      
      Though not explicitly stated, for all intents and purposes the fast
      invalidate mechanism was added to speed up the scenario where removing
      a memslot, e.g. as part of accessing reading PCI ROM, caused KVM to
      flush all shadow entries[1].  Now that the memslot case flushes only
      shadow entries belonging to the memslot, i.e. doesn't use the fast
      invalidate mechanism, the only remaining usage of the mechanism are
      when the VM is being destroyed and when the MMIO generation rolls
      over.
      
      When a VM is being destroyed, either there are no active vcpus, i.e.
      there's no lock contention, or the VM has ungracefully terminated, in
      which case we want to reclaim its pages as quickly as possible, i.e.
      not release the MMU lock if there are still CPUs executing in the VM.
      
      The MMIO generation scenario is almost literally a one-in-a-million
      occurrence, i.e. is not a performance sensitive scenario.
      
      Given that lock-breaking is not desirable (VM teardown) or irrelevant
      (MMIO generation overflow), remove the fast invalidate mechanism to
      simplify the code (a small amount) and to discourage future code from
      zapping all pages as using such a big hammer should be a last restort.
      
      This reverts commit f6f8adee.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a592a3b8
    • Sean Christopherson's avatar
      KVM: x86/mmu: Zap only the relevant pages when removing a memslot · 4e103134
      Sean Christopherson authored
      Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
      handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
      that actually belong to the memslot being removed.  This improves
      performance, especially why the deleted memslot has only a few shadow
      entries, or even no entries.  E.g. a microbenchmark to access regular
      memory while concurrently reading PCI ROM to trigger memslot deletion
      showed a 5% improvement in throughput.
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4e103134
    • Sean Christopherson's avatar
      KVM: x86/mmu: Split remote_flush+zap case out of kvm_mmu_flush_or_zap() · a2113634
      Sean Christopherson authored
      ...and into a separate helper, kvm_mmu_remote_flush_or_zap(), that does
      not require a vcpu so that the code can be (re)used by
      kvm_mmu_invalidate_zap_pages_in_memslot().
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a2113634
    • Sean Christopherson's avatar
      KVM: x86/mmu: Move slot_level_*() helper functions up a few lines · 85875a13
      Sean Christopherson authored
      ...so that kvm_mmu_invalidate_zap_pages_in_memslot() can utilize the
      helpers in future patches.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      85875a13
    • Sean Christopherson's avatar
      KVM: Move the memslot update in-progress flag to bit 63 · 164bf7e5
      Sean Christopherson authored
      ...now that KVM won't explode by moving it out of bit 0.  Using bit 63
      eliminates the need to jump over bit 0, e.g. when calculating a new
      memslots generation or when propagating the memslots generation to an
      MMIO spte.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      164bf7e5
    • Sean Christopherson's avatar
      KVM: Remove the hack to trigger memslot generation wraparound · 0e32958e
      Sean Christopherson authored
      x86 captures a subset of the memslot generation (19 bits) in its MMIO
      sptes so that it can expedite emulated MMIO handling by checking only
      the releveant spte, i.e. doesn't need to do a full page fault walk.
      
      Because the MMIO sptes capture only 19 bits (due to limited space in
      the sptes), there is a non-zero probability that the MMIO generation
      could wrap, e.g. after 500k memslot updates.  Since normal usage is
      extremely unlikely to result in 500k memslot updates, a hack was added
      by commit 69c9ea93 ("KVM: MMU: init kvm generation close to mmio
      wrap-around value") to offset the MMIO generation in order to trigger
      a wraparound, e.g. after 150 memslot updates.
      
      When separate memslot generation sequences were assigned to each
      address space, commit 00f034a1 ("KVM: do not bias the generation
      number in kvm_current_mmio_generation") moved the offset logic into the
      initialization of the memslot generation itself so that the per-address
      space bit(s) were not dropped/corrupted by the MMIO shenanigans.
      
      Remove the offset hack for three reasons:
      
        - While it does exercise x86's kvm_mmu_invalidate_mmio_sptes(), simply
          wrapping the generation doesn't actually test the interesting case
          of having stale MMIO sptes with the new generation number, e.g. old
          sptes with a generation number of 0.
      
        - Triggering kvm_mmu_invalidate_mmio_sptes() prematurely makes its
          performance rather important since the probability of invalidating
          MMIO sptes jumps from "effectively never" to "fairly likely".  This
          limits what can be done in future patches, e.g. to simplify the
          invalidation code, as doing so without proper caution could lead to
          a noticeable performance regression.
      
        - Forcing the memslots generation, which is a 64-bit number, to wrap
          prevents KVM from assuming the memslots generation will never wrap.
          This in turn prevents KVM from using an arbitrary bit for the
          "update in-progress" flag, e.g. using bit 63 would immediately
          collide with using a large value as the starting generation number.
          The "update in-progress" flag is effectively forced into bit 0 so
          that it's (subtly) taken into account when incrementing the
          generation.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0e32958e
    • Sean Christopherson's avatar
      KVM: x86: Refactor the MMIO SPTE generation handling · cae7ed3c
      Sean Christopherson authored
      The code to propagate the memslots generation number into MMIO sptes is
      a bit convoluted.  The "what" is relatively straightfoward, e.g. the
      comment explaining which bits go where is quite readable, but the "how"
      requires a lot of staring to understand what is happening.  For example,
      'MMIO_GEN_LOW_SHIFT' is actually used to calculate the high bits of the
      spte, while 'MMIO_SPTE_GEN_LOW_SHIFT' is used to calculate the low bits.
      
      Refactor the code to:
      
        - use #defines whose values align with the bits defined in the comment
        - use consistent code for both the high and low mask
        - explicitly highlight the handling of bit 0 (update in-progress flag)
        - explicitly call out that the defines are for MMIO sptes (to avoid
          confusion with the per-vCPU MMIO cache, which uses the full memslots
          generation)
      
      In addition to making the code a little less magical, this paves the way
      for moving the update in-progress flag to bit 63 without having to
      simultaneously rewrite all of the MMIO spte code.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cae7ed3c
    • Sean Christopherson's avatar
      KVM: x86: Use a u64 when passing the MMIO gen around · 5192f9b9
      Sean Christopherson authored
      KVM currently uses an 'unsigned int' for the MMIO generation number
      despite it being derived from the 64-bit memslots generation and
      being propagated to (potentially) 64-bit sptes.  There is no hidden
      agenda behind using an 'unsigned int', it's done simply because the
      MMIO generation will never set bits above bit 19.
      
      Passing a u64 will allow the "update in-progress" flag to be relocated
      from bit 0 to bit 63 and removes the need to cast the generation back
      to a u64 when propagating it to a spte.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5192f9b9
    • Sean Christopherson's avatar
      KVM: Explicitly define the "memslot update in-progress" bit · 361209e0
      Sean Christopherson authored
      KVM uses bit 0 of the memslots generation as an "update in-progress"
      flag, which is used by x86 to prevent caching MMIO access while the
      memslots are changing.  Although the intended behavior is flag-like,
      e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
      caching data from in-flux memslots, the implementation oftentimes treats
      the bit as part of the generation number itself, e.g. incrementing the
      generation increments twice, once to set the flag and once to clear it.
      
      Prior to commit 4bd518f1 ("KVM: use separate generations for
      each address space"), incorporating the "update in-progress" bit into
      the generation number largely made sense, e.g. "real" generations are
      even, "bogus" generations are odd, most code doesn't need to be aware of
      the bit, etc...
      
      Now that unique memslots generation numbers are assigned to each address
      space, stealthing the in-progress status into the generation number
      results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
      over bit 0 when initializing the memslots generation without any hint as
      to why.
      
      Explicitly define the flag and convert as much code as possible (which
      isn't much) to actually treat it like a flag.  This paves the way for
      eventually using a different bit for "update in-progress" so that it can
      be a flag in truth instead of a awkward extension to the generation
      number.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      361209e0
    • Sean Christopherson's avatar
      KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux · ddfd1730
      Sean Christopherson authored
      When installing new memslots, KVM sets bit 0 of the generation number to
      indicate that an update is in-progress.  Until the update is complete,
      there are no guarantees as to whether a vCPU will see the old or the new
      memslots.  Explicity prevent caching MMIO accesses so as to avoid using
      an access cached from the old memslots after the new memslots have been
      installed.
      
      Note that it is unclear whether or not disabling caching during the
      update window is strictly necessary as there is no definitive
      documentation as to what ordering guarantees KVM provides with respect
      to updating memslots.  That being said, the MMIO spte code does not
      allow reusing sptes created while an update is in-progress, and the
      associated documentation explicitly states:
      
          We do not want to use an MMIO sptes created with an odd generation
          number, ...  If KVM is unlucky and creates an MMIO spte while the
          low bit is 1, the next access to the spte will always be a cache miss.
      
      At the very least, disabling the per-vCPU MMIO cache during updates will
      make its behavior consistent with the MMIO spte behavior and
      documentation.
      
      Fixes: 56f17dd3 ("kvm: x86: fix stale mmio cache bug")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ddfd1730
    • Sean Christopherson's avatar
      KVM: x86/mmu: Detect MMIO generation wrap in any address space · e1359e2b
      Sean Christopherson authored
      The check to detect a wrap of the MMIO generation explicitly looks for a
      generation number of zero.  Now that unique memslots generation numbers
      are assigned to each address space, only address space 0 will get a
      generation number of exactly zero when wrapping.  E.g. when address
      space 1 goes from 0x7fffe to 0x80002, the MMIO generation number will
      wrap to 0x2.  Adjust the MMIO generation to strip the address space
      modifier prior to checking for a wrap.
      
      Fixes: 4bd518f1 ("KVM: use separate generations for each address space")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e1359e2b
    • Sean Christopherson's avatar
      KVM: Call kvm_arch_memslots_updated() before updating memslots · 15248258
      Sean Christopherson authored
      kvm_arch_memslots_updated() is at this point in time an x86-specific
      hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
      the memslots generation number in its MMIO sptes in order to avoid
      full page fault walks for repeat faults on emulated MMIO addresses.
      Because only 19 bits are used, wrapping the MMIO generation number is
      possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
      the generation has changed so that it can invalidate all MMIO sptes in
      case the effective MMIO generation has wrapped so as to avoid using a
      stale spte, e.g. a (very) old spte that was created with generation==0.
      
      Given that the purpose of kvm_arch_memslots_updated() is to prevent
      consuming stale entries, it needs to be called before the new generation
      is propagated to memslots.  Invalidating the MMIO sptes after updating
      memslots means that there is a window where a vCPU could dereference
      the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
      spte that was created with (pre-wrap) generation==0.
      
      Fixes: e59dbe09 ("KVM: Introduce kvm_arch_memslots_updated()")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      15248258
    • Ben Gardon's avatar
      kvm: vmx: Add memcg accounting to KVM allocations · 41836839
      Ben Gardon authored
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      41836839
    • Ben Gardon's avatar
      kvm: svm: Add memcg accounting to KVM allocations · 1ec69647
      Ben Gardon authored
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1ec69647
    • Ben Gardon's avatar
      kvm: x86: Add memcg accounting to KVM allocations · 254272ce
      Ben Gardon authored
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      
      There remain a few allocations which should be charged to the VM's
      cgroup but are not. In x86, they include:
      	vcpu->arch.pio_data
      There allocations are unaccounted in this patch because they are mapped
      to userspace, and accounting them to a cgroup causes problems. This
      should be addressed in a future patch.
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      254272ce
    • Ben Gardon's avatar
      kvm: Add memcg accounting to KVM allocations · b12ce36a
      Ben Gardon authored
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      
      There remain a few allocations which should be charged to the VM's
      cgroup but are not. In they include:
              vcpu->run
              kvm->coalesced_mmio_ring
      There allocations are unaccounted in this patch because they are mapped
      to userspace, and accounting them to a cgroup causes problems. This
      should be addressed in a future patch.
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b12ce36a
    • Paolo Bonzini's avatar
      KVM: nVMX: do not start the preemption timer hrtimer unnecessarily · 359a6c3d
      Paolo Bonzini authored
      The preemption timer can be started even if there is a vmentry
      failure during or after loading guest state.  That is pointless,
      move the call after all conditions have been checked.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      359a6c3d
    • Yu Zhang's avatar
      kvm: vmx: Fix typos in vmentry/vmexit control setting · d9293597
      Yu Zhang authored
      Previously, 'commit f99e3daf ("KVM: x86: Add Intel PT
      virtualization work mode")' work mode' offered framework
      to support Intel PT virtualization. However, the patch has
      some typos in vmx_vmentry_ctrl() and vmx_vmexit_ctrl(), e.g.
      used wrong flags and wrong variable, which will cause the
      VM entry failure later.
      
      Fixes: 'commit f99e3daf ("KVM: x86: Add Intel PT virtualization work mode")'
      Signed-off-by: default avatarYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d9293597
    • Paolo Bonzini's avatar
      KVM: x86: cleanup freeing of nested state · b4b65b56
      Paolo Bonzini authored
      Ensure that the VCPU free path goes through vmx_leave_nested and
      thus nested_vmx_vmexit, so that the cancellation of the timer does
      not have to be in free_nested.  In addition, because some paths through
      nested_vmx_vmexit do not go through sync_vmcs12, the cancellation of
      the timer is moved there.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b4b65b56
    • Luwei Kang's avatar
      KVM: x86: Sync the pending Posted-Interrupts · 81b01667
      Luwei Kang authored
      Some Posted-Interrupts from passthrough devices may be lost or
      overwritten when the vCPU is in runnable state.
      
      The SN (Suppress Notification) of PID (Posted Interrupt Descriptor) will
      be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE state
      but not running on physical CPU). If a posted interrupt coming at this
      time, the irq remmaping facility will set the bit of PIR (Posted
      Interrupt Requests) without ON (Outstanding Notification).
      So this interrupt can't be sync to APIC virtualization register and
      will not be handled by Guest because ON is zero.
      Signed-off-by: default avatarLuwei Kang <luwei.kang@intel.com>
      [Eliminate the pi_clear_sn fast path. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      81b01667
    • Liu Jingqi's avatar
      KVM: x86: expose MOVDIR64B CPU feature into VM. · c029b5de
      Liu Jingqi authored
      MOVDIR64B moves 64-bytes as direct-store with 64-bytes write atomicity.
      Direct store is implemented by using write combining (WC) for writing
      data directly into memory without caching the data.
      
      Availability of the MOVDIR64B instruction is indicated by the presence
      of the CPUID feature flag MOVDIR64B (CPUID.0x07.0x0:ECX[bit 28]).
      
      This patch exposes the movdir64b feature to the guest.
      
      The release document ref below link:
      https://software.intel.com/sites/default/files/managed/c5/15/\
      architecture-instruction-set-extensions-programming-reference.pdf
      Signed-off-by: default avatarLiu Jingqi <jingqi.liu@intel.com>
      Cc: Xu Tao <tao3.xu@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c029b5de
    • Liu Jingqi's avatar
      KVM: x86: expose MOVDIRI CPU feature into VM. · 74f2370b
      Liu Jingqi authored
      MOVDIRI moves doubleword or quadword from register to memory through
      direct store which is implemented by using write combining (WC) for
      writing data directly into memory without caching the data.
      
      Availability of the MOVDIRI instruction is indicated by the presence of
      the CPUID feature flag MOVDIRI(CPUID.0x07.0x0:ECX[bit 27]).
      
      This patch exposes the movdiri feature to the guest.
      
      The release document ref below link:
      https://software.intel.com/sites/default/files/managed/c5/15/\
      architecture-instruction-set-extensions-programming-reference.pdf
      Signed-off-by: default avatarLiu Jingqi <jingqi.liu@intel.com>
      Cc: Xu Tao <tao3.xu@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      74f2370b
    • Kai Huang's avatar
      kvm, x86, mmu: Use kernel generic dynamic physical address mask · 8acc0993
      Kai Huang authored
      AMD's SME/SEV is no longer the only case which reduces supported
      physical address bits, since Intel introduced Multi-key Total Memory
      Encryption (MKTME), which repurposes high bits of physical address as
      keyID, thus effectively shrinks supported physical address bits. To
      cover both cases (and potential similar future features), kernel MM
      introduced generic dynamaic physical address mask instead of hard-coded
      __PHYSICAL_MASK in 'commit 94d49eb3 ("x86/mm: Decouple dynamic
      __PHYSICAL_MASK from AMD SME")'. KVM should use that too.
      
      Change PT64_BASE_ADDR_MASK to use kernel dynamic physical address mask
      when it is enabled, instead of sme_clr. PT64_DIR_BASE_ADDR_MASK is also
      deleted since it is not used at all.
      Signed-off-by: default avatarKai Huang <kai.huang@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8acc0993
    • Paolo Bonzini's avatar
      KVM: nVMX: remove useless is_protmode check · e0dfacbf
      Paolo Bonzini authored
      VMX is only accessible in protected mode, remove a confusing check
      that causes the conditional to lack a final "else" branch.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e0dfacbf
    • Sean Christopherson's avatar
      KVM: nVMX: Ignore limit checks on VMX instructions using flat segments · 34333cc6
      Sean Christopherson authored
      Regarding segments with a limit==0xffffffff, the SDM officially states:
      
          When the effective limit is FFFFFFFFH (4 GBytes), these accesses may
          or may not cause the indicated exceptions.  Behavior is
          implementation-specific and may vary from one execution to another.
      
      In practice, all CPUs that support VMX ignore limit checks for "flat
      segments", i.e. an expand-up data or code segment with base=0 and
      limit=0xffffffff.  This is subtly different than wrapping the effective
      address calculation based on the address size, as the flat segment
      behavior also applies to accesses that would wrap the 4g boundary, e.g.
      a 4-byte access starting at 0xffffffff will access linear addresses
      0xffffffff, 0x0, 0x1 and 0x2.
      
      Fixes: f9eb4af6 ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      34333cc6
    • Sean Christopherson's avatar
      KVM: nVMX: Apply addr size mask to effective address for VMX instructions · 8570f9e8
      Sean Christopherson authored
      The address size of an instruction affects the effective address, not
      the virtual/linear address.  The final address may still be truncated,
      e.g. to 32-bits outside of long mode, but that happens irrespective of
      the address size, e.g. a 32-bit address size can yield a 64-bit virtual
      address when using FS/GS with a non-zero base.
      
      Fixes: 064aea77 ("KVM: nVMX: Decoding memory operands of VMX instructions")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8570f9e8
    • Sean Christopherson's avatar
      KVM: nVMX: Sign extend displacements of VMX instr's mem operands · 946c522b
      Sean Christopherson authored
      The VMCS.EXIT_QUALIFCATION field reports the displacements of memory
      operands for various instructions, including VMX instructions, as a
      naturally sized unsigned value, but masks the value by the addr size,
      e.g. given a ModRM encoded as -0x28(%ebp), the -0x28 displacement is
      reported as 0xffffffd8 for a 32-bit address size.  Despite some weird
      wording regarding sign extension, the SDM explicitly states that bits
      beyond the instructions address size are undefined:
      
          In all cases, bits of this field beyond the instructionâ€s address
          size are undefined.
      
      Failure to sign extend the displacement results in KVM incorrectly
      treating a negative displacement as a large positive displacement when
      the address size of the VMX instruction is smaller than KVM's native
      size, e.g. a 32-bit address size on a 64-bit KVM.
      
      The very original decoding, added by commit 064aea77 ("KVM: nVMX:
      Decoding memory operands of VMX instructions"), sort of modeled sign
      extension by truncating the final virtual/linear address for a 32-bit
      address size.  I.e. it messed up the effective address but made it work
      by adjusting the final address.
      
      When segmentation checks were added, the truncation logic was kept
      as-is and no sign extension logic was introduced.  In other words, it
      kept calculating the wrong effective address while mostly generating
      the correct virtual/linear address.  As the effective address is what's
      used in the segment limit checks, this results in KVM incorreclty
      injecting #GP/#SS faults due to non-existent segment violations when
      a nested VMM uses negative displacements with an address size smaller
      than KVM's native address size.
      
      Using the -0x28(%ebp) example, an EBP value of 0x1000 will result in
      KVM using 0x100000fd8 as the effective address when checking for a
      segment limit violation.  This causes a 100% failure rate when running
      a 32-bit KVM build as L1 on top of a 64-bit KVM L0.
      
      Fixes: f9eb4af6 ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      946c522b
    • Suthikulpanit, Suravee's avatar
      svm: Fix improper check when deactivate AVIC · c57cd3c8
      Suthikulpanit, Suravee authored
      The function svm_refresh_apicv_exec_ctrl() always returning prematurely
      as kvm_vcpu_apicv_active() always return false when calling from
      the function arch/x86/kvm/x86.c:kvm_vcpu_deactivate_apicv().
      This is because the apicv_active is set to false just before calling
      refresh_apicv_exec_ctrl().
      
      Also, we need to mark VMCB_AVIC bit as dirty instead of VMCB_INTR.
      
      So, fix svm_refresh_apicv_exec_ctrl() to properly deactivate AVIC.
      
      Fixes: 67034bb9 ('KVM: SVM: Add irqchip_split() checks before enabling AVIC')
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c57cd3c8
    • Paolo Bonzini's avatar
      KVM: x86: cull apicv code when userspace irqchip is requested · f7589cca
      Paolo Bonzini authored
      Currently apicv_active can be true even if in-kernel LAPIC
      emulation is disabled.  Avoid this by properly initializing
      it in kvm_arch_vcpu_init, and then do not do anything to
      deactivate APICv when it is actually not used
      
      (Currently APICv is only deactivated by SynIC code that in turn
      is only reachable when in-kernel LAPIC is in use.  However, it is
      cleaner if kvm_vcpu_deactivate_apicv avoids relying on this.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f7589cca
    • Suthikulpanit, Suravee's avatar
      svm: Fix AVIC DFR and LDR handling · 98d90582
      Suthikulpanit, Suravee authored
      Current SVM AVIC driver makes two incorrect assumptions:
        1. APIC LDR register cannot be zero
        2. APIC DFR for all vCPUs must be the same
      
      LDR=0 means the local APIC does not support logical destination mode.
      Therefore, the driver should mark any previously assigned logical APIC ID
      table entry as invalid, and return success.  Also, DFR is specific to
      a particular local APIC, and can be different among all vCPUs
      (as observed on Windows 10).
      
      These incorrect assumptions cause Windows 10 and FreeBSD VMs to fail
      to boot with AVIC enabled. So, instead of flush the whole logical APIC ID
      table, handle DFR and LDR for each vCPU independently.
      
      Fixes: 18f40c53 ('svm: Add VMEXIT handlers for AVIC')
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Reported-by: default avatarJulian Stecklina <jsteckli@amazon.de>
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      98d90582
    • Gustavo A. R. Silva's avatar
      kvm: Use struct_size() in kmalloc() · 90952cd3
      Gustavo A. R. Silva authored
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array. For example:
      
      struct foo {
          int stuff;
          void *entry[];
      };
      
      instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
      instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
      
      This code was detected with the help of Coccinelle.
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      90952cd3
    • Pavel Tatashin's avatar
      x86/kvmclock: set offset for kvm unstable clock · b5179ec4
      Pavel Tatashin authored
      VMs may show incorrect uptime and dmesg printk offsets on hypervisors with
      unstable clock. The problem is produced when VM is rebooted without exiting
      from qemu.
      
      The fix is to calculate clock offset not only for stable clock but for
      unstable clock as well, and use kvm_sched_clock_read() which substracts
      the offset for both clocks.
      
      This is safe, because pvclock_clocksource_read() does the right thing and
      makes sure that clock always goes forward, so once offset is calculated
      with unstable clock, we won't get new reads that are smaller than offset,
      and thus won't get negative results.
      
      Thank you Jon DeVree for helping to reproduce this issue.
      
      Fixes: 857baa87 ("sched/clock: Enable sched clock early")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarDominique Martinet <asmadeus@codewreck.org>
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b5179ec4