1. 08 Jun, 2021 9 commits
    • Paolo Bonzini's avatar
      kvm: avoid speculation-based attacks from out-of-range memslot accesses · da27a83f
      Paolo Bonzini authored
      KVM's mechanism for accessing guest memory translates a guest physical
      address (gpa) to a host virtual address using the right-shifted gpa
      (also known as gfn) and a struct kvm_memory_slot.  The translation is
      performed in __gfn_to_hva_memslot using the following formula:
      
            hva = slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE
      
      It is expected that gfn falls within the boundaries of the guest's
      physical memory.  However, a guest can access invalid physical addresses
      in such a way that the gfn is invalid.
      
      __gfn_to_hva_memslot is called from kvm_vcpu_gfn_to_hva_prot, which first
      retrieves a memslot through __gfn_to_memslot.  While __gfn_to_memslot
      does check that the gfn falls within the boundaries of the guest's
      physical memory or not, a CPU can speculate the result of the check and
      continue execution speculatively using an illegal gfn. The speculation
      can result in calculating an out-of-bounds hva.  If the resulting host
      virtual address is used to load another guest physical address, this
      is effectively a Spectre gadget consisting of two consecutive reads,
      the second of which is data dependent on the first.
      
      Right now it's not clear if there are any cases in which this is
      exploitable.  One interesting case was reported by the original author
      of this patch, and involves visiting guest page tables on x86.  Right
      now these are not vulnerable because the hva read goes through get_user(),
      which contains an LFENCE speculation barrier.  However, there are
      patches in progress for x86 uaccess.h to mask kernel addresses instead of
      using LFENCE; once these land, a guest could use speculation to read
      from the VMM's ring 3 address space.  Other architectures such as ARM
      already use the address masking method, and would be susceptible to
      this same kind of data-dependent access gadgets.  Therefore, this patch
      proactively protects from these attacks by masking out-of-bounds gfns
      in __gfn_to_hva_memslot, which blocks speculation of invalid hvas.
      
      Sean Christopherson noted that this patch does not cover
      kvm_read_guest_offset_cached.  This however is limited to a few bytes
      past the end of the cache, and therefore it is unlikely to be useful in
      the context of building a chain of data dependent accesses.
      Reported-by: default avatarArtemiy Margaritov <artemiy.margaritov@gmail.com>
      Co-developed-by: default avatarArtemiy Margaritov <artemiy.margaritov@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      da27a83f
    • Lai Jiangshan's avatar
      KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync · b53e84ee
      Lai Jiangshan authored
      When using shadow paging, unload the guest MMU when emulating a guest TLB
      flush to ensure all roots are synchronized.  From the guest's perspective,
      flushing the TLB ensures any and all modifications to its PTEs will be
      recognized by the CPU.
      
      Note, unloading the MMU is overkill, but is done to mirror KVM's existing
      handling of INVPCID(all) and ensure the bug is squashed.  Future cleanup
      can be done to more precisely synchronize roots when servicing a guest
      TLB flush.
      
      If TDP is enabled, synchronizing the MMU is unnecessary even if nested
      TDP is in play, as a "legacy" TLB flush from L1 does not invalidate L1's
      TDP mappings.  For EPT, an explicit INVEPT is required to invalidate
      guest-physical mappings; for NPT, guest mappings are always tagged with
      an ASID and thus can only be invalidated via the VMCB's ASID control.
      
      This bug has existed since the introduction of KVM_VCPU_FLUSH_TLB.
      It was only recently exposed after Linux guests stopped flushing the
      local CPU's TLB prior to flushing remote TLBs (see commit 4ce94eab,
      "x86/mm/tlb: Flush remote and local TLBs concurrently"), but is also
      visible in Windows 10 guests.
      Tested-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Fixes: f38a7b75 ("KVM: X86: support paravirtualized help for TLB shootdowns")
      Signed-off-by: default avatarLai Jiangshan <laijs@linux.alibaba.com>
      [sean: massaged comment and changelog]
      Message-Id: <20210531172256.2908-1-jiangshanlai@gmail.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b53e84ee
    • Sean Christopherson's avatar
      KVM: x86: Ensure liveliness of nested VM-Enter fail tracepoint message · f31500b0
      Sean Christopherson authored
      Use the __string() machinery provided by the tracing subystem to make a
      copy of the string literals consumed by the "nested VM-Enter failed"
      tracepoint.  A complete copy is necessary to ensure that the tracepoint
      can't outlive the data/memory it consumes and deference stale memory.
      
      Because the tracepoint itself is defined by kvm, if kvm-intel and/or
      kvm-amd are built as modules, the memory holding the string literals
      defined by the vendor modules will be freed when the module is unloaded,
      whereas the tracepoint and its data in the ring buffer will live until
      kvm is unloaded (or "indefinitely" if kvm is built-in).
      
      This bug has existed since the tracepoint was added, but was recently
      exposed by a new check in tracing to detect exactly this type of bug.
      
        fmt: '%s%s
        ' current_buffer: ' vmx_dirty_log_t-140127  [003] ....  kvm_nested_vmenter_failed: '
        WARNING: CPU: 3 PID: 140134 at kernel/trace/trace.c:3759 trace_check_vprintf+0x3be/0x3e0
        CPU: 3 PID: 140134 Comm: less Not tainted 5.13.0-rc1-ce2e73ce600a-req #184
        Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
        RIP: 0010:trace_check_vprintf+0x3be/0x3e0
        Code: <0f> 0b 44 8b 4c 24 1c e9 a9 fe ff ff c6 44 02 ff 00 49 8b 97 b0 20
        RSP: 0018:ffffa895cc37bcb0 EFLAGS: 00010282
        RAX: 0000000000000000 RBX: ffffa895cc37bd08 RCX: 0000000000000027
        RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff9766cfad74f8
        RBP: ffffffffc0a041d4 R08: ffff9766cfad74f0 R09: ffffa895cc37bad8
        R10: 0000000000000001 R11: 0000000000000001 R12: ffffffffc0a041d4
        R13: ffffffffc0f4dba8 R14: 0000000000000000 R15: ffff976409f2c000
        FS:  00007f92fa200740(0000) GS:ffff9766cfac0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000559bd11b0000 CR3: 000000019fbaa002 CR4: 00000000001726e0
        Call Trace:
         trace_event_printf+0x5e/0x80
         trace_raw_output_kvm_nested_vmenter_failed+0x3a/0x60 [kvm]
         print_trace_line+0x1dd/0x4e0
         s_show+0x45/0x150
         seq_read_iter+0x2d5/0x4c0
         seq_read+0x106/0x150
         vfs_read+0x98/0x180
         ksys_read+0x5f/0xe0
         do_syscall_64+0x40/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Fixes: 380e0055 ("KVM: nVMX: trace nested VM-Enter failures detected by H/W")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Message-Id: <20210607175748.674002-1-seanjc@google.com>
      f31500b0
    • Zhenzhong Duan's avatar
      selftests: kvm: Add support for customized slot0 memory size · f53b16ad
      Zhenzhong Duan authored
      Until commit 39fe2fc9 ("selftests: kvm: make allocation of extra
      memory take effect", 2021-05-27), parameter extra_mem_pages was used
      only to calculate the page table size for all the memory chunks,
      because real memory allocation happened with calls of
      vm_userspace_mem_region_add() after vm_create_default().
      
      Commit 39fe2fc9 however changed the meaning of extra_mem_pages to
      the size of memory slot 0.  This makes the memory allocation more
      flexible, but makes it harder to account for the number of
      pages needed for the page tables.  For example, memslot_perf_test
      has a small amount of memory in slot 0 but a lot in other slots,
      and adding that memory twice (both in slot 0 and with later
      calls to vm_userspace_mem_region_add()) causes an error that
      was fixed in commit 000ac429 ("selftests: kvm: fix overlapping
      addresses in memslot_perf_test", 2021-05-29)
      
      Since both uses are sensible, add a new parameter slot0_mem_pages
      to vm_create_with_vcpus() and some comments to clarify the meaning of
      slot0_mem_pages and extra_mem_pages.  With this change,
      memslot_perf_test can go back to passing the number of memory
      pages as extra_mem_pages.
      Signed-off-by: default avatarZhenzhong Duan <zhenzhong.duan@intel.com>
      Message-Id: <20210608233816.423958-4-zhenzhong.duan@intel.com>
      [Squashed in a single patch and rewrote the commit message. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f53b16ad
    • Christian Borntraeger's avatar
      KVM: selftests: introduce P47V64 for s390x · 1bc603af
      Christian Borntraeger authored
      s390x can have up to 47bits of physical guest and 64bits of virtual
      address  bits. Add a new address mode to avoid errors of testcases
      going beyond 47bits.
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Message-Id: <20210608123954.10991-1-borntraeger@de.ibm.com>
      Fixes: ef4c9f4f ("KVM: selftests: Fix 32-bit truncation of vm_get_max_gfn()")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1bc603af
    • Lai Jiangshan's avatar
      KVM: x86: Ensure PV TLB flush tracepoint reflects KVM behavior · af3511ff
      Lai Jiangshan authored
      In record_steal_time(), st->preempted is read twice, and
      trace_kvm_pv_tlb_flush() might output result inconsistent if
      kvm_vcpu_flush_tlb_guest() see a different st->preempted later.
      
      It is a very trivial problem and hardly has actual harm and can be
      avoided by reseting and reading st->preempted in atomic way via xchg().
      Signed-off-by: default avatarLai Jiangshan <laijs@linux.alibaba.com>
      
      Message-Id: <20210531174628.10265-1-jiangshanlai@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      af3511ff
    • Lai Jiangshan's avatar
      KVM: X86: MMU: Use the correct inherited permissions to get shadow page · b1bd5cba
      Lai Jiangshan authored
      When computing the access permissions of a shadow page, use the effective
      permissions of the walk up to that point, i.e. the logic AND of its parents'
      permissions.  Two guest PxE entries that point at the same table gfn need to
      be shadowed with different shadow pages if their parents' permissions are
      different.  KVM currently uses the effective permissions of the last
      non-leaf entry for all non-leaf entries.  Because all non-leaf SPTEs have
      full ("uwx") permissions, and the effective permissions are recorded only
      in role.access and merged into the leaves, this can lead to incorrect
      reuse of a shadow page and eventually to a missing guest protection page
      fault.
      
      For example, here is a shared pagetable:
      
         pgd[]   pud[]        pmd[]            virtual address pointers
                           /->pmd1(u--)->pte1(uw-)->page1 <- ptr1 (u--)
              /->pud1(uw-)--->pmd2(uw-)->pte2(uw-)->page2 <- ptr2 (uw-)
         pgd-|           (shared pmd[] as above)
              \->pud2(u--)--->pmd1(u--)->pte1(uw-)->page1 <- ptr3 (u--)
                           \->pmd2(uw-)->pte2(uw-)->page2 <- ptr4 (u--)
      
        pud1 and pud2 point to the same pmd table, so:
        - ptr1 and ptr3 points to the same page.
        - ptr2 and ptr4 points to the same page.
      
      (pud1 and pud2 here are pud entries, while pmd1 and pmd2 here are pmd entries)
      
      - First, the guest reads from ptr1 first and KVM prepares a shadow
        page table with role.access=u--, from ptr1's pud1 and ptr1's pmd1.
        "u--" comes from the effective permissions of pgd, pud1 and
        pmd1, which are stored in pt->access.  "u--" is used also to get
        the pagetable for pud1, instead of "uw-".
      
      - Then the guest writes to ptr2 and KVM reuses pud1 which is present.
        The hypervisor set up a shadow page for ptr2 with pt->access is "uw-"
        even though the pud1 pmd (because of the incorrect argument to
        kvm_mmu_get_page in the previous step) has role.access="u--".
      
      - Then the guest reads from ptr3.  The hypervisor reuses pud1's
        shadow pmd for pud2, because both use "u--" for their permissions.
        Thus, the shadow pmd already includes entries for both pmd1 and pmd2.
      
      - At last, the guest writes to ptr4.  This causes no vmexit or pagefault,
        because pud1's shadow page structures included an "uw-" page even though
        its role.access was "u--".
      
      Any kind of shared pagetable might have the similar problem when in
      virtual machine without TDP enabled if the permissions are different
      from different ancestors.
      
      In order to fix the problem, we change pt->access to be an array, and
      any access in it will not include permissions ANDed from child ptes.
      
      The test code is: https://lore.kernel.org/kvm/20210603050537.19605-1-jiangshanlai@gmail.com/
      Remember to test it with TDP disabled.
      
      The problem had existed long before the commit 41074d07 ("KVM: MMU:
      Fix inherited permissions for emulated guest pte updates"), and it
      is hard to find which is the culprit.  So there is no fixes tag here.
      Signed-off-by: default avatarLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20210603052455.21023-1-jiangshanlai@gmail.com>
      Cc: stable@vger.kernel.org
      Fixes: cea0f0e7 ("[PATCH] KVM: MMU: Shadow page table caching")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b1bd5cba
    • Wanpeng Li's avatar
      KVM: LAPIC: Write 0 to TMICT should also cancel vmx-preemption timer · e898da78
      Wanpeng Li authored
      According to the SDM 10.5.4.1:
      
        A write of 0 to the initial-count register effectively stops the local
        APIC timer, in both one-shot and periodic mode.
      
      However, the lapic timer oneshot/periodic mode which is emulated by vmx-preemption
      timer doesn't stop by writing 0 to TMICT since vmx->hv_deadline_tsc is still
      programmed and the guest will receive the spurious timer interrupt later. This
      patch fixes it by also cancelling the vmx-preemption timer when writing 0 to
      the initial-count register.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1623050385-100988-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e898da78
    • Ashish Kalra's avatar
      KVM: SVM: Fix SEV SEND_START session length & SEND_UPDATE_DATA query length... · 4f13d471
      Ashish Kalra authored
      KVM: SVM: Fix SEV SEND_START session length & SEND_UPDATE_DATA query length after commit 238eca82
      
      Commit 238eca82 ("KVM: SVM: Allocate SEV command structures on local stack")
      uses the local stack to allocate the structures used to communicate with the PSP,
      which were earlier being kzalloced. This breaks SEV live migration for
      computing the SEND_START session length and SEND_UPDATE_DATA query length as
      session_len and trans_len and hdr_len fields are not zeroed respectively for
      the above commands before issuing the SEV Firmware API call, hence the
      firmware returns incorrect session length and update data header or trans length.
      
      Also the SEV Firmware API returns SEV_RET_INVALID_LEN firmware error
      for these length query API calls, and the return value and the
      firmware error needs to be passed to the userspace as it is, so
      need to remove the return check in the KVM code.
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Message-Id: <20210607061532.27459-1-Ashish.Kalra@amd.com>
      Fixes: 238eca82 ("KVM: SVM: Allocate SEV command structures on local stack")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4f13d471
  2. 29 May, 2021 1 commit
    • Paolo Bonzini's avatar
      selftests: kvm: fix overlapping addresses in memslot_perf_test · 000ac429
      Paolo Bonzini authored
      vm_create allocates memory and maps it close to GPA.  This memory
      is separate from what is allocated in subsequent calls to
      vm_userspace_mem_region_add, so it is incorrect to pass the
      test memory size to vm_create_default.  Just pass a small
      fixed amount of memory which can be used later for page table,
      otherwise GPAs are already allocated at MEM_GPA and the
      test aborts.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      000ac429
  3. 28 May, 2021 4 commits
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-5.13-2' of... · a3d2ec9d
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      KVM/arm64 fixes for 5.13, take #2
      
      - Another state update on exit to userspace fix
      - Prevent the creation of mixed 32/64 VMs
      a3d2ec9d
    • Wanpeng Li's avatar
      KVM: X86: Kill off ctxt->ud · b35491e6
      Wanpeng Li authored
      ctxt->ud is consumed only by x86_decode_insn(), we can kill it off by
      passing emulation_type to x86_decode_insn() and dropping ctxt->ud
      altogether. Tracking that info in ctxt for literally one call is silly.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <1622160097-37633-2-git-send-email-wanpengli@tencent.com>
      b35491e6
    • Wanpeng Li's avatar
      KVM: X86: Fix warning caused by stale emulation context · da6393cd
      Wanpeng Li authored
      Reported by syzkaller:
      
        WARNING: CPU: 7 PID: 10526 at linux/arch/x86/kvm//x86.c:7621 x86_emulate_instruction+0x41b/0x510 [kvm]
        RIP: 0010:x86_emulate_instruction+0x41b/0x510 [kvm]
        Call Trace:
         kvm_mmu_page_fault+0x126/0x8f0 [kvm]
         vmx_handle_exit+0x11e/0x680 [kvm_intel]
         vcpu_enter_guest+0xd95/0x1b40 [kvm]
         kvm_arch_vcpu_ioctl_run+0x377/0x6a0 [kvm]
         kvm_vcpu_ioctl+0x389/0x630 [kvm]
         __x64_sys_ioctl+0x8e/0xd0
         do_syscall_64+0x3c/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Commit 4a1e10d5 ("KVM: x86: handle hardware breakpoints during emulation())
      adds hardware breakpoints check before emulation the instruction and parts of
      emulation context initialization, actually we don't have the EMULTYPE_NO_DECODE flag
      here and the emulation context will not be reused. Commit c8848cee ("KVM: x86:
      set ctxt->have_exception in x86_decode_insn()) triggers the warning because it
      catches the stale emulation context has #UD, however, it is not during instruction
      decoding which should result in EMULATION_FAILED. This patch fixes it by moving
      the second part emulation context initialization into init_emulate_ctxt() and
      before hardware breakpoints check. The ctxt->ud will be dropped by a follow-up
      patch.
      
      syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=134683fdd00000
      
      Reported-by: syzbot+71271244f206d17f6441@syzkaller.appspotmail.com
      Fixes: 4a1e10d5 (KVM: x86: handle hardware breakpoints during emulation)
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <1622160097-37633-1-git-send-email-wanpengli@tencent.com>
      da6393cd
    • Yuan Yao's avatar
      KVM: X86: Use kvm_get_linear_rip() in single-step and #DB/#BP interception · e87e46d5
      Yuan Yao authored
      The kvm_get_linear_rip() handles x86/long mode cases well and has
      better readability, __kvm_set_rflags() also use the paired
      function kvm_is_linear_rip() to check the vcpu->arch.singlestep_rip
      set in kvm_arch_vcpu_ioctl_set_guest_debug(), so change the
      "CS.BASE + RIP" code in kvm_arch_vcpu_ioctl_set_guest_debug() and
      handle_exception_nmi() to this one.
      Signed-off-by: default avatarYuan Yao <yuan.yao@intel.com>
      Message-Id: <20210526063828.1173-1-yuan.yao@linux.intel.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e87e46d5
  4. 27 May, 2021 26 commits
    • David Matlack's avatar
      KVM: x86/mmu: Fix comment mentioning skip_4k · bedd9195
      David Matlack authored
      This comment was left over from a previous version of the patch that
      introduced wrprot_gfn_range, when skip_4k was passed in instead of
      min_level.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210526163227.3113557-1-dmatlack@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bedd9195
    • Marcelo Tosatti's avatar
      KVM: VMX: update vcpu posted-interrupt descriptor when assigning device · a2486020
      Marcelo Tosatti authored
      For VMX, when a vcpu enters HLT emulation, pi_post_block will:
      
      1) Add vcpu to per-cpu list of blocked vcpus.
      
      2) Program the posted-interrupt descriptor "notification vector"
      to POSTED_INTR_WAKEUP_VECTOR
      
      With interrupt remapping, an interrupt will set the PIR bit for the
      vector programmed for the device on the CPU, test-and-set the
      ON bit on the posted interrupt descriptor, and if the ON bit is clear
      generate an interrupt for the notification vector.
      
      This way, the target CPU wakes upon a device interrupt and wakes up
      the target vcpu.
      
      Problem is that pi_post_block only programs the notification vector
      if kvm_arch_has_assigned_device() is true. Its possible for the
      following to happen:
      
      1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false,
      notification vector is not programmed
      2) device is assigned to VM
      3) device interrupts vcpu V, sets ON bit
      (notification vector not programmed, so pcpu P remains in idle)
      4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set,
      kvm_vcpu_kick is skipped
      5) vcpu 0 busy spins on vcpu V's response for several seconds, until
      RCU watchdog NMIs all vCPUs.
      
      To fix this, use the start_assignment kvm_x86_ops callback to kick
      vcpus out of the halt loop, so the notification vector is
      properly reprogrammed to the wakeup vector.
      Reported-by: default avatarPei Zhang <pezhang@redhat.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Message-Id: <20210526172014.GA29007@fuller.cnet>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a2486020
    • Marcelo Tosatti's avatar
      KVM: rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK · 084071d5
      Marcelo Tosatti authored
      KVM_REQ_UNBLOCK will be used to exit a vcpu from
      its inner vcpu halt emulation loop.
      
      Rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK, switch
      PowerPC to arch specific request bit.
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      
      Message-Id: <20210525134321.303768132@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      084071d5
    • Marcelo Tosatti's avatar
      KVM: x86: add start_assignment hook to kvm_x86_ops · 57ab8794
      Marcelo Tosatti authored
      Add a start_assignment hook to kvm_x86_ops, which is called when
      kvm_arch_start_assignment is done.
      
      The hook is required to update the wakeup vector of a sleeping vCPU
      when a device is assigned to the guest.
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      
      Message-Id: <20210525134321.254128742@redhat.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      57ab8794
    • Wanpeng Li's avatar
      KVM: LAPIC: Narrow the timer latency between wait_lapic_expire and world switch · 9805cf03
      Wanpeng Li authored
      Let's treat lapic_timer_advance_ns automatic tuning logic as hypervisor
      overhead, move it before wait_lapic_expire instead of between wait_lapic_expire
      and the world switch, the wait duration should be calculated by the
      up-to-date guest_tsc after the overhead of automatic tuning logic. This
      patch reduces ~30+ cycles for kvm-unit-tests/tscdeadline-latency when testing
      busy waits.
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1621339235-11131-5-git-send-email-wanpengli@tencent.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9805cf03
    • Paolo Bonzini's avatar
      selftests: kvm: do only 1 memslot_perf_test run by default · fb0f9479
      Paolo Bonzini authored
      The test takes a long time with the current implementation of
      memslots, so cut the run time a bit.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fb0f9479
    • Joe Richey's avatar
      KVM: X86: Use _BITUL() macro in UAPI headers · fb1070d1
      Joe Richey authored
      Replace BIT() in KVM's UPAI header with _BITUL(). BIT() is not defined
      in the UAPI headers and its usage may cause userspace build errors.
      
      Fixes: fb04a1ed ("KVM: X86: Implement ring-based dirty memory tracking")
      Signed-off-by: default avatarJoe Richey <joerichey@google.com>
      Message-Id: <20210521085849.37676-3-joerichey94@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fb1070d1
    • Axel Rasmussen's avatar
      KVM: selftests: add shared hugetlbfs backing source type · 33090a88
      Axel Rasmussen authored
      This lets us run the demand paging test on top of a shared
      hugetlbfs-backed area. The "shared" is key, as this allows us to
      exercise userfaultfd minor faults on hugetlbfs.
      Signed-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Message-Id: <20210519200339.829146-11-axelrasmussen@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      33090a88
    • Axel Rasmussen's avatar
      KVM: selftests: allow using UFFD minor faults for demand paging · a4b9722a
      Axel Rasmussen authored
      UFFD handling of MINOR faults is a new feature whose use case is to
      speed up demand paging (compared to MISSING faults). So, it's
      interesting to let this selftest exercise this new mode.
      
      Modify the demand paging test to have the option of using UFFD minor
      faults, as opposed to missing faults. Now, when turning on userfaultfd
      with '-u', the desired mode has to be specified ("MISSING" or "MINOR").
      
      If we're in minor mode, before registering, prefault via the *alias*.
      This way, the guest will trigger minor faults, instead of missing
      faults, and we can UFFDIO_CONTINUE to resolve them.
      
      Modify the page fault handler function to use the right ioctl depending
      on the mode we're running in. In MINOR mode, use UFFDIO_CONTINUE.
      Signed-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Message-Id: <20210519200339.829146-10-axelrasmussen@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a4b9722a
    • Axel Rasmussen's avatar
      KVM: selftests: create alias mappings when using shared memory · 94f3f2b3
      Axel Rasmussen authored
      When a memory region is added with a src_type specifying that it should
      use some kind of shared memory, also create an alias mapping to the same
      underlying physical pages.
      
      And, add an API so tests can get access to these alias addresses.
      Basically, for a guest physical address, let us look up the analogous
      host *alias* address.
      
      In a future commit, we'll modify the demand paging test to take
      advantage of this to exercise UFFD minor faults. The idea is, we
      pre-fault the underlying pages *via the alias*. When the *guest*
      faults, it gets a "minor" fault (PTEs don't exist yet, but a page is
      already in the page cache). Then, the userfaultfd theads can handle the
      fault: they could potentially modify the underlying memory *via the
      alias* if they wanted to, and then they install the PTEs and let the
      guest carry on via a UFFDIO_CONTINUE ioctl.
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Message-Id: <20210519200339.829146-9-axelrasmussen@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      94f3f2b3
    • Axel Rasmussen's avatar
      KVM: selftests: add shmem backing source type · c9befd59
      Axel Rasmussen authored
      This lets us run the demand paging test on top of a shmem-backed area.
      In follow-up commits, we'll 1) leverage this new capability to create an
      alias mapping, and then 2) use the alias mapping to exercise UFFD minor
      faults.
      Signed-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Message-Id: <20210519200339.829146-8-axelrasmussen@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c9befd59
    • Axel Rasmussen's avatar
      KVM: selftests: refactor vm_mem_backing_src_type flags · b3784bc2
      Axel Rasmussen authored
      Each struct vm_mem_backing_src_alias has a flags field, which denotes
      the flags used to mmap() an area of that type. Previously, this field
      never included MAP_PRIVATE | MAP_ANONYMOUS, because
      vm_userspace_mem_region_add assumed that *all* types would always use
      those flags, and so it hardcoded them.
      
      In a follow-up commit, we'll add a new type: shmem. Areas of this type
      must not have MAP_PRIVATE | MAP_ANONYMOUS, and instead they must have
      MAP_SHARED.
      
      So, refactor things. Make it so that the flags field of
      struct vm_mem_backing_src_alias really is a complete set of flags, and
      don't add in any extras in vm_userspace_mem_region_add. This will let us
      easily tack on shmem.
      Signed-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Message-Id: <20210519200339.829146-7-axelrasmussen@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b3784bc2
    • Axel Rasmussen's avatar
      KVM: selftests: allow different backing source types · 0368c2c1
      Axel Rasmussen authored
      Add an argument which lets us specify a different backing memory type
      for the test. The default is just to use anonymous, matching existing
      behavior.
      
      This is in preparation for testing UFFD minor faults. For that, we'll
      need to use a new backing memory type which is setup with MAP_SHARED.
      Signed-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Message-Id: <20210519200339.829146-6-axelrasmussen@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0368c2c1
    • Axel Rasmussen's avatar
      KVM: selftests: compute correct demand paging size · 32ffa4f7
      Axel Rasmussen authored
      This is a preparatory commit needed before we can use different kinds of
      backing pages for guest memory.
      
      Previously, we used perf_test_args.host_page_size, which is the host's
      native page size (commonly 4K). For VM_MEM_SRC_ANONYMOUS this turns out
      to be okay, but in a follow-up commit we want to allow using different
      kinds of backing memory.
      
      Take VM_MEM_SRC_ANONYMOUS_HUGETLB for example. Without this change, if
      we used that backing page type, when we issued a UFFDIO_COPY ioctl we'd
      only do so with 4K, rather than the full 2M of a backing hugepage. In
      this case, UFFDIO_COPY returns -EINVAL (__mcopy_atomic_hugetlb checks
      the size).
      Signed-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Message-Id: <20210519200339.829146-5-axelrasmussen@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      32ffa4f7
    • Axel Rasmussen's avatar
      KVM: selftests: simplify setup_demand_paging error handling · 25408e5a
      Axel Rasmussen authored
      A small cleanup. Our caller writes:
      
        r = setup_demand_paging(...);
        if (r < 0) exit(-r);
      
      Since we're just going to exit anyway, instead of returning an error we
      can just re-use TEST_ASSERT. This makes the caller simpler, as well as
      the function itself - no need to write our branches, etc.
      Signed-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Message-Id: <20210519200339.829146-3-axelrasmussen@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      25408e5a
    • David Matlack's avatar
      KVM: selftests: Print a message if /dev/kvm is missing · 2aab4b35
      David Matlack authored
      If a KVM selftest is run on a machine without /dev/kvm, it will exit
      silently. Make it easy to tell what's happening by printing an error
      message.
      
      Opportunistically consolidate all codepaths that open /dev/kvm into a
      single function so they all print the same message.
      
      This slightly changes the semantics of vm_is_unrestricted_guest() by
      changing a TEST_ASSERT() to exit(KSFT_SKIP). However
      vm_is_unrestricted_guest() is only called in one place
      (x86_64/mmio_warning_test.c) and that is to determine if the test should
      be skipped or not.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210511202120.1371800-1-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2aab4b35
    • Axel Rasmussen's avatar
      KVM: selftests: trivial comment/logging fixes · c887d6a1
      Axel Rasmussen authored
      Some trivial fixes I found while touching related code in this series,
      factored out into a separate commit for easier reviewing:
      
      - s/gor/got/ and add a newline in demand_paging_test.c
      - s/backing_src/src_type/ in a comment to be consistent with the real
        function signature in kvm_util.c
      Signed-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Message-Id: <20210519200339.829146-2-axelrasmussen@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c887d6a1
    • David Matlack's avatar
      KVM: selftests: Fix hang in hardware_disable_test · a10453c0
      David Matlack authored
      If /dev/kvm is not available then hardware_disable_test will hang
      indefinitely because the child process exits before posting to the
      semaphore for which the parent is waiting.
      
      Fix this by making the parent periodically check if the child has
      exited. We have to be careful to forward the child's exit status to
      preserve a KSFT_SKIP status.
      
      I considered just checking for /dev/kvm before creating the child
      process, but there are so many other reasons why the child could exit
      early that it seemed better to handle that as general case.
      
      Tested:
      
      $ ./hardware_disable_test
      /dev/kvm not available, skipping test
      $ echo $?
      4
      $ modprobe kvm_intel
      $ ./hardware_disable_test
      $ echo $?
      0
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210514230521.2608768-1-dmatlack@google.com>
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a10453c0
    • David Matlack's avatar
      KVM: selftests: Ignore CPUID.0DH.1H in get_cpuid_test · 50bc913d
      David Matlack authored
      Similar to CPUID.0DH.0H this entry depends on the vCPU's XCR0 register
      and IA32_XSS MSR. Since this test does not control for either before
      assigning the vCPU's CPUID, these entries will not necessarily match
      the supported CPUID exposed by KVM.
      
      This fixes get_cpuid_test on Cascade Lake CPUs.
      Suggested-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210519211345.3944063-1-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      50bc913d
    • David Matlack's avatar
      KVM: selftests: Fix 32-bit truncation of vm_get_max_gfn() · ef4c9f4f
      David Matlack authored
      vm_get_max_gfn() casts vm->max_gfn from a uint64_t to an unsigned int,
      which causes the upper 32-bits of the max_gfn to get truncated.
      
      Nobody noticed until now likely because vm_get_max_gfn() is only used
      as a mechanism to create a memslot in an unused region of the guest
      physical address space (the top), and the top of the 32-bit physical
      address space was always good enough.
      
      This fix reveals a bug in memslot_modification_stress_test which was
      trying to create a dummy memslot past the end of guest physical memory.
      Fix that by moving the dummy memslot lower.
      
      Fixes: 52200d0d ("KVM: selftests: Remove duplicate guest mode handling")
      Reviewed-by: default avatarVenkatesh Srinivas <venkateshs@chromium.org>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210521173828.1180619-1-dmatlack@google.com>
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ef4c9f4f
    • Maciej S. Szmigiero's avatar
      KVM: selftests: add a memslot-related performance benchmark · cad347fa
      Maciej S. Szmigiero authored
      This benchmark contains the following tests:
      * Map test, where the host unmaps guest memory while the guest writes to
      it (maps it).
      
      The test is designed in a way to make the unmap operation on the host
      take a negligible amount of time in comparison with the mapping
      operation in the guest.
      
      The test area is actually split in two: the first half is being mapped
      by the guest while the second half in being unmapped by the host.
      Then a guest <-> host sync happens and the areas are reversed.
      
      * Unmap test which is broadly similar to the above map test, but it is
      designed in an opposite way: to make the mapping operation in the guest
      take a negligible amount of time in comparison with the unmap operation
      on the host.
      This test is available in two variants: with per-page unmap operation
      or a chunked one (using 2 MiB chunk size).
      
      * Move active area test which involves moving the last (highest gfn)
      memslot a bit back and forth on the host while the guest is
      concurrently writing around the area being moved (including over the
      moved memslot).
      
      * Move inactive area test which is similar to the previous move active
      area test, but now guest writes all happen outside of the area being
      moved.
      
      * Read / write test in which the guest writes to the beginning of each
      page of the test area while the host writes to the middle of each such
      page.
      Then each side checks the values the other side has written.
      This particular test is not expected to give different results depending
      on particular memslots implementation, it is meant as a rough sanity
      check and to provide insight on the spread of test results expected.
      
      Each test performs its operation in a loop until a test period ends
      (this is 5 seconds by default, but it is configurable).
      Then the total count of loops done is divided by the actual elapsed
      time to give the test result.
      
      The tests have a configurable memslot cap with the "-s" test option, by
      default the system maximum is used.
      Each test is repeated a particular number of times (by default 20
      times), the best result achieved is printed.
      
      The test memory area is divided equally between memslots, the reminder
      is added to the last memslot.
      The test area size does not depend on the number of memslots in use.
      
      The tests also measure the time that it took to add all these memslots.
      The best result from the tests that use the whole test area is printed
      after all the requested tests are done.
      
      In general, these tests are designed to use as much memory as possible
      (within reason) while still doing 100+ loops even on high memslot counts
      with the default test length.
      Increasing the test runtime makes it increasingly more likely that some
      event will happen on the system during the test run, which might lower
      the test result.
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Message-Id: <8d31bb3d92bc8fa33a9756fa802ee14266ab994e.1618253574.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cad347fa
    • Maciej S. Szmigiero's avatar
      KVM: selftests: Keep track of memslots more efficiently · 22721a56
      Maciej S. Szmigiero authored
      The KVM selftest framework was using a simple list for keeping track of
      the memslots currently in use.
      This resulted in lookups and adding a single memslot being O(n), the
      later due to linear scanning of the existing memslot set to check for
      the presence of any conflicting entries.
      
      Before this change, benchmarking high count of memslots was more or less
      impossible as pretty much all the benchmark time was spent in the
      selftest framework code.
      
      We can simply use a rbtree for keeping track of both of gfn and hva.
      We don't need an interval tree for hva here as we can't have overlapping
      memslots because we allocate a completely new memory chunk for each new
      memslot.
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Message-Id: <b12749d47ee860468240cf027412c91b76dbe3db.1618253574.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      22721a56
    • Paolo Bonzini's avatar
      selftests: kvm: fix potential issue with ELF loading · a13534d6
      Paolo Bonzini authored
      vm_vaddr_alloc() sets up GVA to GPA mapping page by page; therefore, GPAs
      may not be continuous if same memslot is used for data and page table allocation.
      
      kvm_vm_elf_load() however expects a continuous range of HVAs (and thus GPAs)
      because it does not try to read file data page by page.  Fix this mismatch
      by allocating memory in one step.
      Reported-by: default avatarZhenzhong Duan <zhenzhong.duan@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a13534d6
    • Zhenzhong Duan's avatar
      selftests: kvm: make allocation of extra memory take effect · 39fe2fc9
      Zhenzhong Duan authored
      The extra memory pages is missed to be allocated during VM creating.
      perf_test_util and kvm_page_table_test use it to alloc extra memory
      currently.
      
      Fix it by adding extra_mem_pages to the total memory calculation before
      allocate.
      Signed-off-by: default avatarZhenzhong Duan <zhenzhong.duan@intel.com>
      Message-Id: <20210512043107.30076-1-zhenzhong.duan@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      39fe2fc9
    • Wanpeng Li's avatar
      KVM: X86: hyper-v: Task srcu lock when accessing kvm_memslots() · da6d63a0
      Wanpeng Li authored
         WARNING: suspicious RCU usage
         5.13.0-rc1 #4 Not tainted
         -----------------------------
         ./include/linux/kvm_host.h:710 suspicious rcu_dereference_check() usage!
      
        other info that might help us debug this:
      
        rcu_scheduler_active = 2, debug_locks = 1
         1 lock held by hyperv_clock/8318:
          #0: ffffb6b8cb05a7d8 (&hv->hv_lock){+.+.}-{3:3}, at: kvm_hv_invalidate_tsc_page+0x3e/0xa0 [kvm]
      
        stack backtrace:
        CPU: 3 PID: 8318 Comm: hyperv_clock Not tainted 5.13.0-rc1 #4
        Call Trace:
         dump_stack+0x87/0xb7
         lockdep_rcu_suspicious+0xce/0xf0
         kvm_write_guest_page+0x1c1/0x1d0 [kvm]
         kvm_write_guest+0x50/0x90 [kvm]
         kvm_hv_invalidate_tsc_page+0x79/0xa0 [kvm]
         kvm_gen_update_masterclock+0x1d/0x110 [kvm]
         kvm_arch_vm_ioctl+0x2a7/0xc50 [kvm]
         kvm_vm_ioctl+0x123/0x11d0 [kvm]
         __x64_sys_ioctl+0x3ed/0x9d0
         do_syscall_64+0x3d/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      kvm_memslots() will be called by kvm_write_guest(), so we should take the srcu lock.
      
      Fixes: e880c6ea (KVM: x86: hyper-v: Prevent using not-yet-updated TSC page by secondary CPUs)
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1621339235-11131-4-git-send-email-wanpengli@tencent.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      da6d63a0
    • Wanpeng Li's avatar
      KVM: X86: Fix vCPU preempted state from guest's point of view · 1eff0ada
      Wanpeng Li authored
      Commit 66570e96 (kvm: x86: only provide PV features if enabled in guest's
      CPUID) avoids to access pv tlb shootdown host side logic when this pv feature
      is not exposed to guest, however, kvm_steal_time.preempted not only leveraged
      by pv tlb shootdown logic but also mitigate the lock holder preemption issue.
      From guest's point of view, vCPU is always preempted since we lose the reset
      of kvm_steal_time.preempted before vmentry if pv tlb shootdown feature is not
      exposed. This patch fixes it by clearing kvm_steal_time.preempted before
      vmentry.
      
      Fixes: 66570e96 (kvm: x86: only provide PV features if enabled in guest's CPUID)
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1621339235-11131-3-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1eff0ada