1. 01 Feb, 2022 3 commits
    • Mark Rutland's avatar
      kvm/mips: rework guest entry logic · 72e32445
      Mark Rutland authored
      In kvm_arch_vcpu_ioctl_run() we use guest_enter_irqoff() and
      guest_exit_irqoff() directly, with interrupts masked between these. As
      we don't handle any timer ticks during this window, we will not account
      time spent within the guest as guest time, which is unfortunate.
      
      Additionally, we do not inform lockdep or tracing that interrupts will
      be enabled during guest execution, which caan lead to misleading traces
      and warnings that interrupts have been enabled for overly-long periods.
      
      This patch fixes these issues by using the new timing and context
      entry/exit helpers to ensure that interrupts are handled during guest
      vtime but with RCU watching, with a sequence:
      
      	guest_timing_enter_irqoff();
      
      	guest_state_enter_irqoff();
      	< run the vcpu >
      	guest_state_exit_irqoff();
      
      	< take any pending IRQs >
      
      	guest_timing_exit_irqoff();
      
      In addition, as guest exits during the "run the vcpu" step are handled
      by kvm_mips_handle_exit(), a wrapper function is added which ensures
      that such exists are handled with a sequence:
      
      	guest_state_exit_irqoff();
      	< handle the exit >
      	guest_state_enter_irqoff();
      
      This means that exits which stop the vCPU running will have a redundant
      guest_state_enter_irqoff() .. guest_state_exit_irqoff() sequence, which
      can be addressed with future rework.
      
      Since instrumentation may make use of RCU, we must also ensure that no
      instrumented code is run during the EQS. I've split out the critical
      section into a new kvm_mips_enter_exit_vcpu() helper which is marked
      noinstr.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Message-Id: <20220201132926.3301912-6-mark.rutland@arm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      72e32445
    • Mark Rutland's avatar
      kvm: add guest_state_{enter,exit}_irqoff() · ef9989af
      Mark Rutland authored
      When transitioning to/from guest mode, it is necessary to inform
      lockdep, tracing, and RCU in a specific order, similar to the
      requirements for transitions to/from user mode. Additionally, it is
      necessary to perform vtime accounting for a window around running the
      guest, with RCU enabled, such that timer interrupts taken from the guest
      can be accounted as guest time.
      
      Most architectures don't handle all the necessary pieces, and a have a
      number of common bugs, including unsafe usage of RCU during the window
      between guest_enter() and guest_exit().
      
      On x86, this was dealt with across commits:
      
        87fa7f3e ("x86/kvm: Move context tracking where it belongs")
        0642391e ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
        9fc975e9 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
        3ebccdf3 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
        135961e0 ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
        16045714 ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
        bc908e09 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
      
      ... but those fixes are specific to x86, and as the resulting logic
      (while correct) is split across generic helper functions and
      x86-specific helper functions, it is difficult to see that the
      entry/exit accounting is balanced.
      
      This patch adds generic helpers which architectures can use to handle
      guest entry/exit consistently and correctly. The guest_{enter,exit}()
      helpers are split into guest_timing_{enter,exit}() to perform vtime
      accounting, and guest_context_{enter,exit}() to perform the necessary
      context tracking and RCU management. The existing guest_{enter,exit}()
      heleprs are left as wrappers of these.
      
      Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
      helpers are added to handle the ordering of lockdep, tracing, and RCU
      manageent. These are inteneded to mirror exit_to_user_mode() and
      enter_from_user_mode().
      
      Subsequent patches will migrate architectures over to the new helpers,
      following a sequence:
      
      	guest_timing_enter_irqoff();
      
      	guest_state_enter_irqoff();
      	< run the vcpu >
      	guest_state_exit_irqoff();
      
      	< take any pending IRQs >
      
      	guest_timing_exit_irqoff();
      
      This sequences handles all of the above correctly, and more clearly
      balances the entry and exit portions, making it easier to understand.
      
      The existing helpers are marked as deprecated, and will be removed once
      all architectures have been converted.
      
      There should be no functional change as a result of this patch.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarMarc Zyngier <maz@kernel.org>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarNicolas Saenz Julienne <nsaenzju@redhat.com>
      Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ef9989af
    • Sean Christopherson's avatar
      KVM: x86: Move delivery of non-APICv interrupt into vendor code · 57dfd7b5
      Sean Christopherson authored
      Handle non-APICv interrupt delivery in vendor code, even though it means
      VMX and SVM will temporarily have duplicate code.  SVM's AVIC has a race
      condition that requires KVM to fall back to legacy interrupt injection
      _after_ the interrupt has been logged in the vIRR, i.e. to fix the race,
      SVM will need to open code the full flow anyways[*].  Refactor the code
      so that the SVM bug without introducing other issues, e.g. SVM would
      return "success" and thus invoke trace_kvm_apicv_accept_irq() even when
      delivery through the AVIC failed, and to opportunistically prepare for
      using KVM_X86_OP to fill each vendor's kvm_x86_ops struct, which will
      rely on the vendor function matching the kvm_x86_op pointer name.
      
      No functional change intended.
      
      [*] https://lore.kernel.org/all/20211213104634.199141-4-mlevitsk@redhat.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      57dfd7b5
  2. 31 Jan, 2022 1 commit
  3. 28 Jan, 2022 10 commits
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-5.17-1' of... · 17179d00
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-5.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      KVM/arm64 fixes for 5.17, take #1
      
      - Correctly update the shadow register on exception injection when
        running in nVHE mode
      
      - Correctly use the mm_ops indirection when performing cache invalidation
        from the page-table walker
      
      - Restrict the vgic-v3 workaround for SEIS to the two known broken
        implementations
      17179d00
    • Hou Wenlong's avatar
      KVM: eventfd: Fix false positive RCU usage warning · 6a0c6170
      Hou Wenlong authored
      Fix the following false positive warning:
       =============================
       WARNING: suspicious RCU usage
       5.16.0-rc4+ #57 Not tainted
       -----------------------------
       arch/x86/kvm/../../../virt/kvm/eventfd.c:484 RCU-list traversed in non-reader section!!
      
       other info that might help us debug this:
      
       rcu_scheduler_active = 2, debug_locks = 1
       3 locks held by fc_vcpu 0/330:
        #0: ffff8884835fc0b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x88/0x6f0 [kvm]
        #1: ffffc90004c0bb68 (&kvm->srcu){....}-{0:0}, at: vcpu_enter_guest+0x600/0x1860 [kvm]
        #2: ffffc90004c0c1d0 (&kvm->irq_srcu){....}-{0:0}, at: kvm_notify_acked_irq+0x36/0x180 [kvm]
      
       stack backtrace:
       CPU: 26 PID: 330 Comm: fc_vcpu 0 Not tainted 5.16.0-rc4+
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
       Call Trace:
        <TASK>
        dump_stack_lvl+0x44/0x57
        kvm_notify_acked_gsi+0x6b/0x70 [kvm]
        kvm_notify_acked_irq+0x8d/0x180 [kvm]
        kvm_ioapic_update_eoi+0x92/0x240 [kvm]
        kvm_apic_set_eoi_accelerated+0x2a/0xe0 [kvm]
        handle_apic_eoi_induced+0x3d/0x60 [kvm_intel]
        vmx_handle_exit+0x19c/0x6a0 [kvm_intel]
        vcpu_enter_guest+0x66e/0x1860 [kvm]
        kvm_arch_vcpu_ioctl_run+0x438/0x7f0 [kvm]
        kvm_vcpu_ioctl+0x38a/0x6f0 [kvm]
        __x64_sys_ioctl+0x89/0xc0
        do_syscall_64+0x3a/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Since kvm_unregister_irq_ack_notifier() does synchronize_srcu(&kvm->irq_srcu),
      kvm->irq_ack_notifier_list is protected by kvm->irq_srcu. In fact,
      kvm->irq_srcu SRCU read lock is held in kvm_notify_acked_irq(), making it
      a false positive warning. So use hlist_for_each_entry_srcu() instead of
      hlist_for_each_entry_rcu().
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarHou Wenlong <houwenlong93@linux.alibaba.com>
      Message-Id: <f98bac4f5052bad2c26df9ad50f7019e40434512.1643265976.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6a0c6170
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Allow VMREAD when Enlightened VMCS is in use · 6cbbaab6
      Vitaly Kuznetsov authored
      Hyper-V TLFS explicitly forbids VMREAD and VMWRITE instructions when
      Enlightened VMCS interface is in use:
      
      "Any VMREAD or VMWRITE instructions while an enlightened VMCS is
      active is unsupported and can result in unexpected behavior.""
      
      Windows 11 + WSL2 seems to ignore this, attempts to VMREAD VMCS field
      0x4404 ("VM-exit interruption information") are observed. Failing
      these attempts with nested_vmx_failInvalid() makes such guests
      unbootable.
      
      Microsoft confirms this is a Hyper-V bug and claims that it'll get fixed
      eventually but for the time being we need a workaround. (Temporary) allow
      VMREAD to get data from the currently loaded Enlightened VMCS.
      
      Note: VMWRITE instructions remain forbidden, it is not clear how to
      handle them properly and hopefully won't ever be needed.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220112170134.1904308-6-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6cbbaab6
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Implement evmcs_field_offset() suitable for handle_vmread() · 892a42c1
      Vitaly Kuznetsov authored
      In preparation to allowing reads from Enlightened VMCS from
      handle_vmread(), implement evmcs_field_offset() to get the correct
      read offset. get_evmcs_offset(), which is being used by KVM-on-Hyper-V,
      is almost what's needed but a few things need to be adjusted. First,
      WARN_ON() is unacceptable for handle_vmread() as any field can (in
      theory) be supplied by the guest and not all fields are defined in
      eVMCS v1. Second, we need to handle 'holes' in eVMCS (missing fields).
      It also sounds like a good idea to WARN_ON() if such fields are ever
      accessed by KVM-on-Hyper-V.
      
      Implement dedicated evmcs_field_offset() helper.
      
      No functional change intended.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220112170134.1904308-5-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      892a42c1
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Rename vmcs_to_field_offset{,_table} · 2423a4c0
      Vitaly Kuznetsov authored
      vmcs_to_field_offset{,_table} may sound misleading as VMCS is an opaque
      blob which is not supposed to be accessed directly. In fact,
      vmcs_to_field_offset{,_table} are related to KVM defined VMCS12 structure.
      
      Rename vmcs_field_to_offset() to get_vmcs12_field_offset() for clarity.
      
      No functional change intended.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220112170134.1904308-4-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2423a4c0
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: eVMCS: Filter out VM_EXIT_SAVE_VMX_PREEMPTION_TIMER · 7a601e2c
      Vitaly Kuznetsov authored
      Enlightened VMCS v1 doesn't have VMX_PREEMPTION_TIMER_VALUE field,
      PIN_BASED_VMX_PREEMPTION_TIMER is also filtered out already so it makes
      sense to filter out VM_EXIT_SAVE_VMX_PREEMPTION_TIMER too.
      
      Note, none of the currently existing Windows/Hyper-V versions are known
      to enable 'save VMX-preemption timer value' when eVMCS is in use, the
      change is aimed at making the filtering future proof.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220112170134.1904308-3-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7a601e2c
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Also filter MSR_IA32_VMX_TRUE_PINBASED_CTLS when eVMCS · f80ae0ef
      Vitaly Kuznetsov authored
      Similar to MSR_IA32_VMX_EXIT_CTLS/MSR_IA32_VMX_TRUE_EXIT_CTLS,
      MSR_IA32_VMX_ENTRY_CTLS/MSR_IA32_VMX_TRUE_ENTRY_CTLS pair,
      MSR_IA32_VMX_TRUE_PINBASED_CTLS needs to be filtered the same way
      MSR_IA32_VMX_PINBASED_CTLS is currently filtered as guests may solely rely
      on 'true' MSR data.
      
      Note, none of the currently existing Windows/Hyper-V versions are known
      to stumble upon the unfiltered MSR_IA32_VMX_TRUE_PINBASED_CTLS, the change
      is aimed at making the filtering future proof.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220112170134.1904308-2-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f80ae0ef
    • Paolo Bonzini's avatar
      selftests: kvm: check dynamic bits against KVM_X86_XCOMP_GUEST_SUPP · b19c99b9
      Paolo Bonzini authored
      Provide coverage for the new API.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b19c99b9
    • Paolo Bonzini's avatar
      KVM: x86: add system attribute to retrieve full set of supported xsave states · dd6e6312
      Paolo Bonzini authored
      Because KVM_GET_SUPPORTED_CPUID is meant to be passed (by simple-minded
      VMMs) to KVM_SET_CPUID2, it cannot include any dynamic xsave states that
      have not been enabled.  Probing those, for example so that they can be
      passed to ARCH_REQ_XCOMP_GUEST_PERM, requires a new ioctl or arch_prctl.
      The latter is in fact worse, even though that is what the rest of the
      API uses, because it would require supported_xcr0 to be moved from the
      KVM module to the kernel just for this use.  In addition, the value
      would be nonsensical (or an error would have to be returned) until
      the KVM module is loaded in.
      
      Therefore, to limit the growth of system ioctls, add a /dev/kvm
      variant of KVM_{GET,HAS}_DEVICE_ATTR, and implement it in x86
      with just one group (0) and attribute (KVM_X86_XCOMP_GUEST_SUPP).
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dd6e6312
    • Sean Christopherson's avatar
      KVM: x86: Add a helper to retrieve userspace address from kvm_device_attr · 56f289a8
      Sean Christopherson authored
      Add a helper to handle converting the u64 userspace address embedded in
      struct kvm_device_attr into a userspace pointer, it's all too easy to
      forget the intermediate "unsigned long" cast as well as the truncation
      check.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      56f289a8
  4. 26 Jan, 2022 24 commits
    • Paolo Bonzini's avatar
      selftests: kvm: move vm_xsave_req_perm call to amx_test · dd4516ae
      Paolo Bonzini authored
      There is no need for tests other than amx_test to enable dynamic xsave
      states.  Remove the call to vm_xsave_req_perm from generic code,
      and move it inside the test.  While at it, allow customizing the bit
      that is requested, so that future tests can use it differently.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dd4516ae
    • Like Xu's avatar
      KVM: x86: Sync the states size with the XCR0/IA32_XSS at, any time · 05a9e065
      Like Xu authored
      XCR0 is reset to 1 by RESET but not INIT and IA32_XSS is zeroed by
      both RESET and INIT. The kvm_set_msr_common()'s handling of MSR_IA32_XSS
      also needs to update kvm_update_cpuid_runtime(). In the above cases, the
      size in bytes of the XSAVE area containing all states enabled by XCR0 or
      (XCRO | IA32_XSS) needs to be updated.
      
      For simplicity and consistency, existing helpers are used to write values
      and call kvm_update_cpuid_runtime(), and it's not exactly a fast path.
      
      Fixes: a554d207 ("KVM: X86: Processor States following Reset or INIT")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220126172226.2298529-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      05a9e065
    • Like Xu's avatar
      KVM: x86: Update vCPU's runtime CPUID on write to MSR_IA32_XSS · 4c282e51
      Like Xu authored
      Do a runtime CPUID update for a vCPU if MSR_IA32_XSS is written, as the
      size in bytes of the XSAVE area is affected by the states enabled in XSS.
      
      Fixes: 20300099 ("kvm: vmx: add MSR logic for XSAVES")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      [sean: split out as a separate patch, adjust Fixes tag]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220126172226.2298529-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4c282e51
    • Xiaoyao Li's avatar
      KVM: x86: Keep MSR_IA32_XSS unchanged for INIT · be4f3b3f
      Xiaoyao Li authored
      It has been corrected from SDM version 075 that MSR_IA32_XSS is reset to
      zero on Power up and Reset but keeps unchanged on INIT.
      
      Fixes: a554d207 ("KVM: X86: Processor States following Reset or INIT")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220126172226.2298529-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      be4f3b3f
    • Sean Christopherson's avatar
      KVM: x86: Free kvm_cpuid_entry2 array on post-KVM_RUN KVM_SET_CPUID{,2} · 811f95ff
      Sean Christopherson authored
      Free the "struct kvm_cpuid_entry2" array on successful post-KVM_RUN
      KVM_SET_CPUID{,2} to fix a memory leak, the callers of kvm_set_cpuid()
      free the array only on failure.
      
       BUG: memory leak
       unreferenced object 0xffff88810963a800 (size 2048):
        comm "syz-executor025", pid 3610, jiffies 4294944928 (age 8.080s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 0d 00 00 00  ................
          47 65 6e 75 6e 74 65 6c 69 6e 65 49 00 00 00 00  GenuntelineI....
        backtrace:
          [<ffffffff814948ee>] kmalloc_node include/linux/slab.h:604 [inline]
          [<ffffffff814948ee>] kvmalloc_node+0x3e/0x100 mm/util.c:580
          [<ffffffff814950f2>] kvmalloc include/linux/slab.h:732 [inline]
          [<ffffffff814950f2>] vmemdup_user+0x22/0x100 mm/util.c:199
          [<ffffffff8109f5ff>] kvm_vcpu_ioctl_set_cpuid2+0x8f/0xf0 arch/x86/kvm/cpuid.c:423
          [<ffffffff810711b9>] kvm_arch_vcpu_ioctl+0xb99/0x1e60 arch/x86/kvm/x86.c:5251
          [<ffffffff8103e92d>] kvm_vcpu_ioctl+0x4ad/0x950 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4066
          [<ffffffff815afacc>] vfs_ioctl fs/ioctl.c:51 [inline]
          [<ffffffff815afacc>] __do_sys_ioctl fs/ioctl.c:874 [inline]
          [<ffffffff815afacc>] __se_sys_ioctl fs/ioctl.c:860 [inline]
          [<ffffffff815afacc>] __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:860
          [<ffffffff844a3335>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<ffffffff844a3335>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
          [<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: c6617c61 ("KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN")
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+be576ad7655690586eec@syzkaller.appspotmail.com
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220125210445.2053429-1-seanjc@google.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      811f95ff
    • Sean Christopherson's avatar
      KVM: nVMX: WARN on any attempt to allocate shadow VMCS for vmcs02 · d6e656cd
      Sean Christopherson authored
      WARN if KVM attempts to allocate a shadow VMCS for vmcs02.  KVM emulates
      VMCS shadowing but doesn't virtualize it, i.e. KVM should never allocate
      a "real" shadow VMCS for L2.
      
      The previous code WARNed but continued anyway with the allocation,
      presumably in an attempt to avoid NULL pointer dereference.
      However, alloc_vmcs (and hence alloc_shadow_vmcs) can fail, and
      indeed the sole caller does:
      
      	if (enable_shadow_vmcs && !alloc_shadow_vmcs(vcpu))
      		goto out_shadow_vmcs;
      
      which makes it not a useful attempt.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220125220527.2093146-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d6e656cd
    • Sean Christopherson's avatar
      KVM: selftests: Don't skip L2's VMCALL in SMM test for SVM guest · 4cf3d3eb
      Sean Christopherson authored
      Don't skip the vmcall() in l2_guest_code() prior to re-entering L2, doing
      so will result in L2 running to completion, popping '0' off the stack for
      RET, jumping to address '0', and ultimately dying with a triple fault
      shutdown.
      
      It's not at all obvious why the test re-enters L2 and re-executes VMCALL,
      but presumably it serves a purpose.  The VMX path doesn't skip vmcall(),
      and the test can't possibly have passed on SVM, so just do what VMX does.
      
      Fixes: d951b221 ("KVM: selftests: smm_test: Test SMM enter from L2")
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220125221725.2101126-1-seanjc@google.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Tested-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4cf3d3eb
    • Vitaly Kuznetsov's avatar
      KVM: x86: Check .flags in kvm_cpuid_check_equal() too · 033a3ea5
      Vitaly Kuznetsov authored
      kvm_cpuid_check_equal() checks for the (full) equality of the supplied
      CPUID data so .flags need to be checked too.
      Reported-by: default avatarSean Christopherson <seanjc@google.com>
      Fixes: c6617c61 ("KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220126131804.2839410-1-vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      033a3ea5
    • Sean Christopherson's avatar
      KVM: x86: Forcibly leave nested virt when SMM state is toggled · f7e57078
      Sean Christopherson authored
      Forcibly leave nested virtualization operation if userspace toggles SMM
      state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS.  If userspace
      forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
      vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
      vmxon=false and smm.vmxon=false, but all other nVMX state allocated.
      
      Don't attempt to gracefully handle the transition as (a) most transitions
      are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
      sufficient information to handle all transitions, e.g. SVM wants access
      to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
      KVM_SET_NESTED_STATE during state restore as the latter disallows putting
      the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
      being post-VMXON in SMM if SMM is not active.
      
      Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
      due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
      just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
      in an architecturally impossible state.
      
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Modules linked in:
        CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
        Call Trace:
         <TASK>
         kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
         kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
         kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
         kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
         kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
         kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
         kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
         kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
         __fput+0x286/0x9f0 fs/file_table.c:311
         task_work_run+0xdd/0x1a0 kernel/task_work.c:164
         exit_task_work include/linux/task_work.h:32 [inline]
         do_exit+0xb29/0x2a30 kernel/exit.c:806
         do_group_exit+0xd2/0x2f0 kernel/exit.c:935
         get_signal+0x4b0/0x28c0 kernel/signal.c:2862
         arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
         handle_signal_work kernel/entry/common.c:148 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
         exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
         __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
         syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
         do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220125220358.2091737-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f7e57078
    • Vitaly Kuznetsov's avatar
      KVM: SVM: drop unnecessary code in svm_hv_vmcb_dirty_nested_enlightenments() · aa3b39f3
      Vitaly Kuznetsov authored
      Commit 3fa5e8fd ("KVM: SVM: delay svm_vcpu_init_msrpm after
      svm->vmcb is initialized") re-arranged svm_vcpu_init_msrpm() call in
      svm_create_vcpu(), thus making the comment about vmcb being NULL
      obsolete. Drop it.
      
      While on it, drop superfluous vmcb_is_clean() check: vmcb_mark_dirty()
      is a bit flip, an extra check is unlikely to bring any performance gain.
      Drop now-unneeded vmcb_is_clean() helper as well.
      
      Fixes: 3fa5e8fd ("KVM: SVM: delay svm_vcpu_init_msrpm after svm->vmcb is initialized")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211220152139.418372-2-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      aa3b39f3
    • Vitaly Kuznetsov's avatar
      KVM: SVM: hyper-v: Enable Enlightened MSR-Bitmap support for real · 38dfa830
      Vitaly Kuznetsov authored
      Commit c4327f15 ("KVM: SVM: hyper-v: Enlightened MSR-Bitmap support")
      introduced enlightened MSR-Bitmap support for KVM-on-Hyper-V but it didn't
      actually enable the support. Similar to enlightened NPT TLB flush and
      direct TLB flush features, the guest (KVM) has to tell L0 (Hyper-V) that
      it's using the feature by setting the appropriate feature fit in VMCB
      control area (sw reserved fields).
      
      Fixes: c4327f15 ("KVM: SVM: hyper-v: Enlightened MSR-Bitmap support")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211220152139.418372-3-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      38dfa830
    • Sean Christopherson's avatar
      KVM: SVM: Don't kill SEV guest if SMAP erratum triggers in usermode · cdf85e0c
      Sean Christopherson authored
      Inject a #GP instead of synthesizing triple fault to try to avoid killing
      the guest if emulation of an SEV guest fails due to encountering the SMAP
      erratum.  The injected #GP may still be fatal to the guest, e.g. if the
      userspace process is providing critical functionality, but KVM should
      make every attempt to keep the guest alive.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-10-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cdf85e0c
    • Sean Christopherson's avatar
      KVM: SVM: Don't apply SEV+SMAP workaround on code fetch or PT access · 3280cc22
      Sean Christopherson authored
      Resume the guest instead of synthesizing a triple fault shutdown if the
      instruction bytes buffer is empty due to the #NPF being on the code fetch
      itself or on a page table access.  The SMAP errata applies if and only if
      the code fetch was successful and ucode's subsequent data read from the
      code page encountered a SMAP violation.  In practice, the guest is likely
      hosed either way, but crashing the guest on a code fetch to emulated MMIO
      is technically wrong according to the behavior described in the APM.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3280cc22
    • Sean Christopherson's avatar
      KVM: SVM: Inject #UD on attempted emulation for SEV guest w/o insn buffer · 04c40f34
      Sean Christopherson authored
      Inject #UD if KVM attempts emulation for an SEV guests without an insn
      buffer and instruction decoding is required.  The previous behavior of
      allowing emulation if there is no insn buffer is undesirable as doing so
      means KVM is reading guest private memory and thus decoding cyphertext,
      i.e. is emulating garbage.  The check was previously necessary as the
      emulation type was not provided, i.e. SVM needed to allow emulation to
      handle completion of emulation after exiting to userspace to handle I/O.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      04c40f34
    • Sean Christopherson's avatar
      KVM: SVM: WARN if KVM attempts emulation on #UD or #GP for SEV guests · 132627c6
      Sean Christopherson authored
      WARN if KVM attempts to emulate in response to #UD or #GP for SEV guests,
      i.e. if KVM intercepts #UD or #GP, as emulation on any fault except #NPF
      is impossible since KVM cannot read guest private memory to get the code
      stream, and the CPU's DecodeAssists feature only provides the instruction
      bytes on #NPF.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-7-seanjc@google.com>
      [Warn on EMULTYPE_TRAP_UD_FORCED according to Liam Merwick's review. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      132627c6
    • Sean Christopherson's avatar
      KVM: x86: Pass emulation type to can_emulate_instruction() · 4d31d9ef
      Sean Christopherson authored
      Pass the emulation type to kvm_x86_ops.can_emulate_insutrction() so that
      a future commit can harden KVM's SEV support to WARN on emulation
      scenarios that should never happen.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4d31d9ef
    • Sean Christopherson's avatar
      KVM: SVM: Explicitly require DECODEASSISTS to enable SEV support · c532f290
      Sean Christopherson authored
      Add a sanity check on DECODEASSIST being support if SEV is supported, as
      KVM cannot read guest private memory and thus relies on the CPU to
      provide the instruction byte stream on #NPF for emulation.  The intent of
      the check is to document the dependency, it should never fail in practice
      as producing hardware that supports SEV but not DECODEASSISTS would be
      non-sensical.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c532f290
    • Sean Christopherson's avatar
      KVM: SVM: Don't intercept #GP for SEV guests · 0b0be065
      Sean Christopherson authored
      Never intercept #GP for SEV guests as reading SEV guest private memory
      will return cyphertext, i.e. emulating on #GP can't work as intended.
      
      Cc: stable@vger.kernel.org
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0b0be065
    • Sean Christopherson's avatar
      Revert "KVM: SVM: avoid infinite loop on NPF from bad address" · 31c25585
      Sean Christopherson authored
      Revert a completely broken check on an "invalid" RIP in SVM's workaround
      for the DecodeAssists SMAP errata.  kvm_vcpu_gfn_to_memslot() obviously
      expects a gfn, i.e. operates in the guest physical address space, whereas
      RIP is a virtual (not even linear) address.  The "fix" worked for the
      problematic KVM selftest because the test identity mapped RIP.
      
      Fully revert the hack instead of trying to translate RIP to a GPA, as the
      non-SEV case is now handled earlier, and KVM cannot access guest page
      tables to translate RIP.
      
      This reverts commit e72436bc.
      
      Fixes: e72436bc ("KVM: SVM: avoid infinite loop on NPF from bad address")
      Reported-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      31c25585
    • Sean Christopherson's avatar
      KVM: SVM: Never reject emulation due to SMAP errata for !SEV guests · 55467fcd
      Sean Christopherson authored
      Always signal that emulation is possible for !SEV guests regardless of
      whether or not the CPU provided a valid instruction byte stream.  KVM can
      read all guest state (memory and registers) for !SEV guests, i.e. can
      fetch the code stream from memory even if the CPU failed to do so because
      of the SMAP errata.
      
      Fixes: 05d5a486 ("KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation)")
      Cc: stable@vger.kernel.org
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      55467fcd
    • Denis Valeev's avatar
      KVM: x86: nSVM: skip eax alignment check for non-SVM instructions · 47c28d43
      Denis Valeev authored
      The bug occurs on #GP triggered by VMware backdoor when eax value is
      unaligned. eax alignment check should not be applied to non-SVM
      instructions because it leads to incorrect omission of the instructions
      emulation.
      Apply the alignment check only to SVM instructions to fix.
      
      Fixes: d1cba6c9 ("KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround")
      Signed-off-by: default avatarDenis Valeev <lemniscattaden@gmail.com>
      Message-Id: <Yexlhaoe1Fscm59u@q>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      47c28d43
    • Like Xu's avatar
      KVM: x86/cpuid: Exclude unpermitted xfeatures sizes at KVM_GET_SUPPORTED_CPUID · 1ffce092
      Like Xu authored
      With the help of xstate_get_guest_group_perm(), KVM can exclude unpermitted
      xfeatures in cpuid.0xd.0.eax, in which case the corresponding xfeatures
      sizes should also be matched to the permitted xfeatures.
      
      To fix this inconsistency, the permitted_xcr0 and permitted_xss are defined
      consistently, which implies 'supported' plus certain permissions for this
      task, and it also fixes cpuid.0xd.1.ebx and later leaf-by-leaf queries.
      
      Fixes: 445ecdf7 ("kvm: x86: Exclude unpermitted xfeatures at KVM_GET_SUPPORTED_CPUID")
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Message-Id: <20220125115223.33707-1-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1ffce092
    • Wanpeng Li's avatar
      KVM: LAPIC: Also cancel preemption timer during SET_LAPIC · 35fe7cfb
      Wanpeng Li authored
      The below warning is splatting during guest reboot.
      
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 1931 at arch/x86/kvm/x86.c:10322 kvm_arch_vcpu_ioctl_run+0x874/0x880 [kvm]
        CPU: 0 PID: 1931 Comm: qemu-system-x86 Tainted: G          I       5.17.0-rc1+ #5
        RIP: 0010:kvm_arch_vcpu_ioctl_run+0x874/0x880 [kvm]
        Call Trace:
         <TASK>
         kvm_vcpu_ioctl+0x279/0x710 [kvm]
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7fd39797350b
      
      This can be triggered by not exposing tsc-deadline mode and doing a reboot in
      the guest. The lapic_shutdown() function which is called in sys_reboot path
      will not disarm the flying timer, it just masks LVTT. lapic_shutdown() clears
      APIC state w/ LVT_MASKED and timer-mode bit is 0, this can trigger timer-mode
      switch between tsc-deadline and oneshot/periodic, which can result in preemption
      timer be cancelled in apic_update_lvtt(). However, We can't depend on this when
      not exposing tsc-deadline mode and oneshot/periodic modes emulated by preemption
      timer. Qemu will synchronise states around reset, let's cancel preemption timer
      under KVM_SET_LAPIC.
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1643102220-35667-1-git-send-email-wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      35fe7cfb
    • Jim Mattson's avatar
      KVM: VMX: Remove vmcs_config.order · 519669cc
      Jim Mattson authored
      The maximum size of a VMCS (or VMXON region) is 4096. By definition,
      these are order 0 allocations.
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Message-Id: <20220125004359.147600-1-jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      519669cc
  5. 25 Jan, 2022 2 commits