1. 28 Jan, 2022 9 commits
    • Hou Wenlong's avatar
      KVM: eventfd: Fix false positive RCU usage warning · 6a0c6170
      Hou Wenlong authored
      Fix the following false positive warning:
       =============================
       WARNING: suspicious RCU usage
       5.16.0-rc4+ #57 Not tainted
       -----------------------------
       arch/x86/kvm/../../../virt/kvm/eventfd.c:484 RCU-list traversed in non-reader section!!
      
       other info that might help us debug this:
      
       rcu_scheduler_active = 2, debug_locks = 1
       3 locks held by fc_vcpu 0/330:
        #0: ffff8884835fc0b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x88/0x6f0 [kvm]
        #1: ffffc90004c0bb68 (&kvm->srcu){....}-{0:0}, at: vcpu_enter_guest+0x600/0x1860 [kvm]
        #2: ffffc90004c0c1d0 (&kvm->irq_srcu){....}-{0:0}, at: kvm_notify_acked_irq+0x36/0x180 [kvm]
      
       stack backtrace:
       CPU: 26 PID: 330 Comm: fc_vcpu 0 Not tainted 5.16.0-rc4+
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
       Call Trace:
        <TASK>
        dump_stack_lvl+0x44/0x57
        kvm_notify_acked_gsi+0x6b/0x70 [kvm]
        kvm_notify_acked_irq+0x8d/0x180 [kvm]
        kvm_ioapic_update_eoi+0x92/0x240 [kvm]
        kvm_apic_set_eoi_accelerated+0x2a/0xe0 [kvm]
        handle_apic_eoi_induced+0x3d/0x60 [kvm_intel]
        vmx_handle_exit+0x19c/0x6a0 [kvm_intel]
        vcpu_enter_guest+0x66e/0x1860 [kvm]
        kvm_arch_vcpu_ioctl_run+0x438/0x7f0 [kvm]
        kvm_vcpu_ioctl+0x38a/0x6f0 [kvm]
        __x64_sys_ioctl+0x89/0xc0
        do_syscall_64+0x3a/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Since kvm_unregister_irq_ack_notifier() does synchronize_srcu(&kvm->irq_srcu),
      kvm->irq_ack_notifier_list is protected by kvm->irq_srcu. In fact,
      kvm->irq_srcu SRCU read lock is held in kvm_notify_acked_irq(), making it
      a false positive warning. So use hlist_for_each_entry_srcu() instead of
      hlist_for_each_entry_rcu().
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarHou Wenlong <houwenlong93@linux.alibaba.com>
      Message-Id: <f98bac4f5052bad2c26df9ad50f7019e40434512.1643265976.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6a0c6170
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Allow VMREAD when Enlightened VMCS is in use · 6cbbaab6
      Vitaly Kuznetsov authored
      Hyper-V TLFS explicitly forbids VMREAD and VMWRITE instructions when
      Enlightened VMCS interface is in use:
      
      "Any VMREAD or VMWRITE instructions while an enlightened VMCS is
      active is unsupported and can result in unexpected behavior.""
      
      Windows 11 + WSL2 seems to ignore this, attempts to VMREAD VMCS field
      0x4404 ("VM-exit interruption information") are observed. Failing
      these attempts with nested_vmx_failInvalid() makes such guests
      unbootable.
      
      Microsoft confirms this is a Hyper-V bug and claims that it'll get fixed
      eventually but for the time being we need a workaround. (Temporary) allow
      VMREAD to get data from the currently loaded Enlightened VMCS.
      
      Note: VMWRITE instructions remain forbidden, it is not clear how to
      handle them properly and hopefully won't ever be needed.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220112170134.1904308-6-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6cbbaab6
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Implement evmcs_field_offset() suitable for handle_vmread() · 892a42c1
      Vitaly Kuznetsov authored
      In preparation to allowing reads from Enlightened VMCS from
      handle_vmread(), implement evmcs_field_offset() to get the correct
      read offset. get_evmcs_offset(), which is being used by KVM-on-Hyper-V,
      is almost what's needed but a few things need to be adjusted. First,
      WARN_ON() is unacceptable for handle_vmread() as any field can (in
      theory) be supplied by the guest and not all fields are defined in
      eVMCS v1. Second, we need to handle 'holes' in eVMCS (missing fields).
      It also sounds like a good idea to WARN_ON() if such fields are ever
      accessed by KVM-on-Hyper-V.
      
      Implement dedicated evmcs_field_offset() helper.
      
      No functional change intended.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220112170134.1904308-5-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      892a42c1
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Rename vmcs_to_field_offset{,_table} · 2423a4c0
      Vitaly Kuznetsov authored
      vmcs_to_field_offset{,_table} may sound misleading as VMCS is an opaque
      blob which is not supposed to be accessed directly. In fact,
      vmcs_to_field_offset{,_table} are related to KVM defined VMCS12 structure.
      
      Rename vmcs_field_to_offset() to get_vmcs12_field_offset() for clarity.
      
      No functional change intended.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220112170134.1904308-4-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2423a4c0
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: eVMCS: Filter out VM_EXIT_SAVE_VMX_PREEMPTION_TIMER · 7a601e2c
      Vitaly Kuznetsov authored
      Enlightened VMCS v1 doesn't have VMX_PREEMPTION_TIMER_VALUE field,
      PIN_BASED_VMX_PREEMPTION_TIMER is also filtered out already so it makes
      sense to filter out VM_EXIT_SAVE_VMX_PREEMPTION_TIMER too.
      
      Note, none of the currently existing Windows/Hyper-V versions are known
      to enable 'save VMX-preemption timer value' when eVMCS is in use, the
      change is aimed at making the filtering future proof.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220112170134.1904308-3-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7a601e2c
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Also filter MSR_IA32_VMX_TRUE_PINBASED_CTLS when eVMCS · f80ae0ef
      Vitaly Kuznetsov authored
      Similar to MSR_IA32_VMX_EXIT_CTLS/MSR_IA32_VMX_TRUE_EXIT_CTLS,
      MSR_IA32_VMX_ENTRY_CTLS/MSR_IA32_VMX_TRUE_ENTRY_CTLS pair,
      MSR_IA32_VMX_TRUE_PINBASED_CTLS needs to be filtered the same way
      MSR_IA32_VMX_PINBASED_CTLS is currently filtered as guests may solely rely
      on 'true' MSR data.
      
      Note, none of the currently existing Windows/Hyper-V versions are known
      to stumble upon the unfiltered MSR_IA32_VMX_TRUE_PINBASED_CTLS, the change
      is aimed at making the filtering future proof.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220112170134.1904308-2-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f80ae0ef
    • Paolo Bonzini's avatar
      selftests: kvm: check dynamic bits against KVM_X86_XCOMP_GUEST_SUPP · b19c99b9
      Paolo Bonzini authored
      Provide coverage for the new API.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b19c99b9
    • Paolo Bonzini's avatar
      KVM: x86: add system attribute to retrieve full set of supported xsave states · dd6e6312
      Paolo Bonzini authored
      Because KVM_GET_SUPPORTED_CPUID is meant to be passed (by simple-minded
      VMMs) to KVM_SET_CPUID2, it cannot include any dynamic xsave states that
      have not been enabled.  Probing those, for example so that they can be
      passed to ARCH_REQ_XCOMP_GUEST_PERM, requires a new ioctl or arch_prctl.
      The latter is in fact worse, even though that is what the rest of the
      API uses, because it would require supported_xcr0 to be moved from the
      KVM module to the kernel just for this use.  In addition, the value
      would be nonsensical (or an error would have to be returned) until
      the KVM module is loaded in.
      
      Therefore, to limit the growth of system ioctls, add a /dev/kvm
      variant of KVM_{GET,HAS}_DEVICE_ATTR, and implement it in x86
      with just one group (0) and attribute (KVM_X86_XCOMP_GUEST_SUPP).
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dd6e6312
    • Sean Christopherson's avatar
      KVM: x86: Add a helper to retrieve userspace address from kvm_device_attr · 56f289a8
      Sean Christopherson authored
      Add a helper to handle converting the u64 userspace address embedded in
      struct kvm_device_attr into a userspace pointer, it's all too easy to
      forget the intermediate "unsigned long" cast as well as the truncation
      check.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      56f289a8
  2. 26 Jan, 2022 24 commits
    • Paolo Bonzini's avatar
      selftests: kvm: move vm_xsave_req_perm call to amx_test · dd4516ae
      Paolo Bonzini authored
      There is no need for tests other than amx_test to enable dynamic xsave
      states.  Remove the call to vm_xsave_req_perm from generic code,
      and move it inside the test.  While at it, allow customizing the bit
      that is requested, so that future tests can use it differently.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dd4516ae
    • Like Xu's avatar
      KVM: x86: Sync the states size with the XCR0/IA32_XSS at, any time · 05a9e065
      Like Xu authored
      XCR0 is reset to 1 by RESET but not INIT and IA32_XSS is zeroed by
      both RESET and INIT. The kvm_set_msr_common()'s handling of MSR_IA32_XSS
      also needs to update kvm_update_cpuid_runtime(). In the above cases, the
      size in bytes of the XSAVE area containing all states enabled by XCR0 or
      (XCRO | IA32_XSS) needs to be updated.
      
      For simplicity and consistency, existing helpers are used to write values
      and call kvm_update_cpuid_runtime(), and it's not exactly a fast path.
      
      Fixes: a554d207 ("KVM: X86: Processor States following Reset or INIT")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220126172226.2298529-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      05a9e065
    • Like Xu's avatar
      KVM: x86: Update vCPU's runtime CPUID on write to MSR_IA32_XSS · 4c282e51
      Like Xu authored
      Do a runtime CPUID update for a vCPU if MSR_IA32_XSS is written, as the
      size in bytes of the XSAVE area is affected by the states enabled in XSS.
      
      Fixes: 20300099 ("kvm: vmx: add MSR logic for XSAVES")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      [sean: split out as a separate patch, adjust Fixes tag]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220126172226.2298529-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4c282e51
    • Xiaoyao Li's avatar
      KVM: x86: Keep MSR_IA32_XSS unchanged for INIT · be4f3b3f
      Xiaoyao Li authored
      It has been corrected from SDM version 075 that MSR_IA32_XSS is reset to
      zero on Power up and Reset but keeps unchanged on INIT.
      
      Fixes: a554d207 ("KVM: X86: Processor States following Reset or INIT")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220126172226.2298529-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      be4f3b3f
    • Sean Christopherson's avatar
      KVM: x86: Free kvm_cpuid_entry2 array on post-KVM_RUN KVM_SET_CPUID{,2} · 811f95ff
      Sean Christopherson authored
      Free the "struct kvm_cpuid_entry2" array on successful post-KVM_RUN
      KVM_SET_CPUID{,2} to fix a memory leak, the callers of kvm_set_cpuid()
      free the array only on failure.
      
       BUG: memory leak
       unreferenced object 0xffff88810963a800 (size 2048):
        comm "syz-executor025", pid 3610, jiffies 4294944928 (age 8.080s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 0d 00 00 00  ................
          47 65 6e 75 6e 74 65 6c 69 6e 65 49 00 00 00 00  GenuntelineI....
        backtrace:
          [<ffffffff814948ee>] kmalloc_node include/linux/slab.h:604 [inline]
          [<ffffffff814948ee>] kvmalloc_node+0x3e/0x100 mm/util.c:580
          [<ffffffff814950f2>] kvmalloc include/linux/slab.h:732 [inline]
          [<ffffffff814950f2>] vmemdup_user+0x22/0x100 mm/util.c:199
          [<ffffffff8109f5ff>] kvm_vcpu_ioctl_set_cpuid2+0x8f/0xf0 arch/x86/kvm/cpuid.c:423
          [<ffffffff810711b9>] kvm_arch_vcpu_ioctl+0xb99/0x1e60 arch/x86/kvm/x86.c:5251
          [<ffffffff8103e92d>] kvm_vcpu_ioctl+0x4ad/0x950 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4066
          [<ffffffff815afacc>] vfs_ioctl fs/ioctl.c:51 [inline]
          [<ffffffff815afacc>] __do_sys_ioctl fs/ioctl.c:874 [inline]
          [<ffffffff815afacc>] __se_sys_ioctl fs/ioctl.c:860 [inline]
          [<ffffffff815afacc>] __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:860
          [<ffffffff844a3335>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<ffffffff844a3335>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
          [<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: c6617c61 ("KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN")
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+be576ad7655690586eec@syzkaller.appspotmail.com
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220125210445.2053429-1-seanjc@google.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      811f95ff
    • Sean Christopherson's avatar
      KVM: nVMX: WARN on any attempt to allocate shadow VMCS for vmcs02 · d6e656cd
      Sean Christopherson authored
      WARN if KVM attempts to allocate a shadow VMCS for vmcs02.  KVM emulates
      VMCS shadowing but doesn't virtualize it, i.e. KVM should never allocate
      a "real" shadow VMCS for L2.
      
      The previous code WARNed but continued anyway with the allocation,
      presumably in an attempt to avoid NULL pointer dereference.
      However, alloc_vmcs (and hence alloc_shadow_vmcs) can fail, and
      indeed the sole caller does:
      
      	if (enable_shadow_vmcs && !alloc_shadow_vmcs(vcpu))
      		goto out_shadow_vmcs;
      
      which makes it not a useful attempt.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220125220527.2093146-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d6e656cd
    • Sean Christopherson's avatar
      KVM: selftests: Don't skip L2's VMCALL in SMM test for SVM guest · 4cf3d3eb
      Sean Christopherson authored
      Don't skip the vmcall() in l2_guest_code() prior to re-entering L2, doing
      so will result in L2 running to completion, popping '0' off the stack for
      RET, jumping to address '0', and ultimately dying with a triple fault
      shutdown.
      
      It's not at all obvious why the test re-enters L2 and re-executes VMCALL,
      but presumably it serves a purpose.  The VMX path doesn't skip vmcall(),
      and the test can't possibly have passed on SVM, so just do what VMX does.
      
      Fixes: d951b221 ("KVM: selftests: smm_test: Test SMM enter from L2")
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220125221725.2101126-1-seanjc@google.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Tested-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4cf3d3eb
    • Vitaly Kuznetsov's avatar
      KVM: x86: Check .flags in kvm_cpuid_check_equal() too · 033a3ea5
      Vitaly Kuznetsov authored
      kvm_cpuid_check_equal() checks for the (full) equality of the supplied
      CPUID data so .flags need to be checked too.
      Reported-by: default avatarSean Christopherson <seanjc@google.com>
      Fixes: c6617c61 ("KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220126131804.2839410-1-vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      033a3ea5
    • Sean Christopherson's avatar
      KVM: x86: Forcibly leave nested virt when SMM state is toggled · f7e57078
      Sean Christopherson authored
      Forcibly leave nested virtualization operation if userspace toggles SMM
      state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS.  If userspace
      forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
      vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
      vmxon=false and smm.vmxon=false, but all other nVMX state allocated.
      
      Don't attempt to gracefully handle the transition as (a) most transitions
      are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
      sufficient information to handle all transitions, e.g. SVM wants access
      to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
      KVM_SET_NESTED_STATE during state restore as the latter disallows putting
      the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
      being post-VMXON in SMM if SMM is not active.
      
      Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
      due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
      just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
      in an architecturally impossible state.
      
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Modules linked in:
        CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
        Call Trace:
         <TASK>
         kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
         kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
         kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
         kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
         kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
         kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
         kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
         kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
         __fput+0x286/0x9f0 fs/file_table.c:311
         task_work_run+0xdd/0x1a0 kernel/task_work.c:164
         exit_task_work include/linux/task_work.h:32 [inline]
         do_exit+0xb29/0x2a30 kernel/exit.c:806
         do_group_exit+0xd2/0x2f0 kernel/exit.c:935
         get_signal+0x4b0/0x28c0 kernel/signal.c:2862
         arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
         handle_signal_work kernel/entry/common.c:148 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
         exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
         __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
         syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
         do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220125220358.2091737-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f7e57078
    • Vitaly Kuznetsov's avatar
      KVM: SVM: drop unnecessary code in svm_hv_vmcb_dirty_nested_enlightenments() · aa3b39f3
      Vitaly Kuznetsov authored
      Commit 3fa5e8fd ("KVM: SVM: delay svm_vcpu_init_msrpm after
      svm->vmcb is initialized") re-arranged svm_vcpu_init_msrpm() call in
      svm_create_vcpu(), thus making the comment about vmcb being NULL
      obsolete. Drop it.
      
      While on it, drop superfluous vmcb_is_clean() check: vmcb_mark_dirty()
      is a bit flip, an extra check is unlikely to bring any performance gain.
      Drop now-unneeded vmcb_is_clean() helper as well.
      
      Fixes: 3fa5e8fd ("KVM: SVM: delay svm_vcpu_init_msrpm after svm->vmcb is initialized")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211220152139.418372-2-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      aa3b39f3
    • Vitaly Kuznetsov's avatar
      KVM: SVM: hyper-v: Enable Enlightened MSR-Bitmap support for real · 38dfa830
      Vitaly Kuznetsov authored
      Commit c4327f15 ("KVM: SVM: hyper-v: Enlightened MSR-Bitmap support")
      introduced enlightened MSR-Bitmap support for KVM-on-Hyper-V but it didn't
      actually enable the support. Similar to enlightened NPT TLB flush and
      direct TLB flush features, the guest (KVM) has to tell L0 (Hyper-V) that
      it's using the feature by setting the appropriate feature fit in VMCB
      control area (sw reserved fields).
      
      Fixes: c4327f15 ("KVM: SVM: hyper-v: Enlightened MSR-Bitmap support")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211220152139.418372-3-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      38dfa830
    • Sean Christopherson's avatar
      KVM: SVM: Don't kill SEV guest if SMAP erratum triggers in usermode · cdf85e0c
      Sean Christopherson authored
      Inject a #GP instead of synthesizing triple fault to try to avoid killing
      the guest if emulation of an SEV guest fails due to encountering the SMAP
      erratum.  The injected #GP may still be fatal to the guest, e.g. if the
      userspace process is providing critical functionality, but KVM should
      make every attempt to keep the guest alive.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-10-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cdf85e0c
    • Sean Christopherson's avatar
      KVM: SVM: Don't apply SEV+SMAP workaround on code fetch or PT access · 3280cc22
      Sean Christopherson authored
      Resume the guest instead of synthesizing a triple fault shutdown if the
      instruction bytes buffer is empty due to the #NPF being on the code fetch
      itself or on a page table access.  The SMAP errata applies if and only if
      the code fetch was successful and ucode's subsequent data read from the
      code page encountered a SMAP violation.  In practice, the guest is likely
      hosed either way, but crashing the guest on a code fetch to emulated MMIO
      is technically wrong according to the behavior described in the APM.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3280cc22
    • Sean Christopherson's avatar
      KVM: SVM: Inject #UD on attempted emulation for SEV guest w/o insn buffer · 04c40f34
      Sean Christopherson authored
      Inject #UD if KVM attempts emulation for an SEV guests without an insn
      buffer and instruction decoding is required.  The previous behavior of
      allowing emulation if there is no insn buffer is undesirable as doing so
      means KVM is reading guest private memory and thus decoding cyphertext,
      i.e. is emulating garbage.  The check was previously necessary as the
      emulation type was not provided, i.e. SVM needed to allow emulation to
      handle completion of emulation after exiting to userspace to handle I/O.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      04c40f34
    • Sean Christopherson's avatar
      KVM: SVM: WARN if KVM attempts emulation on #UD or #GP for SEV guests · 132627c6
      Sean Christopherson authored
      WARN if KVM attempts to emulate in response to #UD or #GP for SEV guests,
      i.e. if KVM intercepts #UD or #GP, as emulation on any fault except #NPF
      is impossible since KVM cannot read guest private memory to get the code
      stream, and the CPU's DecodeAssists feature only provides the instruction
      bytes on #NPF.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-7-seanjc@google.com>
      [Warn on EMULTYPE_TRAP_UD_FORCED according to Liam Merwick's review. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      132627c6
    • Sean Christopherson's avatar
      KVM: x86: Pass emulation type to can_emulate_instruction() · 4d31d9ef
      Sean Christopherson authored
      Pass the emulation type to kvm_x86_ops.can_emulate_insutrction() so that
      a future commit can harden KVM's SEV support to WARN on emulation
      scenarios that should never happen.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4d31d9ef
    • Sean Christopherson's avatar
      KVM: SVM: Explicitly require DECODEASSISTS to enable SEV support · c532f290
      Sean Christopherson authored
      Add a sanity check on DECODEASSIST being support if SEV is supported, as
      KVM cannot read guest private memory and thus relies on the CPU to
      provide the instruction byte stream on #NPF for emulation.  The intent of
      the check is to document the dependency, it should never fail in practice
      as producing hardware that supports SEV but not DECODEASSISTS would be
      non-sensical.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c532f290
    • Sean Christopherson's avatar
      KVM: SVM: Don't intercept #GP for SEV guests · 0b0be065
      Sean Christopherson authored
      Never intercept #GP for SEV guests as reading SEV guest private memory
      will return cyphertext, i.e. emulating on #GP can't work as intended.
      
      Cc: stable@vger.kernel.org
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0b0be065
    • Sean Christopherson's avatar
      Revert "KVM: SVM: avoid infinite loop on NPF from bad address" · 31c25585
      Sean Christopherson authored
      Revert a completely broken check on an "invalid" RIP in SVM's workaround
      for the DecodeAssists SMAP errata.  kvm_vcpu_gfn_to_memslot() obviously
      expects a gfn, i.e. operates in the guest physical address space, whereas
      RIP is a virtual (not even linear) address.  The "fix" worked for the
      problematic KVM selftest because the test identity mapped RIP.
      
      Fully revert the hack instead of trying to translate RIP to a GPA, as the
      non-SEV case is now handled earlier, and KVM cannot access guest page
      tables to translate RIP.
      
      This reverts commit e72436bc.
      
      Fixes: e72436bc ("KVM: SVM: avoid infinite loop on NPF from bad address")
      Reported-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      31c25585
    • Sean Christopherson's avatar
      KVM: SVM: Never reject emulation due to SMAP errata for !SEV guests · 55467fcd
      Sean Christopherson authored
      Always signal that emulation is possible for !SEV guests regardless of
      whether or not the CPU provided a valid instruction byte stream.  KVM can
      read all guest state (memory and registers) for !SEV guests, i.e. can
      fetch the code stream from memory even if the CPU failed to do so because
      of the SMAP errata.
      
      Fixes: 05d5a486 ("KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation)")
      Cc: stable@vger.kernel.org
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      55467fcd
    • Denis Valeev's avatar
      KVM: x86: nSVM: skip eax alignment check for non-SVM instructions · 47c28d43
      Denis Valeev authored
      The bug occurs on #GP triggered by VMware backdoor when eax value is
      unaligned. eax alignment check should not be applied to non-SVM
      instructions because it leads to incorrect omission of the instructions
      emulation.
      Apply the alignment check only to SVM instructions to fix.
      
      Fixes: d1cba6c9 ("KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround")
      Signed-off-by: default avatarDenis Valeev <lemniscattaden@gmail.com>
      Message-Id: <Yexlhaoe1Fscm59u@q>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      47c28d43
    • Like Xu's avatar
      KVM: x86/cpuid: Exclude unpermitted xfeatures sizes at KVM_GET_SUPPORTED_CPUID · 1ffce092
      Like Xu authored
      With the help of xstate_get_guest_group_perm(), KVM can exclude unpermitted
      xfeatures in cpuid.0xd.0.eax, in which case the corresponding xfeatures
      sizes should also be matched to the permitted xfeatures.
      
      To fix this inconsistency, the permitted_xcr0 and permitted_xss are defined
      consistently, which implies 'supported' plus certain permissions for this
      task, and it also fixes cpuid.0xd.1.ebx and later leaf-by-leaf queries.
      
      Fixes: 445ecdf7 ("kvm: x86: Exclude unpermitted xfeatures at KVM_GET_SUPPORTED_CPUID")
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Message-Id: <20220125115223.33707-1-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1ffce092
    • Wanpeng Li's avatar
      KVM: LAPIC: Also cancel preemption timer during SET_LAPIC · 35fe7cfb
      Wanpeng Li authored
      The below warning is splatting during guest reboot.
      
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 1931 at arch/x86/kvm/x86.c:10322 kvm_arch_vcpu_ioctl_run+0x874/0x880 [kvm]
        CPU: 0 PID: 1931 Comm: qemu-system-x86 Tainted: G          I       5.17.0-rc1+ #5
        RIP: 0010:kvm_arch_vcpu_ioctl_run+0x874/0x880 [kvm]
        Call Trace:
         <TASK>
         kvm_vcpu_ioctl+0x279/0x710 [kvm]
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7fd39797350b
      
      This can be triggered by not exposing tsc-deadline mode and doing a reboot in
      the guest. The lapic_shutdown() function which is called in sys_reboot path
      will not disarm the flying timer, it just masks LVTT. lapic_shutdown() clears
      APIC state w/ LVT_MASKED and timer-mode bit is 0, this can trigger timer-mode
      switch between tsc-deadline and oneshot/periodic, which can result in preemption
      timer be cancelled in apic_update_lvtt(). However, We can't depend on this when
      not exposing tsc-deadline mode and oneshot/periodic modes emulated by preemption
      timer. Qemu will synchronise states around reset, let's cancel preemption timer
      under KVM_SET_LAPIC.
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1643102220-35667-1-git-send-email-wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      35fe7cfb
    • Jim Mattson's avatar
      KVM: VMX: Remove vmcs_config.order · 519669cc
      Jim Mattson authored
      The maximum size of a VMCS (or VMXON region) is 4096. By definition,
      these are order 0 allocations.
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Message-Id: <20220125004359.147600-1-jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      519669cc
  3. 25 Jan, 2022 4 commits
    • Quanfa Fu's avatar
      KVM/X86: Make kvm_vcpu_reload_apic_access_page() static · d081a343
      Quanfa Fu authored
      Make kvm_vcpu_reload_apic_access_page() static
      as it is no longer invoked directly by vmx
      and it is also no longer exported.
      
      No functional change intended.
      Signed-off-by: default avatarQuanfa Fu <quanfafu@gmail.com>
      Message-Id: <20211219091446.174584-1-quanfafu@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d081a343
    • David Matlack's avatar
      KVM: selftests: Re-enable access_tracking_perf_test · de1956f4
      David Matlack authored
      This selftest was accidentally removed by commit 6a581508
      ("selftest: KVM: Add intra host migration tests"). Add it back.
      
      Fixes: 6a581508 ("selftest: KVM: Add intra host migration tests")
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220120003826.2805036-1-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      de1956f4
    • Sean Christopherson's avatar
      KVM: VMX: Set vmcs.PENDING_DBG.BS on #DB in STI/MOVSS blocking shadow · b9bed78e
      Sean Christopherson authored
      Set vmcs.GUEST_PENDING_DBG_EXCEPTIONS.BS, a.k.a. the pending single-step
      breakpoint flag, when re-injecting a #DB with RFLAGS.TF=1, and STI or
      MOVSS blocking is active.  Setting the flag is necessary to make VM-Entry
      consistency checks happy, as VMX has an invariant that if RFLAGS.TF is
      set and STI/MOVSS blocking is true, then the previous instruction must
      have been STI or MOV/POP, and therefore a single-step #DB must be pending
      since the RFLAGS.TF cannot have been set by the previous instruction,
      i.e. the one instruction delay after setting RFLAGS.TF must have already
      expired.
      
      Normally, the CPU sets vmcs.GUEST_PENDING_DBG_EXCEPTIONS.BS appropriately
      when recording guest state as part of a VM-Exit, but #DB VM-Exits
      intentionally do not treat the #DB as "guest state" as interception of
      the #DB effectively makes the #DB host-owned, thus KVM needs to manually
      set PENDING_DBG.BS when forwarding/re-injecting the #DB to the guest.
      
      Note, although this bug can be triggered by guest userspace, doing so
      requires IOPL=3, and guest userspace running with IOPL=3 has full access
      to all I/O ports (from the guest's perspective) and can crash/reboot the
      guest any number of ways.  IOPL=3 is required because STI blocking kicks
      in if and only if RFLAGS.IF is toggled 0=>1, and if CPL>IOPL, STI either
      takes a #GP or modifies RFLAGS.VIF, not RFLAGS.IF.
      
      MOVSS blocking can be initiated by userspace, but can be coincident with
      a #DB if and only if DR7.GD=1 (General Detect enabled) and a MOV DR is
      executed in the MOVSS shadow.  MOV DR #GPs at CPL>0, thus MOVSS blocking
      is problematic only for CPL0 (and only if the guest is crazy enough to
      access a DR in a MOVSS shadow).  All other sources of #DBs are either
      suppressed by MOVSS blocking (single-step, code fetch, data, and I/O),
      are mutually exclusive with MOVSS blocking (T-bit task switch), or are
      already handled by KVM (ICEBP, a.k.a. INT1).
      
      This bug was originally found by running tests[1] created for XSA-308[2].
      Note that Xen's userspace test emits ICEBP in the MOVSS shadow, which is
      presumably why the Xen bug was deemed to be an exploitable DOS from guest
      userspace.  KVM already handles ICEBP by skipping the ICEBP instruction
      and thus clears MOVSS blocking as a side effect of its "emulation".
      
      [1] http://xenbits.xenproject.org/docs/xtf/xsa-308_2main_8c_source.html
      [2] https://xenbits.xen.org/xsa/advisory-308.htmlReported-by: default avatarDavid Woodhouse <dwmw2@infradead.org>
      Reported-by: default avatarAlexander Graf <graf@amazon.de>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220120000624.655815-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b9bed78e
    • Vitaly Kuznetsov's avatar
      KVM: x86: Move CPUID.(EAX=0x12,ECX=1) mangling to __kvm_update_cpuid_runtime() · 5c89be1d
      Vitaly Kuznetsov authored
      Full equality check of CPUID data on update (kvm_cpuid_check_equal()) may
      fail for SGX enabled CPUs as CPUID.(EAX=0x12,ECX=1) is currently being
      mangled in kvm_vcpu_after_set_cpuid(). Move it to
      __kvm_update_cpuid_runtime() and split off cpuid_get_supported_xcr0()
      helper  as 'vcpu->arch.guest_supported_xcr0' update needs (logically)
      to stay in kvm_vcpu_after_set_cpuid().
      
      Cc: stable@vger.kernel.org
      Fixes: feb627e8 ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220124103606.2630588-2-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5c89be1d
  4. 24 Jan, 2022 3 commits
    • Xianting Tian's avatar
      KVM: remove async parameter of hva_to_pfn_remapped() · 1625566e
      Xianting Tian authored
      The async parameter of hva_to_pfn_remapped() is not used, so remove it.
      Signed-off-by: default avatarXianting Tian <xianting.tian@linux.alibaba.com>
      Message-Id: <20220124020456.156386-1-xianting.tian@linux.alibaba.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1625566e
    • Peter Zijlstra's avatar
      x86,kvm/xen: Remove superfluous .fixup usage · adb759e5
      Peter Zijlstra authored
      Commit 14243b38 ("KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and
      event channel delivery") adds superfluous .fixup usage after the whole
      .fixup section was removed in commit e5eefda5 ("x86: Remove .fixup
      section").
      
      Fixes: 14243b38 ("KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery")
      Reported-by: default avatarBorislav Petkov <bp@alien8.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Message-Id: <20220123124219.GH20638@worktop.programming.kicks-ass.net>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      adb759e5
    • Sean Christopherson's avatar
      KVM: VMX: Zero host's SYSENTER_ESP iff SYSENTER is NOT used · 94fea1d8
      Sean Christopherson authored
      Zero vmcs.HOST_IA32_SYSENTER_ESP when initializing *constant* host state
      if and only if SYSENTER cannot be used, i.e. the kernel is a 64-bit
      kernel and is not emulating 32-bit syscalls.  As the name suggests,
      vmx_set_constant_host_state() is intended for state that is *constant*.
      When SYSENTER is used, SYSENTER_ESP isn't constant because stacks are
      per-CPU, and the VMCS must be updated whenever the vCPU is migrated to a
      new CPU.  The logic in vmx_vcpu_load_vmcs() doesn't differentiate between
      "never loaded" and "loaded on a different CPU", i.e. setting SYSENTER_ESP
      on VMCS load also handles setting correct host state when the VMCS is
      first loaded.
      
      Because a VMCS must be loaded before it is initialized during vCPU RESET,
      zeroing the field in vmx_set_constant_host_state() obliterates the value
      that was written when the VMCS was loaded.  If the vCPU is run before it
      is migrated, the subsequent VM-Exit will zero out MSR_IA32_SYSENTER_ESP,
      leading to a #DF on the next 32-bit syscall.
      
        double fault: 0000 [#1] SMP
        CPU: 0 PID: 990 Comm: stable Not tainted 5.16.0+ #97
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        EIP: entry_SYSENTER_32+0x0/0xe7
        Code: <9c> 50 eb 17 0f 20 d8 a9 00 10 00 00 74 0d 25 ff ef ff ff 0f 22 d8
        EAX: 000000a2 EBX: a8d1300c ECX: a8d13014 EDX: 00000000
        ESI: a8f87000 EDI: a8d13014 EBP: a8d12fc0 ESP: 00000000
        DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00210093
        CR0: 80050033 CR2: fffffffc CR3: 02c3b000 CR4: 00152e90
      
      Fixes: 6ab8a405 ("KVM: VMX: Avoid to rdmsrl(MSR_IA32_SYSENTER_ESP)")
      Cc: Lai Jiangshan <laijs@linux.alibaba.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220122015211.1468758-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      94fea1d8