An error occurred fetching the project authors.
  1. 20 Jun, 2022 3 commits
    • Sean Christopherson's avatar
      KVM: nVMX: Rename nested.vmcs01_* fields to nested.pre_vmenter_* · 5d76b1f8
      Sean Christopherson authored
      Rename the fields in struct nested_vmx used to snapshot pre-VM-Enter
      values to reflect that they can hold L2's values when restoring nested
      state, e.g. if userspace restores MSRs before nested state.  As crazy as
      it seems, restoring MSRs before nested state actually works (because KVM
      goes out if it's way to make it work), even though the initial MSR writes
      will hit vmcs01 despite holding L2 values.
      
      Add a related comment to vmx_enter_smm() to call out that using the
      common VM-Exit and VM-Enter helpers to emulate SMI and RSM is wrong and
      broken.  The few MSRs that have snapshots _could_ be fixed by taking a
      snapshot prior to the forced VM-Exit instead of at forced VM-Enter, but
      that's just the tip of the iceberg as the rather long list of MSRs that
      aren't snapshotted (hello, VM-Exit MSR load list) can't be handled this
      way.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614215831.3762138-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5d76b1f8
    • Sean Christopherson's avatar
      KVM: nVMX: Snapshot pre-VM-Enter DEBUGCTL for !nested_run_pending case · 764643a6
      Sean Christopherson authored
      If a nested run isn't pending, snapshot vmcs01.GUEST_IA32_DEBUGCTL
      irrespective of whether or not VM_ENTRY_LOAD_DEBUG_CONTROLS is set in
      vmcs12.  When restoring nested state, e.g. after migration, without a
      nested run pending, prepare_vmcs02() will propagate
      nested.vmcs01_debugctl to vmcs02, i.e. will load garbage/zeros into
      vmcs02.GUEST_IA32_DEBUGCTL.
      
      If userspace restores nested state before MSRs, then loading garbage is a
      non-issue as loading DEBUGCTL will also update vmcs02.  But if usersepace
      restores MSRs first, then KVM is responsible for propagating L2's value,
      which is actually thrown into vmcs01, into vmcs02.
      
      Restoring L2 MSRs into vmcs01, i.e. loading all MSRs before nested state
      is all kinds of bizarre and ideally would not be supported.  Sadly, some
      VMMs do exactly that and rely on KVM to make things work.
      
      Note, there's still a lurking SMM bug, as propagating vmcs01's DEBUGCTL
      to vmcs02 across RSM may corrupt L2's DEBUGCTL.  But KVM's entire VMX+SMM
      emulation is flawed as SMI+RSM should not toouch _any_ VMCS when use the
      "default treatment of SMIs", i.e. when not using an SMI Transfer Monitor.
      
      Link: https://lore.kernel.org/all/Yobt1XwOfb5M6Dfa@google.com
      Fixes: 8fcc4b59 ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614215831.3762138-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      764643a6
    • Sean Christopherson's avatar
      KVM: nVMX: Snapshot pre-VM-Enter BNDCFGS for !nested_run_pending case · fa578398
      Sean Christopherson authored
      If a nested run isn't pending, snapshot vmcs01.GUEST_BNDCFGS irrespective
      of whether or not VM_ENTRY_LOAD_BNDCFGS is set in vmcs12.  When restoring
      nested state, e.g. after migration, without a nested run pending,
      prepare_vmcs02() will propagate nested.vmcs01_guest_bndcfgs to vmcs02,
      i.e. will load garbage/zeros into vmcs02.GUEST_BNDCFGS.
      
      If userspace restores nested state before MSRs, then loading garbage is a
      non-issue as loading BNDCFGS will also update vmcs02.  But if usersepace
      restores MSRs first, then KVM is responsible for propagating L2's value,
      which is actually thrown into vmcs01, into vmcs02.
      
      Restoring L2 MSRs into vmcs01, i.e. loading all MSRs before nested state
      is all kinds of bizarre and ideally would not be supported.  Sadly, some
      VMMs do exactly that and rely on KVM to make things work.
      
      Note, there's still a lurking SMM bug, as propagating vmcs01.GUEST_BNDFGS
      to vmcs02 across RSM may corrupt L2's BNDCFGS.  But KVM's entire VMX+SMM
      emulation is flawed as SMI+RSM should not toouch _any_ VMCS when use the
      "default treatment of SMIs", i.e. when not using an SMI Transfer Monitor.
      
      Link: https://lore.kernel.org/all/Yobt1XwOfb5M6Dfa@google.com
      Fixes: 62cf9bd8 ("KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGS")
      Cc: stable@vger.kernel.org
      Cc: Lei Wang <lei4.wang@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614215831.3762138-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fa578398
  2. 08 Jun, 2022 2 commits
    • Tao Xu's avatar
      KVM: VMX: Enable Notify VM exit · 2f4073e0
      Tao Xu authored
      There are cases that malicious virtual machines can cause CPU stuck (due
      to event windows don't open up), e.g., infinite loop in microcode when
      nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and
      IRQ) can be delivered. It leads the CPU to be unavailable to host or
      other VMs.
      
      VMM can enable notify VM exit that a VM exit generated if no event
      window occurs in VM non-root mode for a specified amount of time (notify
      window).
      
      Feature enabling:
      - The new vmcs field SECONDARY_EXEC_NOTIFY_VM_EXITING is introduced to
        enable this feature. VMM can set NOTIFY_WINDOW vmcs field to adjust
        the expected notify window.
      - Add a new KVM capability KVM_CAP_X86_NOTIFY_VMEXIT so that user space
        can query and enable this feature in per-VM scope. The argument is a
        64bit value: bits 63:32 are used for notify window, and bits 31:0 are
        for flags. Current supported flags:
        - KVM_X86_NOTIFY_VMEXIT_ENABLED: enable the feature with the notify
          window provided.
        - KVM_X86_NOTIFY_VMEXIT_USER: exit to userspace once the exits happen.
      - It's safe to even set notify window to zero since an internal hardware
        threshold is added to vmcs.notify_window.
      
      VM exit handling:
      - Introduce a vcpu state notify_window_exits to records the count of
        notify VM exits and expose it through the debugfs.
      - Notify VM exit can happen incident to delivery of a vector event.
        Allow it in KVM.
      - Exit to userspace unconditionally for handling when VM_CONTEXT_INVALID
        bit is set.
      
      Nested handling
      - Nested notify VM exits are not supported yet. Keep the same notify
        window control in vmcs02 as vmcs01, so that L1 can't escape the
        restriction of notify VM exits through launching L2 VM.
      
      Notify VM exit is defined in latest Intel Architecture Instruction Set
      Extensions Programming Reference, chapter 9.2.
      Co-developed-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: default avatarTao Xu <tao3.xu@intel.com>
      Co-developed-by: default avatarChenyi Qiang <chenyi.qiang@intel.com>
      Signed-off-by: default avatarChenyi Qiang <chenyi.qiang@intel.com>
      Message-Id: <20220524135624.22988-5-chenyi.qiang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2f4073e0
    • Sean Christopherson's avatar
      KVM: x86: Introduce "struct kvm_caps" to track misc caps/settings · 938c8745
      Sean Christopherson authored
      Add kvm_caps to hold a variety of capabilites and defaults that aren't
      handled by kvm_cpu_caps because they aren't CPUID bits in order to reduce
      the amount of boilerplate code required to add a new feature.  The vast
      majority (all?) of the caps interact with vendor code and are written
      only during initialization, i.e. should be tagged __read_mostly, declared
      extern in x86.h, and exported.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220524135624.22988-4-chenyi.qiang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      938c8745
  3. 29 Apr, 2022 1 commit
  4. 21 Apr, 2022 1 commit
    • Sean Christopherson's avatar
      KVM: nVMX: Defer APICv updates while L2 is active until L1 is active · 7c69661e
      Sean Christopherson authored
      Defer APICv updates that occur while L2 is active until nested VM-Exit,
      i.e. until L1 regains control.  vmx_refresh_apicv_exec_ctrl() assumes L1
      is active and (a) stomps all over vmcs02 and (b) neglects to ever updated
      vmcs01.  E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no
      APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv
      becomes unhibited while L2 is active, KVM will set various APICv controls
      in vmcs02 and trigger a failed VM-Entry.  The kicker is that, unless
      running with nested_early_check=1, KVM blames L1 and chaos ensues.
      
      In all cases, ignoring vmcs02 and always deferring the inhibition change
      to vmcs01 is correct (or at least acceptable).  The ABSENT and DISABLE
      inhibitions cannot truly change while L2 is active (see below).
      
      IRQ_BLOCKING can change, but it is firmly a best effort debug feature.
      Furthermore, only L2's APIC is accelerated/virtualized to the full extent
      possible, e.g. even if L1 passes through its APIC to L2, normal MMIO/MSR
      interception will apply to the virtual APIC managed by KVM.
      The exception is the SELF_IPI register when x2APIC is enabled, but that's
      an acceptable hole.
      
      Lastly, Hyper-V's Auto EOI can technically be toggled if L1 exposes the
      MSRs to L2, but for that to work in any sane capacity, L1 would need to
      pass through IRQs to L2 as well, and IRQs must be intercepted to enable
      virtual interrupt delivery.  I.e. exposing Auto EOI to L2 and enabling
      VID for L2 are, for all intents and purposes, mutually exclusive.
      
      Lack of dynamic toggling is also why this scenario is all but impossible
      to encounter in KVM's current form.  But a future patch will pend an
      APICv update request _during_ vCPU creation to plug a race where a vCPU
      that's being created doesn't get included in the "all vCPUs request"
      because it's not yet visible to other vCPUs.  If userspaces restores L2
      after VM creation (hello, KVM selftests), the first KVM_RUN will occur
      while L2 is active and thus service the APICv update request made during
      VM creation.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220420013732.3308816-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7c69661e
  5. 13 Apr, 2022 3 commits
    • Sean Christopherson's avatar
      KVM: nVMX: Clear IDT vectoring on nested VM-Exit for double/triple fault · 9bd1f0ef
      Sean Christopherson authored
      Clear the IDT vectoring field in vmcs12 on next VM-Exit due to a double
      or triple fault.  Per the SDM, a VM-Exit isn't considered to occur during
      event delivery if the exit is due to an intercepted double fault or a
      triple fault.  Opportunistically move the default clearing (no event
      "pending") into the helper so that it's more obvious that KVM does indeed
      handle this case.
      
      Note, the double fault case is worded rather wierdly in the SDM:
      
        The original event results in a double-fault exception that causes the
        VM exit directly.
      
      Temporarily ignoring injected events, double faults can _only_ occur if
      an exception occurs while attempting to deliver a different exception,
      i.e. there's _always_ an original event.  And for injected double fault,
      while there's no original event, injected events are never subject to
      interception.
      
      Presumably the SDM is calling out that a the vectoring info will be valid
      if a different exit occurs after a double fault, e.g. if a #PF occurs and
      is intercepted while vectoring #DF, then the vectoring info will show the
      double fault.  In other words, the clause can simply be read as:
      
        The VM exit is caused by a double-fault exception.
      
      Fixes: 4704d0be ("KVM: nVMX: Exiting from L2 to L1")
      Cc: Chenyi Qiang <chenyi.qiang@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220407002315.78092-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9bd1f0ef
    • Sean Christopherson's avatar
      KVM: nVMX: Leave most VM-Exit info fields unmodified on failed VM-Entry · c3634d25
      Sean Christopherson authored
      Don't modify vmcs12 exit fields except EXIT_REASON and EXIT_QUALIFICATION
      when performing a nested VM-Exit due to failed VM-Entry.  Per the SDM,
      only the two aformentioned fields are filled and "All other VM-exit
      information fields are unmodified".
      
      Fixes: 4704d0be ("KVM: nVMX: Exiting from L2 to L1")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220407002315.78092-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c3634d25
    • Sean Christopherson's avatar
      KVM: x86: Drop WARNs that assert a triple fault never "escapes" from L2 · 45846661
      Sean Christopherson authored
      Remove WARNs that sanity check that KVM never lets a triple fault for L2
      escape and incorrectly end up in L1.  In normal operation, the sanity
      check is perfectly valid, but it incorrectly assumes that it's impossible
      for userspace to induce KVM_REQ_TRIPLE_FAULT without bouncing through
      KVM_RUN (which guarantees kvm_check_nested_state() will see and handle
      the triple fault).
      
      The WARN can currently be triggered if userspace injects a machine check
      while L2 is active and CR4.MCE=0.  And a future fix to allow save/restore
      of KVM_REQ_TRIPLE_FAULT, e.g. so that a synthesized triple fault isn't
      lost on migration, will make it trivially easy for userspace to trigger
      the WARN.
      
      Clearing KVM_REQ_TRIPLE_FAULT when forcibly leaving guest mode is
      tempting, but wrong, especially if/when the request is saved/restored,
      e.g. if userspace restores events (including a triple fault) and then
      restores nested state (which may forcibly leave guest mode).  Ignoring
      the fact that KVM doesn't currently provide the necessary APIs, it's
      userspace's responsibility to manage pending events during save/restore.
      
        ------------[ cut here ]------------
        WARNING: CPU: 7 PID: 1399 at arch/x86/kvm/vmx/nested.c:4522 nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel]
        Modules linked in: kvm_intel kvm irqbypass
        CPU: 7 PID: 1399 Comm: state_test Not tainted 5.17.0-rc3+ #808
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel]
        Call Trace:
         <TASK>
         vmx_leave_nested+0x30/0x40 [kvm_intel]
         vmx_set_nested_state+0xca/0x3e0 [kvm_intel]
         kvm_arch_vcpu_ioctl+0xf49/0x13e0 [kvm]
         kvm_vcpu_ioctl+0x4b9/0x660 [kvm]
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        ---[ end trace 0000000000000000 ]---
      
      Fixes: cb6a32c2 ("KVM: x86: Handle triple fault in L2 without killing L1")
      Cc: stable@vger.kernel.org
      Cc: Chenyi Qiang <chenyi.qiang@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220407002315.78092-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      45846661
  6. 25 Feb, 2022 5 commits
    • Paolo Bonzini's avatar
      KVM: x86/mmu: load new PGD after the shadow MMU is initialized · 3cffc89d
      Paolo Bonzini authored
      Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and
      shadow_root_level anymore, pull the PGD load after the initialization of
      the shadow MMUs.
      
      Besides being more intuitive, this enables future simplifications
      and optimizations because it's not necessary anymore to compute the
      role outside kvm_init_mmu.  In particular, kvm_mmu_reset_context was not
      attempting to use a cached PGD to avoid having to figure out the new role.
      With this change, it could follow what nested_{vmx,svm}_load_cr3 are doing,
      and avoid unloading all the cached roots.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3cffc89d
    • Paolo Bonzini's avatar
      KVM: x86/mmu: do not pass vcpu to root freeing functions · 0c1c92f1
      Paolo Bonzini authored
      These functions only operate on a given MMU, of which there is more
      than one in a vCPU (we care about two, because the third does not have
      any roots and is only used to walk guest page tables).  They do need a
      struct kvm in order to lock the mmu_lock, but they do not needed anything
      else in the struct kvm_vcpu.  So, pass the vcpu->kvm directly to them.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0c1c92f1
    • Paolo Bonzini's avatar
      KVM: x86: use struct kvm_mmu_root_info for mmu->root · b9e5603c
      Paolo Bonzini authored
      The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info.
      Use the struct to have more consistency between mmu->root and
      mmu->prev_roots.
      
      The patch is entirely search and replace except for cached_root_available,
      which does not need a temporary struct kvm_mmu_root_info anymore.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b9e5603c
    • Sean Christopherson's avatar
      Revert "KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()" · 1a715810
      Sean Christopherson authored
      Revert back to refreshing vmcs.HOST_CR3 immediately prior to VM-Enter.
      The PCID (ASID) part of CR3 can be bumped without KVM being scheduled
      out, as the kernel will switch CR3 during __text_poke(), e.g. in response
      to a static key toggling.  If switch_mm_irqs_off() chooses a new ASID for
      the mm associate with KVM, KVM will do VM-Enter => VM-Exit with a stale
      vmcs.HOST_CR3.
      
      Add a comment to explain why KVM must wait until VM-Enter is imminent to
      refresh vmcs.HOST_CR3.
      
      The following splat was captured by stashing vmcs.HOST_CR3 in kvm_vcpu
      and adding a WARN in load_new_mm_cr3() to fire if a new ASID is being
      loaded for the KVM-associated mm while KVM has a "running" vCPU:
      
        static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
        {
      	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
      
      	...
      
      	WARN(vcpu && (vcpu->cr3 & GENMASK(11, 0)) != (new_mm_cr3 & GENMASK(11, 0)) &&
      	     (vcpu->cr3 & PHYSICAL_PAGE_MASK) == (new_mm_cr3 & PHYSICAL_PAGE_MASK),
      	     "KVM is hosed, loading CR3 = %lx, vmcs.HOST_CR3 = %lx", new_mm_cr3, vcpu->cr3);
        }
      
        ------------[ cut here ]------------
        KVM is hosed, loading CR3 = 8000000105393004, vmcs.HOST_CR3 = 105393003
        WARNING: CPU: 4 PID: 20717 at arch/x86/mm/tlb.c:291 load_new_mm_cr3+0x82/0xe0
        Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel
        CPU: 4 PID: 20717 Comm: stable Tainted: G        W         5.17.0-rc3+ #747
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:load_new_mm_cr3+0x82/0xe0
        RSP: 0018:ffffc9000489fa98 EFLAGS: 00010082
        RAX: 0000000000000000 RBX: 8000000105393004 RCX: 0000000000000027
        RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277d1b788
        RBP: 0000000000000004 R08: ffff888277d1b780 R09: ffffc9000489f8b8
        R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
        R13: ffff88810678a800 R14: 0000000000000004 R15: 0000000000000c33
        FS:  00007fa9f0e72700(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 00000001001b5003 CR4: 0000000000172ea0
        Call Trace:
         <TASK>
         switch_mm_irqs_off+0x1cb/0x460
         __text_poke+0x308/0x3e0
         text_poke_bp_batch+0x168/0x220
         text_poke_finish+0x1b/0x30
         arch_jump_label_transform_apply+0x18/0x30
         static_key_slow_inc_cpuslocked+0x7c/0x90
         static_key_slow_inc+0x16/0x20
         kvm_lapic_set_base+0x116/0x190
         kvm_set_apic_base+0xa5/0xe0
         kvm_set_msr_common+0x2f4/0xf60
         vmx_set_msr+0x355/0xe70 [kvm_intel]
         kvm_set_msr_ignored_check+0x91/0x230
         kvm_emulate_wrmsr+0x36/0x120
         vmx_handle_exit+0x609/0x6c0 [kvm_intel]
         kvm_arch_vcpu_ioctl_run+0x146f/0x1b80
         kvm_vcpu_ioctl+0x279/0x690
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        ---[ end trace 0000000000000000 ]---
      
      This reverts commit 15ad9762.
      
      Fixes: 15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
      Reported-by: default avatarWanpeng Li <kernellwp@gmail.com>
      Cc: Lai Jiangshan <laijs@linux.alibaba.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Acked-by: default avatarLai Jiangshan <jiangshanlai@gmail.com>
      Message-Id: <20220224191917.3508476-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1a715810
    • Sean Christopherson's avatar
      Revert "KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs()" · bca06b85
      Sean Christopherson authored
      Undo a nested VMX fix as a step toward reverting the commit it fixed,
      15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"),
      as the underlying premise that "host CR3 in the vcpu thread can only be
      changed when scheduling" is wrong.
      
      This reverts commit a9f2705e.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220224191917.3508476-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bca06b85
  7. 10 Feb, 2022 1 commit
    • Sean Christopherson's avatar
      KVM: nVMX: Refactor PMU refresh to avoid referencing kvm_x86_ops.pmu_ops · 0bcd556e
      Sean Christopherson authored
      Refactor the nested VMX PMU refresh helper to pass it a flag stating
      whether or not the vCPU has PERF_GLOBAL_CTRL instead of having the nVMX
      helper query the information by bouncing through kvm_x86_ops.pmu_ops.
      This will allow a future patch to use static_call() for the PMU ops
      without having to export any static call definitions from common x86, and
      it is also a step toward unexported kvm_x86_ops.
      
      Alternatively, nVMX could call kvm_pmu_is_valid_msr() to indirectly use
      kvm_x86_ops.pmu_ops, but that would incur an extra layer of indirection
      and would require exporting kvm_pmu_is_valid_msr().
      
      Opportunistically rename the helper to keep line lengths somewhat
      reasonable, and to better capture its high-level role.
      
      No functional change intended.
      
      Cc: Like Xu <like.xu.linux@gmail.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0bcd556e
  8. 28 Jan, 2022 2 commits
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Allow VMREAD when Enlightened VMCS is in use · 6cbbaab6
      Vitaly Kuznetsov authored
      Hyper-V TLFS explicitly forbids VMREAD and VMWRITE instructions when
      Enlightened VMCS interface is in use:
      
      "Any VMREAD or VMWRITE instructions while an enlightened VMCS is
      active is unsupported and can result in unexpected behavior.""
      
      Windows 11 + WSL2 seems to ignore this, attempts to VMREAD VMCS field
      0x4404 ("VM-exit interruption information") are observed. Failing
      these attempts with nested_vmx_failInvalid() makes such guests
      unbootable.
      
      Microsoft confirms this is a Hyper-V bug and claims that it'll get fixed
      eventually but for the time being we need a workaround. (Temporary) allow
      VMREAD to get data from the currently loaded Enlightened VMCS.
      
      Note: VMWRITE instructions remain forbidden, it is not clear how to
      handle them properly and hopefully won't ever be needed.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220112170134.1904308-6-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6cbbaab6
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Rename vmcs_to_field_offset{,_table} · 2423a4c0
      Vitaly Kuznetsov authored
      vmcs_to_field_offset{,_table} may sound misleading as VMCS is an opaque
      blob which is not supposed to be accessed directly. In fact,
      vmcs_to_field_offset{,_table} are related to KVM defined VMCS12 structure.
      
      Rename vmcs_field_to_offset() to get_vmcs12_field_offset() for clarity.
      
      No functional change intended.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220112170134.1904308-4-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2423a4c0
  9. 26 Jan, 2022 2 commits
    • Sean Christopherson's avatar
      KVM: nVMX: WARN on any attempt to allocate shadow VMCS for vmcs02 · d6e656cd
      Sean Christopherson authored
      WARN if KVM attempts to allocate a shadow VMCS for vmcs02.  KVM emulates
      VMCS shadowing but doesn't virtualize it, i.e. KVM should never allocate
      a "real" shadow VMCS for L2.
      
      The previous code WARNed but continued anyway with the allocation,
      presumably in an attempt to avoid NULL pointer dereference.
      However, alloc_vmcs (and hence alloc_shadow_vmcs) can fail, and
      indeed the sole caller does:
      
      	if (enable_shadow_vmcs && !alloc_shadow_vmcs(vcpu))
      		goto out_shadow_vmcs;
      
      which makes it not a useful attempt.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220125220527.2093146-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d6e656cd
    • Sean Christopherson's avatar
      KVM: x86: Forcibly leave nested virt when SMM state is toggled · f7e57078
      Sean Christopherson authored
      Forcibly leave nested virtualization operation if userspace toggles SMM
      state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS.  If userspace
      forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
      vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
      vmxon=false and smm.vmxon=false, but all other nVMX state allocated.
      
      Don't attempt to gracefully handle the transition as (a) most transitions
      are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
      sufficient information to handle all transitions, e.g. SVM wants access
      to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
      KVM_SET_NESTED_STATE during state restore as the latter disallows putting
      the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
      being post-VMXON in SMM if SMM is not active.
      
      Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
      due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
      just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
      in an architecturally impossible state.
      
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Modules linked in:
        CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
        Call Trace:
         <TASK>
         kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
         kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
         kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
         kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
         kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
         kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
         kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
         kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
         __fput+0x286/0x9f0 fs/file_table.c:311
         task_work_run+0xdd/0x1a0 kernel/task_work.c:164
         exit_task_work include/linux/task_work.h:32 [inline]
         do_exit+0xb29/0x2a30 kernel/exit.c:806
         do_group_exit+0xd2/0x2f0 kernel/exit.c:935
         get_signal+0x4b0/0x28c0 kernel/signal.c:2862
         arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
         handle_signal_work kernel/entry/common.c:148 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
         exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
         __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
         syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
         do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220125220358.2091737-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f7e57078
  10. 07 Jan, 2022 2 commits
    • Eric Hankland's avatar
      KVM: x86: Update vPMCs when retiring branch instructions · 018d70ff
      Eric Hankland authored
      When KVM retires a guest branch instruction through emulation,
      increment any vPMCs that are configured to monitor "branch
      instructions retired," and update the sample period of those counters
      so that they will overflow at the right time.
      Signed-off-by: default avatarEric Hankland <ehankland@google.com>
      [jmattson:
        - Split the code to increment "branch instructions retired" into a
          separate commit.
        - Moved/consolidated the calls to kvm_pmu_trigger_event() in the
          emulation of VMLAUNCH/VMRESUME to accommodate the evolution of
          that code.
      ]
      Fixes: f5132b01 ("KVM: Expose a version 2 architectural PMU to a guests")
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Message-Id: <20211130074221.93635-7-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      018d70ff
    • Lai Jiangshan's avatar
      KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs() · a9f2705e
      Lai Jiangshan authored
      The host CR3 in the vcpu thread can only be changed when scheduling,
      so commit 15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
      changed vmx.c to only save it in vmx_prepare_switch_to_guest().
      
      However, it also has to be synced in vmx_sync_vmcs_host_state() when switching VMCS.
      vmx_set_host_fs_gs() is called in both places, so rename it to
      vmx_set_vmcs_host_state() and make it update HOST_CR3.
      
      Fixes: 15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
      Signed-off-by: default avatarLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20211216021938.11752-2-jiangshanlai@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a9f2705e
  11. 08 Dec, 2021 8 commits
  12. 02 Dec, 2021 1 commit
  13. 26 Nov, 2021 3 commits
    • Sean Christopherson's avatar
      KVM: nVMX: Emulate guest TLB flush on nested VM-Enter with new vpid12 · 712494de
      Sean Christopherson authored
      Fully emulate a guest TLB flush on nested VM-Enter which changes vpid12,
      i.e. L2's VPID, instead of simply doing INVVPID to flush real hardware's
      TLB entries for vpid02.  From L1's perspective, changing L2's VPID is
      effectively a TLB flush unless "hardware" has previously cached entries
      for the new vpid12.  Because KVM tracks only a single vpid12, KVM doesn't
      know if the new vpid12 has been used in the past and so must treat it as
      a brand new, never been used VPID, i.e. must assume that the new vpid12
      represents a TLB flush from L1's perspective.
      
      For example, if L1 and L2 share a CR3, the first VM-Enter to L2 (with a
      VPID) is effectively a TLB flush as hardware/KVM has never seen vpid12
      and thus can't have cached entries in the TLB for vpid12.
      Reported-by: default avatarLai Jiangshan <jiangshanlai+lkml@gmail.com>
      Fixes: 5c614b35 ("KVM: nVMX: nested VPID emulation")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211125014944.536398-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      712494de
    • Sean Christopherson's avatar
      KVM: nVMX: Abide to KVM_REQ_TLB_FLUSH_GUEST request on nested vmentry/vmexit · 40e5f908
      Sean Christopherson authored
      Like KVM_REQ_TLB_FLUSH_CURRENT, the GUEST variant needs to be serviced at
      nested transitions, as KVM doesn't track requests for L1 vs L2.  E.g. if
      there's a pending flush when a nested VM-Exit occurs, then the flush was
      requested in the context of L2 and needs to be handled before switching
      to L1, otherwise the flush for L2 would effectiely be lost.
      
      Opportunistically add a helper to handle CURRENT and GUEST as a pair, the
      logic for when they need to be serviced is identical as both requests are
      tied to L1 vs. L2, the only difference is the scope of the flush.
      Reported-by: default avatarLai Jiangshan <jiangshanlai+lkml@gmail.com>
      Fixes: 07ffaf34 ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211125014944.536398-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      40e5f908
    • Paolo Bonzini's avatar
      KVM: VMX: do not use uninitialized gfn_to_hva_cache · 8503fea6
      Paolo Bonzini authored
      An uninitialized gfn_to_hva_cache has ghc->len == 0, which causes
      the accessors to croak very loudly.  While a BUG_ON is definitely
      _too_ loud and a bug on its own, there is indeed an issue of using
      the caches in such a way that they could not have been initialized,
      because ghc->gpa == 0 might match and thus kvm_gfn_to_hva_cache_init
      would not be called.
      
      For the vmcs12_cache, the solution is simply to invoke
      kvm_gfn_to_hva_cache_init unconditionally: we already know
      that the cache does not match the current VMCS pointer.
      For the shadow_vmcs12_cache, there is no similar condition
      that checks the VMCS link pointer, so invalidate the cache
      on VMXON.
      
      Fixes: cee66664 ("KVM: nVMX: Use a gfn_to_hva_cache for vmptrld")
      Acked-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Reported-by: syzbot+7b7db8bb4db6fd5e157b@syzkaller.appspotmail.com
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8503fea6
  14. 18 Nov, 2021 4 commits
  15. 11 Nov, 2021 2 commits