1. 25 Feb, 2022 7 commits
    • Dexuan Cui's avatar
      x86/kvmclock: Fix Hyper-V Isolated VM's boot issue when vCPUs > 64 · 92e68cc5
      Dexuan Cui authored
      When Linux runs as an Isolated VM on Hyper-V, it supports AMD SEV-SNP
      but it's partially enlightened, i.e. cc_platform_has(
      CC_ATTR_GUEST_MEM_ENCRYPT) is true but sev_active() is false.
      
      Commit 4d96f910 per se is good, but with it now
      kvm_setup_vsyscall_timeinfo() -> kvmclock_init_mem() calls
      set_memory_decrypted(), and later gets stuck when trying to zere out
      the pages pointed by 'hvclock_mem', if Linux runs as an Isolated VM on
      Hyper-V. The cause is that here now the Linux VM should no longer access
      the original guest physical addrss (GPA); instead the VM should do
      memremap() and access the original GPA + ms_hyperv.shared_gpa_boundary:
      see the example code in drivers/hv/connection.c: vmbus_connect() or
      drivers/hv/ring_buffer.c: hv_ringbuffer_init(). If the VM tries to
      access the original GPA, it keepts getting injected a fault by Hyper-V
      and gets stuck there.
      
      Here the issue happens only when the VM has >=65 vCPUs, because the
      global static array hv_clock_boot[] can hold 64 "struct
      pvclock_vsyscall_time_info" (the sizeof of the struct is 64 bytes), so
      kvmclock_init_mem() only allocates memory in the case of vCPUs > 64.
      
      Since the 'hvclock_mem' pages are only useful when the kvm clock is
      supported by the underlying hypervisor, fix the issue by returning
      early when Linux VM runs on Hyper-V, which doesn't support kvm clock.
      
      Fixes: 4d96f910 ("x86/sev: Replace occurrences of sev_active() with cc_platform_has()")
      Tested-by: default avatarAndrea Parri (Microsoft) <parri.andrea@gmail.com>
      Signed-off-by: default avatarAndrea Parri (Microsoft) <parri.andrea@gmail.com>
      Signed-off-by: default avatarDexuan Cui <decui@microsoft.com>
      Message-Id: <20220225084600.17817-1-decui@microsoft.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      92e68cc5
    • Wanpeng Li's avatar
      x86/kvm: Don't waste memory if kvmclock is disabled · 3c51d0a6
      Wanpeng Li authored
      Even if "no-kvmclock" is passed in cmdline parameter, the guest kernel
      still allocates hvclock_mem which is scaled by the number of vCPUs,
      let's check kvmclock enable in advance to avoid this memory waste.
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1645520523-30814-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3c51d0a6
    • Wanpeng Li's avatar
      x86/kvm: Don't use PV TLB/yield when mwait is advertised · 40cd58db
      Wanpeng Li authored
      MWAIT is advertised in host is not overcommitted scenario, however, PV
      TLB/sched yield should be enabled in host overcommitted scenario. Let's
      add the MWAIT checking when enabling PV TLB/sched yield.
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1645777780-2581-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      40cd58db
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-5.17-4' of... · ece32a75
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-5.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      KVM/arm64 fixes for 5.17, take #4
      
      - Correctly synchronise PMR and co on PSCI CPU_SUSPEND
      
      - Skip tests that depend on GICv3 when the HW isn't available
      ece32a75
    • Mark Brown's avatar
      KVM: selftests: aarch64: Skip tests if we can't create a vgic-v3 · 456f89e0
      Mark Brown authored
      The arch_timer and vgic_irq kselftests assume that they can create a
      vgic-v3, using the library function vgic_v3_setup() which aborts with a
      test failure if it is not possible to do so. Since vgic-v3 can only be
      instantiated on systems where the host has GICv3 this leads to false
      positives on older systems where that is not the case.
      
      Fix this by changing vgic_v3_setup() to return an error if the vgic can't
      be instantiated and have the callers skip if this happens. We could also
      exit flagging a skip in vgic_v3_setup() but this would prevent future test
      cases conditionally deciding which GIC to use or generally doing more
      complex output.
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Tested-by: default avatarRicardo Koller <ricarkol@google.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220223131624.1830351-1-broonie@kernel.org
      456f89e0
    • Sean Christopherson's avatar
      Revert "KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()" · 1a715810
      Sean Christopherson authored
      Revert back to refreshing vmcs.HOST_CR3 immediately prior to VM-Enter.
      The PCID (ASID) part of CR3 can be bumped without KVM being scheduled
      out, as the kernel will switch CR3 during __text_poke(), e.g. in response
      to a static key toggling.  If switch_mm_irqs_off() chooses a new ASID for
      the mm associate with KVM, KVM will do VM-Enter => VM-Exit with a stale
      vmcs.HOST_CR3.
      
      Add a comment to explain why KVM must wait until VM-Enter is imminent to
      refresh vmcs.HOST_CR3.
      
      The following splat was captured by stashing vmcs.HOST_CR3 in kvm_vcpu
      and adding a WARN in load_new_mm_cr3() to fire if a new ASID is being
      loaded for the KVM-associated mm while KVM has a "running" vCPU:
      
        static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
        {
      	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
      
      	...
      
      	WARN(vcpu && (vcpu->cr3 & GENMASK(11, 0)) != (new_mm_cr3 & GENMASK(11, 0)) &&
      	     (vcpu->cr3 & PHYSICAL_PAGE_MASK) == (new_mm_cr3 & PHYSICAL_PAGE_MASK),
      	     "KVM is hosed, loading CR3 = %lx, vmcs.HOST_CR3 = %lx", new_mm_cr3, vcpu->cr3);
        }
      
        ------------[ cut here ]------------
        KVM is hosed, loading CR3 = 8000000105393004, vmcs.HOST_CR3 = 105393003
        WARNING: CPU: 4 PID: 20717 at arch/x86/mm/tlb.c:291 load_new_mm_cr3+0x82/0xe0
        Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel
        CPU: 4 PID: 20717 Comm: stable Tainted: G        W         5.17.0-rc3+ #747
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:load_new_mm_cr3+0x82/0xe0
        RSP: 0018:ffffc9000489fa98 EFLAGS: 00010082
        RAX: 0000000000000000 RBX: 8000000105393004 RCX: 0000000000000027
        RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277d1b788
        RBP: 0000000000000004 R08: ffff888277d1b780 R09: ffffc9000489f8b8
        R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
        R13: ffff88810678a800 R14: 0000000000000004 R15: 0000000000000c33
        FS:  00007fa9f0e72700(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 00000001001b5003 CR4: 0000000000172ea0
        Call Trace:
         <TASK>
         switch_mm_irqs_off+0x1cb/0x460
         __text_poke+0x308/0x3e0
         text_poke_bp_batch+0x168/0x220
         text_poke_finish+0x1b/0x30
         arch_jump_label_transform_apply+0x18/0x30
         static_key_slow_inc_cpuslocked+0x7c/0x90
         static_key_slow_inc+0x16/0x20
         kvm_lapic_set_base+0x116/0x190
         kvm_set_apic_base+0xa5/0xe0
         kvm_set_msr_common+0x2f4/0xf60
         vmx_set_msr+0x355/0xe70 [kvm_intel]
         kvm_set_msr_ignored_check+0x91/0x230
         kvm_emulate_wrmsr+0x36/0x120
         vmx_handle_exit+0x609/0x6c0 [kvm_intel]
         kvm_arch_vcpu_ioctl_run+0x146f/0x1b80
         kvm_vcpu_ioctl+0x279/0x690
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        ---[ end trace 0000000000000000 ]---
      
      This reverts commit 15ad9762.
      
      Fixes: 15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
      Reported-by: default avatarWanpeng Li <kernellwp@gmail.com>
      Cc: Lai Jiangshan <laijs@linux.alibaba.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Acked-by: default avatarLai Jiangshan <jiangshanlai@gmail.com>
      Message-Id: <20220224191917.3508476-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1a715810
    • Sean Christopherson's avatar
      Revert "KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs()" · bca06b85
      Sean Christopherson authored
      Undo a nested VMX fix as a step toward reverting the commit it fixed,
      15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"),
      as the underlying premise that "host CR3 in the vcpu thread can only be
      changed when scheduling" is wrong.
      
      This reverts commit a9f2705e.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220224191917.3508476-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bca06b85
  2. 24 Feb, 2022 2 commits
    • Maxim Levitsky's avatar
      KVM: x86: nSVM: disallow userspace setting of MSR_AMD64_TSC_RATIO to non... · e910a53f
      Maxim Levitsky authored
      KVM: x86: nSVM: disallow userspace setting of MSR_AMD64_TSC_RATIO to non default value when tsc scaling disabled
      
      If nested tsc scaling is disabled, MSR_AMD64_TSC_RATIO should
      never have non default value.
      
      Due to way nested tsc scaling support was implmented in qemu,
      it would set this msr to 0 when nested tsc scaling was disabled.
      Ignore that value for now, as it causes no harm.
      
      Fixes: 5228eb96 ("KVM: x86: nSVM: implement nested TSC scaling")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220223115649.319134-1-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e910a53f
    • Liang Zhang's avatar
      KVM: x86/mmu: make apf token non-zero to fix bug · 6f3c1fc5
      Liang Zhang authored
      In current async pagefault logic, when a page is ready, KVM relies on
      kvm_arch_can_dequeue_async_page_present() to determine whether to deliver
      a READY event to the Guest. This function test token value of struct
      kvm_vcpu_pv_apf_data, which must be reset to zero by Guest kernel when a
      READY event is finished by Guest. If value is zero meaning that a READY
      event is done, so the KVM can deliver another.
      But the kvm_arch_setup_async_pf() may produce a valid token with zero
      value, which is confused with previous mention and may lead the loss of
      this READY event.
      
      This bug may cause task blocked forever in Guest:
       INFO: task stress:7532 blocked for more than 1254 seconds.
             Not tainted 5.10.0 #16
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       task:stress          state:D stack:    0 pid: 7532 ppid:  1409
       flags:0x00000080
       Call Trace:
        __schedule+0x1e7/0x650
        schedule+0x46/0xb0
        kvm_async_pf_task_wait_schedule+0xad/0xe0
        ? exit_to_user_mode_prepare+0x60/0x70
        __kvm_handle_async_pf+0x4f/0xb0
        ? asm_exc_page_fault+0x8/0x30
        exc_page_fault+0x6f/0x110
        ? asm_exc_page_fault+0x8/0x30
        asm_exc_page_fault+0x1e/0x30
       RIP: 0033:0x402d00
       RSP: 002b:00007ffd31912500 EFLAGS: 00010206
       RAX: 0000000000071000 RBX: ffffffffffffffff RCX: 00000000021a32b0
       RDX: 000000000007d011 RSI: 000000000007d000 RDI: 00000000021262b0
       RBP: 00000000021262b0 R08: 0000000000000003 R09: 0000000000000086
       R10: 00000000000000eb R11: 00007fefbdf2baa0 R12: 0000000000000000
       R13: 0000000000000002 R14: 000000000007d000 R15: 0000000000001000
      Signed-off-by: default avatarLiang Zhang <zhangliang5@huawei.com>
      Message-Id: <20220222031239.1076682-1-zhangliang5@huawei.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6f3c1fc5
  3. 22 Feb, 2022 2 commits
  4. 18 Feb, 2022 2 commits
  5. 17 Feb, 2022 6 commits
    • Leonardo Bras's avatar
      x86/kvm/fpu: Remove kvm_vcpu_arch.guest_supported_xcr0 · 988896bb
      Leonardo Bras authored
      kvm_vcpu_arch currently contains the guest supported features in both
      guest_supported_xcr0 and guest_fpu.fpstate->user_xfeatures field.
      
      Currently both fields are set to the same value in
      kvm_vcpu_after_set_cpuid() and are not changed anywhere else after that.
      
      Since it's not good to keep duplicated data, remove guest_supported_xcr0.
      
      To keep the code more readable, introduce kvm_guest_supported_xcr()
      and kvm_guest_supported_xfd() to replace the previous usages of
      guest_supported_xcr0.
      Signed-off-by: default avatarLeonardo Bras <leobras@redhat.com>
      Message-Id: <20220217053028.96432-3-leobras@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      988896bb
    • Leonardo Bras's avatar
      x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0 · ad856280
      Leonardo Bras authored
      During host/guest switch (like in kvm_arch_vcpu_ioctl_run()), the kernel
      swaps the fpu between host/guest contexts, by using fpu_swap_kvm_fpstate().
      
      When xsave feature is available, the fpu swap is done by:
      - xsave(s) instruction, with guest's fpstate->xfeatures as mask, is used
        to store the current state of the fpu registers to a buffer.
      - xrstor(s) instruction, with (fpu_kernel_cfg.max_features &
        XFEATURE_MASK_FPSTATE) as mask, is used to put the buffer into fpu regs.
      
      For xsave(s) the mask is used to limit what parts of the fpu regs will
      be copied to the buffer. Likewise on xrstor(s), the mask is used to
      limit what parts of the fpu regs will be changed.
      
      The mask for xsave(s), the guest's fpstate->xfeatures, is defined on
      kvm_arch_vcpu_create(), which (in summary) sets it to all features
      supported by the cpu which are enabled on kernel config.
      
      This means that xsave(s) will save to guest buffer all the fpu regs
      contents the cpu has enabled when the guest is paused, even if they
      are not used.
      
      This would not be an issue, if xrstor(s) would also do that.
      
      xrstor(s)'s mask for host/guest swap is basically every valid feature
      contained in kernel config, except XFEATURE_MASK_PKRU.
      Accordingto kernel src, it is instead switched in switch_to() and
      flush_thread().
      
      Then, the following happens with a host supporting PKRU starts a
      guest that does not support it:
      1 - Host has XFEATURE_MASK_PKRU set. 1st switch to guest,
      2 - xsave(s) fpu regs to host fpustate (buffer has XFEATURE_MASK_PKRU)
      3 - xrstor(s) guest fpustate to fpu regs (fpu regs have XFEATURE_MASK_PKRU)
      4 - guest runs, then switch back to host,
      5 - xsave(s) fpu regs to guest fpstate (buffer now have XFEATURE_MASK_PKRU)
      6 - xrstor(s) host fpstate to fpu regs.
      7 - kvm_vcpu_ioctl_x86_get_xsave() copy guest fpstate to userspace (with
          XFEATURE_MASK_PKRU, which should not be supported by guest vcpu)
      
      On 5, even though the guest does not support PKRU, it does have the flag
      set on guest fpstate, which is transferred to userspace via vcpu ioctl
      KVM_GET_XSAVE.
      
      This becomes a problem when the user decides on migrating the above guest
      to another machine that does not support PKRU: the new host restores
      guest's fpu regs to as they were before (xrstor(s)), but since the new
      host don't support PKRU, a general-protection exception ocurs in xrstor(s)
      and that crashes the guest.
      
      This can be solved by making the guest's fpstate->user_xfeatures hold
      a copy of guest_supported_xcr0. This way, on 7 the only flags copied to
      userspace will be the ones compatible to guest requirements, and thus
      there will be no issue during migration.
      
      As a bonus, it will also fail if userspace tries to set fpu features
      (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest
      configuration.  Such features will never be returned by KVM_GET_XSAVE
      or KVM_GET_XSAVE2.
      
      Also, since kvm_vcpu_after_set_cpuid() now sets fpstate->user_xfeatures,
      there is not need to set it in kvm_check_cpuid(). So, change
      fpstate_realloc() so it does not touch fpstate->user_xfeatures if a
      non-NULL guest_fpu is passed, which is the case when kvm_check_cpuid()
      calls it.
      Signed-off-by: default avatarLeonardo Bras <leobras@redhat.com>
      Message-Id: <20220217053028.96432-2-leobras@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ad856280
    • Anton Romanov's avatar
      kvm: x86: Disable KVM_HC_CLOCK_PAIRING if tsc is in always catchup mode · 3a55f729
      Anton Romanov authored
      If vcpu has tsc_always_catchup set each request updates pvclock data.
      KVM_HC_CLOCK_PAIRING consumers such as ptp_kvm_x86 rely on tsc read on
      host's side and do hypercall inside pvclock_read_retry loop leading to
      infinite loop in such situation.
      
      v3:
          Removed warn
          Changed return code to KVM_EFAULT
      v2:
          Added warn
      Signed-off-by: default avatarAnton Romanov <romanton@google.com>
      Message-Id: <20220216182653.506850-1-romanton@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3a55f729
    • Wanpeng Li's avatar
      KVM: Fix lockdep false negative during host resume · 4cb9a998
      Wanpeng Li authored
      I saw the below splatting after the host suspended and resumed.
      
         WARNING: CPU: 0 PID: 2943 at kvm/arch/x86/kvm/../../../virt/kvm/kvm_main.c:5531 kvm_resume+0x2c/0x30 [kvm]
         CPU: 0 PID: 2943 Comm: step_after_susp Tainted: G        W IOE     5.17.0-rc3+ #4
         RIP: 0010:kvm_resume+0x2c/0x30 [kvm]
         Call Trace:
          <TASK>
          syscore_resume+0x90/0x340
          suspend_devices_and_enter+0xaee/0xe90
          pm_suspend.cold+0x36b/0x3c2
          state_store+0x82/0xf0
          kernfs_fop_write_iter+0x1b6/0x260
          new_sync_write+0x258/0x370
          vfs_write+0x33f/0x510
          ksys_write+0xc9/0x160
          do_syscall_64+0x3b/0xc0
          entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      lockdep_is_held() can return -1 when lockdep is disabled which triggers
      this warning. Let's use lockdep_assert_not_held() which can detect
      incorrect calls while holding a lock and it also avoids false negatives
      when lockdep is disabled.
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1644920142-81249-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4cb9a998
    • Aaron Lewis's avatar
      KVM: x86: Add KVM_CAP_ENABLE_CAP to x86 · 127770ac
      Aaron Lewis authored
      Follow the precedent set by other architectures that support the VCPU
      ioctl, KVM_ENABLE_CAP, and advertise the VM extension, KVM_CAP_ENABLE_CAP.
      This way, userspace can ensure that KVM_ENABLE_CAP is available on a
      vcpu before using it.
      
      Fixes: 5c919412 ("kvm/x86: Hyper-V synthetic interrupt controller")
      Signed-off-by: default avatarAaron Lewis <aaronlewis@google.com>
      Message-Id: <20220214212950.1776943-1-aaronlewis@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      127770ac
    • Oliver Upton's avatar
      KVM: arm64: Don't miss pending interrupts for suspended vCPU · a867e9d0
      Oliver Upton authored
      In order to properly emulate the WFI instruction, KVM reads back
      ICH_VMCR_EL2 and enables doorbells for GICv4. These preparations are
      necessary in order to recognize pending interrupts in
      kvm_arch_vcpu_runnable() and return to the guest. Until recently, this
      work was done by kvm_arch_vcpu_{blocking,unblocking}(). Since commit
      6109c5a6 ("KVM: arm64: Move vGIC v4 handling for WFI out arch
      callback hook"), these callbacks were gutted and superseded by
      kvm_vcpu_wfi().
      
      It is important to note that KVM implements PSCI CPU_SUSPEND calls as
      a WFI within the guest. However, the implementation calls directly into
      kvm_vcpu_halt(), which skips the needed work done in kvm_vcpu_wfi()
      to detect pending interrupts. Fix the issue by calling the WFI helper.
      
      Fixes: 6109c5a6 ("KVM: arm64: Move vGIC v4 handling for WFI out arch callback hook")
      Signed-off-by: default avatarOliver Upton <oupton@google.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220217101242.3013716-1-oupton@google.com
      a867e9d0
  6. 14 Feb, 2022 2 commits
  7. 11 Feb, 2022 6 commits
  8. 10 Feb, 2022 1 commit
    • David Woodhouse's avatar
      KVM: x86/xen: Fix runstate updates to be atomic when preempting vCPU · fcb732d8
      David Woodhouse authored
      There are circumstances whem kvm_xen_update_runstate_guest() should not
      sleep because it ends up being called from __schedule() when the vCPU
      is preempted:
      
      [  222.830825]  kvm_xen_update_runstate_guest+0x24/0x100
      [  222.830878]  kvm_arch_vcpu_put+0x14c/0x200
      [  222.830920]  kvm_sched_out+0x30/0x40
      [  222.830960]  __schedule+0x55c/0x9f0
      
      To handle this, make it use the same trick as __kvm_xen_has_interrupt(),
      of using the hva from the gfn_to_hva_cache directly. Then it can use
      pagefault_disable() around the accesses and just bail out if the page
      is absent (which is unlikely).
      
      I almost switched to using a gfn_to_pfn_cache here and bailing out if
      kvm_map_gfn() fails, like kvm_steal_time_set_preempted() does — but on
      closer inspection it looks like kvm_map_gfn() will *always* fail in
      atomic context for a page in IOMEM, which means it will silently fail
      to make the update every single time for such guests, AFAICT. So I
      didn't do it that way after all. And will probably fix that one too.
      
      Cc: stable@vger.kernel.org
      Fixes: 30b5c851 ("KVM: x86/xen: Add support for vCPU runstate information")
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <b17a93e5ff4561e57b1238e3e7ccd0b613eb827e.camel@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fcb732d8
  9. 08 Feb, 2022 9 commits
    • Maxim Levitsky's avatar
      KVM: x86: SVM: move avic definitions from AMD's spec to svm.h · 39150352
      Maxim Levitsky authored
      asm/svm.h is the correct place for all values that are defined in
      the SVM spec, and that includes AVIC.
      
      Also add some values from the spec that were not defined before
      and will be soon useful.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220207155447.840194-10-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      39150352
    • Maxim Levitsky's avatar
      KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it · 755c2bf8
      Maxim Levitsky authored
      kvm_apic_update_apicv is called when AVIC is still active, thus IRR bits
      can be set by the CPU after it is called, and don't cause the irr_pending
      to be set to true.
      
      Also logic in avic_kick_target_vcpu doesn't expect a race with this
      function so to make it simple, just keep irr_pending set to true and
      let the next interrupt injection to the guest clear it.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220207155447.840194-9-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      755c2bf8
    • Maxim Levitsky's avatar
      KVM: x86: nSVM: deal with L1 hypervisor that intercepts interrupts but lets L2 control them · 2b0ecccb
      Maxim Levitsky authored
      Fix a corner case in which the L1 hypervisor intercepts
      interrupts (INTERCEPT_INTR) and either doesn't set
      virtual interrupt masking (V_INTR_MASKING) or enters a
      nested guest with EFLAGS.IF disabled prior to the entry.
      
      In this case, despite the fact that L1 intercepts the interrupts,
      KVM still needs to set up an interrupt window to wait before
      injecting the INTR vmexit.
      
      Currently the KVM instead enters an endless loop of 'req_immediate_exit'.
      
      Exactly the same issue also happens for SMIs and NMI.
      Fix this as well.
      
      Note that on VMX, this case is impossible as there is only
      'vmexit on external interrupts' execution control which either set,
      in which case both host and guest's EFLAGS.IF
      are ignored, or not set, in which case no VMexits are delivered.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220207155447.840194-8-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2b0ecccb
    • Maxim Levitsky's avatar
      KVM: x86: nSVM: expose clean bit support to the guest · 91f673b3
      Maxim Levitsky authored
      KVM already honours few clean bits thus it makes sense
      to let the nested guest know about it.
      
      Note that KVM also doesn't check if the hardware supports
      clean bits, and therefore nested KVM was
      already setting clean bits and L0 KVM
      was already honouring them.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220207155447.840194-6-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      91f673b3
    • Maxim Levitsky's avatar
      KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM · 759cbd59
      Maxim Levitsky authored
      While RSM induced VM entries are not full VM entries,
      they still need to be followed by actual VM entry to complete it,
      unlike setting the nested state.
      
      This patch fixes boot of hyperv and SMM enabled
      windows VM running nested on KVM, which fail due
      to this issue combined with lack of dirty bit setting.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Message-Id: <20220207155447.840194-5-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      759cbd59
    • Maxim Levitsky's avatar
      KVM: x86: nSVM: mark vmcb01 as dirty when restoring SMM saved state · e8efa4ff
      Maxim Levitsky authored
      While usually, restoring the smm state makes the KVM enter
      the nested guest thus a different vmcb (vmcb02 vs vmcb01),
      KVM should still mark it as dirty, since hardware
      can in theory cache multiple vmcbs.
      
      Failure to do so, combined with lack of setting the
      nested_run_pending (which is fixed in the next patch),
      might make KVM re-enter vmcb01, which was just exited from,
      with completely different set of guest state registers
      (SMM vs non SMM) and without proper dirty bits set,
      which results in the CPU reusing stale IDTR pointer
      which leads to a guest shutdown on any interrupt.
      
      On the real hardware this usually doesn't happen,
      but when running nested, L0's KVM does check and
      honour few dirty bits, causing this issue to happen.
      
      This patch fixes boot of hyperv and SMM enabled
      windows VM running nested on KVM.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Message-Id: <20220207155447.840194-4-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e8efa4ff
    • Maxim Levitsky's avatar
      KVM: x86: nSVM: fix potential NULL derefernce on nested migration · e1779c27
      Maxim Levitsky authored
      Turns out that due to review feedback and/or rebases
      I accidentally moved the call to nested_svm_load_cr3 to be too early,
      before the NPT is enabled, which is very wrong to do.
      
      KVM can't even access guest memory at that point as nested NPT
      is needed for that, and of course it won't initialize the walk_mmu,
      which is main issue the patch was addressing.
      
      Fix this for real.
      
      Fixes: 232f75d3 ("KVM: nSVM: call nested_svm_load_cr3 on nested state load")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220207155447.840194-3-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e1779c27
    • Maxim Levitsky's avatar
      KVM: x86: SVM: don't passthrough SMAP/SMEP/PKE bits in !NPT && !gCR0.PG case · c53bbe21
      Maxim Levitsky authored
      When the guest doesn't enable paging, and NPT/EPT is disabled, we
      use guest't paging CR3's as KVM's shadow paging pointer and
      we are technically in direct mode as if we were to use NPT/EPT.
      
      In direct mode we create SPTEs with user mode permissions
      because usually in the direct mode the NPT/EPT doesn't
      need to restrict access based on guest CPL
      (there are MBE/GMET extenstions for that but KVM doesn't use them).
      
      In this special "use guest paging as direct" mode however,
      and if CR4.SMAP/CR4.SMEP are enabled, that will make the CPU
      fault on each access and KVM will enter endless loop of page faults.
      
      Since page protection doesn't have any meaning in !PG case,
      just don't passthrough these bits.
      
      The fix is the same as was done for VMX in commit:
      commit 656ec4a4 ("KVM: VMX: fix SMEP and SMAP without EPT")
      
      This fixes the boot of windows 10 without NPT for good.
      (Without this patch, BSP boots, but APs were stuck in endless
      loop of page faults, causing the VM boot with 1 CPU)
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Message-Id: <20220207155447.840194-2-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c53bbe21
    • Sean Christopherson's avatar
      Revert "svm: Add warning message for AVIC IPI invalid target" · dd4589ee
      Sean Christopherson authored
      Remove a WARN on an "AVIC IPI invalid target" exit, the WARN is trivial
      to trigger from guest as it will fail on any destination APIC ID that
      doesn't exist from the guest's perspective.
      
      Don't bother recording anything in the kernel log, the common tracepoint
      for kvm_avic_incomplete_ipi() is sufficient for debugging.
      
      This reverts commit 37ef0c44.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220204214205.3306634-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dd4589ee
  10. 06 Feb, 2022 3 commits
    • Linus Torvalds's avatar
      Linux 5.17-rc3 · dfd42fac
      Linus Torvalds authored
      dfd42fac
    • Linus Torvalds's avatar
      Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · d8ad2ce8
      Linus Torvalds authored
      Pull ext4 fixes from Ted Ts'o:
       "Various bug fixes for ext4 fast commit and inline data handling.
      
        Also fix regression introduced as part of moving to the new mount API"
      
      * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        fs/ext4: fix comments mentioning i_mutex
        ext4: fix incorrect type issue during replay_del_range
        jbd2: fix kernel-doc descriptions for jbd2_journal_shrink_{scan,count}()
        ext4: fix potential NULL pointer dereference in ext4_fill_super()
        jbd2: refactor wait logic for transaction updates into a common function
        jbd2: cleanup unused functions declarations from jbd2.h
        ext4: fix error handling in ext4_fc_record_modified_inode()
        ext4: remove redundant max inline_size check in ext4_da_write_inline_data_begin()
        ext4: fix error handling in ext4_restore_inline_data()
        ext4: fast commit may miss file actions
        ext4: fast commit may not fallback for ineligible commit
        ext4: modify the logic of ext4_mb_new_blocks_simple
        ext4: prevent used blocks from being allocated during fast commit replay
      d8ad2ce8
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-fixes-for-v5.17-2022-02-06' of... · 18118a42
      Linus Torvalds authored
      Merge tag 'perf-tools-fixes-for-v5.17-2022-02-06' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
      
      Pull perf tools fixes from Arnaldo Carvalho de Melo:
      
       - Fix display of grouped aliased events in 'perf stat'.
      
       - Add missing branch_sample_type to perf_event_attr__fprintf().
      
       - Apply correct label to user/kernel symbols in branch mode.
      
       - Fix 'perf ftrace' system_wide tracing, it has to be set before
         creating the maps.
      
       - Return error if procfs isn't mounted for PID namespaces when
         synthesizing records for pre-existing processes.
      
       - Set error stream of objdump process for 'perf annotate' TUI, to avoid
         garbling the screen.
      
       - Add missing arm64 support to perf_mmap__read_self(), the kernel part
         got into 5.17.
      
       - Check for NULL pointer before dereference writing debug info about a
         sample.
      
       - Update UAPI copies for asound, perf_event, prctl and kvm headers.
      
       - Fix a typo in bpf_counter_cgroup.c.
      
      * tag 'perf-tools-fixes-for-v5.17-2022-02-06' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
        perf ftrace: system_wide collection is not effective by default
        libperf: Add arm64 support to perf_mmap__read_self()
        tools include UAPI: Sync sound/asound.h copy with the kernel sources
        perf stat: Fix display of grouped aliased events
        perf tools: Apply correct label to user/kernel symbols in branch mode
        perf bpf: Fix a typo in bpf_counter_cgroup.c
        perf synthetic-events: Return error if procfs isn't mounted for PID namespaces
        perf session: Check for NULL pointer before dereference
        perf annotate: Set error stream of objdump process for TUI
        perf tools: Add missing branch_sample_type to perf_event_attr__fprintf()
        tools headers UAPI: Sync linux/kvm.h with the kernel sources
        tools headers UAPI: Sync linux/prctl.h with the kernel sources
        perf beauty: Make the prctl arg regexp more strict to cope with PR_SET_VMA
        tools headers cpufeatures: Sync with the kernel sources
        tools headers UAPI: Sync linux/perf_event.h with the kernel sources
        tools include UAPI: Sync sound/asound.h copy with the kernel sources
      18118a42