- 22 Jun, 2023 2 commits
-
-
Paolo Bonzini authored
Merge tag 'kvmarm-fixes-6.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.4, take #4 - Correctly save/restore PMUSERNR_EL0 when host userspace is using PMU counters directly - Fix GICv2 emulation on GICv3 after the locking rework - Don't use smp_processor_id() in kvm_pmu_probe_armpmu(), and document why...
-
Gavin Shan authored
We run into guest hang in edk2 firmware when KSM is kept as running on the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash device (TYPE_PFLASH_CFI01) during the operation of sector erasing or buffered write. The status is returned by reading the memory region of the pflash device and the read request should have been forwarded to QEMU and emulated by it. Unfortunately, the read request is covered by an illegal stage2 mapping when the guest hang issue occurs. The read request is completed with QEMU bypassed and wrong status is fetched. The edk2 firmware runs into an infinite loop with the wrong status. The illegal stage2 mapping is populated due to same page sharing by KSM at (C) even the associated memory slot has been marked as invalid at (B) when the memory slot is requested to be deleted. It's notable that the active and inactive memory slots can't be swapped when we're in the middle of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count is elevated, and kvm_swap_active_memslots() will busy loop until it reaches to zero again. Besides, the swapping from the active to the inactive memory slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(), corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots(). CPU-A CPU-B ----- ----- ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION) kvm_vm_ioctl_set_memory_region kvm_set_memory_region __kvm_set_memory_region kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE) kvm_invalidate_memslot kvm_copy_memslot kvm_replace_memslot kvm_swap_active_memslots (A) kvm_arch_flush_shadow_memslot (B) same page sharing by KSM kvm_mmu_notifier_invalidate_range_start : kvm_mmu_notifier_change_pte kvm_handle_hva_range __kvm_handle_hva_range kvm_set_spte_gfn (C) : kvm_mmu_notifier_invalidate_range_end Fix the issue by skipping the invalid memory slot at (C) to avoid the illegal stage2 mapping so that the read request for the pflash's status is forwarded to QEMU and emulated by it. In this way, the correct pflash's status can be returned from QEMU to break the infinite loop in the edk2 firmware. We tried a git-bisect and the first problematic commit is cd4c7183 (" KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this, clean_dcache_guest_page() is called after the memory slots are iterated in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called before the iteration on the memory slots before this commit. This change literally enlarges the racy window between kvm_mmu_notifier_change_pte() and memory slot removal so that we're able to reproduce the issue in a practical test case. However, the issue exists since commit d5d8184d ("KVM: ARM: Memory virtualization setup"). Cc: stable@vger.kernel.org # v3.9+ Fixes: d5d8184d ("KVM: ARM: Memory virtualization setup") Reported-by: Shuai Hu <hshuai@redhat.com> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Message-Id: <20230615054259.14911-1-gshan@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 07 Jun, 2023 2 commits
-
-
Oliver Upton authored
Sebastian reports that commit 1c913a1c ("KVM: arm64: Iterate arm_pmus list to probe for default PMU") introduced the following splat with CONFIG_DEBUG_PREEMPT enabled: [70506.110187] BUG: using smp_processor_id() in preemptible [00000000] code: qemu-system-aar/3078242 [70506.119077] caller is debug_smp_processor_id+0x20/0x30 [70506.124229] CPU: 129 PID: 3078242 Comm: qemu-system-aar Tainted: G W 6.4.0-rc5 #25 [70506.133176] Hardware name: GIGABYTE R181-T92-00/MT91-FS4-00, BIOS F34 08/13/2020 [70506.140559] Call trace: [70506.142993] dump_backtrace+0xa4/0x130 [70506.146737] show_stack+0x20/0x38 [70506.150040] dump_stack_lvl+0x48/0x60 [70506.153704] dump_stack+0x18/0x28 [70506.157007] check_preemption_disabled+0xe4/0x108 [70506.161701] debug_smp_processor_id+0x20/0x30 [70506.166046] kvm_arm_pmu_v3_set_attr+0x460/0x628 [70506.170662] kvm_arm_vcpu_arch_set_attr+0x88/0xd8 [70506.175363] kvm_arch_vcpu_ioctl+0x258/0x4a8 [70506.179632] kvm_vcpu_ioctl+0x32c/0x6b8 [70506.183465] __arm64_sys_ioctl+0xb4/0x100 [70506.187467] invoke_syscall+0x78/0x108 [70506.191205] el0_svc_common.constprop.0+0x4c/0x100 [70506.195984] do_el0_svc+0x34/0x50 [70506.199287] el0_svc+0x34/0x108 [70506.202416] el0t_64_sync_handler+0xf4/0x120 [70506.206674] el0t_64_sync+0x194/0x198 Fix the issue by using the raw variant that bypasses the debug assertion. While at it, stick all of the nuance and UAPI baggage into a comment for posterity. Fixes: 1c913a1c ("KVM: arm64: Iterate arm_pmus list to probe for default PMU") Reported-by: Sebastian Ott <sebott@redhat.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230606184814.456743-1-oliver.upton@linux.dev
-
Marc Zyngier authored
When reworking the vgic locking, the vgic distributor registration got simplified, which was a very good cleanup. But just a tad too radical, as we now register the *native* vgic only, ignoring the GICv2-on-GICv3 that allows pre-historic VMs (or so I thought) to run. As it turns out, QEMU still defaults to GICv2 in some cases, and this breaks Nathan's setup! Fix it by propagating the *requested* vgic type rather than the host's version. Fixes: 59112e9c ("KVM: arm64: vgic: Fix a circular locking issue") Reported-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> link: https://lore.kernel.org/r/20230606221525.GA2269598@dev-arch.thelio-3990X
-
- 04 Jun, 2023 2 commits
-
-
Reiji Watanabe authored
Currently, with VHE, KVM sets ER, CR, SW and EN bits of PMUSERENR_EL0 to 1 on vcpu_load(), and saves and restores the register value for the host on vcpu_load() and vcpu_put(). If the value of those bits are cleared on a pCPU with a vCPU loaded (armv8pmu_start() would do that when PMU counters are programmed for the guest), PMU access from the guest EL0 might be trapped to the guest EL1 directly regardless of the current PMUSERENR_EL0 value of the vCPU. Fix this by not letting armv8pmu_start() overwrite PMUSERENR_EL0 on the pCPU where PMUSERENR_EL0 for the guest is loaded, and instead updating the saved shadow register value for the host so that the value can be restored on vcpu_put() later. While vcpu_{put,load}() are manipulating PMUSERENR_EL0, disable IRQs to prevent a race condition between these processes and IPIs that attempt to update PMUSERENR_EL0 for the host EL0. Suggested-by: Mark Rutland <mark.rutland@arm.com> Suggested-by: Marc Zyngier <maz@kernel.org> Fixes: 83a7a4d6 ("arm64: perf: Enable PMU counter userspace access for perf event") Signed-off-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230603025035.3781797-3-reijiw@google.com
-
Reiji Watanabe authored
Restore the host's PMUSERENR_EL0 value instead of clearing it, before returning back to userspace, as the host's EL0 might have a direct access to PMU registers (some bits of PMUSERENR_EL0 for might not be zero for the host EL0). Fixes: 83a7a4d6 ("arm64: perf: Enable PMU counter userspace access for perf event") Signed-off-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230603025035.3781797-2-reijiw@google.com
-
- 03 Jun, 2023 5 commits
-
-
https://github.com/kvm-x86/linuxPaolo Bonzini authored
KVM x86 fixes for 6.4 - Fix a memslot lookup bug in the NX recovery thread that could theoretically let userspace bypass the NX hugepage mitigation - Fix a s/BLOCKING/PENDING bug in SVM's vNMI support - Account exit stats for fastpath VM-Exits that never leave the super tight run-loop - Fix an out-of-bounds bug in the optimized APIC map code, and add a regression test for the race.
-
Paolo Bonzini authored
Merge tag 'kvmarm-fixes-6.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.4, take #3 - Fix the reported address of a watchpoint forwarded to userspace - Fix the freeing of the root of stage-2 page tables - Stop creating spurious PMU events to perform detection of the default PMU and use the existing PMU list instead.
-
Paolo Bonzini authored
Merge tag 'kvmarm-fixes-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.4, take #2 - Address some fallout of the locking rework, this time affecting the way the vgic is configured - Fix an issue where the page table walker frees a subtree and then proceeds with walking what it has just freed... - Check that a given PA donated to the gues is actually memory (only affecting pKVM) - Correctly handle MTE CMOs by Set/Way
-
Michal Luczaj authored
Keep switching between LAPIC_MODE_X2APIC and LAPIC_MODE_DISABLED during APIC map construction to hunt for TOCTOU bugs in KVM. KVM's optimized map recalc makes multiple passes over the list of vCPUs, and the calculations ignore vCPU's whose APIC is hardware-disabled, i.e. there's a window where toggling LAPIC_MODE_DISABLED is quite interesting. Signed-off-by: Michal Luczaj <mhal@rbox.co> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230602233250.1014316-4-seanjc@google.comSigned-off-by: Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
Bail from kvm_recalculate_phys_map() and disable the optimized map if the target vCPU's x2APIC ID is out-of-bounds, i.e. if the vCPU was added and/or enabled its local APIC after the map was allocated. This fixes an out-of-bounds access bug in the !x2apic_format path where KVM would write beyond the end of phys_map. Check the x2APIC ID regardless of whether or not x2APIC is enabled, as KVM's hardcodes x2APIC ID to be the vCPU ID, i.e. it can't change, and the map allocation in kvm_recalculate_apic_map() doesn't check for x2APIC being enabled, i.e. the check won't get false postivies. Note, this also affects the x2apic_format path, which previously just ignored the "x2apic_id > new->max_apic_id" case. That too is arguably a bug fix, as ignoring the vCPU meant that KVM would not send interrupts to the vCPU until the next map recalculation. In practice, that "bug" is likely benign as a newly present vCPU/APIC would immediately trigger a recalc. But, there's no functional downside to disabling the map, and a future patch will gracefully handle the -E2BIG case by retrying instead of simply disabling the optimized map. Opportunistically add a sanity check on the xAPIC ID size, along with a comment explaining why the xAPIC ID is guaranteed to be "good". Reported-by: Michal Luczaj <mhal@rbox.co> Fixes: 5b84b029 ("KVM: x86: Honor architectural behavior for aliased 8-bit APIC IDs") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230602233250.1014316-2-seanjc@google.comSigned-off-by: Sean Christopherson <seanjc@google.com>
-
- 02 Jun, 2023 3 commits
-
-
Sean Christopherson authored
Increment vcpu->stat.exits when handling a fastpath VM-Exit without going through any part of the "slow" path. Not bumping the exits stat can result in wildly misleading exit counts, e.g. if the primary reason the guest is exiting is to program the TSC deadline timer. Fixes: 404d5d7b ("KVM: X86: Introduce more exit_fastpath_completion enum values") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230602011920.787844-2-seanjc@google.comSigned-off-by: Sean Christopherson <seanjc@google.com>
-
Maciej S. Szmigiero authored
While testing Hyper-V enabled Windows Server 2019 guests on Zen4 hardware I noticed that with vCPU count large enough (> 16) they sometimes froze at boot. With vCPU count of 64 they never booted successfully - suggesting some kind of a race condition. Since adding "vnmi=0" module parameter made these guests boot successfully it was clear that the problem is most likely (v)NMI-related. Running kvm-unit-tests quickly showed failing NMI-related tests cases, like "multiple nmi" and "pending nmi" from apic-split, x2apic and xapic tests and the NMI parts of eventinj test. The issue was that once one NMI was being serviced no other NMI was allowed to be set pending (NMI limit = 0), which was traced to svm_is_vnmi_pending() wrongly testing for the "NMI blocked" flag rather than for the "NMI pending" flag. Fix this by testing for the right flag in svm_is_vnmi_pending(). Once this is done, the NMI-related kvm-unit-tests pass successfully and the Windows guest no longer freezes at boot. Fixes: fa4c027a ("KVM: x86: Add support for SVM's Virtual NMI") Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/be4ca192eb0c1e69a210db3009ca984e6a54ae69.1684495380.git.maciej.szmigiero@oracle.comSigned-off-by: Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
Factor in the address space (non-SMM vs. SMM) of the target shadow page when recovering potential NX huge pages, otherwise KVM will retrieve the wrong memslot when zapping shadow pages that were created for SMM. The bug most visibly manifests as a WARN on the memslot being non-NULL, but the worst case scenario is that KVM could unaccount the shadow page without ensuring KVM won't install a huge page, i.e. if the non-SMM slot is being dirty logged, but the SMM slot is not. ------------[ cut here ]------------ WARNING: CPU: 1 PID: 3911 at arch/x86/kvm/mmu/mmu.c:7015 kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm] CPU: 1 PID: 3911 Comm: kvm-nx-lpage-re RIP: 0010:kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm] RSP: 0018:ffff99b284f0be68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff99b284edd000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff9271397024e0 R08: 0000000000000000 R09: ffff927139702450 R10: 0000000000000000 R11: 0000000000000001 R12: ffff99b284f0be98 R13: 0000000000000000 R14: ffff9270991fcd80 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff927f9f640000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0aacad3ae0 CR3: 000000088fc2c005 CR4: 00000000003726e0 Call Trace: <TASK> __pfx_kvm_nx_huge_page_recovery_worker+0x10/0x10 [kvm] kvm_vm_worker_thread+0x106/0x1c0 [kvm] kthread+0xd9/0x100 ret_from_fork+0x2c/0x50 </TASK> ---[ end trace 0000000000000000 ]--- This bug was exposed by commit edbdb43f ("KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated"), which allowed KVM to retain SMM TDP MMU roots effectively indefinitely. Before commit edbdb43f, KVM would zap all SMM TDP MMU roots and thus all SMM TDP MMU shadow pages once all vCPUs exited SMM, which made the window where this bug (recovering an SMM NX huge page) could be encountered quite tiny. To hit the bug, the NX recovery thread would have to run while at least one vCPU was in SMM. Most VMs typically only use SMM during boot, and so the problematic shadow pages were gone by the time the NX recovery thread ran. Now that KVM preserves TDP MMU roots until they are explicitly invalidated (e.g. by a memslot deletion), the window to trigger the bug is effectively never closed because most VMMs don't delete memslots after boot (except for a handful of special scenarios). Fixes: eb298605 ("KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages") Reported-by: Fabio Coatti <fabio.coatti@gmail.com> Closes: https://lore.kernel.org/all/CADpTngX9LESCdHVu_2mQkNGena_Ng2CphWNwsRGSMxzDsTjU2A@mail.gmail.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230602010137.784664-1-seanjc@google.comSigned-off-by: Sean Christopherson <seanjc@google.com>
-
- 31 May, 2023 3 commits
-
-
Oliver Upton authored
KVM maintains a mask of supported CPUs when a vPMU type is explicitly selected by userspace and is used to reject any attempt to run the vCPU on an unsupported CPU. This is great, but we're still beholden to the default behavior where vCPUs can be scheduled anywhere and guest counters may silently stop working. Avoid confusing the next poor sod to look at this code and document the intended behavior. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230525212723.3361524-3-oliver.upton@linux.dev
-
Oliver Upton authored
To date KVM has relied on using a perf event to probe the core PMU at the time of vPMU initialization. Behind the scenes perf_event_init() would iteratively walk the PMUs of the system and return the PMU that could handle the event. However, an upcoming change in perf core will drop the iterative walk, thereby breaking the fragile dance we do on the KVM side. Avoid the problem altogether by iterating over the list of supported PMUs maintained in KVM, returning the core PMU that matches the CPU we were called on. Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230525212723.3361524-2-oliver.upton@linux.dev
-
Oliver Upton authored
The reference count on page table allocations is increased for every 'counted' PTE (valid or donated) in the table in addition to the initial reference from ->zalloc_page(). kvm_pgtable_stage2_free_removed() fails to drop the last reference on the root of the table walk, meaning we leak memory. Fix it by dropping the last reference after the free walker returns, at which point all references for 'counted' PTEs have been released. Cc: stable@vger.kernel.org Fixes: 5c359cca ("KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make") Reported-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Tested-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230530193213.1663411-1-oliver.upton@linux.dev
-
- 30 May, 2023 1 commit
-
-
Akihiko Odaki authored
When handling ESR_ELx_EC_WATCHPT_LOW, far_el2 member of struct kvm_vcpu_fault_info will be copied to far member of struct kvm_debug_exit_arch and exposed to the userspace. The userspace will see stale values from older faults if the fault info does not get populated. Fixes: 8fb20461 ("KVM: arm64: Move early handlers to per-EC handlers") Suggested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230530024651.10014-1-akihiko.odaki@daynix.com Cc: stable@vger.kernel.org
-
- 24 May, 2023 3 commits
-
-
Fuad Tabba authored
The preorder callback on the kvm_pgtable_stage2_map() path can replace a table with a block, then recursively free the detached table. The higher-level walking logic stashes the old page table entry and then walks the freed table, invoking the leaf callback and potentially freeing pgtable pages prematurely. In normal operation, the call to tear down the detached stage-2 is indirected and uses an RCU callback to trigger the freeing. RCU is not available to pKVM, which is where this bug is triggered. Change the behavior of the walker to reload the page table entry after invoking the walker callback on preorder traversal, as it does for leaf entries. Tested on Pixel 6. Fixes: 5c359cca ("KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make") Suggested-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230522103258.402272-1-tabba@google.com
-
Marc Zyngier authored
We appear to have missed the Set/Way CMOs when adding MTE support. Not that we really expect anyone to use them, but you never know what stupidity some people can come up with... Treat these mostly like we deal with the classic S/W CMOs, only with an additional check that MTE really is enabled. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20230515204601.1270428-3-maz@kernel.org
-
Marc Zyngier authored
Add the missing Set/Way CMOs that apply to tagged memory. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20230515204601.1270428-2-maz@kernel.org
-
- 21 May, 2023 3 commits
-
-
Mingwei Zhang authored
Add MSR_IA32_TSX_CTRL into msrs_to_save[] to explicitly tell userspace to save/restore the register value during migration. Missing this may cause userspace that relies on KVM ioctl(KVM_GET_MSR_INDEX_LIST) fail to port the value to the target VM. In addition, there is no need to add MSR_IA32_TSX_CTRL when ARCH_CAP_TSX_CTRL_MSR is not supported in kvm_get_arch_capabilities(). So add the checking in kvm_probe_msr_to_save(). Fixes: c11f83e0 ("KVM: vmx: implement MSR_IA32_TSX_CTRL disable RTM functionality") Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20230509032348.1153070-1-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Drop KVM's manipulation of guest's CPUID.0x12.1 ECX and EDX, i.e. the allowed XFRM of SGX enclaves, now that KVM explicitly checks the guest's allowed XCR0 when emulating ECREATE. Note, this could theoretically break a setup where userspace advertises a "bad" XFRM and relies on KVM to provide a sane CPUID model, but QEMU is the only known user of KVM SGX, and QEMU explicitly sets the SGX CPUID XFRM subleaf based on the guest's XCR0. Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230503160838.3412617-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Explicitly check the vCPU's supported XCR0 when determining whether or not the XFRM for ECREATE is valid. Checking CPUID works because KVM updates guest CPUID.0x12.1 to restrict the leaf to a subset of the guest's allowed XCR0, but that is rather subtle and KVM should not modify guest CPUID except for modeling true runtime behavior (allowed XFRM is most definitely not "runtime" behavior). Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230503160838.3412617-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 19 May, 2023 9 commits
-
-
Michal Luczaj authored
In kvm_vm_ioctl_create_vcpu(), add vcpu to vcpu_array iff it's safe to access vcpu via kvm_get_vcpu() and kvm_for_each_vcpu(), i.e. when there's no failure path requiring vcpu removal and destruction. Such order is important because vcpu_array accessors may end up referencing vcpu at vcpu_array[0] even before online_vcpus is set to 1. When online_vcpus=0, any call to kvm_get_vcpu() goes through array_index_nospec() and ends with an attempt to xa_load(vcpu_array, 0): int num_vcpus = atomic_read(&kvm->online_vcpus); i = array_index_nospec(i, num_vcpus); return xa_load(&kvm->vcpu_array, i); Similarly, when online_vcpus=0, a kvm_for_each_vcpu() does not iterate over an "empty" range, but actually [0, ULONG_MAX]: xa_for_each_range(&kvm->vcpu_array, idx, vcpup, 0, \ (atomic_read(&kvm->online_vcpus) - 1)) In both cases, such online_vcpus=0 edge case, even if leading to unnecessary calls to XArray API, should not be an issue; requesting unpopulated indexes/ranges is handled by xa_load() and xa_for_each_range(). However, this means that when the first vCPU is created and inserted in vcpu_array *and* before online_vcpus is incremented, code calling kvm_get_vcpu()/kvm_for_each_vcpu() already has access to that first vCPU. This should not pose a problem assuming that once a vcpu is stored in vcpu_array, it will remain there, but that's not the case: kvm_vm_ioctl_create_vcpu() first inserts to vcpu_array, then requests a file descriptor. If create_vcpu_fd() fails, newly inserted vcpu is removed from the vcpu_array, then destroyed: vcpu->vcpu_idx = atomic_read(&kvm->online_vcpus); r = xa_insert(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, GFP_KERNEL_ACCOUNT); kvm_get_kvm(kvm); r = create_vcpu_fd(vcpu); if (r < 0) { xa_erase(&kvm->vcpu_array, vcpu->vcpu_idx); kvm_put_kvm_no_destroy(kvm); goto unlock_vcpu_destroy; } atomic_inc(&kvm->online_vcpus); This results in a possible race condition when a reference to a vcpu is acquired (via kvm_get_vcpu() or kvm_for_each_vcpu()) moments before said vcpu is destroyed. Signed-off-by: Michal Luczaj <mhal@rbox.co> Message-Id: <20230510140410.1093987-2-mhal@rbox.co> Cc: stable@vger.kernel.org Fixes: c5b07754 ("KVM: Convert the kvm->vcpus array to a xarray", 2021-12-08) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Jacob Xu authored
Include a definition of WARN_ON_ONCE() before using it. Fixes: bb1fcc70 ("KVM: nVMX: Allow L1 to use 5-level page walks for nested EPT") Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Jacob Xu <jacobhxu@google.com> [reworded commit message; changed <asm/bug.h> to <linux/bug.h>] Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220225012959.1554168-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Reject hardware enabling, i.e. VM creation, if a restart/shutdown has been initiated to avoid re-enabling hardware between kvm_reboot() and machine_{halt,power_off,restart}(). The restart case is especially problematic (for x86) as enabling VMX (or clearing GIF in KVM_RUN on SVM) blocks INIT, which results in the restart/reboot hanging as BIOS is unable to wake and rendezvous with APs. Note, this bug, and the original issue that motivated the addition of kvm_reboot(), is effectively limited to a forced reboot, e.g. `reboot -f`. In a "normal" reboot, userspace will gracefully teardown userspace before triggering the kernel reboot (modulo bugs, errors, etc), i.e. any process that might do ioctl(KVM_CREATE_VM) is long gone. Fixes: 8e1c1815 ("KVM: VMX: Disable VMX when system shutdown") Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Message-Id: <20230512233127.804012-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Use syscore_ops.shutdown to disable hardware virtualization during a reboot instead of using the dedicated reboot_notifier so that KVM disables virtualization _after_ system_state has been updated. This will allow fixing a race in KVM's handling of a forced reboot where KVM can end up enabling hardware virtualization between kernel_restart_prepare() and machine_restart(). Rename KVM's hook to match the syscore op to avoid any possible confusion from wiring up a "reboot" helper to a "shutdown" hook (neither "shutdown nor "reboot" is completely accurate as the hook handles both). Opportunistically rewrite kvm_shutdown()'s comment to make it less VMX specific, and to explain why kvm_rebooting exists. Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: James Morse <james.morse@arm.com> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Zenghui Yu <yuzenghui@huawei.com> Cc: kvmarm@lists.linux.dev Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com> Cc: Anup Patel <anup@brainfault.org> Cc: Atish Patra <atishp@atishpatra.org> Cc: kvm-riscv@lists.infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Message-Id: <20230512233127.804012-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Will Deacon authored
Since host stage-2 mappings are created lazily, we cannot rely solely on the pte in order to recover the target physical address when checking a host-initiated memory transition as this permits donation of unmapped regions corresponding to MMIO or "no-map" memory. Instead of inspecting the pte, move the addr_is_allowed_memory() check into the host callback function where it is passed the physical address directly from the walker. Cc: Quentin Perret <qperret@google.com> Fixes: e82edcc7 ("KVM: arm64: Implement do_share() helper for sharing memory") Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230518095844.1178-1-will@kernel.org
-
Jean-Philippe Brucker authored
It is host userspace, not the guest, that issues KVM_DEV_ARM_VGIC_GRP_CTRL Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230518100914.2837292-5-jean-philippe@linaro.org
-
Jean-Philippe Brucker authored
It is now config_lock that must be held, not kvm lock. Replace the comment with a lockdep annotation. Fixes: f0032773 ("KVM: arm64: Use config_lock to protect vgic state") Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230518100914.2837292-4-jean-philippe@linaro.org
-
Jean-Philippe Brucker authored
vgic_its_create() changes the vgic state without holding the config_lock, which triggers a lockdep warning in vgic_v4_init(): [ 358.667941] WARNING: CPU: 3 PID: 178 at arch/arm64/kvm/vgic/vgic-v4.c:245 vgic_v4_init+0x15c/0x7a8 ... [ 358.707410] vgic_v4_init+0x15c/0x7a8 [ 358.708550] vgic_its_create+0x37c/0x4a4 [ 358.709640] kvm_vm_ioctl+0x1518/0x2d80 [ 358.710688] __arm64_sys_ioctl+0x7ac/0x1ba8 [ 358.711960] invoke_syscall.constprop.0+0x70/0x1e0 [ 358.713245] do_el0_svc+0xe4/0x2d4 [ 358.714289] el0_svc+0x44/0x8c [ 358.715329] el0t_64_sync_handler+0xf4/0x120 [ 358.716615] el0t_64_sync+0x190/0x194 Wrap the whole of vgic_its_create() with config_lock since, in addition to calling vgic_v4_init(), it also modifies the global kvm->arch.vgic state. Fixes: f0032773 ("KVM: arm64: Use config_lock to protect vgic state") Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230518100914.2837292-3-jean-philippe@linaro.org
-
Jean-Philippe Brucker authored
Lockdep reports a circular lock dependency between the srcu and the config_lock: [ 262.179917] -> #1 (&kvm->srcu){.+.+}-{0:0}: [ 262.182010] __synchronize_srcu+0xb0/0x224 [ 262.183422] synchronize_srcu_expedited+0x24/0x34 [ 262.184554] kvm_io_bus_register_dev+0x324/0x50c [ 262.185650] vgic_register_redist_iodev+0x254/0x398 [ 262.186740] vgic_v3_set_redist_base+0x3b0/0x724 [ 262.188087] kvm_vgic_addr+0x364/0x600 [ 262.189189] vgic_set_common_attr+0x90/0x544 [ 262.190278] vgic_v3_set_attr+0x74/0x9c [ 262.191432] kvm_device_ioctl+0x2a0/0x4e4 [ 262.192515] __arm64_sys_ioctl+0x7ac/0x1ba8 [ 262.193612] invoke_syscall.constprop.0+0x70/0x1e0 [ 262.195006] do_el0_svc+0xe4/0x2d4 [ 262.195929] el0_svc+0x44/0x8c [ 262.196917] el0t_64_sync_handler+0xf4/0x120 [ 262.198238] el0t_64_sync+0x190/0x194 [ 262.199224] [ 262.199224] -> #0 (&kvm->arch.config_lock){+.+.}-{3:3}: [ 262.201094] __lock_acquire+0x2b70/0x626c [ 262.202245] lock_acquire+0x454/0x778 [ 262.203132] __mutex_lock+0x190/0x8b4 [ 262.204023] mutex_lock_nested+0x24/0x30 [ 262.205100] vgic_mmio_write_v3_misc+0x5c/0x2a0 [ 262.206178] dispatch_mmio_write+0xd8/0x258 [ 262.207498] __kvm_io_bus_write+0x1e0/0x350 [ 262.208582] kvm_io_bus_write+0xe0/0x1cc [ 262.209653] io_mem_abort+0x2ac/0x6d8 [ 262.210569] kvm_handle_guest_abort+0x9b8/0x1f88 [ 262.211937] handle_exit+0xc4/0x39c [ 262.212971] kvm_arch_vcpu_ioctl_run+0x90c/0x1c04 [ 262.214154] kvm_vcpu_ioctl+0x450/0x12f8 [ 262.215233] __arm64_sys_ioctl+0x7ac/0x1ba8 [ 262.216402] invoke_syscall.constprop.0+0x70/0x1e0 [ 262.217774] do_el0_svc+0xe4/0x2d4 [ 262.218758] el0_svc+0x44/0x8c [ 262.219941] el0t_64_sync_handler+0xf4/0x120 [ 262.221110] el0t_64_sync+0x190/0x194 Note that the current report, which can be triggered by the vgic_irq kselftest, is a triple chain that includes slots_lock, but after inverting the slots_lock/config_lock dependency, the actual problem reported above remains. In several places, the vgic code calls kvm_io_bus_register_dev(), which synchronizes the srcu, while holding config_lock (#1). And the MMIO handler takes the config_lock while holding the srcu read lock (#0). Break dependency #1, by registering the distributor and redistributors without holding config_lock. The ITS also uses kvm_io_bus_register_dev() but already relies on slots_lock to serialize calls. The distributor iodev is created on the first KVM_RUN call. Multiple threads will race for vgic initialization, and only the first one will see !vgic_ready() under the lock. To serialize those threads, rely on slots_lock rather than config_lock. Redistributors are created earlier, through KVM_DEV_ARM_VGIC_GRP_ADDR ioctls and vCPU creation. Similarly, serialize the iodev creation with slots_lock, and the rest with config_lock. Fixes: f0032773 ("KVM: arm64: Use config_lock to protect vgic state") Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230518100914.2837292-2-jean-philippe@linaro.org
-
- 17 May, 2023 1 commit
-
-
Paolo Bonzini authored
Merge tag 'kvmarm-fixes-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.4, take #1 - Plug a race in the stage-2 mapping code where the IPA and the PA would end up being out of sync - Make better use of the bitmap API (bitmap_zero, bitmap_zalloc...) - FP/SVE/SME documentation update, in the hope that this field becomes clearer... - Add workaround for the usual Apple SEIS brokenness - Random comment fixes
-
- 14 May, 2023 6 commits
-
-
Linus Torvalds authored
-
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxlLinus Torvalds authored
Pull compute express link fixes from Dan Williams: - Fix a compilation issue with DEFINE_STATIC_SRCU() in the unit tests - Fix leaking kernel memory to a root-only sysfs attribute * tag 'cxl-fixes-6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl: Add missing return to cdat read error path tools/testing/cxl: Use DEFINE_STATIC_SRCU()
-
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linuxLinus Torvalds authored
Pull parisc architecture fixes from Helge Deller: - Fix encoding of swp_entry due to added SWP_EXCLUSIVE flag - Include reboot.h to avoid gcc-12 compiler warning * tag 'parisc-for-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Fix encoding of swp_entry due to added SWP_EXCLUSIVE flag parisc: kexec: include reboot.h
-
git://git.armlinux.org.uk/~rmk/linux-armLinus Torvalds authored
Pull ARM fixes from Russell King: - fix unwinder for uleb128 case - fix kernel-doc warnings for HP Jornada 7xx - fix unbalanced stack on vfp success path * tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: ARM: 9297/1: vfp: avoid unbalanced stack on 'success' return path ARM: 9296/1: HP Jornada 7XX: fix kernel-doc warnings ARM: 9295/1: unwind:fix unwind abort for uleb128 case
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull locking fix from Borislav Petkov: - Make sure __down_read_common() is always inlined so that the callers' names land in traceevents output and thus the blocked function can be identified * tag 'locking_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rwsem: Add __always_inline annotation to __down_read_common() and inlined callers
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull perf fixes from Borislav Petkov: - Make sure the PEBS buffer is flushed before reprogramming the hardware so that the correct record sizes are used - Update the sample size for AMD BRS events - Fix a confusion with using the same on-stack struct with different events in the event processing path * tag 'perf_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG perf/x86: Fix missing sample size update on AMD BRS perf/core: Fix perf_sample_data not properly initialized for different swevents in perf_tp_event()
-