- 15 Mar, 2024 1 commit
-
-
Paolo Bonzini authored
$(shell ...) expands to the output of the command. It expands to the empty string when the command does not print anything to stdout. Hence, $(shell mkdir ...) is sufficient and does not need any variable assignment in front of it. Commit c2bd08ba ("treewide: remove meaningless assignments in Makefiles", 2024-02-23) did this to all of tools/ but ignored in-flight changes to tools/testing/selftests/kvm/Makefile, so reapply the change. Cc: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 14 Mar, 2024 1 commit
-
-
Paolo Bonzini authored
Merge tag 'kvm-s390-next-6.9-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD - Memop selftest rotate fix - SCLP event bits over indication fix - Missing virt_to_phys for the CRYCB fix
-
- 11 Mar, 2024 11 commits
-
-
https://github.com/kvm-x86/linuxPaolo Bonzini authored
KVM Xen and pfncache changes for 6.9: - Rip out the half-baked support for using gfn_to_pfn caches to manage pages that are "mapped" into guests via physical addresses. - Add support for using gfn_to_pfn caches with only a host virtual address, i.e. to bypass the "gfn" stage of the cache. The primary use case is overlay pages, where the guest may change the gfn used to reference the overlay page, but the backing hva+pfn remains the same. - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead of a gpa, so that userspace doesn't need to reconfigure and invalidate the cache/mapping if the guest changes the gpa (but userspace keeps the resolved hva the same). - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation. - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior). - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs. - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to refresh), and drop a now-redundant acquisition of xen_lock (that was protecting the shared_info cache) to fix a deadlock due to recursively acquiring xen_lock.
-
https://github.com/kvm-x86/linuxPaolo Bonzini authored
KVM x86 PMU changes for 6.9: - Fix several bugs where KVM speciously prevents the guest from utilizing fixed counters and architectural event encodings based on whether or not guest CPUID reports support for the _architectural_ encoding. - Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads, priority of VMX interception vs #GP, PMC types in architectural PMUs, etc. - Add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID, i.e. are difficult to validate via KVM-Unit-Tests. - Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting cycles, e.g. when checking if a PMC event needs to be synthesized when skipping an instruction. - Optimize triggering of emulated events, e.g. for "count instructions" events when skipping an instruction, which yields a ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest. - Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit.
-
https://github.com/kvm-x86/linuxPaolo Bonzini authored
KVM VMX changes for 6.9: - Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace due to an unexpected VM-Exit while the CPU was vectoring an exception. - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support. - Clean up the logic for massaging the passthrough MSR bitmaps when userspace changes its MSR filter.
-
https://github.com/kvm-x86/linuxPaolo Bonzini authored
KVM x86 MMU changes for 6.9: - Clean up code related to unprotecting shadow pages when retrying a guest instruction after failed #PF-induced emulation. - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if a reschedule is needed, e.g. if a high priority task needs to run. Because KVM doesn't support yielding in the middle of processing a zapped non-leaf SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when attempting to schedule in a high priority. - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for read, e.g. to avoid serializing vCPUs when userspace deletes a memslot. - Allocate write-tracking metadata on-demand to avoid the memory overhead when running kernels built with KVMGT support (external write-tracking enabled), but for workloads that don't use nested virtualization (shadow paging) or KVMGT.
-
https://github.com/kvm-x86/linuxPaolo Bonzini authored
KVM x86 misc changes for 6.9: - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives (though in fairness in KMSAN, it's comically difficult to see that the uninitialized memory is never truly consumed). - Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading DR6 and DR7. - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit. This allows VMX to further optimize handling preemption timer exits, and allows SVM to avoid sending a duplicate IPI (SVM also has a need to force an exit). - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, and add WARN to guard against similar bugs. - Provide a dedicated arch hook for checking if a different vCPU was in-kernel (for directed yield), and simplify the logic for checking if the currently loaded vCPU is in-kernel. - Misc cleanups and fixes.
-
https://github.com/kvm-x86/linuxPaolo Bonzini authored
KVM common MMU changes for 6.9: - Harden KVM against underflowing the active mmu_notifier invalidation count, so that "bad" invalidations (usually due to bugs elsehwere in the kernel) are detected earlier and are less likely to hang the kernel. - Fix a benign bug in __kvm_mmu_topup_memory_cache() where the object size and number of objects parameters to kvmalloc_array() were swapped.
-
https://github.com/kvm-x86/linuxPaolo Bonzini authored
KVM async page fault changes for 6.9: - Always flush the async page fault workqueue when a work item is being removed, especially during vCPU destruction, to ensure that there are no workers running in KVM code when all references to KVM-the-module are gone, i.e. to prevent a use-after-free if kvm.ko is unloaded. - Grab a reference to the VM's mm_struct in the async #PF worker itself instead of gifting the worker a reference, e.g. so that there's no need to remember to *conditionally* clean up after the worker.
-
https://github.com/kvm-x86/linuxPaolo Bonzini authored
KVM selftests changes for 6.9: - Add macros to reduce the amount of boilerplate code needed to write "simple" selftests, and to utilize selftest TAP infrastructure, which is especially beneficial for KVM selftests with multiple testcases. - Add basic smoke tests for SEV and SEV-ES, along with a pile of library support for handling private/encrypted/protected memory. - Fix benign bugs where tests neglect to close() guest_memfd files.
-
https://github.com/kvm-riscv/linuxPaolo Bonzini authored
KVM/riscv changes for 6.9 - Exception and interrupt handling for selftests - Sstc (aka arch_timer) selftest - Forward seed CSR access to KVM userspace - Ztso extension support for Guest/VM - Zacas extension support for Guest/VM
-
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarmPaolo Bonzini authored
KVM/arm64 updates for 6.9 - Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it - Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path - Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register - Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests
-
Paolo Bonzini authored
Merge tag 'loongarch-kvm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.9 * Set reserved bits as zero in CPUCFG. * Start SW timer only when vcpu is blocking. * Do not restart SW timer when it is expired. * Remove unnecessary CSR register saving during enter guest.
-
- 09 Mar, 2024 1 commit
-
-
https://github.com/kvm-x86/linuxPaolo Bonzini authored
KVM GUEST_MEMFD fixes for 6.8: - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating ABI that KVM can't sanely support. - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely a development and testing vehicle, and come with zero guarantees. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
-
- 07 Mar, 2024 6 commits
-
-
Oliver Upton authored
* kvm-arm64/kerneldoc: : kerneldoc warning fixes, courtesy of Randy Dunlap : : Fixes addressing the widespread misuse of kerneldoc-style comments : throughout KVM/arm64. KVM: arm64: vgic: fix a kernel-doc warning KVM: arm64: vgic-its: fix kernel-doc warnings KVM: arm64: vgic-init: fix a kernel-doc warning KVM: arm64: sys_regs: fix kernel-doc warnings KVM: arm64: PMU: fix kernel-doc warnings KVM: arm64: mmu: fix a kernel-doc warning KVM: arm64: vhe: fix a kernel-doc warning KVM: arm64: hyp/aarch32: fix kernel-doc warnings KVM: arm64: guest: fix kernel-doc warnings KVM: arm64: debug: fix kernel-doc warnings Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
-
Oliver Upton authored
* kvm-arm64/vfio-normal-nc: : Normal-NC support for vfio-pci @ stage-2, courtesy of Ankit Agrawal : : KVM's policy to date has been that any and all MMIO mapping at stage-2 : is treated as Device-nGnRE. This is primarily done due to concerns of : the guest triggering uncontainable failures in the system if they manage : to tickle the device / memory system the wrong way, though this is : unnecessarily restrictive for devices that can be reasoned as 'safe'. : : Unsurprisingly, the Device-* mapping can really hurt the performance of : assigned devices that can handle Gathering, and can be an outright : correctness issue if the guest driver does unaligned accesses. : : Rather than opening the floodgates to the full ecosystem of devices that : can be exposed to VMs, take the conservative approach and allow PCI : devices to be mapped as Normal-NC since it has been determined to be : 'safe'. vfio: Convey kvm that the vfio-pci device is wc safe KVM: arm64: Set io memory s2 pte as normalnc for vfio pci device mm: Introduce new flag to indicate wc safe KVM: arm64: Introduce new flag for non-cacheable IO memory Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
-
Oliver Upton authored
* kvm-arm64/lpi-xarray: : xarray-based representation of vgic LPIs : : KVM's linked-list of LPI state has proven to be a bottleneck in LPI : injection paths, due to lock serialization when acquiring / releasing a : reference on an IRQ. : : Start the tedious process of reworking KVM's LPI injection by replacing : the LPI linked-list with an xarray, leveraging this to allow RCU readers : to walk it outside of the spinlock. KVM: arm64: vgic: Don't acquire the lpi_list_lock in vgic_put_irq() KVM: arm64: vgic: Ensure the irq refcount is nonzero when taking a ref KVM: arm64: vgic: Rely on RCU protection in vgic_get_lpi() KVM: arm64: vgic: Free LPI vgic_irq structs in an RCU-safe manner KVM: arm64: vgic: Use atomics to count LPIs KVM: arm64: vgic: Get rid of the LPI linked-list KVM: arm64: vgic-its: Walk the LPI xarray in vgic_copy_lpi_list() KVM: arm64: vgic-v3: Iterate the xarray to find pending LPIs KVM: arm64: vgic: Use xarray to find LPI in vgic_get_lpi() KVM: arm64: vgic: Store LPIs in an xarray Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
-
Oliver Upton authored
* kvm-arm64/vm-configuration: (29 commits) : VM configuration enforcement, courtesy of Marc Zyngier : : Userspace has gained the ability to control the features visible : through the ID registers, yet KVM didn't take this into account as the : effective feature set when determing trap / emulation behavior. This : series adds: : : - Mechanism for testing the presence of a particular CPU feature in the : guest's ID registers : : - Infrastructure for computing the effective value of VNCR-backed : registers, taking into account the RES0 / RES1 bits for a particular : VM configuration : : - Implementation of 'fine-grained UNDEF' controls that shadow the FGT : register definitions. KVM: arm64: Don't initialize idreg debugfs w/ preemption disabled KVM: arm64: Fail the idreg iterator if idregs aren't initialized KVM: arm64: Make build-time check of RES0/RES1 bits optional KVM: arm64: Add debugfs file for guest's ID registers KVM: arm64: Snapshot all non-zero RES0/RES1 sysreg fields for later checking KVM: arm64: Make FEAT_MOPS UNDEF if not advertised to the guest KVM: arm64: Make AMU sysreg UNDEF if FEAT_AMU is not advertised to the guest KVM: arm64: Make PIR{,E0}_EL1 UNDEF if S1PIE is not advertised to the guest KVM: arm64: Make TLBI OS/Range UNDEF if not advertised to the guest KVM: arm64: Streamline save/restore of HFG[RW]TR_EL2 KVM: arm64: Move existing feature disabling over to FGU infrastructure KVM: arm64: Propagate and handle Fine-Grained UNDEF bits KVM: arm64: Add Fine-Grained UNDEF tracking information KVM: arm64: Rename __check_nv_sr_forward() to triage_sysreg_trap() KVM: arm64: Use the xarray as the primary sysreg/sysinsn walker KVM: arm64: Register AArch64 system register entries with the sysreg xarray KVM: arm64: Always populate the trap configuration xarray KVM: arm64: nv: Move system instructions to their own sys_reg_desc array KVM: arm64: Drop the requirement for XARRAY_MULTI KVM: arm64: nv: Turn encoding ranges into discrete XArray stores ... Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
-
Oliver Upton authored
* kvm-arm64/misc: : Miscellaneous updates : : - Fix handling of features w/ nonzero safe values in set_id_regs : selftest : : - Cleanup the unused kern_hyp_va() asm macro : : - Differentiate nVHE and hVHE in boot-time message : : - Several selftests cleanups : : - Drop bogus return value from kvm_arch_create_vm_debugfs() : : - Make save/restore of SPE and TRBE control registers affect EL1 state : in hVHE mode : : - Typos KVM: arm64: Fix TRFCR_EL1/PMSCR_EL1 access in hVHE mode KVM: selftests: aarch64: Remove unused functions from vpmu test KVM: arm64: Fix typos KVM: Get rid of return value from kvm_arch_create_vm_debugfs() KVM: selftests: Print timer ctl register in ISTATUS assertion KVM: selftests: Fix GUEST_PRINTF() format warnings in ARM code KVM: arm64: removed unused kern_hyp_va asm macro KVM: arm64: add comments to __kern_hyp_va KVM: arm64: print Hyp mode KVM: arm64: selftests: Handle feature fields with nonzero minimum value correctly Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
-
Oliver Upton authored
* kvm-arm64/feat_e2h0: : Support for FEAT_E2H0, courtesy of Marc Zyngier : : As described in the cover letter: : : Since ARMv8.1, the architecture has grown the VHE feature, which makes : EL2 a superset of EL1. With ARMv9.5 (and retroactively allowed from : ARMv8.1), the architecture allows implementations to have VHE as the : *only* implemented behaviour, meaning that HCR_EL2.E2H can be : implemented as RES1. As a follow-up, HCR_EL2.NV1 can also be : implemented as RES0, making the VHE-ness of the architecture : recursive. : : This series adds support for detecting the architectural feature of E2H : being RES1, leveraging the existing infrastructure for handling : out-of-spec CPUs that are VHE-only. Additionally, the (incomplete) NV : infrastructure in KVM is updated to enforce E2H=1 for guest hypervisors : on implementations that do not support NV1. arm64: cpufeatures: Fix FEAT_NV check when checking for FEAT_NV1 arm64: cpufeatures: Only check for NV1 if NV is present arm64: cpufeatures: Add missing ID_AA64MMFR4_EL1 to __read_sysreg_by_encoding() KVM: arm64: Handle Apple M2 as not having HCR_EL2.NV1 implemented KVM: arm64: Force guest's HCR_EL2.E2H RES1 when NV1 is not implemented KVM: arm64: Expose ID_AA64MMFR4_EL1 to guests arm64: Treat HCR_EL2.E2H as RES1 when ID_AA64MMFR4_EL1.E2H0 is negative arm64: cpufeature: Detect HCR_EL2.NV1 being RES0 arm64: cpufeature: Add ID_AA64MMFR4_EL1 handling arm64: sysreg: Add layout for ID_AA64MMFR4_EL1 arm64: cpufeature: Correctly display signed override values arm64: cpufeatures: Correctly handle signed values arm64: Add macro to compose a sysreg field value Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
-
- 06 Mar, 2024 13 commits
-
-
Anup Patel authored
The KVM RISC-V allows Zacas extension for Guest/VM so add this extension to get-reg-list test. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
-
Anup Patel authored
Extend the KVM ISA extension ONE_REG interface to allow KVM user space to detect and enable Zacas extension for Guest/VM. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
-
Anup Patel authored
The KVM RISC-V allows Ztso extension for Guest/VM so add this extension to get-reg-list test. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
-
Anup Patel authored
Extend the KVM ISA extension ONE_REG interface to allow KVM user space to detect and enable Ztso extension for Guest/VM. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
-
Anup Patel authored
The SEED CSR access from VS/VU mode (guest) will always trap to HS-mode (KVM) when Zkr extension is available to the Guest/VM. Forward this CSR access to KVM user space so that it can be emulated based on the method chosen by VMM. Fixes: f370b4e6 ("RISC-V: KVM: Allow scalar crypto extensions for Guest/VM") Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
-
Haibo Xu authored
Add a KVM selftests to validate the Sstc timer functionality. The test was ported from arm64 arch timer test. Signed-off-by: Haibo Xu <haibo1.xu@intel.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
-
Haibo Xu authored
Move vcpu_has_ext to the processor.c and rename it to __vcpu_has_ext so that other test cases can use it for vCPU extension check. Signed-off-by: Haibo Xu <haibo1.xu@intel.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
-
Haibo Xu authored
Add guest_get_vcpuid() helper to simplify accessing to per-cpu private data. The sscratch CSR was used to store the vcpu id. Signed-off-by: Haibo Xu <haibo1.xu@intel.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
-
Haibo Xu authored
Add the infrastructure for guest exception handling in riscv selftests. Customized handlers can be enabled by vm_install_exception_handler(vector) or vm_install_interrupt_handler(). The code is inspired from that of x86/arm64. Signed-off-by: Haibo Xu <haibo1.xu@intel.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
-
Bibo Mao authored
Some CSR registers like CRMD/PRMD are saved during enter VM mode now. However they are not restored for actual use, so saving for these CSR registers can be removed. Reviewed-by: Tianrui Zhao <zhaotianrui@loongson.cn> Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
-
Bibo Mao authored
LoongArch VCPUs have their own separate HW timers. SW timer is to wake up blocked vcpu thread, rather than HW timer emulation. When blocking vcpu scheduled out, SW timer is used to wakeup blocked vcpu thread and injects timer interrupt. It does not care about whether guest timer is in period mode or oneshot mode, and SW timer needs not to be restarted since vcpu has been woken. This patch does not restart SW timer when it is expired. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
-
Bibo Mao authored
SW timer is enabled when vcpu thread is scheduled out, and it is to wake up vcpu from blocked queue. If vcpu thread is scheduled out but is not blocked, such as it is preempted by other threads, it is not necessary to enable SW timer. Since vcpu thread is still on running queue if it is preempted and SW timer is only to wake up vcpu on blocking queue, so SW timer is not useful in this situation. This patch enables SW timer only when vcpu is scheduled out and is blocking. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
-
Bibo Mao authored
Supported CPUCFG information comes from function _kvm_get_cpucfg_mask(). A bit should be zero if it is reserved by HW or if it is not supported by KVM. Also LoongArch software page table walk feature defined in CPUCFG2_LSPW is supported by KVM, it should be enabled by default. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
-
- 05 Mar, 2024 6 commits
-
-
Dongli Zhang authored
Explicitly close() guest_memfd files in various guest_memfd and private_mem_conversions tests, there's no reason to keep the files open until the test exits. Fixes: 8a89efd4 ("KVM: selftests: Add basic selftest for guest_memfd()") Fixes: 43f623f3 ("KVM: selftests: Add x86-only selftest for private memory conversions") Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Link: https://lore.kernel.org/r/20240227015716.27284-1-dongli.zhang@oracle.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
-
David Woodhouse authored
The fast-path timer delivery introduced a recursive locking deadlock when userspace configures a timer which has already expired and is delivered immediately. The call to kvm_xen_inject_timer_irqs() can call to kvm_xen_set_evtchn() which may take kvm->arch.xen.xen_lock, which is already held in kvm_xen_vcpu_get_attr(). ============================================ WARNING: possible recursive locking detected 6.8.0-smp--5e10b4d51d77-drs #232 Tainted: G O -------------------------------------------- xen_shinfo_test/250013 is trying to acquire lock: ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_set_evtchn+0x74/0x170 [kvm] but task is already holding lock: ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_vcpu_get_attr+0x38/0x250 [kvm] Now that the gfn_to_pfn_cache has its own self-sufficient locking, its callers no longer need to ensure serialization, so just stop taking kvm->arch.xen.xen_lock from kvm_xen_set_evtchn(). Fixes: 77c9b9de ("KVM: x86/xen: Use fast path for Xen timer delivery") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240227115648.3104-6-dwmw2@infradead.orgSigned-off-by: Sean Christopherson <seanjc@google.com>
-
David Woodhouse authored
The locking on the gfn_to_pfn_cache is... interesting. And awful. There is a rwlock in ->lock which readers take to ensure protection against concurrent changes. But __kvm_gpc_refresh() makes assumptions that certain fields will not change even while it drops the write lock and performs MM operations to revalidate the target PFN and kernel mapping. Commit 93984f19 ("KVM: Fully serialize gfn=>pfn cache refresh via mutex") partly addressed that — not by fixing it, but by adding a new mutex, ->refresh_lock. This prevented concurrent __kvm_gpc_refresh() calls on a given gfn_to_pfn_cache, but is still only a partial solution. There is still a theoretical race where __kvm_gpc_refresh() runs in parallel with kvm_gpc_deactivate(). While __kvm_gpc_refresh() has dropped the write lock, kvm_gpc_deactivate() clears the ->active flag and unmaps ->khva. Then __kvm_gpc_refresh() determines that the previous ->pfn and ->khva are still valid, and reinstalls those values into the structure. This leaves the gfn_to_pfn_cache with the ->valid bit set, but ->active clear. And a ->khva which looks like a reasonable kernel address but is actually unmapped. All it takes is a subsequent reactivation to cause that ->khva to be dereferenced. This would theoretically cause an oops which would look something like this: [1724749.564994] BUG: unable to handle page fault for address: ffffaa3540ace0e0 [1724749.565039] RIP: 0010:__kvm_xen_has_interrupt+0x8b/0xb0 I say "theoretically" because theoretically, that oops that was seen in production cannot happen. The code which uses the gfn_to_pfn_cache is supposed to have its *own* locking, to further paper over the fact that the gfn_to_pfn_cache's own papering-over (->refresh_lock) of its own rwlock abuse is not sufficient. For the Xen vcpu_info that external lock is the vcpu->mutex, and for the shared info it's kvm->arch.xen.xen_lock. Those locks ought to protect the gfn_to_pfn_cache against concurrent deactivation vs. refresh in all but the cases where the vcpu or kvm object is being *destroyed*, in which case the subsequent reactivation should never happen. Theoretically. Nevertheless, this locking abuse is awful and should be fixed, even if no clear explanation can be found for how the oops happened. So expand the use of the ->refresh_lock mutex to ensure serialization of activate/deactivate vs. refresh and make the pfncache locking entirely self-sufficient. This means that a future commit can simplify the locking in the callers, such as the Xen emulation code which has an outstanding problem with recursive locking of kvm->arch.xen.xen_lock, which will no longer be necessary. The rwlock abuse described above is still not best practice, although it's harmless now that the ->refresh_lock is held for the entire duration while the offending code drops the write lock, does some other stuff, then takes the write lock again and assumes nothing changed. That can also be fixed^W cleaned up in a subsequent commit, but this commit is a simpler basis for the Xen deadlock fix mentioned above. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240227115648.3104-5-dwmw2@infradead.org [sean: use guard(mutex) to fix a missed unlock] Signed-off-by: Sean Christopherson <seanjc@google.com>
-
David Woodhouse authored
The kvm_xen_inject_vcpu_vector() function has a comment saying "the fast version will always work for physical unicast", justifying its use of kvm_irq_delivery_to_apic_fast() and the WARN_ON_ONCE() when that fails. In fact that assumption isn't true if X2APIC isn't in use by the guest and there is (8-bit x)APIC ID aliasing. A single "unicast" destination APIC ID *may* then be delivered to multiple vCPUs. Remove the warning, and in fact it might as well just call kvm_irq_delivery_to_apic(). Reported-by: Michal Luczaj <mhal@rbox.co> Fixes: fde0451b ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240227115648.3104-4-dwmw2@infradead.orgSigned-off-by: Sean Christopherson <seanjc@google.com>
-
David Woodhouse authored
Linux guests since commit b1c3497e ("x86/xen: Add support for HVMOP_set_evtchn_upcall_vector") in v6.0 onwards will use the per-vCPU upcall vector when it's advertised in the Xen CPUID leaves. This upcall is injected through the guest's local APIC as an MSI, unlike the older system vector which was merely injected by the hypervisor any time the CPU was able to receive an interrupt and the upcall_pending flags is set in its vcpu_info. Effectively, that makes the per-CPU upcall edge triggered instead of level triggered, which results in the upcall being lost if the MSI is delivered when the local APIC is *disabled*. Xen checks the vcpu_info->evtchn_upcall_pending flag when the local APIC for a vCPU is software enabled (in fact, on any write to the SPIV register which doesn't disable the APIC). Do the same in KVM since KVM doesn't provide a way for userspace to intervene and trap accesses to the SPIV register of a local APIC emulated by KVM. Fixes: fde0451b ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240227115648.3104-3-dwmw2@infradead.orgSigned-off-by: Sean Christopherson <seanjc@google.com>
-
David Woodhouse authored
A test program such as http://david.woodhou.se/timerlat.c confirms user reports that timers are increasingly inaccurate as the lifetime of a guest increases. Reporting the actual delay observed when asking for 100µs of sleep, it starts off OK on a newly-launched guest but gets worse over time, giving incorrect sleep times: root@ip-10-0-193-21:~# ./timerlat -c -n 5 00000000 latency 103243/100000 (3.2430%) 00000001 latency 103243/100000 (3.2430%) 00000002 latency 103242/100000 (3.2420%) 00000003 latency 103245/100000 (3.2450%) 00000004 latency 103245/100000 (3.2450%) The biggest problem is that get_kvmclock_ns() returns inaccurate values when the guest TSC is scaled. The guest sees a TSC value scaled from the host TSC by a mul/shift conversion (hopefully done in hardware). The guest then converts that guest TSC value into nanoseconds using the mul/shift conversion given to it by the KVM pvclock information. But get_kvmclock_ns() performs only a single conversion directly from host TSC to nanoseconds, giving a different result. A test program at http://david.woodhou.se/tsdrift.c demonstrates the cumulative error over a day. It's non-trivial to fix get_kvmclock_ns(), although I'll come back to that. The actual guest hv_clock is per-CPU, and *theoretically* each vCPU could be running at a *different* frequency. But this patch is needed anyway because... The other issue with Xen timers was that the code would snapshot the host CLOCK_MONOTONIC at some point in time, and then... after a few interrupts may have occurred, some preemption perhaps... would also read the guest's kvmclock. Then it would proceed under the false assumption that those two happened at the *same* time. Any time which *actually* elapsed between reading the two clocks was introduced as inaccuracies in the time at which the timer fired. Fix it to use a variant of kvm_get_time_and_clockread(), which reads the host TSC just *once*, then use the returned TSC value to calculate the kvmclock (making sure to do that the way the guest would instead of making the same mistake get_kvmclock_ns() does). Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen timers still have to use CLOCK_MONOTONIC. In practice the difference between the two won't matter over the timescales involved, as the *absolute* values don't matter; just the delta. This does mean a new variant of kvm_get_time_and_clockread() is needed; called kvm_get_monotonic_and_clockread() because that's what it does. Fixes: 53639526 ("KVM: x86/xen: handle PV timers oneshot mode") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240227115648.3104-2-dwmw2@infradead.org [sean: massage moved comment, tweak if statement formatting] Signed-off-by: Sean Christopherson <seanjc@google.com>
-
- 03 Mar, 2024 1 commit
-
-
Linus Torvalds authored
-