1. 15 Mar, 2024 1 commit
    • Paolo Bonzini's avatar
      selftests: kvm: remove meaningless assignments in Makefiles · 47811790
      Paolo Bonzini authored
      $(shell ...) expands to the output of the command. It expands to the
      empty string when the command does not print anything to stdout.
      Hence, $(shell mkdir ...) is sufficient and does not need any
      variable assignment in front of it.
      
      Commit c2bd08ba ("treewide: remove meaningless assignments in
      Makefiles", 2024-02-23) did this to all of tools/ but ignored in-flight
      changes to tools/testing/selftests/kvm/Makefile, so reapply the change.
      
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      47811790
  2. 14 Mar, 2024 1 commit
  3. 11 Mar, 2024 11 commits
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-xen-6.9' of https://github.com/kvm-x86/linux into HEAD · e9a2bba4
      Paolo Bonzini authored
      KVM Xen and pfncache changes for 6.9:
      
       - Rip out the half-baked support for using gfn_to_pfn caches to manage pages
         that are "mapped" into guests via physical addresses.
      
       - Add support for using gfn_to_pfn caches with only a host virtual address,
         i.e. to bypass the "gfn" stage of the cache.  The primary use case is
         overlay pages, where the guest may change the gfn used to reference the
         overlay page, but the backing hva+pfn remains the same.
      
       - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead
         of a gpa, so that userspace doesn't need to reconfigure and invalidate the
         cache/mapping if the guest changes the gpa (but userspace keeps the resolved
         hva the same).
      
       - When possible, use a single host TSC value when computing the deadline for
         Xen timers in order to improve the accuracy of the timer emulation.
      
       - Inject pending upcall events when the vCPU software-enables its APIC to fix
         a bug where an upcall can be lost (and to follow Xen's behavior).
      
       - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
         events fails, e.g. if the guest has aliased xAPIC IDs.
      
       - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to
         refresh), and drop a now-redundant acquisition of xen_lock (that was
         protecting the shared_info cache) to fix a deadlock due to recursively
         acquiring xen_lock.
      e9a2bba4
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-pmu-6.9' of https://github.com/kvm-x86/linux into HEAD · e9025cdd
      Paolo Bonzini authored
      KVM x86 PMU changes for 6.9:
      
       - Fix several bugs where KVM speciously prevents the guest from utilizing
         fixed counters and architectural event encodings based on whether or not
         guest CPUID reports support for the _architectural_ encoding.
      
       - Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads,
         priority of VMX interception vs #GP, PMC types in architectural PMUs, etc.
      
       - Add a selftest to verify KVM correctly emulates RDMPC, counter availability,
         and a variety of other PMC-related behaviors that depend on guest CPUID,
         i.e. are difficult to validate via KVM-Unit-Tests.
      
       - Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting
         cycles, e.g. when checking if a PMC event needs to be synthesized when
         skipping an instruction.
      
       - Optimize triggering of emulated events, e.g. for "count instructions" events
         when skipping an instruction, which yields a ~10% performance improvement in
         VM-Exit microbenchmarks when a vPMU is exposed to the guest.
      
       - Tighten the check for "PMI in guest" to reduce false positives if an NMI
         arrives in the host while KVM is handling an IRQ VM-Exit.
      e9025cdd
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-vmx-6.9' of https://github.com/kvm-x86/linux into HEAD · b00471a5
      Paolo Bonzini authored
      KVM VMX changes for 6.9:
      
       - Fix a bug where KVM would report stale/bogus exit qualification information
         when exiting to userspace due to an unexpected VM-Exit while the CPU was
         vectoring an exception.
      
       - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support.
      
       - Clean up the logic for massaging the passthrough MSR bitmaps when userspace
         changes its MSR filter.
      b00471a5
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-mmu-6.9' of https://github.com/kvm-x86/linux into HEAD · 41ebae2e
      Paolo Bonzini authored
      KVM x86 MMU changes for 6.9:
      
       - Clean up code related to unprotecting shadow pages when retrying a guest
         instruction after failed #PF-induced emulation.
      
       - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if
         a reschedule is needed, e.g. if a high priority task needs to run.  Because
         KVM doesn't support yielding in the middle of processing a zapped non-leaf
         SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when
         attempting to schedule in a high priority.
      
       - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
         read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.
      
       - Allocate write-tracking metadata on-demand to avoid the memory overhead when
         running kernels built with KVMGT support (external write-tracking enabled),
         but for workloads that don't use nested virtualization (shadow paging) or
         KVMGT.
      41ebae2e
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-misc-6.9' of https://github.com/kvm-x86/linux into HEAD · c9cd0bea
      Paolo Bonzini authored
      KVM x86 misc changes for 6.9:
      
       - Explicitly initialize a variety of on-stack variables in the emulator that
         triggered KMSAN false positives (though in fairness in KMSAN, it's comically
         difficult to see that the uninitialized memory is never truly consumed).
      
       - Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading
         DR6 and DR7.
      
       - Rework the "force immediate exit" code so that vendor code ultimately
         decides how and when to force the exit.  This allows VMX to further optimize
         handling preemption timer exits, and allows SVM to avoid sending a duplicate
         IPI (SVM also has a need to force an exit).
      
       - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
         vCPU creation ultimately failed, and add WARN to guard against similar bugs.
      
       - Provide a dedicated arch hook for checking if a different vCPU was in-kernel
         (for directed yield), and simplify the logic for checking if the currently
         loaded vCPU is in-kernel.
      
       - Misc cleanups and fixes.
      c9cd0bea
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-generic-6.9' of https://github.com/kvm-x86/linux into HEAD · 507e72f8
      Paolo Bonzini authored
      KVM common MMU changes for 6.9:
      
        - Harden KVM against underflowing the active mmu_notifier invalidation
          count, so that "bad" invalidations (usually due to bugs elsehwere in the
          kernel) are detected earlier and are less likely to hang the kernel.
      
        - Fix a benign bug in __kvm_mmu_topup_memory_cache() where the object size
          and number of objects parameters to kvmalloc_array() were swapped.
      507e72f8
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-asyncpf-6.9' of https://github.com/kvm-x86/linux into HEAD · a81d95ae
      Paolo Bonzini authored
      KVM async page fault changes for 6.9:
      
       - Always flush the async page fault workqueue when a work item is being
         removed, especially during vCPU destruction, to ensure that there are no
         workers running in KVM code when all references to KVM-the-module are gone,
         i.e. to prevent a use-after-free if kvm.ko is unloaded.
      
       - Grab a reference to the VM's mm_struct in the async #PF worker itself instead
         of gifting the worker a reference, e.g. so that there's no need to remember
         to *conditionally* clean up after the worker.
      a81d95ae
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-selftests-6.9' of https://github.com/kvm-x86/linux into HEAD · 4d4c0285
      Paolo Bonzini authored
      KVM selftests changes for 6.9:
      
       - Add macros to reduce the amount of boilerplate code needed to write "simple"
         selftests, and to utilize selftest TAP infrastructure, which is especially
         beneficial for KVM selftests with multiple testcases.
      
       - Add basic smoke tests for SEV and SEV-ES, along with a pile of library
         support for handling private/encrypted/protected memory.
      
       - Fix benign bugs where tests neglect to close() guest_memfd files.
      4d4c0285
    • Paolo Bonzini's avatar
      Merge tag 'kvm-riscv-6.9-1' of https://github.com/kvm-riscv/linux into HEAD · f074158a
      Paolo Bonzini authored
      KVM/riscv changes for 6.9
      
      - Exception and interrupt handling for selftests
      - Sstc (aka arch_timer) selftest
      - Forward seed CSR access to KVM userspace
      - Ztso extension support for Guest/VM
      - Zacas extension support for Guest/VM
      f074158a
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-6.9' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD · 961e2bfc
      Paolo Bonzini authored
      KVM/arm64 updates for 6.9
      
       - Infrastructure for building KVM's trap configuration based on the
         architectural features (or lack thereof) advertised in the VM's ID
         registers
      
       - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
         x86's WC) at stage-2, improving the performance of interacting with
         assigned devices that can tolerate it
      
       - Conversion of KVM's representation of LPIs to an xarray, utilized to
         address serialization some of the serialization on the LPI injection
         path
      
       - Support for _architectural_ VHE-only systems, advertised through the
         absence of FEAT_E2H0 in the CPU's ID register
      
       - Miscellaneous cleanups, fixes, and spelling corrections to KVM and
         selftests
      961e2bfc
    • Paolo Bonzini's avatar
      Merge tag 'loongarch-kvm-6.9' of... · 233d0bc4
      Paolo Bonzini authored
      Merge tag 'loongarch-kvm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
      
      LoongArch KVM changes for v6.9
      
      * Set reserved bits as zero in CPUCFG.
      * Start SW timer only when vcpu is blocking.
      * Do not restart SW timer when it is expired.
      * Remove unnecessary CSR register saving during enter guest.
      233d0bc4
  4. 09 Mar, 2024 1 commit
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of https://github.com/kvm-x86/linux into HEAD · 7d8942d8
      Paolo Bonzini authored
      KVM GUEST_MEMFD fixes for 6.8:
      
       - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to
         avoid creating ABI that KVM can't sanely support.
      
       - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
         clear that such VMs are purely a development and testing vehicle, and
         come with zero guarantees.
      
       - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan
         is to support confidential VMs with deterministic private memory (SNP
         and TDX) only in the TDP MMU.
      
       - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes
         when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
      7d8942d8
  5. 07 Mar, 2024 6 commits
    • Oliver Upton's avatar
      Merge branch kvm-arm64/kerneldoc into kvmarm/next · 4a09ddb8
      Oliver Upton authored
      * kvm-arm64/kerneldoc:
        : kerneldoc warning fixes, courtesy of Randy Dunlap
        :
        : Fixes addressing the widespread misuse of kerneldoc-style comments
        : throughout KVM/arm64.
        KVM: arm64: vgic: fix a kernel-doc warning
        KVM: arm64: vgic-its: fix kernel-doc warnings
        KVM: arm64: vgic-init: fix a kernel-doc warning
        KVM: arm64: sys_regs: fix kernel-doc warnings
        KVM: arm64: PMU: fix kernel-doc warnings
        KVM: arm64: mmu: fix a kernel-doc warning
        KVM: arm64: vhe: fix a kernel-doc warning
        KVM: arm64: hyp/aarch32: fix kernel-doc warnings
        KVM: arm64: guest: fix kernel-doc warnings
        KVM: arm64: debug: fix kernel-doc warnings
      Signed-off-by: default avatarOliver Upton <oliver.upton@linux.dev>
      4a09ddb8
    • Oliver Upton's avatar
      Merge branch kvm-arm64/vfio-normal-nc into kvmarm/next · 9bd8d7df
      Oliver Upton authored
      * kvm-arm64/vfio-normal-nc:
        : Normal-NC support for vfio-pci @ stage-2, courtesy of Ankit Agrawal
        :
        : KVM's policy to date has been that any and all MMIO mapping at stage-2
        : is treated as Device-nGnRE. This is primarily done due to concerns of
        : the guest triggering uncontainable failures in the system if they manage
        : to tickle the device / memory system the wrong way, though this is
        : unnecessarily restrictive for devices that can be reasoned as 'safe'.
        :
        : Unsurprisingly, the Device-* mapping can really hurt the performance of
        : assigned devices that can handle Gathering, and can be an outright
        : correctness issue if the guest driver does unaligned accesses.
        :
        : Rather than opening the floodgates to the full ecosystem of devices that
        : can be exposed to VMs, take the conservative approach and allow PCI
        : devices to be mapped as Normal-NC since it has been determined to be
        : 'safe'.
        vfio: Convey kvm that the vfio-pci device is wc safe
        KVM: arm64: Set io memory s2 pte as normalnc for vfio pci device
        mm: Introduce new flag to indicate wc safe
        KVM: arm64: Introduce new flag for non-cacheable IO memory
      Signed-off-by: default avatarOliver Upton <oliver.upton@linux.dev>
      9bd8d7df
    • Oliver Upton's avatar
      Merge branch kvm-arm64/lpi-xarray into kvmarm/next · 8dbc4110
      Oliver Upton authored
      * kvm-arm64/lpi-xarray:
        : xarray-based representation of vgic LPIs
        :
        : KVM's linked-list of LPI state has proven to be a bottleneck in LPI
        : injection paths, due to lock serialization when acquiring / releasing a
        : reference on an IRQ.
        :
        : Start the tedious process of reworking KVM's LPI injection by replacing
        : the LPI linked-list with an xarray, leveraging this to allow RCU readers
        : to walk it outside of the spinlock.
        KVM: arm64: vgic: Don't acquire the lpi_list_lock in vgic_put_irq()
        KVM: arm64: vgic: Ensure the irq refcount is nonzero when taking a ref
        KVM: arm64: vgic: Rely on RCU protection in vgic_get_lpi()
        KVM: arm64: vgic: Free LPI vgic_irq structs in an RCU-safe manner
        KVM: arm64: vgic: Use atomics to count LPIs
        KVM: arm64: vgic: Get rid of the LPI linked-list
        KVM: arm64: vgic-its: Walk the LPI xarray in vgic_copy_lpi_list()
        KVM: arm64: vgic-v3: Iterate the xarray to find pending LPIs
        KVM: arm64: vgic: Use xarray to find LPI in vgic_get_lpi()
        KVM: arm64: vgic: Store LPIs in an xarray
      Signed-off-by: default avatarOliver Upton <oliver.upton@linux.dev>
      8dbc4110
    • Oliver Upton's avatar
      Merge branch kvm-arm64/vm-configuration into kvmarm/next · 0d874858
      Oliver Upton authored
      * kvm-arm64/vm-configuration: (29 commits)
        : VM configuration enforcement, courtesy of Marc Zyngier
        :
        : Userspace has gained the ability to control the features visible
        : through the ID registers, yet KVM didn't take this into account as the
        : effective feature set when determing trap / emulation behavior. This
        : series adds:
        :
        :  - Mechanism for testing the presence of a particular CPU feature in the
        :    guest's ID registers
        :
        :  - Infrastructure for computing the effective value of VNCR-backed
        :    registers, taking into account the RES0 / RES1 bits for a particular
        :    VM configuration
        :
        :  - Implementation of 'fine-grained UNDEF' controls that shadow the FGT
        :    register definitions.
        KVM: arm64: Don't initialize idreg debugfs w/ preemption disabled
        KVM: arm64: Fail the idreg iterator if idregs aren't initialized
        KVM: arm64: Make build-time check of RES0/RES1 bits optional
        KVM: arm64: Add debugfs file for guest's ID registers
        KVM: arm64: Snapshot all non-zero RES0/RES1 sysreg fields for later checking
        KVM: arm64: Make FEAT_MOPS UNDEF if not advertised to the guest
        KVM: arm64: Make AMU sysreg UNDEF if FEAT_AMU is not advertised to the guest
        KVM: arm64: Make PIR{,E0}_EL1 UNDEF if S1PIE is not advertised to the guest
        KVM: arm64: Make TLBI OS/Range UNDEF if not advertised to the guest
        KVM: arm64: Streamline save/restore of HFG[RW]TR_EL2
        KVM: arm64: Move existing feature disabling over to FGU infrastructure
        KVM: arm64: Propagate and handle Fine-Grained UNDEF bits
        KVM: arm64: Add Fine-Grained UNDEF tracking information
        KVM: arm64: Rename __check_nv_sr_forward() to triage_sysreg_trap()
        KVM: arm64: Use the xarray as the primary sysreg/sysinsn walker
        KVM: arm64: Register AArch64 system register entries with the sysreg xarray
        KVM: arm64: Always populate the trap configuration xarray
        KVM: arm64: nv: Move system instructions to their own sys_reg_desc array
        KVM: arm64: Drop the requirement for XARRAY_MULTI
        KVM: arm64: nv: Turn encoding ranges into discrete XArray stores
        ...
      Signed-off-by: default avatarOliver Upton <oliver.upton@linux.dev>
      0d874858
    • Oliver Upton's avatar
      Merge branch kvm-arm64/misc into kvmarm/next · a040adfb
      Oliver Upton authored
      * kvm-arm64/misc:
        : Miscellaneous updates
        :
        :  - Fix handling of features w/ nonzero safe values in set_id_regs
        :    selftest
        :
        :  - Cleanup the unused kern_hyp_va() asm macro
        :
        :  - Differentiate nVHE and hVHE in boot-time message
        :
        :  - Several selftests cleanups
        :
        :  - Drop bogus return value from kvm_arch_create_vm_debugfs()
        :
        :  - Make save/restore of SPE and TRBE control registers affect EL1 state
        :    in hVHE mode
        :
        :  - Typos
        KVM: arm64: Fix TRFCR_EL1/PMSCR_EL1 access in hVHE mode
        KVM: selftests: aarch64: Remove unused functions from vpmu test
        KVM: arm64: Fix typos
        KVM: Get rid of return value from kvm_arch_create_vm_debugfs()
        KVM: selftests: Print timer ctl register in ISTATUS assertion
        KVM: selftests: Fix GUEST_PRINTF() format warnings in ARM code
        KVM: arm64: removed unused kern_hyp_va asm macro
        KVM: arm64: add comments to __kern_hyp_va
        KVM: arm64: print Hyp mode
        KVM: arm64: selftests: Handle feature fields with nonzero minimum value correctly
      Signed-off-by: default avatarOliver Upton <oliver.upton@linux.dev>
      a040adfb
    • Oliver Upton's avatar
      Merge branch kvm-arm64/feat_e2h0 into kvmarm/next · 262cd16e
      Oliver Upton authored
      * kvm-arm64/feat_e2h0:
        : Support for FEAT_E2H0, courtesy of Marc Zyngier
        :
        : As described in the cover letter:
        :
        :   Since ARMv8.1, the architecture has grown the VHE feature, which makes
        :   EL2 a superset of EL1. With ARMv9.5 (and retroactively allowed from
        :   ARMv8.1), the architecture allows implementations to have VHE as the
        :   *only* implemented behaviour, meaning that HCR_EL2.E2H can be
        :   implemented as RES1. As a follow-up, HCR_EL2.NV1 can also be
        :   implemented as RES0, making the VHE-ness of the architecture
        :   recursive.
        :
        : This series adds support for detecting the architectural feature of E2H
        : being RES1, leveraging the existing infrastructure for handling
        : out-of-spec CPUs that are VHE-only. Additionally, the (incomplete) NV
        : infrastructure in KVM is updated to enforce E2H=1 for guest hypervisors
        : on implementations that do not support NV1.
        arm64: cpufeatures: Fix FEAT_NV check when checking for FEAT_NV1
        arm64: cpufeatures: Only check for NV1 if NV is present
        arm64: cpufeatures: Add missing ID_AA64MMFR4_EL1 to __read_sysreg_by_encoding()
        KVM: arm64: Handle Apple M2 as not having HCR_EL2.NV1 implemented
        KVM: arm64: Force guest's HCR_EL2.E2H RES1 when NV1 is not implemented
        KVM: arm64: Expose ID_AA64MMFR4_EL1 to guests
        arm64: Treat HCR_EL2.E2H as RES1 when ID_AA64MMFR4_EL1.E2H0 is negative
        arm64: cpufeature: Detect HCR_EL2.NV1 being RES0
        arm64: cpufeature: Add ID_AA64MMFR4_EL1 handling
        arm64: sysreg: Add layout for ID_AA64MMFR4_EL1
        arm64: cpufeature: Correctly display signed override values
        arm64: cpufeatures: Correctly handle signed values
        arm64: Add macro to compose a sysreg field value
      Signed-off-by: default avatarOliver Upton <oliver.upton@linux.dev>
      262cd16e
  6. 06 Mar, 2024 13 commits
  7. 05 Mar, 2024 6 commits
    • Dongli Zhang's avatar
      KVM: selftests: Explicitly close guest_memfd files in some gmem tests · e9da6f08
      Dongli Zhang authored
      Explicitly close() guest_memfd files in various guest_memfd and
      private_mem_conversions tests, there's no reason to keep the files open
      until the test exits.
      
      Fixes: 8a89efd4 ("KVM: selftests: Add basic selftest for guest_memfd()")
      Fixes: 43f623f3 ("KVM: selftests: Add x86-only selftest for private memory conversions")
      Signed-off-by: default avatarDongli Zhang <dongli.zhang@oracle.com>
      Link: https://lore.kernel.org/r/20240227015716.27284-1-dongli.zhang@oracle.com
      [sean: massage changelog]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      e9da6f08
    • David Woodhouse's avatar
      KVM: x86/xen: fix recursive deadlock in timer injection · 7a36d680
      David Woodhouse authored
      The fast-path timer delivery introduced a recursive locking deadlock
      when userspace configures a timer which has already expired and is
      delivered immediately. The call to kvm_xen_inject_timer_irqs() can
      call to kvm_xen_set_evtchn() which may take kvm->arch.xen.xen_lock,
      which is already held in kvm_xen_vcpu_get_attr().
      
       ============================================
       WARNING: possible recursive locking detected
       6.8.0-smp--5e10b4d51d77-drs #232 Tainted: G           O
       --------------------------------------------
       xen_shinfo_test/250013 is trying to acquire lock:
       ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_set_evtchn+0x74/0x170 [kvm]
      
       but task is already holding lock:
       ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_vcpu_get_attr+0x38/0x250 [kvm]
      
      Now that the gfn_to_pfn_cache has its own self-sufficient locking, its
      callers no longer need to ensure serialization, so just stop taking
      kvm->arch.xen.xen_lock from kvm_xen_set_evtchn().
      
      Fixes: 77c9b9de ("KVM: x86/xen: Use fast path for Xen timer delivery")
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Reviewed-by: default avatarPaul Durrant <paul@xen.org>
      Link: https://lore.kernel.org/r/20240227115648.3104-6-dwmw2@infradead.orgSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      7a36d680
    • David Woodhouse's avatar
      KVM: pfncache: simplify locking and make more self-contained · 6addfcf2
      David Woodhouse authored
      The locking on the gfn_to_pfn_cache is... interesting. And awful.
      
      There is a rwlock in ->lock which readers take to ensure protection
      against concurrent changes. But __kvm_gpc_refresh() makes assumptions
      that certain fields will not change even while it drops the write lock
      and performs MM operations to revalidate the target PFN and kernel
      mapping.
      
      Commit 93984f19 ("KVM: Fully serialize gfn=>pfn cache refresh via
      mutex") partly addressed that — not by fixing it, but by adding a new
      mutex, ->refresh_lock. This prevented concurrent __kvm_gpc_refresh()
      calls on a given gfn_to_pfn_cache, but is still only a partial solution.
      
      There is still a theoretical race where __kvm_gpc_refresh() runs in
      parallel with kvm_gpc_deactivate(). While __kvm_gpc_refresh() has
      dropped the write lock, kvm_gpc_deactivate() clears the ->active flag
      and unmaps ->khva. Then __kvm_gpc_refresh() determines that the previous
      ->pfn and ->khva are still valid, and reinstalls those values into the
      structure. This leaves the gfn_to_pfn_cache with the ->valid bit set,
      but ->active clear. And a ->khva which looks like a reasonable kernel
      address but is actually unmapped.
      
      All it takes is a subsequent reactivation to cause that ->khva to be
      dereferenced. This would theoretically cause an oops which would look
      something like this:
      
      [1724749.564994] BUG: unable to handle page fault for address: ffffaa3540ace0e0
      [1724749.565039] RIP: 0010:__kvm_xen_has_interrupt+0x8b/0xb0
      
      I say "theoretically" because theoretically, that oops that was seen in
      production cannot happen. The code which uses the gfn_to_pfn_cache is
      supposed to have its *own* locking, to further paper over the fact that
      the gfn_to_pfn_cache's own papering-over (->refresh_lock) of its own
      rwlock abuse is not sufficient.
      
      For the Xen vcpu_info that external lock is the vcpu->mutex, and for the
      shared info it's kvm->arch.xen.xen_lock. Those locks ought to protect
      the gfn_to_pfn_cache against concurrent deactivation vs. refresh in all
      but the cases where the vcpu or kvm object is being *destroyed*, in
      which case the subsequent reactivation should never happen.
      
      Theoretically.
      
      Nevertheless, this locking abuse is awful and should be fixed, even if
      no clear explanation can be found for how the oops happened. So expand
      the use of the ->refresh_lock mutex to ensure serialization of
      activate/deactivate vs. refresh and make the pfncache locking entirely
      self-sufficient.
      
      This means that a future commit can simplify the locking in the callers,
      such as the Xen emulation code which has an outstanding problem with
      recursive locking of kvm->arch.xen.xen_lock, which will no longer be
      necessary.
      
      The rwlock abuse described above is still not best practice, although
      it's harmless now that the ->refresh_lock is held for the entire duration
      while the offending code drops the write lock, does some other stuff,
      then takes the write lock again and assumes nothing changed. That can
      also be fixed^W cleaned up in a subsequent commit, but this commit is
      a simpler basis for the Xen deadlock fix mentioned above.
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Reviewed-by: default avatarPaul Durrant <paul@xen.org>
      Link: https://lore.kernel.org/r/20240227115648.3104-5-dwmw2@infradead.org
      [sean: use guard(mutex) to fix a missed unlock]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      6addfcf2
    • David Woodhouse's avatar
      KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery · 66e3cf72
      David Woodhouse authored
      The kvm_xen_inject_vcpu_vector() function has a comment saying "the fast
      version will always work for physical unicast", justifying its use of
      kvm_irq_delivery_to_apic_fast() and the WARN_ON_ONCE() when that fails.
      
      In fact that assumption isn't true if X2APIC isn't in use by the guest
      and there is (8-bit x)APIC ID aliasing. A single "unicast" destination
      APIC ID *may* then be delivered to multiple vCPUs. Remove the warning,
      and in fact it might as well just call kvm_irq_delivery_to_apic().
      Reported-by: default avatarMichal Luczaj <mhal@rbox.co>
      Fixes: fde0451b ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC")
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Reviewed-by: default avatarPaul Durrant <paul@xen.org>
      Link: https://lore.kernel.org/r/20240227115648.3104-4-dwmw2@infradead.orgSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      66e3cf72
    • David Woodhouse's avatar
      KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled · 8e62bf2b
      David Woodhouse authored
      Linux guests since commit b1c3497e ("x86/xen: Add support for
      HVMOP_set_evtchn_upcall_vector") in v6.0 onwards will use the per-vCPU
      upcall vector when it's advertised in the Xen CPUID leaves.
      
      This upcall is injected through the guest's local APIC as an MSI, unlike
      the older system vector which was merely injected by the hypervisor any
      time the CPU was able to receive an interrupt and the upcall_pending
      flags is set in its vcpu_info.
      
      Effectively, that makes the per-CPU upcall edge triggered instead of
      level triggered, which results in the upcall being lost if the MSI is
      delivered when the local APIC is *disabled*.
      
      Xen checks the vcpu_info->evtchn_upcall_pending flag when the local APIC
      for a vCPU is software enabled (in fact, on any write to the SPIV
      register which doesn't disable the APIC). Do the same in KVM since KVM
      doesn't provide a way for userspace to intervene and trap accesses to
      the SPIV register of a local APIC emulated by KVM.
      
      Fixes: fde0451b ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC")
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Reviewed-by: default avatarPaul Durrant <paul@xen.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20240227115648.3104-3-dwmw2@infradead.orgSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      8e62bf2b
    • David Woodhouse's avatar
      KVM: x86/xen: improve accuracy of Xen timers · 451a7078
      David Woodhouse authored
      A test program such as http://david.woodhou.se/timerlat.c confirms user
      reports that timers are increasingly inaccurate as the lifetime of a
      guest increases. Reporting the actual delay observed when asking for
      100µs of sleep, it starts off OK on a newly-launched guest but gets
      worse over time, giving incorrect sleep times:
      
      root@ip-10-0-193-21:~# ./timerlat -c -n 5
      00000000 latency 103243/100000 (3.2430%)
      00000001 latency 103243/100000 (3.2430%)
      00000002 latency 103242/100000 (3.2420%)
      00000003 latency 103245/100000 (3.2450%)
      00000004 latency 103245/100000 (3.2450%)
      
      The biggest problem is that get_kvmclock_ns() returns inaccurate values
      when the guest TSC is scaled. The guest sees a TSC value scaled from the
      host TSC by a mul/shift conversion (hopefully done in hardware). The
      guest then converts that guest TSC value into nanoseconds using the
      mul/shift conversion given to it by the KVM pvclock information.
      
      But get_kvmclock_ns() performs only a single conversion directly from
      host TSC to nanoseconds, giving a different result. A test program at
      http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
      over a day.
      
      It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
      that. The actual guest hv_clock is per-CPU, and *theoretically* each
      vCPU could be running at a *different* frequency. But this patch is
      needed anyway because...
      
      The other issue with Xen timers was that the code would snapshot the
      host CLOCK_MONOTONIC at some point in time, and then... after a few
      interrupts may have occurred, some preemption perhaps... would also read
      the guest's kvmclock. Then it would proceed under the false assumption
      that those two happened at the *same* time. Any time which *actually*
      elapsed between reading the two clocks was introduced as inaccuracies
      in the time at which the timer fired.
      
      Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
      host TSC just *once*, then use the returned TSC value to calculate the
      kvmclock (making sure to do that the way the guest would instead of
      making the same mistake get_kvmclock_ns() does).
      
      Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
      timers still have to use CLOCK_MONOTONIC. In practice the difference
      between the two won't matter over the timescales involved, as the
      *absolute* values don't matter; just the delta.
      
      This does mean a new variant of kvm_get_time_and_clockread() is needed;
      called kvm_get_monotonic_and_clockread() because that's what it does.
      
      Fixes: 53639526 ("KVM: x86/xen: handle PV timers oneshot mode")
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Reviewed-by: default avatarPaul Durrant <paul@xen.org>
      Link: https://lore.kernel.org/r/20240227115648.3104-2-dwmw2@infradead.org
      [sean: massage moved comment, tweak if statement formatting]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      451a7078
  8. 03 Mar, 2024 1 commit