- 08 Mar, 2022 17 commits
-
-
Sean Christopherson authored
When recovering a potential hugepage that was shattered for the iTLB multihit workaround, precisely zap only the target page instead of iterating over the TDP MMU to find the SP that was passed in. This will allow future simplification of zap_gfn_range() by having it zap only leaf SPTEs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220226001546.360188-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Refactor __tdp_mmu_set_spte() to work with raw values instead of a tdp_iter objects so that a future patch can modify SPTEs without doing a walk, and without having to synthesize a tdp_iter. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-13-seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
WARN if the new_spte being set by __tdp_mmu_set_spte() is a REMOVED_SPTE, which is called out by the comment as being disallowed but not actually checked. Keep the WARN on the old_spte as well, because overwriting a REMOVED_SPTE in the non-atomic path is also disallowed (as evidence by lack of splats with the existing WARN). Fixes: 08f07c80 ("KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler") Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-12-seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Add helpers to read and write TDP MMU SPTEs instead of open coding rcu_dereference() all over the place, and to provide a convenient location to document why KVM doesn't exempt holding mmu_lock for write from having to hold RCU (and any future changes to the rules). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-11-seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Drop RCU protection after processing each root when handling MMU notifier hooks that aren't the "unmap" path, i.e. aren't zapping. Temporarily drop RCU to let RCU do its thing between roots, and to make it clear that there's no special behavior that relies on holding RCU across all roots. Currently, the RCU protection is completely superficial, it's necessary only to make rcu_dereference() of SPTE pointers happy. A future patch will rely on holding RCU as a proxy for vCPUs in the guest, e.g. to ensure shadow pages aren't freed before all vCPUs do a TLB flush (or rather, acknowledge the need for a flush), but in that case RCU needs to be held until the flush is complete if and only if the flush is needed because a shadow page may have been removed. And except for the "unmap" path, MMU notifier events cannot remove SPs (don't toggle PRESENT bit, and can't change the PFN for a SP). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-10-seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Batch TLB flushes (with other MMUs) when handling ->change_spte() notifications in the TDP MMU. The MMU notifier path in question doesn't allow yielding and correcty flushes before dropping mmu_lock. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-9-seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Look for a !leaf=>leaf conversion instead of a PFN change when checking if a SPTE change removed a TDP MMU shadow page. Convert the PFN check into a WARN, as KVM should never change the PFN of a shadow page (except when its being zapped or replaced). From a purely theoretical perspective, it's not illegal to replace a SP with a hugepage pointing at the same PFN. In practice, it's impossible as that would require mapping guest memory overtop a kernel-allocated SP. Either way, the check is odd. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-8-seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Remove the "shared" argument of for_each_tdp_mmu_root_yield_safe, thus ensuring that readers do not ever acquire a reference to an invalid root. After this patch, all readers except kvm_tdp_mmu_zap_invalidated_roots() treat refcount=0/valid, refcount=0/invalid and refcount=1/invalid in exactly the same way. kvm_tdp_mmu_zap_invalidated_roots() is different but it also does not acquire a reference to the invalid root, and it cannot see refcount=0/invalid because it is guaranteed to run after kvm_tdp_mmu_invalidate_all_roots(). Opportunistically add a lockdep assertion to the yield-safe iterator. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Eager page splitting is an optimization; it does not have to be performed on invalid roots. It is also the only case in which a reader might acquire a reference to an invalid root, so after this change we know that readers will skip both dying and invalid roots. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Assert that mmu_lock is held for write by users of the yield-unfriendly TDP iterator. The nature of a shared walk means that the caller needs to play nice with other tasks modifying the page tables, which is more or less the same thing as playing nice with yielding. Theoretically, KVM could gain a flow where it could legitimately take mmu_lock for read in a non-preemptible context, but that's highly unlikely and any such case should be viewed with a fair amount of scrutiny. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Remove the misleading flush "handling" when zapping invalidated TDP MMU roots, and document that flushing is unnecessary for all flavors of MMUs when zapping invalid/obsolete roots/pages. The "handling" in the TDP MMU is dead code, as zap_gfn_range() is called with shared=true, in which case it will never return true due to the flushing being handled by tdp_mmu_zap_spte_atomic(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Explicitly ignore the result of zap_gfn_range() when putting the last reference to a TDP MMU root, and add a pile of comments to formalize the TDP MMU's behavior of deferring TLB flushes to alloc/reuse. Note, this only affects the !shared case, as zap_gfn_range() subtly never returns true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic(). Putting the root without a flush is ok because even if there are stale references to the root in the TLB, they are unreachable because KVM will not run the guest with the same ASID without first flushing (where ASID in this context refers to both SVM's explicit ASID and Intel's implicit ASID that is constructed from VPID+PCID+EPT4A+etc...). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220226001546.360188-5-seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Fix misleading and arguably wrong comments in the TDP MMU's fast zap flow. The comments, and the fact that actually zapping invalid roots was added separately, strongly suggests that zapping invalid roots is an optimization and not required for correctness. That is a lie. KVM _must_ zap invalid roots before returning from kvm_mmu_zap_all_fast(), because when it's called from kvm_mmu_invalidate_zap_pages_in_memslot(), KVM is relying on it to fully remove all references to the memslot. Once the memslot is gone, KVM's mmu_notifier hooks will be unable to find the stale references as the hva=>gfn translation is done via the memslots. If KVM doesn't immediately zap SPTEs and userspace unmaps a range after deleting a memslot, KVM will fail to zap in response to the mmu_notifier due to not finding a memslot corresponding to the notifier's range, which leads to a variation of use-after-free. The other misleading comment (and code) explicitly states that roots without a reference should be skipped. While that's technically true, it's also extremely misleading as it should be impossible for KVM to encounter a defunct root on the list while holding mmu_lock for write. Opportunistically add a WARN to enforce that invariant. Fixes: b7cccd39 ("KVM: x86/mmu: Fast invalidation for TDP MMU") Fixes: 4c6654bd ("KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Explicitly check for present SPTEs when clearing dirty bits in the TDP MMU. This isn't strictly required for correctness, as setting the dirty bit in a defunct SPTE will not change the SPTE from !PRESENT to PRESENT. However, the guarded MMU_WARN_ON() in spte_ad_need_write_protect() would complain if anyone actually turned on KVM's MMU debugging. Fixes: a6a0b05d ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-3-seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Allocations whose size is related to the memslot size can be arbitrarily large. Do not use kvzalloc/kvcalloc, as those are limited to "not crazy" sizes that fit in 32 bits. Cc: stable@vger.kernel.org Fixes: 7661809d ("mm: don't allow oversized kvmalloc() calls") Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Instead of using array_size or just a multiply, use a function that takes care of both the multiplication and the overflow checks. Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Linux has dozens of occurrences of vmalloc(array_size()) and vzalloc(array_size()). Allow to simplify the code by providing vmalloc_array and vcalloc, as well as the underscored variants that let the caller specify the GFP flags. Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 04 Mar, 2022 1 commit
-
-
Paolo Bonzini authored
Merge bugfixes from 5.17 before merging more tricky work.
-
- 02 Mar, 2022 2 commits
-
-
Paolo Bonzini authored
kvm_arch_vcpu_ioctl_run is already doing srcu_read_lock/unlock in two places, namely vcpu_run and post_kvm_run_save, and a third is actually needed around the call to vcpu->arch.complete_userspace_io to avoid the following splat: WARNING: suspicious RCU usage arch/x86/kvm/pmu.c:190 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by CPU 28/KVM/370841: #0: ff11004089f280b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x87/0x730 [kvm] Call Trace: <TASK> dump_stack_lvl+0x59/0x73 reprogram_fixed_counter+0x15d/0x1a0 [kvm] kvm_pmu_trigger_event+0x1a3/0x260 [kvm] ? free_moved_vector+0x1b4/0x1e0 complete_fast_pio_in+0x8a/0xd0 [kvm] This splat is not at all unexpected, since complete_userspace_io callbacks can execute similar code to vmexits. For example, SVM with nrips=false will call into the emulator from svm_skip_emulated_instruction(). While it's tempting to never acquire kvm->srcu for an uninitialized vCPU, practically speaking there's no penalty to acquiring kvm->srcu "early" as the KVM_MP_STATE_UNINITIALIZED path is a one-time thing per vCPU. On the other hand, seemingly innocuous helpers like kvm_apic_accept_events() and sync_regs() can theoretically reach code that might access SRCU-protected data structures, e.g. sync_regs() can trigger forced existing of nested mode via kvm_vcpu_ioctl_x86_set_vcpu_events(). Reported-by: Like Xu <likexu@tencent.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Like Xu authored
Just like on the optional mmu_alloc_direct_roots() path, once shadow path reaches "r = -EIO" somewhere, the caller needs to know the actual state in order to enter error handling and avoid something worse. Fixes: 4a38162e ("KVM: MMU: load PDPTRs outside mmu_lock") Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220301124941.48412-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 01 Mar, 2022 20 commits
-
-
Sean Christopherson authored
Disable preemption when loading/putting the AVIC during an APICv refresh. If the vCPU task is preempted and migrated ot a different pCPU, the unprotected avic_vcpu_load() could set the wrong pCPU in the physical ID cache/table. Pull the necessary code out of avic_vcpu_{,un}blocking() and into a new helper to reduce the probability of introducing this exact bug a third time. Fixes: df7e4827 ("KVM: SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling AVIC") Cc: stable@vger.kernel.org Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Exit to userspace if setup_vmgexit_scratch() fails due to OOM or because copying data from guest (userspace) memory failed/faulted. The OOM scenario is clearcut, it's userspace's decision as to whether it should terminate the guest, free memory, etc... As for -EFAULT, arguably, any guest issue is a violation of the guest's contract with userspace, and thus userspace needs to decide how to proceed. E.g. userspace defines what is RAM vs. MMIO and communicates that directly to the guest, KVM is not involved in deciding what is/isn't RAM nor in communicating that information to the guest. If the scratch GPA doesn't resolve to a memslot, then the guest is not honoring the memory configuration as defined by userspace. And if userspace unmaps an hva for whatever reason, then exiting to userspace with -EFAULT is absolutely the right thing to do. KVM's ABI currently sucks and doesn't provide enough information to act on the -EFAULT, but that will hopefully be remedied in the future as there are multiple use cases, e.g. uffd and virtiofs truncation, that shouldn't require any work in KVM beyond returning -EFAULT with a small amount of metadata. KVM could define its ABI such that failure to access the scratch area is reflected into the guest, i.e. establish a contract with userspace, but that's undesirable as it limits KVM's options in the future, e.g. in the potential uffd case any failure on a uaccess needs to kick out to userspace. KVM does have several cases where it reflects these errors into the guest, e.g. kvm_pv_clock_pairing() and Hyper-V emulation, but KVM would preferably "fix" those instead of propagating the falsehood that any memory failure is the guest's fault. Lastly, returning a boolean as an "error" for that a helper that isn't named accordingly never works out well. Fixes: ad5b3532 ("KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure") Cc: Alper Gun <alpergun@google.com> Cc: Peter Gonda <pgonda@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220225205209.3881130-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
WARN and bail if is_unsync_root() is passed a root for which there is no shadow page, i.e. is passed the physical address of one of the special roots, which do not have an associated shadow page. The current usage squeaks by without bug reports because neither kvm_mmu_sync_roots() nor kvm_mmu_sync_prev_roots() calls the helper with pae_root or pml4_root, and 5-level AMD CPUs are not generally available, i.e. no one can coerce KVM into calling is_unsync_root() on pml5_root. Note, this doesn't fix the mess with 5-level nNPT, it just (hopefully) prevents KVM from crashing. Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220225182248.3812651-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Remove the now unused KVM_REQ_MMU_RELOAD, shift KVM_REQ_VM_DEAD into the unoccupied space, and update vcpu-requests.rst, which was missing an entry for KVM_REQ_VM_DEAD. Switching KVM_REQ_VM_DEAD to entry '1' also fixes the stale comment about bits 4-7 being reserved. Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220225182248.3812651-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Add an arch request, KVM_REQ_REFRESH_GUEST_PREFIX, to deal with guest prefix changes instead of piggybacking KVM_REQ_MMU_RELOAD. This will allow for the removal of the generic KVM_REQ_MMU_RELOAD, which isn't actually used by generic KVM. No functional change intended. Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220225182248.3812651-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Zap only obsolete roots when responding to zapping a single root shadow page. Because KVM keeps root_count elevated when stuffing a previous root into its PGD cache, shadowing a 64-bit guest means that zapping any root causes all vCPUs to reload all roots, even if their current root is not affected by the zap. For many kernels, zapping a single root is a frequent operation, e.g. in Linux it happens whenever an mm is dropped, e.g. process exits, etc... Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220225182248.3812651-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Remove the generic kvm_reload_remote_mmus() and open code its functionality into the two x86 callers. x86 is (obviously) the only architecture that uses the hook, and is also the only architecture that uses KVM_REQ_MMU_RELOAD in a way that's consistent with the name. That will change in a future patch, as x86's usage when zapping a single shadow page x86 doesn't actually _need_ to reload all vCPUs' MMUs, only MMUs whose root is being zapped actually need to be reloaded. s390 also uses KVM_REQ_MMU_RELOAD, but for a slightly different purpose. Drop the generic code in anticipation of implementing s390 and x86 arch specific requests, which will allow dropping KVM_REQ_MMU_RELOAD entirely. Opportunistically reword the x86 TDP MMU comment to avoid making references to functions (and requests!) when possible, and to remove the rather ambiguous "this". No functional change intended. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220225182248.3812651-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Replace a KVM_REQ_MMU_RELOAD request with a direct kvm_mmu_unload() call when the guest's CR4.PCIDE changes. This will allow tweaking the logic of KVM_REQ_MMU_RELOAD to free only obsolete/invalid roots, which is the historical intent of KVM_REQ_MMU_RELOAD. The recent PCIDE behavior is the only user of KVM_REQ_MMU_RELOAD that doesn't mark affected roots as obsolete, needs to unconditionally unload the entire MMU, _and_ affects only the current vCPU. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220225182248.3812651-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Hou Wenlong authored
KVM: x86/emulator: Move the unhandled outer privilege level logic of far return into __load_segment_descriptor() Outer-privilege level return is not implemented in emulator, move the unhandled logic into __load_segment_descriptor to make it easier to understand why the checks for RET are incomplete. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Message-Id: <5b7188e6388ac9f4567d14eab32db9adf3e00119.1644292363.git.houwenlong.hwl@antgroup.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Hou Wenlong authored
Code segment descriptor can be loaded by jmp/call/ret, iret and int. The privilege checks are different between those instructions above realmode. Although, the emulator has use x86_transfer_type enumerate to differentiate them, but it is not really used in __load_segment_descriptor(). Note, far jump/call to call gate, task gate or task state segment are not implemented in emulator. As for far jump/call to code segment, if DPL > CPL for conforming code or (RPL > CPL or DPL != CPL) for non-conforming code, it should trigger #GP. The current checks are ok. As for far return, if RPL < CPL or DPL > RPL for conforming code or DPL != RPL for non-conforming code, it should trigger #GP. Outer level return is not implemented above virtual-8086 mode in emulator. So it implies that RPL <= CPL, but the current checks wouldn't trigger #GP if RPL < CPL. As for code segment loading in task switch, if DPL > RPL for conforming code or DPL != RPL for non-conforming code, it should trigger #TS. Since segment selector is loaded before segment descriptor when load state from tss, it implies that RPL = CPL, so the current checks are ok. The only problem in current implementation is missing RPL < CPL check for far return. However, change code to follow the manual is better. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Message-Id: <e01f5ea70fc1f18f23da1182acdbc5c97c0e5886.1644292363.git.houwenlong.hwl@antgroup.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Hou Wenlong authored
Per Intel's SDM on the "Instruction Set Reference", when loading segment descriptor, not-present segment check should be after all type and privilege checks. But the emulator checks it first, then #NP is triggered instead of #GP if privilege fails and segment is not present. Put not-present segment check after type and privilege checks in __load_segment_descriptor(). Fixes: 38ba30ba (KVM: x86 emulator: Emulate task switch in emulator.c) Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Message-Id: <52573c01d369f506cadcf7233812427cf7db81a7.1644292363.git.houwenlong.hwl@antgroup.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
The main thing that the selftest verifies is that KVM copies x2APIC's ICR[63:32] to/from ICR2 when userspace accesses the vAPIC page via KVM_{G,S}ET_LAPIC. KVM previously split x2APIC ICR to ICR+ICR2 at the time of write (from the guest), and so KVM must preserve that behavior for backwards compatibility between different versions of KVM. It will also test other invariants, e.g. that KVM clears the BUSY flag on ICR writes, that the reserved bits in ICR2 are dropped on writes from the guest, etc... Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220204214205.3306634-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Hide the lapic's "raw" write helper inside lapic.c to force non-APIC code to go through proper helpers when modification the vAPIC state. Keep the read helper visible to outsiders for now, refactoring KVM to hide it too is possible, it will just take more work to do so. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220204214205.3306634-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Emulate the x2APIC ICR as a single 64-bit register, as opposed to forking it across ICR and ICR2 as two 32-bit registers. This mirrors hardware behavior for Intel's upcoming IPI virtualization support, which does not split the access. Previous versions of Intel's SDM and AMD's APM don't explicitly state exactly how ICR is reflected in the vAPIC page for x2APIC, KVM just happened to speculate incorrectly. Handling the upcoming behavior is necessary in order to maintain backwards compatibility with KVM_{G,S}ET_LAPIC, e.g. failure to shuffle the 64-bit ICR to ICR+ICR2 and vice versa would break live migration if IPI virtualization support isn't symmetrical across the source and dest. Cc: Zeng Guang <guang.zeng@intel.com> Cc: Chao Gao <chao.gao@intel.com> Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220204214205.3306634-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Add helpers to handle 64-bit APIC read/writes via MSRs to deduplicate the x2APIC and Hyper-V code needed to service reads/writes to ICR. Future support for IPI virtualization will add yet another path where KVM must handle 64-bit APIC MSR reads/write (to ICR). Opportunistically fix the comment in the write path; ICR2 holds the destination (if there's no shorthand), not the vector. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220204214205.3306634-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Make the low level read/write lapic helpers static, any accesses to the local APIC from vendor code or non-APIC code should be routed through proper helpers. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220204214205.3306634-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
WARN if KVM emulates an IPI without clearing the BUSY flag, failure to do so could hang the guest if it waits for the IPI be sent. Opportunistically use APIC_ICR_BUSY macro instead of open coding the magic number, and add a comment to clarify why kvm_recalculate_apic_map() is unconditionally invoked (it's really, really confusing for IPIs due to the existence of fast paths that don't trigger a potential recalc). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220204214205.3306634-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Don't bother rewriting the ICR value into the vAPIC page on an AVIC IPI virtualization failure, the access is a trap, i.e. the value has already been written to the vAPIC page. The one caveat is if hardware left the BUSY flag set (which appears to happen somewhat arbitrarily), in which case go through the "nodecode" APIC-write path in order to clear the BUSY flag. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220204214205.3306634-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Use the common kvm_apic_write_nodecode() to handle AVIC/APIC-write traps instead of open coding the same exact code. This will allow making the low level lapic helpers inaccessible outside of lapic.c code. Opportunistically clean up the params to eliminate a bunch of svm=>vcpu reflection. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220204214205.3306634-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Use the "raw" helper to read the vAPIC register after an APIC-write trap VM-Exit. Hardware is responsible for vetting the write, and the caller is responsible for sanitizing the offset. This is a functional change, as it means KVM will consume whatever happens to be in the vAPIC page if the write was dropped by hardware. But, unless userspace deliberately wrote garbage into the vAPIC page via KVM_SET_LAPIC, the value should be zero since it's not writable by the guest. This aligns common x86 with SVM's AVIC logic, i.e. paves the way for using the nodecode path to handle APIC-write traps when AVIC is enabled. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220204214205.3306634-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-