- 17 Jun, 2021 38 commits
-
-
Vitaly Kuznetsov authored
Unify VMX and SVM code by moving APICv/AVIC enablement tracking to common 'enable_apicv' variable. Note: unlike APICv, AVIC is disabled by default. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210609150911.1471882c-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sergey Senozhatsky authored
Implement PM hibernation/suspend prepare notifiers so that KVM can reliably set PVCLOCK_GUEST_STOPPED on VCPUs and properly suspend VMs. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Message-Id: <20210606021045.14159-2-senozhatsky@chromium.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sergey Senozhatsky authored
Add KVM PM-notifier so that architectures can have arch-specific VM suspend/resume routines. Such architectures need to select CONFIG_HAVE_KVM_PM_NOTIFIER and implement kvm_arch_pm_notifier(). Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Marc Zyngier <maz@kernel.org> Message-Id: <20210606021045.14159-1-senozhatsky@chromium.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Jim Mattson authored
Standardize reads and writes of the x2APIC MSRs. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20210604172611.281819-11-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Jim Mattson authored
Move the APIC functions into the library to encourage code reuse and to avoid unintended deviations. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20210604172611.281819-10-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Jim Mattson authored
Processor.h is a hodgepodge of definitions. Though the local APIC is technically built into the CPU these days, move the APIC definitions into a new header file: apic.h. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20210604172611.281819-9-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Jim Mattson authored
Don't allow posted interrupts to modify a stale posted interrupt descriptor (including the initial value of 0). Empirical tests on real hardware reveal that a posted interrupt descriptor referencing an unbacked address has PCI bus error semantics (reads as all 1's; writes are ignored). However, kvm can't distinguish unbacked addresses from device-backed (MMIO) addresses, so it should really ask userspace for an MMIO completion. That's overly complicated, so just punt with KVM_INTERNAL_ERROR. Don't return the error until the posted interrupt descriptor is actually accessed. We don't want to break the existing kvm-unit-tests that assume they can launch an L2 VM with a posted interrupt descriptor that references MMIO space in L1. Fixes: 6beb7bd5 ("kvm: nVMX: Refactor nested_get_vmcs12_pages()") Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210604172611.281819-8-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Jim Mattson authored
When the kernel has no mapping for the vmcs02 virtual APIC page, userspace MMIO completion is necessary to process nested posted interrupts. This is not a configuration that KVM supports. Rather than silently ignoring the problem, try to exit to userspace with KVM_INTERNAL_ERROR. Note that the event that triggers this error is consumed as a side-effect of a call to kvm_check_nested_events. On some paths (notably through kvm_vcpu_check_block), the error is dropped. In any case, this is an incremental improvement over always ignoring the error. Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210604172611.281819-7-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Jim Mattson authored
No functional change intended. At present, the only negative value returned by kvm_check_nested_events is -EBUSY. Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210604172611.281819-6-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Jim Mattson authored
No functional change intended. At present, 'r' will always be -EBUSY on a control transfer to the 'out' label. Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210604172611.281819-5-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Jim Mattson authored
No functional change intended. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20210604172611.281819-4-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Jim Mattson authored
A survey of the callsites reveals that they all ensure the vCPU is in guest mode before calling kvm_check_nested_events. Remove this dead code so that the only negative value this function returns (at the moment) is -EBUSY. Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210604172611.281819-2-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ilias Stamatis authored
Test that nested TSC scaling works as expected with both L1 and L2 scaled. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-12-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ilias Stamatis authored
Calculate the TSC offset and multiplier on nested transitions and expose the TSC scaling feature to L1. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-11-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ilias Stamatis authored
Currently vmx_vcpu_load_vmcs() writes the TSC_MULTIPLIER field of the VMCS every time the VMCS is loaded. Instead of doing this, set this field from common code on initialization and whenever the scaling ratio changes. Additionally remove vmx->current_tsc_ratio. This field is redundant as vcpu->arch.tsc_scaling_ratio already tracks the current TSC scaling ratio. The vmx->current_tsc_ratio field is only used for avoiding unnecessary writes but it is no longer needed after removing the code from the VMCS load path. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Message-Id: <20210607105438.16541-1-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ilias Stamatis authored
The write_l1_tsc_offset() callback has a misleading name. It does not set L1's TSC offset, it rather updates the current TSC offset which might be different if a nested guest is executing. Additionally, both the vmx and svm implementations use the same logic for calculating the current TSC before writing it to hardware. Rename the function and move the common logic to the caller. The vmx/svm specific code now merely sets the given offset to the corresponding hardware structure. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-9-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ilias Stamatis authored
When L2 is entered we need to "merge" the TSC multiplier and TSC offset values of 01 and 12 together. The merging is done using the following equations: offset_02 = ((offset_01 * mult_12) >> shift_bits) + offset_12 mult_02 = (mult_01 * mult_12) >> shift_bits Where shift_bits is kvm_tsc_scaling_ratio_frac_bits. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-8-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ilias Stamatis authored
In order to implement as much of the nested TSC scaling logic as possible in common code, we need these vendor callbacks for retrieving the TSC offset and the TSC multiplier that L1 has set for L2. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-7-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ilias Stamatis authored
This is required for supporting nested TSC scaling. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-6-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ilias Stamatis authored
Sometimes kvm_scale_tsc() needs to use the current scaling ratio and other times (like when reading the TSC from user space) it needs to use L1's scaling ratio. Have the caller specify this by passing the ratio as a parameter. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-5-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ilias Stamatis authored
All existing code uses kvm_compute_tsc_offset() passing L1 TSC values to it. Let's document this by renaming it to kvm_compute_l1_tsc_offset(). Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-4-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ilias Stamatis authored
Store L1's scaling ratio in the kvm_vcpu_arch struct like we already do for L1's TSC offset. This allows for easy save/restore when we enter and then exit the nested guest. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-3-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ilias Stamatis authored
This function is needed for KVM's nested virtualization. The nested TSC scaling implementation requires multiplying the signed TSC offset with the unsigned TSC multiplier. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-2-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
If the TDP MMU is in use, wait to allocate the rmaps until the shadow MMU is actually used. (i.e. a nested VM is launched.) This saves memory equal to 0.2% of guest memory in cases where the TDP MMU is used and there are no nested guests involved. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210518173414.450044-8-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
If only the TDP MMU is being used to manage the memory mappings for a VM, then many rmap operations can be skipped as they are guaranteed to be no-ops. This saves some time which would be spent on the rmap operation. It also avoids acquiring the MMU lock in write mode for many operations. This makes it safe to run the VM without rmaps allocated, when only using the TDP MMU and sets the stage for waiting to allocate the rmaps until they're needed. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210518173414.450044-7-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
Add a field to control whether new memslots should have rmaps allocated for them. As of this change, it's not safe to skip allocating rmaps, so the field is always set to allocate rmaps. Future changes will make it safe to operate without rmaps, using the TDP MMU. Then further changes will allow the rmaps to be allocated lazily when needed for nested oprtation. No functional change expected. Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210518173414.450044-6-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
Add a new lock to protect the arch-specific fields of memslots if they need to be modified in a kvm->srcu read critical section. A future commit will use this lock to lazily allocate memslot rmaps for x86. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210518173414.450044-5-bgardon@google.com> [Add Documentation/ hunk. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
Factor out copying kvm_memslots from allocating the memory for new ones in preparation for adding a new lock to protect the arch-specific fields of the memslots. No functional change intended. Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210518173414.450044-4-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
Small refactor to facilitate allocating rmaps for all memslots at once. No functional change expected. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210518173414.450044-3-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
Small code deduplication. No functional change expected. Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210518173414.450044-2-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Keqian Zhu authored
Currently, when dirty logging is started in initially-all-set mode, we write protect huge pages to prepare for splitting them into 4K pages, and leave normal pages untouched as the logging will be enabled lazily as dirty bits are cleared. However, enabling dirty logging lazily is also feasible for huge pages. This not only reduces the time of start dirty logging, but it also greatly reduces side-effect on guest when there is high dirty rate. Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com> Message-Id: <20210429034115.35560-3-zhukeqian1@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Keqian Zhu authored
Prepare for write protecting large page lazily during dirty log tracking, for which we will only need to write protect gfns at large page granularity. No functional or performance change expected. Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com> Message-Id: <20210429034115.35560-2-zhukeqian1@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Siddharth Chandrasekaran authored
Now that kvm_hv_flush_tlb() has been patched to support XMM hypercall inputs, we can start advertising this feature to guests. Cc: Alexander Graf <graf@amazon.com> Cc: Evgeny Iakovlev <eyakovl@amazon.de> Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de> Message-Id: <e63fc1c61dd2efecbefef239f4f0a598bd552750.1622019134.git.sidcha@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Siddharth Chandrasekaran authored
Hyper-V supports the use of XMM registers to perform fast hypercalls. This allows guests to take advantage of the improved performance of the fast hypercall interface even though a hypercall may require more than (the current maximum of) two input registers. The XMM fast hypercall interface uses six additional XMM registers (XMM0 to XMM5) to allow the guest to pass an input parameter block of up to 112 bytes. Add framework to read from XMM registers in kvm_hv_hypercall() and use the additional hypercall inputs from XMM registers in kvm_hv_flush_tlb() when possible. Cc: Alexander Graf <graf@amazon.com> Co-developed-by: Evgeny Iakovlev <eyakovl@amazon.de> Signed-off-by: Evgeny Iakovlev <eyakovl@amazon.de> Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de> Message-Id: <fc62edad33f1920fe5c74dde47d7d0b4275a9012.1622019134.git.sidcha@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Siddharth Chandrasekaran authored
As of now there are 7 parameters (and flags) that are used in various hyper-v hypercall handlers. There are 6 more input/output parameters passed from XMM registers which are to be added in an upcoming patch. To make passing arguments to the handlers more readable, capture all these parameters into a single structure. Cc: Alexander Graf <graf@amazon.com> Cc: Evgeny Iakovlev <eyakovl@amazon.de> Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de> Message-Id: <273f7ed510a1f6ba177e61b73a5c7bfbee4a4a87.1622019133.git.sidcha@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Siddharth Chandrasekaran authored
Hyper-v XMM fast hypercalls use XMM registers to pass input/output parameters. To access these, hyperv.c can reuse some FPU register accessors defined in emulator.c. Move them to a common location so both can access them. While at it, reorder the parameters of these accessor methods to make them more readable. Cc: Alexander Graf <graf@amazon.com> Cc: Evgeny Iakovlev <eyakovl@amazon.de> Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de> Message-Id: <01a85a6560714d4d3637d3d86e5eba65073318fa.1622019133.git.sidcha@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Shaokun Zhang authored
Function 'is_nx_huge_page_enabled' is called only by kvm/mmu, so make it as inline fucntion and remove the unnecessary declaration. Cc: Ben Gardon <bgardon@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Message-Id: <1622102271-63107-1-git-send-email-zhangshaokun@hisilicon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Fuad Tabba authored
KVM_CHECK_EXTENSION ioctl can return any negative value on error, and not necessarily -1. Change the assertion to reflect that. Signed-off-by: Fuad Tabba <tabba@google.com> Message-Id: <20210615150443.1183365-1-tabba@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 11 Jun, 2021 2 commits
-
-
Sean Christopherson authored
Calculate and check the full mmu_role when initializing the MMU context for the nested MMU, where "full" means the bits and pieces of the role that aren't handled by kvm_calc_mmu_role_common(). While the nested MMU isn't used for shadow paging, things like the number of levels in the guest's page tables are surprisingly important when walking the guest page tables. Failure to reinitialize the nested MMU context if L2's paging mode changes can result in unexpected and/or missed page faults, and likely other explosions. E.g. if an L1 vCPU is running both a 32-bit PAE L2 and a 64-bit L2, the "common" role calculation will yield the same role for both L2s. If the 64-bit L2 is run after the 32-bit PAE L2, L0 will fail to reinitialize the nested MMU context, ultimately resulting in a bad walk of L2's page tables as the MMU will still have a guest root_level of PT32E_ROOT_LEVEL. WARNING: CPU: 4 PID: 167334 at arch/x86/kvm/vmx/vmx.c:3075 ept_save_pdptrs+0x15/0xe0 [kvm_intel] Modules linked in: kvm_intel] CPU: 4 PID: 167334 Comm: CPU 3/KVM Not tainted 5.13.0-rc1-d849817d5673-reqs #185 Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014 RIP: 0010:ept_save_pdptrs+0x15/0xe0 [kvm_intel] Code: <0f> 0b c3 f6 87 d8 02 00f RSP: 0018:ffffbba702dbba00 EFLAGS: 00010202 RAX: 0000000000000011 RBX: 0000000000000002 RCX: ffffffff810a2c08 RDX: ffff91d7bc30acc0 RSI: 0000000000000011 RDI: ffff91d7bc30a600 RBP: ffff91d7bc30a600 R08: 0000000000000010 R09: 0000000000000007 R10: 0000000000000000 R11: 0000000000000000 R12: ffff91d7bc30a600 R13: ffff91d7bc30acc0 R14: ffff91d67c123460 R15: 0000000115d7e005 FS: 00007fe8e9ffb700(0000) GS:ffff91d90fb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000029f15a001 CR4: 00000000001726e0 Call Trace: kvm_pdptr_read+0x3a/0x40 [kvm] paging64_walk_addr_generic+0x327/0x6a0 [kvm] paging64_gva_to_gpa_nested+0x3f/0xb0 [kvm] kvm_fetch_guest_virt+0x4c/0xb0 [kvm] __do_insn_fetch_bytes+0x11a/0x1f0 [kvm] x86_decode_insn+0x787/0x1490 [kvm] x86_decode_emulated_instruction+0x58/0x1e0 [kvm] x86_emulate_instruction+0x122/0x4f0 [kvm] vmx_handle_exit+0x120/0x660 [kvm_intel] kvm_arch_vcpu_ioctl_run+0xe25/0x1cb0 [kvm] kvm_vcpu_ioctl+0x211/0x5a0 [kvm] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x40/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: stable@vger.kernel.org Fixes: bf627a92 ("x86/kvm/mmu: check if MMU reconfiguration is needed in init_kvm_nested_mmu()") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210610220026.1364486-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
Commit c9b8b07c (KVM: x86: Dynamically allocate per-vCPU emulation context) tries to allocate per-vCPU emulation context dynamically, however, the x86_emulator slab cache is still exiting after the kvm module is unload as below after destroying the VM and unloading the kvm module. grep x86_emulator /proc/slabinfo x86_emulator 36 36 2672 12 8 : tunables 0 0 0 : slabdata 3 3 0 This patch fixes this slab cache leak by destroying the x86_emulator slab cache when the kvm module is unloaded. Fixes: c9b8b07c (KVM: x86: Dynamically allocate per-vCPU emulation context) Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1623387573-5969-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-