- 07 May, 2024 13 commits
-
-
Sean Christopherson authored
Move the checks related to the validity of an access to a memslot from the inner __kvm_faultin_pfn() to its sole caller, kvm_faultin_pfn(). This allows emulating accesses to the APIC access page, which don't need to resolve a pfn, even if there is a relevant in-progress mmu_notifier invalidation. Ditto for accesses to KVM internal memslots from L2, which KVM also treats as emulated MMIO. More importantly, this will allow for future cleanup by having the "no memslot" case bail from kvm_faultin_pfn() very early on. Go to rather extreme and gross lengths to make the change a glorified nop, e.g. call into __kvm_faultin_pfn() even when there is no slot, as the related code is very subtle. E.g. fault->slot can be nullified if it points at the APIC access page, some flows in KVM x86 expect fault->pfn to be KVM_PFN_NOSLOT, while others check only fault->slot, etc. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Explicitly detect and disallow private accesses to emulated MMIO in kvm_handle_noslot_fault() instead of relying on kvm_faultin_pfn_private() to perform the check. This will allow the page fault path to go straight to kvm_handle_noslot_fault() without bouncing through __kvm_faultin_pfn(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240228024147.41573-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Allow mapping KVM's internal memslots used for EPT without unrestricted guest into L2, i.e. allow mapping the hidden TSS and the identity mapped page tables into L2. Unlike the APIC access page, there is no correctness issue with letting L2 access the "hidden" memory. Allowing these memslots to be mapped into L2 fixes a largely theoretical bug where KVM could incorrectly emulate subsequent _L1_ accesses as MMIO, and also ensures consistent KVM behavior for L2. If KVM is using TDP, but L1 is using shadow paging for L2, then routing through kvm_handle_noslot_fault() will incorrectly cache the gfn as MMIO, and create an MMIO SPTE. Creating an MMIO SPTE is ok, but only because kvm_mmu_page_role.guest_mode ensure KVM uses different roots for L1 vs. L2. But vcpu->arch.mmio_gfn will remain valid, and could cause KVM to incorrectly treat an L1 access to the hidden TSS or identity mapped page tables as MMIO. Furthermore, forcing L2 accesses to be treated as "no slot" faults doesn't actually prevent exposing KVM's internal memslots to L2, it simply forces KVM to emulate the access. In most cases, that will trigger MMIO, amusingly due to filling vcpu->arch.mmio_gfn, but also because vcpu_is_mmio_gpa() unconditionally treats APIC accesses as MMIO, i.e. APIC accesses are ok. But the hidden TSS and identity mapped page tables could go either way (MMIO or access the private memslot's backing memory). Alternatively, the inconsistent emulator behavior could be addressed by forcing MMIO emulation for L2 access to all internal memslots, not just to the APIC. But that's arguably less correct than letting L2 access the hidden TSS and identity mapped page tables, not to mention that it's *extremely* unlikely anyone cares what KVM does in this case. From L1's perspective there is R/W memory at those memslots, the memory just happens to be initialized with non-zero data. Making the memory disappear when it is accessed by L2 is far more magical and arbitrary than the memory existing in the first place. The APIC access page is special because KVM _must_ emulate the access to do the right thing (emulate an APIC access instead of reading/writing the APIC access page). And despite what commit 3a2936de ("kvm: mmu: Don't expose private memslots to L2") said, it's not just necessary when L1 is accelerating L2's virtual APIC, it's just as important (likely *more* imporant for correctness when L1 is passing through its own APIC to L2. Fixes: 3a2936de ("kvm: mmu: Don't expose private memslots to L2") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Prioritize private vs. shared gfn attribute checks above slot validity checks to ensure a consistent userspace ABI. E.g. as is, KVM will exit to userspace if there is no memslot, but emulate accesses to the APIC access page even if the attributes mismatch. Fixes: 8dd2eee9 ("KVM: x86/mmu: Handle page fault for private memory") Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Cc: Chao Peng <chao.p.peng@linux.intel.com> Cc: Fuad Tabba <tabba@google.com> Cc: Michael Roth <michael.roth@amd.com> Cc: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
WARN and skip the emulated MMIO fastpath if a private, reserved page fault is encountered, as private+reserved should be an impossible combination (KVM should never create an MMIO SPTE for a private access). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240228024147.41573-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Right now the error code is not used when an async page fault is completed. This is not a problem in the current code, but it is untidy. For protected VMs, we will also need to check that the page attributes match the current state of the page, because asynchronous page faults can only occur on shared pages (private pages go through kvm_faultin_pfn_private() instead of __gfn_to_pfn_memslot()). Start by piping the error code from kvm_arch_setup_async_pf() to kvm_arch_async_page_ready() via the architecture-specific async page fault data. For now, it can be used to assert that there are no async page faults on private memory. Extracted from a patch by Isaku Yamahata. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Add and use a synthetic, KVM-defined page fault error code to indicate whether a fault is to private vs. shared memory. TDX and SNP have different mechanisms for reporting private vs. shared, and KVM's software-protected VMs have no mechanism at all. Usurp an error code flag to avoid having to plumb another parameter to kvm_mmu_page_fault() and friends. Alternatively, KVM could borrow AMD's PFERR_GUEST_ENC_MASK, i.e. set it for TDX and software-protected VMs as appropriate, but that would require *clearing* the flag for SEV and SEV-ES VMs, which support encrypted memory at the hardware layer, but don't utilize private memory at the KVM layer. Opportunistically add a comment to call out that the logic for software- protected VMs is (and was before this commit) broken for nested MMUs, i.e. for nested TDP, as the GPA is an L2 GPA. Punt on trying to play nice with nested MMUs as there is a _lot_ of functionality that simply doesn't work for software-protected VMs, e.g. all of the paths where KVM accesses guest memory need to be updated to be aware of private vs. shared memory. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20240228024147.41573-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
WARN if bits 63:32 are non-zero when handling an intercepted legacy #PF, as the error code for #PF is limited to 32 bits (and in practice, 16 bits on Intel CPUS). This behavior is architectural, is part of KVM's ABI (see kvm_vcpu_events.error_code), and is explicitly documented as being preserved for intecerpted #PF in both the APM: The error code saved in EXITINFO1 is the same as would be pushed onto the stack by a non-intercepted #PF exception in protected mode. and even more explicitly in the SDM as VMCS.VM_EXIT_INTR_ERROR_CODE is a 32-bit field. Simply drop the upper bits if hardware provides garbage, as spurious information should do no harm (though in all likelihood hardware is buggy and the kernel is doomed). Handling all upper 32 bits in the #PF path will allow moving the sanity check on synthetic checks from kvm_mmu_page_fault() to npf_interception(), which in turn will allow deriving PFERR_PRIVATE_ACCESS from AMD's PFERR_GUEST_ENC_MASK without running afoul of the sanity check. Note, this is also why Intel uses bit 15 for SGX (highest bit on Intel CPUs) and AMD uses bit 31 for RMP (highest bit on AMD CPUs); using the highest bit minimizes the probability of a collision with the "other" vendor, without needing to plumb more bits through microcode. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Isaku Yamahata authored
Plumb the full 64-bit error code throughout the page fault handling code so that KVM can use the upper 32 bits, e.g. SNP's PFERR_GUEST_ENC_MASK will be used to determine whether or not a fault is private vs. shared. Note, passing the 64-bit error code to FNAME(walk_addr)() does NOT change the behavior of permission_fault() when invoked in the page fault path, as KVM explicitly clears PFERR_IMPLICIT_ACCESS in kvm_mmu_page_fault(). Continue passing '0' from the async #PF worker, as guest_memfd and thus private memory doesn't support async page faults. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [mdr: drop references/changes on rebase, update commit message] Signed-off-by: Michael Roth <michael.roth@amd.com> [sean: drop truncation in call to FNAME(walk_addr)(), rewrite changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240228024147.41573-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Move the sanity check that hardware never sets bits that collide with KVM- define synthetic bits from kvm_mmu_page_fault() to npf_interception(), i.e. make the sanity check #NPF specific. The legacy #PF path already WARNs if _any_ of bits 63:32 are set, and the error code that comes from VMX's EPT Violatation and Misconfig is 100% synthesized (KVM morphs VMX's EXIT_QUALIFICATION into error code flags). Add a compile-time assert in the legacy #PF handler to make sure that KVM- define flags are covered by its existing sanity check on the upper bits. Opportunistically add a description of PFERR_IMPLICIT_ACCESS, since we are removing the comment that defined it. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20240228024147.41573-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Define more #NPF error code flags that are relevant to SEV+ (mostly SNP) guests, as specified by the APM: * Bit 31 (RMP): Set to 1 if the fault was caused due to an RMP check or a VMPL check failure, 0 otherwise. * Bit 34 (ENC): Set to 1 if the guest’s effective C-bit was 1, 0 otherwise. * Bit 35 (SIZEM): Set to 1 if the fault was caused by a size mismatch between PVALIDATE or RMPADJUST and the RMP, 0 otherwise. * Bit 36 (VMPL): Set to 1 if the fault was caused by a VMPL permission check failure, 0 otherwise. Note, the APM is *extremely* misleading, and strongly implies that the above flags can _only_ be set for #NPF exits from SNP guests. That is a lie, as bit 34 (C-bit=1, i.e. was encrypted) can be set when running _any_ flavor of SEV guest on SNP capable hardware. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240228024147.41573-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Open code the bit number directly in the PFERR_* masks and drop the intermediate PFERR_*_BIT defines, as having to bounce through two macros just to see which flag corresponds to which bit is quite annoying, as is having to define two macros just to add recognition of a new flag. Use ternary operator to derive the bit in permission_fault(), the one function that actually needs the bit number as part of clever shifting to avoid conditional branches. Generally the compiler is able to turn it into a conditional move, and if not it's not really a big deal. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240228024147.41573-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private fault triggers emulation of any kind, as KVM doesn't currently support emulating access to guest private memory. Practically speaking, private faults and emulation are already mutually exclusive, but there are many flow that can result in KVM returning RET_PF_EMULATE, and adding one last check to harden against weird, unexpected combinations and/or KVM bugs is inexpensive. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240228024147.41573-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 19 Apr, 2024 1 commit
-
-
Paolo Bonzini authored
Pull fix for SEV-SNP late disable bugs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 12 Apr, 2024 7 commits
-
-
Paolo Bonzini authored
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
TDX uses different ABI to get information about VM exit. Pass intr_info to the NMI and INTR handlers instead of pulling it from vcpu_vmx in preparation for sharing the bulk of the handlers with TDX. When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds exit qualification etc rather than the VMCS fields because VMM doesn't have access to the VMCS. The eventual code will be VMX: - get exit reason, intr_info, exit_qualification, and etc from VMCS - call NMI/INTR handlers (common code) TDX: - get exit reason, intr_info, exit_qualification, and etc from guest registers - call NMI/INTR handlers (common code) Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions to operate on VM. TDX doesn't allow VMM to operate VMCS directly. Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to indirectly operate those data structures. This means we must have a TDX version of kvm_x86_ops. The existing global struct kvm_x86_ops already defines an interface which can be adapted to TDX, but kvm_x86_ops is a system-wide, not per-VM structure. To allow VMX to coexist with TDs, the kvm_x86_ops callbacks will have wrappers "if (tdx) tdx_op() else vmx_op()" to pick VMX or TDX at run time. To split the runtime switch, the VMX implementation, and the TDX implementation, add main.c, and move out the vmx_x86_ops hooks in preparation for adding TDX. Use 'vt' for the naming scheme as a nod to VT-x and as a concatenation of VmxTdx. The eventually converted code will look like this: vmx.c: vmx_op() { ... } VMX initialization tdx.c: tdx_op() { ... } TDX initialization x86_ops.h: vmx_op(); tdx_op(); main.c: static vt_op() { if (tdx) tdx_op() else vmx_op() } static struct kvm_x86_ops vt_x86_ops = { .op = vt_op, initialization functions call both VMX and TDX initialization Opportunistically, fix the name inconsistency from vmx_create_vcpu() and vmx_free_vcpu() to vmx_vcpu_create() and vmx_vcpu_free(). Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-Id: <e603c317587f933a9d1bee8728c84e4935849c16.1705965634.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
By necessity, TDX will use a different register ABI for hypercalls. Break out the core functionality so that it may be reused for TDX. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <5134caa55ac3dec33fb2addb5545b52b3b52db02.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The idea that no parameter would ever be necessary when enabling SEV or SEV-ES for a VM was decidedly optimistic. The first source of variability that was encountered is the desired set of VMSA features, as that affects the measurement of the VM's initial state and cannot be changed arbitrarily by the hypervisor. This series adds all the APIs that are needed to customize the features, with room for future enhancements: - a new /dev/kvm device attribute to retrieve the set of supported features (right now, only debug swap) - a new sub-operation for KVM_MEM_ENCRYPT_OP that can take a struct, replacing the existing KVM_SEV_INIT and KVM_SEV_ES_INIT It then puts the new op to work by including the VMSA features as a field of the The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of supported VMSA features for backwards compatibility; but I am considering also making them use zero as the feature mask, and will gladly adjust the patches if so requested. In order to avoid creating *two* new KVM_MEM_ENCRYPT_OPs, I decided that I could as well make SEV and SEV-ES use VM types. This allows SEV-SNP to reuse the KVM_SEV_INIT2 ioctl. And while at it, KVM_SEV_INIT2 also includes two bugfixes. First of all, SEV-ES VM, when created with the new VM type instead of KVM_SEV_ES_INIT, reject KVM_GET_REGS/KVM_SET_REGS and friends on the vCPU file descriptor once the VMSA has been encrypted... which is how the API should have always behaved. Second, they also synchronize the FPU and AVX state. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The .change_pte() MMU notifier callback was intended as an optimization and for this reason it was initially called without a surrounding mmu_notifier_invalidate_range_{start,end}() pair. It was only ever implemented by KVM (which was also the original user of MMU notifiers) and the rules on when to call set_pte_at_notify() rather than set_pte_at() have always been pretty obscure. It may seem a miracle that it has never caused any hard to trigger bugs, but there's a good reason for that: KVM's implementation has been nonfunctional for a good part of its existence. Already in 2012, commit 6bdb913f ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end", 2012-10-09) changed the .change_pte() callback to occur within an invalidate_range_start/end() pair; and because KVM unmaps the sPTEs during .invalidate_range_start(), .change_pte() has no hope of finding a sPTE to change. Therefore, all the code for .change_pte() can be removed from both KVM and mm/, and set_pte_at_notify() can be replaced with just set_pte_at(). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
With the demise of the .change_pte() MMU notifier callback, there is no notification happening in set_pte_at_notify(). It is a synonym of set_pte_at() and can be replaced with it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-ID: <20240405115815.3226315-5-pbonzini@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 11 Apr, 2024 19 commits
-
-
Paolo Bonzini authored
The scope of set_pte_at_notify() has reduced more and more through the years. Initially, it was meant for when the change to the PTE was not bracketed by mmu_notifier_invalidate_range_{start,end}(). However, that has not been so for over ten years. During all this period the only implementation of .change_pte() was KVM and it had no actual functionality, because it was called after mmu_notifier_invalidate_range_start() zapped the secondary PTE. Now that this (nonfunctional) user of the .change_pte() callback is gone, the whole callback can be removed. For now, leave in place set_pte_at_notify() even though it is just a synonym for set_pte_at(). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Message-ID: <20240405115815.3226315-4-pbonzini@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The only user was kvm_mmu_notifier_change_pte(), which is now gone. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-ID: <20240405115815.3226315-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The .change_pte() MMU notifier callback was intended as an optimization. The original point of it was that KSM could tell KVM to flip its secondary PTE to a new location without having to first zap it. At the time there was also an .invalidate_page() callback; both of them were *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(), and .invalidate_page() also doubled as a fallback implementation of .change_pte(). Later on, however, both callbacks were changed to occur within an invalidate_range_start/end() block. In the case of .change_pte(), commit 6bdb913f ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end", 2012-10-09) did so to remove the fallback from .invalidate_page() to .change_pte() and allow sleepable .invalidate_page() hooks. This however made KVM's usage of the .change_pte() callback completely moot, because KVM unmaps the sPTEs during .invalidate_range_start() and therefore .change_pte() has no hope of finding a sPTE to change. Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as well as all the architecture specific implementations. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Anup Patel <anup@brainfault.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Reviewed-by: Bibo Mao <maobibo@loongson.cn> Message-ID: <20240405115815.3226315-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-18-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Allow the caller to set the initial state of the VM. Doing this before sev_vm_launch() matters for SEV-ES, since that is the place where the VMSA is updated and after which the guest state becomes sealed. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-17-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
This removes the concept of "subtypes", instead letting the tests use proper VM types that were recently added. While the sev_init_vm() and sev_es_init_vm() are still able to operate with the legacy KVM_SEV_INIT and KVM_SEV_ES_INIT ioctls, this is limited to VMs that are created manually with vm_create_barebones(). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-16-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-15-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The DebugSwap feature of SEV-ES provides a way for confidential guests to use data breakpoints. Its status is record in VMSA, and therefore attestation signatures depend on whether it is enabled or not. In order to avoid invalidating the signatures depending on the host machine, it was disabled by default (see commit 5abf6dce, "SEV: disable SEV-ES DebugSwap by default", 2024-03-09). However, we now have a new API to create SEV VMs that allows enabling DebugSwap based on what the user tells KVM to do, and we also changed the legacy KVM_SEV_ES_INIT API to never enable DebugSwap. It is therefore possible to re-enable the feature without breaking compatibility with kernels that pre-date the introduction of DebugSwap, so go ahead. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-14-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The idea that no parameter would ever be necessary when enabling SEV or SEV-ES for a VM was decidedly optimistic. In fact, in some sense it's already a parameter whether SEV or SEV-ES is desired. Another possible source of variability is the desired set of VMSA features, as that affects the measurement of the VM's initial state and cannot be changed arbitrarily by the hypervisor. Create a new sub-operation for KVM_MEMORY_ENCRYPT_OP that can take a struct, and put the new op to work by including the VMSA features as a field of the struct. The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of supported VMSA features for backwards compatibility. The struct also includes the usual bells and whistles for future extensibility: a flags field that must be zero for now, and some padding at the end. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-13-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
SEV-ES allows passing custom contents for x87, SSE and AVX state into the VMSA. Allow userspace to do that with the usual KVM_SET_XSAVE API and only mark FPU contents as confidential after it has been copied and encrypted into the VMSA. Since the XSAVE state for AVX is the first, it does not need the compacted-state handling of get_xsave_addr(). However, there are other parts of XSAVE state in the VMSA that currently are not handled, and the validation logic of get_xsave_addr() is pointless to duplicate in KVM, so move get_xsave_addr() to public FPU API; it is really just a facility to operate on XSAVE state and does not expose any internal details of arch/x86/kernel/fpu. Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-12-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-11-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-10-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
This simplifies the implementation of KVM_CHECK_EXTENSION(KVM_CAP_VM_TYPES), and also allows the vendor module to specify which VM types are supported. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-9-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Some VM types have characteristics in common; in fact, the only use of VM types right now is kvm_arch_has_private_mem and it assumes that _all_ nonzero VM types have private memory. We will soon introduce a VM type for SEV and SEV-ES VMs, and at that point we will have two special characteristics of confidential VMs that depend on the VM type: not just if memory is private, but also whether guest state is protected. For the latter we have kvm->arch.guest_state_protected, which is only set on a fully initialized VM. For VM types with protected guest state, we can actually fix a problem in the SEV-ES implementation, where ioctls to set registers do not cause an error even if the VM has been initialized and the guest state encrypted. Make sure that when using VM types that will become an error. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20240209183743.22030-7-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <20240404121327.3107131-8-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Right now, the set of features that are stored in the VMSA upon initialization is fixed and depends on the module parameters for kvm-amd.ko. However, the hypervisor cannot really change it at will because the feature word has to match between the hypervisor and whatever computes a measurement of the VMSA for attestation purposes. Add a field to kvm_sev_info that holds the set of features to be stored in the VMSA; and query it instead of referring to the module parameters. Because KVM_SEV_INIT and KVM_SEV_ES_INIT accept no parameters, this does not yet introduce any functional change, but it paves the way for an API that allows customization of the features per-VM. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20240209183743.22030-6-pbonzini@redhat.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-7-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Compute the set of features to be stored in the VMSA when KVM is initialized; move it from there into kvm_sev_info when SEV is initialized, and then into the initial VMSA. The new variable can then be used to return the set of supported features to userspace, via the KVM_GET_DEVICE_ATTR ioctl. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <20240404121327.3107131-6-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Allow vendor modules to provide their own attributes on /dev/kvm. To avoid proliferation of vendor ops, implement KVM_HAS_DEVICE_ATTR and KVM_GET_DEVICE_ATTR in terms of the same function. You're not supposed to use KVM_GET_DEVICE_ATTR to do complicated computations, especially on /dev/kvm. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <20240404121327.3107131-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
There is no danger to the kernel if 32-bit userspace provides a 64-bit value that has the high bits set, but for whatever reason happens to resolve to an address that has something mapped there. KVM uses the checked version of get_user() and put_user(), so any faults are caught properly. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-4-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Stop compiling sev.c when CONFIG_KVM_AMD_SEV=n, as the number of #ifdefs in sev.c is getting ridiculous, and having #ifdefs inside of SEV helpers is quite confusing. To minimize #ifdefs in code flows, #ifdef away only the kvm_x86_ops hooks and the #VMGEXIT handler. Stubs are also restricted to functions that check sev_enabled and to the destruction functions sev_free_cpu() and sev_vm_destroy(), where the style of their callers is to leave checks to the callers. Most call sites instead rely on dead code elimination to take care of functions that are guarded with sev_guest() or sev_es_guest(). Signed-off-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-