- 12 Nov, 2021 1 commit
-
-
Paolo Bonzini authored
When UBSAN is enabled, the code emitted for the call to guest_pv_has includes a call to __ubsan_handle_load_invalid_value. objtool complains that this call happens with UACCESS enabled; to avoid the warning, pull the calls to user_access_begin into both arms of the "if" statement, after the check for guest_pv_has. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 11 Nov, 2021 38 commits
-
-
Paolo Bonzini authored
* Fix misuse of gfn-to-pfn cache when recording guest steal time / preempted status * Fix selftests on APICv machines * Fix sparse warnings * Fix detection of KVM features in CPUID * Cleanups for bogus writes to MSR_KVM_PV_EOI_EN * Fixes and cleanups for MSR bitmap handling * Cleanups for INVPCID * Make x86 KVM_SOFT_MAX_VCPUS consistent with other architectures
-
Paolo Bonzini authored
Add support for AMD SEV and SEV-ES intra-host migration support. Intra host migration provides a low-cost mechanism for userspace VMM upgrades. In the common case for intra host migration, we can rely on the normal ioctls for passing data from one VMM to the next. SEV, SEV-ES, and other confidential compute environments make most of this information opaque, and render KVM ioctls such as "KVM_GET_REGS" irrelevant. As a result, we need the ability to pass this opaque metadata from one VMM to the next. The easiest way to do this is to leave this data in the kernel, and transfer ownership of the metadata from one KVM VM (or vCPU) to the next. In-kernel hand off makes it possible to move any data that would be unsafe/impossible for the kernel to hand directly to userspace, and cannot be reproduced using data that can be handed to userspace. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
KVM_CAP_NR_VCPUS is used to get the "recommended" maximum number of VCPUs and arm64/mips/riscv report num_online_cpus(). Powerpc reports either num_online_cpus() or num_present_cpus(), s390 has multiple constants depending on hardware features. On x86, KVM reports an arbitrary value of '710' which is supposed to be the maximum tested value but it's possible to test all KVM_MAX_VCPUS even when there are less physical CPUs available. Drop the arbitrary '710' value and return num_online_cpus() on x86 as well. The recommendation will match other architectures and will mean 'no CPU overcommit'. For reference, QEMU only queries KVM_CAP_NR_VCPUS to print a warning when the requested vCPU number exceeds it. The static limit of '710' is quite weird as smaller systems with just a few physical CPUs should certainly "recommend" less. Suggested-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211111134733.86601-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vipin Sharma authored
Handle #GP on INVPCID due to an invalid type in the common switch statement instead of relying on the callers (VMX and SVM) to manually validate the type. Unlike INVVPID and INVEPT, INVPCID is not explicitly documented to check the type before reading the operand from memory, so deferring the type validity check until after that point is architecturally allowed. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109174426.2350547-3-vipinsh@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vipin Sharma authored
handle_invept(), handle_invvpid(), handle_invpcid() read the same reg2 field in vmcs.VMX_INSTRUCTION_INFO to get the index of the GPR that holds the invalidation type. Add a helper to retrieve reg2 from VMX instruction info to consolidate and document the shift+mask magic. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109174426.2350547-2-vipinsh@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Clean up the x2APIC MSR bitmap intereption code for L2, which is the last holdout of open coded bitmap manipulations. Freshen up the SDM/PRM comment, rename the function to make it abundantly clear the funky behavior is x2APIC specific, and explain _why_ vmcs01's bitmap is ignored (the previous comment was flat out wrong for x2APIC behavior). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109013047.2041518-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Add builder macros to generate the MSR bitmap helpers to reduce the amount of copy-paste code, especially with respect to all the magic numbers needed to calc the correct bit location. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109013047.2041518-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Always check vmcs01's MSR bitmap when merging L0 and L1 bitmaps for L2, and always update the relevant bits in vmcs02. This fixes two distinct, but intertwined bugs related to dynamic MSR bitmap modifications. The first issue is that KVM fails to enable MSR interception in vmcs02 for the FS/GS base MSRs if L1 first runs L2 with interception disabled, and later enables interception. The second issue is that KVM fails to honor userspace MSR filtering when preparing vmcs02. Fix both issues simultaneous as fixing only one of the issues (doesn't matter which) would create a mess that no one should have to bisect. Fixing only the first bug would exacerbate the MSR filtering issue as userspace would see inconsistent behavior depending on the whims of L1. Fixing only the second bug (MSR filtering) effectively requires fixing the first, as the nVMX code only knows how to transition vmcs02's bitmap from 1->0. Move the various accessor/mutators that are currently buried in vmx.c into vmx.h so that they can be shared by the nested code. Fixes: 1a155254 ("KVM: x86: Introduce MSR filtering") Fixes: d69129b4 ("KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible") Cc: stable@vger.kernel.org Cc: Alexander Graf <graf@amazon.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109013047.2041518-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Check the current VMCS controls to determine if an MSR write will be intercepted due to MSR bitmaps being disabled. In the nested VMX case, KVM will disable MSR bitmaps in vmcs02 if they're disabled in vmcs12 or if KVM can't map L1's bitmaps for whatever reason. Note, the bad behavior is relatively benign in the current code base as KVM sets all bits in vmcs02's MSR bitmap by default, clears bits if and only if L0 KVM also disables interception of an MSR, and only uses the buggy helper for MSR_IA32_SPEC_CTRL. Because KVM explicitly tests WRMSR before disabling interception of MSR_IA32_SPEC_CTRL, the flawed check will only result in KVM reading MSR_IA32_SPEC_CTRL from hardware when it isn't strictly necessary. Tag the fix for stable in case a future fix wants to use msr_write_intercepted(), in which case a buggy implementation in older kernels could prove subtly problematic. Fixes: d28b387f ("KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109013047.2041518-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
KVM: x86: Don't update vcpu->arch.pv_eoi.msr_val when a bogus value was written to MSR_KVM_PV_EOI_EN When kvm_gfn_to_hva_cache_init() call from kvm_lapic_set_pv_eoi() fails, MSR write to MSR_KVM_PV_EOI_EN results in #GP so it is reasonable to expect that the value we keep internally in KVM wasn't updated. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211108152819.12485-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
kvm_lapic_enable_pv_eoi() is a misnomer as the function is also used to disable PV EOI. Rename it to kvm_lapic_set_pv_eoi(). No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211108152819.12485-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paul Durrant authored
Currently when kvm_update_cpuid_runtime() runs, it assumes that the KVM_CPUID_FEATURES leaf is located at 0x40000001. This is not true, however, if Hyper-V support is enabled. In this case the KVM leaves will be offset. This patch introdues as new 'kvm_cpuid_base' field into struct kvm_vcpu_arch to track the location of the KVM leaves and function kvm_update_kvm_cpuid_base() (called from kvm_set_cpuid()) to locate the leaves using the 'KVMKVMKVM\0\0\0' signature (which is now given a definition in kvm_para.h). Adjustment of KVM_CPUID_FEATURES will hence now target the correct leaf. NOTE: A new for_each_possible_hypervisor_cpuid_base() macro is intoduced into processor.h to avoid having duplicate code for the iteration over possible hypervisor base leaves. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Message-Id: <20211105095101.5384-3-pdurrant@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Move the core logic of SET_CPUID and SET_CPUID2 to a common helper, the only difference between the two ioctls() is the format of the userspace struct. A future fix will add yet more code to the core logic. No functional change intended. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211105095101.5384-2-pdurrant@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Junaid Shahid authored
The fast page fault path bails out on write faults to huge pages in order to accommodate dirty logging. This change adds a check to do that only when dirty logging is actually enabled, so that access tracking for huge pages can still use the fast path for write faults in the common case. Signed-off-by: Junaid Shahid <junaids@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211104003359.2201967-1-junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Wrap the read of iter->sptep in tdp_mmu_map_handle_target_level() with rcu_dereference(). Shadow pages in the TDP MMU, and thus their SPTEs, are protected by rcu. This fixes a Sparse warning at tdp_mmu.c:900:51: warning: incorrect type in argument 1 (different address spaces) expected unsigned long long [usertype] *sptep got unsigned long long [noderef] [usertype] __rcu *[usertype] sptep Fixes: 7158bee4 ("KVM: MMU: pass kvm_mmu_page struct to make_spte") Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211103161833.3769487-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Maxim Levitsky authored
KVM_GUESTDBG_BLOCKIRQ relies on interrupts being injected using standard kvm's inject_pending_event, and not via APICv/AVIC. Since this is a debug feature, just inhibit APICv/AVIC while KVM_GUESTDBG_BLOCKIRQ is in use on at least one vCPU. Fixes: 61e5f69e ("KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ") Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Tested-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211108090245.166408-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Jim Mattson authored
These function names sound like predicates, and they have siblings, *is_valid_msr(), which _are_ predicates. Moreover, there are comments that essentially warn that these functions behave unexpectedly. Flip the polarity of the return values, so that they become predicates, and convert the boolean result to a success/failure code at the outer call site. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211105202058.1048757-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Woodhouse authored
In commit b0431382 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed") we switched to using a gfn_to_pfn_cache for accessing the guest steal time structure in order to allow for an atomic xchg of the preempted field. This has a couple of problems. Firstly, kvm_map_gfn() doesn't work at all for IOMEM pages when the atomic flag is set, which it is in kvm_steal_time_set_preempted(). So a guest vCPU using an IOMEM page for its steal time would never have its preempted field set. Secondly, the gfn_to_pfn_cache is not invalidated in all cases where it should have been. There are two stages to the GFN->PFN conversion; first the GFN is converted to a userspace HVA, and then that HVA is looked up in the process page tables to find the underlying host PFN. Correct invalidation of the latter would require being hooked up to the MMU notifiers, but that doesn't happen---so it just keeps mapping and unmapping the *wrong* PFN after the userspace page tables change. In the !IOMEM case at least the stale page *is* pinned all the time it's cached, so it won't be freed and reused by anyone else while still receiving the steal time updates. The map/unmap dance only takes care of the KVM administrivia such as marking the page dirty. Until the gfn_to_pfn cache handles the remapping automatically by integrating with the MMU notifiers, we might as well not get a kernel mapping of it, and use the perfectly serviceable userspace HVA that we already have. We just need to implement the atomic xchg on the userspace address with appropriate exception handling, which is fairly trivial. Cc: stable@vger.kernel.org Fixes: b0431382 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <3645b9b889dac6438394194bb5586a46b68d581f.camel@infradead.org> [I didn't entirely agree with David's assessment of the usefulness of the gfn_to_pfn cache, and integrated the outcome of the discussion in the above commit message. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Peter Gonda authored
Adds testcases for intra host migration for SEV and SEV-ES. Also adds locking test to confirm no deadlock exists. Signed-off-by: Peter Gonda <pgonda@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Cc: Marc Orr <marcorr@google.com> Cc: Sean Christopherson <seanjc@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Message-Id: <20211021174303.385706-6-pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Peter Gonda authored
Refactors out open path support from open_kvm_dev_path_or_exit() and adds new helper for SEV device path. Signed-off-by: Peter Gonda <pgonda@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Cc: Marc Orr <marcorr@google.com> Cc: Sean Christopherson <seanjc@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Message-Id: <20211021174303.385706-5-pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Peter Gonda authored
For SEV-ES to work with intra host migration the VMSAs, GHCB metadata, and other SEV-ES info needs to be preserved along with the guest's memory. Signed-off-by: Peter Gonda <pgonda@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Cc: Marc Orr <marcorr@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Message-Id: <20211021174303.385706-4-pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Peter Gonda authored
For SEV to work with intra host migration, contents of the SEV info struct such as the ASID (used to index the encryption key in the AMD SP) and the list of memory regions need to be transferred to the target VM. This change adds a commands for a target VMM to get a source SEV VM's sev info. Signed-off-by: Peter Gonda <pgonda@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Cc: Marc Orr <marcorr@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Message-Id: <20211021174303.385706-3-pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Avoid code duplication across all callers of misc_cg_try_charge and misc_cg_uncharge. The resource type for KVM is always derived from sev->es_active, and the quantity is always 1. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Generalize KVM_REQ_VM_BUGGED so that it can be called even in cases where it is by design that the VM cannot be operated upon. In this case any KVM_BUG_ON should still warn, so introduce a new flag kvm->vm_dead that is separate from kvm->vm_bugged. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Peter Gonda authored
Move SEV-ES vCPU metadata into new sev_es_state struct from vcpu_svm. Signed-off-by: Peter Gonda <pgonda@google.com> Suggested-by: Tom Lendacky <thomas.lendacky@amd.com> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: Marc Orr <marcorr@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Message-Id: <20211021174303.385706-2-pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Add guest api and guest kernel support for SEV live migration. Introduces a new hypercall to notify the host of changes to the page encryption status. If the page is encrypted then it must be migrated through the SEV firmware or a helper VM sharing the key. If page is not encrypted then it can be migrated normally by userspace. This new hypercall is invoked using paravirt_ops. Conflicts: sev_active() replaced by cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT).
-
Ashish Kalra authored
Reset the host's shared pages list related to kernel specific page encryption status settings before we load a new kernel by kexec. We cannot reset the complete shared pages list here as we need to retain the UEFI/OVMF firmware specific settings. The host's shared pages list is maintained for the guest to keep track of all unencrypted guest memory regions, therefore we need to explicitly mark all shared pages as encrypted again before rebooting into the new guest kernel. Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Reviewed-by: Steve Rutherford <srutherford@google.com> Message-Id: <3e051424ab839ea470f88333273d7a185006754f.1629726117.git.ashish.kalra@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ashish Kalra authored
The guest support for detecting and enabling SEV Live migration feature uses the following logic : - kvm_init_plaform() checks if its booted under the EFI - If not EFI, i) if kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL), issue a wrmsrl() to enable the SEV live migration support - If EFI, i) If kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL), read the UEFI variable which indicates OVMF support for live migration ii) the variable indicates live migration is supported, issue a wrmsrl() to enable the SEV live migration support The EFI live migration check is done using a late_initcall() callback. Also, ensure that _bss_decrypted section is marked as decrypted in the hypervisor's guest page encryption status tracking. Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Reviewed-by: Steve Rutherford <srutherford@google.com> Message-Id: <b4453e4c87103ebef12217d2505ea99a1c3e0f0f.1629726117.git.ashish.kalra@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ashish Kalra authored
Introduce a new AMD Memory Encryption GUID which is currently used for defining a new UEFI environment variable which indicates UEFI/OVMF support for the SEV live migration feature. This variable is setup when UEFI/OVMF detects host/hypervisor support for SEV live migration and later this variable is read by the kernel using EFI runtime services to verify if OVMF supports the live migration feature. Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Message-Id: <1cea22976d2208f34d47e0c1ce0ecac816c13111.1629726117.git.ashish.kalra@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Brijesh Singh authored
Invoke a hypercall when a memory region is changed from encrypted -> decrypted and vice versa. Hypervisor needs to know the page encryption status during the guest migration. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Borislav Petkov <bp@suse.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Steve Rutherford <srutherford@google.com> Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Reviewed-by: Borislav Petkov <bp@suse.de> Message-Id: <0a237d5bb08793916c7790a3e653a2cbe7485761.1629726117.git.ashish.kalra@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Brijesh Singh authored
KVM hypercall framework relies on alternative framework to patch the VMCALL -> VMMCALL on AMD platform. If a hypercall is made before apply_alternative() is called then it defaults to VMCALL. The approach works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor will be able to decode the instruction and do the right things. But when SEV is active, guest memory is encrypted with guest key and hypervisor will not be able to decode the instruction bytes. To highlight the need to provide this interface, capturing the flow of apply_alternatives() : setup_arch() call init_hypervisor_platform() which detects the hypervisor platform the kernel is running under and then the hypervisor specific initialization code can make early hypercalls. For example, KVM specific initialization in case of SEV will try to mark the "__bss_decrypted" section's encryption state via early page encryption status hypercalls. Now, apply_alternatives() is called much later when setup_arch() calls check_bugs(), so we do need some kind of an early, pre-alternatives hypercall interface. Other cases of pre-alternatives hypercalls include marking per-cpu GHCB pages as decrypted on SEV-ES and per-cpu apf_reason, steal_time and kvm_apic_eoi as decrypted for SEV generally. Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall will be used by the SEV guest to notify encrypted pages to the hypervisor. This kvm_sev_hypercall3() function is abstracted and used as follows : All these early hypercalls are made through early_set_memory_XX() interfaces, which in turn invoke pv_ops (paravirt_ops). This early_set_memory_XX() -> pv_ops.mmu.notify_page_enc_status_changed() is a generic interface and can easily have SEV, TDX and any other future platform specific abstractions added to it. Currently, pv_ops.mmu.notify_page_enc_status_changed() callback is setup to invoke kvm_sev_hypercall3() in case of SEV. Similarly, in case of TDX, pv_ops.mmu.notify_page_enc_status_changed() can be setup to a TDX specific callback. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Borislav Petkov <bp@suse.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Steve Rutherford <srutherford@google.com> Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Message-Id: <6fd25c749205dd0b1eb492c60d41b124760cc6ae.1629726117.git.ashish.kalra@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4Linus Torvalds authored
Pull ext4 updates from Ted Ts'o: "Only bug fixes and cleanups for ext4 this merge window. Of note are fixes for the combination of the inline_data and fast_commit fixes, and more accurately calculating when to schedule additional lazy inode table init, especially when CONFIG_HZ is 100HZ" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix error code saved on super block during file system abort ext4: inline data inode fast commit replay fixes ext4: commit inline data during fast commit ext4: scope ret locally in ext4_try_to_trim_range() ext4: remove an unused variable warning with CONFIG_QUOTA=n ext4: fix boolreturn.cocci warnings in fs/ext4/name.c ext4: prevent getting empty inode buffer ext4: move ext4_fill_raw_inode() related functions ext4: factor out ext4_fill_raw_inode() ext4: prevent partial update of the extent blocks ext4: check for inconsistent extents between index and leaf block ext4: check for out-of-order index extents in ext4_valid_extent_entries() ext4: convert from atomic_t to refcount_t on ext4_io_end->count ext4: refresh the ext4_ext_path struct after dropping i_data_sem. ext4: ensure enough credits in ext4_ext_shift_path_extents ext4: correct the left/middle/right debug message for binsearch ext4: fix lazy initialization next schedule time computation in more granular unit Revert "ext4: enforce buffer head state assertion in ext4_da_map_blocks"
-
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds authored
Pull btrfs fix from David Sterba: "Fix for a deadlock when direct/buffered IO is done on a mmaped file and a fault happens (details in the patch). There's a fstest generic/647 that triggers the problem and makes testing hard" * tag 'for-5.16-deadlock-fix-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix deadlock due to page faults during direct IO reads and writes
-
git://linux-nfs.org/~bfields/linuxLinus Torvalds authored
Pull nfsd updates from Bruce Fields: "A slow cycle for nfsd: mainly cleanup, including Neil's patch dropping support for a filehandle format deprecated 20 years ago, and further xdr-related cleanup from Chuck" * tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux: (26 commits) nfsd4: remove obselete comment nfsd: document server-to-server-copy parameters NFSD:fix boolreturn.cocci warning nfsd: update create verifier comment SUNRPC: Change return value type of .pc_encode SUNRPC: Replace the "__be32 *p" parameter to .pc_encode NFSD: Save location of NFSv4 COMPOUND status SUNRPC: Change return value type of .pc_decode SUNRPC: Replace the "__be32 *p" parameter to .pc_decode SUNRPC: De-duplicate .pc_release() call sites SUNRPC: Simplify the SVC dispatch code path SUNRPC: Capture value of xdr_buf::page_base SUNRPC: Add trace event when alloc_pages_bulk() makes no progress svcrdma: Split svcrmda_wc_{read,write} tracepoints svcrdma: Split the svcrdma_wc_send() tracepoint svcrdma: Split the svcrdma_wc_receive() tracepoint NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment() SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases NFSD: Initialize pointer ni with NULL and not plain integer 0 NFSD: simplify struct nfsfh ...
-
git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds authored
Pull NFS client updates from Trond Myklebust: "Highlights include: Features: - NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN - Optimisations for READDIR and the 'ls -l' style workload - Further replacements of dprintk() with tracepoints and other tracing improvements - Ensure we re-probe NFSv4 server capabilities when the user does a "mount -o remount" Bugfixes: - Fix an Oops in pnfs_mark_request_commit() - Fix up deadlocks in the commit code - Fix regressions in NFSv2/v3 attribute revalidation due to the change_attr_type optimisations - Fix some dentry verifier races - Fix some missing dentry verifier settings - Fix a performance regression in nfs_set_open_stateid_locked() - SUNRPC was sending multiple SYN calls when re-establishing a TCP connection. - Fix multiple NFSv4 issues due to missing sanity checking of server return values - Fix a potential Oops when FREE_STATEID races with an unmount Cleanups: - Clean up the labelled NFS code - Remove unused header <linux/pnfs_osd_xdr.h>" * tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (84 commits) NFSv4: Sanity check the parameters in nfs41_update_target_slotid() NFS: Remove the nfs4_label argument from decode_getattr_*() functions NFS: Remove the nfs4_label argument from nfs_setsecurity NFS: Remove the nfs4_label argument from nfs_fhget() NFS: Remove the nfs4_label argument from nfs_add_or_obtain() NFS: Remove the nfs4_label argument from nfs_instantiate() NFS: Remove the nfs4_label from the nfs_setattrres NFS: Remove the nfs4_label from the nfs4_getattr_res NFS: Remove the f_label from the nfs4_opendata and nfs_openres NFS: Remove the nfs4_label from the nfs4_lookupp_res struct NFS: Remove the label from the nfs4_lookup_res struct NFS: Remove the nfs4_label from the nfs4_link_res struct NFS: Remove the nfs4_label from the nfs4_create_res struct NFS: Remove the nfs4_label from the nfs_entry struct NFS: Create a new nfs_alloc_fattr_with_label() function NFS: Always initialise fattr->label in nfs_fattr_alloc() NFSv4.2: alloc_file_pseudo() takes an open flag, not an f_mode NFS: Don't allocate nfs_fattr on the stack in __nfs42_ssc_open() NFSv4: Remove unnecessary 'minor version' check NFSv4: Fix potential Oops in decode_op_map() ...
-
Linus Torvalds authored
Merge branch 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull exit cleanups from Eric Biederman: "While looking at some issues related to the exit path in the kernel I found several instances where the code is not using the existing abstractions properly. This set of changes introduces force_fatal_sig a way of sending a signal and not allowing it to be caught, and corrects the misuse of the existing abstractions that I found. A lot of the misuse of the existing abstractions are silly things such as doing something after calling a no return function, rolling BUG by hand, doing more work than necessary to terminate a kernel thread, or calling do_exit(SIGKILL) instead of calling force_sig(SIGKILL). In the review a deficiency in force_fatal_sig and force_sig_seccomp where ptrace or sigaction could prevent the delivery of the signal was found. I have added a change that adds SA_IMMUTABLE to change that makes it impossible to interrupt the delivery of those signals, and allows backporting to fix force_sig_seccomp And Arnd found an issue where a function passed to kthread_run had the wrong prototype, and after my cleanup was failing to build." * 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits) soc: ti: fix wkup_m3_rproc_boot_thread return type signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV) exit/r8188eu: Replace the macro thread_exit with a simple return 0 exit/rtl8712: Replace the macro thread_exit with a simple return 0 exit/rtl8723bs: Replace the macro thread_exit with a simple return 0 signal/x86: In emulate_vsyscall force a signal instead of calling do_exit signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails exit/syscall_user_dispatch: Send ordinary signals on failure signal: Implement force_fatal_sig exit/kthread: Have kernel threads return instead of calling do_exit signal/s390: Use force_sigsegv in default_trap_handler signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved. signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON signal/sparc: In setup_tsb_params convert open coded BUG into BUG signal/powerpc: On swapcontext failure force SIGSEGV signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL) signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT signal/sparc32: Remove unreachable do_exit in do_sparc_fault ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linuxLinus Torvalds authored
Pull prctl updates from Christian Brauner: "This contains the missing prctl uapi pieces for PR_SCHED_CORE. In order to activate core scheduling the caller is expected to specify the scope of the new core scheduling domain. For example, passing 2 in the 4th argument of prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, <pid>, 2, 0); would indicate that the new core scheduling domain encompasses all tasks in the process group of <pid>. Specifying 0 would only create a core scheduling domain for the thread identified by <pid> and 2 would encompass the whole thread-group of <pid>. Note, the values 0, 1, and 2 correspond to PIDTYPE_PID, PIDTYPE_TGID, and PIDTYPE_PGID. A first version tried to expose those values directly to which I objected because: - PIDTYPE_* is an enum that is kernel internal which we should not expose to userspace directly. - PIDTYPE_* indicates what a given struct pid is used for it doesn't express a scope. But what the 4th argument of PR_SCHED_CORE prctl() expresses is the scope of the operation, i.e. the scope of the core scheduling domain at creation time. So Eugene's patch now simply introduces three new defines PR_SCHED_CORE_SCOPE_THREAD, PR_SCHED_CORE_SCOPE_THREAD_GROUP, and PR_SCHED_CORE_SCOPE_PROCESS_GROUP. They simply express what happens. This has been on the mailing list for quite a while with all relevant scheduler folks Cced. I announced multiple times that I'd pick this up if I don't see or her anyone else doing it. None of this touches proper scheduler code but only concerns uapi so I think this is fine. With core scheduling being quite common now for vm managers (e.g. moving individual vcpu threads into their own core scheduling domain) and container managers (e.g. moving the init process into its own core scheduling domain and letting all created children inherit it) having to rely on raw numbers passed as the 4th argument in prctl() is a bit annoying and everyone is starting to come up with their own defines" * tag 'kernel.sys.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: uapi/linux/prctl: provide macro definitions for the PR_SCHED_CORE type argument
-
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linuxLinus Torvalds authored
Pull pidfd updates from Christian Brauner: "Various places in the kernel have picked up pidfds. The two most recent additions have probably been the ability to use pidfds in bpf maps and the usage of pidfds in mm-based syscalls such as process_mrelease() and process_madvise(). The same pattern to turn a pidfd into a struct task exists in two places. One of those places used PIDTYPE_TGID while the other one used PIDTYPE_PID even though it is clearly documented in all pidfd-helpers that pidfds __currently__ only refer to thread-group leaders (subject to change in the future if need be). This isn't a bug per se but has the potential to be one if we allow pidfds to refer to individual threads. If that happens we want to audit all codepaths that make use of them to ensure they can deal with pidfds refering to individual threads. This adds a simple helper to turn a pidfd into a struct task making it easy to grep for such places. Plus, it gets rid of code-duplication" * tag 'pidfd.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: mm: use pidfd_get_task() pid: add pidfd_get_task() helper
-
- 10 Nov, 2021 1 commit
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds authored
Pull more thermal control updates from Rafael Wysocki: "These fix two issues in the thermal core and one in the int340x thermal driver. Specifics: - Replace pr_warn() with pr_warn_once() in user_space_bind() to reduce kernel log noise (Rafael Wysocki). - Extend the RFIM mailbox interface in the int340x thermal driver to return 64 bit values to allow all values returned by the hardware to be handled correctly (Srinivas Pandruvada). - Fix possible NULL pointer dereferences in the of_thermal_ family of functions (Subbaraman Narayanamurthy)" * tag 'thermal-5.16-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: thermal: Replace pr_warn() with pr_warn_once() in user_space_bind() thermal: Fix NULL pointer dereferences in of_thermal_ functions thermal/drivers/int340x: processor_thermal: Suppot 64 bit RFIM responses
-