1. 12 May, 2024 19 commits
    • Tom Lendacky's avatar
      KVM: SEV: Support SEV-SNP AP Creation NAE event · e366f92e
      Tom Lendacky authored
      Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP
      guests to alter the register state of the APs on their own. This allows
      the guest a way of simulating INIT-SIPI.
      
      A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used
      so as to avoid updating the VMSA pointer while the vCPU is running.
      
      For CREATE
        The guest supplies the GPA of the VMSA to be used for the vCPU with
        the specified APIC ID. The GPA is saved in the svm struct of the
        target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added
        to the vCPU and then the vCPU is kicked.
      
      For CREATE_ON_INIT:
        The guest supplies the GPA of the VMSA to be used for the vCPU with
        the specified APIC ID the next time an INIT is performed. The GPA is
        saved in the svm struct of the target vCPU.
      
      For DESTROY:
        The guest indicates it wishes to stop the vCPU. The GPA is cleared
        from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is
        added to vCPU and then the vCPU is kicked.
      
      The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked
      as a result of the event or as a result of an INIT. If a new VMSA is to
      be installed, the VMSA guest page is set as the VMSA in the vCPU VMCB
      and the vCPU state is set to KVM_MP_STATE_RUNNABLE. If a new VMSA is not
      to be installed, the VMSA is cleared in the vCPU VMCB and the vCPU state
      is set to KVM_MP_STATE_HALTED to prevent it from being run.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Co-developed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Message-ID: <20240501085210.2213060-13-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e366f92e
    • Brijesh Singh's avatar
      KVM: SEV: Add support to handle RMP nested page faults · c63cf135
      Brijesh Singh authored
      When SEV-SNP is enabled in the guest, the hardware places restrictions
      on all memory accesses based on the contents of the RMP table. When
      hardware encounters RMP check failure caused by the guest memory access
      it raises the #NPF. The error code contains additional information on
      the access type. See the APM volume 2 for additional information.
      
      When using gmem, RMP faults resulting from mismatches between the state
      in the RMP table vs. what the guest expects via its page table result
      in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This
      means the only expected case that needs to be handled in the kernel is
      when the page size of the entry in the RMP table is larger than the
      mapping in the nested page table, in which case a PSMASH instruction
      needs to be issued to split the large RMP entry into individual 4K
      entries so that subsequent accesses can succeed.
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Co-developed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Message-ID: <20240501085210.2213060-12-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c63cf135
    • Michael Roth's avatar
      KVM: SEV: Add support to handle Page State Change VMGEXIT · 9b54e248
      Michael Roth authored
      SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
      table to be private or shared using the Page State Change NAE event
      as defined in the GHCB specification version 2.
      
      Forward these requests to userspace as KVM_EXIT_VMGEXITs, similar to how
      it is done for requests that don't use a GHCB page.
      
      As with the MSR-based page-state changes, use the existing
      KVM_HC_MAP_GPA_RANGE hypercall format to deliver these requests to
      userspace via KVM_EXIT_HYPERCALL.
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Co-developed-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Message-ID: <20240501085210.2213060-11-michael.roth@amd.com>
      Co-developed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9b54e248
    • Michael Roth's avatar
      KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT · d46b7b6a
      Michael Roth authored
      SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
      table to be private or shared using the Page State Change MSR protocol
      as defined in the GHCB specification.
      
      When using gmem, private/shared memory is allocated through separate
      pools, and KVM relies on userspace issuing a KVM_SET_MEMORY_ATTRIBUTES
      KVM ioctl to tell the KVM MMU whether or not a particular GFN should be
      backed by private memory or not.
      
      Forward these page state change requests to userspace so that it can
      issue the expected KVM ioctls. The KVM MMU will handle updating the RMP
      entries when it is ready to map a private page into a guest.
      
      Use the existing KVM_HC_MAP_GPA_RANGE hypercall format to deliver these
      requests to userspace via KVM_EXIT_HYPERCALL.
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Co-developed-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Message-ID: <20240501085210.2213060-10-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d46b7b6a
    • Brijesh Singh's avatar
      KVM: SEV: Add support to handle GHCB GPA register VMGEXIT · 0c76b1d0
      Brijesh Singh authored
      SEV-SNP guests are required to perform a GHCB GPA registration. Before
      using a GHCB GPA for a vCPU the first time, a guest must register the
      vCPU GHCB GPA. If hypervisor can work with the guest requested GPA then
      it must respond back with the same GPA otherwise return -1.
      
      On VMEXIT, verify that the GHCB GPA matches with the registered value.
      If a mismatch is detected, then abort the guest.
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-ID: <20240501085210.2213060-9-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0c76b1d0
    • Brijesh Singh's avatar
      KVM: SEV: Add KVM_SEV_SNP_LAUNCH_FINISH command · ad27ce15
      Brijesh Singh authored
      Add a KVM_SEV_SNP_LAUNCH_FINISH command to finalize the cryptographic
      launch digest which stores the measurement of the guest at launch time.
      Also extend the existing SNP firmware data structures to support
      disabling the use of Versioned Chip Endorsement Keys (VCEK) by guests as
      part of this command.
      
      While finalizing the launch flow, the code also issues the LAUNCH_UPDATE
      SNP firmware commands to encrypt/measure the initial VMSA pages for each
      configured vCPU, which requires setting the RMP entries for those pages
      to private, so also add handling to clean up the RMP entries for these
      pages whening freeing vCPUs during shutdown.
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Co-developed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarHarald Hoyer <harald@profian.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Message-ID: <20240501085210.2213060-8-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ad27ce15
    • Brijesh Singh's avatar
      KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command · dee5a47c
      Brijesh Singh authored
      A key aspect of a launching an SNP guest is initializing it with a
      known/measured payload which is then encrypted into guest memory as
      pre-validated private pages and then measured into the cryptographic
      launch context created with KVM_SEV_SNP_LAUNCH_START so that the guest
      can attest itself after booting.
      
      Since all private pages are provided by guest_memfd, make use of the
      kvm_gmem_populate() interface to handle this. The general flow is that
      guest_memfd will handle allocating the pages associated with the GPA
      ranges being initialized by each particular call of
      KVM_SEV_SNP_LAUNCH_UPDATE, copying data from userspace into those pages,
      and then the post_populate callback will do the work of setting the
      RMP entries for these pages to private and issuing the SNP firmware
      calls to encrypt/measure them.
      
      For more information see the SEV-SNP specification.
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Co-developed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Message-ID: <20240501085210.2213060-7-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dee5a47c
    • Brijesh Singh's avatar
      KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command · 136d8bc9
      Brijesh Singh authored
      KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
      The command initializes a cryptographic digest context used to construct
      the measurement of the guest. Other commands can then at that point be
      used to load/encrypt data into the guest's initial launch image.
      
      For more information see the SEV-SNP specification.
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Co-developed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Message-ID: <20240501085210.2213060-6-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      136d8bc9
    • Brijesh Singh's avatar
      KVM: SEV: Add initial SEV-SNP support · 1dfe571c
      Brijesh Singh authored
      SEV-SNP builds upon existing SEV and SEV-ES functionality while adding
      new hardware-based security protection. SEV-SNP adds strong memory
      encryption and integrity protection to help prevent malicious
      hypervisor-based attacks such as data replay, memory re-mapping, and
      more, to create an isolated execution environment.
      
      Define a new KVM_X86_SNP_VM type which makes use of these capabilities
      and extend the KVM_SEV_INIT2 ioctl to support it. Also add a basic
      helper to check whether SNP is enabled and set PFERR_PRIVATE_ACCESS for
      private #NPFs so they are handled appropriately by KVM MMU.
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Co-developed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-ID: <20240501085210.2213060-5-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1dfe571c
    • Michael Roth's avatar
      KVM: SEV: Select KVM_GENERIC_PRIVATE_MEM when CONFIG_KVM_AMD_SEV=y · a8e31983
      Michael Roth authored
      SEV-SNP relies on private memory support to run guests, so make sure to
      enable that support via the CONFIG_KVM_GENERIC_PRIVATE_MEM config
      option.
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-ID: <20240501085210.2213060-4-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a8e31983
    • Michael Roth's avatar
      KVM: MMU: Disable fast path if KVM_EXIT_MEMORY_FAULT is needed · b74d002d
      Michael Roth authored
      For hardware-protected VMs like SEV-SNP guests, certain conditions like
      attempting to perform a write to a page which is not in the state that
      the guest expects it to be in can result in a nested/extended #PF which
      can only be satisfied by the host performing an implicit page state
      change to transition the page into the expected shared/private state.
      This is generally handled by generating a KVM_EXIT_MEMORY_FAULT event
      that gets forwarded to userspace to handle via
      KVM_SET_MEMORY_ATTRIBUTES.
      
      However, the fast_page_fault() code might misconstrue this situation as
      being the result of a write-protected access, and treat it as a spurious
      case when it sees that writes are already allowed for the sPTE. This
      results in the KVM MMU trying to resume the guest rather than taking any
      action to satisfy the real source of the #PF such as generating a
      KVM_EXIT_MEMORY_FAULT, resulting in the guest spinning on nested #PFs.
      
      Check for this condition and bail out of the fast path if it is
      detected.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Cc: Isaku Yamahata <isaku.yamahata@intel.com>
      Reviewed-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b74d002d
    • Paolo Bonzini's avatar
      Merge branch 'kvm-coco-hooks' into HEAD · 73232603
      Paolo Bonzini authored
      Common patches for the target-independent functionality and hooks
      that are needed by SEV-SNP and TDX.
      73232603
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-misc-6.10' of https://github.com/kvm-x86/linux into HEAD · 7d41e24d
      Paolo Bonzini authored
      KVM x86 misc changes for 6.10:
      
       - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which
         is unused by hardware, so that KVM can communicate its inability to map GPAs
         that set bits 51:48 due to lack of 5-level paging.  Guest firmware is
         expected to use the information to safely remap BARs in the uppermost GPA
         space, i.e to avoid placing a BAR at a legal, but unmappable, GPA.
      
       - Use vfree() instead of kvfree() for allocations that always use vcalloc()
         or __vcalloc().
      
       - Don't completely ignore same-value writes to immutable feature MSRs, as
         doing so results in KVM failing to reject accesses to MSR that aren't
         supposed to exist given the vCPU model and/or KVM configuration.
      
       - Don't mark APICv as being inhibited due to ABSENT if APICv is disabled
         KVM-wide to avoid confusing debuggers (KVM will never bother clearing the
         ABSENT inhibit, even if userspace enables in-kernel local APIC).
      7d41e24d
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-mmu-6.10' of https://github.com/kvm-x86/linux into HEAD · 5a1c72e0
      Paolo Bonzini authored
      KVM x86 MMU changes for 6.10:
      
       - Process TDP MMU SPTEs that are are zapped while holding mmu_lock for read
         after replacing REMOVED_SPTE with '0' and flushing remote TLBs, which allows
         vCPU tasks to repopulate the zapped region while the zapper finishes tearing
         down the old, defunct page tables.
      
       - Fix a longstanding, likely benign-in-practice race where KVM could fail to
         detect a write from kvm_mmu_track_write() to a shadowed GPTE if the GPTE is
         first page table being shadowed.
      5a1c72e0
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-selftests_utils-6.10' of https://github.com/kvm-x86/linux into HEAD · dee7ea42
      Paolo Bonzini authored
      KVM selftests treewide updates for 6.10:
      
       - Define _GNU_SOURCE for all selftests to fix a warning that was introduced by
         a change to kselftest_harness.h late in the 6.9 cycle, and because forcing
         every test to #define _GNU_SOURCE is painful.
      
       - Provide a global psuedo-RNG instance for all tests, so that library code can
         generate random, but determinstic numbers.
      
       - Use the global pRNG to randomly force emulation of select writes from guest
         code on x86, e.g. to help validate KVM's emulation of locked accesses.
      
       - Rename kvm_util_base.h back to kvm_util.h, as the weird layer of indirection
         was added purely to avoid manually #including ucall_common.h in a handful of
         locations.
      
       - Allocate and initialize x86's GDT, IDT, TSS, segments, and default exception
         handlers at VM creation, instead of forcing tests to manually trigger the
         related setup.
      dee7ea42
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-vmx-6.10' of https://github.com/kvm-x86/linux into HEAD · 31a6cd7f
      Paolo Bonzini authored
      KVM VMX changes for 6.10:
      
       - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to
         L1, as per the SDM.
      
       - Move kvm_vcpu_arch's exit_qualification into x86_exception, as the field is
         used only when synthesizing nested EPT violation, i.e. it's not the vCPU's
         "real" exit_qualification, which is tracked elsewhere.
      
       - Add a sanity check to assert that EPT Violations are the only sources of
         nested PML Full VM-Exits.
      31a6cd7f
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-selftests-6.10' of https://github.com/kvm-x86/linux into HEAD · 56f40708
      Paolo Bonzini authored
      KVM selftests cleanups and fixes for 6.10:
      
       - Enhance the demand paging test to allow for better reporting and stressing
         of UFFD performance.
      
       - Convert the steal time test to generate TAP-friendly output.
      
       - Fix a flaky false positive in the xen_shinfo_test due to comparing elapsed
         time across two different clock domains.
      
       - Skip the MONITOR/MWAIT test if the host doesn't actually support MWAIT.
      
       - Avoid unnecessary use of "sudo" in the NX hugepage test to play nice with
         running in a minimal userspace environment.
      
       - Allow skipping the RSEQ test's sanity check that the vCPU was able to
         complete a reasonable number of KVM_RUNs, as the assert can fail on a
         completely valid setup.  If the test is run on a large-ish system that is
         otherwise idle, and the test isn't affined to a low-ish number of CPUs, the
         vCPU task can be repeatedly migrated to CPUs that are in deep sleep states,
         which results in the vCPU having very little net runtime before the next
         migration due to high wakeup latencies.
      56f40708
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-generic-6.10' of https://github.com/kvm-x86/linux into HEAD · f4bc1373
      Paolo Bonzini authored
      KVM cleanups for 6.10:
      
       - Misc cleanups extracted from the "exit on missing userspace mapping" series,
         which has been put on hold in anticipation of a "KVM Userfault" approach,
         which should provide a superset of functionality.
      
       - Remove kvm_make_all_cpus_request_except(), which got added to hack around an
         AVIC bug, and then became dead code when a more robust fix came along.
      
       - Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation.
      f4bc1373
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD · e5f62e27
      Paolo Bonzini authored
      KVM/arm64 updates for Linux 6.10
      
      - Move a lot of state that was previously stored on a per vcpu
        basis into a per-CPU area, because it is only pertinent to the
        host while the vcpu is loaded. This results in better state
        tracking, and a smaller vcpu structure.
      
      - Add full handling of the ERET/ERETAA/ERETAB instructions in
        nested virtualisation. The last two instructions also require
        emulating part of the pointer authentication extension.
        As a result, the trap handling of pointer authentication has
        been greattly simplified.
      
      - Turn the global (and not very scalable) LPI translation cache
        into a per-ITS, scalable cache, making non directly injected
        LPIs much cheaper to make visible to the vcpu.
      
      - A batch of pKVM patches, mostly fixes and cleanups, as the
        upstreaming process seems to be resuming. Fingers crossed!
      
      - Allocate PPIs and SGIs outside of the vcpu structure, allowing
        for smaller EL2 mapping and some flexibility in implementing
        more or less than 32 private IRQs.
      
      - Purge stale mpidr_data if a vcpu is created after the MPIDR
        map has been created.
      
      - Preserve vcpu-specific ID registers across a vcpu reset.
      
      - Various minor cleanups and improvements.
      e5f62e27
  2. 10 May, 2024 13 commits
    • Paolo Bonzini's avatar
      Merge tag 'loongarch-kvm-6.10' of... · 4232da23
      Paolo Bonzini authored
      Merge tag 'loongarch-kvm-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
      
      LoongArch KVM changes for v6.10
      
      1. Add ParaVirt IPI support.
      2. Add software breakpoint support.
      3. Add mmio trace events support.
      4232da23
    • Paolo Bonzini's avatar
      Merge branch 'kvm-sev-es-ghcbv2' into HEAD · bbe10a5c
      Paolo Bonzini authored
      While the main additions from GHCB protocol version 1 to version 2
      revolve mostly around SEV-SNP support, there are a number of changes
      applicable to SEV-ES guests as well. Pluck a handful patches from the
      SNP hypervisor patchset for GHCB-related changes that are also applicable
      to SEV-ES.  A KVM_SEV_INIT2 field lets userspace can control the maximum
      GHCB protocol version advertised to guests and manage compatibility
      across kernels/versions.
      bbe10a5c
    • Paolo Bonzini's avatar
      Merge branch 'kvm-coco-pagefault-prep' into HEAD · f3650842
      Paolo Bonzini authored
      A combination of prep work for TDX and SNP, and a clean up of the
      page fault path to (hopefully) make it easier to follow the rules for
      private memory, noslot faults, writes to read-only slots, etc.
      f3650842
    • Paolo Bonzini's avatar
      Merge branch 'kvm-vmx-ve' into HEAD · 1e21b538
      Paolo Bonzini authored
      Allow a non-zero value for non-present SPTE and removed SPTE,
      so that TDX can set the "suppress VE" bit.
      1e21b538
    • Michael Roth's avatar
      KVM: x86: Add hook for determining max NPT mapping level · f32fb328
      Michael Roth authored
      In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
      2MB mapping in the guest's nested page table depends on whether or not
      any subpages within the range have already been initialized as private
      in the RMP table. The existing mixed-attribute tracking in KVM is
      insufficient here, for instance:
      
      - gmem allocates 2MB page
      - guest issues PVALIDATE on 2MB page
      - guest later converts a subpage to shared
      - SNP host code issues PSMASH to split 2MB RMP mapping to 4K
      - KVM MMU splits NPT mapping to 4K
      - guest later converts that shared page back to private
      
      At this point there are no mixed attributes, and KVM would normally
      allow for 2MB NPT mappings again, but this is actually not allowed
      because the RMP table mappings are 4K and cannot be promoted on the
      hypervisor side, so the NPT mappings must still be limited to 4K to
      match this.
      
      Add a hook to determine the max NPT mapping size in situations like
      this.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Reviewed-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Message-ID: <20240501085210.2213060-3-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f32fb328
    • Michael Roth's avatar
      KVM: guest_memfd: Add hook for invalidating memory · a90764f0
      Michael Roth authored
      In some cases, like with SEV-SNP, guest memory needs to be updated in a
      platform-specific manner before it can be safely freed back to the host.
      Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
      allow for special handling of this sort when freeing memory in response
      to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
      ahead and define an arch-specific hook for x86 since it will be needed
      for handling memory used for SEV-SNP guests.
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-Id: <20231230172351.574091-6-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a90764f0
    • Paolo Bonzini's avatar
      KVM: guest_memfd: Add interface for populating gmem pages with user data · 1f6c06b1
      Paolo Bonzini authored
      During guest run-time, kvm_arch_gmem_prepare() is issued as needed to
      prepare newly-allocated gmem pages prior to mapping them into the guest.
      In the case of SEV-SNP, this mainly involves setting the pages to
      private in the RMP table.
      
      However, for the GPA ranges comprising the initial guest payload, which
      are encrypted/measured prior to starting the guest, the gmem pages need
      to be accessed prior to setting them to private in the RMP table so they
      can be initialized with the userspace-provided data. Additionally, an
      SNP firmware call is needed afterward to encrypt them in-place and
      measure the contents into the guest's launch digest.
      
      While it is possible to bypass the kvm_arch_gmem_prepare() hooks so that
      this handling can be done in an open-coded/vendor-specific manner, this
      may expose more gmem-internal state/dependencies to external callers
      than necessary. Try to avoid this by implementing an interface that
      tries to handle as much of the common functionality inside gmem as
      possible, while also making it generic enough to potentially be
      usable/extensible for TDX as well.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Co-developed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1f6c06b1
    • Paolo Bonzini's avatar
      KVM: guest_memfd: extract __kvm_gmem_get_pfn() · 17573fd9
      Paolo Bonzini authored
      In preparation for adding a function that walks a set of pages
      provided by userspace and populates them in a guest_memfd,
      add a version of kvm_gmem_get_pfn() that has a "bool prepare"
      argument and passes it down to kvm_gmem_get_folio().
      
      Populating guest memory has to call repeatedly __kvm_gmem_get_pfn()
      on the same file, so make the new function take struct file*.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      17573fd9
    • Paolo Bonzini's avatar
      KVM: guest_memfd: Add hook for initializing memory · 3bb2531e
      Paolo Bonzini authored
      guest_memfd pages are generally expected to be in some arch-defined
      initial state prior to using them for guest memory. For SEV-SNP this
      initial state is 'private', or 'guest-owned', and requires additional
      operations to move these pages into a 'private' state by updating the
      corresponding entries the RMP table.
      
      Allow for an arch-defined hook to handle updates of this sort, and go
      ahead and implement one for x86 so KVM implementations like AMD SVM can
      register a kvm_x86_ops callback to handle these updates for SEV-SNP
      guests.
      
      The preparation callback is always called when allocating/grabbing
      folios via gmem, and it is up to the architecture to keep track of
      whether or not the pages are already in the expected state (e.g. the RMP
      table in the case of SEV-SNP).
      
      In some cases, it is necessary to defer the preparation of the pages to
      handle things like in-place encryption of initial guest memory payloads
      before marking these pages as 'private'/'guest-owned'.  Add an argument
      (always true for now) to kvm_gmem_get_folio() that allows for the
      preparation callback to be bypassed.  To detect possible issues in
      the way userspace initializes memory, it is only possible to add an
      unprepared page if it is not already included in the filemap.
      
      Link: https://lore.kernel.org/lkml/ZLqVdvsF11Ddo7Dq@google.com/Co-developed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-Id: <20231230172351.574091-5-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3bb2531e
    • Paolo Bonzini's avatar
      KVM: guest_memfd: limit overzealous WARN · fa30b0dc
      Paolo Bonzini authored
      Because kvm_gmem_get_pfn() is called from the page fault path without
      any of the slots_lock, filemap lock or mmu_lock taken, it is
      possible for it to race with kvm_gmem_unbind().  This is not a
      problem, as any PTE that is installed temporarily will be zapped
      before the guest has the occasion to run.
      
      However, it is not possible to have a complete unbind+bind
      racing with the page fault, because deleting the memslot
      will call synchronize_srcu_expedited() and wait for the
      page fault to be resolved.  Thus, we can still warn if
      the file is there and is not the one we expect.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fa30b0dc
    • Paolo Bonzini's avatar
      KVM: guest_memfd: pass error up from filemap_grab_folio · 70623723
      Paolo Bonzini authored
      Some SNP ioctls will require the page not to be in the pagecache, and as such they
      will want to return EEXIST to userspace.  Start by passing the error up from
      filemap_grab_folio.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      70623723
    • Michael Roth's avatar
      KVM: guest_memfd: Use AS_INACCESSIBLE when creating guest_memfd inode · 1d23040c
      Michael Roth authored
      truncate_inode_pages_range() may attempt to zero pages before truncating
      them, and this will occur before arch-specific invalidations can be
      triggered via .invalidate_folio/.free_folio hooks via kvm_gmem_aops. For
      AMD SEV-SNP this would result in an RMP #PF being generated by the
      hardware, which is currently treated as fatal (and even if specifically
      allowed for, would not result in anything other than garbage being
      written to guest pages due to encryption). On Intel TDX this would also
      result in undesirable behavior.
      
      Set the AS_INACCESSIBLE flag to prevent the MM from attempting
      unexpected accesses of this sort during operations like truncation.
      
      This may also in some cases yield a decent performance improvement for
      guest_memfd userspace implementations that hole-punch ranges immediately
      after private->shared conversions via KVM_SET_MEMORY_ATTRIBUTES, since
      the current implementation of truncate_inode_pages_range() always ends
      up zero'ing an entire 4K range if it is backing by a 2M folio.
      
      Link: https://lore.kernel.org/lkml/ZR9LYhpxTaTk6PJX@google.com/Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-ID: <20240329212444.395559-6-michael.roth@amd.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1d23040c
    • Michael Roth's avatar
      mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory · c72ceafb
      Michael Roth authored
      filemap users like guest_memfd may use page cache pages to
      allocate/manage memory that is only intended to be accessed by guests
      via hardware protections like encryption. Writes to memory of this sort
      in common paths like truncation may cause unexpected behavior such as
      writing garbage instead of zeros when attempting to zero pages, or
      worse, triggering hardware protections that are considered fatal as far
      as the kernel is concerned.
      
      Introduce a new address_space flag, AS_INACCESSIBLE, and use this
      initially to prevent zero'ing of pages during truncation, with the
      understanding that it is up to the owner of the mapping to handle this
      specially if needed.
      
      This is admittedly a rather blunt solution, but it seems like
      there are no other places that should take into account the
      flag to keep its promise.
      
      Link: https://lore.kernel.org/lkml/ZR9LYhpxTaTk6PJX@google.com/
      Cc: Matthew Wilcox <willy@infradead.org>
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-ID: <20240329212444.395559-5-michael.roth@amd.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c72ceafb
  3. 09 May, 2024 8 commits