1. 12 May, 2024 8 commits
    • Paolo Bonzini's avatar
      Merge branch 'kvm-coco-hooks' into HEAD · 73232603
      Paolo Bonzini authored
      Common patches for the target-independent functionality and hooks
      that are needed by SEV-SNP and TDX.
      73232603
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-misc-6.10' of https://github.com/kvm-x86/linux into HEAD · 7d41e24d
      Paolo Bonzini authored
      KVM x86 misc changes for 6.10:
      
       - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which
         is unused by hardware, so that KVM can communicate its inability to map GPAs
         that set bits 51:48 due to lack of 5-level paging.  Guest firmware is
         expected to use the information to safely remap BARs in the uppermost GPA
         space, i.e to avoid placing a BAR at a legal, but unmappable, GPA.
      
       - Use vfree() instead of kvfree() for allocations that always use vcalloc()
         or __vcalloc().
      
       - Don't completely ignore same-value writes to immutable feature MSRs, as
         doing so results in KVM failing to reject accesses to MSR that aren't
         supposed to exist given the vCPU model and/or KVM configuration.
      
       - Don't mark APICv as being inhibited due to ABSENT if APICv is disabled
         KVM-wide to avoid confusing debuggers (KVM will never bother clearing the
         ABSENT inhibit, even if userspace enables in-kernel local APIC).
      7d41e24d
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-mmu-6.10' of https://github.com/kvm-x86/linux into HEAD · 5a1c72e0
      Paolo Bonzini authored
      KVM x86 MMU changes for 6.10:
      
       - Process TDP MMU SPTEs that are are zapped while holding mmu_lock for read
         after replacing REMOVED_SPTE with '0' and flushing remote TLBs, which allows
         vCPU tasks to repopulate the zapped region while the zapper finishes tearing
         down the old, defunct page tables.
      
       - Fix a longstanding, likely benign-in-practice race where KVM could fail to
         detect a write from kvm_mmu_track_write() to a shadowed GPTE if the GPTE is
         first page table being shadowed.
      5a1c72e0
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-selftests_utils-6.10' of https://github.com/kvm-x86/linux into HEAD · dee7ea42
      Paolo Bonzini authored
      KVM selftests treewide updates for 6.10:
      
       - Define _GNU_SOURCE for all selftests to fix a warning that was introduced by
         a change to kselftest_harness.h late in the 6.9 cycle, and because forcing
         every test to #define _GNU_SOURCE is painful.
      
       - Provide a global psuedo-RNG instance for all tests, so that library code can
         generate random, but determinstic numbers.
      
       - Use the global pRNG to randomly force emulation of select writes from guest
         code on x86, e.g. to help validate KVM's emulation of locked accesses.
      
       - Rename kvm_util_base.h back to kvm_util.h, as the weird layer of indirection
         was added purely to avoid manually #including ucall_common.h in a handful of
         locations.
      
       - Allocate and initialize x86's GDT, IDT, TSS, segments, and default exception
         handlers at VM creation, instead of forcing tests to manually trigger the
         related setup.
      dee7ea42
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-vmx-6.10' of https://github.com/kvm-x86/linux into HEAD · 31a6cd7f
      Paolo Bonzini authored
      KVM VMX changes for 6.10:
      
       - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to
         L1, as per the SDM.
      
       - Move kvm_vcpu_arch's exit_qualification into x86_exception, as the field is
         used only when synthesizing nested EPT violation, i.e. it's not the vCPU's
         "real" exit_qualification, which is tracked elsewhere.
      
       - Add a sanity check to assert that EPT Violations are the only sources of
         nested PML Full VM-Exits.
      31a6cd7f
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-selftests-6.10' of https://github.com/kvm-x86/linux into HEAD · 56f40708
      Paolo Bonzini authored
      KVM selftests cleanups and fixes for 6.10:
      
       - Enhance the demand paging test to allow for better reporting and stressing
         of UFFD performance.
      
       - Convert the steal time test to generate TAP-friendly output.
      
       - Fix a flaky false positive in the xen_shinfo_test due to comparing elapsed
         time across two different clock domains.
      
       - Skip the MONITOR/MWAIT test if the host doesn't actually support MWAIT.
      
       - Avoid unnecessary use of "sudo" in the NX hugepage test to play nice with
         running in a minimal userspace environment.
      
       - Allow skipping the RSEQ test's sanity check that the vCPU was able to
         complete a reasonable number of KVM_RUNs, as the assert can fail on a
         completely valid setup.  If the test is run on a large-ish system that is
         otherwise idle, and the test isn't affined to a low-ish number of CPUs, the
         vCPU task can be repeatedly migrated to CPUs that are in deep sleep states,
         which results in the vCPU having very little net runtime before the next
         migration due to high wakeup latencies.
      56f40708
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-generic-6.10' of https://github.com/kvm-x86/linux into HEAD · f4bc1373
      Paolo Bonzini authored
      KVM cleanups for 6.10:
      
       - Misc cleanups extracted from the "exit on missing userspace mapping" series,
         which has been put on hold in anticipation of a "KVM Userfault" approach,
         which should provide a superset of functionality.
      
       - Remove kvm_make_all_cpus_request_except(), which got added to hack around an
         AVIC bug, and then became dead code when a more robust fix came along.
      
       - Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation.
      f4bc1373
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD · e5f62e27
      Paolo Bonzini authored
      KVM/arm64 updates for Linux 6.10
      
      - Move a lot of state that was previously stored on a per vcpu
        basis into a per-CPU area, because it is only pertinent to the
        host while the vcpu is loaded. This results in better state
        tracking, and a smaller vcpu structure.
      
      - Add full handling of the ERET/ERETAA/ERETAB instructions in
        nested virtualisation. The last two instructions also require
        emulating part of the pointer authentication extension.
        As a result, the trap handling of pointer authentication has
        been greattly simplified.
      
      - Turn the global (and not very scalable) LPI translation cache
        into a per-ITS, scalable cache, making non directly injected
        LPIs much cheaper to make visible to the vcpu.
      
      - A batch of pKVM patches, mostly fixes and cleanups, as the
        upstreaming process seems to be resuming. Fingers crossed!
      
      - Allocate PPIs and SGIs outside of the vcpu structure, allowing
        for smaller EL2 mapping and some flexibility in implementing
        more or less than 32 private IRQs.
      
      - Purge stale mpidr_data if a vcpu is created after the MPIDR
        map has been created.
      
      - Preserve vcpu-specific ID registers across a vcpu reset.
      
      - Various minor cleanups and improvements.
      e5f62e27
  2. 10 May, 2024 13 commits
    • Paolo Bonzini's avatar
      Merge tag 'loongarch-kvm-6.10' of... · 4232da23
      Paolo Bonzini authored
      Merge tag 'loongarch-kvm-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
      
      LoongArch KVM changes for v6.10
      
      1. Add ParaVirt IPI support.
      2. Add software breakpoint support.
      3. Add mmio trace events support.
      4232da23
    • Paolo Bonzini's avatar
      Merge branch 'kvm-sev-es-ghcbv2' into HEAD · bbe10a5c
      Paolo Bonzini authored
      While the main additions from GHCB protocol version 1 to version 2
      revolve mostly around SEV-SNP support, there are a number of changes
      applicable to SEV-ES guests as well. Pluck a handful patches from the
      SNP hypervisor patchset for GHCB-related changes that are also applicable
      to SEV-ES.  A KVM_SEV_INIT2 field lets userspace can control the maximum
      GHCB protocol version advertised to guests and manage compatibility
      across kernels/versions.
      bbe10a5c
    • Paolo Bonzini's avatar
      Merge branch 'kvm-coco-pagefault-prep' into HEAD · f3650842
      Paolo Bonzini authored
      A combination of prep work for TDX and SNP, and a clean up of the
      page fault path to (hopefully) make it easier to follow the rules for
      private memory, noslot faults, writes to read-only slots, etc.
      f3650842
    • Paolo Bonzini's avatar
      Merge branch 'kvm-vmx-ve' into HEAD · 1e21b538
      Paolo Bonzini authored
      Allow a non-zero value for non-present SPTE and removed SPTE,
      so that TDX can set the "suppress VE" bit.
      1e21b538
    • Michael Roth's avatar
      KVM: x86: Add hook for determining max NPT mapping level · f32fb328
      Michael Roth authored
      In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
      2MB mapping in the guest's nested page table depends on whether or not
      any subpages within the range have already been initialized as private
      in the RMP table. The existing mixed-attribute tracking in KVM is
      insufficient here, for instance:
      
      - gmem allocates 2MB page
      - guest issues PVALIDATE on 2MB page
      - guest later converts a subpage to shared
      - SNP host code issues PSMASH to split 2MB RMP mapping to 4K
      - KVM MMU splits NPT mapping to 4K
      - guest later converts that shared page back to private
      
      At this point there are no mixed attributes, and KVM would normally
      allow for 2MB NPT mappings again, but this is actually not allowed
      because the RMP table mappings are 4K and cannot be promoted on the
      hypervisor side, so the NPT mappings must still be limited to 4K to
      match this.
      
      Add a hook to determine the max NPT mapping size in situations like
      this.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Reviewed-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Message-ID: <20240501085210.2213060-3-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f32fb328
    • Michael Roth's avatar
      KVM: guest_memfd: Add hook for invalidating memory · a90764f0
      Michael Roth authored
      In some cases, like with SEV-SNP, guest memory needs to be updated in a
      platform-specific manner before it can be safely freed back to the host.
      Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
      allow for special handling of this sort when freeing memory in response
      to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
      ahead and define an arch-specific hook for x86 since it will be needed
      for handling memory used for SEV-SNP guests.
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-Id: <20231230172351.574091-6-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a90764f0
    • Paolo Bonzini's avatar
      KVM: guest_memfd: Add interface for populating gmem pages with user data · 1f6c06b1
      Paolo Bonzini authored
      During guest run-time, kvm_arch_gmem_prepare() is issued as needed to
      prepare newly-allocated gmem pages prior to mapping them into the guest.
      In the case of SEV-SNP, this mainly involves setting the pages to
      private in the RMP table.
      
      However, for the GPA ranges comprising the initial guest payload, which
      are encrypted/measured prior to starting the guest, the gmem pages need
      to be accessed prior to setting them to private in the RMP table so they
      can be initialized with the userspace-provided data. Additionally, an
      SNP firmware call is needed afterward to encrypt them in-place and
      measure the contents into the guest's launch digest.
      
      While it is possible to bypass the kvm_arch_gmem_prepare() hooks so that
      this handling can be done in an open-coded/vendor-specific manner, this
      may expose more gmem-internal state/dependencies to external callers
      than necessary. Try to avoid this by implementing an interface that
      tries to handle as much of the common functionality inside gmem as
      possible, while also making it generic enough to potentially be
      usable/extensible for TDX as well.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Co-developed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1f6c06b1
    • Paolo Bonzini's avatar
      KVM: guest_memfd: extract __kvm_gmem_get_pfn() · 17573fd9
      Paolo Bonzini authored
      In preparation for adding a function that walks a set of pages
      provided by userspace and populates them in a guest_memfd,
      add a version of kvm_gmem_get_pfn() that has a "bool prepare"
      argument and passes it down to kvm_gmem_get_folio().
      
      Populating guest memory has to call repeatedly __kvm_gmem_get_pfn()
      on the same file, so make the new function take struct file*.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      17573fd9
    • Paolo Bonzini's avatar
      KVM: guest_memfd: Add hook for initializing memory · 3bb2531e
      Paolo Bonzini authored
      guest_memfd pages are generally expected to be in some arch-defined
      initial state prior to using them for guest memory. For SEV-SNP this
      initial state is 'private', or 'guest-owned', and requires additional
      operations to move these pages into a 'private' state by updating the
      corresponding entries the RMP table.
      
      Allow for an arch-defined hook to handle updates of this sort, and go
      ahead and implement one for x86 so KVM implementations like AMD SVM can
      register a kvm_x86_ops callback to handle these updates for SEV-SNP
      guests.
      
      The preparation callback is always called when allocating/grabbing
      folios via gmem, and it is up to the architecture to keep track of
      whether or not the pages are already in the expected state (e.g. the RMP
      table in the case of SEV-SNP).
      
      In some cases, it is necessary to defer the preparation of the pages to
      handle things like in-place encryption of initial guest memory payloads
      before marking these pages as 'private'/'guest-owned'.  Add an argument
      (always true for now) to kvm_gmem_get_folio() that allows for the
      preparation callback to be bypassed.  To detect possible issues in
      the way userspace initializes memory, it is only possible to add an
      unprepared page if it is not already included in the filemap.
      
      Link: https://lore.kernel.org/lkml/ZLqVdvsF11Ddo7Dq@google.com/Co-developed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-Id: <20231230172351.574091-5-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3bb2531e
    • Paolo Bonzini's avatar
      KVM: guest_memfd: limit overzealous WARN · fa30b0dc
      Paolo Bonzini authored
      Because kvm_gmem_get_pfn() is called from the page fault path without
      any of the slots_lock, filemap lock or mmu_lock taken, it is
      possible for it to race with kvm_gmem_unbind().  This is not a
      problem, as any PTE that is installed temporarily will be zapped
      before the guest has the occasion to run.
      
      However, it is not possible to have a complete unbind+bind
      racing with the page fault, because deleting the memslot
      will call synchronize_srcu_expedited() and wait for the
      page fault to be resolved.  Thus, we can still warn if
      the file is there and is not the one we expect.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fa30b0dc
    • Paolo Bonzini's avatar
      KVM: guest_memfd: pass error up from filemap_grab_folio · 70623723
      Paolo Bonzini authored
      Some SNP ioctls will require the page not to be in the pagecache, and as such they
      will want to return EEXIST to userspace.  Start by passing the error up from
      filemap_grab_folio.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      70623723
    • Michael Roth's avatar
      KVM: guest_memfd: Use AS_INACCESSIBLE when creating guest_memfd inode · 1d23040c
      Michael Roth authored
      truncate_inode_pages_range() may attempt to zero pages before truncating
      them, and this will occur before arch-specific invalidations can be
      triggered via .invalidate_folio/.free_folio hooks via kvm_gmem_aops. For
      AMD SEV-SNP this would result in an RMP #PF being generated by the
      hardware, which is currently treated as fatal (and even if specifically
      allowed for, would not result in anything other than garbage being
      written to guest pages due to encryption). On Intel TDX this would also
      result in undesirable behavior.
      
      Set the AS_INACCESSIBLE flag to prevent the MM from attempting
      unexpected accesses of this sort during operations like truncation.
      
      This may also in some cases yield a decent performance improvement for
      guest_memfd userspace implementations that hole-punch ranges immediately
      after private->shared conversions via KVM_SET_MEMORY_ATTRIBUTES, since
      the current implementation of truncate_inode_pages_range() always ends
      up zero'ing an entire 4K range if it is backing by a 2M folio.
      
      Link: https://lore.kernel.org/lkml/ZR9LYhpxTaTk6PJX@google.com/Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-ID: <20240329212444.395559-6-michael.roth@amd.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1d23040c
    • Michael Roth's avatar
      mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory · c72ceafb
      Michael Roth authored
      filemap users like guest_memfd may use page cache pages to
      allocate/manage memory that is only intended to be accessed by guests
      via hardware protections like encryption. Writes to memory of this sort
      in common paths like truncation may cause unexpected behavior such as
      writing garbage instead of zeros when attempting to zero pages, or
      worse, triggering hardware protections that are considered fatal as far
      as the kernel is concerned.
      
      Introduce a new address_space flag, AS_INACCESSIBLE, and use this
      initially to prevent zero'ing of pages during truncation, with the
      understanding that it is up to the owner of the mapping to handle this
      specially if needed.
      
      This is admittedly a rather blunt solution, but it seems like
      there are no other places that should take into account the
      flag to keep its promise.
      
      Link: https://lore.kernel.org/lkml/ZR9LYhpxTaTk6PJX@google.com/
      Cc: Matthew Wilcox <willy@infradead.org>
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-ID: <20240329212444.395559-5-michael.roth@amd.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c72ceafb
  3. 09 May, 2024 8 commits
  4. 08 May, 2024 4 commits
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/misc-6.10 into kvmarm-master/next · e2815706
      Marc Zyngier authored
      * kvm-arm64/misc-6.10:
        : .
        : Misc fixes and updates targeting 6.10
        :
        : - Improve boot-time diagnostics when the sysreg tables
        :   are not correctly sorted
        :
        : - Allow FFA_MSG_SEND_DIRECT_REQ in the FFA proxy
        :
        : - Fix duplicate XNX field in the ID_AA64MMFR1_EL1
        :   writeable mask
        :
        : - Allocate PPIs and SGIs outside of the vcpu structure, allowing
        :   for smaller EL2 mapping and some flexibility in implementing
        :   more or less than 32 private IRQs.
        :
        : - Use bitmap_gather() instead of its open-coded equivalent
        :
        : - Make protected mode use hVHE if available
        :
        : - Purge stale mpidr_data if a vcpu is created after the MPIDR
        :   map has been created
        : .
        KVM: arm64: Destroy mpidr_data for 'late' vCPU creation
        KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support
        KVM: arm64: Fix hvhe/nvhe early alias parsing
        KVM: arm64: Convert kvm_mpidr_index() to bitmap_gather()
        KVM: arm64: vgic: Allocate private interrupts on demand
        KVM: arm64: Remove duplicated AA64MMFR1_EL1 XNX
        KVM: arm64: Remove FFA_MSG_SEND_DIRECT_REQ from the denylist
        KVM: arm64: Improve out-of-order sysreg table diagnostics
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      e2815706
    • Oliver Upton's avatar
      KVM: arm64: Destroy mpidr_data for 'late' vCPU creation · ce5d2448
      Oliver Upton authored
      A particularly annoying userspace could create a vCPU after KVM has
      computed mpidr_data for the VM, either by racing against VGIC
      initialization or having a userspace irqchip.
      
      In any case, this means mpidr_data no longer fully describes the VM, and
      attempts to find the new vCPU with kvm_mpidr_to_vcpu() will fail. The
      fix is to discard mpidr_data altogether, as it is only a performance
      optimization and not required for correctness. In all likelihood KVM
      will recompute the mappings when KVM_RUN is called on the new vCPU.
      
      Note that reads of mpidr_data are not guarded by a lock; promote to RCU
      to cope with the possibility of mpidr_data being invalidated at runtime.
      
      Fixes: 54a8006d ("KVM: arm64: Fast-track kvm_mpidr_to_vcpu() when mpidr_data is available")
      Signed-off-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Link: https://lore.kernel.org/r/20240508071952.2035422-1-oliver.upton@linux.devSigned-off-by: default avatarMarc Zyngier <maz@kernel.org>
      ce5d2448
    • Will Deacon's avatar
      KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support · 5053c3f0
      Will Deacon authored
      The early command line parsing treats "kvm-arm.mode=protected" as an
      alias for "id_aa64mmfr1.vh=0", forcing the use of nVHE so that the host
      kernel runs at EL1 with the pKVM hypervisor at EL2.
      
      With the introduction of hVHE support in ad744e8c ("arm64: Allow
      arm64_sw.hvhe on command line"), the hypervisor can run using the EL2+0
      translation regime. This is interesting for unusual CPUs that have VH
      stuck to 1, but also because it opens the possibility of a hypervisor
      "userspace" in the distant future which could be used to isolate vCPU
      contexts in the hypervisor (see Marc's talk from KVM Forum 2022 [1]).
      
      Repaint the "kvm-arm.mode=protected" alias to map to "arm64_sw.hvhe=1",
      which will use hVHE on CPUs that support it and remain with nVHE
      otherwise.
      
      [1] https://www.youtube.com/watch?v=1F_Mf2j9eIoSigned-off-by: default avatarWill Deacon <will@kernel.org>
      Acked-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Link: https://lore.kernel.org/r/20240501163400.15838-3-will@kernel.orgSigned-off-by: default avatarMarc Zyngier <maz@kernel.org>
      5053c3f0
    • Will Deacon's avatar
      KVM: arm64: Fix hvhe/nvhe early alias parsing · 3c142f9d
      Will Deacon authored
      Booting a kernel with "arm64_sw.hvhe=1 kvm-arm.mode=nvhe" on the
      command-line results in KVM initialising using hVHE, whereas one might
      expect the latter option to override the former.
      
      Fix this by adding "arm64_sw.hvhe=0" to the alias expansion for
      "kvm-arm.mode=nvhe".
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Acked-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Link: https://lore.kernel.org/r/20240501163400.15838-2-will@kernel.orgSigned-off-by: default avatarMarc Zyngier <maz@kernel.org>
      3c142f9d
  5. 07 May, 2024 7 commits