1. 07 May, 2024 7 commits
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use synthetic page fault error code to indicate private faults · b3d5dc62
      Sean Christopherson authored
      Add and use a synthetic, KVM-defined page fault error code to indicate
      whether a fault is to private vs. shared memory.  TDX and SNP have
      different mechanisms for reporting private vs. shared, and KVM's
      software-protected VMs have no mechanism at all.  Usurp an error code
      flag to avoid having to plumb another parameter to kvm_mmu_page_fault()
      and friends.
      
      Alternatively, KVM could borrow AMD's PFERR_GUEST_ENC_MASK, i.e. set it
      for TDX and software-protected VMs as appropriate, but that would require
      *clearing* the flag for SEV and SEV-ES VMs, which support encrypted
      memory at the hardware layer, but don't utilize private memory at the
      KVM layer.
      
      Opportunistically add a comment to call out that the logic for software-
      protected VMs is (and was before this commit) broken for nested MMUs, i.e.
      for nested TDP, as the GPA is an L2 GPA.  Punt on trying to play nice with
      nested MMUs as there is a _lot_ of functionality that simply doesn't work
      for software-protected VMs, e.g. all of the paths where KVM accesses guest
      memory need to be updated to be aware of private vs. shared memory.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20240228024147.41573-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b3d5dc62
    • Sean Christopherson's avatar
      KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero · 7bdbb820
      Sean Christopherson authored
      WARN if bits 63:32 are non-zero when handling an intercepted legacy #PF,
      as the error code for #PF is limited to 32 bits (and in practice, 16 bits
      on Intel CPUS).  This behavior is architectural, is part of KVM's ABI
      (see kvm_vcpu_events.error_code), and is explicitly documented as being
      preserved for intecerpted #PF in both the APM:
      
        The error code saved in EXITINFO1 is the same as would be pushed onto
        the stack by a non-intercepted #PF exception in protected mode.
      
      and even more explicitly in the SDM as VMCS.VM_EXIT_INTR_ERROR_CODE is a
      32-bit field.
      
      Simply drop the upper bits if hardware provides garbage, as spurious
      information should do no harm (though in all likelihood hardware is buggy
      and the kernel is doomed).
      
      Handling all upper 32 bits in the #PF path will allow moving the sanity
      check on synthetic checks from kvm_mmu_page_fault() to npf_interception(),
      which in turn will allow deriving PFERR_PRIVATE_ACCESS from AMD's
      PFERR_GUEST_ENC_MASK without running afoul of the sanity check.
      
      Note, this is also why Intel uses bit 15 for SGX (highest bit on Intel CPUs)
      and AMD uses bit 31 for RMP (highest bit on AMD CPUs); using the highest
      bit minimizes the probability of a collision with the "other" vendor,
      without needing to plumb more bits through microcode.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-ID: <20240228024147.41573-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7bdbb820
    • Isaku Yamahata's avatar
      KVM: x86/mmu: Pass full 64-bit error code when handling page faults · c9710130
      Isaku Yamahata authored
      Plumb the full 64-bit error code throughout the page fault handling code
      so that KVM can use the upper 32 bits, e.g. SNP's PFERR_GUEST_ENC_MASK
      will be used to determine whether or not a fault is private vs. shared.
      
      Note, passing the 64-bit error code to FNAME(walk_addr)() does NOT change
      the behavior of permission_fault() when invoked in the page fault path, as
      KVM explicitly clears PFERR_IMPLICIT_ACCESS in kvm_mmu_page_fault().
      
      Continue passing '0' from the async #PF worker, as guest_memfd and thus
      private memory doesn't support async page faults.
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      [mdr: drop references/changes on rebase, update commit message]
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      [sean: drop truncation in call to FNAME(walk_addr)(), rewrite changelog]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Message-ID: <20240228024147.41573-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c9710130
    • Sean Christopherson's avatar
      KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handler · dee281e4
      Sean Christopherson authored
      Move the sanity check that hardware never sets bits that collide with KVM-
      define synthetic bits from kvm_mmu_page_fault() to npf_interception(),
      i.e. make the sanity check #NPF specific.  The legacy #PF path already
      WARNs if _any_ of bits 63:32 are set, and the error code that comes from
      VMX's EPT Violatation and Misconfig is 100% synthesized (KVM morphs VMX's
      EXIT_QUALIFICATION into error code flags).
      
      Add a compile-time assert in the legacy #PF handler to make sure that KVM-
      define flags are covered by its existing sanity check on the upper bits.
      
      Opportunistically add a description of PFERR_IMPLICIT_ACCESS, since we
      are removing the comment that defined it.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Reviewed-by: default avatarBinbin Wu <binbin.wu@linux.intel.com>
      Message-ID: <20240228024147.41573-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dee281e4
    • Sean Christopherson's avatar
      KVM: x86: Define more SEV+ page fault error bits/flags for #NPF · 9b62e03e
      Sean Christopherson authored
      Define more #NPF error code flags that are relevant to SEV+ (mostly SNP)
      guests, as specified by the APM:
      
       * Bit 31 (RMP):   Set to 1 if the fault was caused due to an RMP check or a
                         VMPL check failure, 0 otherwise.
       * Bit 34 (ENC):   Set to 1 if the guest’s effective C-bit was 1, 0 otherwise.
       * Bit 35 (SIZEM): Set to 1 if the fault was caused by a size mismatch between
                         PVALIDATE or RMPADJUST and the RMP, 0 otherwise.
       * Bit 36 (VMPL):  Set to 1 if the fault was caused by a VMPL permission
                         check failure, 0 otherwise.
      
      Note, the APM is *extremely* misleading, and strongly implies that the
      above flags can _only_ be set for #NPF exits from SNP guests.  That is a
      lie, as bit 34 (C-bit=1, i.e. was encrypted) can be set when running _any_
      flavor of SEV guest on SNP capable hardware.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240228024147.41573-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9b62e03e
    • Sean Christopherson's avatar
      KVM: x86: Remove separate "bit" defines for page fault error code masks · 63b6206e
      Sean Christopherson authored
      Open code the bit number directly in the PFERR_* masks and drop the
      intermediate PFERR_*_BIT defines, as having to bounce through two macros
      just to see which flag corresponds to which bit is quite annoying, as is
      having to define two macros just to add recognition of a new flag.
      
      Use ternary operator to derive the bit in permission_fault(), the one
      function that actually needs the bit number as part of clever shifting
      to avoid conditional branches.  Generally the compiler is able to turn
      it into a conditional move, and if not it's not really a big deal.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-ID: <20240228024147.41573-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      63b6206e
    • Sean Christopherson's avatar
      KVM: x86/mmu: Exit to userspace with -EFAULT if private fault hits emulation · d0bf8e6e
      Sean Christopherson authored
      Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private fault
      triggers emulation of any kind, as KVM doesn't currently support emulating
      access to guest private memory.  Practically speaking, private faults and
      emulation are already mutually exclusive, but there are many flow that
      can result in KVM returning RET_PF_EMULATE, and adding one last check
      to harden against weird, unexpected combinations and/or KVM bugs is
      inexpensive.
      Suggested-by: default avatarYan Zhao <yan.y.zhao@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240228024147.41573-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d0bf8e6e
  2. 19 Apr, 2024 1 commit
  3. 12 Apr, 2024 7 commits
    • Paolo Bonzini's avatar
      1ab157ce
    • Sean Christopherson's avatar
      KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument · 2325a21a
      Sean Christopherson authored
      TDX uses different ABI to get information about VM exit.  Pass intr_info to
      the NMI and INTR handlers instead of pulling it from vcpu_vmx in
      preparation for sharing the bulk of the handlers with TDX.
      
      When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
      exit qualification etc rather than the VMCS fields because VMM doesn't have
      access to the VMCS.  The eventual code will be
      
      VMX:
        - get exit reason, intr_info, exit_qualification, and etc from VMCS
        - call NMI/INTR handlers (common code)
      
      TDX:
        - get exit reason, intr_info, exit_qualification, and etc from guest
          registers
        - call NMI/INTR handlers (common code)
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2325a21a
    • Paolo Bonzini's avatar
      KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch VMX and TDX · 5f18c642
      Paolo Bonzini authored
      KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
      to operate on VM.  TDX doesn't allow VMM to operate VMCS directly.
      Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to
      indirectly operate those data structures.  This means we must have a TDX
      version of kvm_x86_ops.
      
      The existing global struct kvm_x86_ops already defines an interface which
      can be adapted to TDX, but kvm_x86_ops is a system-wide, not per-VM
      structure.  To allow VMX to coexist with TDs, the kvm_x86_ops callbacks
      will have wrappers "if (tdx) tdx_op() else vmx_op()" to pick VMX or
      TDX at run time.
      
      To split the runtime switch, the VMX implementation, and the TDX
      implementation, add main.c, and move out the vmx_x86_ops hooks in
      preparation for adding TDX.  Use 'vt' for the naming scheme as a nod to
      VT-x and as a concatenation of VmxTdx.
      
      The eventually converted code will look like this:
      
      vmx.c:
        vmx_op() { ... }
        VMX initialization
      tdx.c:
        tdx_op() { ... }
        TDX initialization
      x86_ops.h:
        vmx_op();
        tdx_op();
      main.c:
        static vt_op() { if (tdx) tdx_op() else vmx_op() }
        static struct kvm_x86_ops vt_x86_ops = {
              .op = vt_op,
        initialization functions call both VMX and TDX initialization
      
      Opportunistically, fix the name inconsistency from vmx_create_vcpu() and
      vmx_free_vcpu() to vmx_vcpu_create() and vmx_vcpu_free().
      Co-developed-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Reviewed-by: default avatarBinbin Wu <binbin.wu@linux.intel.com>
      Reviewed-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Reviewed-by: default avatarYuan Yao <yuan.yao@intel.com>
      Message-Id: <e603c317587f933a9d1bee8728c84e4935849c16.1705965634.git.isaku.yamahata@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5f18c642
    • Sean Christopherson's avatar
      KVM: x86: Split core of hypercall emulation to helper function · e913ef15
      Sean Christopherson authored
      By necessity, TDX will use a different register ABI for hypercalls.
      Break out the core functionality so that it may be reused for TDX.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Message-Id: <5134caa55ac3dec33fb2addb5545b52b3b52db02.1705965635.git.isaku.yamahata@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e913ef15
    • Paolo Bonzini's avatar
      Merge branch 'kvm-sev-init2' into HEAD · f9cecb3c
      Paolo Bonzini authored
      The idea that no parameter would ever be necessary when enabling SEV or
      SEV-ES for a VM was decidedly optimistic.  The first source of variability
      that was encountered is the desired set of VMSA features, as that affects
      the measurement of the VM's initial state and cannot be changed
      arbitrarily by the hypervisor.
      
      This series adds all the APIs that are needed to customize the features,
      with room for future enhancements:
      
      - a new /dev/kvm device attribute to retrieve the set of supported
        features (right now, only debug swap)
      
      - a new sub-operation for KVM_MEM_ENCRYPT_OP that can take a struct,
        replacing the existing KVM_SEV_INIT and KVM_SEV_ES_INIT
      
      It then puts the new op to work by including the VMSA features as a field
      of the The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of
      supported VMSA features for backwards compatibility; but I am considering
      also making them use zero as the feature mask, and will gladly adjust the
      patches if so requested.
      
      In order to avoid creating *two* new KVM_MEM_ENCRYPT_OPs, I decided that
      I could as well make SEV and SEV-ES use VM types.  This allows SEV-SNP
      to reuse the KVM_SEV_INIT2 ioctl.
      
      And while at it, KVM_SEV_INIT2 also includes two bugfixes.  First of all,
      SEV-ES VM, when created with the new VM type instead of KVM_SEV_ES_INIT,
      reject KVM_GET_REGS/KVM_SET_REGS and friends on the vCPU file descriptor
      once the VMSA has been encrypted...  which is how the API should have
      always behaved.  Second, they also synchronize the FPU and AVX state.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f9cecb3c
    • Paolo Bonzini's avatar
      Merge branch 'mm-delete-change-gpte' into HEAD · 531f5200
      Paolo Bonzini authored
      The .change_pte() MMU notifier callback was intended as an optimization
      and for this reason it was initially called without a surrounding
      mmu_notifier_invalidate_range_{start,end}() pair.  It was only ever
      implemented by KVM (which was also the original user of MMU notifiers)
      and the rules on when to call set_pte_at_notify() rather than set_pte_at()
      have always been pretty obscure.
      
      It may seem a miracle that it has never caused any hard to trigger
      bugs, but there's a good reason for that: KVM's implementation has
      been nonfunctional for a good part of its existence.  Already in
      2012, commit 6bdb913f ("mm: wrap calls to set_pte_at_notify with
      invalidate_range_start and invalidate_range_end", 2012-10-09) changed the
      .change_pte() callback to occur within an invalidate_range_start/end()
      pair; and because KVM unmaps the sPTEs during .invalidate_range_start(),
      .change_pte() has no hope of finding a sPTE to change.
      
      Therefore, all the code for .change_pte() can be removed from both KVM
      and mm/, and set_pte_at_notify() can be replaced with just set_pte_at().
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      531f5200
    • Paolo Bonzini's avatar
      mm: replace set_pte_at_notify() with just set_pte_at() · f7842747
      Paolo Bonzini authored
      With the demise of the .change_pte() MMU notifier callback, there is no
      notification happening in set_pte_at_notify().  It is a synonym of
      set_pte_at() and can be replaced with it.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      Message-ID: <20240405115815.3226315-5-pbonzini@redhat.com>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f7842747
  4. 11 Apr, 2024 20 commits
  5. 06 Apr, 2024 5 commits
    • Borislav Petkov (AMD)'s avatar
      x86/retpoline: Add NOENDBR annotation to the SRSO dummy return thunk · b377c66a
      Borislav Petkov (AMD) authored
      srso_alias_untrain_ret() is special code, even if it is a dummy
      which is called in the !SRSO case, so annotate it like its real
      counterpart, to address the following objtool splat:
      
        vmlinux.o: warning: objtool: .export_symbol+0x2b290: data relocation to !ENDBR: srso_alias_untrain_ret+0x0
      
      Fixes: 4535e1a4 ("x86/bugs: Fix the SRSO mitigation on Zen3/4")
      Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: https://lore.kernel.org/r/20240405144637.17908-1-bp@kernel.org
      b377c66a
    • Ingo Molnar's avatar
      Merge branch 'linus' into x86/urgent, to pick up dependent commit · 5f2ca44e
      Ingo Molnar authored
      We want to fix:
      
        0e110732 ("x86/retpoline: Do the necessary fixup to the Zen3/4 srso return thunk for !SRSO")
      
      So merge in Linus's latest into x86/urgent to have it available.
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5f2ca44e
    • Linus Torvalds's avatar
      Merge tag 'firewire-fixes-6.9-rc2' of... · 6c6e47d6
      Linus Torvalds authored
      Merge tag 'firewire-fixes-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394
      
      Pull firewire fixes from Takashi Sakamoto:
       "The firewire-ohci kernel module has a parameter for verbose kernel
        logging. It is well-known that it logs the spurious IRQ for bus-reset
        event due to the unmasked register for IRQ event. This update fixes
        the issue"
      
      * tag 'firewire-fixes-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
        firewire: ohci: mask bus reset interrupts between ISR and bottom half
      6c6e47d6
    • Adam Goldman's avatar
      firewire: ohci: mask bus reset interrupts between ISR and bottom half · 752e3c53
      Adam Goldman authored
      In the FireWire OHCI interrupt handler, if a bus reset interrupt has
      occurred, mask bus reset interrupts until bus_reset_work has serviced and
      cleared the interrupt.
      
      Normally, we always leave bus reset interrupts masked. We infer the bus
      reset from the self-ID interrupt that happens shortly thereafter. A
      scenario where we unmask bus reset interrupts was introduced in 2008 in
      a007bb85: If
      OHCI_PARAM_DEBUG_BUSRESETS (8) is set in the debug parameter bitmask, we
      will unmask bus reset interrupts so we can log them.
      
      irq_handler logs the bus reset interrupt. However, we can't clear the bus
      reset event flag in irq_handler, because we won't service the event until
      later. irq_handler exits with the event flag still set. If the
      corresponding interrupt is still unmasked, the first bus reset will
      usually freeze the system due to irq_handler being called again each
      time it exits. This freeze can be reproduced by loading firewire_ohci
      with "modprobe firewire_ohci debug=-1" (to enable all debugging output).
      Apparently there are also some cases where bus_reset_work will get called
      soon enough to clear the event, and operation will continue normally.
      
      This freeze was first reported a few months after a007bb85 was committed,
      but until now it was never fixed. The debug level could safely be set
      to -1 through sysfs after the module was loaded, but this would be
      ineffectual in logging bus reset interrupts since they were only
      unmasked during initialization.
      
      irq_handler will now leave the event flag set but mask bus reset
      interrupts, so irq_handler won't be called again and there will be no
      freeze. If OHCI_PARAM_DEBUG_BUSRESETS is enabled, bus_reset_work will
      unmask the interrupt after servicing the event, so future interrupts
      will be caught as desired.
      
      As a side effect to this change, OHCI_PARAM_DEBUG_BUSRESETS can now be
      enabled through sysfs in addition to during initial module loading.
      However, when enabled through sysfs, logging of bus reset interrupts will
      be effective only starting with the second bus reset, after
      bus_reset_work has executed.
      Signed-off-by: default avatarAdam Goldman <adamg@pobox.com>
      Signed-off-by: default avatarTakashi Sakamoto <o-takashi@sakamocchi.jp>
      752e3c53
    • Linus Torvalds's avatar
      Merge tag 'spi-fix-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi · 104db052
      Linus Torvalds authored
      Pull spi fixes from Mark Brown:
       "A few small driver specific fixes, the most important being the
        s3c64xx change which is likely to be hit during normal operation"
      
      * tag 'spi-fix-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
        spi: mchp-pci1xxx: Fix a possible null pointer dereference in pci1xxx_spi_probe
        spi: spi-fsl-lpspi: remove redundant spi_controller_put call
        spi: s3c64xx: Use DMA mode from fifo size
      104db052