1. 23 Dec, 2022 12 commits
    • Sean Christopherson's avatar
      KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is disallowed · 21a36ac6
      Sean Christopherson authored
      Re-check sp->nx_huge_page_disallowed under the tdp_mmu_pages_lock spinlock
      when adding a new shadow page in the TDP MMU.  To ensure the NX reclaim
      kthread can't see a not-yet-linked shadow page, the page fault path links
      the new page table prior to adding the page to possible_nx_huge_pages.
      
      If the page is zapped by different task, e.g. because dirty logging is
      disabled, between linking the page and adding it to the list, KVM can end
      up triggering use-after-free by adding the zapped SP to the aforementioned
      list, as the zapped SP's memory is scheduled for removal via RCU callback.
      The bug is detected by the sanity checks guarded by CONFIG_DEBUG_LIST=y,
      i.e. the below splat is just one possible signature.
      
        ------------[ cut here ]------------
        list_add corruption. prev->next should be next (ffffc9000071fa70), but was ffff88811125ee38. (prev=ffff88811125ee38).
        WARNING: CPU: 1 PID: 953 at lib/list_debug.c:30 __list_add_valid+0x79/0xa0
        Modules linked in: kvm_intel
        CPU: 1 PID: 953 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #71
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:__list_add_valid+0x79/0xa0
        RSP: 0018:ffffc900006efb68 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: ffff888116cae8a0 RCX: 0000000000000027
        RDX: 0000000000000027 RSI: 0000000100001872 RDI: ffff888277c5b4c8
        RBP: ffffc90000717000 R08: ffff888277c5b4c0 R09: ffffc900006efa08
        R10: 0000000000199998 R11: 0000000000199a20 R12: ffff888116cae930
        R13: ffff88811125ee38 R14: ffffc9000071fa70 R15: ffff88810b794f90
        FS:  00007fc0415d2740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 0000000115201006 CR4: 0000000000172ea0
        Call Trace:
         <TASK>
         track_possible_nx_huge_page+0x53/0x80
         kvm_tdp_mmu_map+0x242/0x2c0
         kvm_tdp_page_fault+0x10c/0x130
         kvm_mmu_page_fault+0x103/0x680
         vmx_handle_exit+0x132/0x5a0 [kvm_intel]
         vcpu_enter_guest+0x60c/0x16f0
         kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
         kvm_vcpu_ioctl+0x271/0x660
         __x64_sys_ioctl+0x80/0xb0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
         </TASK>
        ---[ end trace 0000000000000000 ]---
      
      Fixes: 61f94478 ("KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE")
      Reported-by: default avatarGreg Thelen <gthelen@google.com>
      Analyzed-by: default avatarDavid Matlack <dmatlack@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: Mingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221213033030.83345-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      21a36ac6
    • Sean Christopherson's avatar
      KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached · 80a3e4ae
      Sean Christopherson authored
      Map the leaf SPTE when handling a TDP MMU page fault if and only if the
      target level is reached.  A recent commit reworked the retry logic and
      incorrectly assumed that walking SPTEs would never "fail", as the loop
      either bails (retries) or installs parent SPs.  However, the iterator
      itself will bail early if it detects a frozen (REMOVED) SPTE when
      stepping down.   The TDP iterator also rereads the current SPTE before
      stepping down specifically to avoid walking into a part of the tree that
      is being removed, which means it's possible to terminate the loop without
      the guts of the loop observing the frozen SPTE, e.g. if a different task
      zaps a parent SPTE between the initial read and try_step_down()'s refresh.
      
      Mapping a leaf SPTE at the wrong level results in all kinds of badness as
      page table walkers interpret the SPTE as a page table, not a leaf, and
      walk into the weeds.
      
        ------------[ cut here ]------------
        WARNING: CPU: 1 PID: 1025 at arch/x86/kvm/mmu/tdp_mmu.c:1070 kvm_tdp_mmu_map+0x481/0x510
        Modules linked in: kvm_intel
        CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #64
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_tdp_mmu_map+0x481/0x510
        RSP: 0018:ffffc9000072fba8 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: ffffc9000072fcc0 RCX: 0000000000000027
        RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
        RBP: ffff888107d45a10 R08: ffff888277c5b4c0 R09: ffffc9000072fa48
        R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000073a0e0
        R13: ffff88810fc54800 R14: ffff888107d1ae60 R15: ffff88810fc54f90
        FS:  00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
        Call Trace:
         <TASK>
         kvm_tdp_page_fault+0x10c/0x130
         kvm_mmu_page_fault+0x103/0x680
         vmx_handle_exit+0x132/0x5a0 [kvm_intel]
         vcpu_enter_guest+0x60c/0x16f0
         kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
         kvm_vcpu_ioctl+0x271/0x660
         __x64_sys_ioctl+0x80/0xb0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
         </TASK>
        ---[ end trace 0000000000000000 ]---
        Invalid SPTE change: cannot replace a present leaf
        SPTE with another present leaf SPTE mapping a
        different PFN!
        as_id: 0 gfn: 100200 old_spte: 600000112400bf3 new_spte: 6000001126009f3 level: 2
        ------------[ cut here ]------------
        kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559!
        invalid opcode: 0000 [#1] SMP
        CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #64
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:__handle_changed_spte.cold+0x95/0x9c
        RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246
        RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027
        RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
        RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0
        R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3
        R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01
        FS:  00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
        Call Trace:
         <TASK>
         kvm_tdp_mmu_map+0x3b0/0x510
         kvm_tdp_page_fault+0x10c/0x130
         kvm_mmu_page_fault+0x103/0x680
         vmx_handle_exit+0x132/0x5a0 [kvm_intel]
         vcpu_enter_guest+0x60c/0x16f0
         kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
         kvm_vcpu_ioctl+0x271/0x660
         __x64_sys_ioctl+0x80/0xb0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
         </TASK>
        Modules linked in: kvm_intel
        ---[ end trace 0000000000000000 ]---
      
      Fixes: 63d28a25 ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry")
      Cc: Robert Hoo <robert.hu@linux.intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221213033030.83345-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      80a3e4ae
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is frozen · f5d16bb9
      Sean Christopherson authored
      Hoist the is_removed_spte() check above the "level == goal_level" check
      when walking SPTEs during a TDP MMU page fault to avoid attempting to map
      a leaf entry if said entry is frozen by a different task/vCPU.
      
        ------------[ cut here ]------------
        WARNING: CPU: 3 PID: 939 at arch/x86/kvm/mmu/tdp_mmu.c:653 kvm_tdp_mmu_map+0x269/0x4b0
        Modules linked in: kvm_intel
        CPU: 3 PID: 939 Comm: nx_huge_pages_t Not tainted 6.1.0-rc4+ #67
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_tdp_mmu_map+0x269/0x4b0
        RSP: 0018:ffffc9000068fba8 EFLAGS: 00010246
        RAX: 00000000000005a0 RBX: ffffc9000068fcc0 RCX: 0000000000000005
        RDX: ffff88810741f000 RSI: ffff888107f04600 RDI: ffffc900006a3000
        RBP: 060000010b000bf3 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 000ffffffffff000 R12: 0000000000000005
        R13: ffff888113670000 R14: ffff888107464958 R15: 0000000000000000
        FS:  00007f01c942c740(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 0000000117013006 CR4: 0000000000172ea0
        Call Trace:
         <TASK>
         kvm_tdp_page_fault+0x10c/0x130
         kvm_mmu_page_fault+0x103/0x680
         vmx_handle_exit+0x132/0x5a0 [kvm_intel]
         vcpu_enter_guest+0x60c/0x16f0
         kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
         kvm_vcpu_ioctl+0x271/0x660
         __x64_sys_ioctl+0x80/0xb0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
         </TASK>
        ---[ end trace 0000000000000000 ]---
      
      Fixes: 63d28a25 ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry")
      Cc: Robert Hoo <robert.hu@linux.intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarRobert Hoo <robert.hu@linux.intel.com>
      Message-Id: <20221213033030.83345-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f5d16bb9
    • Sean Christopherson's avatar
      KVM: nVMX: Don't stuff secondary execution control if it's not supported · a0860d68
      Sean Christopherson authored
      When stuffing the allowed secondary execution controls for nested VMX in
      response to CPUID updates, don't set the allowed-1 bit for a feature that
      isn't supported by KVM, i.e. isn't allowed by the canonical vmcs_config.
      
      WARN if KVM attempts to manipulate a feature that isn't supported.  All
      features that are currently stuffed are always advertised to L1 for
      nested VMX if they are supported in KVM's base configuration, and no
      additional features should ever be added to the CPUID-induced stuffing
      (updating VMX MSRs in response to CPUID updates is a long-standing KVM
      flaw that is slowly being fixed).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221213062306.667649-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a0860d68
    • Sean Christopherson's avatar
      KVM: nVMX: Properly expose ENABLE_USR_WAIT_PAUSE control to L1 · 31de69f4
      Sean Christopherson authored
      Set ENABLE_USR_WAIT_PAUSE in KVM's supported VMX MSR configuration if the
      feature is supported in hardware and enabled in KVM's base, non-nested
      configuration, i.e. expose ENABLE_USR_WAIT_PAUSE to L1 if it's supported.
      This fixes a bug where saving/restoring, i.e. migrating, a vCPU will fail
      if WAITPKG (the associated CPUID feature) is enabled for the vCPU, and
      obviously allows L1 to enable the feature for L2.
      
      KVM already effectively exposes ENABLE_USR_WAIT_PAUSE to L1 by stuffing
      the allowed-1 control ina vCPU's virtual MSR_IA32_VMX_PROCBASED_CTLS2 when
      updating secondary controls in response to KVM_SET_CPUID(2), but (a) that
      depends on flawed code (KVM shouldn't touch VMX MSRs in response to CPUID
      updates) and (b) runs afoul of vmx_restore_control_msr()'s restriction
      that the guest value must be a strict subset of the supported host value.
      
      Although no past commit explicitly enabled nested support for WAITPKG,
      doing so is safe and functionally correct from an architectural
      perspective as no additional KVM support is needed to virtualize TPAUSE,
      UMONITOR, and UMWAIT for L2 relative to L1, and KVM already forwards
      VM-Exits to L1 as necessary (commit bf653b78, "KVM: vmx: Introduce
      handle_unexpected_vmexit and handle WAITPKG vmexit").
      
      Note, KVM always keeps the hosts MSR_IA32_UMWAIT_CONTROL resident in
      hardware, i.e. always runs both L1 and L2 with the host's power management
      settings for TPAUSE and UMWAIT.  See commit bf09fb6c ("KVM: VMX: Stop
      context switching MSR_IA32_UMWAIT_CONTROL") for more details.
      
      Fixes: e69e72fa ("KVM: x86: Add support for user wait instructions")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAaron Lewis <aaronlewis@google.com>
      Reported-by: default avatarYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Message-Id: <20221213062306.667649-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      31de69f4
    • Sean Christopherson's avatar
      KVM: nVMX: Document that ignoring memory failures for VMCLEAR is deliberate · 057b1875
      Sean Christopherson authored
      Explicitly drop the result of kvm_vcpu_write_guest() when writing the
      "launch state" as part of VMCLEAR emulation, and add a comment to call
      out that KVM's behavior is architecturally valid.  Intel's pseudocode
      effectively says that VMCLEAR is a nop if the target VMCS address isn't
      in memory, e.g. if the address points at MMIO.
      
      Add a FIXME to call out that suppressing failures on __copy_to_user() is
      wrong, as memory (a memslot) does exist in that case.  Punt the issue to
      the future as open coding kvm_vcpu_write_guest() just to make sure the
      guest dies with -EFAULT isn't worth the extra complexity.  The flaw will
      need to be addressed if KVM ever does something intelligent on uaccess
      failures, e.g. to support post-copy demand paging, but in that case KVM
      will need a more thorough overhaul, i.e. VMCLEAR shouldn't need to open
      code a core KVM helper.
      
      No functional change intended.
      Reported-by: default avatarcoverity-bot <keescook+coverity-bot@chromium.org>
      Addresses-Coverity-ID: 1527765 ("Error handling issues")
      Fixes: 587d7e72 ("kvm: nVMX: VMCLEAR should not cause the vCPU to shut down")
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221220154224.526568-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      057b1875
    • Sean Christopherson's avatar
      KVM: selftests: Zero out valid_bank_mask for "all" case in Hyper-V IPI test · 53800f88
      Sean Christopherson authored
      Zero out the valid_bank_mask when using the fast variant of
      HVCALL_SEND_IPI_EX to send IPIs to all vCPUs.  KVM requires the "var_cnt"
      and "valid_bank_mask" inputs to be consistent even when targeting all
      vCPUs.  See commit bd1ba573 ("KVM: x86: Get the number of Hyper-V
      sparse banks from the VARHEAD field").
      
      Fixes: 99848924 ("KVM: selftests: Hyper-V PV IPI selftest")
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221219220416.395329-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      53800f88
    • Sean Christopherson's avatar
      KVM: x86: Sanity check inputs to kvm_handle_memory_failure() · 77b1908e
      Sean Christopherson authored
      Add a sanity check in kvm_handle_memory_failure() to assert that a valid
      x86_exception structure is provided if the memory "failure" wants to
      propagate a fault into the guest.  If a memory failure happens during a
      direct guest physical memory access, e.g. for nested VMX, KVM hardcodes
      the failure to X86EMUL_IO_NEEDED and doesn't provide an exception pointer
      (because the exception struct would just be filled with garbage).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221220153427.514032-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      77b1908e
    • Peng Hao's avatar
      KVM: x86: Simplify kvm_apic_hw_enabled · 3c649918
      Peng Hao authored
      kvm_apic_hw_enabled() only needs to return bool, there is no place
      to use the return value of MSR_IA32_APICBASE_ENABLE.
      Signed-off-by: default avatarPeng Hao <flyingpeng@tencent.com>
      Message-Id: <CAPm50aJ=BLXNWT11+j36Dd6d7nz2JmOBk4u7o_NPQ0N61ODu1g@mail.gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3c649918
    • Vitaly Kuznetsov's avatar
      KVM: x86: hyper-v: Fix 'using uninitialized value' Coverity warning · 8b9e13d2
      Vitaly Kuznetsov authored
      In kvm_hv_flush_tlb(), 'data_offset' and 'consumed_xmm_halves' variables
      are used in a mutually exclusive way: in 'hc->fast' we count in 'XMM
      halves' and increase 'data_offset' otherwise. Coverity discovered, that in
      one case both variables are incremented unconditionally. This doesn't seem
      to cause any issues as the only user of 'data_offset'/'consumed_xmm_halves'
      data is kvm_hv_get_tlb_flush_entries() -> kvm_hv_get_hc_data() which also
      takes into account 'hc->fast' but is still worth fixing.
      
      To make things explicit, put 'data_offset' and 'consumed_xmm_halves' to
      'struct kvm_hv_hcall' as a union and use at call sites. This allows to
      remove explicit 'data_offset'/'consumed_xmm_halves' parameters from
      kvm_hv_get_hc_data()/kvm_get_sparse_vp_set()/kvm_hv_get_tlb_flush_entries()
      helpers.
      
      Note: 'struct kvm_hv_hcall' is allocated on stack in kvm_hv_hypercall() and
      is not zeroed, consumers are supposed to initialize the appropriate field
      if needed.
      Reported-by: default avatarcoverity-bot <keescook+coverity-bot@chromium.org>
      Addresses-Coverity-ID: 1527764 ("Uninitialized variables")
      Fixes: 26097086 ("KVM: x86: hyper-v: Handle HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST{,EX} calls gently")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221208102700.959630-1-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8b9e13d2
    • Adamos Ttofari's avatar
      KVM: x86: ioapic: Fix level-triggered EOI and userspace I/OAPIC reconfigure race · fceb3a36
      Adamos Ttofari authored
      When scanning userspace I/OAPIC entries, intercept EOI for level-triggered
      IRQs if the current vCPU has a pending and/or in-service IRQ for the
      vector in its local API, even if the vCPU doesn't match the new entry's
      destination.  This fixes a race between userspace I/OAPIC reconfiguration
      and IRQ delivery that results in the vector's bit being left set in the
      remote IRR due to the eventual EOI not being forwarded to the userspace
      I/OAPIC.
      
      Commit 0fc5a36d ("KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC
      reconfigure race") fixed the in-kernel IOAPIC, but not the userspace
      IOAPIC configuration, which has a similar race.
      
      Fixes: 0fc5a36d ("KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC reconfigure race")
      Signed-off-by: default avatarAdamos Ttofari <attofari@amazon.de>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221208094415.12723-1-attofari@amazon.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fceb3a36
    • Like Xu's avatar
      KVM: x86/pmu: Prevent zero period event from being repeatedly released · 55c590ad
      Like Xu authored
      The current vPMU can reuse the same pmc->perf_event for the same
      hardware event via pmc_pause/resume_counter(), but this optimization
      does not apply to a portion of the TSX events (e.g., "event=0x3c,in_tx=1,
      in_tx_cp=1"), where event->attr.sample_period is legally zero at creation,
      thus making the perf call to perf_event_period() meaningless (no need to
      adjust sample period in this case), and instead causing such reusable
      perf_events to be repeatedly released and created.
      
      Avoid releasing zero sample_period events by checking is_sampling_event()
      to follow the previously enable/disable optimization.
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Message-Id: <20221207071506.15733-2-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      55c590ad
  2. 14 Dec, 2022 1 commit
  3. 12 Dec, 2022 1 commit
    • Paolo Bonzini's avatar
      Merge remote-tracking branch 'kvm/queue' into HEAD · 9352e747
      Paolo Bonzini authored
      x86 Xen-for-KVM:
      
      * Allow the Xen runstate information to cross a page boundary
      
      * Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
      
      * add support for 32-bit guests in SCHEDOP_poll
      
      x86 fixes:
      
      * One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
      
      * Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
         years back when eliminating unnecessary barriers when switching between
         vmcs01 and vmcs02.
      
      * Clean up the MSR filter docs.
      
      * Clean up vmread_error_trampoline() to make it more obvious that params
        must be passed on the stack, even for x86-64.
      
      * Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
        of the current guest CPUID.
      
      * Fudge around a race with TSC refinement that results in KVM incorrectly
        thinking a guest needs TSC scaling when running on a CPU with a
        constant TSC, but no hardware-enumerated TSC frequency.
      
      * Advertise (on AMD) that the SMM_CTL MSR is not supported
      
      * Remove unnecessary exports
      
      Selftests:
      
      * Fix an inverted check in the access tracking perf test, and restore
        support for asserting that there aren't too many idle pages when
        running on bare metal.
      
      * Fix an ordering issue in the AMX test introduced by recent conversions
        to use kvm_cpu_has(), and harden the code to guard against similar bugs
        in the future.  Anything that tiggers caching of KVM's supported CPUID,
        kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
        the caching occurs before the test opts in via prctl().
      
      * Fix build errors that occur in certain setups (unsure exactly what is
        unique about the problematic setup) due to glibc overriding
        static_assert() to a variant that requires a custom message.
      
      * Introduce actual atomics for clear/set_bit() in selftests
      
      Documentation:
      
      * Remove deleted ioctls from documentation
      
      * Various fixes
      9352e747
  4. 09 Dec, 2022 3 commits
    • Oliver Upton's avatar
      KVM: selftests: Allocate ucall pool from MEM_REGION_DATA · 2afc1fbb
      Oliver Upton authored
      MEM_REGION_TEST_DATA is meant to hold data explicitly used by a
      selftest, not implicit allocations due to the selftests infrastructure.
      Allocate the ucall pool from MEM_REGION_DATA much like the rest of the
      selftests library allocations.
      
      Fixes: 426729b2 ("KVM: selftests: Add ucall pool based implementation")
      Signed-off-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Message-Id: <20221207214809.489070-5-oliver.upton@linux.dev>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2afc1fbb
    • Oliver Upton's avatar
      KVM: arm64: selftests: Align VA space allocator with TTBR0 · e8b9a055
      Oliver Upton authored
      An interesting feature of the Arm architecture is that the stage-1 MMU
      supports two distinct VA regions, controlled by TTBR{0,1}_EL1. As KVM
      selftests on arm64 only uses TTBR0_EL1, the VA space is constrained to
      [0, 2^(va_bits-1)). This is different from other architectures that
      allow for addressing low and high regions of the VA space from a single
      page table.
      
      KVM selftests' VA space allocator presumes the valid address range is
      split between low and high memory based the MSB, which of course is a
      poor match for arm64's TTBR0 region.
      
      Allow architectures to override the default VA space layout. Make use of
      the override to align vpages_valid with the behavior of TTBR0 on arm64.
      Signed-off-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Message-Id: <20221207214809.489070-4-oliver.upton@linux.dev>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e8b9a055
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-6.2' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD · eb561891
      Paolo Bonzini authored
      KVM/arm64 updates for 6.2
      
      - Enable the per-vcpu dirty-ring tracking mechanism, together with an
        option to keep the good old dirty log around for pages that are
        dirtied by something other than a vcpu.
      
      - Switch to the relaxed parallel fault handling, using RCU to delay
        page table reclaim and giving better performance under load.
      
      - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
        option, which multi-process VMMs such as crosvm rely on.
      
      - Merge the pKVM shadow vcpu state tracking that allows the hypervisor
        to have its own view of a vcpu, keeping that state private.
      
      - Add support for the PMUv3p5 architecture revision, bringing support
        for 64bit counters on systems that support it, and fix the
        no-quite-compliant CHAIN-ed counter support for the machines that
        actually exist out there.
      
      - Fix a handful of minor issues around 52bit VA/PA support (64kB pages
        only) as a prefix of the oncoming support for 4kB and 16kB pages.
      
      - Add/Enable/Fix a bunch of selftests covering memslots, breakpoints,
        stage-2 faults and access tracking. You name it, we got it, we
        probably broke it.
      
      - Pick a small set of documentation and spelling fixes, because no
        good merge window would be complete without those.
      
      As a side effect, this tag also drags:
      
      - The 'kvmarm-fixes-6.1-3' tag as a dependency to the dirty-ring
        series
      
      - A shared branch with the arm64 tree that repaints all the system
        registers to match the ARM ARM's naming, and resulting in
        interesting conflicts
      eb561891
  5. 05 Dec, 2022 14 commits
    • Marc Zyngier's avatar
      Merge remote-tracking branch 'arm64/for-next/sysregs' into kvmarm-master/next · 753d734f
      Marc Zyngier authored
      Merge arm64's sysreg repainting branch to avoid too many
      ugly conflicts...
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      753d734f
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/misc-6.2 into kvmarm-master/next · 86f27d84
      Marc Zyngier authored
      * kvm-arm64/misc-6.2:
        : .
        : Misc fixes for 6.2:
        :
        : - Fix formatting for the pvtime documentation
        :
        : - Fix a comment in the VHE-specific Makefile
        : .
        KVM: arm64: Fix typo in comment
        KVM: arm64: Fix pvtime documentation
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      86f27d84
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/pmu-unchained into kvmarm-master/next · 118bc846
      Marc Zyngier authored
      * kvm-arm64/pmu-unchained:
        : .
        : PMUv3 fixes and improvements:
        :
        : - Make the CHAIN event handling strictly follow the architecture
        :
        : - Add support for PMUv3p5 (64bit counters all the way)
        :
        : - Various fixes and cleanups
        : .
        KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
        KVM: arm64: PMU: Sanitise PMCR_EL0.LP on first vcpu run
        KVM: arm64: PMU: Simplify PMCR_EL0 reset handling
        KVM: arm64: PMU: Replace version number '0' with ID_AA64DFR0_EL1_PMUVer_NI
        KVM: arm64: PMU: Make kvm_pmc the main data structure
        KVM: arm64: PMU: Simplify vcpu computation on perf overflow notification
        KVM: arm64: PMU: Allow PMUv3p5 to be exposed to the guest
        KVM: arm64: PMU: Implement PMUv3p5 long counter support
        KVM: arm64: PMU: Allow ID_DFR0_EL1.PerfMon to be set from userspace
        KVM: arm64: PMU: Allow ID_AA64DFR0_EL1.PMUver to be set from userspace
        KVM: arm64: PMU: Move the ID_AA64DFR0_EL1.PMUver limit to VM creation
        KVM: arm64: PMU: Do not let AArch32 change the counters' top 32 bits
        KVM: arm64: PMU: Simplify setting a counter to a specific value
        KVM: arm64: PMU: Add counter_index_to_*reg() helpers
        KVM: arm64: PMU: Only narrow counters that are not 64bit wide
        KVM: arm64: PMU: Narrow the overflow checking when required
        KVM: arm64: PMU: Distinguish between 64bit counter and 64bit overflow
        KVM: arm64: PMU: Always advertise the CHAIN event
        KVM: arm64: PMU: Align chained counter implementation with architecture pseudocode
        arm64: Add ID_DFR0_EL1.PerfMon values for PMUv3p7 and IMP_DEF
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      118bc846
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/mte-map-shared into kvmarm-master/next · 382b5b87
      Marc Zyngier authored
      * kvm-arm64/mte-map-shared:
        : .
        : Update the MTE support to allow the VMM to use shared mappings
        : to back the memslots exposed to MTE-enabled guests.
        :
        : Patches courtesy of Catalin Marinas and Peter Collingbourne.
        : .
        : Fix a number of issues with MTE, such as races on the tags
        : being initialised vs the PG_mte_tagged flag as well as the
        : lack of support for VM_SHARED when KVM is involved.
        :
        : Patches from Catalin Marinas and Peter Collingbourne.
        : .
        Documentation: document the ABI changes for KVM_CAP_ARM_MTE
        KVM: arm64: permit all VM_MTE_ALLOWED mappings with MTE enabled
        KVM: arm64: unify the tests for VMAs in memslots when MTE is enabled
        arm64: mte: Lock a page for MTE tag initialisation
        mm: Add PG_arch_3 page flag
        KVM: arm64: Simplify the sanitise_mte_tags() logic
        arm64: mte: Fix/clarify the PG_mte_tagged semantics
        mm: Do not enable PG_arch_2 for all 64-bit architectures
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      382b5b87
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/pkvm-vcpu-state into kvmarm-master/next · cfa72993
      Marc Zyngier authored
      * kvm-arm64/pkvm-vcpu-state: (25 commits)
        : .
        : Large drop of pKVM patches from Will Deacon and co, adding
        : a private vm/vcpu state at EL2, managed independently from
        : the EL1 state. From the cover letter:
        :
        : "This is version six of the pKVM EL2 state series, extending the pKVM
        : hypervisor code so that it can dynamically instantiate and manage VM
        : data structures without the host being able to access them directly.
        : These structures consist of a hyp VM, a set of hyp vCPUs and the stage-2
        : page-table for the MMU. The pages used to hold the hypervisor structures
        : are returned to the host when the VM is destroyed."
        : .
        KVM: arm64: Use the pKVM hyp vCPU structure in handle___kvm_vcpu_run()
        KVM: arm64: Don't unnecessarily map host kernel sections at EL2
        KVM: arm64: Explicitly map 'kvm_vgic_global_state' at EL2
        KVM: arm64: Maintain a copy of 'kvm_arm_vmid_bits' at EL2
        KVM: arm64: Unmap 'kvm_arm_hyp_percpu_base' from the host
        KVM: arm64: Return guest memory from EL2 via dedicated teardown memcache
        KVM: arm64: Instantiate guest stage-2 page-tables at EL2
        KVM: arm64: Consolidate stage-2 initialisation into a single function
        KVM: arm64: Add generic hyp_memcache helpers
        KVM: arm64: Provide I-cache invalidation by virtual address at EL2
        KVM: arm64: Initialise hypervisor copies of host symbols unconditionally
        KVM: arm64: Add per-cpu fixmap infrastructure at EL2
        KVM: arm64: Instantiate pKVM hypervisor VM and vCPU structures from EL1
        KVM: arm64: Add infrastructure to create and track pKVM instances at EL2
        KVM: arm64: Rename 'host_kvm' to 'host_mmu'
        KVM: arm64: Add hyp_spinlock_t static initializer
        KVM: arm64: Include asm/kvm_mmu.h in nvhe/mem_protect.h
        KVM: arm64: Add helpers to pin memory shared with the hypervisor at EL2
        KVM: arm64: Prevent the donation of no-map pages
        KVM: arm64: Implement do_donate() helper for donating memory
        ...
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      cfa72993
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/parallel-faults into kvmarm-master/next · fe8e3f44
      Marc Zyngier authored
      * kvm-arm64/parallel-faults:
        : .
        : Parallel stage-2 fault handling, courtesy of Oliver Upton.
        : From the cover letter:
        :
        : "Presently KVM only takes a read lock for stage 2 faults if it believes
        : the fault can be fixed by relaxing permissions on a PTE (write unprotect
        : for dirty logging). Otherwise, stage 2 faults grab the write lock, which
        : predictably can pile up all the vCPUs in a sufficiently large VM.
        :
        : Like the TDP MMU for x86, this series loosens the locking around
        : manipulations of the stage 2 page tables to allow parallel faults. RCU
        : and atomics are exploited to safely build/destroy the stage 2 page
        : tables in light of multiple software observers."
        : .
        KVM: arm64: Reject shared table walks in the hyp code
        KVM: arm64: Don't acquire RCU read lock for exclusive table walks
        KVM: arm64: Take a pointer to walker data in kvm_dereference_pteref()
        KVM: arm64: Handle stage-2 faults in parallel
        KVM: arm64: Make table->block changes parallel-aware
        KVM: arm64: Make leaf->leaf PTE changes parallel-aware
        KVM: arm64: Make block->table PTE changes parallel-aware
        KVM: arm64: Split init and set for table PTE
        KVM: arm64: Atomically update stage 2 leaf attributes in parallel walks
        KVM: arm64: Protect stage-2 traversal with RCU
        KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make
        KVM: arm64: Use an opaque type for pteps
        KVM: arm64: Add a helper to tear down unlinked stage-2 subtrees
        KVM: arm64: Don't pass kvm_pgtable through kvm_pgtable_walk_data
        KVM: arm64: Pass mm_ops through the visitor context
        KVM: arm64: Stash observed pte value in visitor context
        KVM: arm64: Combine visitor arguments into a context structure
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      fe8e3f44
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/dirty-ring into kvmarm-master/next · a937f37d
      Marc Zyngier authored
      * kvm-arm64/dirty-ring:
        : .
        : Add support for the "per-vcpu dirty-ring tracking with a bitmap
        : and sprinkles on top", courtesy of Gavin Shan.
        :
        : This branch drags the kvmarm-fixes-6.1-3 tag which was already
        : merged in 6.1-rc4 so that the branch is in a working state.
        : .
        KVM: Push dirty information unconditionally to backup bitmap
        KVM: selftests: Automate choosing dirty ring size in dirty_log_test
        KVM: selftests: Clear dirty ring states between two modes in dirty_log_test
        KVM: selftests: Use host page size to map ring buffer in dirty_log_test
        KVM: arm64: Enable ring-based dirty memory tracking
        KVM: Support dirty ring in conjunction with bitmap
        KVM: Move declaration of kvm_cpu_dirty_log_size() to kvm_dirty_ring.h
        KVM: x86: Introduce KVM_REQ_DIRTY_RING_SOFT_FULL
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      a937f37d
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/52bit-fixes into kvmarm-master/next · 3bbcc8cc
      Marc Zyngier authored
      * kvm-arm64/52bit-fixes:
        : .
        : 52bit PA fixes, courtesy of Ryan Roberts. From the cover letter:
        :
        : "I've been adding support for FEAT_LPA2 to KVM and as part of that work have been
        : testing various (84) configurations of HW, host and guest kernels on FVP. This
        : has thrown up a couple of pre-existing bugs, for which the fixes are provided."
        : .
        KVM: arm64: Fix benign bug with incorrect use of VA_BITS
        KVM: arm64: Fix PAR_TO_HPFAR() to work independently of PA_BITS.
        KVM: arm64: Fix kvm init failure when mode!=vhe and VA_BITS=52.
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      3bbcc8cc
    • Ryan Roberts's avatar
      KVM: arm64: Fix benign bug with incorrect use of VA_BITS · 219072c0
      Ryan Roberts authored
      get_user_mapping_size() uses kvm's pgtable library to walk a user space
      page table created by the kernel, and in doing so, passes metadata
      that the library needs, including ia_bits, which defines the size of the
      input address.
      
      For the case where the kernel is compiled for 52 VA bits but runs on HW
      that does not support LVA, it will fall back to 48 VA bits at runtime.
      Therefore we must use vabits_actual rather than VA_BITS to get the true
      address size.
      
      This is benign in the current code base because the pgtable library only
      uses it for error checking.
      
      Fixes: 6011cf68 ("KVM: arm64: Walk userspace page tables to compute the THP mapping size")
      Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20221205114031.3972780-1-ryan.roberts@arm.com
      219072c0
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/selftest/access-tracking into kvmarm-master/next · b1d10ee1
      Marc Zyngier authored
      * kvm-arm64/selftest/access-tracking:
        : .
        : Small series to add support for arm64 to access_tracking_perf_test and
        : correct a couple bugs along the way.
        :
        : Patches courtesy of Oliver Upton.
        : .
        KVM: selftests: Build access_tracking_perf_test for arm64
        KVM: selftests: Have perf_test_util signal when to stop vCPUs
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      b1d10ee1
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/selftest/s2-faults into kvmarm-master/next · adde0476
      Marc Zyngier authored
      * kvm-arm64/selftest/s2-faults:
        : .
        : New KVM/arm64 selftests exercising various sorts of S2 faults, courtesy
        : of Ricardo Koller. From the cover letter:
        :
        : "This series adds a new aarch64 selftest for testing stage 2 fault handling
        : for various combinations of guest accesses (e.g., write, S1PTW), backing
        : sources (e.g., anon), and types of faults (e.g., read on hugetlbfs with a
        : hole, write on a readonly memslot). Each test tries a different combination
        : and then checks that the access results in the right behavior (e.g., uffd
        : faults with the right address and write/read flag). [...]"
        : .
        KVM: selftests: aarch64: Add mix of tests into page_fault_test
        KVM: selftests: aarch64: Add readonly memslot tests into page_fault_test
        KVM: selftests: aarch64: Add dirty logging tests into page_fault_test
        KVM: selftests: aarch64: Add userfaultfd tests into page_fault_test
        KVM: selftests: aarch64: Add aarch64/page_fault_test
        KVM: selftests: Use the right memslot for code, page-tables, and data allocations
        KVM: selftests: Fix alignment in virt_arch_pgd_alloc() and vm_vaddr_alloc()
        KVM: selftests: Add vm->memslots[] and enum kvm_mem_region_type
        KVM: selftests: Stash backing_src_type in struct userspace_mem_region
        tools: Copy bitfield.h from the kernel sources
        KVM: selftests: aarch64: Construct DEFAULT_MAIR_EL1 using sysreg.h macros
        KVM: selftests: Add missing close and munmap in __vm_mem_region_delete()
        KVM: selftests: aarch64: Add virt_get_pte_hva() library function
        KVM: selftests: Add a userfaultfd library
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      adde0476
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/selftest/linked-bps into kvmarm-master/next · 02f6fdd4
      Marc Zyngier authored
      * kvm-arm64/selftest/linked-bps:
        : .
        : Additional selftests for the arm64 breakpoints/watchpoints,
        : courtesy of Reiji Watanabe. From the cover letter:
        :
        : "This series adds test cases for linked {break,watch}points to the
        : debug-exceptions test, and expands {break,watch}point tests to
        : use non-zero {break,watch}points (the current test always uses
        : {break,watch}point#0)."
        : .
        KVM: arm64: selftests: Test with every breakpoint/watchpoint
        KVM: arm64: selftests: Add a test case for a linked watchpoint
        KVM: arm64: selftests: Add a test case for a linked breakpoint
        KVM: arm64: selftests: Change debug_version() to take ID_AA64DFR0_EL1
        KVM: arm64: selftests: Stop unnecessary test stage tracking of debug-exceptions
        KVM: arm64: selftests: Add helpers to enable debug exceptions
        KVM: arm64: selftests: Remove the hard-coded {b,w}pn#0 from debug-exceptions
        KVM: arm64: selftests: Add write_dbg{b,w}{c,v}r helpers in debug-exceptions
        KVM: arm64: selftests: Use FIELD_GET() to extract ID register fields
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      02f6fdd4
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/selftest/memslot-fixes into kvmarm-master/next · f8faf02f
      Marc Zyngier authored
      * kvm-arm64/selftest/memslot-fixes:
        : .
        : KVM memslot selftest fixes for non-4kB page sizes, courtesy
        : of Gavin Shan. From the cover letter:
        :
        : "kvm/selftests/memslots_perf_test doesn't work with 64KB-page-size-host
        : and 4KB-page-size-guest on aarch64. In the implementation, the host and
        : guest page size have been hardcoded to 4KB. It's ovbiously not working
        : on aarch64 which supports 4KB, 16KB, 64KB individually on host and guest.
        :
        : This series tries to fix it. After the series is applied, the test runs
        : successfully with 64KB-page-size-host and 4KB-page-size-guest."
        : .
        KVM: selftests: memslot_perf_test: Report optimal memory slots
        KVM: selftests: memslot_perf_test: Consolidate memory
        KVM: selftests: memslot_perf_test: Support variable guest page size
        KVM: selftests: memslot_perf_test: Probe memory slots for once
        KVM: selftests: memslot_perf_test: Consolidate loop conditions in prepare_vm()
        KVM: selftests: memslot_perf_test: Use data->nslots in prepare_vm()
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      f8faf02f
    • Marc Zyngier's avatar
      KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow · 58ff6569
      Marc Zyngier authored
      Fix the bogus masking when computing the period of a 64bit counter
      with 32bit overflow. It really should be treated like a 32bit counter
      for the purpose of the period.
      Reported-by: default avatarRicardo Koller <ricarkol@google.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/Y4jbosgHbUDI0WF4@google.com
      58ff6569
  6. 02 Dec, 2022 9 commits