1. 29 Apr, 2022 4 commits
    • Paolo Bonzini's avatar
      Merge branch 'kvm-fixes-for-5.18-rc5' into HEAD · 71d7c575
      Paolo Bonzini authored
      Fixes for (relatively) old bugs, to be merged in both the -rc and next
      development trees.
      
      The merge reconciles the ABI fixes for KVM_EXIT_SYSTEM_EVENT between
      5.18 and commit c24a950e ("KVM, SEV: Add KVM_EXIT_SHUTDOWN metadata
      for SEV-ES", 2022-04-13).
      71d7c575
    • Mingwei Zhang's avatar
      KVM: x86/mmu: fix potential races when walking host page table · 44187235
      Mingwei Zhang authored
      KVM uses lookup_address_in_mm() to detect the hugepage size that the host
      uses to map a pfn.  The function suffers from several issues:
      
       - no usage of READ_ONCE(*). This allows multiple dereference of the same
         page table entry. The TOCTOU problem because of that may cause KVM to
         incorrectly treat a newly generated leaf entry as a nonleaf one, and
         dereference the content by using its pfn value.
      
       - the information returned does not match what KVM needs; for non-present
         entries it returns the level at which the walk was terminated, as long
         as the entry is not 'none'.  KVM needs level information of only 'present'
         entries, otherwise it may regard a non-present PXE entry as a present
         large page mapping.
      
       - the function is not safe for mappings that can be torn down, because it
         does not disable IRQs and because it returns a PTE pointer which is never
         safe to dereference after the function returns.
      
      So implement the logic for walking host page tables directly in KVM, and
      stop using lookup_address_in_mm().
      
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20220429031757.2042406-1-mizhang@google.com>
      [Inline in host_pfn_mapping_level, ensure no semantic change for its
       callers. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      44187235
    • Paolo Bonzini's avatar
      KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENT · d495f942
      Paolo Bonzini authored
      When KVM_EXIT_SYSTEM_EVENT was introduced, it included a flags
      member that at the time was unused.  Unfortunately this extensibility
      mechanism has several issues:
      
      - x86 is not writing the member, so it would not be possible to use it
        on x86 except for new events
      
      - the member is not aligned to 64 bits, so the definition of the
        uAPI struct is incorrect for 32- on 64-bit userspace.  This is a
        problem for RISC-V, which supports CONFIG_KVM_COMPAT, but fortunately
        usage of flags was only introduced in 5.18.
      
      Since padding has to be introduced, place a new field in there
      that tells if the flags field is valid.  To allow further extensibility,
      in fact, change flags to an array of 16 values, and store how many
      of the values are valid.  The availability of the new ndata field
      is tied to a system capability; all architectures are changed to
      fill in the field.
      
      To avoid breaking compilation of userspace that was using the flags
      field, provide a userspace-only union to overlap flags with data[0].
      The new field is placed at the same offset for both 32- and 64-bit
      userspace.
      
      Cc: Will Deacon <will@kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Message-Id: <20220422103013.34832-1-pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d495f942
    • Sean Christopherson's avatar
      KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR · 86931ff7
      Sean Christopherson authored
      Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
      MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
      The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
      guest, possibly with help from userspace, manages to coerce KVM into
      creating a SPTE for an "impossible" gfn, KVM will leak the associated
      shadow pages (page tables):
      
        WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
                                      kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Modules linked in: kvm_intel kvm irqbypass
        CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G        W         5.18.0-rc1+ #293
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Call Trace:
         <TASK>
         kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
         kvm_destroy_vm+0x162/0x2d0 [kvm]
         kvm_vm_release+0x1d/0x30 [kvm]
         __fput+0x82/0x240
         task_work_run+0x5b/0x90
         exit_to_user_mode_prepare+0xd2/0xe0
         syscall_exit_to_user_mode+0x1d/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      On bare metal, encountering an impossible gpa in the page fault path is
      well and truly impossible, barring CPU bugs, as the CPU will signal #PF
      during the gva=>gpa translation (or a similar failure when stuffing a
      physical address into e.g. the VMCS/VMCB).  But if KVM is running as a VM
      itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
      of the underlying hardware, in which case the hardware will not fault on
      the illegal-from-KVM's-perspective gpa.
      
      Alternatively, KVM could continue allowing the dodgy behavior and simply
      zap the max possible range.  But, for hosts with MAXPHYADDR < 52, that's
      a (minor) waste of cycles, and more importantly, KVM can't reasonably
      support impossible memslots when running on bare metal (or with an
      accurate MAXPHYADDR as a VM).  Note, limiting the overhead by checking if
      KVM is running as a guest is not a safe option as the host isn't required
      to announce itself to the guest in any way, e.g. doesn't need to set the
      HYPERVISOR CPUID bit.
      
      A second alternative to disallowing the memslot behavior would be to
      disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR.  That
      restriction is undesirable as there are legitimate use cases for doing
      so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
      systems so that VMs can be migrated between hosts with different
      MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
      
      Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
      even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
      any reserved physical address bits).
      
      The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
      The memslot and TDP MMU code want an exclusive value, but the name
      implies the returned value is inclusive, and the MMIO path needs an
      inclusive check.
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Fixes: 524a1e4e ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220428233416.2446833-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      86931ff7
  2. 13 Apr, 2022 21 commits
  3. 11 Apr, 2022 5 commits
  4. 09 Apr, 2022 4 commits
    • Heiko Stuebner's avatar
      RISC-V: KVM: include missing hwcap.h into vcpu_fp · 4054eee9
      Heiko Stuebner authored
      vcpu_fp uses the riscv_isa_extension mechanism which gets
      defined in hwcap.h but doesn't include that head file.
      
      While it seems to work in most cases, in certain conditions
      this can lead to build failures like
      
      ../arch/riscv/kvm/vcpu_fp.c: In function ‘kvm_riscv_vcpu_fp_reset’:
      ../arch/riscv/kvm/vcpu_fp.c:22:13: error: implicit declaration of function ‘riscv_isa_extension_available’ [-Werror=implicit-function-declaration]
         22 |         if (riscv_isa_extension_available(&isa, f) ||
            |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      ../arch/riscv/kvm/vcpu_fp.c:22:49: error: ‘f’ undeclared (first use in this function)
         22 |         if (riscv_isa_extension_available(&isa, f) ||
      
      Fix this by simply including the necessary header.
      
      Fixes: 0a86512d ("RISC-V: KVM: Factor-out FP virtualization into separate
      sources")
      Signed-off-by: default avatarHeiko Stuebner <heiko@sntech.de>
      Signed-off-by: default avatarAnup Patel <anup@brainfault.org>
      4054eee9
    • Anup Patel's avatar
      KVM: selftests: riscv: Fix alignment of the guest_hang() function · ebdef0de
      Anup Patel authored
      The guest_hang() function is used as the default exception handler
      for various KVM selftests applications by setting it's address in
      the vstvec CSR. The vstvec CSR requires exception handler base address
      to be at least 4-byte aligned so this patch fixes alignment of the
      guest_hang() function.
      
      Fixes: 3e06cdf1 ("KVM: selftests: Add initial support for RISC-V
      64-bit")
      Signed-off-by: default avatarAnup Patel <apatel@ventanamicro.com>
      Tested-by: default avatarMayuresh Chitale <mchitale@ventanamicro.com>
      Signed-off-by: default avatarAnup Patel <anup@brainfault.org>
      ebdef0de
    • Anup Patel's avatar
      KVM: selftests: riscv: Set PTE A and D bits in VS-stage page table · fac37253
      Anup Patel authored
      Supporting hardware updates of PTE A and D bits is optional for any
      RISC-V implementation so current software strategy is to always set
      these bits in both G-stage (hypervisor) and VS-stage (guest kernel).
      
      If PTE A and D bits are not set by software (hypervisor or guest)
      then RISC-V implementations not supporting hardware updates of these
      bits will cause traps even for perfectly valid PTEs.
      
      Based on above explanation, the VS-stage page table created by various
      KVM selftest applications is not correct because PTE A and D bits are
      not set. This patch fixes VS-stage page table programming of PTE A and
      D bits for KVM selftests.
      
      Fixes: 3e06cdf1 ("KVM: selftests: Add initial support for RISC-V
      64-bit")
      Signed-off-by: default avatarAnup Patel <apatel@ventanamicro.com>
      Tested-by: default avatarMayuresh Chitale <mchitale@ventanamicro.com>
      Signed-off-by: default avatarAnup Patel <anup@brainfault.org>
      fac37253
    • Anup Patel's avatar
      RISC-V: KVM: Don't clear hgatp CSR in kvm_arch_vcpu_put() · 8c3ce496
      Anup Patel authored
      We might have RISC-V systems (such as QEMU) where VMID is not part
      of the TLB entry tag so these systems will have to flush all TLB
      entries upon any change in hgatp.VMID.
      
      Currently, we zero-out hgatp CSR in kvm_arch_vcpu_put() and we
      re-program hgatp CSR in kvm_arch_vcpu_load(). For above described
      systems, this will flush all TLB entries whenever VCPU exits to
      user-space hence reducing performance.
      
      This patch fixes above described performance issue by not clearing
      hgatp CSR in kvm_arch_vcpu_put().
      
      Fixes: 34bde9d8 ("RISC-V: KVM: Implement VCPU world-switch")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAnup Patel <apatel@ventanamicro.com>
      Signed-off-by: default avatarAnup Patel <anup@brainfault.org>
      8c3ce496
  5. 08 Apr, 2022 1 commit
  6. 07 Apr, 2022 4 commits
  7. 06 Apr, 2022 1 commit
    • Paolo Bonzini's avatar
      KVM: avoid NULL pointer dereference in kvm_dirty_ring_push · 5593473a
      Paolo Bonzini authored
      kvm_vcpu_release() will call kvm_dirty_ring_free(), freeing
      ring->dirty_gfns and setting it to NULL.  Afterwards, it calls
      kvm_arch_vcpu_destroy().
      
      However, if closing the file descriptor races with KVM_RUN in such away
      that vcpu->arch.st.preempted == 0, the following call stack leads to a
      NULL pointer dereference in kvm_dirty_run_push():
      
       mark_page_dirty_in_slot+0x192/0x270 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3171
       kvm_steal_time_set_preempted arch/x86/kvm/x86.c:4600 [inline]
       kvm_arch_vcpu_put+0x34e/0x5b0 arch/x86/kvm/x86.c:4618
       vcpu_put+0x1b/0x70 arch/x86/kvm/../../../virt/kvm/kvm_main.c:211
       vmx_free_vcpu+0xcb/0x130 arch/x86/kvm/vmx/vmx.c:6985
       kvm_arch_vcpu_destroy+0x76/0x290 arch/x86/kvm/x86.c:11219
       kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
      
      The fix is to release the dirty page ring after kvm_arch_vcpu_destroy
      has run.
      Reported-by: default avatarQiuhao Li <qiuhao@sysec.org>
      Reported-by: default avatarGaoning Pan <pgn@zju.edu.cn>
      Reported-by: default avatarYongkang Jia <kangel@zju.edu.cn>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5593473a