1. 03 May, 2022 3 commits
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits · ba3a6120
      Sean Christopherson authored
      Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
      if mmu_lock is held for write, as volatile SPTEs can be written by other
      tasks/vCPUs outside of mmu_lock.  If a vCPU uses the to-be-modified SPTE
      to write a page, the CPU can cache the translation as WRITABLE in the TLB
      despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
      Accessed/Dirty bits and not properly tag the backing page.
      
      Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
      non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
      KVM doesn't consume the Accessed bit of non-leaf SPTEs.
      
      Dropping the Dirty and/or Writable bits is most problematic for dirty
      logging, as doing so can result in a missed TLB flush and eventually a
      missed dirty page.  In the unlikely event that the only dirty page(s) is
      a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
      (based on the Dirty or Writable bit depending on the method) and so not
      update the SPTE and ultimately not flush.  If the SPTE is cached in the
      TLB as writable before it is clobbered, the guest can continue writing
      the associated page without ever taking a write-protect fault.
      
      For most (all?) file back memory, dropping the Dirty bit is a non-issue.
      The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
      bit is effectively ignored because the primary MMU will mark that page
      dirty when the write-protection is lifted, e.g. when KVM faults the page
      back in for write.
      
      The Accessed bit is a complete non-issue.  Aside from being unused for
      non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
      Accessed bit may be dropped anyways.
      
      Lastly, the Writable bit is also problematic as an extension of the Dirty
      bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
      !DIRTY && WRITABLE.  If KVM fixes an MMU-writable, but !WRITABLE, SPTE
      out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
      the SPTE being !WRITABLE when it is checked by KVM.  But that all depends
      on the Dirty bit being problematic in the first place.
      
      Fixes: 2f2fad08 ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Cc: Venkatesh Srinivas <venkateshs@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ba3a6120
    • Sean Christopherson's avatar
      KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits() · 54eb3ef5
      Sean Christopherson authored
      Move the is_shadow_present_pte() check out of spte_has_volatile_bits()
      and into its callers.  Well, caller, since only one of its two callers
      doesn't already do the shadow-present check.
      
      Opportunistically move the helper to spte.c/h so that it can be used by
      the TDP MMU, which is also the primary motivation for the shadow-present
      change.  Unlike the legacy MMU, the TDP MMU uses a single path for clear
      leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP
      MMU will need to check is_last_spte() prior to calling
      spte_has_volatile_bits(), and calling is_last_spte() without first
      calling is_shadow_present_spte() is at best odd, and at worst a violation
      of KVM's loosely defines SPTE rules.
      
      Note, mmu_spte_clear_track_bits() could likely skip the write entirely
      for SPTEs that are not shadow-present.  Leave that cleanup for a future
      patch to avoid introducing a functional change, and because the
      shadow-present check can likely be moved further up the stack, e.g.
      drop_large_spte() appears to be the only path that doesn't already
      explicitly check for a shadow-present SPTE.
      
      No functional change intended.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      54eb3ef5
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D) · 706c9c55
      Sean Christopherson authored
      Don't treat SPTEs that are truly writable, i.e. writable in hardware, as
      being volatile (unless they're volatile for other reasons, e.g. A/D bits).
      KVM _sets_ the WRITABLE bit out of mmu_lock, but never _clears_ the bit
      out of mmu_lock, so if the WRITABLE bit is set, it cannot magically get
      cleared just because the SPTE is MMU-writable.
      
      Rename the wrapper of MMU-writable to be more literal, the previous name
      of spte_can_locklessly_be_made_writable() is wrong and misleading.
      
      Fixes: c7ba5b48 ("KVM: MMU: fast path of handling guest page fault")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      706c9c55
  2. 29 Apr, 2022 3 commits
    • Mingwei Zhang's avatar
      KVM: x86/mmu: fix potential races when walking host page table · 44187235
      Mingwei Zhang authored
      KVM uses lookup_address_in_mm() to detect the hugepage size that the host
      uses to map a pfn.  The function suffers from several issues:
      
       - no usage of READ_ONCE(*). This allows multiple dereference of the same
         page table entry. The TOCTOU problem because of that may cause KVM to
         incorrectly treat a newly generated leaf entry as a nonleaf one, and
         dereference the content by using its pfn value.
      
       - the information returned does not match what KVM needs; for non-present
         entries it returns the level at which the walk was terminated, as long
         as the entry is not 'none'.  KVM needs level information of only 'present'
         entries, otherwise it may regard a non-present PXE entry as a present
         large page mapping.
      
       - the function is not safe for mappings that can be torn down, because it
         does not disable IRQs and because it returns a PTE pointer which is never
         safe to dereference after the function returns.
      
      So implement the logic for walking host page tables directly in KVM, and
      stop using lookup_address_in_mm().
      
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20220429031757.2042406-1-mizhang@google.com>
      [Inline in host_pfn_mapping_level, ensure no semantic change for its
       callers. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      44187235
    • Paolo Bonzini's avatar
      KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENT · d495f942
      Paolo Bonzini authored
      When KVM_EXIT_SYSTEM_EVENT was introduced, it included a flags
      member that at the time was unused.  Unfortunately this extensibility
      mechanism has several issues:
      
      - x86 is not writing the member, so it would not be possible to use it
        on x86 except for new events
      
      - the member is not aligned to 64 bits, so the definition of the
        uAPI struct is incorrect for 32- on 64-bit userspace.  This is a
        problem for RISC-V, which supports CONFIG_KVM_COMPAT, but fortunately
        usage of flags was only introduced in 5.18.
      
      Since padding has to be introduced, place a new field in there
      that tells if the flags field is valid.  To allow further extensibility,
      in fact, change flags to an array of 16 values, and store how many
      of the values are valid.  The availability of the new ndata field
      is tied to a system capability; all architectures are changed to
      fill in the field.
      
      To avoid breaking compilation of userspace that was using the flags
      field, provide a userspace-only union to overlap flags with data[0].
      The new field is placed at the same offset for both 32- and 64-bit
      userspace.
      
      Cc: Will Deacon <will@kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Message-Id: <20220422103013.34832-1-pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d495f942
    • Sean Christopherson's avatar
      KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR · 86931ff7
      Sean Christopherson authored
      Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
      MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
      The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
      guest, possibly with help from userspace, manages to coerce KVM into
      creating a SPTE for an "impossible" gfn, KVM will leak the associated
      shadow pages (page tables):
      
        WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
                                      kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Modules linked in: kvm_intel kvm irqbypass
        CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G        W         5.18.0-rc1+ #293
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Call Trace:
         <TASK>
         kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
         kvm_destroy_vm+0x162/0x2d0 [kvm]
         kvm_vm_release+0x1d/0x30 [kvm]
         __fput+0x82/0x240
         task_work_run+0x5b/0x90
         exit_to_user_mode_prepare+0xd2/0xe0
         syscall_exit_to_user_mode+0x1d/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      On bare metal, encountering an impossible gpa in the page fault path is
      well and truly impossible, barring CPU bugs, as the CPU will signal #PF
      during the gva=>gpa translation (or a similar failure when stuffing a
      physical address into e.g. the VMCS/VMCB).  But if KVM is running as a VM
      itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
      of the underlying hardware, in which case the hardware will not fault on
      the illegal-from-KVM's-perspective gpa.
      
      Alternatively, KVM could continue allowing the dodgy behavior and simply
      zap the max possible range.  But, for hosts with MAXPHYADDR < 52, that's
      a (minor) waste of cycles, and more importantly, KVM can't reasonably
      support impossible memslots when running on bare metal (or with an
      accurate MAXPHYADDR as a VM).  Note, limiting the overhead by checking if
      KVM is running as a guest is not a safe option as the host isn't required
      to announce itself to the guest in any way, e.g. doesn't need to set the
      HYPERVISOR CPUID bit.
      
      A second alternative to disallowing the memslot behavior would be to
      disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR.  That
      restriction is undesirable as there are legitimate use cases for doing
      so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
      systems so that VMs can be migrated between hosts with different
      MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
      
      Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
      even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
      any reserved physical address bits).
      
      The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
      The memslot and TDP MMU code want an exclusive value, but the name
      implies the returned value is inclusive, and the MMIO path needs an
      inclusive check.
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Fixes: 524a1e4e ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220428233416.2446833-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      86931ff7
  3. 11 Apr, 2022 5 commits
  4. 09 Apr, 2022 4 commits
    • Heiko Stuebner's avatar
      RISC-V: KVM: include missing hwcap.h into vcpu_fp · 4054eee9
      Heiko Stuebner authored
      vcpu_fp uses the riscv_isa_extension mechanism which gets
      defined in hwcap.h but doesn't include that head file.
      
      While it seems to work in most cases, in certain conditions
      this can lead to build failures like
      
      ../arch/riscv/kvm/vcpu_fp.c: In function ‘kvm_riscv_vcpu_fp_reset’:
      ../arch/riscv/kvm/vcpu_fp.c:22:13: error: implicit declaration of function ‘riscv_isa_extension_available’ [-Werror=implicit-function-declaration]
         22 |         if (riscv_isa_extension_available(&isa, f) ||
            |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      ../arch/riscv/kvm/vcpu_fp.c:22:49: error: ‘f’ undeclared (first use in this function)
         22 |         if (riscv_isa_extension_available(&isa, f) ||
      
      Fix this by simply including the necessary header.
      
      Fixes: 0a86512d ("RISC-V: KVM: Factor-out FP virtualization into separate
      sources")
      Signed-off-by: default avatarHeiko Stuebner <heiko@sntech.de>
      Signed-off-by: default avatarAnup Patel <anup@brainfault.org>
      4054eee9
    • Anup Patel's avatar
      KVM: selftests: riscv: Fix alignment of the guest_hang() function · ebdef0de
      Anup Patel authored
      The guest_hang() function is used as the default exception handler
      for various KVM selftests applications by setting it's address in
      the vstvec CSR. The vstvec CSR requires exception handler base address
      to be at least 4-byte aligned so this patch fixes alignment of the
      guest_hang() function.
      
      Fixes: 3e06cdf1 ("KVM: selftests: Add initial support for RISC-V
      64-bit")
      Signed-off-by: default avatarAnup Patel <apatel@ventanamicro.com>
      Tested-by: default avatarMayuresh Chitale <mchitale@ventanamicro.com>
      Signed-off-by: default avatarAnup Patel <anup@brainfault.org>
      ebdef0de
    • Anup Patel's avatar
      KVM: selftests: riscv: Set PTE A and D bits in VS-stage page table · fac37253
      Anup Patel authored
      Supporting hardware updates of PTE A and D bits is optional for any
      RISC-V implementation so current software strategy is to always set
      these bits in both G-stage (hypervisor) and VS-stage (guest kernel).
      
      If PTE A and D bits are not set by software (hypervisor or guest)
      then RISC-V implementations not supporting hardware updates of these
      bits will cause traps even for perfectly valid PTEs.
      
      Based on above explanation, the VS-stage page table created by various
      KVM selftest applications is not correct because PTE A and D bits are
      not set. This patch fixes VS-stage page table programming of PTE A and
      D bits for KVM selftests.
      
      Fixes: 3e06cdf1 ("KVM: selftests: Add initial support for RISC-V
      64-bit")
      Signed-off-by: default avatarAnup Patel <apatel@ventanamicro.com>
      Tested-by: default avatarMayuresh Chitale <mchitale@ventanamicro.com>
      Signed-off-by: default avatarAnup Patel <anup@brainfault.org>
      fac37253
    • Anup Patel's avatar
      RISC-V: KVM: Don't clear hgatp CSR in kvm_arch_vcpu_put() · 8c3ce496
      Anup Patel authored
      We might have RISC-V systems (such as QEMU) where VMID is not part
      of the TLB entry tag so these systems will have to flush all TLB
      entries upon any change in hgatp.VMID.
      
      Currently, we zero-out hgatp CSR in kvm_arch_vcpu_put() and we
      re-program hgatp CSR in kvm_arch_vcpu_load(). For above described
      systems, this will flush all TLB entries whenever VCPU exits to
      user-space hence reducing performance.
      
      This patch fixes above described performance issue by not clearing
      hgatp CSR in kvm_arch_vcpu_put().
      
      Fixes: 34bde9d8 ("RISC-V: KVM: Implement VCPU world-switch")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAnup Patel <apatel@ventanamicro.com>
      Signed-off-by: default avatarAnup Patel <anup@brainfault.org>
      8c3ce496
  5. 08 Apr, 2022 1 commit
  6. 07 Apr, 2022 4 commits
  7. 06 Apr, 2022 8 commits
  8. 05 Apr, 2022 4 commits
    • Bagas Sanjaya's avatar
      Documentation: kvm: Add missing line break in api.rst · c1be1ef1
      Bagas Sanjaya authored
      Add missing line break separator between literal block and description
      of KVM_EXIT_RISCV_SBI.
      
      This fixes:
      </path/to/linux>/Documentation/virt/kvm/api.rst:6118: WARNING: Literal block ends without a blank line; unexpected unindent.
      
      Fixes: da40d858 (RISC-V: KVM: Document RISC-V specific parts of KVM API, 2021-09-27)
      Cc: Anup Patel <anup.patel@wdc.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-riscv@lists.infradead.org
      Signed-off-by: default avatarBagas Sanjaya <bagasdotme@gmail.com>
      Message-Id: <20220403065735.23859-1-bagasdotme@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c1be1ef1
    • Lv Ruyi's avatar
      KVM: x86/mmu: remove unnecessary flush_workqueue() · 3203a56a
      Lv Ruyi authored
      All work currently pending will be done first by calling destroy_workqueue,
      so there is unnecessary to flush it explicitly.
      Reported-by: default avatarZeal Robot <zealci@zte.com.cn>
      Signed-off-by: default avatarLv Ruyi <lv.ruyi@zte.com.cn>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220401083530.2407703-1-lv.ruyi@zte.com.cn>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3203a56a
    • Sean Christopherson's avatar
      KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded · 1d0e8480
      Sean Christopherson authored
      Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
      -1 is technically undefined behavior when its value is read out by
      param_get_bool(), as boolean values are supposed to be '0' or '1'.
      
      Alternatively, KVM could define a custom getter for the param, but the
      auto value doesn't depend on the vendor module in any way, and printing
      "auto" would be unnecessarily unfriendly to the user.
      
      In addition to fixing the undefined behavior, resolving the auto value
      also fixes the scenario where the auto value resolves to N and no vendor
      module is loaded.  Previously, -1 would result in Y being printed even
      though KVM would ultimately disable the mitigation.
      
      Rename the existing MMU module init/exit helpers to clarify that they're
      invoked with respect to the vendor module, and add comments to document
      why KVM has two separate "module init" flows.
      
        =========================================================================
        UBSAN: invalid-load in kernel/params.c:320:33
        load of value 255 is not a valid value for type '_Bool'
        CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         <TASK>
         dump_stack_lvl+0x34/0x44
         ubsan_epilogue+0x5/0x40
         __ubsan_handle_load_invalid_value.cold+0x43/0x48
         param_get_bool.cold+0xf/0x14
         param_attr_show+0x55/0x80
         module_attr_show+0x1c/0x30
         sysfs_kf_seq_show+0x93/0xc0
         seq_read_iter+0x11c/0x450
         new_sync_read+0x11b/0x1a0
         vfs_read+0xf0/0x190
         ksys_read+0x5f/0xe0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        =========================================================================
      
      Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarBruno Goncalves <bgoncalv@redhat.com>
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220331221359.3912754-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1d0e8480
    • Peter Gonda's avatar
      KVM: SEV: Add cond_resched() to loop in sev_clflush_pages() · 00c22013
      Peter Gonda authored
      Add resched to avoid warning from sev_clflush_pages() with large number
      of pages.
      Signed-off-by: default avatarPeter Gonda <pgonda@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      
      Message-Id: <20220330164306.2376085-1-pgonda@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      00c22013
  9. 03 Apr, 2022 8 commits