1. 27 Mar, 2023 8 commits
    • Sean Christopherson's avatar
      KVM: nVMX: Do not report error code when synthesizing VM-Exit from Real Mode · 80962ec9
      Sean Christopherson authored
      Don't report an error code to L1 when synthesizing a nested VM-Exit and
      L2 is in Real Mode.  Per Intel's SDM, regarding the error code valid bit:
      
        This bit is always 0 if the VM exit occurred while the logical processor
        was in real-address mode (CR0.PE=0).
      
      The bug was introduced by a recent fix for AMD's Paged Real Mode, which
      moved the error code suppression from the common "queue exception" path
      to the "inject exception" path, but missed VMX's "synthesize VM-Exit"
      path.
      
      Fixes: b97f0745 ("KVM: x86: determine if an exception has an error code only when injecting it.")
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20230322143300.2209476-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      80962ec9
    • Sean Christopherson's avatar
      KVM: x86: Clear "has_error_code", not "error_code", for RM exception injection · 6c41468c
      Sean Christopherson authored
      When injecting an exception into a vCPU in Real Mode, suppress the error
      code by clearing the flag that tracks whether the error code is valid, not
      by clearing the error code itself.  The "typo" was introduced by recent
      fix for SVM's funky Paged Real Mode.
      
      Opportunistically hoist the logic above the tracepoint so that the trace
      is coherent with respect to what is actually injected (this was also the
      behavior prior to the buggy commit).
      
      Fixes: b97f0745 ("KVM: x86: determine if an exception has an error code only when injecting it.")
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20230322143300.2209476-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6c41468c
    • Sean Christopherson's avatar
      KVM: x86: Suppress pending MMIO write exits if emulator detects exception · 0dc90226
      Sean Christopherson authored
      Clear vcpu->mmio_needed when injecting an exception from the emulator to
      squash a (legitimate) warning about vcpu->mmio_needed being true at the
      start of KVM_RUN without a callback being registered to complete the
      userspace MMIO exit.  Suppressing the MMIO write exit is inarguably wrong
      from an architectural perspective, but it is the least awful hack-a-fix
      due to shortcomings in KVM's uAPI, not to mention that KVM already
      suppresses MMIO writes in this scenario.
      
      Outside of REP string instructions, KVM doesn't provide a way to resume
      an instruction at the exact point where it was "interrupted" if said
      instruction partially completed before encountering an MMIO access.  For
      MMIO reads, KVM immediately exits to userspace upon detecting MMIO as
      userspace provides the to-be-read value in a buffer, and so KVM can safely
      (more or less) restart the instruction from the beginning.  When the
      emulator re-encounters the MMIO read, KVM will service the MMIO by getting
      the value from the buffer instead of exiting to userspace, i.e. KVM won't
      put the vCPU into an infinite loop.
      
      On an emulated MMIO write, KVM finishes the instruction before exiting to
      userspace, as exiting immediately would ultimately hang the vCPU due to
      the aforementioned shortcoming of KVM not being able to resume emulation
      in the middle of an instruction.
      
      For the vast majority of _emulated_ instructions, deferring the userspace
      exit doesn't cause problems as very few x86 instructions (again ignoring
      string operations) generate multiple writes.  But for instructions that
      generate multiple writes, e.g. PUSHA (multiple pushes onto the stack),
      deferring the exit effectively results in only the final write triggering
      an exit to userspace.  KVM does support multiple MMIO "fragments", but
      only for page splits; if an instruction performs multiple distinct MMIO
      writes, the number of fragments gets reset when the next MMIO write comes
      along and any previous MMIO writes are dropped.
      
      Circling back to the warning, if a deferred MMIO write coincides with an
      exception, e.g. in this case a #SS due to PUSHA underflowing the stack
      after queueing a write to an MMIO page on a previous push, KVM injects
      the exceptions and leaves the deferred MMIO pending without registering a
      callback, thus triggering the splat.
      
      Sweep the problem under the proverbial rug as dropping MMIO writes is not
      unique to the exception scenario (see above), i.e. instructions like PUSHA
      are fundamentally broken with respect to MMIO, and have been since KVM's
      inception.
      Reported-by: default avatarzhangjianguo <zhangjianguo18@huawei.com>
      Reported-by: syzbot+760a73552f47a8cd0fd9@syzkaller.appspotmail.com
      Reported-by: syzbot+8accb43ddc6bd1f5713a@syzkaller.appspotmail.com
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20230322141220.2206241-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0dc90226
    • Dmytro Maluka's avatar
      KVM: x86/ioapic: Resample the pending state of an IRQ when unmasking · fef8f2b9
      Dmytro Maluka authored
      KVM irqfd based emulation of level-triggered interrupts doesn't work
      quite correctly in some cases, particularly in the case of interrupts
      that are handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT).
      Such an interrupt is acked to the device in its threaded irq handler,
      i.e. later than it is acked to the interrupt controller (EOI at the end
      of hardirq), not earlier.
      
      Linux keeps such interrupt masked until its threaded handler finishes,
      to prevent the EOI from re-asserting an unacknowledged interrupt.
      However, with KVM + vfio (or whatever is listening on the resamplefd)
      we always notify resamplefd at the EOI, so vfio prematurely unmasks the
      host physical IRQ, thus a new physical interrupt is fired in the host.
      This extra interrupt in the host is not a problem per se. The problem is
      that it is unconditionally queued for injection into the guest, so the
      guest sees an extra bogus interrupt. [*]
      
      There are observed at least 2 user-visible issues caused by those
      extra erroneous interrupts for a oneshot irq in the guest:
      
      1. System suspend aborted due to a pending wakeup interrupt from
         ChromeOS EC (drivers/platform/chrome/cros_ec.c).
      2. Annoying "invalid report id data" errors from ELAN0000 touchpad
         (drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg
         every time the touchpad is touched.
      
      The core issue here is that by the time when the guest unmasks the IRQ,
      the physical IRQ line is no longer asserted (since the guest has
      acked the interrupt to the device in the meantime), yet we
      unconditionally inject the interrupt queued into the guest by the
      previous resampling. So to fix the issue, we need a way to detect that
      the IRQ is no longer pending, and cancel the queued interrupt in this
      case.
      
      With IOAPIC we are not able to probe the physical IRQ line state
      directly (at least not if the underlying physical interrupt controller
      is an IOAPIC too), so in this patch we use irqfd resampler for that.
      Namely, instead of injecting the queued interrupt, we just notify the
      resampler that this interrupt is done. If the IRQ line is actually
      already deasserted, we are done. If it is still asserted, a new
      interrupt will be shortly triggered through irqfd and injected into the
      guest.
      
      In the case if there is no irqfd resampler registered for this IRQ, we
      cannot fix the issue, so we keep the existing behavior: immediately
      unconditionally inject the queued interrupt.
      
      This patch fixes the issue for x86 IOAPIC only. In the long run, we can
      fix it for other irqchips and other architectures too, possibly taking
      advantage of reading the physical state of the IRQ line, which is
      possible with some other irqchips (e.g. with arm64 GIC, maybe even with
      the legacy x86 PIC).
      
      [*] In this description we assume that the interrupt is a physical host
          interrupt forwarded to the guest e.g. by vfio. Potentially the same
          issue may occur also with a purely virtual interrupt from an
          emulated device, e.g. if the guest handles this interrupt, again, as
          a oneshot interrupt.
      Signed-off-by: default avatarDmytro Maluka <dmy@semihalf.com>
      Link: https://lore.kernel.org/kvm/31420943-8c5f-125c-a5ee-d2fde2700083@semihalf.com/
      Link: https://lore.kernel.org/lkml/87o7wrug0w.wl-maz@kernel.org/
      Message-Id: <20230322204344.50138-3-dmy@semihalf.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fef8f2b9
    • Dmytro Maluka's avatar
      KVM: irqfd: Make resampler_list an RCU list · d583fbd7
      Dmytro Maluka authored
      It is useful to be able to do read-only traversal of the list of all the
      registered irqfd resamplers without locking the resampler_lock mutex.
      In particular, we are going to traverse it to search for a resampler
      registered for the given irq of an irqchip, and that will be done with
      an irqchip spinlock (ioapic->lock) held, so it is undesirable to lock a
      mutex in this context. So turn this list into an RCU list.
      
      For protecting the read side, reuse kvm->irq_srcu which is already used
      for protecting a number of irq related things (kvm->irq_routing,
      irqfd->resampler->list, kvm->irq_ack_notifier_list,
      kvm->arch.mask_notifier_list).
      Signed-off-by: default avatarDmytro Maluka <dmy@semihalf.com>
      Message-Id: <20230322204344.50138-2-dmy@semihalf.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d583fbd7
    • Jeremi Piotrowski's avatar
      KVM: SVM: Flush Hyper-V TLB when required · e5c972c1
      Jeremi Piotrowski authored
      The Hyper-V "EnlightenedNptTlb" enlightenment is always enabled when KVM
      is running on top of Hyper-V and Hyper-V exposes support for it (which
      is always). On AMD CPUs this enlightenment results in ASID invalidations
      not flushing TLB entries derived from the NPT. To force the underlying
      (L0) hypervisor to rebuild its shadow page tables, an explicit hypercall
      is needed.
      
      The original KVM implementation of Hyper-V's "EnlightenedNptTlb" on SVM
      only added remote TLB flush hooks. This worked out fine for a while, as
      sufficient remote TLB flushes where being issued in KVM to mask the
      problem. Since v5.17, changes in the TDP code reduced the number of
      flushes and the out-of-sync TLB prevents guests from booting
      successfully.
      
      Split svm_flush_tlb_current() into separate callbacks for the 3 cases
      (guest/all/current), and issue the required Hyper-V hypercall when a
      Hyper-V TLB flush is needed. The most important case where the TLB flush
      was missing is when loading a new PGD, which is followed by what is now
      svm_flush_tlb_current().
      
      Cc: stable@vger.kernel.org # v5.17+
      Fixes: 1e0c7d40 ("KVM: SVM: hyper-v: Remote TLB flush for SVM")
      Link: https://lore.kernel.org/lkml/43980946-7bbf-dcef-7e40-af904c456250@linux.microsoft.com/Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarJeremi Piotrowski <jpiotrowski@linux.microsoft.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20230324145233.4585-1-jpiotrowski@linux.microsoft.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e5c972c1
    • Paolo Bonzini's avatar
      Merge tag 'kvm-riscv-fixes-6.3-1' of https://github.com/kvm-riscv/linux into HEAD · 9e347ba0
      Paolo Bonzini authored
      KVM/riscv fixes for 6.3, take #1
      
      - Fix VM hang in case of timer delta being zero
      9e347ba0
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-6.3-2' of... · 8607daa2
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-6.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      KVM/arm64 fixes for 6.3, part #2
      
      Fixes for a rather interesting set of bugs relating to the MMU:
      
       - Read the MMU notifier seq before dropping the mmap lock to guard
         against reading a potentially stale VMA
      
       - Disable interrupts when walking user page tables to protect against
         the page table being freed
      
       - Read the MTE permissions for the VMA within the mmap lock critical
         section, avoiding the use of a potentally stale VMA pointer
      
      Additionally, some fixes targeting the vPMU:
      
       - Return the sum of the current perf event value and PMC snapshot for
         reads from userspace
      
       - Don't save the value of guest writes to PMCR_EL0.{C,P}, which could
         otherwise lead to userspace erroneously resetting the vPMU during VM
         save/restore
      8607daa2
  2. 17 Mar, 2023 1 commit
    • Rajnesh Kanwal's avatar
      riscv/kvm: Fix VM hang in case of timer delta being zero. · 6eff3804
      Rajnesh Kanwal authored
      In case when VCPU is blocked due to WFI, we schedule the timer
      from `kvm_riscv_vcpu_timer_blocking()` to keep timer interrupt
      ticking.
      
      But in case when delta_ns comes to be zero, we never schedule
      the timer and VCPU keeps sleeping indefinitely until any activity
      is done with VM console.
      
      This is easily reproduce-able using kvmtool.
      ./lkvm-static run -c1 --console virtio -p "earlycon root=/dev/vda" \
               -k ./Image -d rootfs.ext4
      
      Also, just add a print in kvm_riscv_vcpu_vstimer_expired() to
      check the interrupt delivery and run `top` or similar auto-upating
      cmd from guest. Within sometime one can notice that print from
      timer expiry routine stops and the `top` cmd output will stop
      updating.
      
      This change fixes this by making sure we schedule the timer even
      with delta_ns being zero to bring the VCPU out of sleep immediately.
      
      Fixes: 8f5cb44b ("RISC-V: KVM: Support sstc extension")
      Signed-off-by: default avatarRajnesh Kanwal <rkanwal@rivosinc.com>
      Reviewed-by: default avatarAtish Patra <atishp@rivosinc.com>
      Signed-off-by: default avatarAnup Patel <anup@brainfault.org>
      6eff3804
  3. 16 Mar, 2023 2 commits
  4. 14 Mar, 2023 18 commits
  5. 13 Mar, 2023 2 commits
  6. 12 Mar, 2023 9 commits