Commits · 0bc273266112937cd6560f7d447b6aa9cfd9134a · Kirill Smelkov / linux

14 Jul, 2022 5 commits

KVM: x86: WARN only once if KVM leaves a dangling userspace I/O request · 0bc27326

Sean Christopherson authored Jul 11, 2022

Change a WARN_ON() to separate WARN_ON_ONCE() if KVM has an outstanding
PIO or MMIO request without an associated callback, i.e. if KVM queued a
userspace I/O exit but didn't actually exit to userspace before moving
on to something else. Warning on every KVM_RUN risks spamming the kernel
if KVM gets into a bad state. Opportunistically split the WARNs so that
it's easier to triage failures when a WARN fires.

Deliberately do not use KVM_BUG_ON(), i.e. don't kill the VM. While the
WARN is all but guaranteed to fire if and only if there's a KVM bug, a
dangling I/O request does not present a danger to KVM (that flag is truly
truly consumed only in a single emulator path), and any such bug is
unlikely to be fatal to the VM (KVM essentially failed to do something it
shouldn't have tried to do in the first place). In other words, note the
bug, but let the VM keep running.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220711232750.1092012-4-seanjc@google.comSigned-off-by: Sean Christopherson <seanjc@google.com>

0bc27326

KVM: x86: Set error code to segment selector on LLDT/LTR non-canonical #GP · 26262069

Sean Christopherson authored Jul 11, 2022

When injecting a #GP on LLDT/LTR due to a non-canonical LDT/TSS base, set
the error code to the selector. Intel SDM's says nothing about the #GP,
but AMD's APM explicitly states that both LLDT and LTR set the error code
to the selector, not zero.

Note, a non-canonical memory operand on LLDT/LTR does generate a #GP(0),
but the KVM code in question is specific to the base from the descriptor.

Fixes: e37a75a1 ("KVM: x86: Emulator ignores LDTR/TR extended base on LLDT/LTR")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220711232750.1092012-3-seanjc@google.comSigned-off-by: Sean Christopherson <seanjc@google.com>

26262069

KVM: x86: Mark TSS busy during LTR emulation _after_ all fault checks · ec6e4d86

Sean Christopherson authored Jul 11, 2022

Wait to mark the TSS as busy during LTR emulation until after all fault
checks for the LTR have passed. Specifically, don't mark the TSS busy if
the new TSS base is non-canonical.

Opportunistically drop the one-off !seg_desc.PRESENT check for TR as the
only reason for the early check was to avoid marking a !PRESENT TSS as
busy, i.e. the common !PRESENT is now done before setting the busy bit.

Fixes: e37a75a1 ("KVM: x86: Emulator ignores LDTR/TR extended base on LLDT/LTR")
Reported-by: syzbot+760a73552f47a8cd0fd9@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220711232750.1092012-2-seanjc@google.comSigned-off-by: Sean Christopherson <seanjc@google.com>

ec6e4d86

KVM: x86: Tweak name of MONITOR/MWAIT #UD quirk to make it #UD specific · 43bb9e00

Sean Christopherson authored Jul 11, 2022

Add a "UD" clause to KVM_X86_QUIRK_MWAIT_NEVER_FAULTS to make it clear
that the quirk only controls the #UD behavior of MONITOR/MWAIT. KVM
doesn't currently enforce fault checks when MONITOR/MWAIT are supported,
but that could change in the future. SVM also has a virtualization hole
in that it checks all faults before intercepts, and so "never faults" is
already a lie when running on SVM.

Fixes: bfbcc81b ("KVM: x86: Add a quirk for KVM's "MONITOR/MWAIT are NOPs!" behavior")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220711225753.1073989-4-seanjc@google.com

43bb9e00

KVM: selftests: Use "a" and "d" to set EAX/EDX for wrmsr_safe() · 14fd95bf

Vitaly Kuznetsov authored Jul 14, 2022

Do not use GCC's "A" constraint to load EAX:EDX in wrmsr_safe().  Per
GCC's documenation on x86-specific constraints, "A" will not actually
load a 64-bit value into EAX:EDX on x86-64.

  The a and d registers. This class is used for instructions that return
  double word results in the ax:dx register pair. Single word values will
  be allocated either in ax or dx. For example on i386 the following
  implements rdtsc:

  unsigned long long rdtsc (void)
  {
    unsigned long long tick;
    __asm__ __volatile__("rdtsc":"=A"(tick));
    return tick;
  }

  This is not correct on x86-64 as it would allocate tick in either ax or
  dx. You have to use the following variant instead:

  unsigned long long rdtsc (void)
  {
    unsigned int tickl, tickh;
    __asm__ __volatile__("rdtsc":"=a"(tickl),"=d"(tickh));
    return ((unsigned long long)tickh << 32)|tickl;
  }

Because a u64 fits in a single 64-bit register, using "A" for selftests,
which are 64-bit only, results in GCC loading the value into either RAX
or RDX instead of splitting it across EAX:EDX.

E.g.:

  kvm_exit:             reason MSR_WRITE rip 0x402919 info 0 0
  kvm_msr:              msr_write 40000118 = 0x60000000001 (#GP)
...

With "A":

  48 8b 43 08          	mov    0x8(%rbx),%rax
  49 b9 ba da ca ba 0a 	movabs $0xabacadaba,%r9
  00 00 00
  4c 8d 15 07 00 00 00 	lea    0x7(%rip),%r10        # 402f44 <guest_msr+0x34>
  4c 8d 1d 06 00 00 00 	lea    0x6(%rip),%r11        # 402f4a <guest_msr+0x3a>
  0f 30                 wrmsr

With "a"/"d":

  48 8b 53 08             mov    0x8(%rbx),%rdx
  89 d0                   mov    %edx,%eax
  48 c1 ea 20             shr    $0x20,%rdx
  49 b9 ba da ca ba 0a    movabs $0xabacadaba,%r9
  00 00 00
  4c 8d 15 07 00 00 00    lea    0x7(%rip),%r10        # 402fc3 <guest_msr+0xb3>
  4c 8d 1d 06 00 00 00    lea    0x6(%rip),%r11        # 402fc9 <guest_msr+0xb9>
  0f 30                   wrmsr

Fixes: 3b23054c ("KVM: selftests: Add x86-64 support for exception fixup")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints
[sean: use "& -1u", provide GCC blurb and link to documentation]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220714011115.3135828-1-seanjc@google.com

14fd95bf

12 Jul, 2022 5 commits

KVM: selftests: Provide valid inputs for MONITOR/MWAIT regs · b624ae35

Sean Christopherson authored Jul 11, 2022

Provide valid inputs for RAX, RCX, and RDX when testing whether or not
KVM injects a #UD on MONITOR/MWAIT.  SVM has a virtualization hole and
checks for _all_ faults before checking for intercepts, e.g. MONITOR with
an unsupported RCX will #GP before KVM gets a chance to intercept and
emulate.

Fixes: 2325d4dd ("KVM: selftests: Add MONITOR/MWAIT quirk test")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220711225753.1073989-3-seanjc@google.com

b624ae35

KVM: selftests: Test MONITOR and MWAIT, not just MONITOR for quirk · 874190fd

Sean Christopherson authored Jul 11, 2022

Fix a copy+paste error in monitor_mwait_test by switching one of the two
"monitor" instructions to an "mwait". The intent of the test is very
much to verify the quirk handles both MONITOR and MWAIT.

Fixes: 2325d4dd ("KVM: selftests: Add MONITOR/MWAIT quirk test")
Reported-by: Yuan Yao <yuan.yao@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220711225753.1073989-2-seanjc@google.com

874190fd

KVM: x86: Query vcpu->vcpu_idx directly and drop its accessor, again · 79f772b9

Sean Christopherson authored Jun 14, 2022

Read vcpu->vcpu_idx directly instead of bouncing through the one-line
wrapper, kvm_vcpu_get_idx(), and drop the wrapper.  The wrapper is a
remnant of the original implementation and serves no purpose; remove it
(again) before it gains more users.

kvm_vcpu_get_idx() was removed in the not-too-distant past by commit
4eeef242 ("KVM: x86: Query vcpu->vcpu_idx directly and drop its
accessor"), but was unintentionally re-introduced by commit a54d8066
("KVM: Keep memslots in tree-based structures instead of array-based ones"),
likely due to a rebase goof.  The wrapper then managed to gain users in
KVM's Xen code.

No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20220614225615.3843835-1-seanjc@google.com

79f772b9

KVM: x86/mmu: Replace UNMAPPED_GVA with INVALID_GPA for gva_to_gpa() · 6e1d2a3f

Hou Wenlong authored Jul 01, 2022

The result of gva_to_gpa() is physical address not virtual address,
it is odd that UNMAPPED_GVA macro is used as the result for physical
address. Replace UNMAPPED_GVA with INVALID_GPA and drop UNMAPPED_GVA
macro.

No functional change intended.
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/6104978956449467d3c68f1ad7f2c2f6d771d0ee.1656667239.git.houwenlong.hwl@antgroup.comSigned-off-by: Sean Christopherson <seanjc@google.com>

6e1d2a3f

KVM: nVMX: Always enable TSC scaling for L2 when it was enabled for L1 · 156b9d76

Vitaly Kuznetsov authored Jul 12, 2022

Windows 10/11 guests with Hyper-V role (WSL2) enabled are observed to
hang upon boot or shortly after when a non-default TSC frequency was
set for L1. The issue is observed on a host where TSC scaling is
supported. The problem appears to be that Windows doesn't use TSC
scaling for its guests, even when the feature is advertised, and KVM
filters SECONDARY_EXEC_TSC_SCALING out when creating L2 controls from
L1's VMCS. This leads to L2 running with the default frequency (matching
host's) while L1 is running with an altered one.

Keep SECONDARY_EXEC_TSC_SCALING in secondary exec controls for L2 when
it was set for L1. TSC_MULTIPLIER is already correctly computed and
written by prepare_vmcs02().
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: d041b5ea ("KVM: nVMX: Enable nested TSC scaling")
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220712135009.952805-1-vkuznets@redhat.comSigned-off-by: Sean Christopherson <seanjc@google.com>

156b9d76

08 Jul, 2022 7 commits

KVM: x86: Fully initialize 'struct kvm_lapic_irq' in kvm_pv_kick_cpu_op() · 159e037d

Vitaly Kuznetsov authored Jul 08, 2022

'vector' and 'trig_mode' fields of 'struct kvm_lapic_irq' are left
uninitialized in kvm_pv_kick_cpu_op(). While these fields are normally
not needed for APIC_DM_REMRD, they're still referenced by
__apic_accept_irq() for trace_kvm_apic_accept_irq(). Fully initialize
the structure to avoid consuming random stack memory.

Fixes: a183b638 ("KVM: x86: make apic_accept_irq tracepoint more generic")
Reported-by: syzbot+d6caa905917d353f0d07@syzkaller.appspotmail.com
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220708125147.593975-1-vkuznets@redhat.comSigned-off-by: Sean Christopherson <seanjc@google.com>

159e037d

KVM: x86: Fix handling of APIC LVT updates when userspace changes MCG_CAP · f83894b2

Sean Christopherson authored Jul 08, 2022

Add a helper to update KVM's in-kernel local APIC in response to MCG_CAP
being changed by userspace to fix multiple bugs.  First and foremost,
KVM needs to check that there's an in-kernel APIC prior to dereferencing
vcpu->arch.apic.  Beyond that, any "new" LVT entries need to be masked,
and the APIC version register needs to be updated as it reports out the
number of LVT entries.

Fixes: 4b903561 ("KVM: x86: Add Corrected Machine Check Interrupt (CMCI) emulation to lapic.")
Reported-by: syzbot+8cdad6430c24f396f158@syzkaller.appspotmail.com
Cc: Siddh Raman Pant <code@siddh.me>
Cc: Jue Wang <juew@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>

f83894b2

KVM: x86: Initialize number of APIC LVT entries during APIC creation · 03d84f96

Sean Christopherson authored Jul 08, 2022

Initialize the number of LVT entries during APIC creation, else the field
will be incorrectly left '0' if userspace never invokes KVM_X86_SETUP_MCE.

Add and use a helper to calculate the number of entries even though
MCG_CMCI_P is not set by default in vcpu->arch.mcg_cap. Relying on that
to always be true is unnecessarily risky, and subtle/confusing as well.

Fixes: 4b903561 ("KVM: x86: Add Corrected Machine Check Interrupt (CMCI) emulation to lapic.")
Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Jue Wang <juew@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>

03d84f96

Merge branch 'kvm-5.20-msr-eperm' · 4a627b0b

Sean Christopherson authored Jul 08, 2022

Merge a bug fix and cleanups for {g,s}et_msr_mce() using a base that
predates commit 281b5278 ("KVM: x86: Add emulation for
MSR_IA32_MCx_CTL2 MSRs."), which was written with the intention that it
be applied _after_ the bug fix and cleanups. The bug fix in particular
needs to be sent to stable trees; give them a stable hash to use.

4a627b0b

KVM: x86: Add helpers to identify CTL and STATUS MCi MSRs · 54ad60ba

Sean Christopherson authored May 12, 2022

Add helpers to identify CTL (control) and STATUS MCi MSR types instead of
open coding the checks using the offset. Using the offset is perfectly
safe, but unintuitive, as understanding what the code does requires
knowing that the offset calcuation will not affect the lower three bits.

Opportunistically comment the STATUS logic to save readers a trip to
Intel's SDM or AMD's APM to understand the "data != 0" check.

No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20220512222716.4112548-4-seanjc@google.com

54ad60ba

KVM: x86: Use explicit case-statements for MCx banks in {g,s}et_msr_mce() · f5223a33

Sean Christopherson authored May 12, 2022

Use an explicit case statement to grab the full range of MCx bank MSRs
in {g,s}et_msr_mce(), and manually check only the "end" (the number of
banks configured by userspace may be less than the max).  The "default"
trick works, but is a bit odd now, and will be quite odd if/when support
for accessing MCx_CTL2 MSRs is added, which has near identical logic.

Hoist "offset" to function scope so as to avoid curly braces for the case
statement, and because MCx_CTL2 support will need the same variables.

Opportunstically clean up the comment about allowing bit 10 to be cleared
from bank 4.

No functional change intended.

Cc: Jue Wang <juew@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20220512222716.4112548-3-seanjc@google.com

f5223a33

KVM: x86: Signal #GP, not -EPERM, on bad WRMSR(MCi_CTL/STATUS) · 2368048b

Sean Christopherson authored May 12, 2022

Return '1', not '-1', when handling an illegal WRMSR to a MCi_CTL or
MCi_STATUS MSR.  The behavior of "all zeros' or "all ones" for CTL MSRs
is architectural, as is the "only zeros" behavior for STATUS MSRs.  I.e.
the intent is to inject a #GP, not exit to userspace due to an unhandled
emulation case.  Returning '-1' gets interpreted as -EPERM up the stack
and effecitvely kills the guest.

Fixes: 890ca9ae ("KVM: Add MCE support")
Fixes: 9ffd986c ("KVM: X86: #GP when guest attempts to write MCi_STATUS register w/o 0")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20220512222716.4112548-2-seanjc@google.com

2368048b

25 Jun, 2022 2 commits

KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity · b9b71f43

Sean Christopherson authored Jun 24, 2022

Buffer split_desc_cache, the cache used to allcoate rmap list entries,
only by the default cache capacity (currently 40), not by doubling the
minimum (513). Aliasing L2 GPAs to L1 GPAs is uncommon, thus eager page
splitting is unlikely to need 500+ entries. And because each object is a
non-trivial 128 bytes (see struct pte_list_desc), those extra ~500
entries means KVM is in all likelihood wasting ~64kb of memory per VM.

Link: https://lore.kernel.org/all/YrTDcrsn0%2F+alpzf@google.com
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220624171808.2845941-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

b9b71f43

KVM: x86/mmu: Use "unsigned int", not "u32", for SPTEs' @access info · 72ae5822

Sean Christopherson authored Jun 24, 2022

Use an "unsigned int" for @access parameters instead of a "u32", mostly
to be consistent throughout KVM, but also because "u32" is misleading.
@access can actually squeeze into a u8, i.e. doesn't need 32 bits, but is
as an "unsigned int" because sp->role.access is an unsigned int.

No functional change intended.

Link: https://lore.kernel.org/all/YqyZxEfxXLsHGoZ%2F@google.com
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220624171808.2845941-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

72ae5822

24 Jun, 2022 21 commits

KVM: SEV-ES: reuse advance_sev_es_emulated_ins for OUT too · db209369

Paolo Bonzini authored Oct 22, 2021

complete_emulator_pio_in() only has to be called by
complete_sev_es_emulated_ins() now; therefore, all that the function does
now is adjust sev_pio_count and sev_pio_data.  Which is the same for
both IN and OUT.

No functional change intended.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

db209369

KVM: x86: de-underscorify __emulator_pio_in · f35cee4a

Paolo Bonzini authored Oct 22, 2021

Now all callers except emulator_pio_in_emulated are using
__emulator_pio_in/complete_emulator_pio_in explicitly.
Move the "either copy the result or attempt PIO" logic in
emulator_pio_in_emulated, and rename __emulator_pio_in to
just emulator_pio_in.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

f35cee4a

KVM: x86: wean fast IN from emulator_pio_in · dc7a4bfd

Paolo Bonzini authored Oct 22, 2021

Use __emulator_pio_in() directly for fast PIO instead of bouncing through
emulator_pio_in() now that __emulator_pio_in() fills "val" when handling
in-kernel PIO.  vcpu->arch.pio.count is guaranteed to be '0', so this a
pure nop.

emulator_pio_in_emulated is now the last caller of emulator_pio_in.

No functional change intended.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

dc7a4bfd

KVM: x86: wean in-kernel PIO from vcpu->arch.pio* · 0c05e10b

Paolo Bonzini authored Jun 15, 2022

Make emulator_pio_in_out operate directly on the provided buffer
as long as PIO is handled inside KVM.

For input operations, this means that, in the case of in-kernel
PIO, __emulator_pio_in() does not have to be always followed
by complete_emulator_pio_in().  This affects emulator_pio_in() and
kvm_sev_es_ins(); for the latter, that is why the call moves from
advance_sev_es_emulated_ins() to complete_sev_es_emulated_ins().

For output, it means that vcpu->pio.count is never set unnecessarily
and there is no need to clear it; but also vcpu->pio.size must not
be used in kvm_sev_es_outs(), because it will not be updated for
in-kernel OUT.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

0c05e10b

KVM: x86: move all vcpu->arch.pio* setup in emulator_pio_in_out() · 30d583fd

Paolo Bonzini authored Oct 22, 2021

For now, this is basically an excuse to add back the void* argument to
the function, while removing some knowledge of vcpu->arch.pio* from
its callers.  The WARN that vcpu->arch.pio.count is zero is also
extended to OUT operations.

The vcpu->arch.pio* fields still need to be filled even when the PIO is
handled in-kernel as __emulator_pio_in() is always followed by
complete_emulator_pio_in().  But after fixing that, it will be possible to
to only populate the vcpu->arch.pio* fields on userspace exits.

No functional change intended.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

30d583fd

KVM: x86: drop PIO from unregistered devices · 35ab3b77

Paolo Bonzini authored Jun 15, 2022

KVM protects the device list with SRCU, and therefore different calls
to kvm_io_bus_read()/kvm_io_bus_write() can very well see different
incarnations of kvm->buses.  If userspace unregisters a device while
vCPUs are running there is no well-defined result.  This patch applies
a safe fallback by returning early from emulator_pio_in_out().  This
corresponds to returning zeroes from IN, and dropping the writes on
the floor for OUT.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

35ab3b77

KVM: x86: inline kernel_pio into its sole caller · 0f87ac23

Paolo Bonzini authored Oct 22, 2021

The caller of kernel_pio already has arguments for most of what kernel_pio
fishes out of vcpu->arch.pio.  This is the first step towards ensuring that
vcpu->arch.pio.* is only used when exiting to userspace.

No functional change intended.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

0f87ac23

KVM: x86: complete fast IN directly with complete_emulator_pio_in() · 7a6177d6

Paolo Bonzini authored Jun 15, 2022

Use complete_emulator_pio_in() directly when completing fast PIO, there's
no need to bounce through emulator_pio_in(): the comment about ECX
changing doesn't apply to fast PIO, which isn't used for string I/O.

No functional change intended.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

7a6177d6

KVM: x86: nSVM: optimize svm_set_x2apic_msr_interception · 091abbf5

Maxim Levitsky authored May 19, 2022

- Avoid toggling the x2apic msr interception if it is already up to date.

- Avoid touching L0 msr bitmap when AVIC is inhibited on entry to
  the guest mode, because in this case the guest usually uses its
  own msr bitmap.

  Later on VM exit, the 1st optimization will allow KVM to skip
  touching the L0 msr bitmap as well.
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220519102709.24125-18-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

091abbf5

KVM: SVM: Add AVIC doorbell tracepoint · 39b6b8c3

Suravee Suthikulpanit authored May 19, 2022

Add a tracepoint to track number of doorbells being sent
to signal a running vCPU to process IRQ after being injected.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-17-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

39b6b8c3

KVM: SVM: Use target APIC ID to complete x2AVIC IRQs when possible · 8c9e639d

Suravee Suthikulpanit authored May 19, 2022

For x2AVIC, the index from incomplete IPI #vmexit info is invalid
for logical cluster mode. Only ICRH/ICRL values can be used
to determine the IPI destination APIC ID.

Since QEMU defines guest physical APIC ID to be the same as
vCPU ID, it can be used to quickly identify the target vCPU to deliver IPI,
and avoid the overhead from searching through all vCPUs to match the target
vCPU.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-16-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

8c9e639d

KVM: x86: Warning APICv inconsistency only when vcpu APIC mode is valid · f8d8ac21

Suravee Suthikulpanit authored May 19, 2022

When launching a VM with x2APIC and specify more than 255 vCPUs,
the guest kernel can disable x2APIC (e.g. specify nox2apic kernel option).
The VM fallbacks to xAPIC mode, and disable the vCPU ID 255 and greater.

In this case, APICV is deactivated for the disabled vCPUs.
However, the current APICv consistency warning does not account for
this case, which results in a warning.

Therefore, modify warning logic to report only when vCPU APIC mode
is valid.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-15-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

f8d8ac21

KVM: SVM: Introduce hybrid-AVIC mode · 0e311d33

Suravee Suthikulpanit authored May 19, 2022

Currently, AVIC is inhibited when booting a VM w/ x2APIC support.
because AVIC cannot virtualize x2APIC MSR register accesses.
However, the AVIC doorbell can be used to accelerate interrupt
injection into a running vCPU, while all guest accesses to x2APIC MSRs
will be intercepted and emulated by KVM.

With hybrid-AVIC support, the APICV_INHIBIT_REASON_X2APIC is
no longer enforced.
Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-14-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

0e311d33

KVM: SVM: Do not throw warning when calling avic_vcpu_load on a running vcpu · c0caeee6

Suravee Suthikulpanit authored May 19, 2022

Originalliy, this WARN_ON is designed to detect when calling
avic_vcpu_load() on an already running vcpu in AVIC mode (i.e. the AVIC
is_running bit is set).

However, for x2AVIC, the vCPU can switch from xAPIC to x2APIC mode while in
running state, in which the avic_vcpu_load() will be called from
svm_refresh_apicv_exec_ctrl().

Therefore, remove this warning since it is no longer appropriate.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-13-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

c0caeee6

KVM: SVM: Introduce logic to (de)activate x2AVIC mode · 4d1d7942

Suravee Suthikulpanit authored May 19, 2022

Introduce logic to (de)activate AVIC, which also allows
switching between AVIC to x2AVIC mode at runtime.

When an AVIC-enabled guest switches from APIC to x2APIC mode,
the SVM driver needs to perform the following steps:

1. Set the x2APIC mode bit for AVIC in VMCB along with the maximum
APIC ID support for each mode accodingly.

2. Disable x2APIC MSRs interception in order to allow the hardware
to virtualize x2APIC MSRs accesses.
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-12-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

4d1d7942

KVM: x86: nSVM: always intercept x2apic msrs · 7a8f7c1f

Maxim Levitsky authored May 19, 2022

As a preparation for x2avic, this patch ensures that x2apic msrs
are always intercepted for the nested guest.
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220519102709.24125-11-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

7a8f7c1f

KVM: SVM: Refresh AVIC configuration when changing APIC mode · 05c4fe8c

Suravee Suthikulpanit authored May 19, 2022

AMD AVIC can support xAPIC and x2APIC virtualization,
which requires changing x2APIC bit VMCB and MSR intercepton
for x2APIC MSRs. Therefore, call avic_refresh_apicv_exec_ctrl()
to refresh configuration accordingly.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-10-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

05c4fe8c

KVM: x86: Deactivate APICv on vCPU with APIC disabled · 8fc9c7a3

Suravee Suthikulpanit authored May 19, 2022

APICv should be deactivated on vCPU that has APIC disabled.
Therefore, call kvm_vcpu_update_apicv() when changing
APIC mode, and add additional check for APIC disable mode
when determine APICV activation,
Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-9-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

8fc9c7a3

KVM: SVM: Adding support for configuring x2APIC MSRs interception · 5c127c85

Suravee Suthikulpanit authored May 19, 2022

When enabling x2APIC virtualization (x2AVIC), the interception of
x2APIC MSRs must be disabled to let the hardware virtualize guest
MSR accesses.

Current implementation keeps track of list of MSR interception state
in the svm_direct_access_msrs array. Therefore, extends the array to
include x2APIC MSRs.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-8-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

5c127c85

KVM: SVM: Do not support updating APIC ID when in x2APIC mode · ab1b1dc1

Suravee Suthikulpanit authored May 19, 2022

In X2APIC mode, the Logical Destination Register is read-only,
which provides a fixed mapping between the logical and physical
APIC IDs. Therefore, there is no Logical APIC ID table in X2AVIC
and the processor uses the X2APIC ID in the backing page to create
a vCPU’s logical ID.

In addition, KVM does not support updating APIC ID in x2APIC mode,
which means AVIC does not need to handle this case.

Therefore, check x2APIC mode when handling physical and logical
APIC ID update, and when invalidating logical APIC ID table.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-7-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

ab1b1dc1

KVM: SVM: Update avic_kick_target_vcpus to support 32-bit APIC ID · c514d3a3

Suravee Suthikulpanit authored May 19, 2022

In x2APIC mode, ICRH contains 32-bit destination APIC ID.
So, update the avic_kick_target_vcpus() accordingly.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-6-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

c514d3a3