Commits · cbd858b17e379f727c6bf1eaedde7b1a44e75b40 · Kirill Smelkov / linux

24 Jun, 2022 23 commits

KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page() · cbd858b1

David Matlack authored Jun 22, 2022

Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only
caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync
indirect shadow pages, so it's safe to pass in NULL when looking up
direct shadow pages.

This will be used for doing eager page splitting, which allocates direct
shadow pages from the context of a VM ioctl without access to a vCPU
pointer.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-14-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

cbd858b1

KVM: x86/mmu: Pass kvm pointer separately from vcpu to kvm_mmu_find_shadow_page() · 3cc736b3

David Matlack authored Jun 22, 2022

Get the kvm pointer from the caller, rather than deriving it from
vcpu->kvm, and plumb the kvm pointer all the way from
kvm_mmu_get_shadow_page(). With this change in place, the vcpu pointer
is only needed to sync indirect shadow pages. In other words,
__kvm_mmu_get_shadow_page() can now be used to get *direct* shadow pages
without a vcpu pointer. This enables eager page splitting, which needs
to allocate direct shadow pages during VM ioctls.

No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-13-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

3cc736b3

KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page() · 336081fb

David Matlack authored Jun 22, 2022

The vcpu pointer in kvm_mmu_alloc_shadow_page() is only used to get the
kvm pointer. So drop the vcpu pointer and just pass in the kvm pointer.

No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-12-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

336081fb

KVM: x86/mmu: Pass memory caches to allocate SPs separately · 2f8b1b53

David Matlack authored Jun 22, 2022

Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
will allocate the various pieces of memory for shadow pages as a
parameter, rather than deriving them from the vcpu pointer. This will be
useful in a future commit where shadow pages are allocated during VM
ioctls for eager page splitting, and thus will use a different set of
caches.

Preemptively pull the caches out all the way to
kvm_mmu_get_shadow_page() since eager page splitting will not be calling
kvm_mmu_alloc_shadow_page() directly.

No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-11-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

2f8b1b53

KVM: x86/mmu: Move guest PT write-protection to account_shadowed() · be911771

David Matlack authored Jun 22, 2022

Move the code that write-protects newly-shadowed guest page tables into
account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a
more logical place for this code to live. But most importantly, this
reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct
kvm_vcpu pointer, which will be necessary when creating new shadow pages
during VM ioctls for eager page splitting.

Note, it is safe to drop the role.level == PG_LEVEL_4K check since
account_shadowed() returns early if role.level > PG_LEVEL_4K.

No functional change intended.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-10-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

be911771

KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages · 87654643

David Matlack authored Jun 22, 2022

Rename 2 functions:

  kvm_mmu_get_page() -> kvm_mmu_get_shadow_page()
  kvm_mmu_free_page() -> kvm_mmu_free_shadow_page()

This change makes it clear that these functions deal with shadow pages
rather than struct pages. It also aligns these functions with the naming
scheme for kvm_mmu_find_shadow_page() and kvm_mmu_alloc_shadow_page().

Prefer "shadow_page" over the shorter "sp" since these are core
functions and the line lengths aren't terrible.

No functional change intended.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-9-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

87654643

KVM: x86/mmu: Consolidate shadow page allocation and initialization · c306aec8

David Matlack authored Jun 22, 2022

Consolidate kvm_mmu_alloc_page() and kvm_mmu_alloc_shadow_page() under
the latter so that all shadow page allocation and initialization happens
in one place.

No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-8-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

c306aec8

KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions · 94c81364

David Matlack authored Jun 22, 2022

Decompose kvm_mmu_get_page() into separate helper functions to increase
readability and prepare for allocating shadow pages without a vcpu
pointer.

Specifically, pull the guts of kvm_mmu_get_page() into 2 helper
functions:

kvm_mmu_find_shadow_page() -
  Walks the page hash checking for any existing mmu pages that match the
  given gfn and role.

kvm_mmu_alloc_shadow_page()
  Allocates and initializes an entirely new kvm_mmu_page. This currently
  requries a vcpu pointer for allocation and looking up the memslot but
  that will be removed in a future commit.

No functional change intended.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-7-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

94c81364

KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytes · 7f497775

David Matlack authored Jun 22, 2022

The quadrant is only used when gptes are 4 bytes, but
mmu_alloc_{direct,shadow}_roots() pass in a non-zero quadrant for PAE
page directories regardless. Make this less confusing by only passing in
a non-zero quadrant when it is actually necessary.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-6-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

7f497775

KVM: x86/mmu: Derive shadow MMU page role from parent · 2e65e842

David Matlack authored Jun 22, 2022

Instead of computing the shadow page role from scratch for every new
page, derive most of the information from the parent shadow page.  This
eliminates the dependency on the vCPU root role to allocate shadow page
tables, and reduces the number of parameters to kvm_mmu_get_page().

Preemptively split out the role calculation to a separate function for
use in a following commit.

Note that when calculating the MMU root role, we can take
@role.passthrough, @role.direct, and @role.access directly from
@vcpu->arch.mmu->root_role. Only @role.level and @role.quadrant still
must be overridden for PAE page directories, when shadowing 32-bit
guest page tables with PAE page tables.

No functional change intended.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-5-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

2e65e842

KVM: x86/mmu: Stop passing "direct" to mmu_alloc_root() · 86938ab6

David Matlack authored Jun 22, 2022

The "direct" argument is vcpu->arch.mmu->root_role.direct,
because unlike non-root page tables, it's impossible to have
a direct root in an indirect MMU.  So just use that.
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

86938ab6

KVM: x86/mmu: Use a bool for direct · 27a59d57

David Matlack authored Jun 22, 2022

The parameter "direct" can either be true or false, and all of the
callers pass in a bool variable or true/false literal, so just use the
type bool.

No functional change intended.
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

27a59d57

KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs · bb924ca6

David Matlack authored Jun 22, 2022

Commit fb58a9c3 ("KVM: x86/mmu: Optimize MMU page cache lookup for
fully direct MMUs") skipped the unsync checks and write flood clearing
for full direct MMUs. We can extend this further to skip the checks for
all direct shadow pages. Direct shadow pages in indirect MMUs (i.e.
shadow paging) are used when shadowing a guest huge page with smaller
pages. Such direct shadow pages, like their counterparts in fully direct
MMUs, are never marked unsynced or have a non-zero write-flooding count.

Checking sp->role.direct also generates better code than checking
direct_map because, due to register pressure, direct_map has to get
shoved onto the stack and then pulled back off.

No functional change intended.
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-2-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

bb924ca6

KVM: selftests: Cache binary stats metadata for duration of test · 83f6e109

Ben Gardon authored Jun 13, 2022

In order to improve performance across multiple reads of VM stats, cache
the stats metadata in the VM struct.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-11-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

83f6e109

KVM: selftests: Test disabling NX hugepages on a VM · b774da3f

Ben Gardon authored Jun 13, 2022

Add an argument to the NX huge pages test to test disabling the feature
on a VM using the new capability.
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-10-bgardon@google.com>
[Handle failure of sudo or setcap more gracefully. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

b774da3f

KVM: selftests: Add NX huge pages test · 8448ec59

Ben Gardon authored Jun 13, 2022

There's currently no test coverage of NX hugepages in KVM selftests, so
add a basic test to ensure that the feature works as intended.

The test creates a VM with a data slot backed with huge pages. The
memory in the data slot is filled with op-codes for the return
instruction. The guest then executes a series of accesses on the memory,
some reads, some instruction fetches. After each operation, the guest
exits and the test performs some checks on the backing page counts to
ensure that NX page splitting an reclaim work as expected.
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-7-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

8448ec59

KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basis · 084cc29f

Ben Gardon authored Jun 13, 2022

In some cases, the NX hugepage mitigation for iTLB multihit is not
needed for all guests on a host. Allow disabling the mitigation on a
per-VM basis to avoid the performance hit of NX hugepages on trusted
workloads.

In order to disable NX hugepages on a VM, ensure that the userspace
actor has permission to reboot the system. Since disabling NX hugepages
would allow a guest to crash the system, it is similar to reboot
permissions.

Ideally, KVM would require userspace to prove it has access to KVM's
nx_huge_pages module param, e.g. so that userspace can opt out without
needing full reboot permissions.  But getting access to the module param
file info is difficult because it is buried in layers of sysfs and module
glue. Requiring CAP_SYS_BOOT is sufficient for all known use cases.
Suggested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-9-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

084cc29f

KVM: x86: Fix errant brace in KVM capability handling · 1c4dc573

Ben Gardon authored Jun 13, 2022

The braces around the KVM_CAP_XSAVE2 block also surround the
KVM_CAP_PMU_CAPABILITY block, likely the result of a merge issue. Simply
move the curly brace back to where it belongs.

Fixes: ba7bb663 ("KVM: x86: Provide per VM capability for disabling PMU virtualization")
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-8-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

1c4dc573

KVM: selftests: Read binary stat data in lib · ed6b53ec

Ben Gardon authored Jun 13, 2022

Move the code to read the binary stats data to the KVM selftests
library. It will be re-used by other tests to check KVM behavior.

Also opportunistically remove an unnecessary calculation with
"size_data" in stats_test.
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-6-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

ed6b53ec

KVM: selftests: Clean up coding style in binary stats test · 143e7eea

Sean Christopherson authored Jun 13, 2022

Fix a variety of code style violations and/or inconsistencies in the
binary stats test.  The 80 char limit is a soft limit and can and should
be ignored/violated if doing so improves the overall code readability.

Specifically, provide consistent indentation and don't split expressions
at arbitrary points just to honor the 80 char limit.

Opportunistically expand/add comments to call out the more subtle aspects
of the code.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-5-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

143e7eea

KVM: selftests: Read binary stats desc in lib · 4d0a0594

Ben Gardon authored Jun 13, 2022

Move the code to read the binary stats descriptors to the KVM selftests
library. It will be re-used by other tests to check KVM behavior.

No functional change intended.
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-4-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

4d0a0594

KVM: selftests: Read binary stats header in lib · 32faa064

Ben Gardon authored Jun 13, 2022

Move the code to read the binary stats header to the KVM selftests
library. It will be re-used by other tests to check KVM behavior.

No functional change intended.
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-3-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

32faa064

KVM: selftests: Remove dynamic memory allocation for stats header · fcd48a21

Ben Gardon authored Jun 13, 2022

There's no need to allocate dynamic memory for the stats header since
its size is known at compile time.
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-2-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

fcd48a21

22 Jun, 2022 1 commit

KVM: selftests: Add MONITOR/MWAIT quirk test · 2325d4dd

Sean Christopherson authored Jun 08, 2022

Add a test to verify the "MONITOR/MWAIT never fault" quirk, and as a
bonus, also verify the related "MISC_ENABLES ignores ENABLE_MWAIT" quirk.

If the "never fault" quirk is enabled, MONITOR/MWAIT should always be
emulated as NOPs, even if they're reported as disabled in guest CPUID.
Use the MISC_ENABLES quirk to coerce KVM into toggling the MWAIT CPUID
enable, as KVM now disallows manually toggling CPUID bits after running
the vCPU.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220608224516.3788274-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

2325d4dd

20 Jun, 2022 16 commits

KVM: selftests: Use exception fixup for #UD/#GP Hyper-V MSR/hcall tests · cc5851c6

Sean Christopherson authored Jun 08, 2022

Use exception fixup to verify VMCALL/RDMSR/WRMSR fault as expected in the
Hyper-V Features test.

No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220608224516.3788274-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

cc5851c6

KVM: selftests: Mostly fix broken Hyper-V Features test · 9f88d062

Sean Christopherson authored Jun 08, 2022

Explicitly do all setup at every stage of the Hyper-V Features test, e.g.
set the MSR/hypercall, enable capabilities, etc...  Now that the VM is
recreated for every stage, values that are written into the VM's address
space, i.e. shared with the guest, are reset between sub-tests, as are
any capabilities, etc...

Fix the hypercall params as well, which were broken in the same rework.
The "hcall" struct/pointer needs to point at the hcall_params object, not
the set of hypercall pages.

The goofs were hidden by the test's dubious behavior of using '0' to
signal "done", i.e. the MSR test ran exactly one sub-test, and the
hypercall test was a gigantic nop.

Fixes: 6c118643 ("KVM: selftests: Avoid KVM_SET_CPUID2 after KVM_RUN in hyperv_features test")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220608224516.3788274-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

9f88d062

KVM: selftests: Add x86-64 support for exception fixup · 3b23054c

Sean Christopherson authored Jun 08, 2022

Add x86-64 support for exception fixup on single instructions, without
forcing tests to install their own fault handlers. Use registers r9-r11
to flag the instruction as "safe" and pass fixup/vector information,
i.e. introduce yet another flavor of fixup (versus the kernel's in-memory
tables and KUT's per-CPU area) to take advantage of KVM sefltests being
64-bit only.

Using only registers avoids the need to allocate fixup tables, ensure
FS or GS base is valid for the guest, ensure memory is mapped into the
guest, etc..., and also reduces the potential for recursive faults due to
accessing memory.

Providing exception fixup trivializes tests that just want to verify that
an instruction faults, e.g. no need to track start/end using global
labels, no need to install a dedicated handler, etc...

Deliberately do not support #DE in exception fixup so that the fixup glue
doesn't need to account for a fault with vector == 0, i.e. the vector can
also indicate that a fault occurred. KVM injects #DE only for esoteric
emulation scenarios, i.e. there's very, very little value in testing #DE.
Force any test that wants to generate #DEs to install its own handler(s).

Use kvm_pv_test as a guinea pig for the new fixup, as it has a very
straightforward use case of wanting to verify that RDMSR and WRMSR fault.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220608224516.3788274-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

3b23054c

KVM: x86: Add a quirk for KVM's "MONITOR/MWAIT are NOPs!" behavior · bfbcc81b

Sean Christopherson authored Jun 08, 2022

Add a quirk for KVM's behavior of emulating intercepted MONITOR/MWAIT
instructions a NOPs regardless of whether or not they are supported in
guest CPUID.  KVM's current behavior was likely motiviated by a certain
fruity operating system that expects MONITOR/MWAIT to be supported
unconditionally and blindly executes MONITOR/MWAIT without first checking
CPUID.  And because KVM does NOT advertise MONITOR/MWAIT to userspace,
that's effectively the default setup for any VMM that regurgitates
KVM_GET_SUPPORTED_CPUID to KVM_SET_CPUID2.

Note, this quirk interacts with KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT.  The
behavior is actually desirable, as userspace VMMs that want to
unconditionally hide MONITOR/MWAIT from the guest can leave the
MISC_ENABLE quirk enabled.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220608224516.3788274-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

bfbcc81b

KVM: x86: Ignore benign host writes to "unsupported" F15H_PERF_CTL MSRs · ff81a90f

Sean Christopherson authored Jun 11, 2022

Ignore host userspace writes of '0' to F15H_PERF_CTL MSRs KVM reports
in the MSR-to-save list, but the MSRs are ultimately unsupported. All
MSRs in said list must be writable by userspace, e.g. if userspace sends
the list back at KVM without filtering out the MSRs it doesn't need.

Note, reads of said MSRs already have the desired behavior.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220611005755.753273-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

ff81a90f

KVM: x86: Ignore benign host accesses to "unsupported" PEBS and BTS MSRs · 157fc497

Sean Christopherson authored Jun 11, 2022

Ignore host userspace reads and writes of '0' to PEBS and BTS MSRs that
KVM reports in the MSR-to-save list, but the MSRs are ultimately
unsupported. All MSRs in said list must be writable by userspace, e.g.
if userspace sends the list back at KVM without filtering out the MSRs it
doesn't need.

Fixes: 8183a538 ("KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to support guest DS")
Fixes: 902caeb6 ("KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS")
Fixes: c59a1f10 ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220611005755.753273-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

157fc497

KVM: VMX: Use vcpu_get_perf_capabilities() to get guest-visible value · 3f7999b9

Sean Christopherson authored Jun 11, 2022

Use vcpu_get_perf_capabilities() when querying MSR_IA32_PERF_CAPABILITIES
from the guest's perspective, e.g. to update the vPMU and to determine
which MSRs exist. If userspace ignores MSR_IA32_PERF_CAPABILITIES but
clear X86_FEATURE_PDCM, the guest should see '0'.

Fixes: 902caeb6 ("KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS")
Fixes: c59a1f10 ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220611005755.753273-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

3f7999b9

Revert "KVM: x86: always allow host-initiated writes to PMU MSRs" · 545feb96

Sean Christopherson authored Jun 11, 2022

Revert the hack to allow host-initiated accesses to all "PMU" MSRs,
as intel_is_valid_msr() returns true for _all_ MSRs, regardless of whether
or not it has a snowball's chance in hell of actually being a PMU MSR.

That mostly gets papered over by the actual get/set helpers only handling
MSRs that they knows about, except there's the minor detail that
kvm_pmu_{g,s}et_msr() eat reads and writes when the PMU is disabled.
I.e. KVM will happy allow reads and writes to _any_ MSR if the PMU is
disabled, either via module param or capability.

This reverts commit d1c88a40.

Fixes: d1c88a40 ("KVM: x86: always allow host-initiated writes to PMU MSRs")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220611005755.753273-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

545feb96

Revert "KVM: x86/pmu: Accept 0 for absent PMU MSRs when host-initiated if !enable_pmu" · 5d4283df

Sean Christopherson authored Jun 11, 2022

Eating reads and writes to all "PMU" MSRs when there is no PMU is wildly
broken as it results in allowing accesses to _any_ MSR on Intel CPUs
as intel_is_valid_msr() returns true for all host_initiated accesses.

A revert of commit d1c88a40 ("KVM: x86: always allow host-initiated
writes to PMU MSRs") will soon follow.

This reverts commit 8e6a58e2.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220611005755.753273-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

5d4283df

KVM: VMX: Give host userspace full control of MSR_IA32_PERF_CAPABILITIES · 0f4a7185

Sean Christopherson authored Jun 11, 2022

Do not clear manipulate MSR_IA32_PERF_CAPABILITIES in intel_pmu_refresh(),
i.e. give userspace full control over capability/read-only MSRs.  KVM is
not a babysitter, it is userspace's responsiblity to provide a valid and
coherent vCPU model.

Attempting to "help" the guest by forcing a consistent model creates edge
cases, and ironicially leads to inconsistent behavior.

Example #1:  KVM doesn't do intel_pmu_refresh() when userspace writes
the MSR.

Example #2: KVM doesn't clear the bits when the PMU is disabled, or when
there's no architectural PMU.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220611005755.753273-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

0f4a7185

KVM: x86: Give host userspace full control of MSR_IA32_MISC_ENABLES · 9fc22296

Sean Christopherson authored Jun 11, 2022

Give userspace full control of the read-only bits in MISC_ENABLES, i.e.
do not modify bits on PMU refresh and do not preserve existing bits when
userspace writes MISC_ENABLES. With a few exceptions where KVM doesn't
expose the necessary controls to userspace _and_ there is a clear cut
association with CPUID, e.g. reserved CR4 bits, KVM does not own the vCPU
and should not manipulate the vCPU model on behalf of "dummy user space".

The argument that KVM is doing userspace a favor because "the order of
setting vPMU capabilities and MSR_IA32_MISC_ENABLE is not strictly
guaranteed" is specious, as attempting to configure MSRs on behalf of
userspace inevitably leads to edge cases precisely because KVM does not
prescribe a specific order of initialization.

Example #1: intel_pmu_refresh() consumes and modifies the vCPU's
MSR_IA32_PERF_CAPABILITIES, and so assumes userspace initializes config
MSRs before setting the guest CPUID model. If userspace sets CPUID
first, then KVM will mark PEBS as available when arch.perf_capabilities
is initialized with a non-zero PEBS format, thus creating a bad vCPU
model if userspace later disables PEBS by writing PERF_CAPABILITIES.

Example #2: intel_pmu_refresh() does not clear PERF_CAP_PEBS_MASK in
MSR_IA32_PERF_CAPABILITIES if there is no vPMU, making KVM inconsistent
in its desire to be consistent.

Example #3: intel_pmu_refresh() does not clear MSR_IA32_MISC_ENABLE_EMON
if KVM_SET_CPUID2 is called multiple times, first with a vPMU, then
without a vPMU. While slightly contrived, it's plausible a VMM could
reflect KVM's default vCPU and then operate on KVM's copy of CPUID to
later clear the vPMU settings, e.g. see KVM's selftests.

Example #4: Enumerating an Intel vCPU on an AMD host will not call into
intel_pmu_refresh() at any point, and so the BTS and PEBS "unavailable"
bits will be left clear, without any way for userspace to set them.

Keep the "R" behavior of the bit 7, "EMON available", for the guest.
Unlike the BTS and PEBS bits, which are fully "RO", the EMON bit can be
written with a different value, but that new value is ignored.

Cc: Like Xu <likexu@tencent.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Message-Id: <20220611005755.753273-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

9fc22296

x86: kvm: remove NULL check before kfree · e20918f6

Dongliang Mu authored Jun 14, 2022

kfree can handle NULL pointer as its argument.
According to coccinelle isnullfree check, remove NULL check
before kfree operation.
Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com>
Message-Id: <20220614133458.147314-1-dzm91@hust.edu.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

e20918f6

KVM: Do not zero initialize 'pfn' in hva_to_pfn() · 943dfea8

Sean Christopherson authored Apr 29, 2022

Drop the unnecessary initialization of the local 'pfn' variable in
hva_to_pfn(). First and foremost, '0' is not an invalid pfn, it's a
perfectly valid pfn on most architectures. I.e. if hva_to_pfn() were to
return an "uninitializd" pfn, it would actually be interpeted as a legal
pfn by most callers.

Second, hva_to_pfn() can't return an uninitialized pfn as hva_to_pfn()
explicitly sets pfn to an error value (or returns an error value directly)
if a helper returns failure, and all helpers set the pfn on success.

The zeroing of 'pfn' was introduced by commit 2fc84311 ("KVM:
reorganize hva_to_pfn"), probably to avoid "uninitialized variable"
warnings on statements that return pfn. However, no compiler seems
to produce them, making the initialization unnecessary.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220429010416.2788472-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

943dfea8

KVM: x86/mmu: Shove refcounted page dependency into host_pfn_mapping_level() · 5d49f08c

Sean Christopherson authored Apr 29, 2022

Move the check that restricts mapping huge pages into the guest to pfns
that are backed by refcounted 'struct page' memory into the helper that
actually "requires" a 'struct page', host_pfn_mapping_level().  In
addition to deduplicating code, moving the check to the helper eliminates
the subtle requirement that the caller check that the incoming pfn is
backed by a refcounted struct page, and as an added bonus avoids an extra
pfn_to_page() lookup.

Note, the is_error_noslot_pfn() check in kvm_mmu_hugepage_adjust() needs
to stay where it is, as it guards against dereferencing a NULL memslot in
the kvm_slot_dirty_track_enabled() that follows.

No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220429010416.2788472-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

5d49f08c

KVM: Rename/refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page() · b14b2690

Sean Christopherson authored Apr 29, 2022

Rename and refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page()
to better reflect what KVM is actually checking, and to eliminate extra
pfn_to_page() lookups.  The kvm_release_pfn_*() an kvm_try_get_pfn()
helpers in particular benefit from "refouncted" nomenclature, as it's not
all that obvious why KVM needs to get/put refcounts for some PG_reserved
pages (ZERO_PAGE and ZONE_DEVICE).

Add a comment to call out that the list of exceptions to PG_reserved is
all but guaranteed to be incomplete.  The list has mostly been compiled
by people throwing noodles at KVM and finding out they stick a little too
well, e.g. the ZERO_PAGE's refcount overflowed and ZONE_DEVICE pages
didn't get freed.

No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220429010416.2788472-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

b14b2690

KVM: Take a 'struct page', not a pfn in kvm_is_zone_device_page() · 284dc493

Sean Christopherson authored Apr 29, 2022

Operate on a 'struct page' instead of a pfn when checking if a page is a
ZONE_DEVICE page, and rename the helper accordingly. Generally speaking,
KVM doesn't actually care about ZONE_DEVICE memory, i.e. shouldn't do
anything special for ZONE_DEVICE memory. Rather, KVM wants to treat
ZONE_DEVICE memory like regular memory, and the need to identify
ZONE_DEVICE memory only arises as an exception to PG_reserved pages. In
other words, KVM should only ever check for ZONE_DEVICE memory after KVM
has already verified that there is a struct page associated with the pfn.

No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220429010416.2788472-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

284dc493