Commits · 84478c829d0f474a1d6749207c53daacc305d4e1 · nexedi / linux

19 May, 2010 24 commits

KVM: x86: export paravirtual cpuid flags in KVM_GET_SUPPORTED_CPUID · 84478c82

Glauber Costa authored May 11, 2010

Right now, we were using individual KVM_CAP entities to communicate
userspace about which cpuids we support. This is suboptimal, since it
generates a delay between the feature arriving in the host, and
being available at the guest.

A much better mechanism is to list para features in KVM_GET_SUPPORTED_CPUID.
This makes userspace automatically aware of what we provide. And if we
ever add a new cpuid bit in the future, we have to do that again,
which create some complexity and delay in feature adoption.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

84478c82

KVM: x86: add new KVMCLOCK cpuid feature · 0e6ac58a

Glauber Costa authored May 11, 2010

This cpuid, KVM_CPUID_CLOCKSOURCE2, will indicate to the guest
that kvmclock is available through a new set of MSRs. The old ones
are deprecated.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

0e6ac58a

KVM: x86: change msr numbers for kvmclock · 11c6bffa

Glauber Costa authored May 11, 2010

Avi pointed out a while ago that those MSRs falls into the pentium
PMU range. So the idea here is to add new ones, and after a while,
deprecate the old ones.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

11c6bffa

x86, paravirt: Add a global synchronization point for pvclock · 489fb490

Glauber Costa authored May 11, 2010

In recent stress tests, it was found that pvclock-based systems
could seriously warp in smp systems. Using ingo's time-warp-test.c,
I could trigger a scenario as bad as 1.5mi warps a minute in some systems.
(to be fair, it wasn't that bad in most of them). Investigating further, I
found out that such warps were caused by the very offset-based calculation
pvclock is based on.

This happens even on some machines that report constant_tsc in its tsc flags,
specially on multi-socket ones.

Two reads of the same kernel timestamp at approx the same time, will likely
have tsc timestamped in different occasions too. This means the delta we
calculate is unpredictable at best, and can probably be smaller in a cpu
that is legitimately reading clock in a forward ocasion.

Some adjustments on the host could make this window less likely to happen,
but still, it pretty much poses as an intrinsic problem of the mechanism.

A while ago, I though about using a shared variable anyway, to hold clock
last state, but gave up due to the high contention locking was likely
to introduce, possibly rendering the thing useless on big machines. I argue,
however, that locking is not necessary.

We do a read-and-return sequence in pvclock, and between read and return,
the global value can have changed. However, it can only have changed
by means of an addition of a positive value. So if we detected that our
clock timestamp is less than the current global, we know that we need to
return a higher one, even though it is not exactly the one we compared to.

OTOH, if we detect we're greater than the current time source, we atomically
replace the value with our new readings. This do causes contention on big
boxes (but big here means *BIG*), but it seems like a good trade off, since
it provide us with a time source guaranteed to be stable wrt time warps.

After this patch is applied, I don't see a single warp in time during 5 days
of execution, in any of the machines I saw them before.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Avi Kivity <avi@redhat.com>
CC: Marcelo Tosatti <mtosatti@redhat.com>
CC: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

489fb490

x86, paravirt: Enable pvclock flags in vcpu_time_info structure · 424c32f1

Glauber Costa authored May 11, 2010

This patch removes one padding byte and transform it into a flags
field. New versions of guests using pvclock will query these flags
upon each read.

Flags, however, will only be interpreted when the guest decides to.
It uses the pvclock_valid_flags function to signal that a specific
set of flags should be taken into consideration. Which flags are valid
are usually devised via HV negotiation.
Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

424c32f1

KVM: x86: Inject #GP with the right rip on efer writes · b69e8cae

Roedel, Joerg authored May 06, 2010

This patch fixes a bug in the KVM efer-msr write path. If a
guest writes to a reserved efer bit the set_efer function
injects the #GP directly. The architecture dependent wrmsr
function does not see this, assumes success and advances the
rip. This results in a #GP in the guest with the wrong rip.
This patch fixes this by reporting efer write errors back to
the architectural wrmsr function.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

b69e8cae

KVM: SVM: Don't allow nested guest to VMMCALL into host · 0d945bd9

Joerg Roedel authored May 05, 2010

This patch disables the possibility for a l2-guest to do a
VMMCALL directly into the host. This would happen if the
l1-hypervisor doesn't intercept VMMCALL and the l2-guest
executes this instruction.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

0d945bd9

KVM: x86: Fix exception reinjection forced to true · 3f0fd292

Joerg Roedel authored May 05, 2010

The patch merged recently which allowed to mark an exception
as reinjected has a bug as it always marks the exception as
reinjected. This breaks nested-svm shadow-on-shadow
implementation.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

3f0fd292

KVM: Fix wallclock version writing race · 9ed3c444

Avi Kivity authored May 04, 2010

Wallclock writing uses an unprotected global variable to hold the version;
this can cause one guest to interfere with another if both write their
wallclock at the same time.
Acked-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

9ed3c444

KVM: MMU: Don't read pdptrs with mmu spinlock held in mmu_alloc_roots · 8facbbff

Avi Kivity authored May 04, 2010

On svm, kvm_read_pdptr() may require reading guest memory, which can sleep.

Push the spinlock into mmu_alloc_roots(), and only take it after we've read
the pdptr.
Tested-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

8facbbff

KVM: VMX: enable VMXON check with SMX enabled (Intel TXT) · cafd6659

Shane Wang authored Apr 29, 2010

Per document, for feature control MSR:

Bit 1 enables VMXON in SMX operation. If the bit is clear, execution
of VMXON in SMX operation causes a general-protection exception.
Bit 2 enables VMXON outside SMX operation. If the bit is clear, execution
of VMXON outside SMX operation causes a general-protection exception.

This patch is to enable this kind of check with SMX for VMXON in KVM.
Signed-off-by: Shane Wang <shane.wang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

cafd6659

KVM: x86: properly update ready_for_interrupt_injection · f1d86e46

Marcelo Tosatti authored May 03, 2010

The recent changes to emulate string instructions without entering guest
mode exposed a bug where pending interrupts are not properly reflected
in ready_for_interrupt_injection.

The result is that userspace overwrites a previously queued interrupt,
when irqchip's are emulated in userspace.

Fix by always updating state before returning to userspace.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

f1d86e46

KVM: VMX: Atomically switch efer if EPT && !EFER.NX · 84ad33ef

Avi Kivity authored Apr 28, 2010

When EPT is enabled, we cannot emulate EFER.NX=0 through the shadow page
tables.  This causes accesses through ptes with bit 63 set to succeed instead
of failing a reserved bit check.
Signed-off-by: Avi Kivity <avi@redhat.com>

84ad33ef

KVM: VMX: Add facility to atomically switch MSRs on guest entry/exit · 61d2ef2c

Avi Kivity authored Apr 28, 2010

Some guest msr values cannot be used on the host (for example. EFER.NX=0),
so we need to switch them atomically during guest entry or exit.

Add a facility to program the vmx msr autoload registers accordingly.
Signed-off-by: Avi Kivity <avi@redhat.com>

61d2ef2c

KVM: VMX: Add definitions for guest and host EFER autoswitch vmcs entries · 5dfa3d17
Avi Kivity authored Apr 28, 2010
```
Signed-off-by: Avi Kivity <avi@redhat.com>
```
5dfa3d17
KVM: VMX: Add definition for msr autoload entry · 19b95dba
Avi Kivity authored Apr 28, 2010
```
Signed-off-by: Avi Kivity <avi@redhat.com>
```
19b95dba

KVM: Let vcpu structure alignment be determined at runtime · 0ee75bea

Avi Kivity authored Apr 28, 2010

vmx and svm vcpus have different contents and therefore may have different
alignmment requirements.  Let each specify its required alignment.
Signed-off-by: Avi Kivity <avi@redhat.com>

0ee75bea

KVM: MMU: cleanup invlpg code · 884a0ff0

Xiao Guangrong authored Apr 28, 2010

Using is_last_spte() to cleanup invlpg code
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

884a0ff0

KVM: MMU: move unsync/sync tracpoints to proper place · 5e1b3ddb

Xiao Guangrong authored Apr 28, 2010

Move unsync/sync tracepoints to the proper place, it's good
for us to obtain unsync page live time
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

5e1b3ddb

KVM: MMU: convert mmu tracepoints · 85f2067c

Xiao Guangrong authored Apr 28, 2010

Convert mmu tracepoints by using DECLARE_EVENT_CLASS
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

85f2067c

KVM: MMU: fix for calculating gpa in invlpg code · 22c9b2d1

Xiao Guangrong authored Apr 28, 2010

If the guest is 32-bit, we should use 'quadrant' to adjust gpa
offset
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

22c9b2d1

KVM: powerpc: use of kzalloc/kfree requires including slab.h · 329d20ba
Stephen Rothwell authored Apr 27, 2010
```
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
```
329d20ba

KVM: Fix mmu shrinker error · d35b8dd9

Gui Jianfeng authored Apr 27, 2010

kvm_mmu_remove_one_alloc_mmu_page() assumes kvm_mmu_zap_page() only reclaims
only one sp, but that's not the case. This will cause mmu shrinker returns
a wrong number. This patch fix the counting error.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

d35b8dd9

KVM: MMU: fix hashing for TDP and non-paging modes · 5a7388c2

Eric Northup authored Apr 26, 2010

For TDP mode, avoid creating multiple page table roots for the single
guest-to-host physical address map by fixing the inputs used for the
shadow page table hash in mmu_alloc_roots().
Signed-off-by: Eric Northup <digitaleric@google.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

5a7388c2

17 May, 2010 16 commits

KVM: Minor MMU documentation edits · c4bd09b2

Avi Kivity authored Apr 26, 2010

Reported by Andrew Jones.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

c4bd09b2

KVM: Document KVM_GET_MP_STATE and KVM_SET_MP_STATE · b843f065

Avi Kivity authored Apr 25, 2010

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

b843f065

KVM: MMU: fix sp->unsync type error in trace event definition · df2fb6e7

Gui Jianfeng authored Apr 22, 2010

sp->unsync is bool now, so update trace event declaration.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

df2fb6e7

KVM: SVM: Handle MCE intercepts always on host level · ff47a49b

Joerg Roedel authored Apr 22, 2010

This patch prevents MCE intercepts from being propagated
into the L1 guest if they happened in an L2 guest.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

ff47a49b

KVM: x86: Allow marking an exception as reinjected · ce7ddec4

Joerg Roedel authored Apr 22, 2010

This patch adds logic to kvm/x86 which allows to mark an
injected exception as reinjected. This allows to remove an
ugly hack from svm_complete_interrupts that prevented
exceptions from being reinjected at all in the nested case.
The hack was necessary because an reinjected exception into
the nested guest could cause a nested vmexit emulation. But
reinjected exceptions must not intercept. The downside of
the hack is that a exception that in injected could get
lost.
This patch fixes the problem and puts the code for it into
generic x86 files because. Nested-VMX will likely have the
same problem and could reuse the code.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

ce7ddec4

KVM: SVM: Report emulated SVM features to userspace · c2c63a49

Joerg Roedel authored Apr 22, 2010

This patch implements the reporting of the emulated SVM
features to userspace instead of the real hardware
capabilities. Every real hardware capability needs emulation
in nested svm so the old behavior was broken.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

c2c63a49

KVM: x86: Add callback to let modules decide over some supported cpuid bits · d4330ef2

Joerg Roedel authored Apr 22, 2010

This patch adds the get_supported_cpuid callback to
kvm_x86_ops. It will be used in do_cpuid_ent to delegate the
decission about some supported cpuid bits to the
architecture modules.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

d4330ef2

KVM: SVM: Propagate nested entry failure into guest hypervisor · 228070b1

Joerg Roedel authored Apr 22, 2010

This patch implements propagation of a failes guest vmrun
back into the guest instead of killing the whole guest.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

228070b1

KVM: SVM: Sync cr0 and cr3 to kvm state before nested handling · 2be4fc7a

Joerg Roedel authored Apr 22, 2010

This patch syncs cr0 and cr3 from the vmcb to the kvm state
before nested intercept handling is done. This allows to
simplify the vmexit path.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

2be4fc7a

KVM: SVM: Make sure rip is synced to vmcb before nested vmexit · 2041a06a

Joerg Roedel authored Apr 22, 2010

This patch fixes a bug where a nested guest always went over
the same instruction because the rip was not advanced on a
nested vmexit.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

2041a06a

KVM: SVM: Fix nested nmi handling · 924584cc

Joerg Roedel authored Apr 22, 2010

The patch introducing nested nmi handling had a bug. The
check does not belong to enable_nmi_window but must be in
nmi_allowed. This patch fixes this.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

924584cc

Merge remote branch 'tip/perf/core' · 8d3b9323
Avi Kivity authored Apr 23, 2010
```
Signed-off-by: Avi Kivity <avi@redhat.com>
```
8d3b9323

KVM: Remove test-before-set optimization for dirty bits · d1476937

Takuya Yoshikawa authored Apr 23, 2010

As Avi pointed out, testing bit part in mark_page_dirty() was important
in the days of shadow paging, but currently EPT and NPT has already become
common and the chance of faulting a page more that once per iteration is
small. So let's remove the test bit to avoid extra access.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>

d1476937

KVM: Document mmu · 03909187
Avi Kivity authored Apr 21, 2010
```
Signed-off-by: Avi Kivity <avi@redhat.com>
```
03909187

KVM: VMX: free vpid when fail to create vcpu · cdbecfc3

Lai Jiangshan authored Apr 17, 2010

Fix bug of the exception path, free allocated vpid when fail
to create vcpu.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

cdbecfc3

KVM: PPC: Enable native paired singles · b83d4a9c

Alexander Graf authored Apr 20, 2010

When we're on a paired single capable host, we can just always enable
paired singles and expose them to the guest directly.

This approach breaks when multiple VMs run and access PS concurrently,
but this should suffice until we get a proper framework for it in Linux.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>

b83d4a9c