Commit 5d26331c authored by Paolo Bonzini's avatar Paolo Bonzini Committed by Kamal Mostafa

KVM: MMU: fix ept=0/pte.u=1/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo

commit 844a5fe2 upstream.

Yes, all of these are needed. :) This is admittedly a bit odd, but
kvm-unit-tests access.flat tests this if you run it with "-cpu host"
and of course ept=0.

KVM runs the guest with CR0.WP=1, so it must handle supervisor writes
specially when pte.u=1/pte.w=0/CR0.WP=0.  Such writes cause a fault
when U=1 and W=0 in the SPTE, but they must succeed because CR0.WP=0.
When KVM gets the fault, it sets U=0 and W=1 in the shadow PTE and
restarts execution.  This will still cause a user write to fault, while
supervisor writes will succeed.  User reads will fault spuriously now,
and KVM will then flip U and W again in the SPTE (U=1, W=0).  User reads
will be enabled and supervisor writes disabled, going back to the
originary situation where supervisor writes fault spuriously.

When SMEP is in effect, however, U=0 will enable kernel execution of
this page.  To avoid this, KVM also sets NX=1 in the shadow PTE together
with U=0.  If the guest has not enabled NX, the result is a continuous
stream of page faults due to the NX bit being reserved.

The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER
switch.  (All machines with SMEP have the CPU_LOAD_IA32_EFER vm-entry
control, so they do not use user-return notifiers for EFER---if they did,
EFER.NX would be forced to the same value as the host).

There is another bug in the reserved bit check, which I've split to a
separate patch for easier application to stable kernels.

Cc: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: default avatarXiao Guangrong <guangrong.xiao@linux.intel.com>
Fixes: f6577a5fSigned-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
parent 0e972543
...@@ -358,7 +358,8 @@ In the first case there are two additional complications: ...@@ -358,7 +358,8 @@ In the first case there are two additional complications:
- if CR4.SMEP is enabled: since we've turned the page into a kernel page, - if CR4.SMEP is enabled: since we've turned the page into a kernel page,
the kernel may now execute it. We handle this by also setting spte.nx. the kernel may now execute it. We handle this by also setting spte.nx.
If we get a user fetch or read fault, we'll change spte.u=1 and If we get a user fetch or read fault, we'll change spte.u=1 and
spte.nx=gpte.nx back. spte.nx=gpte.nx back. For this to work, KVM forces EFER.NX to 1 when
shadow paging is in use.
- if CR4.SMAP is disabled: since the page has been changed to a kernel - if CR4.SMAP is disabled: since the page has been changed to a kernel
page, it can not be reused when CR4.SMAP is enabled. We set page, it can not be reused when CR4.SMAP is enabled. We set
CR4.SMAP && !CR0.WP into shadow page's role to avoid this case. Note, CR4.SMAP && !CR0.WP into shadow page's role to avoid this case. Note,
......
...@@ -1718,26 +1718,31 @@ static void reload_tss(void) ...@@ -1718,26 +1718,31 @@ static void reload_tss(void)
static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset) static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)
{ {
u64 guest_efer; u64 guest_efer = vmx->vcpu.arch.efer;
u64 ignore_bits; u64 ignore_bits = 0;
guest_efer = vmx->vcpu.arch.efer; if (!enable_ept) {
/*
* NX is needed to handle CR0.WP=1, CR4.SMEP=1. Testing
* host CPUID is more efficient than testing guest CPUID
* or CR4. Host SMEP is anyway a requirement for guest SMEP.
*/
if (boot_cpu_has(X86_FEATURE_SMEP))
guest_efer |= EFER_NX;
else if (!(guest_efer & EFER_NX))
ignore_bits |= EFER_NX;
}
/* /*
* NX is emulated; LMA and LME handled by hardware; SCE meaningless * LMA and LME handled by hardware; SCE meaningless outside long mode.
* outside long mode
*/ */
ignore_bits = EFER_NX | EFER_SCE; ignore_bits |= EFER_SCE;
#ifdef CONFIG_X86_64 #ifdef CONFIG_X86_64
ignore_bits |= EFER_LMA | EFER_LME; ignore_bits |= EFER_LMA | EFER_LME;
/* SCE is meaningful only in long mode on Intel */ /* SCE is meaningful only in long mode on Intel */
if (guest_efer & EFER_LMA) if (guest_efer & EFER_LMA)
ignore_bits &= ~(u64)EFER_SCE; ignore_bits &= ~(u64)EFER_SCE;
#endif #endif
guest_efer &= ~ignore_bits;
guest_efer |= host_efer & ignore_bits;
vmx->guest_msrs[efer_offset].data = guest_efer;
vmx->guest_msrs[efer_offset].mask = ~ignore_bits;
clear_atomic_switch_msr(vmx, MSR_EFER); clear_atomic_switch_msr(vmx, MSR_EFER);
...@@ -1748,16 +1753,21 @@ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset) ...@@ -1748,16 +1753,21 @@ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)
*/ */
if (cpu_has_load_ia32_efer || if (cpu_has_load_ia32_efer ||
(enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX))) { (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX))) {
guest_efer = vmx->vcpu.arch.efer;
if (!(guest_efer & EFER_LMA)) if (!(guest_efer & EFER_LMA))
guest_efer &= ~EFER_LME; guest_efer &= ~EFER_LME;
if (guest_efer != host_efer) if (guest_efer != host_efer)
add_atomic_switch_msr(vmx, MSR_EFER, add_atomic_switch_msr(vmx, MSR_EFER,
guest_efer, host_efer); guest_efer, host_efer);
return false; return false;
} } else {
guest_efer &= ~ignore_bits;
guest_efer |= host_efer & ignore_bits;
vmx->guest_msrs[efer_offset].data = guest_efer;
vmx->guest_msrs[efer_offset].mask = ~ignore_bits;
return true; return true;
}
} }
static unsigned long segment_base(u16 selector) static unsigned long segment_base(u16 selector)
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment