Commits · 7700d76ea98ff8024db70abffcfd96d9ea0ad21f · Kirill Smelkov / linux

08 Aug, 2018 40 commits

x86/bugs, kvm: Introduce boot-time control of L1TF mitigations · 7700d76e

Jiri Kosina authored Jul 13, 2018

Introduce the 'l1tf=' kernel command line option to allow for boot-time
switching of mitigation that is used on processors affected by L1TF.

The possible values are:

  full
	Provides all available mitigations for the L1TF vulnerability. Disables
	SMT and enables all mitigations in the hypervisors. SMT control via
	/sys/devices/system/cpu/smt/control is still possible after boot.
	Hypervisors will issue a warning when the first VM is started in
	a potentially insecure configuration, i.e. SMT enabled or L1D flush
	disabled.

  full,force
	Same as 'full', but disables SMT control. Implies the 'nosmt=force'
	command line option. sysfs control of SMT and the hypervisor flush
	control is disabled.

  flush
	Leaves SMT enabled and enables the conditional hypervisor mitigation.
	Hypervisors will issue a warning when the first VM is started in a
	potentially insecure configuration, i.e. SMT enabled or L1D flush
	disabled.

  flush,nosmt
	Disables SMT and enables the conditional hypervisor mitigation. SMT
	control via /sys/devices/system/cpu/smt/control is still possible
	after boot. If SMT is reenabled or flushing disabled at runtime
	hypervisors will issue a warning.

  flush,nowarn
	Same as 'flush', but hypervisors will not warn when
	a VM is started in a potentially insecure configuration.

  off
	Disables hypervisor mitigations and doesn't emit any warnings.

Default is 'flush'.

Let KVM adhere to these semantics, which means:

  - 'lt1f=full,force'	: Performe L1D flushes. No runtime control
    			  possible.

  - 'l1tf=full'
  - 'l1tf-flush'
  - 'l1tf=flush,nosmt'	: Perform L1D flushes and warn on VM start if
			  SMT has been runtime enabled or L1D flushing
			  has been run-time enabled

  - 'l1tf=flush,nowarn'	: Perform L1D flushes and no warnings are emitted.

  - 'l1tf=off'		: L1D flushes are not performed and no warnings
			  are emitted.

KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush'
module parameter except when lt1f=full,force is set.

This makes KVM's private 'nosmt' option redundant, and as it is a bit
non-systematic anyway (this is something to control globally, not on
hypervisor level), remove that option.

Add the missing Documentation entry for the l1tf vulnerability sysfs file
while at it.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de

CVE-2018-3620
CVE-2018-3646

[smb: Minor context adjustments and adapt location of l1tf doc.]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

7700d76e

cpu/hotplug: Set CPU_SMT_NOT_SUPPORTED early · 94353874

Thomas Gleixner authored Jul 13, 2018

The CPU_SMT_NOT_SUPPORTED state is set (if the processor does not support
SMT) when the sysfs SMT control file is initialized.

That was fine so far as this was only required to make the output of the
control file correct and to prevent writes in that case.

With the upcoming l1tf command line parameter, this needs to be set up
before the L1TF mitigation selection and command line parsing happens.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142323.121795971@linutronix.de

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

94353874

cpu/hotplug: Expose SMT control init function · 7b2ab6e5

Jiri Kosina authored Jul 13, 2018

The L1TF mitigation will gain a commend line parameter which allows to set
a combination of hypervisor mitigation and SMT control.

Expose cpu_smt_disable() so the command line parser can tweak SMT settings.

[ tglx: Split out of larger patch and made it preserve an already existing
  	force off state ]
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142323.039715135@linutronix.de

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

7b2ab6e5

x86/kvm: Allow runtime control of L1D flush · 29ab360f

Thomas Gleixner authored Jul 13, 2018

All mitigation modes can be switched at run time with a static key now:

 - Use sysfs_streq() instead of strcmp() to handle the trailing new line
   from sysfs writes correctly.
 - Make the static key management handle multiple invocations properly.
 - Set the module parameter file to RW
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.954525119@linutronix.de

CVE-2018-3620
CVE-2018-3646

[smb: Reviewed and accepted fuzz in last hunk]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

29ab360f

x86/kvm: Serialize L1D flush parameter setter · bf502aa8

Thomas Gleixner authored Jul 13, 2018

Writes to the parameter files are not serialized at the sysfs core
level, so local serialization is required.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.873642605@linutronix.de

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

bf502aa8

x86/kvm: Add static key for flush always · dc17b0f7

Thomas Gleixner authored Jul 13, 2018

Avoid the conditional in the L1D flush control path.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.790914912@linutronix.de

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

dc17b0f7

x86/kvm: Move l1tf setup function · dc47bec7

Thomas Gleixner authored Jul 13, 2018

In preparation of allowing run time control for L1D flushing, move the
setup code to the module parameter handler.

In case of pre module init parsing, just store the value and let vmx_init()
do the actual setup after running kvm_init() so that enable_ept is having
the correct state.

During run-time invoke it directly from the parameter setter to prepare for
run-time control.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.694063239@linutronix.de

CVE-2018-3620
CVE-2018-3646

[smb: Accept/reviewed fuzz]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

dc47bec7

x86/l1tf: Handle EPT disabled state proper · ed61097d

Thomas Gleixner authored Jul 13, 2018

If Extended Page Tables (EPT) are disabled or not supported, no L1D
flushing is required. The setup function can just avoid setting up the L1D
flush for the EPT=n case.

Invoke it after the hardware setup has be done and enable_ept has the
correct state and expose the EPT disabled state in the mitigation status as
well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.612160168@linutronix.de

CVE-2018-3620
CVE-2018-3646

[smb: Adjusted to work around missing hyperv support in vmx_exit()]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

ed61097d

x86/kvm: Drop L1TF MSR list approach · 7fea4047

Thomas Gleixner authored Jul 13, 2018

The VMX module parameter to control the L1D flush should become
writeable.

The MSR list is set up at VM init per guest VCPU, but the run time
switching is based on a static key which is global. Toggling the MSR list
at run time might be feasible, but for now drop this optimization and use
the regular MSR write to make run-time switching possible.

The default mitigation is the conditional flush anyway, so for extra
paranoid setups this will add some small overhead, but the extra code
executed is in the noise compared to the flush itself.

Aside of that the EPT disabled case is not handled correctly at the moment
and the MSR list magic is in the way for fixing that as well.

If it's really providing a significant advantage, then this needs to be
revisited after the code is correct and the control is writable.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.516940445@linutronix.de

CVE-2018-3620
CVE-2018-3646

[smb: Minor context adjustment in one hunk. FIXME: Should be merged
      with the patch that adds this and possibly dropped completely.]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

7fea4047

x86/litf: Introduce vmx status variable · 4757e268

Thomas Gleixner authored Jul 13, 2018

Store the effective mitigation of VMX in a status variable and use it to
report the VMX state in the l1tf sysfs file.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.433098358@linutronix.de

CVE-2018-3620
CVE-2018-3646

[smb: Minor context adjustment in last hunk]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

4757e268

cpu/hotplug: Online siblings when SMT control is turned on · eea468d2

Thomas Gleixner authored Jul 07, 2018

Writing 'off' to /sys/devices/system/cpu/smt/control offlines all SMT
siblings. Writing 'on' merily enables the abilify to online them, but does
not online them automatically.

Make 'on' more useful by onlining all offline siblings.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

CVE-2018-3620
CVE-2018-3646

[smb: _cpu_up() only has 2 arguments]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

eea468d2

x86/KVM/VMX: Use MSR save list for IA32_FLUSH_CMD if required · 3c4f1d15

Konrad Rzeszutek Wilk authored Jun 28, 2018

If the L1D flush module parameter is set to 'always' and the IA32_FLUSH_CMD
MSR is available, optimize the VMENTER code with the MSR save list.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

CVE-2018-3620
CVE-2018-3646

[smb: Minor context adjustments and ensure hubk #2 does not get
      applied into the wrong place.]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

3c4f1d15

x86/KVM/VMX: Extend add_atomic_switch_msr() to allow VMENTER only MSRs · a844b017

Konrad Rzeszutek Wilk authored Jun 20, 2018

The IA32_FLUSH_CMD MSR needs only to be written on VMENTER. Extend
add_atomic_switch_msr() with an entry_only parameter to allow storing the
MSR only in the guest (ENTRY) MSR array.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

a844b017

x86/KVM/VMX: Seperate the VMX AUTOLOAD guest/host number accounting. · 62bb770c

Konrad Rzeszutek Wilk authored Jun 20, 2018

This allows to load a different number of MSRs depending on the context:
VMEXIT or VMENTER.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

CVE-2018-3620
CVE-2018-3646

[smb: Minor context adjustments]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

62bb770c

x86/KVM/VMX: Add find_msr() helper function · 40914af6

Konrad Rzeszutek Wilk authored Jun 20, 2018

.. to help find the MSR on either the guest or host MSR list.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

40914af6

x86/KVM/VMX: Split the VMX MSR LOAD structures to have an host/guest numbers · c48b1fb8

Konrad Rzeszutek Wilk authored Jun 20, 2018

There is no semantic change but this change allows an unbalanced amount of
MSRs to be loaded on VMEXIT and VMENTER, i.e. the number of MSRs to save or
restore on VMEXIT or VMENTER may be different.

That is the number of MSRs to save or restore on VMEXIT or VMENTER may
be different.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

CVE-2018-3620
CVE-2018-3646

[smb: Drop 2 hunks modifying nested which does not exist, yet]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

c48b1fb8

x86/KVM/VMX: Add L1D flush logic · 57880666

Paolo Bonzini authored Jul 02, 2018

Add the logic for flushing L1D on VMENTER. The flush depends on the static
key being enabled and the new l1tf_flush_l1d flag being set.

The flags is set:
 - Always, if the flush module parameter is 'always'

 - Conditionally at:
   - Entry to vcpu_run(), i.e. after executing user space

   - From the sched_in notifier, i.e. when switching to a vCPU thread.

   - From vmexit handlers which are considered unsafe, i.e. where
     sensitive data can be brought into L1D:

     - The emulator, which could be a good target for other speculative
       execution-based threats,

     - The MMU, which can bring host page tables in the L1 cache.

     - External interrupts

     - Nested operations that require the MMU (see above). That is
       vmptrld, vmptrst, vmclear,vmwrite,vmread.

     - When handling invept,invvpid

[ tglx: Split out from combo patch and reduced to a single flag ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

CVE-2018-3620
CVE-2018-3646

[smb: Moved change to kvm/mmu.c(kvm_handle_page_fault) into kvm/vmx.c
      before calling kvm_mmu_page_fault(). Left kvm/svm.c unmodified
      as AMD is not said to be affected.]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

57880666

x86/KVM/VMX: Add L1D MSR based flush · 37329e72

Paolo Bonzini authored Jul 02, 2018

336996-Speculative-Execution-Side-Channel-Mitigations.pdf defines a new MSR
(IA32_FLUSH_CMD aka 0x10B) which has similar write-only semantics to other
MSRs defined in the document.

The semantics of this MSR is to allow "finer granularity invalidation of
caching structures than existing mechanisms like WBINVD. It will writeback
and invalidate the L1 data cache, including all cachelines brought in by
preceding instructions, without invalidating all caches (eg. L2 or
LLC). Some processors may also invalidate the first level level instruction
cache on a L1D_FLUSH command. The L1 data and instruction caches may be
shared across the logical processors of a core."

Use it instead of the loop based L1 flush algorithm.

A copy of this document is available at
   https://bugzilla.kernel.org/show_bug.cgi?id=199511

[ tglx: Avoid allocating pages when the MSR is available ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

37329e72

x86/KVM/VMX: Add L1D flush algorithm · 3f457609

Paolo Bonzini authored Jul 02, 2018

To mitigate the L1 Terminal Fault vulnerability it's required to flush L1D
on VMENTER to prevent rogue guests from snooping host memory.

CPUs will have a new control MSR via a microcode update to flush L1D with a
single MSR write, but in the absence of microcode a fallback to a software
based flush algorithm is required.

Add a software flush loop which is based on code from Intel.

[ tglx: Split out from combo patch ]
[ bpetkov: Polish the asm code ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

CVE-2018-3620
CVE-2018-3646

[smb: Minor context adaptions]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

3f457609

x86/KVM/VMX: Add module argument for L1TF mitigation · 4fa3dc82

Konrad Rzeszutek Wilk authored Jul 02, 2018

Add a mitigation mode parameter "vmentry_l1d_flush" for CVE-2018-3620, aka
L1 terminal fault. The valid arguments are:

 - "always" 	L1D cache flush on every VMENTER.
 - "cond"	Conditional L1D cache flush, explained below
 - "never"	Disable the L1D cache flush mitigation

"cond" is trying to avoid L1D cache flushes on VMENTER if the code executed
between VMEXIT and VMENTER is considered safe, i.e. is not bringing any
interesting information into L1D which might exploited.

[ tglx: Split out from a larger patch ]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

CVE-2018-3620
CVE-2018-3646

[smb: Minor context adjustments]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

4fa3dc82

x86/KVM: Warn user if KVM is loaded SMT and L1TF CPU bug being present. · e93d06c7

Konrad Rzeszutek Wilk authored Jun 20, 2018

If the L1TF CPU bug is present we allow the KVM module to be loaded as
the major of users that use Linux and KVM have trusted guests and do not
want a broken setup.

Cloud vendors are the ones that are uncomfortable with CVE 2018-3620 and
as such they are the ones that should set nosmt to one.

Setting 'nosmt' means that the system administrator also needs to
disable SMT (Hyper-threading) in the BIOS, or via the 'nosmt' command line
parameter, or via the /sys/devices/system/cpu/smt/control. See commit
05736e4a ("cpu/hotplug: Provide knobs to control SMT").

Other mitigations are to use task affinity, cpu sets, interrupt binding,
etc - anything to make sure that _only_ the same guests vCPUs are running
on sibling threads.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

CVE-2018-3620
CVE-2018-3646

[smb: Added vm_init function to vmx.c, squashed v4, re-
      arranged for v6]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

e93d06c7

KVM: x86: Introducing kvm_x86_ops VM init/destroy hooks · 8b27d911

Suravee Suthikulpanit authored May 04, 2016

Adding function pointers in struct kvm_x86_ops for processor-specific
layer to provide hooks for when KVM initialize and destroy VM.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

CVE-2018-3620
CVE-2018-3646

(backported from commit 03543133 upstream)
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

8b27d911

cpu/hotplug: Boot HT siblings at least once · ffa6cdc8

Thomas Gleixner authored Jun 29, 2018

Due to the way Machine Check Exceptions work on X86 hyperthreads it's
required to boot up _all_ logical cores at least once in order to set the
CR4.MCE bit.

So instead of ignoring the sibling threads right away, let them boot up
once so they can configure themselves. After they came out of the initial
boot stage check whether its a "secondary" sibling and cancel the operation
which puts the CPU back into offline state.
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tony Luck <tony.luck@intel.com>

CVE-2018-3620
CVE-2018-3646

[smb: Heavily modified to get around backporting all of the new
      hotplug state machine code.]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

ffa6cdc8

UBUNTU: SAUCE: x86/mce: register mce notifier earlier · 8ce4509c

Stefan Bader authored Jul 11, 2018

Currently the hotplug notifier is registered in device init. This is
too late to handle events at early SMP boot stage. Later upstream code
(around 4.10) changes this, but relies on the work done to make the
whole CPU hotplug code a state machine.

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

8ce4509c

Revert "x86/apic: Ignore secondary threads if nosmt=force" · 90128378

Thomas Gleixner authored Jun 29, 2018

Dave Hansen reported, that it's outright dangerous to keep SMT siblings
disabled completely so they are stuck in the BIOS and wait for SIPI.

The reason is that Machine Check Exceptions are broadcasted to siblings and
the soft disabled sibling has CR4.MCE = 0. If a MCE is delivered to a
logical core with CR4.MCE = 0, it asserts IERR#, which shuts down or
reboots the machine. The MCE chapter in the SDM contains the following
blurb:

    Because the logical processors within a physical package are tightly
    coupled with respect to shared hardware resources, both logical
    processors are notified of machine check errors that occur within a
    given physical processor. If machine-check exceptions are enabled when
    a fatal error is reported, all the logical processors within a physical
    package are dispatched to the machine-check exception handler. If
    machine-check exceptions are disabled, the logical processors enter the
    shutdown state and assert the IERR# signal. When enabling machine-check
    exceptions, the MCE flag in control register CR4 should be set for each
    logical processor.

Reverting the commit which ignores siblings at enumeration time solves only
half of the problem. The core cpuhotplug logic needs to be adjusted as
well.

This thoughtful engineered mechanism also turns the boot process on all
Intel HT enabled systems into a MCE lottery. MCE is enabled on the boot CPU
before the secondary CPUs are brought up. Depending on the number of
physical cores the window in which this situation can happen is smaller or
larger. On a HSW-EX it's about 750ms:

MCE is enabled on the boot CPU:

[    0.244017] mce: CPU supports 22 MCE banks

The corresponding sibling #72 boots:

[    1.008005] .... node  #0, CPUs:    #72

That means if an MCE hits on physical core 0 (logical CPUs 0 and 72)
between these two points the machine is going to shutdown. At least it's a
known safe state.

It's obvious that the early boot can be hit by an MCE as well and then runs
into the same situation because MCEs are not yet enabled on the boot CPU.
But after enabling them on the boot CPU, it does not make any sense to
prevent the kernel from recovering.

Adjust the nosmt kernel parameter documentation as well.

Reverts: 2207def7 ("x86/apic: Ignore secondary threads if nosmt=force")
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tony Luck <tony.luck@intel.com>

CVE-2018-3620
CVE-2018-3646

[smb: Adjust doc path, minor context adjustments]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

90128378

x86/speculation/l1tf: Fix up pte->pfn conversion for PAE · 2e165c34

Michal Hocko authored Jun 27, 2018

Jan has noticed that pte_pfn and co. resp. pfn_pte are incorrect for
CONFIG_PAE because phys_addr_t is wider than unsigned long and so the
pte_val reps. shift left would get truncated. Fix this up by using proper
types.

Fixes: 6b28baca ("x86/speculation/l1tf: Protect PROT_NONE PTEs
against speculation")
Reported-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>

CVE-2018-3620
CVE-2018-3646

[smb: Drop change to pfn_pud which does not exist]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

2e165c34

x86/speculation/l1tf: Protect PAE swap entries against L1TF · a7554af7

Vlastimil Babka authored Jun 22, 2018

The PAE 3-level paging code currently doesn't mitigate L1TF by flipping the
offset bits, and uses the high PTE word, thus bits 32-36 for type, 37-63 for
offset. The lower word is zeroed, thus systems with less than 4GB memory are
safe. With 4GB to 128GB the swap type selects the memory locations vulnerable
to L1TF; with even more memory, also the swap offfset influences the address.
This might be a problem with 32bit PAE guests running on large 64bit hosts.

By continuing to keep the whole swap entry in either high or low 32bit word of
PTE we would limit the swap size too much. Thus this patch uses the whole PAE
PTE with the same layout as the 64bit version does. The macros just become a
bit tricky since they assume the arch-dependent swp_entry_t to be 32bit.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>

CVE-2018-3620
CVE-2018-3646

[smb: Minor context adjustments]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

a7554af7

x86/cpufeatures: Add detection of L1D cache flush support. · 3f8e8539

Konrad Rzeszutek Wilk authored Jun 20, 2018

336996-Speculative-Execution-Side-Channel-Mitigations.pdf defines a new MSR
(IA32_FLUSH_CMD) which is detected by CPUID.7.EDX[28]=1 bit being set.

This new MSR "gives software a way to invalidate structures with finer
granularity than other architectual methods like WBINVD."

A copy of this document is available at
  https://bugzilla.kernel.org/show_bug.cgi?id=199511Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

3f8e8539

x86/CPU/AMD: Move TOPOEXT reenablement before reading smp_num_siblings · 35283a3e

Borislav Petkov authored Jun 22, 2018

The TOPOEXT reenablement is a workaround for broken BIOSen which didn't
enable the CPUID bit. amd_get_topology_early(), however, relies on
that bit being set so that it can read out the CPUID leaf and set
smp_num_siblings properly.

Move the reenablement up to early_init_amd(). While at it, simplify
amd_get_topology_early().
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

35283a3e

x86/speculation/l1tf: Extend 64bit swap file size limit · c6c44854

Vlastimil Babka authored Jun 21, 2018

The previous patch has limited swap file size so that large offsets cannot
clear bits above MAX_PA/2 in the pte and interfere with L1TF mitigation.

It assumed that offsets are encoded starting with bit 12, same as pfn. But
on x86_64, offsets are encoded starting with bit 9.

Thus the limit can be raised by 3 bits. That means 16TB with 42bit MAX_PA
and 256TB with 46bit MAX_PA.

Fixes: 377eeaa8 ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

c6c44854

x86/apic: Ignore secondary threads if nosmt=force · 60709ca7

Thomas Gleixner authored Jun 05, 2018

nosmt on the kernel command line merely prevents the onlining of the
secondary SMT siblings.

nosmt=force makes the APIC detection code ignore the secondary SMT siblings
completely, so they even do not show up as possible CPUs. That reduces the
amount of memory allocations for per cpu variables and saves other
resources from being allocated too large.

This is not fully equivalent to disabling SMT in the BIOS because the low
level SMT enabling in the BIOS can result in partitioning of resources
between the siblings, which is not undone by just ignoring them. Some CPUs
can use the full resources when their sibling is not onlined, but this is
depending on the CPU family and model and it's not well documented whether
this applies to all partitioned resources. That means depending on the
workload disabling SMT in the BIOS might result in better performance.

Linus analysis of the Intel manual:

  The intel optimization manual is not very clear on what the partitioning
  rules are.

  I find:

    "In general, the buffers for staging instructions between major pipe
     stages  are partitioned. These buffers include µop queues after the
     execution trace cache, the queues after the register rename stage, the
     reorder buffer which stages instructions for retirement, and the load
     and store buffers.

     In the case of load and store buffers, partitioning also provided an
     easier implementation to maintain memory ordering for each logical
     processor and detect memory ordering violations"

  but some of that partitioning may be relaxed if the HT thread is "not
  active":

    "In Intel microarchitecture code name Sandy Bridge, the micro-op queue
     is statically partitioned to provide 28 entries for each logical
     processor,  irrespective of software executing in single thread or
     multiple threads. If one logical processor is not active in Intel
     microarchitecture code name Ivy Bridge, then a single thread executing
     on that processor  core can use the 56 entries in the micro-op queue"

  but I do not know what "not active" means, and how dynamic it is. Some of
  that partitioning may be entirely static and depend on the early BIOS
  disabling of HT, and even if we park the cores, the resources will just be
  wasted.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

60709ca7

x86/cpu/AMD: Evaluate smp_num_siblings early · c04c82ff

Thomas Gleixner authored Jun 06, 2018

To support force disabling of SMT it's required to know the number of
thread siblings early. amd_get_topology() cannot be called before the APIC
driver is selected, so split out the part which initializes
smp_num_siblings and invoke it from amd_early_init().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

c04c82ff

x86/cpu/intel: Evaluate smp_num_siblings early · 1cdba5ac

Thomas Gleixner authored Jun 06, 2018

Make use of the new early detection function to initialize smp_num_siblings
on the boot cpu before the MP-Table or ACPI/MADT scan happens. That's
required for force disabling SMT.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

1cdba5ac

x86/cpu/topology: Provide detect_extended_topology_early() · a5ea4c08

Thomas Gleixner authored Jun 06, 2018

To support force disabling of SMT it's required to know the number of
thread siblings early. detect_extended_topology() cannot be called before
the APIC driver is selected, so split out the part which initializes
smp_num_siblings.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

a5ea4c08

x86/cpu/common: Provide detect_ht_early() · ba3ba638

Thomas Gleixner authored Jun 06, 2018

To support force disabling of SMT it's required to know the number of
thread siblings early. detect_ht() cannot be called before the APIC driver
is selected, so split out the part which initializes smp_num_siblings.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

ba3ba638

x86/cpu/AMD: Remove the pointless detect_ht() call · 2aa42f41

Thomas Gleixner authored Jun 06, 2018

Real 32bit AMD CPUs do not have SMT and the only value of the call was to
reach the magic printout which got removed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

2aa42f41

x86/cpu: Remove the pointless CPU printout · f3917fb6

Thomas Gleixner authored Jun 06, 2018

The value of this printout is dubious at best and there is no point in
having it in two different places along with convoluted ways to reach it.

Remove it completely.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>

CVE-2018-3620
CVE-2018-3646
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

f3917fb6

x86/CPU: Modify detect_extended_topology() to return result · 3565ee4a

Suravee Suthikulpanit authored Apr 27, 2018

Current implementation does not communicate whether it can successfully
detect CPUID function 0xB information. Therefore, modify the function to
return success or error codes. This will be used by subsequent patches.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1524865681-112110-2-git-send-email-suravee.suthikulpanit@amd.com

CVE-2018-3620
CVE-2018-3646

(cherry picked from commit 4779a53c61a198a46525df708071d29d6b14c813)
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

3565ee4a

cpu/hotplug: Provide knobs to control SMT · df8ddbcb

Thomas Gleixner authored May 29, 2018

Provide a command line and a sysfs knob to control SMT.

The command line options are:

 'nosmt':	Enumerate secondary threads, but do not online them

 'nosmt=force': Ignore secondary threads completely during enumeration
 		via MP table and ACPI/MADT.

The sysfs control file has the following states (read/write):

 'on':		 SMT is enabled. Secondary threads can be freely onlined
 'off':		 SMT is disabled. Secondary threads, even if enumerated
 		 cannot be onlined
 'forceoff':	 SMT is permanentely disabled. Writes to the control
 		 file are rejected.
 'notsupported': SMT is not supported by the CPU

The command line option 'nosmt' sets the sysfs control to 'off'. This
can be changed to 'on' to reenable SMT during runtime.

The command line option 'nosmt=force' sets the sysfs control to
'forceoff'. This cannot be changed during runtime.

When SMT is 'on' and the control file is changed to 'off' then all online
secondary threads are offlined and attempts to online a secondary thread
later on are rejected.

When SMT is 'off' and the control file is changed to 'on' then secondary
threads can be onlined again. The 'off' -> 'on' transition does not
automatically online the secondary threads.

When the control file is set to 'forceoff', the behaviour is the same as
setting it to 'off', but the operation is irreversible and later writes to
the control file are rejected.

When the control status is 'notsupported' then writes to the control file
are rejected.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>

CVE-2018-3620
CVE-2018-3646

[smb: Modified a lot to cope with old hotplug code]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

df8ddbcb

x86/topology: Add topology_max_smt_threads() · 09136388

Andi Kleen authored May 19, 2016

For SMT specific workarounds it is useful to know if SMT is active
on any online CPU in the system. This currently requires a loop
over all online CPUs.

Add a global variable that is updated with the maximum number
of smt threads on any CPU on online/offline, and use it for
topology_max_smt_threads()

The single call is easier to use than a loop.

Not exported to user space because user space already can use
the existing sibling interfaces to find this out.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: jolsa@kernel.org
Link: http://lkml.kernel.org/r/1463703002-19686-2-git-send-email-andi@firstfloor.orgSigned-off-by: Ingo Molnar <mingo@kernel.org>

CVE-2018-3620
CVE-2018-3646

(cherry picked from commit 70b8301f)
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

09136388