• Linus Torvalds's avatar
    Merge tag 'x86-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip · 88afbb21
    Linus Torvalds authored
    Pull x86 core updates from Thomas Gleixner:
     "A set of fixes for kexec(), reboot and shutdown issues:
    
       - Ensure that the WBINVD in stop_this_cpu() has been completed before
         the control CPU proceedes.
    
         stop_this_cpu() is used for kexec(), reboot and shutdown to park
         the APs in a HLT loop.
    
         The control CPU sends an IPI to the APs and waits for their CPU
         online bits to be cleared. Once they all are marked "offline" it
         proceeds.
    
         But stop_this_cpu() clears the CPU online bit before issuing
         WBINVD, which means there is no guarantee that the AP has reached
         the HLT loop.
    
         This was reported to cause intermittent reboot/shutdown failures
         due to some dubious interaction with the firmware.
    
         This is not only a problem of WBINVD. The code to actually "stop"
         the CPU which runs between clearing the online bit and reaching the
         HLT loop can cause large enough delays on its own (think
         virtualization). That's especially dangerous for kexec() as kexec()
         expects that all APs are in a safe state and not executing code
         while the boot CPU jumps to the new kernel. There are more issues
         vs kexec() which are addressed separately.
    
         Cure this by implementing an explicit synchronization point right
         before the AP reaches HLT. This guarantees that the AP has
         completed the full stop proceedure.
    
       - Fix the condition for WBINVD in stop_this_cpu().
    
         The WBINVD in stop_this_cpu() is required for ensuring that when
         switching to or from memory encryption no dirty data is left in the
         cache lines which might cause a write back in the wrong more later.
    
         This checks CPUID directly because the feature bit might have been
         cleared due to a command line option.
    
         But that CPUID check accesses leaf 0x8000001f::EAX unconditionally.
         Intel CPUs return the content of the highest supported leaf when a
         non-existing leaf is read, while AMD CPUs return all zeros for
         unsupported leafs.
    
         So the result of the test on Intel CPUs is lottery and on AMD its
         just correct by chance.
    
         While harmless it's incorrect and causes the conditional wbinvd()
         to be issued where not required, which caused the above issue to be
         unearthed.
    
       - Make kexec() robust against AP code execution
    
         Ashok observed triple faults when doing kexec() on a system which
         had been booted with "nosmt".
    
         It turned out that the SMT siblings which had been brought up
         partially are parked in mwait_play_dead() to enable power savings.
    
         mwait_play_dead() is monitoring the thread flags of the AP's idle
         task, which has been chosen as it's unlikely to be written to.
    
         But kexec() can overwrite the previous kernel text and data
         including page tables etc. When it overwrites the cache lines
         monitored by an AP that AP resumes execution after the MWAIT on
         eventually overwritten text, stack and page tables, which obviously
         might end up in a triple fault easily.
    
         Make this more robust in several steps:
    
          1) Use an explicit per CPU cache line for monitoring.
    
          2) Write a command to these cache lines to kick APs out of MWAIT
             before proceeding with kexec(), shutdown or reboot.
    
             The APs confirm the wakeup by writing status back and then
             enter a HLT loop.
    
          3) If the system uses INIT/INIT/STARTUP for AP bringup, park the
             APs in INIT state.
    
             HLT is not a guarantee that an AP won't wake up and resume
             execution. HLT is woken up by NMI and SMI. SMI puts the CPU
             back into HLT (+/- firmware bugs), but NMI is delivered to the
             CPU which executes the NMI handler. Same issue as the MWAIT
             scenario described above.
    
             Sending an INIT/INIT sequence to the APs puts them into wait
             for STARTUP state, which is safe against NMI.
    
         There is still an issue remaining which can't be fixed: #MCE
    
         If the AP sits in HLT and receives a broadcast #MCE it will try to
         handle it with the obvious consequences.
    
         INIT/INIT clears CR4.MCE in the AP which will cause a broadcast
         #MCE to shut down the machine.
    
         So there is a choice between fire (HLT) and frying pan (INIT).
         Frying pan has been chosen as it's at least preventing the NMI
         issue.
    
         On systems which are not using INIT/INIT/STARTUP there is not much
         which can be done right now, but at least the obvious and easy to
         trigger MWAIT issue has been addressed"
    
    * tag 'x86-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip:
      x86/smp: Put CPUs into INIT on shutdown if possible
      x86/smp: Split sending INIT IPI out into a helper function
      x86/smp: Cure kexec() vs. mwait_play_dead() breakage
      x86/smp: Use dedicated cache-line for mwait_play_dead()
      x86/smp: Remove pointless wmb()s from native_stop_other_cpus()
      x86/smp: Dont access non-existing CPUID leaf
      x86/smp: Make stop_other_cpus() more robust
    88afbb21
smpboot.c 44.6 KB