1. 13 Jul, 2018 2 commits
  2. 09 Jul, 2018 1 commit
  3. 04 Jul, 2018 10 commits
  4. 02 Jul, 2018 2 commits
    • Thomas Gleixner's avatar
      cpu/hotplug: Boot HT siblings at least once · 0cc3cd21
      Thomas Gleixner authored
      Due to the way Machine Check Exceptions work on X86 hyperthreads it's
      required to boot up _all_ logical cores at least once in order to set the
      CR4.MCE bit.
      
      So instead of ignoring the sibling threads right away, let them boot up
      once so they can configure themselves. After they came out of the initial
      boot stage check whether its a "secondary" sibling and cancel the operation
      which puts the CPU back into offline state.
      Reported-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarTony Luck <tony.luck@intel.com>
      0cc3cd21
    • Thomas Gleixner's avatar
      Revert "x86/apic: Ignore secondary threads if nosmt=force" · 506a66f3
      Thomas Gleixner authored
      Dave Hansen reported, that it's outright dangerous to keep SMT siblings
      disabled completely so they are stuck in the BIOS and wait for SIPI.
      
      The reason is that Machine Check Exceptions are broadcasted to siblings and
      the soft disabled sibling has CR4.MCE = 0. If a MCE is delivered to a
      logical core with CR4.MCE = 0, it asserts IERR#, which shuts down or
      reboots the machine. The MCE chapter in the SDM contains the following
      blurb:
      
          Because the logical processors within a physical package are tightly
          coupled with respect to shared hardware resources, both logical
          processors are notified of machine check errors that occur within a
          given physical processor. If machine-check exceptions are enabled when
          a fatal error is reported, all the logical processors within a physical
          package are dispatched to the machine-check exception handler. If
          machine-check exceptions are disabled, the logical processors enter the
          shutdown state and assert the IERR# signal. When enabling machine-check
          exceptions, the MCE flag in control register CR4 should be set for each
          logical processor.
      
      Reverting the commit which ignores siblings at enumeration time solves only
      half of the problem. The core cpuhotplug logic needs to be adjusted as
      well.
      
      This thoughtful engineered mechanism also turns the boot process on all
      Intel HT enabled systems into a MCE lottery. MCE is enabled on the boot CPU
      before the secondary CPUs are brought up. Depending on the number of
      physical cores the window in which this situation can happen is smaller or
      larger. On a HSW-EX it's about 750ms:
      
      MCE is enabled on the boot CPU:
      
      [    0.244017] mce: CPU supports 22 MCE banks
      
      The corresponding sibling #72 boots:
      
      [    1.008005] .... node  #0, CPUs:    #72
      
      That means if an MCE hits on physical core 0 (logical CPUs 0 and 72)
      between these two points the machine is going to shutdown. At least it's a
      known safe state.
      
      It's obvious that the early boot can be hit by an MCE as well and then runs
      into the same situation because MCEs are not yet enabled on the boot CPU.
      But after enabling them on the boot CPU, it does not make any sense to
      prevent the kernel from recovering.
      
      Adjust the nosmt kernel parameter documentation as well.
      
      Reverts: 2207def7 ("x86/apic: Ignore secondary threads if nosmt=force")
      Reported-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarTony Luck <tony.luck@intel.com>
      506a66f3
  5. 29 Jun, 2018 1 commit
  6. 27 Jun, 2018 1 commit
    • Vlastimil Babka's avatar
      x86/speculation/l1tf: Protect PAE swap entries against L1TF · 0d0f6249
      Vlastimil Babka authored
      The PAE 3-level paging code currently doesn't mitigate L1TF by flipping the
      offset bits, and uses the high PTE word, thus bits 32-36 for type, 37-63 for
      offset. The lower word is zeroed, thus systems with less than 4GB memory are
      safe. With 4GB to 128GB the swap type selects the memory locations vulnerable
      to L1TF; with even more memory, also the swap offfset influences the address.
      This might be a problem with 32bit PAE guests running on large 64bit hosts.
      
      By continuing to keep the whole swap entry in either high or low 32bit word of
      PTE we would limit the swap size too much. Thus this patch uses the whole PAE
      PTE with the same layout as the 64bit version does. The macros just become a
      bit tricky since they assume the arch-dependent swp_entry_t to be 32bit.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      0d0f6249
  7. 22 Jun, 2018 1 commit
  8. 21 Jun, 2018 17 commits
  9. 20 Jun, 2018 5 commits
    • Andi Kleen's avatar
      x86/speculation/l1tf: Limit swap file size to MAX_PA/2 · 377eeaa8
      Andi Kleen authored
      For the L1TF workaround its necessary to limit the swap file size to below
      MAX_PA/2, so that the higher bits of the swap offset inverted never point
      to valid memory.
      
      Add a mechanism for the architecture to override the swap file size check
      in swapfile.c and add a x86 specific max swapfile check function that
      enforces that limit.
      
      The check is only enabled if the CPU is vulnerable to L1TF.
      
      In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
      with 46bit PA it is 32TB. The limit is only per individual swap file, so
      it's always possible to exceed these limits with multiple swap files or
      partitions.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      
      
      377eeaa8
    • Andi Kleen's avatar
      x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings · 42e4089c
      Andi Kleen authored
      For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
      table entry. This sets the high bits in the CPU's address space, thus
      making sure to point to not point an unmapped entry to valid cached memory.
      
      Some server system BIOSes put the MMIO mappings high up in the physical
      address space. If such an high mapping was mapped to unprivileged users
      they could attack low memory by setting such a mapping to PROT_NONE. This
      could happen through a special device driver which is not access
      protected. Normal /dev/mem is of course access protected.
      
      To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.
      
      Valid page mappings are allowed because the system is then unsafe anyways.
      
      It's not expected that users commonly use PROT_NONE on MMIO. But to
      minimize any impact this is only enforced if the mapping actually refers to
      a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
      the check for root.
      
      For mmaps this is straight forward and can be handled in vm_insert_pfn and
      in remap_pfn_range().
      
      For mprotect it's a bit trickier. At the point where the actual PTEs are
      accessed a lot of state has been changed and it would be difficult to undo
      on an error. Since this is a uncommon case use a separate early page talk
      walk pass for MMIO PROT_NONE mappings that checks for this condition
      early. For non MMIO and non PROT_NONE there are no changes.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      
      42e4089c
    • Andi Kleen's avatar
      x86/speculation/l1tf: Add sysfs reporting for l1tf · 17dbca11
      Andi Kleen authored
      L1TF core kernel workarounds are cheap and normally always enabled, However
      they still should be reported in sysfs if the system is vulnerable or
      mitigated. Add the necessary CPU feature/bug bits.
      
      - Extend the existing checks for Meltdowns to determine if the system is
        vulnerable. All CPUs which are not vulnerable to Meltdown are also not
        vulnerable to L1TF
      
      - Check for 32bit non PAE and emit a warning as there is no practical way
        for mitigation due to the limited physical address bits
      
      - If the system has more than MAX_PA/2 physical memory the invert page
        workarounds don't protect the system against the L1TF attack anymore,
        because an inverted physical address will also point to valid
        memory. Print a warning in this case and report that the system is
        vulnerable.
      
      Add a function which returns the PFN limit for the L1TF mitigation, which
      will be used in follow up patches for sanity and range checks.
      
      [ tglx: Renamed the CPU feature bit to L1TF_PTEINV ]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      
      17dbca11
    • Andi Kleen's avatar
      x86/speculation/l1tf: Make sure the first page is always reserved · 10a70416
      Andi Kleen authored
      The L1TF workaround doesn't make any attempt to mitigate speculate accesses
      to the first physical page for zeroed PTEs. Normally it only contains some
      data from the early real mode BIOS.
      
      It's not entirely clear that the first page is reserved in all
      configurations, so add an extra reservation call to make sure it is really
      reserved. In most configurations (e.g.  with the standard reservations)
      it's likely a nop.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      
      10a70416
    • Andi Kleen's avatar
      x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation · 6b28baca
      Andi Kleen authored
      When PTEs are set to PROT_NONE the kernel just clears the Present bit and
      preserves the PFN, which creates attack surface for L1TF speculation
      speculation attacks.
      
      This is important inside guests, because L1TF speculation bypasses physical
      page remapping. While the host has its own migitations preventing leaking
      data from other VMs into the guest, this would still risk leaking the wrong
      page inside the current guest.
      
      This uses the same technique as Linus' swap entry patch: while an entry is
      is in PROTNONE state invert the complete PFN part part of it. This ensures
      that the the highest bit will point to non existing memory.
      
      The invert is done by pte/pmd_modify and pfn/pmd/pud_pte for PROTNONE and
      pte/pmd/pud_pfn undo it.
      
      This assume that no code path touches the PFN part of a PTE directly
      without using these primitives.
      
      This doesn't handle the case that MMIO is on the top of the CPU physical
      memory. If such an MMIO region was exposed by an unpriviledged driver for
      mmap it would be possible to attack some real memory.  However this
      situation is all rather unlikely.
      
      For 32bit non PAE the inversion is not done because there are really not
      enough bits to protect anything.
      
      Q: Why does the guest need to be protected when the HyperVisor already has
         L1TF mitigations?
      
      A: Here's an example:
      
         Physical pages 1 2 get mapped into a guest as
         GPA 1 -> PA 2
         GPA 2 -> PA 1
         through EPT.
      
         The L1TF speculation ignores the EPT remapping.
      
         Now the guest kernel maps GPA 1 to process A and GPA 2 to process B, and
         they belong to different users and should be isolated.
      
         A sets the GPA 1 PA 2 PTE to PROT_NONE to bypass the EPT remapping and
         gets read access to the underlying physical page. Which in this case
         points to PA 2, so it can read process B's data, if it happened to be in
         L1, so isolation inside the guest is broken.
      
         There's nothing the hypervisor can do about this. This mitigation has to
         be done in the guest itself.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      
      
      6b28baca