1. 15 Aug, 2018 40 commits
    • Vlastimil Babka's avatar
      x86/speculation/l1tf: Extend 64bit swap file size limit · c4b998c8
      Vlastimil Babka authored
      commit 1a7ed1ba upstream
      
      The previous patch has limited swap file size so that large offsets cannot
      clear bits above MAX_PA/2 in the pte and interfere with L1TF mitigation.
      
      It assumed that offsets are encoded starting with bit 12, same as pfn. But
      on x86_64, offsets are encoded starting with bit 9.
      
      Thus the limit can be raised by 3 bits. That means 16TB with 42bit MAX_PA
      and 256TB with 46bit MAX_PA.
      
      Fixes: 377eeaa8 ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c4b998c8
    • Thomas Gleixner's avatar
      x86/apic: Ignore secondary threads if nosmt=force · 4a818f2c
      Thomas Gleixner authored
      commit 2207def7 upstream
      
      nosmt on the kernel command line merely prevents the onlining of the
      secondary SMT siblings.
      
      nosmt=force makes the APIC detection code ignore the secondary SMT siblings
      completely, so they even do not show up as possible CPUs. That reduces the
      amount of memory allocations for per cpu variables and saves other
      resources from being allocated too large.
      
      This is not fully equivalent to disabling SMT in the BIOS because the low
      level SMT enabling in the BIOS can result in partitioning of resources
      between the siblings, which is not undone by just ignoring them. Some CPUs
      can use the full resources when their sibling is not onlined, but this is
      depending on the CPU family and model and it's not well documented whether
      this applies to all partitioned resources. That means depending on the
      workload disabling SMT in the BIOS might result in better performance.
      
      Linus analysis of the Intel manual:
      
        The intel optimization manual is not very clear on what the partitioning
        rules are.
      
        I find:
      
          "In general, the buffers for staging instructions between major pipe
           stages  are partitioned. These buffers include µop queues after the
           execution trace cache, the queues after the register rename stage, the
           reorder buffer which stages instructions for retirement, and the load
           and store buffers.
      
           In the case of load and store buffers, partitioning also provided an
           easier implementation to maintain memory ordering for each logical
           processor and detect memory ordering violations"
      
        but some of that partitioning may be relaxed if the HT thread is "not
        active":
      
          "In Intel microarchitecture code name Sandy Bridge, the micro-op queue
           is statically partitioned to provide 28 entries for each logical
           processor,  irrespective of software executing in single thread or
           multiple threads. If one logical processor is not active in Intel
           microarchitecture code name Ivy Bridge, then a single thread executing
           on that processor  core can use the 56 entries in the micro-op queue"
      
        but I do not know what "not active" means, and how dynamic it is. Some of
        that partitioning may be entirely static and depend on the early BIOS
        disabling of HT, and even if we park the cores, the resources will just be
        wasted.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a818f2c
    • Thomas Gleixner's avatar
      x86/cpu/AMD: Evaluate smp_num_siblings early · ae76eb11
      Thomas Gleixner authored
      commit 1e1d7e25 upstream
      
      To support force disabling of SMT it's required to know the number of
      thread siblings early. amd_get_topology() cannot be called before the APIC
      driver is selected, so split out the part which initializes
      smp_num_siblings and invoke it from amd_early_init().
      
      [dwmw2: Backport to 4.9]
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ae76eb11
    • Borislav Petkov's avatar
      x86/CPU/AMD: Do not check CPUID max ext level before parsing SMP info · 112d2430
      Borislav Petkov authored
      commit 119bff8a upstream
      
      Old code used to check whether CPUID ext max level is >= 0x80000008 because
      that last leaf contains the number of cores of the physical CPU.  The three
      functions called there now do not depend on that leaf anymore so the check
      can go.
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      112d2430
    • Thomas Gleixner's avatar
      x86/cpu/intel: Evaluate smp_num_siblings early · 0ee6f3b2
      Thomas Gleixner authored
      commit 1910ad56 upstream
      
      Make use of the new early detection function to initialize smp_num_siblings
      on the boot cpu before the MP-Table or ACPI/MADT scan happens. That's
      required for force disabling SMT.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0ee6f3b2
    • Thomas Gleixner's avatar
      x86/cpu/topology: Provide detect_extended_topology_early() · 3b4f20ad
      Thomas Gleixner authored
      commit 95f3d39c upstream
      
      To support force disabling of SMT it's required to know the number of
      thread siblings early. detect_extended_topology() cannot be called before
      the APIC driver is selected, so split out the part which initializes
      smp_num_siblings.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3b4f20ad
    • Thomas Gleixner's avatar
      x86/cpu/common: Provide detect_ht_early() · 691997bf
      Thomas Gleixner authored
      commit 545401f4 upstream
      
      To support force disabling of SMT it's required to know the number of
      thread siblings early. detect_ht() cannot be called before the APIC driver
      is selected, so split out the part which initializes smp_num_siblings.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      691997bf
    • Thomas Gleixner's avatar
      x86/cpu/AMD: Remove the pointless detect_ht() call · a6d2fa5d
      Thomas Gleixner authored
      commit 44ca36de upstream
      
      Real 32bit AMD CPUs do not have SMT and the only value of the call was to
      reach the magic printout which got removed.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a6d2fa5d
    • Thomas Gleixner's avatar
      x86/cpu: Remove the pointless CPU printout · e0439285
      Thomas Gleixner authored
      commit 55e6d279 upstream
      
      The value of this printout is dubious at best and there is no point in
      having it in two different places along with convoluted ways to reach it.
      
      Remove it completely.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e0439285
    • Thomas Gleixner's avatar
      cpu/hotplug: Provide knobs to control SMT · f37486c0
      Thomas Gleixner authored
      commit 05736e4a upstream
      
      Provide a command line and a sysfs knob to control SMT.
      
      The command line options are:
      
       'nosmt':	Enumerate secondary threads, but do not online them
      
       'nosmt=force': Ignore secondary threads completely during enumeration
       		via MP table and ACPI/MADT.
      
      The sysfs control file has the following states (read/write):
      
       'on':		 SMT is enabled. Secondary threads can be freely onlined
       'off':		 SMT is disabled. Secondary threads, even if enumerated
       		 cannot be onlined
       'forceoff':	 SMT is permanentely disabled. Writes to the control
       		 file are rejected.
       'notsupported': SMT is not supported by the CPU
      
      The command line option 'nosmt' sets the sysfs control to 'off'. This
      can be changed to 'on' to reenable SMT during runtime.
      
      The command line option 'nosmt=force' sets the sysfs control to
      'forceoff'. This cannot be changed during runtime.
      
      When SMT is 'on' and the control file is changed to 'off' then all online
      secondary threads are offlined and attempts to online a secondary thread
      later on are rejected.
      
      When SMT is 'off' and the control file is changed to 'on' then secondary
      threads can be onlined again. The 'off' -> 'on' transition does not
      automatically online the secondary threads.
      
      When the control file is set to 'forceoff', the behaviour is the same as
      setting it to 'off', but the operation is irreversible and later writes to
      the control file are rejected.
      
      When the control status is 'notsupported' then writes to the control file
      are rejected.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f37486c0
    • Thomas Gleixner's avatar
      cpu/hotplug: Split do_cpu_down() · 373b8def
      Thomas Gleixner authored
      commit cc1fe215 upstream
      
      Split out the inner workings of do_cpu_down() to allow reuse of that
      function for the upcoming SMT disabling mechanism.
      
      No functional change.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      373b8def
    • Thomas Gleixner's avatar
      cpu/hotplug: Make bringup/teardown of smp threads symmetric · 9333575f
      Thomas Gleixner authored
      commit c4de6569 upstream
      
      The asymmetry caused a warning to trigger if the bootup was stopped in state
      CPUHP_AP_ONLINE_IDLE. The warning no longer triggers as kthread_park() can
      now be invoked on already or still parked threads. But there is still no
      reason to have this be asymmetric.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9333575f
    • Thomas Gleixner's avatar
      x86/topology: Provide topology_smt_supported() · 16fd33cd
      Thomas Gleixner authored
      commit f048c399 upstream
      
      Provide information whether SMT is supoorted by the CPUs. Preparatory patch
      for SMT control mechanism.
      Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      16fd33cd
    • Thomas Gleixner's avatar
      x86/smp: Provide topology_is_primary_thread() · 7b69a96e
      Thomas Gleixner authored
      commit 6a4d2657 upstream
      
      If the CPU is supporting SMT then the primary thread can be found by
      checking the lower APIC ID bits for zero. smp_num_siblings is used to build
      the mask for the APIC ID bits which need to be taken into account.
      
      This uses the MPTABLE or ACPI/MADT supplied APIC ID, which can be different
      than the initial APIC ID in CPUID. But according to AMD the lower bits have
      to be consistent. Intel gave a tentative confirmation as well.
      
      Preparatory patch to support disabling SMT at boot/runtime.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7b69a96e
    • Konrad Rzeszutek Wilk's avatar
      x86/bugs: Move the l1tf function and define pr_fmt properly · 1ac1dc14
      Konrad Rzeszutek Wilk authored
      commit 56563f53 upstream
      
      The pr_warn in l1tf_select_mitigation would have used the prior pr_fmt
      which was defined as "Spectre V2 : ".
      
      Move the function to be past SSBD and also define the pr_fmt.
      
      Fixes: 17dbca11 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1ac1dc14
    • Andi Kleen's avatar
      x86/speculation/l1tf: Limit swap file size to MAX_PA/2 · e3923475
      Andi Kleen authored
      commit 377eeaa8 upstream
      
      For the L1TF workaround its necessary to limit the swap file size to below
      MAX_PA/2, so that the higher bits of the swap offset inverted never point
      to valid memory.
      
      Add a mechanism for the architecture to override the swap file size check
      in swapfile.c and add a x86 specific max swapfile check function that
      enforces that limit.
      
      The check is only enabled if the CPU is vulnerable to L1TF.
      
      In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
      with 46bit PA it is 32TB. The limit is only per individual swap file, so
      it's always possible to exceed these limits with multiple swap files or
      partitions.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e3923475
    • Andi Kleen's avatar
      x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings · 7c5b42f8
      Andi Kleen authored
      commit 42e4089c upstream
      
      For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
      table entry. This sets the high bits in the CPU's address space, thus
      making sure to point to not point an unmapped entry to valid cached memory.
      
      Some server system BIOSes put the MMIO mappings high up in the physical
      address space. If such an high mapping was mapped to unprivileged users
      they could attack low memory by setting such a mapping to PROT_NONE. This
      could happen through a special device driver which is not access
      protected. Normal /dev/mem is of course access protected.
      
      To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.
      
      Valid page mappings are allowed because the system is then unsafe anyways.
      
      It's not expected that users commonly use PROT_NONE on MMIO. But to
      minimize any impact this is only enforced if the mapping actually refers to
      a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
      the check for root.
      
      For mmaps this is straight forward and can be handled in vm_insert_pfn and
      in remap_pfn_range().
      
      For mprotect it's a bit trickier. At the point where the actual PTEs are
      accessed a lot of state has been changed and it would be difficult to undo
      on an error. Since this is a uncommon case use a separate early page talk
      walk pass for MMIO PROT_NONE mappings that checks for this condition
      early. For non MMIO and non PROT_NONE there are no changes.
      
      [dwmw2: Backport to 4.9]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7c5b42f8
    • Andi Kleen's avatar
      x86/speculation/l1tf: Add sysfs reporting for l1tf · 432e99b3
      Andi Kleen authored
      commit 17dbca11 upstream
      
      L1TF core kernel workarounds are cheap and normally always enabled, However
      they still should be reported in sysfs if the system is vulnerable or
      mitigated. Add the necessary CPU feature/bug bits.
      
      - Extend the existing checks for Meltdowns to determine if the system is
        vulnerable. All CPUs which are not vulnerable to Meltdown are also not
        vulnerable to L1TF
      
      - Check for 32bit non PAE and emit a warning as there is no practical way
        for mitigation due to the limited physical address bits
      
      - If the system has more than MAX_PA/2 physical memory the invert page
        workarounds don't protect the system against the L1TF attack anymore,
        because an inverted physical address will also point to valid
        memory. Print a warning in this case and report that the system is
        vulnerable.
      
      Add a function which returns the PFN limit for the L1TF mitigation, which
      will be used in follow up patches for sanity and range checks.
      
      [ tglx: Renamed the CPU feature bit to L1TF_PTEINV ]
      [ dwmw2: Backport to 4.9 (cpufeatures.h, E820) ]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      432e99b3
    • Andi Kleen's avatar
      x86/speculation/l1tf: Make sure the first page is always reserved · 5b2ec92f
      Andi Kleen authored
      commit 10a70416 upstream
      
      The L1TF workaround doesn't make any attempt to mitigate speculate accesses
      to the first physical page for zeroed PTEs. Normally it only contains some
      data from the early real mode BIOS.
      
      It's not entirely clear that the first page is reserved in all
      configurations, so add an extra reservation call to make sure it is really
      reserved. In most configurations (e.g.  with the standard reservations)
      it's likely a nop.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5b2ec92f
    • Andi Kleen's avatar
      x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation · 33182fe9
      Andi Kleen authored
      commit 6b28baca upstream
      
      When PTEs are set to PROT_NONE the kernel just clears the Present bit and
      preserves the PFN, which creates attack surface for L1TF speculation
      speculation attacks.
      
      This is important inside guests, because L1TF speculation bypasses physical
      page remapping. While the host has its own migitations preventing leaking
      data from other VMs into the guest, this would still risk leaking the wrong
      page inside the current guest.
      
      This uses the same technique as Linus' swap entry patch: while an entry is
      is in PROTNONE state invert the complete PFN part part of it. This ensures
      that the the highest bit will point to non existing memory.
      
      The invert is done by pte/pmd_modify and pfn/pmd/pud_pte for PROTNONE and
      pte/pmd/pud_pfn undo it.
      
      This assume that no code path touches the PFN part of a PTE directly
      without using these primitives.
      
      This doesn't handle the case that MMIO is on the top of the CPU physical
      memory. If such an MMIO region was exposed by an unpriviledged driver for
      mmap it would be possible to attack some real memory.  However this
      situation is all rather unlikely.
      
      For 32bit non PAE the inversion is not done because there are really not
      enough bits to protect anything.
      
      Q: Why does the guest need to be protected when the HyperVisor already has
         L1TF mitigations?
      
      A: Here's an example:
      
         Physical pages 1 2 get mapped into a guest as
         GPA 1 -> PA 2
         GPA 2 -> PA 1
         through EPT.
      
         The L1TF speculation ignores the EPT remapping.
      
         Now the guest kernel maps GPA 1 to process A and GPA 2 to process B, and
         they belong to different users and should be isolated.
      
         A sets the GPA 1 PA 2 PTE to PROT_NONE to bypass the EPT remapping and
         gets read access to the underlying physical page. Which in this case
         points to PA 2, so it can read process B's data, if it happened to be in
         L1, so isolation inside the guest is broken.
      
         There's nothing the hypervisor can do about this. This mitigation has to
         be done in the guest itself.
      
      [ tglx: Massaged changelog ]
      [ dwmw2: backported to 4.9 ]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      33182fe9
    • Linus Torvalds's avatar
      x86/speculation/l1tf: Protect swap entries against L1TF · 60712274
      Linus Torvalds authored
      commit 2f22b4cd upstream
      
      With L1 terminal fault the CPU speculates into unmapped PTEs, and resulting
      side effects allow to read the memory the PTE is pointing too, if its
      values are still in the L1 cache.
      
      For swapped out pages Linux uses unmapped PTEs and stores a swap entry into
      them.
      
      To protect against L1TF it must be ensured that the swap entry is not
      pointing to valid memory, which requires setting higher bits (between bit
      36 and bit 45) that are inside the CPUs physical address space, but outside
      any real memory.
      
      To do this invert the offset to make sure the higher bits are always set,
      as long as the swap file is not too big.
      
      Note there is no workaround for 32bit !PAE, or on systems which have more
      than MAX_PA/2 worth of memory. The later case is very unlikely to happen on
      real systems.
      
      [AK: updated description and minor tweaks by. Split out from the original
           patch ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarAndi Kleen <ak@linux.intel.com>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      60712274
    • Linus Torvalds's avatar
      x86/speculation/l1tf: Change order of offset/type in swap entry · 2c9b57e4
      Linus Torvalds authored
      commit bcd11afa upstream
      
      If pages are swapped out, the swap entry is stored in the corresponding
      PTE, which has the Present bit cleared. CPUs vulnerable to L1TF speculate
      on PTE entries which have the present bit set and would treat the swap
      entry as phsyical address (PFN). To mitigate that the upper bits of the PTE
      must be set so the PTE points to non existent memory.
      
      The swap entry stores the type and the offset of a swapped out page in the
      PTE. type is stored in bit 9-13 and offset in bit 14-63. The hardware
      ignores the bits beyond the phsyical address space limit, so to make the
      mitigation effective its required to start 'offset' at the lowest possible
      bit so that even large swap offsets do not reach into the physical address
      space limit bits.
      
      Move offset to bit 9-58 and type to bit 59-63 which are the bits that
      hardware generally doesn't care about.
      
      That, in turn, means that if you on desktop chip with only 40 bits of
      physical addressing, now that the offset starts at bit 9, there needs to be
      30 bits of offset actually *in use* until bit 39 ends up being set, which
      means when inverted it will again point into existing memory.
      
      So that's 4 terabyte of swap space (because the offset is counted in pages,
      so 30 bits of offset is 42 bits of actual coverage). With bigger physical
      addressing, that obviously grows further, until the limit of the offset is
      hit (at 50 bits of offset - 62 bits of actual swap file coverage).
      
      This is a preparatory change for the actual swap entry inversion to protect
      against L1TF.
      
      [ AK: Updated description and minor tweaks. Split into two parts ]
      [ tglx: Massaged changelog ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarAndi Kleen <ak@linux.intel.com>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2c9b57e4
    • Naoya Horiguchi's avatar
      mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1 · 1a4922e0
      Naoya Horiguchi authored
      commit eee4818b upstream
      
      _PAGE_PSE is used to distinguish between a truly non-present
      (_PAGE_PRESENT=0) PMD, and a PMD which is undergoing a THP split and
      should be treated as present.
      
      But _PAGE_SWP_SOFT_DIRTY currently uses the _PAGE_PSE bit, which would
      cause confusion between one of those PMDs undergoing a THP split, and a
      soft-dirty PMD.  Dropping _PAGE_PSE check in pmd_present() does not work
      well, because it can hurt optimization of tlb handling in thp split.
      
      Thus, we need to move the bit.
      
      In the current kernel, bits 1-4 are not used in non-present format since
      commit 00839ee3 ("x86/mm: Move swap offset/type up in PTE to work
      around erratum").  So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.  Bit 7
      is used as reserved (always clear), so please don't use it for other
      purpose.
      
      [dwmw2: Pulled in to 4.9 backport to support L1TF changes]
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-3-zi.yan@sent.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1a4922e0
    • Andi Kleen's avatar
      x86/speculation/l1tf: Increase 32bit PAE __PHYSICAL_PAGE_SHIFT · bbd07cbb
      Andi Kleen authored
      commit 50896e18 upstream
      
      L1 Terminal Fault (L1TF) is a speculation related vulnerability. The CPU
      speculates on PTE entries which do not have the PRESENT bit set, if the
      content of the resulting physical address is available in the L1D cache.
      
      The OS side mitigation makes sure that a !PRESENT PTE entry points to a
      physical address outside the actually existing and cachable memory
      space. This is achieved by inverting the upper bits of the PTE. Due to the
      address space limitations this only works for 64bit and 32bit PAE kernels,
      but not for 32bit non PAE.
      
      This mitigation applies to both host and guest kernels, but in case of a
      64bit host (hypervisor) and a 32bit PAE guest, inverting the upper bits of
      the PAE address space (44bit) is not enough if the host has more than 43
      bits of populated memory address space, because the speculation treats the
      PTE content as a physical host address bypassing EPT.
      
      The host (hypervisor) protects itself against the guest by flushing L1D as
      needed, but pages inside the guest are not protected against attacks from
      other processes inside the same guest.
      
      For the guest the inverted PTE mask has to match the host to provide the
      full protection for all pages the host could possibly map into the
      guest. The hosts populated address space is not known to the guest, so the
      mask must cover the possible maximal host address space, i.e. 52 bit.
      
      On 32bit PAE the maximum PTE mask is currently set to 44 bit because that
      is the limit imposed by 32bit unsigned long PFNs in the VMs. This limits
      the mask to be below what the host could possible use for physical pages.
      
      The L1TF PROT_NONE protection code uses the PTE masks to determine which
      bits to invert to make sure the higher bits are set for unmapped entries to
      prevent L1TF speculation attacks against EPT inside guests.
      
      In order to invert all bits that could be used by the host, increase
      __PHYSICAL_PAGE_SHIFT to 52 to match 64bit.
      
      The real limit for a 32bit PAE kernel is still 44 bits because all Linux
      PTEs are created from unsigned long PFNs, so they cannot be higher than 44
      bits on a 32bit kernel. So these extra PFN bits should be never set. The
      only users of this macro are using it to look at PTEs, so it's safe.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bbd07cbb
    • Nick Desaulniers's avatar
      x86/irqflags: Provide a declaration for native_save_fl · 329d8156
      Nick Desaulniers authored
      commit 208cbb32 upstream.
      
      It was reported that the commit d0a8d937 is causing users of gcc < 4.9
      to observe -Werror=missing-prototypes errors.
      
      Indeed, it seems that:
      extern inline unsigned long native_save_fl(void) { return 0; }
      
      compiled with -Werror=missing-prototypes produces this warning in gcc <
      4.9, but not gcc >= 4.9.
      
      Fixes: d0a8d937 ("x86/paravirt: Make native_save_fl() extern inline").
      Reported-by: default avatarDavid Laight <david.laight@aculab.com>
      Reported-by: default avatarJean Delvare <jdelvare@suse.de>
      Signed-off-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Cc: jgross@suse.com
      Cc: kstewart@linuxfoundation.org
      Cc: gregkh@linuxfoundation.org
      Cc: boris.ostrovsky@oracle.com
      Cc: astrachan@google.com
      Cc: mka@chromium.org
      Cc: arnd@arndb.de
      Cc: tstellar@redhat.com
      Cc: sedat.dilek@gmail.com
      Cc: David.Laight@aculab.com
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180803170550.164688-1-ndesaulniers@google.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      329d8156
    • Masami Hiramatsu's avatar
      kprobes/x86: Fix %p uses in error messages · a92daabd
      Masami Hiramatsu authored
      commit 0ea06330 upstream.
      
      Remove all %p uses in error messages in kprobes/x86.
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: David S . Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jon Medhurst <tixy@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Richter <tmricht@linux.ibm.com>
      Cc: Tobin C . Harding <me@tobin.cc>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: acme@kernel.org
      Cc: akpm@linux-foundation.org
      Cc: brueckner@linux.vnet.ibm.com
      Cc: linux-arch@vger.kernel.org
      Cc: rostedt@goodmis.org
      Cc: schwidefsky@de.ibm.com
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/lkml/152491902310.9916.13355297638917767319.stgit@devboxSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a92daabd
    • Jiri Kosina's avatar
      x86/speculation: Protect against userspace-userspace spectreRSB · 6455f41d
      Jiri Kosina authored
      commit fdf82a78 upstream.
      
      The article "Spectre Returns! Speculation Attacks using the Return Stack
      Buffer" [1] describes two new (sub-)variants of spectrev2-like attacks,
      making use solely of the RSB contents even on CPUs that don't fallback to
      BTB on RSB underflow (Skylake+).
      
      Mitigate userspace-userspace attacks by always unconditionally filling RSB on
      context switch when the generic spectrev2 mitigation has been enabled.
      
      [1] https://arxiv.org/pdf/1807.07940.pdfSigned-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1807261308190.997@cbobk.fhfr.pmSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6455f41d
    • Peter Zijlstra's avatar
      x86/paravirt: Fix spectre-v2 mitigations for paravirt guests · 640fe070
      Peter Zijlstra authored
      commit 5800dc5c upstream.
      
      Nadav reported that on guests we're failing to rewrite the indirect
      calls to CALLEE_SAVE paravirt functions. In particular the
      pv_queued_spin_unlock() call is left unpatched and that is all over the
      place. This obviously wrecks Spectre-v2 mitigation (for paravirt
      guests) which relies on not actually having indirect calls around.
      
      The reason is an incorrect clobber test in paravirt_patch_call(); this
      function rewrites an indirect call with a direct call to the _SAME_
      function, there is no possible way the clobbers can be different
      because of this.
      
      Therefore remove this clobber check. Also put WARNs on the other patch
      failure case (not enough room for the instruction) which I've not seen
      trigger in my (limited) testing.
      
      Three live kernel image disassemblies for lock_sock_nested (as a small
      function that illustrates the problem nicely). PRE is the current
      situation for guests, POST is with this patch applied and NATIVE is with
      or without the patch for !guests.
      
      PRE:
      
      (gdb) disassemble lock_sock_nested
      Dump of assembler code for function lock_sock_nested:
         0xffffffff817be970 <+0>:     push   %rbp
         0xffffffff817be971 <+1>:     mov    %rdi,%rbp
         0xffffffff817be974 <+4>:     push   %rbx
         0xffffffff817be975 <+5>:     lea    0x88(%rbp),%rbx
         0xffffffff817be97c <+12>:    callq  0xffffffff819f7160 <_cond_resched>
         0xffffffff817be981 <+17>:    mov    %rbx,%rdi
         0xffffffff817be984 <+20>:    callq  0xffffffff819fbb00 <_raw_spin_lock_bh>
         0xffffffff817be989 <+25>:    mov    0x8c(%rbp),%eax
         0xffffffff817be98f <+31>:    test   %eax,%eax
         0xffffffff817be991 <+33>:    jne    0xffffffff817be9ba <lock_sock_nested+74>
         0xffffffff817be993 <+35>:    movl   $0x1,0x8c(%rbp)
         0xffffffff817be99d <+45>:    mov    %rbx,%rdi
         0xffffffff817be9a0 <+48>:    callq  *0xffffffff822299e8
         0xffffffff817be9a7 <+55>:    pop    %rbx
         0xffffffff817be9a8 <+56>:    pop    %rbp
         0xffffffff817be9a9 <+57>:    mov    $0x200,%esi
         0xffffffff817be9ae <+62>:    mov    $0xffffffff817be993,%rdi
         0xffffffff817be9b5 <+69>:    jmpq   0xffffffff81063ae0 <__local_bh_enable_ip>
         0xffffffff817be9ba <+74>:    mov    %rbp,%rdi
         0xffffffff817be9bd <+77>:    callq  0xffffffff817be8c0 <__lock_sock>
         0xffffffff817be9c2 <+82>:    jmp    0xffffffff817be993 <lock_sock_nested+35>
      End of assembler dump.
      
      POST:
      
      (gdb) disassemble lock_sock_nested
      Dump of assembler code for function lock_sock_nested:
         0xffffffff817be970 <+0>:     push   %rbp
         0xffffffff817be971 <+1>:     mov    %rdi,%rbp
         0xffffffff817be974 <+4>:     push   %rbx
         0xffffffff817be975 <+5>:     lea    0x88(%rbp),%rbx
         0xffffffff817be97c <+12>:    callq  0xffffffff819f7160 <_cond_resched>
         0xffffffff817be981 <+17>:    mov    %rbx,%rdi
         0xffffffff817be984 <+20>:    callq  0xffffffff819fbb00 <_raw_spin_lock_bh>
         0xffffffff817be989 <+25>:    mov    0x8c(%rbp),%eax
         0xffffffff817be98f <+31>:    test   %eax,%eax
         0xffffffff817be991 <+33>:    jne    0xffffffff817be9ba <lock_sock_nested+74>
         0xffffffff817be993 <+35>:    movl   $0x1,0x8c(%rbp)
         0xffffffff817be99d <+45>:    mov    %rbx,%rdi
         0xffffffff817be9a0 <+48>:    callq  0xffffffff810a0c20 <__raw_callee_save___pv_queued_spin_unlock>
         0xffffffff817be9a5 <+53>:    xchg   %ax,%ax
         0xffffffff817be9a7 <+55>:    pop    %rbx
         0xffffffff817be9a8 <+56>:    pop    %rbp
         0xffffffff817be9a9 <+57>:    mov    $0x200,%esi
         0xffffffff817be9ae <+62>:    mov    $0xffffffff817be993,%rdi
         0xffffffff817be9b5 <+69>:    jmpq   0xffffffff81063aa0 <__local_bh_enable_ip>
         0xffffffff817be9ba <+74>:    mov    %rbp,%rdi
         0xffffffff817be9bd <+77>:    callq  0xffffffff817be8c0 <__lock_sock>
         0xffffffff817be9c2 <+82>:    jmp    0xffffffff817be993 <lock_sock_nested+35>
      End of assembler dump.
      
      NATIVE:
      
      (gdb) disassemble lock_sock_nested
      Dump of assembler code for function lock_sock_nested:
         0xffffffff817be970 <+0>:     push   %rbp
         0xffffffff817be971 <+1>:     mov    %rdi,%rbp
         0xffffffff817be974 <+4>:     push   %rbx
         0xffffffff817be975 <+5>:     lea    0x88(%rbp),%rbx
         0xffffffff817be97c <+12>:    callq  0xffffffff819f7160 <_cond_resched>
         0xffffffff817be981 <+17>:    mov    %rbx,%rdi
         0xffffffff817be984 <+20>:    callq  0xffffffff819fbb00 <_raw_spin_lock_bh>
         0xffffffff817be989 <+25>:    mov    0x8c(%rbp),%eax
         0xffffffff817be98f <+31>:    test   %eax,%eax
         0xffffffff817be991 <+33>:    jne    0xffffffff817be9ba <lock_sock_nested+74>
         0xffffffff817be993 <+35>:    movl   $0x1,0x8c(%rbp)
         0xffffffff817be99d <+45>:    mov    %rbx,%rdi
         0xffffffff817be9a0 <+48>:    movb   $0x0,(%rdi)
         0xffffffff817be9a3 <+51>:    nopl   0x0(%rax)
         0xffffffff817be9a7 <+55>:    pop    %rbx
         0xffffffff817be9a8 <+56>:    pop    %rbp
         0xffffffff817be9a9 <+57>:    mov    $0x200,%esi
         0xffffffff817be9ae <+62>:    mov    $0xffffffff817be993,%rdi
         0xffffffff817be9b5 <+69>:    jmpq   0xffffffff81063ae0 <__local_bh_enable_ip>
         0xffffffff817be9ba <+74>:    mov    %rbp,%rdi
         0xffffffff817be9bd <+77>:    callq  0xffffffff817be8c0 <__lock_sock>
         0xffffffff817be9c2 <+82>:    jmp    0xffffffff817be993 <lock_sock_nested+35>
      End of assembler dump.
      
      
      Fixes: 63f70270 ("[PATCH] i386: PARAVIRT: add common patching machinery")
      Fixes: 3010a066 ("x86/paravirt, objtool: Annotate indirect calls")
      Reported-by: default avatarNadav Amit <namit@vmware.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      640fe070
    • Oleksij Rempel's avatar
      ARM: dts: imx6sx: fix irq for pcie bridge · 16aeb3f1
      Oleksij Rempel authored
      commit 1bcfe056 upstream.
      
      Use the correct IRQ line for the MSI controller in the PCIe host
      controller. Apparently a different IRQ line is used compared to other
      i.MX6 variants. Without this change MSI IRQs aren't properly propagated
      to the upstream interrupt controller.
      Signed-off-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: default avatarLucas Stach <l.stach@pengutronix.de>
      Fixes: b1d17f68 ("ARM: dts: imx: add initial imx6sx device tree source")
      Signed-off-by: default avatarShawn Guo <shawnguo@kernel.org>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      16aeb3f1
    • Michael Mera's avatar
      IB/ocrdma: fix out of bounds access to local buffer · 27250cf8
      Michael Mera authored
      commit 062d0f22 upstream.
      
      In write to debugfs file 'resource_stats' the local buffer 'tmp_str' is
      written at index 'count-1' where 'count' is the size of the write, so
      potentially 0.
      
      This patch filters odd values for the write size/position to avoid this
      type of problem.
      Signed-off-by: default avatarMichael Mera <dev@michaelmera.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      27250cf8
    • Fabio Estevam's avatar
      mtd: nand: qcom: Add a NULL check for devm_kasprintf() · 5ee45fc9
      Fabio Estevam authored
      commit 069f0534 upstream.
      
      devm_kasprintf() may fail, so we should better add a NULL check
      and propagate an error on failure.
      Signed-off-by: default avatarFabio Estevam <fabio.estevam@nxp.com>
      Signed-off-by: default avatarBoris Brezillon <boris.brezillon@free-electrons.com>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5ee45fc9
    • Jack Morgenstein's avatar
      IB/mlx4: Mark user MR as writable if actual virtual memory is writable · e2ba7bf1
      Jack Morgenstein authored
      commit d8f9cc32 upstream.
      
      To allow rereg_user_mr to modify the MR from read-only to writable without
      using get_user_pages again, we needed to define the initial MR as writable.
      However, this was originally done unconditionally, without taking into
      account the writability of the underlying virtual memory.
      
      As a result, any attempt to register a read-only MR over read-only
      virtual memory failed.
      
      To fix this, do not add the writable flag bit when the user virtual memory
      is not writable (e.g. const memory).
      
      However, when the underlying memory is NOT writable (and we therefore
      do not define the initial MR as writable), the IB core adds a
      "force writable" flag to its user-pages request. If this succeeds,
      the reg_user_mr caller gets a writable copy of the original pages.
      
      If the user-space caller then does a rereg_user_mr operation to enable
      writability, this will succeed. This should not be allowed, since
      the original virtual memory was not writable.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 9376932d ("IB/mlx4_ib: Add support for user MR re-registration")
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e2ba7bf1
    • Jack Morgenstein's avatar
      IB/core: Make testing MR flags for writability a static inline function · 11410f99
      Jack Morgenstein authored
      commit 08bb558a upstream.
      
      Make the MR writability flags check, which is performed in umem.c,
      a static inline function in file ib_verbs.h
      
      This allows the function to be used by low-level infiniband drivers.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      11410f99
    • Eric W. Biederman's avatar
      proc: Fix proc_sys_prune_dcache to hold a sb reference · a3a7b992
      Eric W. Biederman authored
      commit 2fd1d2c4 upstream.
      
      Andrei Vagin writes:
      FYI: This bug has been reproduced on 4.11.7
      > BUG: Dentry ffff895a3dd01240{i=4e7c09a,n=lo}  still in use (1) [unmount of proc proc]
      > ------------[ cut here ]------------
      > WARNING: CPU: 1 PID: 13588 at fs/dcache.c:1445 umount_check+0x6e/0x80
      > CPU: 1 PID: 13588 Comm: kworker/1:1 Not tainted 4.11.7-200.fc25.x86_64 #1
      > Hardware name: CompuLab sbc-flt1/fitlet, BIOS SBCFLT_0.08.04 06/27/2015
      > Workqueue: events proc_cleanup_work
      > Call Trace:
      >  dump_stack+0x63/0x86
      >  __warn+0xcb/0xf0
      >  warn_slowpath_null+0x1d/0x20
      >  umount_check+0x6e/0x80
      >  d_walk+0xc6/0x270
      >  ? dentry_free+0x80/0x80
      >  do_one_tree+0x26/0x40
      >  shrink_dcache_for_umount+0x2d/0x90
      >  generic_shutdown_super+0x1f/0xf0
      >  kill_anon_super+0x12/0x20
      >  proc_kill_sb+0x40/0x50
      >  deactivate_locked_super+0x43/0x70
      >  deactivate_super+0x5a/0x60
      >  cleanup_mnt+0x3f/0x90
      >  mntput_no_expire+0x13b/0x190
      >  kern_unmount+0x3e/0x50
      >  pid_ns_release_proc+0x15/0x20
      >  proc_cleanup_work+0x15/0x20
      >  process_one_work+0x197/0x450
      >  worker_thread+0x4e/0x4a0
      >  kthread+0x109/0x140
      >  ? process_one_work+0x450/0x450
      >  ? kthread_park+0x90/0x90
      >  ret_from_fork+0x2c/0x40
      > ---[ end trace e1c109611e5d0b41 ]---
      > VFS: Busy inodes after unmount of proc. Self-destruct in 5 seconds.  Have a nice day...
      > BUG: unable to handle kernel NULL pointer dereference at           (null)
      > IP: _raw_spin_lock+0xc/0x30
      > PGD 0
      
      Fix this by taking a reference to the super block in proc_sys_prune_dcache.
      
      The superblock reference is the core of the fix however the sysctl_inodes
      list is converted to a hlist so that hlist_del_init_rcu may be used.  This
      allows proc_sys_prune_dache to remove inodes the sysctl_inodes list, while
      not causing problems for proc_sys_evict_inode when if it later choses to
      remove the inode from the sysctl_inodes list.  Removing inodes from the
      sysctl_inodes list allows proc_sys_prune_dcache to have a progress
      guarantee, while still being able to drop all locks.  The fact that
      head->unregistering is set in start_unregistering ensures that no more
      inodes will be added to the the sysctl_inodes list.
      
      Previously the code did a dance where it delayed calling iput until the
      next entry in the list was being considered to ensure the inode remained on
      the sysctl_inodes list until the next entry was walked to.  The structure
      of the loop in this patch does not need that so is much easier to
      understand and maintain.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAndrei Vagin <avagin@gmail.com>
      Tested-by: default avatarAndrei Vagin <avagin@openvz.org>
      Fixes: ace0c791 ("proc/sysctl: Don't grab i_lock under sysctl_lock.")
      Fixes: d6cffbbe ("proc/sysctl: prune stale dentries during unregistering")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a3a7b992
    • Eric W. Biederman's avatar
      proc/sysctl: Don't grab i_lock under sysctl_lock. · 631f93a6
      Eric W. Biederman authored
      commit ace0c791 upstream.
      
      Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
      > This patch has locking problem. I've got lockdep splat under LTP.
      >
      > [ 6633.115456] ======================================================
      > [ 6633.115502] [ INFO: possible circular locking dependency detected ]
      > [ 6633.115553] 4.9.10-debug+ #9 Tainted: G             L
      > [ 6633.115584] -------------------------------------------------------
      > [ 6633.115627] ksm02/284980 is trying to acquire lock:
      > [ 6633.115659]  (&sb->s_type->i_lock_key#4){+.+...}, at: [<ffffffff816bc1ce>] igrab+0x1e/0x80
      > [ 6633.115834] but task is already holding lock:
      > [ 6633.115882]  (sysctl_lock){+.+...}, at: [<ffffffff817e379b>] unregister_sysctl_table+0x6b/0x110
      > [ 6633.116026] which lock already depends on the new lock.
      > [ 6633.116026]
      > [ 6633.116080]
      > [ 6633.116080] the existing dependency chain (in reverse order) is:
      > [ 6633.116117]
      > -> #2 (sysctl_lock){+.+...}:
      > -> #1 (&(&dentry->d_lockref.lock)->rlock){+.+...}:
      > -> #0 (&sb->s_type->i_lock_key#4){+.+...}:
      >
      > d_lock nests inside i_lock
      > sysctl_lock nests inside d_lock in d_compare
      >
      > This patch adds i_lock nesting inside sysctl_lock.
      
      Al Viro <viro@ZenIV.linux.org.uk> replied:
      > Once ->unregistering is set, you can drop sysctl_lock just fine.  So I'd
      > try something like this - use rcu_read_lock() in proc_sys_prune_dcache(),
      > drop sysctl_lock() before it and regain after.  Make sure that no inodes
      > are added to the list ones ->unregistering has been set and use RCU list
      > primitives for modifying the inode list, with sysctl_lock still used to
      > serialize its modifications.
      >
      > Freeing struct inode is RCU-delayed (see proc_destroy_inode()), so doing
      > igrab() is safe there.  Since we don't drop inode reference until after we'd
      > passed beyond it in the list, list_for_each_entry_rcu() should be fine.
      
      I agree with Al Viro's analsysis of the situtation.
      
      Fixes: d6cffbbe ("proc/sysctl: prune stale dentries during unregistering")
      Reported-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Tested-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Suggested-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      631f93a6
    • Konstantin Khlebnikov's avatar
      proc/sysctl: prune stale dentries during unregistering · b96e215e
      Konstantin Khlebnikov authored
      commit d6cffbbe upstream.
      
      Currently unregistering sysctl table does not prune its dentries.
      Stale dentries could slowdown sysctl operations significantly.
      
      For example, command:
      
       # for i in {1..100000} ; do unshare -n -- sysctl -a &> /dev/null ; done
       creates a millions of stale denties around sysctls of loopback interface:
      
       # sysctl fs.dentry-state
       fs.dentry-state = 25812579  24724135        45      0       0       0
      
       All of them have matching names thus lookup have to scan though whole
       hash chain and call d_compare (proc_sys_compare) which checks them
       under system-wide spinlock (sysctl_lock).
      
       # time sysctl -a > /dev/null
       real    1m12.806s
       user    0m0.016s
       sys     1m12.400s
      
      Currently only memory reclaimer could remove this garbage.
      But without significant memory pressure this never happens.
      
      This patch collects sysctl inodes into list on sysctl table header and
      prunes all their dentries once that table unregisters.
      
      Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
      > On 10.02.2017 10:47, Al Viro wrote:
      >> how about >> the matching stats *after* that patch?
      >
      > dcache size doesn't grow endlessly, so stats are fine
      >
      > # sysctl fs.dentry-state
      > fs.dentry-state = 92712	58376	45	0	0	0
      >
      > # time sysctl -a &>/dev/null
      >
      > real	0m0.013s
      > user	0m0.004s
      > sys	0m0.008s
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b96e215e
    • Al Viro's avatar
      fix __legitimize_mnt()/mntput() race · e31578c6
      Al Viro authored
      commit 119e1ef8 upstream.
      
      __legitimize_mnt() has two problems - one is that in case of success
      the check of mount_lock is not ordered wrt preceding increment of
      refcount, making it possible to have successful __legitimize_mnt()
      on one CPU just before the otherwise final mntpu() on another,
      with __legitimize_mnt() not seeing mntput() taking the lock and
      mntput() not seeing the increment done by __legitimize_mnt().
      Solved by a pair of barriers.
      
      Another is that failure of __legitimize_mnt() on the second
      read_seqretry() leaves us with reference that'll need to be
      dropped by caller; however, if that races with final mntput()
      we can end up with caller dropping rcu_read_lock() and doing
      mntput() to release that reference - with the first mntput()
      having freed the damn thing just as rcu_read_lock() had been
      dropped.  Solution: in "do mntput() yourself" failure case
      grab mount_lock, check if MNT_DOOMED has been set by racing
      final mntput() that has missed our increment and if it has -
      undo the increment and treat that as "failure, caller doesn't
      need to drop anything" case.
      
      It's not easy to hit - the final mntput() has to come right
      after the first read_seqretry() in __legitimize_mnt() *and*
      manage to miss the increment done by __legitimize_mnt() before
      the second read_seqretry() in there.  The things that are almost
      impossible to hit on bare hardware are not impossible on SMP
      KVM, though...
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Fixes: 48a066e7 ("RCU'd vsfmounts")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e31578c6
    • Al Viro's avatar
      fix mntput/mntput race · 87a2d84d
      Al Viro authored
      commit 9ea0a46c upstream.
      
      mntput_no_expire() does the calculation of total refcount under mount_lock;
      unfortunately, the decrement (as well as all increments) are done outside
      of it, leading to false positives in the "are we dropping the last reference"
      test.  Consider the following situation:
      	* mnt is a lazy-umounted mount, kept alive by two opened files.  One
      of those files gets closed.  Total refcount of mnt is 2.  On CPU 42
      mntput(mnt) (called from __fput()) drops one reference, decrementing component
      	* After it has looked at component #0, the process on CPU 0 does
      mntget(), incrementing component #0, gets preempted and gets to run again -
      on CPU 69.  There it does mntput(), which drops the reference (component #69)
      and proceeds to spin on mount_lock.
      	* On CPU 42 our first mntput() finishes counting.  It observes the
      decrement of component #69, but not the increment of component #0.  As the
      result, the total it gets is not 1 as it should've been - it's 0.  At which
      point we decide that vfsmount needs to be killed and proceed to free it and
      shut the filesystem down.  However, there's still another opened file
      on that filesystem, with reference to (now freed) vfsmount, etc. and we are
      screwed.
      
      It's not a wide race, but it can be reproduced with artificial slowdown of
      the mnt_get_count() loop, and it should be easier to hit on SMP KVM setups.
      
      Fix consists of moving the refcount decrement under mount_lock; the tricky
      part is that we want (and can) keep the fast case (i.e. mount that still
      has non-NULL ->mnt_ns) entirely out of mount_lock.  All places that zero
      mnt->mnt_ns are dropping some reference to mnt and they call synchronize_rcu()
      before that mntput().  IOW, if mntput() observes (under rcu_read_lock())
      a non-NULL ->mnt_ns, it is guaranteed that there is another reference yet to
      be dropped.
      Reported-by: default avatarJann Horn <jannh@google.com>
      Tested-by: default avatarJann Horn <jannh@google.com>
      Fixes: 48a066e7 ("RCU'd vsfmounts")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      87a2d84d
    • Al Viro's avatar
      make sure that __dentry_kill() always invalidates d_seq, unhashed or not · 59199c04
      Al Viro authored
      commit 4c0d7cd5 upstream.
      
      RCU pathwalk relies upon the assumption that anything that changes
      ->d_inode of a dentry will invalidate its ->d_seq.  That's almost
      true - the one exception is that the final dput() of already unhashed
      dentry does *not* touch ->d_seq at all.  Unhashing does, though,
      so for anything we'd found by RCU dcache lookup we are fine.
      Unfortunately, we can *start* with an unhashed dentry or jump into
      it.
      
      We could try and be careful in the (few) places where that could
      happen.  Or we could just make the final dput() invalidate the damn
      thing, unhashed or not.  The latter is much simpler and easier to
      backport, so let's do it that way.
      Reported-by: default avatar"Dae R. Jeong" <threeearcat@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      59199c04
    • Al Viro's avatar
      root dentries need RCU-delayed freeing · cfac7df7
      Al Viro authored
      commit 90bad5e0 upstream.
      
      Since mountpoint crossing can happen without leaving lazy mode,
      root dentries do need the same protection against having their
      memory freed without RCU delay as everything else in the tree.
      
      It's partially hidden by RCU delay between detaching from the
      mount tree and dropping the vfsmount reference, but the starting
      point of pathwalk can be on an already detached mount, in which
      case umount-caused RCU delay has already passed by the time the
      lazy pathwalk grabs rcu_read_lock().  If the starting point
      happens to be at the root of that vfsmount *and* that vfsmount
      covers the entire filesystem, we get trouble.
      
      Fixes: 48a066e7 ("RCU'd vsfmounts")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cfac7df7