1. 15 Aug, 2018 40 commits
    • Michal Hocko's avatar
      x86/speculation/l1tf: Fix up pte->pfn conversion for PAE · 49221fc7
      Michal Hocko authored
      commit e14d7dfb upstream
      
      Jan has noticed that pte_pfn and co. resp. pfn_pte are incorrect for
      CONFIG_PAE because phys_addr_t is wider than unsigned long and so the
      pte_val reps. shift left would get truncated. Fix this up by using proper
      types.
      
      Fixes: 6b28baca ("x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation")
      Reported-by: default avatarJan Beulich <JBeulich@suse.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      49221fc7
    • Vlastimil Babka's avatar
      x86/speculation/l1tf: Protect PAE swap entries against L1TF · b9cdf143
      Vlastimil Babka authored
      commit 0d0f6249 upstream
      
      The PAE 3-level paging code currently doesn't mitigate L1TF by flipping the
      offset bits, and uses the high PTE word, thus bits 32-36 for type, 37-63 for
      offset. The lower word is zeroed, thus systems with less than 4GB memory are
      safe. With 4GB to 128GB the swap type selects the memory locations vulnerable
      to L1TF; with even more memory, also the swap offfset influences the address.
      This might be a problem with 32bit PAE guests running on large 64bit hosts.
      
      By continuing to keep the whole swap entry in either high or low 32bit word of
      PTE we would limit the swap size too much. Thus this patch uses the whole PAE
      PTE with the same layout as the 64bit version does. The macros just become a
      bit tricky since they assume the arch-dependent swp_entry_t to be 32bit.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9cdf143
    • Borislav Petkov's avatar
      x86/CPU/AMD: Move TOPOEXT reenablement before reading smp_num_siblings · 2a82e5e5
      Borislav Petkov authored
      commit 7ce2f039 upstream
      
      The TOPOEXT reenablement is a workaround for broken BIOSen which didn't
      enable the CPUID bit. amd_get_topology_early(), however, relies on
      that bit being set so that it can read out the CPUID leaf and set
      smp_num_siblings properly.
      
      Move the reenablement up to early_init_amd(). While at it, simplify
      amd_get_topology_early().
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2a82e5e5
    • Konrad Rzeszutek Wilk's avatar
      x86/cpufeatures: Add detection of L1D cache flush support. · 62f43866
      Konrad Rzeszutek Wilk authored
      commit 11e34e64 upstream
      
      336996-Speculative-Execution-Side-Channel-Mitigations.pdf defines a new MSR
      (IA32_FLUSH_CMD) which is detected by CPUID.7.EDX[28]=1 bit being set.
      
      This new MSR "gives software a way to invalidate structures with finer
      granularity than other architectual methods like WBINVD."
      
      A copy of this document is available at
        https://bugzilla.kernel.org/show_bug.cgi?id=199511Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      62f43866
    • Vlastimil Babka's avatar
      x86/speculation/l1tf: Extend 64bit swap file size limit · c6374a62
      Vlastimil Babka authored
      commit 1a7ed1ba upstream
      
      The previous patch has limited swap file size so that large offsets cannot
      clear bits above MAX_PA/2 in the pte and interfere with L1TF mitigation.
      
      It assumed that offsets are encoded starting with bit 12, same as pfn. But
      on x86_64, offsets are encoded starting with bit 9.
      
      Thus the limit can be raised by 3 bits. That means 16TB with 42bit MAX_PA
      and 256TB with 46bit MAX_PA.
      
      Fixes: 377eeaa8 ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c6374a62
    • Thomas Gleixner's avatar
      x86/apic: Ignore secondary threads if nosmt=force · f02f2ad9
      Thomas Gleixner authored
      commit 2207def7 upstream
      
      nosmt on the kernel command line merely prevents the onlining of the
      secondary SMT siblings.
      
      nosmt=force makes the APIC detection code ignore the secondary SMT siblings
      completely, so they even do not show up as possible CPUs. That reduces the
      amount of memory allocations for per cpu variables and saves other
      resources from being allocated too large.
      
      This is not fully equivalent to disabling SMT in the BIOS because the low
      level SMT enabling in the BIOS can result in partitioning of resources
      between the siblings, which is not undone by just ignoring them. Some CPUs
      can use the full resources when their sibling is not onlined, but this is
      depending on the CPU family and model and it's not well documented whether
      this applies to all partitioned resources. That means depending on the
      workload disabling SMT in the BIOS might result in better performance.
      
      Linus analysis of the Intel manual:
      
        The intel optimization manual is not very clear on what the partitioning
        rules are.
      
        I find:
      
          "In general, the buffers for staging instructions between major pipe
           stages  are partitioned. These buffers include µop queues after the
           execution trace cache, the queues after the register rename stage, the
           reorder buffer which stages instructions for retirement, and the load
           and store buffers.
      
           In the case of load and store buffers, partitioning also provided an
           easier implementation to maintain memory ordering for each logical
           processor and detect memory ordering violations"
      
        but some of that partitioning may be relaxed if the HT thread is "not
        active":
      
          "In Intel microarchitecture code name Sandy Bridge, the micro-op queue
           is statically partitioned to provide 28 entries for each logical
           processor,  irrespective of software executing in single thread or
           multiple threads. If one logical processor is not active in Intel
           microarchitecture code name Ivy Bridge, then a single thread executing
           on that processor  core can use the 56 entries in the micro-op queue"
      
        but I do not know what "not active" means, and how dynamic it is. Some of
        that partitioning may be entirely static and depend on the early BIOS
        disabling of HT, and even if we park the cores, the resources will just be
        wasted.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f02f2ad9
    • Thomas Gleixner's avatar
      x86/cpu/AMD: Evaluate smp_num_siblings early · 63f2c9b0
      Thomas Gleixner authored
      commit 1e1d7e25 upstream
      
      To support force disabling of SMT it's required to know the number of
      thread siblings early. amd_get_topology() cannot be called before the APIC
      driver is selected, so split out the part which initializes
      smp_num_siblings and invoke it from amd_early_init().
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      63f2c9b0
    • Borislav Petkov's avatar
      x86/CPU/AMD: Do not check CPUID max ext level before parsing SMP info · 35c67d5b
      Borislav Petkov authored
      commit 119bff8a upstream
      
      Old code used to check whether CPUID ext max level is >= 0x80000008 because
      that last leaf contains the number of cores of the physical CPU.  The three
      functions called there now do not depend on that leaf anymore so the check
      can go.
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      35c67d5b
    • Thomas Gleixner's avatar
      x86/cpu/intel: Evaluate smp_num_siblings early · d20d8f7f
      Thomas Gleixner authored
      commit 1910ad56 upstream
      
      Make use of the new early detection function to initialize smp_num_siblings
      on the boot cpu before the MP-Table or ACPI/MADT scan happens. That's
      required for force disabling SMT.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d20d8f7f
    • Thomas Gleixner's avatar
      x86/cpu/topology: Provide detect_extended_topology_early() · 9552b7df
      Thomas Gleixner authored
      commit 95f3d39c upstream
      
      To support force disabling of SMT it's required to know the number of
      thread siblings early. detect_extended_topology() cannot be called before
      the APIC driver is selected, so split out the part which initializes
      smp_num_siblings.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9552b7df
    • Thomas Gleixner's avatar
      x86/cpu/common: Provide detect_ht_early() · bfa4f8ae
      Thomas Gleixner authored
      commit 545401f4 upstream
      
      To support force disabling of SMT it's required to know the number of
      thread siblings early. detect_ht() cannot be called before the APIC driver
      is selected, so split out the part which initializes smp_num_siblings.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bfa4f8ae
    • Thomas Gleixner's avatar
      x86/cpu/AMD: Remove the pointless detect_ht() call · 33c8be23
      Thomas Gleixner authored
      commit 44ca36de upstream
      
      Real 32bit AMD CPUs do not have SMT and the only value of the call was to
      reach the magic printout which got removed.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      33c8be23
    • Thomas Gleixner's avatar
      x86/cpu: Remove the pointless CPU printout · 728ac482
      Thomas Gleixner authored
      commit 55e6d279 upstream
      
      The value of this printout is dubious at best and there is no point in
      having it in two different places along with convoluted ways to reach it.
      
      Remove it completely.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      728ac482
    • Thomas Gleixner's avatar
      cpu/hotplug: Provide knobs to control SMT · c5ac43ee
      Thomas Gleixner authored
      commit 05736e4a upstream
      
      Provide a command line and a sysfs knob to control SMT.
      
      The command line options are:
      
       'nosmt':	Enumerate secondary threads, but do not online them
      
       'nosmt=force': Ignore secondary threads completely during enumeration
       		via MP table and ACPI/MADT.
      
      The sysfs control file has the following states (read/write):
      
       'on':		 SMT is enabled. Secondary threads can be freely onlined
       'off':		 SMT is disabled. Secondary threads, even if enumerated
       		 cannot be onlined
       'forceoff':	 SMT is permanentely disabled. Writes to the control
       		 file are rejected.
       'notsupported': SMT is not supported by the CPU
      
      The command line option 'nosmt' sets the sysfs control to 'off'. This
      can be changed to 'on' to reenable SMT during runtime.
      
      The command line option 'nosmt=force' sets the sysfs control to
      'forceoff'. This cannot be changed during runtime.
      
      When SMT is 'on' and the control file is changed to 'off' then all online
      secondary threads are offlined and attempts to online a secondary thread
      later on are rejected.
      
      When SMT is 'off' and the control file is changed to 'on' then secondary
      threads can be onlined again. The 'off' -> 'on' transition does not
      automatically online the secondary threads.
      
      When the control file is set to 'forceoff', the behaviour is the same as
      setting it to 'off', but the operation is irreversible and later writes to
      the control file are rejected.
      
      When the control status is 'notsupported' then writes to the control file
      are rejected.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c5ac43ee
    • Thomas Gleixner's avatar
      cpu/hotplug: Split do_cpu_down() · 6beba29c
      Thomas Gleixner authored
      commit cc1fe215 upstream
      
      Split out the inner workings of do_cpu_down() to allow reuse of that
      function for the upcoming SMT disabling mechanism.
      
      No functional change.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6beba29c
    • Thomas Gleixner's avatar
      cpu/hotplug: Make bringup/teardown of smp threads symmetric · 26a6dcc7
      Thomas Gleixner authored
      commit c4de6569 upstream
      
      The asymmetry caused a warning to trigger if the bootup was stopped in state
      CPUHP_AP_ONLINE_IDLE. The warning no longer triggers as kthread_park() can
      now be invoked on already or still parked threads. But there is still no
      reason to have this be asymmetric.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      26a6dcc7
    • Thomas Gleixner's avatar
      x86/topology: Provide topology_smt_supported() · 8eb28605
      Thomas Gleixner authored
      commit f048c399 upstream
      
      Provide information whether SMT is supoorted by the CPUs. Preparatory patch
      for SMT control mechanism.
      Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8eb28605
    • Thomas Gleixner's avatar
      x86/smp: Provide topology_is_primary_thread() · b84b9184
      Thomas Gleixner authored
      commit 6a4d2657 upstream
      
      If the CPU is supporting SMT then the primary thread can be found by
      checking the lower APIC ID bits for zero. smp_num_siblings is used to build
      the mask for the APIC ID bits which need to be taken into account.
      
      This uses the MPTABLE or ACPI/MADT supplied APIC ID, which can be different
      than the initial APIC ID in CPUID. But according to AMD the lower bits have
      to be consistent. Intel gave a tentative confirmation as well.
      
      Preparatory patch to support disabling SMT at boot/runtime.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b84b9184
    • Peter Zijlstra's avatar
      sched/smt: Update sched_smt_present at runtime · fc890e9b
      Peter Zijlstra authored
      commit ba2591a5 upstream
      
      The static key sched_smt_present is only updated at boot time when SMT
      siblings have been detected. Booting with maxcpus=1 and bringing the
      siblings online after boot rebuilds the scheduling domains correctly but
      does not update the static key, so the SMT code is not enabled.
      
      Let the key be updated in the scheduler CPU hotplug code to fix this.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fc890e9b
    • Konrad Rzeszutek Wilk's avatar
      x86/bugs: Move the l1tf function and define pr_fmt properly · 202f9056
      Konrad Rzeszutek Wilk authored
      commit 56563f53 upstream
      
      The pr_warn in l1tf_select_mitigation would have used the prior pr_fmt
      which was defined as "Spectre V2 : ".
      
      Move the function to be past SSBD and also define the pr_fmt.
      
      Fixes: 17dbca11 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      202f9056
    • Andi Kleen's avatar
      x86/speculation/l1tf: Limit swap file size to MAX_PA/2 · 4bb1a8d8
      Andi Kleen authored
      commit 377eeaa8 upstream
      
      For the L1TF workaround its necessary to limit the swap file size to below
      MAX_PA/2, so that the higher bits of the swap offset inverted never point
      to valid memory.
      
      Add a mechanism for the architecture to override the swap file size check
      in swapfile.c and add a x86 specific max swapfile check function that
      enforces that limit.
      
      The check is only enabled if the CPU is vulnerable to L1TF.
      
      In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
      with 46bit PA it is 32TB. The limit is only per individual swap file, so
      it's always possible to exceed these limits with multiple swap files or
      partitions.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4bb1a8d8
    • Andi Kleen's avatar
      x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings · a4116334
      Andi Kleen authored
      commit 42e4089c upstream
      
      For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
      table entry. This sets the high bits in the CPU's address space, thus
      making sure to point to not point an unmapped entry to valid cached memory.
      
      Some server system BIOSes put the MMIO mappings high up in the physical
      address space. If such an high mapping was mapped to unprivileged users
      they could attack low memory by setting such a mapping to PROT_NONE. This
      could happen through a special device driver which is not access
      protected. Normal /dev/mem is of course access protected.
      
      To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.
      
      Valid page mappings are allowed because the system is then unsafe anyways.
      
      It's not expected that users commonly use PROT_NONE on MMIO. But to
      minimize any impact this is only enforced if the mapping actually refers to
      a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
      the check for root.
      
      For mmaps this is straight forward and can be handled in vm_insert_pfn and
      in remap_pfn_range().
      
      For mprotect it's a bit trickier. At the point where the actual PTEs are
      accessed a lot of state has been changed and it would be difficult to undo
      on an error. Since this is a uncommon case use a separate early page talk
      walk pass for MMIO PROT_NONE mappings that checks for this condition
      early. For non MMIO and non PROT_NONE there are no changes.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a4116334
    • Andi Kleen's avatar
      x86/speculation/l1tf: Add sysfs reporting for l1tf · 3d98de69
      Andi Kleen authored
      commit 17dbca11 upstream
      
      L1TF core kernel workarounds are cheap and normally always enabled, However
      they still should be reported in sysfs if the system is vulnerable or
      mitigated. Add the necessary CPU feature/bug bits.
      
      - Extend the existing checks for Meltdowns to determine if the system is
        vulnerable. All CPUs which are not vulnerable to Meltdown are also not
        vulnerable to L1TF
      
      - Check for 32bit non PAE and emit a warning as there is no practical way
        for mitigation due to the limited physical address bits
      
      - If the system has more than MAX_PA/2 physical memory the invert page
        workarounds don't protect the system against the L1TF attack anymore,
        because an inverted physical address will also point to valid
        memory. Print a warning in this case and report that the system is
        vulnerable.
      
      Add a function which returns the PFN limit for the L1TF mitigation, which
      will be used in follow up patches for sanity and range checks.
      
      [ tglx: Renamed the CPU feature bit to L1TF_PTEINV ]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3d98de69
    • Andi Kleen's avatar
      x86/speculation/l1tf: Make sure the first page is always reserved · aff6fe17
      Andi Kleen authored
      commit 10a70416 upstream
      
      The L1TF workaround doesn't make any attempt to mitigate speculate accesses
      to the first physical page for zeroed PTEs. Normally it only contains some
      data from the early real mode BIOS.
      
      It's not entirely clear that the first page is reserved in all
      configurations, so add an extra reservation call to make sure it is really
      reserved. In most configurations (e.g.  with the standard reservations)
      it's likely a nop.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aff6fe17
    • Andi Kleen's avatar
      x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation · 8c35b2fc
      Andi Kleen authored
      commit 6b28baca upstream
      
      When PTEs are set to PROT_NONE the kernel just clears the Present bit and
      preserves the PFN, which creates attack surface for L1TF speculation
      speculation attacks.
      
      This is important inside guests, because L1TF speculation bypasses physical
      page remapping. While the host has its own migitations preventing leaking
      data from other VMs into the guest, this would still risk leaking the wrong
      page inside the current guest.
      
      This uses the same technique as Linus' swap entry patch: while an entry is
      is in PROTNONE state invert the complete PFN part part of it. This ensures
      that the the highest bit will point to non existing memory.
      
      The invert is done by pte/pmd_modify and pfn/pmd/pud_pte for PROTNONE and
      pte/pmd/pud_pfn undo it.
      
      This assume that no code path touches the PFN part of a PTE directly
      without using these primitives.
      
      This doesn't handle the case that MMIO is on the top of the CPU physical
      memory. If such an MMIO region was exposed by an unpriviledged driver for
      mmap it would be possible to attack some real memory.  However this
      situation is all rather unlikely.
      
      For 32bit non PAE the inversion is not done because there are really not
      enough bits to protect anything.
      
      Q: Why does the guest need to be protected when the HyperVisor already has
         L1TF mitigations?
      
      A: Here's an example:
      
         Physical pages 1 2 get mapped into a guest as
         GPA 1 -> PA 2
         GPA 2 -> PA 1
         through EPT.
      
         The L1TF speculation ignores the EPT remapping.
      
         Now the guest kernel maps GPA 1 to process A and GPA 2 to process B, and
         they belong to different users and should be isolated.
      
         A sets the GPA 1 PA 2 PTE to PROT_NONE to bypass the EPT remapping and
         gets read access to the underlying physical page. Which in this case
         points to PA 2, so it can read process B's data, if it happened to be in
         L1, so isolation inside the guest is broken.
      
         There's nothing the hypervisor can do about this. This mitigation has to
         be done in the guest itself.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8c35b2fc
    • Linus Torvalds's avatar
      x86/speculation/l1tf: Protect swap entries against L1TF · 83ef7e8c
      Linus Torvalds authored
      commit 2f22b4cd upstream
      
      With L1 terminal fault the CPU speculates into unmapped PTEs, and resulting
      side effects allow to read the memory the PTE is pointing too, if its
      values are still in the L1 cache.
      
      For swapped out pages Linux uses unmapped PTEs and stores a swap entry into
      them.
      
      To protect against L1TF it must be ensured that the swap entry is not
      pointing to valid memory, which requires setting higher bits (between bit
      36 and bit 45) that are inside the CPUs physical address space, but outside
      any real memory.
      
      To do this invert the offset to make sure the higher bits are always set,
      as long as the swap file is not too big.
      
      Note there is no workaround for 32bit !PAE, or on systems which have more
      than MAX_PA/2 worth of memory. The later case is very unlikely to happen on
      real systems.
      
      [AK: updated description and minor tweaks by. Split out from the original
           patch ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarAndi Kleen <ak@linux.intel.com>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      83ef7e8c
    • Linus Torvalds's avatar
      x86/speculation/l1tf: Change order of offset/type in swap entry · 39991a7a
      Linus Torvalds authored
      commit bcd11afa upstream
      
      If pages are swapped out, the swap entry is stored in the corresponding
      PTE, which has the Present bit cleared. CPUs vulnerable to L1TF speculate
      on PTE entries which have the present bit set and would treat the swap
      entry as phsyical address (PFN). To mitigate that the upper bits of the PTE
      must be set so the PTE points to non existent memory.
      
      The swap entry stores the type and the offset of a swapped out page in the
      PTE. type is stored in bit 9-13 and offset in bit 14-63. The hardware
      ignores the bits beyond the phsyical address space limit, so to make the
      mitigation effective its required to start 'offset' at the lowest possible
      bit so that even large swap offsets do not reach into the physical address
      space limit bits.
      
      Move offset to bit 9-58 and type to bit 59-63 which are the bits that
      hardware generally doesn't care about.
      
      That, in turn, means that if you on desktop chip with only 40 bits of
      physical addressing, now that the offset starts at bit 9, there needs to be
      30 bits of offset actually *in use* until bit 39 ends up being set, which
      means when inverted it will again point into existing memory.
      
      So that's 4 terabyte of swap space (because the offset is counted in pages,
      so 30 bits of offset is 42 bits of actual coverage). With bigger physical
      addressing, that obviously grows further, until the limit of the offset is
      hit (at 50 bits of offset - 62 bits of actual swap file coverage).
      
      This is a preparatory change for the actual swap entry inversion to protect
      against L1TF.
      
      [ AK: Updated description and minor tweaks. Split into two parts ]
      [ tglx: Massaged changelog ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarAndi Kleen <ak@linux.intel.com>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      39991a7a
    • Andi Kleen's avatar
      x86/speculation/l1tf: Increase 32bit PAE __PHYSICAL_PAGE_SHIFT · aefe1861
      Andi Kleen authored
      commit 50896e18 upstream
      
      L1 Terminal Fault (L1TF) is a speculation related vulnerability. The CPU
      speculates on PTE entries which do not have the PRESENT bit set, if the
      content of the resulting physical address is available in the L1D cache.
      
      The OS side mitigation makes sure that a !PRESENT PTE entry points to a
      physical address outside the actually existing and cachable memory
      space. This is achieved by inverting the upper bits of the PTE. Due to the
      address space limitations this only works for 64bit and 32bit PAE kernels,
      but not for 32bit non PAE.
      
      This mitigation applies to both host and guest kernels, but in case of a
      64bit host (hypervisor) and a 32bit PAE guest, inverting the upper bits of
      the PAE address space (44bit) is not enough if the host has more than 43
      bits of populated memory address space, because the speculation treats the
      PTE content as a physical host address bypassing EPT.
      
      The host (hypervisor) protects itself against the guest by flushing L1D as
      needed, but pages inside the guest are not protected against attacks from
      other processes inside the same guest.
      
      For the guest the inverted PTE mask has to match the host to provide the
      full protection for all pages the host could possibly map into the
      guest. The hosts populated address space is not known to the guest, so the
      mask must cover the possible maximal host address space, i.e. 52 bit.
      
      On 32bit PAE the maximum PTE mask is currently set to 44 bit because that
      is the limit imposed by 32bit unsigned long PFNs in the VMs. This limits
      the mask to be below what the host could possible use for physical pages.
      
      The L1TF PROT_NONE protection code uses the PTE masks to determine which
      bits to invert to make sure the higher bits are set for unmapped entries to
      prevent L1TF speculation attacks against EPT inside guests.
      
      In order to invert all bits that could be used by the host, increase
      __PHYSICAL_PAGE_SHIFT to 52 to match 64bit.
      
      The real limit for a 32bit PAE kernel is still 44 bits because all Linux
      PTEs are created from unsigned long PFNs, so they cannot be higher than 44
      bits on a 32bit kernel. So these extra PFN bits should be never set. The
      only users of this macro are using it to look at PTEs, so it's safe.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aefe1861
    • Nick Desaulniers's avatar
      x86/irqflags: Provide a declaration for native_save_fl · 688e5a20
      Nick Desaulniers authored
      commit 208cbb32 upstream.
      
      It was reported that the commit d0a8d937 is causing users of gcc < 4.9
      to observe -Werror=missing-prototypes errors.
      
      Indeed, it seems that:
      extern inline unsigned long native_save_fl(void) { return 0; }
      
      compiled with -Werror=missing-prototypes produces this warning in gcc <
      4.9, but not gcc >= 4.9.
      
      Fixes: d0a8d937 ("x86/paravirt: Make native_save_fl() extern inline").
      Reported-by: default avatarDavid Laight <david.laight@aculab.com>
      Reported-by: default avatarJean Delvare <jdelvare@suse.de>
      Signed-off-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Cc: jgross@suse.com
      Cc: kstewart@linuxfoundation.org
      Cc: gregkh@linuxfoundation.org
      Cc: boris.ostrovsky@oracle.com
      Cc: astrachan@google.com
      Cc: mka@chromium.org
      Cc: arnd@arndb.de
      Cc: tstellar@redhat.com
      Cc: sedat.dilek@gmail.com
      Cc: David.Laight@aculab.com
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180803170550.164688-1-ndesaulniers@google.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      688e5a20
    • Masami Hiramatsu's avatar
      kprobes/x86: Fix %p uses in error messages · bf6ac84a
      Masami Hiramatsu authored
      commit 0ea06330 upstream.
      
      Remove all %p uses in error messages in kprobes/x86.
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: David S . Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jon Medhurst <tixy@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Richter <tmricht@linux.ibm.com>
      Cc: Tobin C . Harding <me@tobin.cc>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: acme@kernel.org
      Cc: akpm@linux-foundation.org
      Cc: brueckner@linux.vnet.ibm.com
      Cc: linux-arch@vger.kernel.org
      Cc: rostedt@goodmis.org
      Cc: schwidefsky@de.ibm.com
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/lkml/152491902310.9916.13355297638917767319.stgit@devboxSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bf6ac84a
    • Jiri Kosina's avatar
      x86/speculation: Protect against userspace-userspace spectreRSB · f374b559
      Jiri Kosina authored
      commit fdf82a78 upstream.
      
      The article "Spectre Returns! Speculation Attacks using the Return Stack
      Buffer" [1] describes two new (sub-)variants of spectrev2-like attacks,
      making use solely of the RSB contents even on CPUs that don't fallback to
      BTB on RSB underflow (Skylake+).
      
      Mitigate userspace-userspace attacks by always unconditionally filling RSB on
      context switch when the generic spectrev2 mitigation has been enabled.
      
      [1] https://arxiv.org/pdf/1807.07940.pdfSigned-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1807261308190.997@cbobk.fhfr.pmSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f374b559
    • Peter Zijlstra's avatar
      x86/paravirt: Fix spectre-v2 mitigations for paravirt guests · f7a0871d
      Peter Zijlstra authored
      commit 5800dc5c upstream.
      
      Nadav reported that on guests we're failing to rewrite the indirect
      calls to CALLEE_SAVE paravirt functions. In particular the
      pv_queued_spin_unlock() call is left unpatched and that is all over the
      place. This obviously wrecks Spectre-v2 mitigation (for paravirt
      guests) which relies on not actually having indirect calls around.
      
      The reason is an incorrect clobber test in paravirt_patch_call(); this
      function rewrites an indirect call with a direct call to the _SAME_
      function, there is no possible way the clobbers can be different
      because of this.
      
      Therefore remove this clobber check. Also put WARNs on the other patch
      failure case (not enough room for the instruction) which I've not seen
      trigger in my (limited) testing.
      
      Three live kernel image disassemblies for lock_sock_nested (as a small
      function that illustrates the problem nicely). PRE is the current
      situation for guests, POST is with this patch applied and NATIVE is with
      or without the patch for !guests.
      
      PRE:
      
      (gdb) disassemble lock_sock_nested
      Dump of assembler code for function lock_sock_nested:
         0xffffffff817be970 <+0>:     push   %rbp
         0xffffffff817be971 <+1>:     mov    %rdi,%rbp
         0xffffffff817be974 <+4>:     push   %rbx
         0xffffffff817be975 <+5>:     lea    0x88(%rbp),%rbx
         0xffffffff817be97c <+12>:    callq  0xffffffff819f7160 <_cond_resched>
         0xffffffff817be981 <+17>:    mov    %rbx,%rdi
         0xffffffff817be984 <+20>:    callq  0xffffffff819fbb00 <_raw_spin_lock_bh>
         0xffffffff817be989 <+25>:    mov    0x8c(%rbp),%eax
         0xffffffff817be98f <+31>:    test   %eax,%eax
         0xffffffff817be991 <+33>:    jne    0xffffffff817be9ba <lock_sock_nested+74>
         0xffffffff817be993 <+35>:    movl   $0x1,0x8c(%rbp)
         0xffffffff817be99d <+45>:    mov    %rbx,%rdi
         0xffffffff817be9a0 <+48>:    callq  *0xffffffff822299e8
         0xffffffff817be9a7 <+55>:    pop    %rbx
         0xffffffff817be9a8 <+56>:    pop    %rbp
         0xffffffff817be9a9 <+57>:    mov    $0x200,%esi
         0xffffffff817be9ae <+62>:    mov    $0xffffffff817be993,%rdi
         0xffffffff817be9b5 <+69>:    jmpq   0xffffffff81063ae0 <__local_bh_enable_ip>
         0xffffffff817be9ba <+74>:    mov    %rbp,%rdi
         0xffffffff817be9bd <+77>:    callq  0xffffffff817be8c0 <__lock_sock>
         0xffffffff817be9c2 <+82>:    jmp    0xffffffff817be993 <lock_sock_nested+35>
      End of assembler dump.
      
      POST:
      
      (gdb) disassemble lock_sock_nested
      Dump of assembler code for function lock_sock_nested:
         0xffffffff817be970 <+0>:     push   %rbp
         0xffffffff817be971 <+1>:     mov    %rdi,%rbp
         0xffffffff817be974 <+4>:     push   %rbx
         0xffffffff817be975 <+5>:     lea    0x88(%rbp),%rbx
         0xffffffff817be97c <+12>:    callq  0xffffffff819f7160 <_cond_resched>
         0xffffffff817be981 <+17>:    mov    %rbx,%rdi
         0xffffffff817be984 <+20>:    callq  0xffffffff819fbb00 <_raw_spin_lock_bh>
         0xffffffff817be989 <+25>:    mov    0x8c(%rbp),%eax
         0xffffffff817be98f <+31>:    test   %eax,%eax
         0xffffffff817be991 <+33>:    jne    0xffffffff817be9ba <lock_sock_nested+74>
         0xffffffff817be993 <+35>:    movl   $0x1,0x8c(%rbp)
         0xffffffff817be99d <+45>:    mov    %rbx,%rdi
         0xffffffff817be9a0 <+48>:    callq  0xffffffff810a0c20 <__raw_callee_save___pv_queued_spin_unlock>
         0xffffffff817be9a5 <+53>:    xchg   %ax,%ax
         0xffffffff817be9a7 <+55>:    pop    %rbx
         0xffffffff817be9a8 <+56>:    pop    %rbp
         0xffffffff817be9a9 <+57>:    mov    $0x200,%esi
         0xffffffff817be9ae <+62>:    mov    $0xffffffff817be993,%rdi
         0xffffffff817be9b5 <+69>:    jmpq   0xffffffff81063aa0 <__local_bh_enable_ip>
         0xffffffff817be9ba <+74>:    mov    %rbp,%rdi
         0xffffffff817be9bd <+77>:    callq  0xffffffff817be8c0 <__lock_sock>
         0xffffffff817be9c2 <+82>:    jmp    0xffffffff817be993 <lock_sock_nested+35>
      End of assembler dump.
      
      NATIVE:
      
      (gdb) disassemble lock_sock_nested
      Dump of assembler code for function lock_sock_nested:
         0xffffffff817be970 <+0>:     push   %rbp
         0xffffffff817be971 <+1>:     mov    %rdi,%rbp
         0xffffffff817be974 <+4>:     push   %rbx
         0xffffffff817be975 <+5>:     lea    0x88(%rbp),%rbx
         0xffffffff817be97c <+12>:    callq  0xffffffff819f7160 <_cond_resched>
         0xffffffff817be981 <+17>:    mov    %rbx,%rdi
         0xffffffff817be984 <+20>:    callq  0xffffffff819fbb00 <_raw_spin_lock_bh>
         0xffffffff817be989 <+25>:    mov    0x8c(%rbp),%eax
         0xffffffff817be98f <+31>:    test   %eax,%eax
         0xffffffff817be991 <+33>:    jne    0xffffffff817be9ba <lock_sock_nested+74>
         0xffffffff817be993 <+35>:    movl   $0x1,0x8c(%rbp)
         0xffffffff817be99d <+45>:    mov    %rbx,%rdi
         0xffffffff817be9a0 <+48>:    movb   $0x0,(%rdi)
         0xffffffff817be9a3 <+51>:    nopl   0x0(%rax)
         0xffffffff817be9a7 <+55>:    pop    %rbx
         0xffffffff817be9a8 <+56>:    pop    %rbp
         0xffffffff817be9a9 <+57>:    mov    $0x200,%esi
         0xffffffff817be9ae <+62>:    mov    $0xffffffff817be993,%rdi
         0xffffffff817be9b5 <+69>:    jmpq   0xffffffff81063ae0 <__local_bh_enable_ip>
         0xffffffff817be9ba <+74>:    mov    %rbp,%rdi
         0xffffffff817be9bd <+77>:    callq  0xffffffff817be8c0 <__lock_sock>
         0xffffffff817be9c2 <+82>:    jmp    0xffffffff817be993 <lock_sock_nested+35>
      End of assembler dump.
      
      
      Fixes: 63f70270 ("[PATCH] i386: PARAVIRT: add common patching machinery")
      Fixes: 3010a066 ("x86/paravirt, objtool: Annotate indirect calls")
      Reported-by: default avatarNadav Amit <namit@vmware.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f7a0871d
    • Oleksij Rempel's avatar
      ARM: dts: imx6sx: fix irq for pcie bridge · 93bf5456
      Oleksij Rempel authored
      commit 1bcfe056 upstream.
      
      Use the correct IRQ line for the MSI controller in the PCIe host
      controller. Apparently a different IRQ line is used compared to other
      i.MX6 variants. Without this change MSI IRQs aren't properly propagated
      to the upstream interrupt controller.
      Signed-off-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: default avatarLucas Stach <l.stach@pengutronix.de>
      Fixes: b1d17f68 ("ARM: dts: imx: add initial imx6sx device tree source")
      Signed-off-by: default avatarShawn Guo <shawnguo@kernel.org>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      93bf5456
    • Lukas Wunner's avatar
      Bluetooth: hci_serdev: Init hci_uart proto_lock to avoid oops · 4a53c4e8
      Lukas Wunner authored
      commit d73e1728 upstream.
      
      John Stultz reports a boot time crash with the HiKey board (which uses
      hci_serdev) occurring in hci_uart_tx_wakeup().  That function is
      contained in hci_ldisc.c, but also called from the newer hci_serdev.c.
      It acquires the proto_lock in struct hci_uart and it turns out that we
      forgot to init the lock in the serdev code path, thus causing the crash.
      
      John bisected the crash to commit 67d2f878 ("Bluetooth: hci_ldisc:
      Allow sleeping while proto locks are held"), but the issue was present
      before and the commit merely exposed it.  (Perhaps by luck, the crash
      did not occur with rwlocks.)
      
      Init the proto_lock in the serdev code path to avoid the oops.
      
      Stack trace for posterity:
      
      Unable to handle kernel read from unreadable memory at 406f127000
      [000000406f127000] user address but active_mm is swapper
      Internal error: Oops: 96000005 [#1] PREEMPT SMP
      Hardware name: HiKey Development Board (DT)
      Call trace:
       hci_uart_tx_wakeup+0x38/0x148
       hci_uart_send_frame+0x28/0x38
       hci_send_frame+0x64/0xc0
       hci_cmd_work+0x98/0x110
       process_one_work+0x134/0x330
       worker_thread+0x130/0x468
       kthread+0xf8/0x128
       ret_from_fork+0x10/0x18
      
      Link: https://lkml.org/lkml/2017/11/15/908Reported-and-tested-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Cc: Ronald Tschalär <ronald@innovation.ch>
      Cc: Rob Herring <rob.herring@linaro.org>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a53c4e8
    • Ronald Tschalär's avatar
      Bluetooth: hci_ldisc: Allow sleeping while proto locks are held. · f6ec33f6
      Ronald Tschalär authored
      commit 67d2f878 upstream.
      
      Commit dec2c928 ("Bluetooth: hci_ldisc:
      Use rwlocking to avoid closing proto races") introduced locks in
      hci_ldisc that are held while calling the proto functions. These locks
      are rwlock's, and hence do not allow sleeping while they are held.
      However, the proto functions that hci_bcm registers use mutexes and
      hence need to be able to sleep.
      
      In more detail: hci_uart_tty_receive() and hci_uart_dequeue() both
      acquire the rwlock, after which they call proto->recv() and
      proto->dequeue(), respectively. In the case of hci_bcm these point to
      bcm_recv() and bcm_dequeue(). The latter both acquire the
      bcm_device_lock, which is a mutex, so doing so results in a call to
      might_sleep(). But since we're holding a rwlock in hci_ldisc, that
      results in the following BUG (this for the dequeue case - a similar
      one for the receive case is omitted for brevity):
      
        BUG: sleeping function called from invalid context at kernel/locking/mutex.c
        in_atomic(): 1, irqs_disabled(): 0, pid: 7303, name: kworker/7:3
        INFO: lockdep is turned off.
        CPU: 7 PID: 7303 Comm: kworker/7:3 Tainted: G        W  OE   4.13.2+ #17
        Hardware name: Apple Inc. MacBookPro13,3/Mac-A5C67F76ED83108C, BIOS MBP133.8
        Workqueue: events hci_uart_write_work [hci_uart]
        Call Trace:
         dump_stack+0x8e/0xd6
         ___might_sleep+0x164/0x250
         __might_sleep+0x4a/0x80
         __mutex_lock+0x59/0xa00
         ? lock_acquire+0xa3/0x1f0
         ? lock_acquire+0xa3/0x1f0
         ? hci_uart_write_work+0xd3/0x160 [hci_uart]
         mutex_lock_nested+0x1b/0x20
         ? mutex_lock_nested+0x1b/0x20
         bcm_dequeue+0x21/0xc0 [hci_uart]
         hci_uart_write_work+0xe6/0x160 [hci_uart]
         process_one_work+0x253/0x6a0
         worker_thread+0x4d/0x3b0
         kthread+0x133/0x150
      
      We can't replace the mutex in hci_bcm, because there are other calls
      there that might sleep. Therefore this replaces the rwlock's in
      hci_ldisc with rw_semaphore's (which allow sleeping). This is a safer
      approach anyway as it reduces the restrictions on the proto callbacks.
      Also, because acquiring write-lock is very rare compared to acquiring
      the read-lock, the percpu variant of rw_semaphore is used.
      
      Lastly, because hci_uart_tx_wakeup() may be called from an IRQ context,
      we can't block (sleep) while trying acquire the read lock there, so we
      use the trylock variant.
      Signed-off-by: default avatarRonald Tschalär <ronald@innovation.ch>
      Reviewed-by: default avatarLukas Wunner <lukas@wunner.de>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f6ec33f6
    • Chunfeng Yun's avatar
      phy: phy-mtk-tphy: use auto instead of force to bypass utmi signals · 4290940d
      Chunfeng Yun authored
      commit 00c0092c upstream.
      
      When system is running, if usb2 phy is forced to bypass utmi signals,
      all PLL will be turned off, and it can't detect device connection
      anymore, so replace force mode with auto mode which can bypass utmi
      signals automatically if no device attached for normal flow.
      But keep the force mode to fix RX sensitivity degradation issue.
      Signed-off-by: default avatarChunfeng Yun <chunfeng.yun@mediatek.com>
      Signed-off-by: default avatarKishon Vijay Abraham I <kishon@ti.com>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4290940d
    • Fabio Estevam's avatar
      mtd: nand: qcom: Add a NULL check for devm_kasprintf() · 2424869e
      Fabio Estevam authored
      commit 069f0534 upstream.
      
      devm_kasprintf() may fail, so we should better add a NULL check
      and propagate an error on failure.
      Signed-off-by: default avatarFabio Estevam <fabio.estevam@nxp.com>
      Signed-off-by: default avatarBoris Brezillon <boris.brezillon@free-electrons.com>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2424869e
    • Al Viro's avatar
      fix __legitimize_mnt()/mntput() race · e5751c84
      Al Viro authored
      commit 119e1ef8 upstream.
      
      __legitimize_mnt() has two problems - one is that in case of success
      the check of mount_lock is not ordered wrt preceding increment of
      refcount, making it possible to have successful __legitimize_mnt()
      on one CPU just before the otherwise final mntpu() on another,
      with __legitimize_mnt() not seeing mntput() taking the lock and
      mntput() not seeing the increment done by __legitimize_mnt().
      Solved by a pair of barriers.
      
      Another is that failure of __legitimize_mnt() on the second
      read_seqretry() leaves us with reference that'll need to be
      dropped by caller; however, if that races with final mntput()
      we can end up with caller dropping rcu_read_lock() and doing
      mntput() to release that reference - with the first mntput()
      having freed the damn thing just as rcu_read_lock() had been
      dropped.  Solution: in "do mntput() yourself" failure case
      grab mount_lock, check if MNT_DOOMED has been set by racing
      final mntput() that has missed our increment and if it has -
      undo the increment and treat that as "failure, caller doesn't
      need to drop anything" case.
      
      It's not easy to hit - the final mntput() has to come right
      after the first read_seqretry() in __legitimize_mnt() *and*
      manage to miss the increment done by __legitimize_mnt() before
      the second read_seqretry() in there.  The things that are almost
      impossible to hit on bare hardware are not impossible on SMP
      KVM, though...
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Fixes: 48a066e7 ("RCU'd vsfmounts")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5751c84
    • Al Viro's avatar
      fix mntput/mntput race · d9d46a22
      Al Viro authored
      commit 9ea0a46c upstream.
      
      mntput_no_expire() does the calculation of total refcount under mount_lock;
      unfortunately, the decrement (as well as all increments) are done outside
      of it, leading to false positives in the "are we dropping the last reference"
      test.  Consider the following situation:
      	* mnt is a lazy-umounted mount, kept alive by two opened files.  One
      of those files gets closed.  Total refcount of mnt is 2.  On CPU 42
      mntput(mnt) (called from __fput()) drops one reference, decrementing component
      	* After it has looked at component #0, the process on CPU 0 does
      mntget(), incrementing component #0, gets preempted and gets to run again -
      on CPU 69.  There it does mntput(), which drops the reference (component #69)
      and proceeds to spin on mount_lock.
      	* On CPU 42 our first mntput() finishes counting.  It observes the
      decrement of component #69, but not the increment of component #0.  As the
      result, the total it gets is not 1 as it should've been - it's 0.  At which
      point we decide that vfsmount needs to be killed and proceed to free it and
      shut the filesystem down.  However, there's still another opened file
      on that filesystem, with reference to (now freed) vfsmount, etc. and we are
      screwed.
      
      It's not a wide race, but it can be reproduced with artificial slowdown of
      the mnt_get_count() loop, and it should be easier to hit on SMP KVM setups.
      
      Fix consists of moving the refcount decrement under mount_lock; the tricky
      part is that we want (and can) keep the fast case (i.e. mount that still
      has non-NULL ->mnt_ns) entirely out of mount_lock.  All places that zero
      mnt->mnt_ns are dropping some reference to mnt and they call synchronize_rcu()
      before that mntput().  IOW, if mntput() observes (under rcu_read_lock())
      a non-NULL ->mnt_ns, it is guaranteed that there is another reference yet to
      be dropped.
      Reported-by: default avatarJann Horn <jannh@google.com>
      Tested-by: default avatarJann Horn <jannh@google.com>
      Fixes: 48a066e7 ("RCU'd vsfmounts")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d9d46a22
    • Al Viro's avatar
      make sure that __dentry_kill() always invalidates d_seq, unhashed or not · d5426a38
      Al Viro authored
      commit 4c0d7cd5 upstream.
      
      RCU pathwalk relies upon the assumption that anything that changes
      ->d_inode of a dentry will invalidate its ->d_seq.  That's almost
      true - the one exception is that the final dput() of already unhashed
      dentry does *not* touch ->d_seq at all.  Unhashing does, though,
      so for anything we'd found by RCU dcache lookup we are fine.
      Unfortunately, we can *start* with an unhashed dentry or jump into
      it.
      
      We could try and be careful in the (few) places where that could
      happen.  Or we could just make the final dput() invalidate the damn
      thing, unhashed or not.  The latter is much simpler and easier to
      backport, so let's do it that way.
      Reported-by: default avatar"Dae R. Jeong" <threeearcat@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d5426a38