1. 21 Dec, 2018 16 commits
    • Breno Leitao's avatar
      powerpc/tm: Unset MSR[TS] if not recheckpointing · 6f5b9f01
      Breno Leitao authored
      There is a TM Bad Thing bug that can be caused when you return from a
      signal context in a suspended transaction but with ucontext MSR[TS] unset.
      
      This forces regs->msr[TS] to be set at syscall entrance (since the CPU
      state is transactional). It also calls treclaim() to flush the transaction
      state, which is done based on the live (mfmsr) MSR state.
      
      Since user context MSR[TS] is not set, then restore_tm_sigcontexts() is not
      called, thus, not executing recheckpoint, keeping the CPU state as not
      transactional. When calling rfid, SRR1 will have MSR[TS] set, but the CPU
      state is non transactional, causing the TM Bad Thing with the following
      stack:
      
      	[   33.862316] Bad kernel stack pointer 3fffd9dce3e0 at c00000000000c47c
      	cpu 0x8: Vector: 700 (Program Check) at [c00000003ff7fd40]
      	    pc: c00000000000c47c: fast_exception_return+0xac/0xb4
      	    lr: 00003fff865f442c
      	    sp: 3fffd9dce3e0
      	   msr: 8000000102a03031
      	  current = 0xc00000041f68b700
      	  paca    = 0xc00000000fb84800   softe: 0        irq_happened: 0x01
      	    pid   = 1721, comm = tm-signal-sigre
      	Linux version 4.9.0-3-powerpc64le (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
      	WARNING: exception is not recoverable, can't continue
      
      The same problem happens on 32-bits signal handler, and the fix is very
      similar, if tm_recheckpoint() is not executed, then regs->msr[TS] should be
      zeroed.
      
      This patch also fixes a sparse warning related to lack of indentation when
      CONFIG_PPC_TRANSACTIONAL_MEM is set.
      
      Fixes: 2b0a576d ("powerpc: Add new transactional memory state to the signal context")
      CC: Stable <stable@vger.kernel.org>	# 3.10+
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Tested-by: default avatarMichal Suchánek <msuchanek@suse.de>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      6f5b9f01
    • Breno Leitao's avatar
      powerpc/tm: Print scratch value · 11be3958
      Breno Leitao authored
      Usually a TM Bad Thing exception is raised due to three different problems.
      a) touching SPRs in an active transaction; b) using TM instruction with the
      facility disabled and c) setting a wrong MSR/SRR1 at RFID.
      
      The two initial cases are easy to identify by looking at the instructions.
      The latter case is harder, because the MSR is masked after RFID, so, it is
      very useful to look at the previous MSR (SRR1) before RFID as also the
      current and masked MSR.
      
      Since MSR is saved at paca just before RFID, this patch prints it if a TM
      Bad thing happen, helping to understand what is the invalid TM transition
      that is causing the exception.
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      11be3958
    • Breno Leitao's avatar
      powerpc/tm: Save MSR to PACA before RFID · 63a0d6b0
      Breno Leitao authored
      As other exit points, move SRR1 (MSR) into paca->tm_scratch, so, if
      there is a TM Bad Thing in RFID, it is easy to understand what was the
      SRR1 value being used.
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      63a0d6b0
    • Breno Leitao's avatar
      powerpc/tm: Set MSR[TS] just prior to recheckpoint · e1c3743e
      Breno Leitao authored
      On a signal handler return, the user could set a context with MSR[TS] bits
      set, and these bits would be copied to task regs->msr.
      
      At restore_tm_sigcontexts(), after current task regs->msr[TS] bits are set,
      several __get_user() are called and then a recheckpoint is executed.
      
      This is a problem since a page fault (in kernel space) could happen when
      calling __get_user(). If it happens, the process MSR[TS] bits were
      already set, but recheckpoint was not executed, and SPRs are still invalid.
      
      The page fault can cause the current process to be de-scheduled, with
      MSR[TS] active and without tm_recheckpoint() being called.  More
      importantly, without TEXASR[FS] bit set also.
      
      Since TEXASR might not have the FS bit set, and when the process is
      scheduled back, it will try to reclaim, which will be aborted because of
      the CPU is not in the suspended state, and, then, recheckpoint. This
      recheckpoint will restore thread->texasr into TEXASR SPR, which might be
      zero, hitting a BUG_ON().
      
      	kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
      	cpu 0xb: Vector: 700 (Program Check) at [c00000041f1576d0]
      	    pc: c000000000054550: restore_gprs+0xb0/0x180
      	    lr: 0000000000000000
      	    sp: c00000041f157950
      	   msr: 8000000100021033
      	  current = 0xc00000041f143000
      	  paca    = 0xc00000000fb86300	 softe: 0	 irq_happened: 0x01
      	    pid   = 1021, comm = kworker/11:1
      	kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
      	Linux version 4.9.0-3-powerpc64le (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
      	enter ? for help
      	[c00000041f157b30] c00000000001bc3c tm_recheckpoint.part.11+0x6c/0xa0
      	[c00000041f157b70] c00000000001d184 __switch_to+0x1e4/0x4c0
      	[c00000041f157bd0] c00000000082eeb8 __schedule+0x2f8/0x990
      	[c00000041f157cb0] c00000000082f598 schedule+0x48/0xc0
      	[c00000041f157ce0] c0000000000f0d28 worker_thread+0x148/0x610
      	[c00000041f157d80] c0000000000f96b0 kthread+0x120/0x140
      	[c00000041f157e30] c00000000000c0e0 ret_from_kernel_thread+0x5c/0x7c
      
      This patch simply delays the MSR[TS] set, so, if there is any page fault in
      the __get_user() section, it does not have regs->msr[TS] set, since the TM
      structures are still invalid, thus avoiding doing TM operations for
      in-kernel exceptions and possible process reschedule.
      
      With this patch, the MSR[TS] will only be set just before recheckpointing
      and setting TEXASR[FS] = 1, thus avoiding an interrupt with TM registers in
      invalid state.
      
      Other than that, if CONFIG_PREEMPT is set, there might be a preemption just
      after setting MSR[TS] and before tm_recheckpoint(), thus, this block must
      be atomic from a preemption perspective, thus, calling
      preempt_disable/enable() on this code.
      
      It is not possible to move tm_recheckpoint to happen earlier, because it is
      required to get the checkpointed registers from userspace, with
      __get_user(), thus, the only way to avoid this undesired behavior is
      delaying the MSR[TS] set.
      
      The 32-bits signal handler seems to be safe this current issue, but, it
      might be exposed to the preemption issue, thus, disabling preemption in
      this chunk of code.
      
      Changes from v2:
       * Run the critical section with preempt_disable.
      
      Fixes: 87b4e539 ("powerpc/tm: Fix return of active 64bit signals")
      Cc: stable@vger.kernel.org (v3.9+)
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e1c3743e
    • Mahesh Salgaonkar's avatar
      powerpc/fadump: Do not allow hot-remove memory from fadump reserved area. · 0db6896f
      Mahesh Salgaonkar authored
      For fadump to work successfully there should not be any holes in reserved
      memory ranges where kernel has asked firmware to move the content of old
      kernel memory in event of crash. Now that fadump uses CMA for reserved
      area, this memory area is now not protected from hot-remove operations
      unless it is cma allocated. Hence, fadump service can fail to re-register
      after the hot-remove operation, if hot-removed memory belongs to fadump
      reserved region. To avoid this make sure that memory from fadump reserved
      area is not hot-removable if fadump is registered.
      
      However, if user still wants to remove that memory, he can do so by
      manually stopping fadump service before hot-remove operation.
      Signed-off-by: default avatarMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      0db6896f
    • Mahesh Salgaonkar's avatar
      powerpc/fadump: Throw proper error message on fadump registration failure · f86593be
      Mahesh Salgaonkar authored
      fadump fails to register when there are holes in reserved memory area.
      This can happen if user has hot-removed a memory that falls in the
      fadump reserved memory area. Throw a meaningful error message to the
      user in such case.
      Signed-off-by: default avatarMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      [mpe: is_reserved_memory_area_contiguous() returns bool, unsplit string]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f86593be
    • Mahesh Salgaonkar's avatar
      powerpc/fadump: Reservationless firmware assisted dump · a4e92ce8
      Mahesh Salgaonkar authored
      One of the primary issues with Firmware Assisted Dump (fadump) on Power
      is that it needs a large amount of memory to be reserved. On large
      systems with TeraBytes of memory, this reservation can be quite
      significant.
      
      In some cases, fadump fails if the memory reserved is insufficient, or
      if the reserved memory was DLPAR hot-removed.
      
      In the normal case, post reboot, the preserved memory is filtered to
      extract only relevant areas of interest using the makedumpfile tool.
      While the tool provides flexibility to determine what needs to be part
      of the dump and what memory to filter out, all supported distributions
      default this to "Capture only kernel data and nothing else".
      
      We take advantage of this default and the Linux kernel's Contiguous
      Memory Allocator (CMA) to fundamentally change the memory reservation
      model for fadump.
      
      Instead of setting aside a significant chunk of memory nobody can use,
      this patch uses CMA instead, to reserve a significant chunk of memory
      that the kernel is prevented from using (due to MIGRATE_CMA), but
      applications are free to use it. With this fadump will still be able
      to capture all of the kernel memory and most of the user space memory
      except the user pages that were present in CMA region.
      
      Essentially, on a P9 LPAR with 2 cores, 8GB RAM and current upstream:
      [root@zzxx-yy10 ~]# free -m
                    total        used        free      shared  buff/cache   available
      Mem:           7557         193        6822          12         541        6725
      Swap:          4095           0        4095
      
      With this patch:
      [root@zzxx-yy10 ~]# free -m
                    total        used        free      shared  buff/cache   available
      Mem:           8133         194        7464          12         475        7338
      Swap:          4095           0        4095
      
      Changes made here are completely transparent to how fadump has
      traditionally worked.
      
      Thanks to Aneesh Kumar and Anshuman Khandual for helping us understand
      CMA and its usage.
      
      TODO:
      - Handle case where CMA reservation spans nodes.
      Signed-off-by: default avatarAnanth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Signed-off-by: default avatarMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: default avatarHari Bathini <hbathini@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a4e92ce8
    • Mahesh Salgaonkar's avatar
      powerpc/powernv: Move opal_power_control_init() call in opal_init(). · 08fb726d
      Mahesh Salgaonkar authored
      opal_power_control_init() depends on opal message notifier to be
      initialized, which is done in opal_init()->opal_message_init(). But both
      these initialization are called through machine initcalls and it all
      depends on in which order they being called. So far these are called in
      correct order (may be we got lucky) and never saw any issue. But it is
      clearer to control initialization order explicitly by moving
      opal_power_control_init() into opal_init().
      Signed-off-by: default avatarMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      08fb726d
    • Markus Elfring's avatar
      powerpc/4xx: Delete an unnecessary return statement in two functions · ae6263cc
      Markus Elfring authored
      The script "checkpatch.pl" pointed information out like the following.
      
      WARNING: void function return statements are not generally useful
      
      Thus remove such a statement in the affected functions.
      Signed-off-by: default avatarMarkus Elfring <elfring@users.sourceforge.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      ae6263cc
    • Markus Elfring's avatar
      powerpc/4xx: Delete error message for a ENOMEM in two functions · a8d5dada
      Markus Elfring authored
      Omit an extra message for a memory allocation failure in these
      functions.
      
      This issue was detected by using the Coccinelle software.
      Signed-off-by: default avatarMarkus Elfring <elfring@users.sourceforge.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a8d5dada
    • Markus Elfring's avatar
      powerpc/4xx: Use seq_putc() in ocm_debugfs_show() · 52930bc6
      Markus Elfring authored
      A single character (line break) should be put into a sequence.
      Thus use the corresponding function "seq_putc".
      
      This issue was detected by using the Coccinelle software.
      Signed-off-by: default avatarMarkus Elfring <elfring@users.sourceforge.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      52930bc6
    • Markus Elfring's avatar
      powerpc/4xx: Combine four seq_printf() calls into two in ocm_debugfs_show() · b52106a0
      Markus Elfring authored
      Some data were printed into a sequence by four separate function calls.
      Print the same data by two single function calls instead.
      
      This issue was detected by using the Coccinelle software.
      Signed-off-by: default avatarMarkus Elfring <elfring@users.sourceforge.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      b52106a0
    • Christophe Leroy's avatar
      powerpc/8xx: Allow pinning IMMR TLB when using early debug console · 96d19d70
      Christophe Leroy authored
      CONFIG_EARLY_DEBUG_CPM requires IMMR area TLB to be pinned
      otherwise it doesn't survive MMU_init, and the boot fails.
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      96d19d70
    • Oliver O'Halloran's avatar
      powerpc/powernv: Remove PCI_MSI ifdef checks · 5f639e5f
      Oliver O'Halloran authored
      CONFIG_PCI_MSI was made mandatory by commit a311e738
      ("powerpc/powernv: Make PCI non-optional") so the #ifdef
      checks around CONFIG_PCI_MSI here can be removed entirely.
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Reviewed-by: default avatarJoel Stanley <joel@jms.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      5f639e5f
    • Alexandre Belloni's avatar
      powerpc/fsl-rio: fix spelling mistake "reserverd" -> "reserved" · a0837876
      Alexandre Belloni authored
      Fix a spelling mistake in a register description.
      Signed-off-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a0837876
    • Ravi Bangoria's avatar
      Powerpc/perf: Wire up PMI throttling · 0c9108b0
      Ravi Bangoria authored
      Commit 14c63f17 ("perf: Drop sample rate when sampling is too
      slow") introduced a way to throttle PMU interrupts if we're spending
      too much time just processing those. Wire up powerpc PMI handler to
      use this infrastructure.
      
      We have throttling of the *rate* of interrupts, but this adds
      throttling based on the *time taken* to process the interrupts.
      Signed-off-by: default avatarRavi Bangoria <ravi.bangoria@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      0c9108b0
  2. 20 Dec, 2018 24 commits