1. 27 Apr, 2017 2 commits
    • Aneesh Kumar K.V's avatar
      powerpc/mm/radix: Optimise tlbiel flush all case · a5998fcb
      Aneesh Kumar K.V authored
      _tlbiel_pid() is called with a ric (Radix Invalidation Control) argument of
      either RIC_FLUSH_TLB or RIC_FLUSH_ALL.
      
      RIC_FLUSH_ALL says to invalidate the entire TLB and the Page Walk Cache (PWC).
      
      To flush the whole TLB, we have to iterate over each set (congruence class) of
      the TLB. Currently we do that and pass RIC_FLUSH_ALL each time. That is not
      incorrect but it means we flush the PWC 128 times, when once would suffice.
      
      Fix it by doing the first flush with the ric value we're passed, and then if it
      was RIC_FLUSH_ALL, we downgrade it to RIC_FLUSH_TLB, because we know we have
      just flushed the PWC and don't need to do it again.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [mpe: Split out of combined patch, tweak logic, rewrite change log]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a5998fcb
    • Aneesh Kumar K.V's avatar
      powerpc/mm/radix: Optimise Page Walk Cache flush · cf4f08be
      Aneesh Kumar K.V authored
      Currently we implement flushing of the page walk cache (PWC) by calling
      _tlbiel_pid() with a RIC (Radix Invalidation Control) value of 1 which says to
      only flush the PWC.
      
      But _tlbiel_pid() loops over each set (congruence class) of the TLB, which is
      not necessary when we're just flushing the PWC.
      
      In fact the set argument is ignored for a PWC flush, so essentially we're just
      flushing the PWC 127 extra times for no benefit.
      
      Fix it by adding tlbiel_pwc() which just does a single flush of the PWC.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [mpe: Split out of combined patch, drop _ in name, rewrite change log]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      cf4f08be
  2. 26 Apr, 2017 3 commits
    • Michael Ellerman's avatar
      powerpc/powernv: Fix oops on P9 DD1 in cause_ipi() · 45b21cfe
      Michael Ellerman authored
      Recently we merged the native xive support for Power9, and then separately some
      reworks for doorbell IPI support. In isolation both series were OK, but the
      merged result had a bug in one case.
      
      On P9 DD1 we use pnv_p9_dd1_cause_ipi() which tries to use doorbells, and then
      falls back to the interrupt controller. However the fallback is implemented by
      calling icp_ops->cause_ipi. But now that xive support is merged we might be
      using xive, in which case icp_ops is not initialised, it's a xics specific
      structure. This leads to an oops such as:
      
        Unable to handle kernel paging request for data at address 0x00000028
        Oops: Kernel access of bad area, sig: 11 [#1]
        NIP pnv_p9_dd1_cause_ipi+0x74/0xe0
        LR smp_muxed_ipi_message_pass+0x54/0x70
      
      To fix it, rather than using icp_ops which might be NULL, have both xics and
      xive set smp_ops->cause_ipi, and then in the powernv code we save that as
      ic_cause_ipi before overriding smp_ops->cause_ipi. For paranoia add a WARN_ON()
      to check if somehow smp_ops->cause_ipi is NULL.
      
      Fixes: b866cc21 ("powerpc: Change the doorbell IPI calling convention")
      Tested-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      45b21cfe
    • Michael Ellerman's avatar
      powerpc/powernv: Fix missing attr initialisation in opal_export_attrs() · 83c49190
      Michael Ellerman authored
      In opal_export_attrs() we dynamically allocate some bin_attributes. They're
      allocated with kmalloc() and although we initialise most of the fields, we don't
      initialise write() or mmap(), and in particular we don't initialise the lockdep
      related fields in the embedded struct attribute.
      
      This leads to a lockdep warning at boot:
      
        BUG: key c0000000f11906d8 not in .data!
        WARNING: CPU: 0 PID: 1 at ../kernel/locking/lockdep.c:3136 lockdep_init_map+0x28c/0x2a0
        ...
        Call Trace:
          lockdep_init_map+0x288/0x2a0 (unreliable)
          __kernfs_create_file+0x8c/0x170
          sysfs_add_file_mode_ns+0xc8/0x240
          __machine_initcall_powernv_opal_init+0x60c/0x684
          do_one_initcall+0x60/0x1c0
          kernel_init_freeable+0x2f4/0x3d4
          kernel_init+0x24/0x160
          ret_from_kernel_thread+0x5c/0xb0
      
      Fix it by kzalloc'ing the attr, which fixes the uninitialised write() and
      mmap(), and calling sysfs_bin_attr_init() on it to initialise the lockdep
      fields.
      
      Fixes: 11fe909d ("powerpc/powernv: Add OPAL exports attributes to sysfs")
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      83c49190
    • Michael Ellerman's avatar
      powerpc/mm: Fix possible out-of-bounds shift in arch_mmap_rnd() · b409946b
      Michael Ellerman authored
      The recent patch to add runtime configuration of the ASLR limits added a bug in
      arch_mmap_rnd() where we may shift an integer (32-bits) by up to 33 bits,
      leading to undefined behaviour.
      
      In practice it exhibits as every process seg faulting instantly, presumably
      because the rnd value hasn't been restricited by the modulus at all. We didn't
      notice because it only happens under certain kernel configurations and if the
      number of bits is actually set to a large value.
      
      Fix it by switching to unsigned long.
      
      Fixes: 9fea59bd ("powerpc/mm: Add support for runtime configuration of ASLR limits")
      Reported-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      b409946b
  3. 24 Apr, 2017 9 commits
    • David Gibson's avatar
      powerpc/mm: Ensure IRQs are off in switch_mm() · 9765ad13
      David Gibson authored
      powerpc expects IRQs to already be (soft) disabled when switch_mm() is
      called, as made clear in the commit message of 9c1e1052 ("powerpc: Allow
      perf_counters to access user memory at interrupt time").
      
      Aside from any race conditions that might exist between switch_mm() and an IRQ,
      there is also an unconditional hard_irq_disable() in switch_slb(). If that isn't
      followed at some point by an IRQ enable then interrupts will remain disabled
      until we return to userspace.
      
      It is true that when switch_mm() is called from the scheduler IRQs are off, but
      not when it's called by use_mm(). Looking closer we see that last year in commit
      f98db601 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
      this was made more explicit by the addition of switch_mm_irqs_off() which is now
      called by the scheduler, vs switch_mm() which is used by use_mm().
      
      Arguably it is a bug in use_mm() to call switch_mm() in a different context than
      it expects, but fixing that will take time.
      
      This was discovered recently when vhost started throwing warnings such as:
      
        BUG: sleeping function called from invalid context at kernel/mutex.c:578
        in_atomic(): 0, irqs_disabled(): 1, pid: 10768, name: vhost-10760
        no locks held by vhost-10760/10768.
        irq event stamp: 10
        hardirqs last  enabled at (9):  _raw_spin_unlock_irq+0x40/0x80
        hardirqs last disabled at (10): switch_slb+0x2e4/0x490
        softirqs last  enabled at (0):  copy_process+0x5e8/0x1260
        softirqs last disabled at (0):  (null)
        Call Trace:
          show_stack+0x88/0x390 (unreliable)
          dump_stack+0x30/0x44
          __might_sleep+0x1c4/0x2d0
          mutex_lock_nested+0x74/0x5c0
          cgroup_attach_task_all+0x5c/0x180
          vhost_attach_cgroups_work+0x58/0x80 [vhost]
          vhost_worker+0x24c/0x3d0 [vhost]
          kthread+0xec/0x100
          ret_from_kernel_thread+0x5c/0xd4
      
      Prior to commit 04b96e55 ("vhost: lockless enqueuing") (Aug 2016) the
      vhost_worker() would do a spin_unlock_irq() not long after calling use_mm(),
      which had the effect of reenabling IRQs. Since that commit removed the locking
      in vhost_worker() the body of the vhost_worker() loop now runs with interrupts
      off causing the warnings.
      
      This patch addresses the problem by making the powerpc code mirror the x86 code,
      ie. we disable interrupts in switch_mm(), and optimise the scheduler case by
      defining switch_mm_irqs_off().
      
      Cc: stable@vger.kernel.org # v4.7+
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      [mpe: Flesh out/rewrite change log, add stable]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9765ad13
    • Tyrel Datwyler's avatar
      powerpc/sysfs: Fix reference leak of cpu device_nodes present at boot · e76ca277
      Tyrel Datwyler authored
      For CPUs present at boot each logical CPU acquires a reference to the
      associated device node of the core. This happens in register_cpu() which
      is called by topology_init(). The result of this is that we end up with
      a reference held by each thread of the core. However, these references
      are never freed if the CPU core is DLPAR removed.
      
      This patch fixes the reference leaks by acquiring and releasing the references
      in the CPU hotplug callbacks un/register_cpu_online(). With this patch symmetric
      reference counting is observed with both CPUs present at boot, and those DLPAR
      added after boot.
      
      Fixes: f86e4718 ("driver/core: cpu: initialize of_node in cpu's device struture")
      Cc: stable@vger.kernel.org # v3.12+
      Signed-off-by: default avatarTyrel Datwyler <tyreld@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e76ca277
    • Tyrel Datwyler's avatar
      powerpc/pseries: Fix of_node_put() underflow during DLPAR remove · 68baf692
      Tyrel Datwyler authored
      Historically struct device_node references were tracked using a kref embedded as
      a struct field. Commit 75b57ecf ("of: Make device nodes kobjects so they
      show up in sysfs") (Mar 2014) refactored device_nodes to be kobjects such that
      the device tree could by more simply exposed to userspace using sysfs.
      
      Commit 0829f6d1 ("of: device_node kobject lifecycle fixes") (Mar 2014)
      followed up these changes to better control the kobject lifecycle and in
      particular the referecne counting via of_node_get(), of_node_put(), and
      of_node_init().
      
      A result of this second commit was that it introduced an of_node_put() call when
      a dynamic node is detached, in of_node_remove(), that removes the initial kobj
      reference created by of_node_init().
      
      Traditionally as the original dynamic device node user the pseries code had
      assumed responsibilty for releasing this final reference in its platform
      specific DLPAR detach code.
      
      This patch fixes a refcount underflow introduced by commit 0829f6d1, and
      recently exposed by the upstreaming of the recount API.
      
      Messages like the following are no longer seen in the kernel log with this
      patch following DLPAR remove operations of cpus and pci devices.
      
        rpadlpar_io: slot PHB 72 removed
        refcount_t: underflow; use-after-free.
        ------------[ cut here ]------------
        WARNING: CPU: 5 PID: 3335 at lib/refcount.c:128 refcount_sub_and_test+0xf4/0x110
      
      Fixes: 0829f6d1 ("of: device_node kobject lifecycle fixes")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: default avatarTyrel Datwyler <tyreld@linux.vnet.ibm.com>
      [mpe: Make change log commit references more verbose]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      68baf692
    • Michael Ellerman's avatar
      powerpc/xmon: Deindent the SLB dumping logic · 85673646
      Michael Ellerman authored
      Currently the code that dumps SLB entries uses a double-nested if. This
      means the actual dumping logic is a bit squashed. Deindent it by using
      continue.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: default avatarRashmica Gupta <rashmica.g@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      85673646
    • Michael Ellerman's avatar
      Merge branch 'topic/kprobes' into next · 9fc84914
      Michael Ellerman authored
      Although most of these kprobes patches are powerpc specific, there's a couple
      that touch generic code (with Acks). At the moment there's one conflict with
      acme's tree, but it's not too bad. Still just in case some other conflicts show
      up, we've put these in a topic branch so another tree could merge some or all of
      it if necessary.
      9fc84914
    • Naveen N. Rao's avatar
      powerpc/kprobes: Prefer ftrace when probing function entry · 24bd909e
      Naveen N. Rao authored
      KPROBES_ON_FTRACE avoids much of the overhead of regular kprobes as it
      eliminates the need for a trap, as well as the need to emulate or single-step
      instructions.
      
      Though OPTPROBES provides us with similar performance, we have limited
      optprobes trampoline slots. As such, when asked to probe at a function
      entry, default to using the ftrace infrastructure.
      
      With:
        # cd /sys/kernel/debug/tracing
        # echo 'p _do_fork' > kprobe_events
      
      before patch:
        # cat ../kprobes/list
        c0000000000daf08  k  _do_fork+0x8    [DISABLED]
        c000000000044fc0  k  kretprobe_trampoline+0x0    [OPTIMIZED]
      
      and after patch:
        # cat ../kprobes/list
        c0000000000d074c  k  _do_fork+0xc    [DISABLED][FTRACE]
        c0000000000412b0  k  kretprobe_trampoline+0x0    [OPTIMIZED]
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      24bd909e
    • Naveen N. Rao's avatar
      powerpc: Introduce a new helper to obtain function entry points · 1b32cd17
      Naveen N. Rao authored
      kprobe_lookup_name() is specific to the kprobe subsystem and may not always
      return the function entry point (in a subsequent patch for KPROBES_ON_FTRACE).
      For looking up function entry points, introduce a separate helper and use it
      in optprobes.c
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      1b32cd17
    • Naveen N. Rao's avatar
      powerpc/kprobes: Add support for KPROBES_ON_FTRACE · ead514d5
      Naveen N. Rao authored
      Allow kprobes to be placed on ftrace _mcount() call sites. This optimization
      avoids the use of a trap, by riding on ftrace infrastructure.
      
      This depends on HAVE_DYNAMIC_FTRACE_WITH_REGS which depends on MPROFILE_KERNEL,
      which is only currently enabled on powerpc64le with newer toolchains.
      
      Based on the x86 code by Masami.
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      ead514d5
    • Naveen N. Rao's avatar
      powerpc/ftrace: Restore LR from pt_regs · 2f59be5b
      Naveen N. Rao authored
      Pass the real LR to the ftrace handler. This is needed for KPROBES_ON_FTRACE for
      the pre handlers.
      
      Also, with KPROBES_ON_FTRACE, the link register may be updated by the pre
      handlers or by a registed kretprobe. Honor updated LR by restoring it from
      pt_regs, rather than from the stack save area.
      
      Live patch and function graph continue to work fine with this change.
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      2f59be5b
  4. 23 Apr, 2017 14 commits
  5. 21 Apr, 2017 2 commits
    • Michael Ellerman's avatar
      powerpc/mm: Add support for runtime configuration of ASLR limits · 9fea59bd
      Michael Ellerman authored
      Add powerpc support for mmap_rnd_bits and mmap_rnd_compat_bits, which are two
      sysctls that allow a user to configure the number of bits of randomness used for
      ASLR.
      
      Because of the way the Kconfig for ARCH_MMAP_RND_BITS is defined, we have to
      construct at least the MIN value in Kconfig, vs in a header which would be more
      natural. Given that we just go ahead and do it all in Kconfig.
      
      At least according to the code (the documentation makes no mention of it), the
      value is defined as the number of bits of randomisation *of the page*, not the
      address. This makes some sense, with larger page sizes more of the low bits are
      forced to zero, which would reduce the randomisation if we didn't take the
      PAGE_SIZE into account. However it does mean the min/max values have to change
      depending on the PAGE_SIZE in order to actually limit the amount of address
      space consumed by the randomisation.
      
      The result of that is that we have to define the default values based on both
      32-bit vs 64-bit, but also the configured PAGE_SIZE. Furthermore now that we
      have 128TB address space support on Book3S, we also have to take that into
      account.
      
      Finally we can wire up the value in arch_mmap_rnd().
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarBhupesh Sharma <bhsharma@redhat.com>
      Tested-by: default avatarBhupesh Sharma <bhsharma@redhat.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      9fea59bd
    • Oliver O'Halloran's avatar
      powerpc/mm: Wire up ioremap_cache() · f855b2f5
      Oliver O'Halloran authored
      The default implementation of ioremap_cache() is aliased to ioremap().
      On powerpc ioremap() creates cache-inhibited mappings by default which
      is almost certainly not what you wanted.
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f855b2f5
  6. 20 Apr, 2017 8 commits
  7. 19 Apr, 2017 2 commits