Commits · a9dadc1c512807f955f0799e85830b420da47932 · nexedi / linux

24 Aug, 2017 1 commit

powerpc/xive: Fix the size of the cpumask used in xive_find_target_in_mask() · a9dadc1c

Cédric Le Goater authored Aug 08, 2017

When called from xive_irq_startup(), the size of the cpumask can be
larger than nr_cpu_ids. This can result in a WARN_ON such as:

  WARNING: CPU: 10 PID: 1 at ../arch/powerpc/sysdev/xive/common.c:476 xive_find_target_in_mask+0x110/0x2f0
  ...
  NIP [c00000000008a310] xive_find_target_in_mask+0x110/0x2f0
  LR [c00000000008a2e4] xive_find_target_in_mask+0xe4/0x2f0
  Call Trace:
    xive_find_target_in_mask+0x74/0x2f0 (unreliable)
    xive_pick_irq_target.isra.1+0x200/0x230
    xive_irq_startup+0x60/0x180
    irq_startup+0x70/0xd0
    __setup_irq+0x7bc/0x880
    request_threaded_irq+0x14c/0x2c0
    request_event_sources_irqs+0x100/0x180
    __machine_initcall_pseries_init_ras_IRQ+0x104/0x134
    do_one_initcall+0x68/0x1d0
    kernel_init_freeable+0x290/0x374
    kernel_init+0x24/0x170
    ret_from_kernel_thread+0x5c/0x74

This happens because we're being called with our affinity mask set to
irq_default_affinity. That in turn was populated using
cpumask_setall(), which sets NR_CPUs worth of bits, not nr_cpu_ids
worth. Finally cpumask_weight() will return > nr_cpu_ids when passed a
mask which has > nr_cpu_ids bits set.

Fix it by limiting the value returned by cpumask_weight().
Signed-off-by: Cédric Le Goater <clg@kaod.org>
[mpe: Add change log details on actual cause]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

a9dadc1c

23 Aug, 2017 19 commits

powerpc/64: Optimise set/clear of CTRL[RUN] (runlatch) · d1d0d5ff

Nicholas Piggin authored Aug 12, 2017

On modern CPUs the CTRL register is read-only except bit 63 which is
the run latch control. This means it can be updated with a mtspr
rather than mfspr/mtspr.

To accomodate older CPUs (Cell at least), where there are other bits
in the register, we still do a read/modify/write on pre 2.06 CPUs.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Update change log to mention 2.06 workaround]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

d1d0d5ff

powerpc/64s: Remove spurious IRQ reason in IRQ replay · 7b76a1f5

Nicholas Piggin authored Aug 12, 2017

HVI interrupts have always used 0x500, so remove the dead branch.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

7b76a1f5

powerpc/64: Remove redundant instruction in interrupt replay · ccd5eb83
Nicholas Piggin authored Aug 12, 2017
```
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
```
ccd5eb83

powerpc/64s: Use the HV handler for external IRQ replay in HV mode on POWER9 · e6c1203d

Nicholas Piggin authored Aug 12, 2017

POWER9 host external interrupts use the h_virt_irq_common handler, so
use that to replay them rather than using the hardware_interrupt_common
handler. Both call do_IRQ, but using the correct handler reduces
i-cache footprint.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

e6c1203d

powerpc/64s: Merge HV and non-HV paths for doorbell IRQ replay · d6f73fc6

Nicholas Piggin authored Aug 12, 2017

This results in smaller code, and fewer branches. This relies on the
fact that both the 0xe80 and 0xa00 handlers call the same upper level
code, namely doorbell_exception().
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Mention we rely on the implementation of the 0xe80/0xa00 handlers]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

d6f73fc6

powerpc/64: Cleanup __check_irq_replay() · 6f881eae

Nicholas Piggin authored Aug 12, 2017

Move the clearing of irq_happened bits into the condition where they
were found to be set. This reduces instruction count slightly, and
reduces stores into irq_happened.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

6f881eae

powerpc/64s: masked_interrupt() returns to kernel so avoid restoring r13 · c05f0be8

Nicholas Piggin authored Aug 12, 2017

Places in the kernel where r13 is not the PACA pointer must have
maskable interrupts disabled, so r13 does not have to be restored when
returning from a soft-masked interrupt. We should never have
interrupts soft disabled when we're in user space.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

c05f0be8

powerpc/64s: Optimise clearing of MSR_EE in masked_[H]interrupt() · 6e9a2f6e

Nicholas Piggin authored Aug 12, 2017

MSR_EE is always enabled in SRR1 for masked interrupts, so we can use
xor to clear it.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

6e9a2f6e

powerpc/64s: Avoid a branch in masked_[H]interrupt() · e0c827c0

Nicholas Piggin authored Aug 12, 2017

Interrupts which do not require EE to be cleared can all be tested
with a single bitwise test.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

e0c827c0

powerpc/mm: Make switch_mm_irqs_off() out of line · 3a2df379

Benjamin Herrenschmidt authored Jul 24, 2017

It's too big to be inline, there is no reason to keep it
that way.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Rework to incorporate the comment changes via fixes branch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

3a2df379

powerpc/mm: Optimize detection of thread local mm's · a619e59c

Benjamin Herrenschmidt authored Jul 24, 2017

Instead of comparing the whole CPU mask every time, let's
keep a counter of how many bits are set in the mask. Thus
testing for a local mm only requires testing if that counter
is 1 and the current CPU bit is set in the mask.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

a619e59c

powerpc/mm: Use mm_is_thread_local() instread of open-coding · b426e4bd

Benjamin Herrenschmidt authored Jul 24, 2017

We open-code testing for the mm being local to the current CPU
in a few places. Use our existing helper instead.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

b426e4bd

powerpc/mm: Avoid double irq save/restore in activate_mm · 058ccc34

Benjamin Herrenschmidt authored Jul 24, 2017

It calls switch_mm() which already does the irq save/restore
these days.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

058ccc34

powerpc/mm: Move pgdir setting into a helper · 43ed84a8

Benjamin Herrenschmidt authored Jul 24, 2017

Makes switch_mm_irqs_off() a bit more readable
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

43ed84a8

powerpc/64s: Fix replay interrupt return label name · 3e23a12b

Michael Ellerman authored Aug 22, 2017

In __replay_interrupt() we take the address of a local label so we can
return to it later. However the assembler turns the local label into a
symbol with a name like ".L1^B42" - where "^B" is literally "\002".
This does not make for pleasant stack traces. Fix it by giving the
label a sensible name.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

3e23a12b

powerpc: pseries: remove dlpar_attach_node dependency on full path · 215ee763

Rob Herring authored Aug 21, 2017

In preparation to stop storing the full node path in full_name, remove the
dependency on full_name from dlpar_attach_node(). Callers of
dlpar_attach_node() already have the parent device_node, so just pass the
parent node into dlpar_attach_node instead of the path. This avoids doing
a lookup of the parent node by the path.
Signed-off-by: Rob Herring <robh@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

215ee763

powerpc: Convert to using %pOF instead of full_name · b7c670d6

Rob Herring authored Aug 21, 2017

Now that we have a custom printf format specifier, convert users of
full_name to use %pOF instead. This is preparation to remove storing
of the full path string for each node.
Signed-off-by: Rob Herring <robh@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Anatolij Gustschin <agust@denx.de>
Cc: Scott Wood <oss@buserror.net>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

b7c670d6

powerpc/vio: Use device_type to detect family · bcf21e3a

Michael Ellerman authored Aug 22, 2017

Currently in the vio.c code we use a comparision against the parent
device node's full path to decide if the device is a PFO or VIO family
device.

Both the ibm,platform-facilities and vdevice nodes are defined by PAPR,
and must have a matching device_type. So instead of using the path we
can instead compare the device_type.

I've checked Qemu and kvmtool both do this correctly, and all the
PowerVM systems I have access to do also. So it seems to be safe.

This removes the dependency on full_name, which is being removed
upstream.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

bcf21e3a

Merge branch 'fixes' into next · 15c659ff

Michael Ellerman authored Aug 23, 2017

There's a non-trivial dependency between some commits we want to put in
next and the KVM prefetch work around that went into fixes. So merge
fixes into next.

15c659ff

21 Aug, 2017 1 commit

macintosh/rack-meter: Make of_device_ids const · 516fa8d0

Arvind Yadav authored Jun 19, 2017

of_device_ids are not supposed to change at runtime. All functions
working with of_device_ids provided by <linux/of.h> work with const
of_device_ids. So mark the non-const structs as const.

File size before:
   text	   data	    bss	    dec	    hex	filename
    407	    576	      0	    983	    3d7	drivers/macintosh/rack-meter.o

File size after constify rackmeter_match.
   text	   data	    bss	    dec	    hex	filename
    807	    176	      0	    983	    3d7	drivers/macintosh/rack-meter.o
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

516fa8d0

18 Aug, 2017 1 commit

powerpc/mm: Ensure cpumask update is ordered · 1a92a80a

Benjamin Herrenschmidt authored Jul 24, 2017

There is no guarantee that the various isync's involved with
the context switch will order the update of the CPU mask with
the first TLB entry for the new context being loaded by the HW.

Be safe here and add a memory barrier to order any subsequent
load/store which may bring entries into the TLB.

The corresponding barrier on the other side already exists as
pte updates use pte_xchg() which uses __cmpxchg_u64 which has
a sync after the atomic operation.

Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Add comments in the code]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

1a92a80a

17 Aug, 2017 11 commits

powerpc/mm/cxl: Add the fault handling cpu to mm cpumask · 0f4bc093

Aneesh Kumar K.V authored Jul 27, 2017

We use mm cpumask for serializing against lockless page table walk.
Anybody who is doing a lockless page table walk is expected to disable
irq and only cpus in mm cpumask is expected do the lockless walk. This
ensure that a THP split can send IPI to only cpus in the mm cpumask,
to make sure there are no parallel lockless page table walk.

Add the CAPI fault handling cpu to the mm cpumask so that we can do
the lockless page table walk while inserting hash page table entries.
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

0f4bc093

powerpc/mm: Don't send IPI to all cpus on THP updates · fa4531f7

Aneesh Kumar K.V authored Jul 27, 2017

Now that we made sure that lockless walk of linux page table is mostly
limitted to current task(current->mm->pgdir) we can update the THP
update sequence to only send IPI to CPUs on which this task has run.
This helps in reducing the IPI overload on systems with large number
of CPUs.

WRT kvm even though kvm is walking page table with vpc->arch.pgdir,
it is done only on secondary CPUs and in that case we have primary CPU
added to task's mm cpumask. Sending an IPI to primary will force the
secondary to do a vm exit and hence this mm cpumask usage is safe
here.

WRT CAPI, we still end up walking linux page table with capi context
MM. For now the pte lookup serialization sends an IPI to all CPUs in
CPI is in use. We can further improve this by adding the CAPI
interrupt handling CPU to task mm cpumask. That will be done in a
later patch.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

fa4531f7

Merge branch 'topic/ppc-kvm' into next · 8434f089

Michael Ellerman authored Aug 17, 2017

Bring in the commit to rename find_linux_pte_or_hugepte() which touches
arch and KVM code, and might need to be merged with the kvmppc tree to
avoid conflicts.

8434f089

powerpc/mm: Rename find_linux_pte_or_hugepte() · 94171b19

Aneesh Kumar K.V authored Jul 27, 2017

Add newer helpers to make the function usage simpler. It is always
recommended to use find_current_mm_pte() for walking the page table.
If we cannot use find_current_mm_pte(), it should be documented why
the said usage of __find_linux_pte() is safe against a parallel THP
split.

For now we have KVM code using __find_linux_pte(). This is because kvm
code ends up calling __find_linux_pte() in real mode with MSR_EE=0 but
with PACA soft_enabled = 1. We may want to fix that later and make
sure we keep the MSR_EE and PACA soft_enabled in sync. When we do that
we can switch kvm to use find_linux_pte().
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

94171b19

powerpc/bpf: Use memset32() to pre-fill traps in BPF page(s) · 6acdc9a6

Naveen N. Rao authored Mar 28, 2017

Use the newly introduced memset32() to pre-fill BPF page(s) with trap
instructions.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

6acdc9a6

powerpc/string: Implement optimized memset variants · 694fc88c

Naveen N. Rao authored Mar 28, 2017

Based on Matthew Wilcox's patches for other architectures.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

694fc88c

block/ps3vram: Check return of ps3vram_cache_init · 00e7c259

Geoff Levand authored Aug 07, 2017

Cc: Markus Elfring <elfring@users.sourceforge.net>
Cc: Jim Paris <jim@jtan.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

00e7c259

block/ps3vram: Delete an error message for a failed memory allocation in ps3vram_cache_init() · fd1335e0

Markus Elfring authored Aug 07, 2017

Omit an extra message for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Link: http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdfSigned-off-by: Markus Elfring <elfring@users.sourceforge.net>
Cc: Jim Paris <jim@jtan.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

fd1335e0

selftests/powerpc: Improve tm-resched-dscr · 99597ced

Sam Bobroff authored Aug 17, 2017

The tm-resched-dscr self test can, in some situations, run for
several minutes before being successfully interrupted by the context
switch it needs in order to perform the test. This often seems to
occur when the test is being run in a virtual machine.

Improve the test by running it under eat_cpu() to guarantee
contention for the CPU and increase the chance of a context switch.

In practice this seems to reduce the test time, in some cases, from
more than two minutes to under a second.

Also remove the "progress dots" so that if the test does run for a
long time, it doesn't produce large amounts of unnecessary output.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

99597ced

powerpc/perf: Fix usage of nest_imc_refc · 711bd207

Madhavan Srinivasan authored Aug 16, 2017

nest_imc_refc is a reference count struct, used to track number of
active perf sessions using the nest units.

Currently the code accesses nest_imc_refc using node_id, which is
incorrect, the array is indexed by node number. Meaning in the case of
sparse node ids we index off the end of the array.

Fix it to use get_nest_pmu_ref() which uses the existing per-cpu
variable local_nest_imc_refc.

Fixes: 885dcd70 ('powerpc/perf: Add nest IMC PMU support')
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
[mpe: Tweak change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

711bd207

powerpc: Add const to bin_attribute structures · 8bfa42ab

Bhumika Goyal authored Aug 02, 2017

Declare bin_attribute structures as const as they are only passed as an
argument to the function sysfs_create_bin_file. This argument is of
type const, so declare the structure as const.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

8bfa42ab

16 Aug, 2017 7 commits

powerpc: Remove more redundant VSX save/tests · 96c79b6b

Benjamin Herrenschmidt authored Aug 16, 2017

__giveup_vsx/save_vsx are completely equivalent to testing MSR_FP
and MSR_VEC and calling the corresponding giveup/save function so
just remove the spurious VSX cases. Also add WARN_ONs checking that
we never have VSX enabled without the two other.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

96c79b6b

powerpc: Remove redundant clear of MSR_VSX in __giveup_vsx() · dc801081

Benjamin Herrenschmidt authored Aug 16, 2017

__giveup_fpu() already does it and we cannot have MSR_VSX set
without having MSR_FP also set.

This also adds a warning to check we indeed do
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

dc801081

powerpc: Remove redundant FP/Altivec giveup code · 746874d3

Benjamin Herrenschmidt authored Aug 16, 2017

__giveup_vsx() already calls those two functions.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

746874d3

powerpc: Fix missing newline before { · 6a303833

Benjamin Herrenschmidt authored Aug 16, 2017

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

6a303833

powerpc/topology: Remove the unused parent_node() macro · 7baebe54

Dou Liyang authored Jul 26, 2017

Commit a7be6e5a ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().

The parent_node() macro in POWERPC platform is unnecessary.

Remove it for cleanup.
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

7baebe54

powerpc: Fix VSX enabling/flushing to also test MSR_FP and MSR_VEC · 5a69aec9

Benjamin Herrenschmidt authored Aug 16, 2017

VSX uses a combination of the old vector registers, the old FP
registers and new "second halves" of the FP registers.

Thus when we need to see the VSX state in the thread struct
(flush_vsx_to_thread()) or when we'll use the VSX in the kernel
(enable_kernel_vsx()) we need to ensure they are all flushed into
the thread struct if either of them is individually enabled.

Unfortunately we only tested if the whole VSX was enabled, not if they
were individually enabled.

Fixes: 72cd7b44 ("powerpc: Uncomment and make enable_kernel_vsx() routine available")
Cc: stable@vger.kernel.org # v4.3+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

5a69aec9

powerpc/mm/hugetlb: Allow runtime allocation of 16G. · 4ae279c2

Aneesh Kumar K.V authored Jul 28, 2017

Now that we have GIGANTIC_PAGE enabled on powerpc, use this for 16G hugepages
with hash translation mode. Depending on the total system memory we have, we may
be able to allocate 16G hugepages runtime. This also remove the hugetlb setup
difference between hash/radix translation mode.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

4ae279c2