- 01 Jul, 2019 3 commits
-
-
Michael Neuling authored
commit 243e2511 ("powerpc/xive: Native exploitation of the XIVE interrupt controller") added an option to turn off Linux native XIVE usage via the xive=off kernel command line option. This documents this option. Signed-off-by: Michael Neuling <mikey@neuling.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Acked-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
Merge our fixes branch into next, this brings in a number of commits that fix bugs we don't want to hit in next, in particular the fix for CVE-2019-12817.
-
Michael Ellerman authored
This merges the commits that were the fix for CVE-2019-12817, which was developed under embargo. They have already been merged by Linus Merge them into fixes now so that this branch contains all the fixes for this release.
-
- 25 Jun, 2019 1 commit
-
-
Nicholas Piggin authored
The early machine check runs in real mode, so locking is unnecessary. Worse, the windup does not restore AMR, so this can result in a false KUAP fault after a recoverable machine check hits inside a user copy operation. Fix this similarly to HMI by just avoiding the kuap lock in the early machine check handler (it will be set by the late handler that runs in virtual mode if that runs). If the virtual mode handler is reached, it will lock and restore the AMR. Fixes: 890274c2 ("powerpc/64s: Implement KUAP for Radix MMU") Cc: Russell Currey <ruscur@russell.cc> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 20 Jun, 2019 4 commits
-
-
Suraj Jitindar Singh authored
If we enter an L1 guest with a pending decrementer exception then this is cleared on guest exit if the guest has writtien a positive value into the decrementer (indicating that it handled the decrementer exception) since there is no other way to detect that the guest has handled the pending exception and that it should be dequeued. In the event that the L1 guest tries to run a nested (L2) guest immediately after this and the L2 guest decrementer is negative (which is loaded by L1 before making the H_ENTER_NESTED hcall), then the pending decrementer exception isn't cleared and the L2 entry is blocked since L1 has a pending exception, even though L1 may have already handled the exception and written a positive value for it's decrementer. This results in a loop of L1 trying to enter the L2 guest and L0 blocking the entry since L1 has an interrupt pending with the outcome being that L2 never gets to run and hangs. Fix this by clearing any pending decrementer exceptions when L1 makes the H_ENTER_NESTED hcall since it won't do this if it's decrementer has gone negative, and anyway it's decrementer has been communicated to L0 in the hdec_expires field and L0 will return control to L1 when this goes negative by delivering an H_DECREMENTER exception. Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Suraj Jitindar Singh authored
On POWER9 the decrementer can operate in large decrementer mode where the decrementer is 56 bits and signed extended to 64 bits. When not operating in this mode the decrementer behaves as a 32 bit decrementer which is NOT signed extended (as on POWER8). Currently when reading a guest decrementer value we don't take into account whether the large decrementer is enabled or not, and this means the value will be incorrect when the guest is not using the large decrementer. Fix this by sign extending the value read when the guest isn't using the large decrementer. Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Suraj Jitindar Singh authored
When a guest vcpu moves from one physical thread to another it is necessary for the host to perform a tlb flush on the previous core if another vcpu from the same guest is going to run there. This is because the guest may use the local form of the tlb invalidation instruction meaning stale tlb entries would persist where it previously ran. This is handled on guest entry in kvmppc_check_need_tlb_flush() which calls flush_guest_tlb() to perform the tlb flush. Previously the generic radix__local_flush_tlb_lpid_guest() function was used, however the functionality was reimplemented in flush_guest_tlb() to avoid the trace_tlbie() call as the flushing may be done in real mode. The reimplementation in flush_guest_tlb() was missing an erat invalidation after flushing the tlb. This lead to observable memory corruption in the guest due to the caching of stale translations. Fix this by adding the erat invalidation. Fixes: 70ea13f6 ("KVM: PPC: Book3S HV: Flush TLB on secondary radix threads") Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Alexey Kardashevskiy authored
When the firmware does PCI BAR resource allocation, it passes the assigned addresses and flags (prefetch/64bit/...) via the "reg" property of a PCI device device tree node so the kernel does not need to do resource allocation. The flags are stored in resource::flags - the lower byte stores PCI_BASE_ADDRESS_SPACE/etc bits and the other bytes are IORESOURCE_IO/etc. Some flags from PCI_BASE_ADDRESS_xxx and IORESOURCE_xxx are duplicated, such as PCI_BASE_ADDRESS_MEM_PREFETCH/PCI_BASE_ADDRESS_MEM_TYPE_64/etc. When parsing the "reg" property, we copy the prefetch flag but we skip on PCI_BASE_ADDRESS_MEM_TYPE_64 which leaves the flags out of sync. The missing IORESOURCE_MEM_64 flag comes into play under 2 conditions: 1. we remove PCI_PROBE_ONLY for pseries (by hacking pSeries_setup_arch() or by passing "/chosen/linux,pci-probe-only"); 2. we request resource alignment (by passing pci=resource_alignment= via the kernel cmd line to request PAGE_SIZE alignment or defining ppc_md.pcibios_default_alignment which returns anything but 0). Note that the alignment requests are ignored if PCI_PROBE_ONLY is enabled. With 1) and 2), the generic PCI code in the kernel unconditionally decides to: - reassign the BARs in pci_specified_resource_alignment() (works fine) - write new BARs to the device - this fails for 64bit BARs as the generic code looks at IORESOURCE_MEM_64 (not set) and writes only lower 32bits of the BAR and leaves the upper 32bit unmodified which breaks BAR mapping in the hypervisor. This fixes the issue by copying the flag. This is useful if we want to enforce certain BAR alignment per platform as handling subpage sized BARs is proven to cause problems with hotplug (SLOF already aligns BARs to 64k). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com> Reviewed-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Shawn Anastasio <shawn@anastas.io> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 19 Jun, 2019 13 commits
-
-
Christoph Hellwig authored
With the strict dma mask checking introduced with the switch to the generic DMA direct code common wifi chips on 32-bit powerbooks stopped working. Add a 30-bit ZONE_DMA to the 32-bit pmac builds to allow them to reliably allocate dma coherent memory. Fixes: 65a21b71 ("powerpc/dma: remove dma_nommu_dma_supported") Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Larry Finger <Larry.Finger@lwfinger.net> Acked-by: Larry Finger <Larry.Finger@lwfinger.net> Tested-by: Aaro Koskinen <aaro.koskinen@iki.fi> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
This sets the HAVE_ARCH_HUGE_VMAP option, and defines the required page table functions. This enables huge (2MB and 1GB) ioremap mappings. I don't have a benchmark for this change, but huge vmap will be used by a later core kernel change to enable huge vmalloc memory mappings. This improves cached `git diff` performance by about 5% on a 2-node POWER9 with 32MB size dentry cache hash. Profiling git diff dTLB misses with a vanilla kernel: 81.75% git [kernel.vmlinux] [k] __d_lookup_rcu 7.21% git [kernel.vmlinux] [k] strncpy_from_user 1.77% git [kernel.vmlinux] [k] find_get_entry 1.59% git [kernel.vmlinux] [k] kmem_cache_free 40,168 dTLB-miss 0.100342754 seconds time elapsed With powerpc huge vmalloc: 2,987 dTLB-miss 0.095933138 seconds time elapsed Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Radix can use ioremap_page_range for ioremap, after slab is available. This makes it possible to enable huge ioremap mapping support. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
__ioremap_at error handling is wonky, it requires caller to clean up after it. Implement a helper that does the map and error cleanup and remove the requirement from the caller. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Anju T Sudhakar authored
Nest and core IMC (In-Memory Collection counters) assigns a particular cpu as the designated target for counter data collection. During system boot, the first online cpu in a chip gets assigned as the designated cpu for that chip(for nest-imc) and the first online cpu in a core gets assigned as the designated cpu for that core(for core-imc). If the designated cpu goes offline, the next online cpu from the same chip(for nest-imc)/core(for core-imc) is assigned as the next target, and the event context is migrated to the target cpu. Currently, cpumask_any_but() function is used to find the target cpu. Though this function is expected to return a `random` cpu, this always returns the next online cpu. If all cpus in a chip/core is offlined in a sequential manner, starting from the first cpu, the event migration has to happen for all the cpus which goes offline. Since the migration process involves a grace period, the total time taken to offline all the cpus will be significantly high. Example: In a system which has 2 sockets, with NUMA node0 CPU(s): 0-87 NUMA node8 CPU(s): 88-175 Time taken to offline cpu 88-175: real 2m56.099s user 0m0.191s sys 0m0.000s Use cpumask_last() to choose the target cpu, when the designated cpu goes online, so the migration will happen only when the last_cpu in the mask goes offline. This way the time taken to offline all cpus in a chip/core can be reduced. With the patch: Time taken to offline cpu 88-175: real 0m12.207s user 0m0.171s sys 0m0.000s Offlining all cpus in reverse order is also taken care because, cpumask_any_but() is used to find the designated cpu if the last cpu in the mask goes offline. Since cpumask_any_but() always return the first cpu in the mask, that becomes the designated cpu and migration will happen only when the first_cpu in the mask goes offline. Example: With the patch, Time taken to offline cpu from 175-88: real 0m9.330s user 0m0.110s sys 0m0.000s Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com> Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Shaokun Zhang authored
pr_info shows SPR and timebase as a decimal value with a '0x' prefix, which is somewhat misleading. Fix it to print hexadecimal, as was intended. Fixes: 10d91611 ("powerpc/64s: Reimplement book3s idle code in C") Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nathan Lynch authored
A couple of bugs in queue_hotplug_event(): 1. Unchecked kmalloc result which could lead to an oops. 2. Use of GFP_KERNEL allocations in interrupt context (this code's only caller is ras_hotplug_interrupt()). Use kmemdup to avoid open-coding the allocation+copy and check for failure; use GFP_ATOMIC for both allocations. Ultimately it probably would be better to avoid or reduce allocations in this path if possible. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Ravi Bangoria authored
powerpc hardware triggers watchpoint before executing the instruction. To make trigger-after-execute behavior, kernel emulates the instruction. If the instruction is 'load something into non-volatile register', exception handler should restore emulated register state while returning back, otherwise there will be register state corruption. eg, adding a watchpoint on a list can corrput the list: # cat /proc/kallsyms | grep kthread_create_list c00000000121c8b8 d kthread_create_list Add watchpoint on kthread_create_list->prev: # perf record -e mem:0xc00000000121c8c0 Run some workload such that new kthread gets invoked. eg, I just logged out from console: list_add corruption. next->prev should be prev (c000000001214e00), \ but was c00000000121c8b8. (next=c00000000121c8b8). WARNING: CPU: 59 PID: 309 at lib/list_debug.c:25 __list_add_valid+0xb4/0xc0 CPU: 59 PID: 309 Comm: kworker/59:0 Kdump: loaded Not tainted 5.1.0-rc7+ #69 ... NIP __list_add_valid+0xb4/0xc0 LR __list_add_valid+0xb0/0xc0 Call Trace: __list_add_valid+0xb0/0xc0 (unreliable) __kthread_create_on_node+0xe0/0x260 kthread_create_on_node+0x34/0x50 create_worker+0xe8/0x260 worker_thread+0x444/0x560 kthread+0x160/0x1a0 ret_from_kernel_thread+0x5c/0x70 List corruption happened because it uses 'load into non-volatile register' instruction: Snippet from __kthread_create_on_node: c000000000136be8: addis r29,r2,-19 c000000000136bec: ld r29,31424(r29) if (!__list_add_valid(new, prev, next)) c000000000136bf0: mr r3,r30 c000000000136bf4: mr r5,r28 c000000000136bf8: mr r4,r29 c000000000136bfc: bl c00000000059a2f8 <__list_add_valid+0x8> Register state from WARN_ON(): GPR00: c00000000059a3a0 c000007ff23afb50 c000000001344e00 0000000000000075 GPR04: 0000000000000000 0000000000000000 0000001852af8bc1 0000000000000000 GPR08: 0000000000000001 0000000000000007 0000000000000006 00000000000004aa GPR12: 0000000000000000 c000007ffffeb080 c000000000137038 c000005ff62aaa00 GPR16: 0000000000000000 0000000000000000 c000007fffbe7600 c000007fffbe7370 GPR20: c000007fffbe7320 c000007fffbe7300 c000000001373a00 0000000000000000 GPR24: fffffffffffffef7 c00000000012e320 c000007ff23afcb0 c000000000cb8628 GPR28: c00000000121c8b8 c000000001214e00 c000007fef5b17e8 c000007fef5b17c0 Watchpoint hit at 0xc000000000136bec. addis r29,r2,-19 => r29 = 0xc000000001344e00 + (-19 << 16) => r29 = 0xc000000001214e00 ld r29,31424(r29) => r29 = *(0xc000000001214e00 + 31424) => r29 = *(0xc00000000121c8c0) 0xc00000000121c8c0 is where we placed a watchpoint and thus this instruction was emulated by emulate_step. But because handle_dabr_fault did not restore emulated register state, r29 still contains stale value in above register state. Fixes: 5aae8a53 ("powerpc, hw_breakpoints: Implement hw_breakpoints for 64-bit server processors") Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Cc: stable@vger.kernel.org # 2.6.36+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Greg Kroah-Hartman authored
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Because there's no need to check, also make the return value of the local debugfs_create_io_x64() call void, as no one ever did anything with the return value (as they did not need to.) And make the cxl_debugfs_* calls return void as no one was even checking their return value at all. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Acked-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Geert Uytterhoeven authored
Flexible array members should be denoted using [] instead of [0], else gcc will not warn when they are no longer at the end of the structure. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Andreas Schwab authored
Move a misplaced paren that makes the condition always true. Fixes: 63b2bc61 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX") Cc: stable@vger.kernel.org # v5.1+ Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Christophe Leroy authored
Previously, only IBAT1 and IBAT2 were used to map kernel linear mem. Since commit 63b2bc61 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX"), we may have all 8 BATs used for mapping kernel text. But the suspend/restore functions only save/restore BATs 0 to 3, and clears BATs 4 to 7. Make suspend and restore functions respectively save and reload the 8 BATs on CPUs having MMU_FTR_USE_HIGH_BATS feature. Reported-by: Andreas Schwab <schwab@linux-m68k.org> Cc: stable@vger.kernel.org Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Gustavo Romero authored
In some cases, compiler can allocate the same register for operand 'res' and 'vecoutptr', resulting in segfault at 'stxvd2x 40,0,%[vecoutptr]' because base register will contain 1, yielding a false-positive. This is because output 'res' must be marked as an earlyclobber operand so it may not overlap an input operand ('vecoutptr'). Signed-off-by: Gustavo Romero <gromero@linux.vnet.ibm.com> Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 18 Jun, 2019 2 commits
-
-
Suraj Jitindar Singh authored
The hcall H_SET_DAWR is used by a guest to set the data address watchpoint register (DAWR). This hcall is handled in the host in kvmppc_h_set_dawr() which can be called in either real mode on the guest exit path from hcall_try_real_mode() in book3s_hv_rmhandlers.S, or in virtual mode when called from kvmppc_pseries_do_hcall() in book3s_hv.c. The function kvmppc_h_set_dawr() updates the dawr and dawrx fields in the vcpu struct accordingly and then also writes the respective values into the DAWR and DAWRX registers directly. It is necessary to write the registers directly here when calling the function in real mode since the path to re-enter the guest won't do this. However when in virtual mode the host DAWR and DAWRX values have already been restored, and so writing the registers would overwrite these. Additionally there is no reason to write the guest values here as these will be read from the vcpu struct and written to the registers appropriately the next time the vcpu is run. This also avoids the case when handling h_set_dawr for a nested guest where the guest hypervisor isn't able to write the DAWR and DAWRX registers directly and must rely on the real hypervisor to do this for it when it calls H_ENTER_NESTED. Fixes: c1fe190c ("powerpc: Add force enable of DAWR on P9 option") Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Michael Neuling authored
Commit c1fe190c ("powerpc: Add force enable of DAWR on P9 option") screwed up some assembler and corrupted a pointer in r3. This resulted in crashes like the below: BUG: Kernel NULL pointer dereference at 0x000013bf Faulting instruction address: 0xc00000000010b044 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA pSeries CPU: 8 PID: 1771 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc4+ #3 NIP: c00000000010b044 LR: c0080000089dacf4 CTR: c00000000010aff4 REGS: c00000179b397710 TRAP: 0300 Not tainted (5.2.0-rc4+) MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 42244842 XER: 00000000 CFAR: c00000000010aff8 DAR: 00000000000013bf DSISR: 42000000 IRQMASK: 0 GPR00: c0080000089dd6bc c00000179b3979a0 c008000008a04300 ffffffffffffffff GPR04: 0000000000000000 0000000000000003 000000002444b05d c0000017f11c45d0 ... NIP kvmppc_h_set_dabr+0x50/0x68 LR kvmppc_pseries_do_hcall+0xa3c/0xeb0 [kvm_hv] Call Trace: 0xc0000017f11c0000 (unreliable) kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv] kvmppc_vcpu_run+0x34/0x48 [kvm] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm] kvm_vcpu_ioctl+0x460/0x850 [kvm] do_vfs_ioctl+0xe4/0xb40 ksys_ioctl+0xc4/0x110 sys_ioctl+0x28/0x80 system_call+0x5c/0x70 Instruction dump: 4082fff4 4c00012c 38600000 4e800020 e96280c0 896b0000 2c2b0000 3860ffff 4d820020 50852e74 508516f6 78840724 <f88313c0> f8a313c8 7c942ba6 7cbc2ba6 Fix the bug by only changing r3 when we are returning immediately. Fixes: c1fe190c ("powerpc: Add force enable of DAWR on P9 option") Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Reported-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 15 Jun, 2019 7 commits
-
-
Christophe Leroy authored
Build failure was introduced by the commit identified below, due to missed macro expension leading to wrong called function's name. arch/powerpc/kernel/head_fsl_booke.o: In function `SystemCall': arch/powerpc/kernel/head_fsl_booke.S:416: undefined reference to `kvmppc_handler_BOOKE_INTERRUPT_SYSCALL_SPRN_SRR1' Makefile:1052: recipe for target 'vmlinux' failed The called function should be kvmppc_handler_8_0x01B(). This patch fixes it. Reported-by: Paul Mackerras <paulus@ozlabs.org> Fixes: 1a4b739b ("powerpc/32: implement fast entry for syscalls on BOOKE") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Christophe Leroy authored
Otherwise, the following warning is encountered: WARNING: vmlinux.o(.text+0x3dc6): Section mismatch in reference from the variable start_here_multiplatform to the function .init.text:.early_setup() The function start_here_multiplatform() references the function __init .early_setup(). This is often because start_here_multiplatform lacks a __init annotation or the annotation of .early_setup is wrong. Fixes: 56c46bba ("powerpc/64: Fix booting large kernels with STRICT_KERNEL_RWX") Cc: Russell Currey <ruscur@russell.cc> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Christophe Leroy authored
Use r10 instead of r9 to calculate CPU offset as r9 contains the value from SRR1 which is used later. Fixes: 1a4b739b ("powerpc/32: implement fast entry for syscalls on BOOKE") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Christophe Leroy authored
The patch referenced below moved the loading of segment registers out of load_up_mmu() in order to do it earlier in the boot sequence. However, the secondary CPU still needs it to be done when loading up the MMU. Reported-by: Erhard F. <erhard_f@mailbox.org> Fixes: 215b8237 ("powerpc/32s: set up an early static hash table for KASAN") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nathan Lynch authored
It's common for the platform to replace the cache device nodes after a migration. Since the cacheinfo code is never informed about this, it never drops its references to the source system's cache nodes, causing it to wind up in an inconsistent state resulting in warnings and oopses as soon as CPU online/offline occurs after the migration, e.g. cache for /cpus/l3-cache@3113(Unified) refers to cache for /cpus/l2-cache@200d(Unified) WARNING: CPU: 15 PID: 86 at arch/powerpc/kernel/cacheinfo.c:176 release_cache+0x1bc/0x1d0 [...] NIP release_cache+0x1bc/0x1d0 LR release_cache+0x1b8/0x1d0 Call Trace: release_cache+0x1b8/0x1d0 (unreliable) cacheinfo_cpu_offline+0x1c4/0x2c0 unregister_cpu_online+0x1b8/0x260 cpuhp_invoke_callback+0x114/0xf40 cpuhp_thread_fun+0x270/0x310 smpboot_thread_fn+0x2c8/0x390 kthread+0x1b8/0x1c0 ret_from_kernel_thread+0x5c/0x68 Using device tree notifiers won't work since we want to rebuild the hierarchy only after all the removals and additions have occurred and the device tree is in a consistent state. Call cacheinfo_teardown() before processing device tree updates, and rebuild the hierarchy afterward. Fixes: 410bccf9 ("powerpc/pseries: Partition migration in the kernel") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nathan Lynch authored
CPU online/offline code paths are sensitive to parts of the device tree (various cpu node properties, cache nodes) that can be changed as a result of a migration. Prevent CPU hotplug while the device tree potentially is inconsistent. Fixes: 410bccf9 ("powerpc/pseries: Partition migration in the kernel") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nathan Lynch authored
Allow external callers to force the cacheinfo code to release all its references to cache nodes, e.g. before processing device tree updates post-migration, and to rebuild the hierarchy afterward. CPU online/offline must be blocked by callers; enforce this. Fixes: 410bccf9 ("powerpc/pseries: Partition migration in the kernel") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 14 Jun, 2019 3 commits
-
-
Nathan Lynch authored
During post-migration device tree updates, we can oops in pseries_update_drconf_memory() if the source device tree has an ibm,dynamic-memory-v2 property and the destination has a ibm,dynamic_memory (v1) property. The notifier processes an "update" for the ibm,dynamic-memory property but it's really an add in this scenario. So make sure the old property object is there before dereferencing it. Fixes: 2b31e3ae ("powerpc/drmem: Add support for ibm, dynamic-memory-v2 property") Cc: stable@vger.kernel.org # v4.16+ Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Daniel Axtens authored
While developing KASAN for 64-bit book3s, I hit the following stack over-read. It occurs because the hypercall to put characters onto the terminal takes 2 longs (128 bits/16 bytes) of characters at a time, and so hvc_put_chars() would unconditionally copy 16 bytes from the argument buffer, regardless of supplied length. However, udbg_hvc_putc() can call hvc_put_chars() with a single-byte buffer, leading to the error. ================================================================== BUG: KASAN: stack-out-of-bounds in hvc_put_chars+0xdc/0x110 Read of size 8 at addr c0000000023e7a90 by task swapper/0 CPU: 0 PID: 0 Comm: swapper Not tainted 5.2.0-rc2-next-20190528-02824-g048a6ab4835b #113 Call Trace: dump_stack+0x104/0x154 (unreliable) print_address_description+0xa0/0x30c __kasan_report+0x20c/0x224 kasan_report+0x18/0x30 __asan_report_load8_noabort+0x24/0x40 hvc_put_chars+0xdc/0x110 hvterm_raw_put_chars+0x9c/0x110 udbg_hvc_putc+0x154/0x200 udbg_write+0xf0/0x240 console_unlock+0x868/0xd30 register_console+0x970/0xe90 register_early_udbg_console+0xf8/0x114 setup_arch+0x108/0x790 start_kernel+0x104/0x784 start_here_common+0x1c/0x534 Memory state around the buggy address: c0000000023e7980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0000000023e7a00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 >c0000000023e7a80: f1 f1 01 f2 f2 f2 00 00 00 00 00 00 00 00 00 00 ^ c0000000023e7b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0000000023e7b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Document that a 16-byte buffer is requred, and provide it in udbg. Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Masahiro Yamada authored
Linux kernel tolerates C++ style comments these days. Actually, the SPDX License tags for .c files start with //. On the other hand, uapi headers are written in more strict C, where the C++ comment style is forbidden. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Frederic Barrat <fbarrat@linux.ibm.com> Acked-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 13 Jun, 2019 2 commits
-
-
Michael Ellerman authored
This merges a fix for a bug in our context id handling on 64-bit hash CPUs. The fix was written against v5.1 to ease backporting to stable releases. Here we are merging it up to a v5.2-rc2 base, which involves a bit of manual resolution. It also adds a test case for the bug. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
This tests that when a process with a mapping above 512TB forks we correctly separate the parent and child address spaces. This exercises the bug in the context id handling fixed in the previous commit. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 12 Jun, 2019 1 commit
-
-
Michael Ellerman authored
When using the Hash Page Table (HPT) MMU, userspace memory mappings are managed at two levels. Firstly in the Linux page tables, much like other architectures, and secondly in the SLB (Segment Lookaside Buffer) and HPT. It's the SLB and HPT that are actually used by the hardware to do translations. As part of the series adding support for 4PB user virtual address space using the hash MMU, we added support for allocating multiple "context ids" per process, one for each 512TB chunk of address space. These are tracked in an array called extended_id in the mm_context_t of a process that has done a mapping above 512TB. If such a process forks (ie. clone(2) without CLONE_VM set) it's mm is copied, including the mm_context_t, and then init_new_context() is called to reinitialise parts of the mm_context_t as appropriate to separate the address spaces of the two processes. The key step in ensuring the two processes have separate address spaces is to allocate a new context id for the process, this is done at the beginning of hash__init_new_context(). If we didn't allocate a new context id then the two processes would share mappings as far as the SLB and HPT are concerned, even though their Linux page tables would be separate. For mappings above 512TB, which use the extended_id array, we neglected to allocate new context ids on fork, meaning the parent and child use the same ids and therefore share those mappings even though they're supposed to be separate. This can lead to the parent seeing writes done by the child, which is essentially memory corruption. There is an additional exposure which is that if the child process exits, all its context ids are freed, including the context ids that are still in use by the parent for mappings above 512TB. One or more of those ids can then be reallocated to a third process, that process can then read/write to the parent's mappings above 512TB. Additionally if the freed id is used for the third process's primary context id, then the parent is able to read/write to the third process's mappings *below* 512TB. All of these are fundamental failures to enforce separation between processes. The only mitigating factor is that the bug only occurs if a process creates mappings above 512TB, and most applications still do not create such mappings. Only machines using the hash page table MMU are affected, eg. PowerPC 970 (G5), PA6T, Power5/6/7/8/9. By default Power9 bare metal machines (powernv) use the Radix MMU and are not affected, unless the machine has been explicitly booted in HPT mode (using disable_radix on the kernel command line). KVM guests on Power9 may be affected if the host or guest is configured to use the HPT MMU. LPARs under PowerVM on Power9 are affected as they always use the HPT MMU. Kernels built with PAGE_SIZE=4K are not affected. The fix is relatively simple, we need to reallocate context ids for all extended mappings on fork. Fixes: f384796c ("powerpc/mm: Add support for handling > 512TB address in SLB miss") Cc: stable@vger.kernel.org # v4.17+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 07 Jun, 2019 4 commits
-
-
Christophe Leroy authored
When booting through OF, setup_disp_bat() does nothing because disp_BAT are not set. By change, it used to work because BOOTX buffer is mapped 1:1 at address 0x81000000 by the bootloader, and btext_setup_display() sets virt addr same as phys addr. But since commit 215b8237 ("powerpc/32s: set up an early static hash table for KASAN."), a temporary page table overrides the bootloader mapping. This 0x81000000 is also problematic with the newly implemented Kernel Userspace Access Protection (KUAP) because it is within user address space. This patch fixes those issues by properly setting disp_BAT through a call to btext_prepare_BAT(), allowing setup_disp_bat() to properly setup BAT3 for early bootx screen buffer access. Reported-by: Mathieu Malaterre <malat@debian.org> Fixes: 215b8237 ("powerpc/32s: set up an early static hash table for KASAN.") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Tested-by: Mathieu Malaterre <malat@debian.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
The change to pmdp_invalidate() to mark the pmd with _PAGE_INVALID broke the synchronisation against lock free lookups, __find_linux_pte()'s pmd_none() check no longer returns true for such cases. Fix this by adding a check for this condition as well. Fixes: da7ad366 ("powerpc/mm/book3s: Update pmd_present to look at _PAGE_PRESENT bit") Cc: stable@vger.kernel.org # v4.20+ Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Commit 1b2443a5 ("powerpc/book3s64: Avoid multiple endian conversion in pte helpers") changed the actual bitwise tests in pte_access_permitted by using pte_write() and pte_present() helpers rather than raw bitwise testing _PAGE_WRITE and _PAGE_PRESENT bits. The pte_present() change now returns true for PTEs which are !_PAGE_PRESENT and _PAGE_INVALID, which is the combination used by pmdp_invalidate() to synchronize access from lock-free lookups. pte_access_permitted() is used by pmd_access_permitted(), so allowing GUP lock free access to proceed with such PTEs breaks this synchronisation. This bug has been observed on a host using the hash page table MMU, with random crashes and corruption in guests, usually together with bad PMD messages in the host. Fix this by adding an explicit check in pmd_access_permitted(), and documenting the condition explicitly. The pte_write() change should be okay, and would prevent GUP from falling back to the slow path when encountering savedwrite PTEs, which matches what x86 (that does not implement savedwrite) does. Fixes: 1b2443a5 ("powerpc/book3s64: Avoid multiple endian conversion in pte helpers") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Christophe Leroy authored
In the old days, _PAGE_EXEC didn't exist on 6xx aka book3s/32. Therefore, allthough __mapin_ram_chunk() was already mapping kernel text with PAGE_KERNEL_TEXT and the rest with PAGE_KERNEL, the entire memory was executable. Part of the memory (first 512kbytes) was mapped with BATs instead of page table, but it was also entirely mapped as executable. In commit 385e89d5 ("powerpc/mm: add exec protection on powerpc 603"), we started adding exec protection to some 6xx, namely the 603, for pages mapped via pagetables. Then, in commit 63b2bc61 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX"), the exec protection was extended to BAT mapped memory, so that really only the kernel text could be executed. The problem here is that kexec is based on copying some code into upper part of memory then executing it from there in order to install a fresh new kernel at its definitive location. However, the code is position independant and first part of it is just there to deactivate the MMU and jump to the second part. So it is possible to run this first part inplace instead of running the copy. Once the MMU is off, there is no protection anymore and the second part of the code will just run as before. Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi> Fixes: 63b2bc61 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX") Cc: stable@vger.kernel.org # v5.1+ Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Tested-by: Aaro Koskinen <aaro.koskinen@iki.fi> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-