- 03 Jun, 2018 40 commits
-
-
Nicholas Piggin authored
The stores to update the SLB shadow area must be made as they appear in the C code, so that the hypervisor does not see an entry with mismatched vsid and esid. Use WRITE_ONCE for this. GCC has been observed to elide the first store to esid in the update, which means that if the hypervisor interrupts the guest after storing to vsid, it could see an entry with old esid and new vsid, which may possibly result in memory corruption. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
When a single-threaded process has a non-local mm_cpumask, try to use that point to flush the TLBs out of other CPUs in the cpumask. An IPI is used for clearing remote CPUs for a few reasons: - An IPI can end lazy TLB use of the mm, which is required to prevent TLB entries being created on the remote CPU. The alternative is to drop lazy TLB switching completely, which costs 7.5% in a context switch ping-pong test betwee a process and kernel idle thread. - An IPI can have remote CPUs flush the entire PID, but the local CPU can flush a specific VA. tlbie would require over-flushing of the local CPU (where the process is running). - A single threaded process that is migrated to a different CPU is likely to have a relatively small mm_cpumask, so IPI is reasonable. No other thread can concurrently switch to this mm, because it must have been given a reference to mm_users by the current thread before it can use_mm. mm_users can be asynchronously incremented (by mm_activate or mmget_not_zero), but those users must use remote mm access and can't use_mm or access user address space. Existing code makes the this assumption already, for example sparc64 has reset mm_cpumask using this condition since the start of history, see arch/sparc/kernel/smp_64.c. This reduces tlbies for a kernel compile workload from 0.90M to 0.12M, tlbiels are increased significantly due to the PID flushing for the cleaning up remote CPUs, and increased local flushes (PID flushes take 128 tlbiels vs 1 tlbie). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Implementing pte_update with pte_xchg (which uses cmpxchg) is inefficient. A single larx/stcx. works fine, no need for the less efficient cmpxchg sequence. Then remove the memory barriers from the operation. There is a requirement for TLB flushing to load mm_cpumask after the store that reduces pte permissions, which is moved into the TLB flush code. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
The ISA suggests ptesync after setting a pte, to prevent a table walk initiated by a subsequent access from missing that store and causing a spurious fault. This is an architectual allowance that allows an implementation's page table walker to be incoherent with the store queue. However there is no correctness problem in taking a spurious fault in userspace -- the kernel copes with these at any time, so the updated pte will be found eventually. Spurious kernel faults on vmap memory must be avoided, so a ptesync is put into flush_cache_vmap. On POWER9 so far I have not found a measurable window where this can result in more minor faults, so as an optimisation, remove the costly ptesync from pte updates. If an implementation benefits from ptesync, it would be better to add it back in update_mmu_cache, so it's not done for things like fork(2). fork --fork --exec benchmark improved 5.2% (12400->13100). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Prefetch the faulting address in update_mmu_cache to give the page table walker perhaps 100 cycles head start as locks are dropped and the interrupt completed. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
This matches other architectures, when we know there will be no further accesses to the address (e.g., for teardown), page table entries can be cleared non-atomically. The comments about NMMU are bogus: all MMU notifiers (including NMMU) are released at this point, with their TLBs flushed. An NMMU access at this point would be a bug. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
In the case of a spurious fault (which can happen due to a race with another thread that changes the page table), the default Linux mm code calls flush_tlb_page for that address. This is not required because the pte will be re-fetched. Hash does not wire this up to a hardware TLB flush for this reason. This patch avoids the flush for radix. >From Power ISA v3.0B, p.1090: Setting a Reference or Change Bit or Upgrading Access Authority (PTE Subject to Atomic Hardware Updates) If the only change being made to a valid PTE that is subject to atomic hardware updates is to set the Refer- ence or Change bit to 1 or to add access authorities, a simpler sequence suffices because the translation hardware will refetch the PTE if an access is attempted for which the only problems were reference and/or change bits needing to be set or insufficient access authority. The nest MMU on POWER9 does not re-fetch the PTE after such an access attempt before faulting, so address spaces with a coprocessor attached will continue to flush in these cases. This reduces tlbies for a kernel compile workload from 0.95M to 0.90M. fork --fork --exec benchmark improved 0.5% (12300->12400). Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Radix flushes the TLB when updating ptes to increase permissiveness of protection (increase access authority). Book3S does not require TLB flushing in this case, and it is not done on hash. This patch avoids the flush for radix. >From Power ISA v3.0B, p.1090: Setting a Reference or Change Bit or Upgrading Access Authority (PTE Subject to Atomic Hardware Updates) If the only change being made to a valid PTE that is subject to atomic hardware updates is to set the Reference or Change bit to 1 or to add access authorities, a simpler sequence suffices because the translation hardware will refetch the PTE if an access is attempted for which the only problems were reference and/or change bits needing to be set or insufficient access authority. The nest MMU on POWER9 does not re-fetch the PTE after such an access attempt before faulting, so address spaces with a coprocessor attached will continue to flush in these cases. This reduces tlbies for a kernel compile workload from 1.28M to 0.95M, tlbiels from 20.17M 19.68M. fork --fork --exec benchmark improved 2.77% (12000->12300). Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Aneesh Kumar K.V authored
When relaxing access (read -> read_write update), pte needs to be marked invalid to handle a nest MMU bug. We also need to do a tlb flush after the pte is marked invalid before updating the pte with new access bits. We also move tlb flush to platform specific __ptep_set_access_flags. This will help us to gerid of unnecessary tlb flush on BOOK3S 64 later. We don't do that in this patch. This also helps in avoiding multiple tlbies with coprocessor attached. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Aneesh Kumar K.V authored
In later patch, we use the vma and psize to do tlb flush. Do the prototype update in separate patch to make the review easy. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Aneesh Kumar K.V authored
In later patch we will update them which require them to be moved to pgtable-radix.c. Keeping the function in radix.h results in compile warning as below. ./arch/powerpc/include/asm/book3s/64/radix.h: In function ‘radix__ptep_set_access_flags’: ./arch/powerpc/include/asm/book3s/64/radix.h:196:28: error: dereferencing pointer to incomplete type ‘struct vm_area_struct’ struct mm_struct *mm = vma->vm_mm; ^~ ./arch/powerpc/include/asm/book3s/64/radix.h:204:6: error: implicit declaration of function ‘atomic_read’; did you mean ‘__atomic_load’? [-Werror=implicit-function-declaration] atomic_read(&mm->context.copros) > 0) { ^~~~~~~~~~~ __atomic_load ./arch/powerpc/include/asm/book3s/64/radix.h:204:21: error: dereferencing pointer to incomplete type ‘struct mm_struct’ atomic_read(&mm->context.copros) > 0) { Instead of fixing header dependencies, we move the function to pgtable-radix.c Also the function is now large to be a static inline . Doing the move in separate patch helps in review. No functional change in this patch. Only code movement. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Aneesh Kumar K.V authored
In a later patch, we want to update __ptep_set_access_flags take page size arg. This makes ptep_set_access_flags only work with mmu_virtual_psize. To simplify the code make huge_ptep_set_access_flags directly call __ptep_set_access_flags so that we can compute the hugetlb page size in hugetlb function. Now that ptep_set_access_flags won't be called for hugetlb remove the is_vm_hugetlb_page() check and add the assert of pte lock unconditionally. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Alastair D'Silva authored
Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Alastair D'Silva authored
In order for a userspace AFU driver to call the POWER9 specific OCXL_IOCTL_ENABLE_P9_WAIT, it needs to verify that it can actually make that call. Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Alastair D'Silva authored
In order to successfully issue as_notify, an AFU needs to know the TID to notify, which in turn means that this information should be available in userspace so it can be communicated to the AFU. Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Alastair D'Silva authored
The function removes the process element from NPU cache. Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Alastair D'Silva authored
The current implementation of TID allocation, using a global IDR, may result in an errant process starving the system of available TIDs. Instead, use task_pid_nr(), as mentioned by the original author. The scenario described which prevented it's use is not applicable, as set_thread_tidr can only be called after the task struct has been populated. In the unlikely event that 2 threads share the TID and are waiting, all potential outcomes have been determined safe. Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Alastair D'Silva authored
Switch the use of TIDR on it's CPU feature, rather than assuming it is available based on architecture. Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Alastair D'Silva authored
This patch adds a CPU feature bit to show whether the CPU has the TIDR register available, enabling as_notify/wait in userspace. Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Using irq_work for processing OPAL event interrupts is not necessary. irq_work is typically used to schedule work from NMI context, a softirq may be more appropriate. However OPAL events are not particularly performance or latency critical, so they can all be invoked by kopald. This patch removes the irq_work queueing, and instead wakes up kopald when there is an event to be processed. kopald processes interrupts individually, enabling irqs and calling cond_resched between each one to minimise latencies. Event handlers themselves should still use threaded handlers, workqueues, etc. as necessary to avoid high interrupts-off latencies within any single interrupt. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Although it is often possible to recover a CPU that was interrupted from OPAL with a system reset NMI, it's undesirable to interrupt them for a few reasons. Firstly because dump/debug code itself needs to call firmware, so it could hang on a lock or possibly corrupt a per-cpu data structure if it or another CPU was interrupted from OPAL. Secondly, the kexec crash dump code will not return from interrupt to unwind the OPAL call. Call OPAL_QUIESCE with QUIESCE_HOLD before sending an NMI IPI to another CPU, which wait for it to leave firmware (or time out) to avoid this problem in normal conditions. Firmware bugs may still result in a timeout and interrupting OPAL, but that is the best option (stops the CPU, and possibly allows firmware to be debugged). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
When the soft enabled flag was changed to a soft disable mask, xmon and register dump code was not updated to reflect that, which is confusing ('SOFTE: 1' previously meant interrupts were soft enabled, currently it means the opposite, the general interrupt type has been disabled). Fix this by using the name irqmask, and printing it in hex. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
When soft enabled was changed to irq disabled mask, this test missed being converted (although the equivalent book3s test was converted). The PMU drivers consider it an NMI when they take a PMI while general interrupts are disabled. This change restores that behaviour. Fixes: 01417c6c ("powerpc/64: Change soft_enabled from flag to bitmask") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
These are not local timer interrupts but IPIs. It's good to be able to see how timer offloading is behaving, so split these out into their own category. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Large decrementers (e.g., POWER9) can take a very long time to wrap, so when the timer iterrupt handler sets the decrementer to max so as to avoid taking another decrementer interrupt when hard enabling interrupts before running timers, it effectively disables the soft NMI coverage for timer interrupts. Fix this by using the traditional 31-bit value instead, which wraps after a few seconds. masked interrupt code does the same thing, and in normal operation neither of these paths would ever wrap even the 31 bit value. Note: the SMP watchdog should catch timer interrupt lockups, but it is preferable for the local soft-NMI to catch them, mainly to avoid the IPI. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
The broadcast tick recipient can call tick_receive_broadcast rather than re-running the full timer interrupt. It does not have to check for the next event time, because the sender already determined the timer has expired. It does not have to test irq_work_pending, because that's a direct decrementer interrupt and does not go through the clock events subsystem. And it does not have to read PURR because that was removed with the previous patch. This results in no code size change, but both the decrementer and broadcast path lengths are reduced. Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
For SPLPAR, lparcfg provides a sum of PURR registers for all CPUs. Currently this is done by reading PURR in context switch and timer interrupt, and storing that into a per-CPU variable. These are summed to provide the value. This does not work with all timer schemes (e.g., NO_HZ_FULL), and it is sub-optimal for performance because it reads the PURR register on every context switch, although that's been difficult to distinguish from noise in the contxt_switch microbenchmark. This patch implements the sum by calling a function on each CPU, to read and add PURR values of each CPU. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
These fields are only written to. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Book3S minimum supported ISA version now requires mtmsrd L=1. This instruction does not require bits other than RI and EE to be supplied, so __hard_irq_enable() and __hard_irq_disable() does not have to read the kernel_msr from paca. Interrupt entry code already relies on L=1 support. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
This check does not catch IRQ soft mask bugs, but this option is slightly more suitable than TRACE_IRQFLAGS. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
irq_work_raise should not cause a decrementer exception unless it is called from NMI context. Doing so often just results in an immediate masked decrementer interrupt: <...>-550 90d... 4us : update_curr_rt <-dequeue_task_rt <...>-550 90d... 5us : dbs_update_util_handler <-update_curr_rt <...>-550 90d... 6us : arch_irq_work_raise <-irq_work_queue <...>-550 90d... 7us : soft_nmi_interrupt <-soft_nmi_common <...>-550 90d... 7us : printk_nmi_enter <-soft_nmi_interrupt <...>-550 90d.Z. 8us : rcu_nmi_enter <-soft_nmi_interrupt <...>-550 90d.Z. 9us : rcu_nmi_exit <-soft_nmi_interrupt <...>-550 90d... 9us : printk_nmi_exit <-soft_nmi_interrupt <...>-550 90d... 10us : cpuacct_charge <-update_curr_rt The soft_nmi_interrupt here is the call into the watchdog, due to the decrementer interrupt firing with irqs soft-disabled. This is harmless, but sub-optimal. When it's not called from NMI context or with interrupts enabled, mark the decrementer pending in the irq_happened mask directly, rather than having the masked decrementer interupt handler do it. This will be replayed at the next local_irq_enable. See the comment for details. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Alexey Kardashevskiy authored
When IODA2 creates a PE, it creates an IOMMU table with it_ops::free set to pnv_ioda2_table_free() which calls pnv_pci_ioda2_table_free_pages(). Since iommu_tce_table_put() calls it_ops::free when the last reference to the table is released, explicit call to pnv_pci_ioda2_table_free_pages() is not needed so let's remove it. This should fix double free in the case of PCI hotuplug as pnv_pci_ioda2_table_free_pages() does not reset neither iommu_table::it_base nor ::it_size. This was not exposed by SRIOV as it uses different code path via pnv_pcibios_sriov_disable(). IODA1 does not inialize it_ops::free so it does not have this issue. Fixes: c5f7700b ("powerpc/powernv: Dynamically release PE") Cc: stable@vger.kernel.org # v4.8+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Yisheng Xie authored
match_string() returns the index of an array for a matching string, which can be used instead of open coded variant. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Christophe Leroy authored
GCC 8.1 emits warnings such as the following. As arch/powerpc code is built with -Werror, this breaks the build with GCC 8.1. In file included from arch/powerpc/kernel/pci_64.c:23: ./include/linux/syscalls.h:233:18: error: 'sys_pciconfig_iobase' alias between functions of incompatible types 'long int(long int, long unsigned int, long unsigned int)' and 'long int(long int, long int, long int)' [-Werror=attribute-alias] asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \ ^~~ ./include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx' __SYSCALL_DEFINEx(x, sname, __VA_ARGS__) This patch inhibits those warnings. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> [mpe: Trim change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Christophe Leroy authored
GCC 8.1 warns about possible string truncation: arch/powerpc/kernel/nvram_64.c:1042:2: error: 'strncpy' specified bound 12 equals destination size [-Werror=stringop-truncation] strncpy(new_part->header.name, name, 12); arch/powerpc/platforms/ps3/repository.c:106:2: error: 'strncpy' output truncated before terminating nul copying 8 bytes from a string of the same length [-Werror=stringop-truncation] strncpy((char *)&n, text, 8); Fix it by using memcpy(). To make that safe we need to ensure the destination is pre-zeroed. Use kzalloc() in the nvram code and initialise the u64 to zero in the ps3 code. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> [mpe: Use kzalloc() in the nvram code, flesh out change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
This is a branch with a mixture of mm, x86 and powerpc commits all relating to some minor cross-arch pkeys consolidation. The x86/mm changes have been reviewed by Ingo & Dave Hansen and the tree has been in linux-next for some weeks without issue.
-
Michael Ellerman authored
We ended up with an ugly conflict between fixes and next in ftrace.h involving multiple nested ifdefs, and the automatic resolution is wrong. So merge fixes into next so we can fix it up.
-
Michael Ellerman authored
Merge in some commits we're sharing with the kbuild tree.
-