- 13 May, 2024 1 commit
-
-
Inochi Amaoto authored
CONFIG_CLK_SOPHGO_CV1800 is required when booting the minimum system for CV1800 series board. Signed-off-by: Inochi Amaoto <inochiama@outlook.com> Link: https://lore.kernel.org/r/IA1PR20MB49537E8B2D1FAAA7D5B8BDA2BB052@IA1PR20MB4953.namprd20.prod.outlook.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
- 30 Apr, 2024 8 commits
-
-
Palmer Dabbelt authored
Samuel Holland <samuel.holland@sifive.com> says: This series converts uniprocessor kernel builds to use the same TLB flushing code as SMP builds, to take advantage of batching and existing range- and ASID-based TLB flush optimizations. It optimizes out IPIs and SBI calls based on the online CPU count, which also covers the scenario where SMP was enabled at build time but only one CPU is present/online. A final optimization is to use single-ASID flushes wherever possible, to avoid unnecessary TLB misses for kernel mappings. This series has a semantic conflict with the AIA patches that are in linux-next due to the removal of the third parameter of riscv_ipi_set_virq_range(), which is called from imsic_ipi_domain_init() in drivers/irqchip/irq-riscv-imsic-early.c. The resolution is to remove the extra argument from the call site. Here are some numbers from D1 which show the performance impact: v6.9-rc1: System Benchmarks Partial Index BASELINE RESULT INDEX Execl Throughput 43.0 198.5 46.2 File Copy 1024 bufsize 2000 maxblocks 3960.0 73934.4 186.7 File Copy 256 bufsize 500 maxblocks 1655.0 20242.6 122.3 File Copy 4096 bufsize 8000 maxblocks 5800.0 197706.4 340.9 Pipe Throughput 12440.0 176974.2 142.3 Pipe-based Context Switching 4000.0 23626.8 59.1 Process Creation 126.0 449.9 35.7 Shell Scripts (1 concurrent) 42.4 544.4 128.4 Shell Scripts (16 concurrent) --- 35.3 --- Shell Scripts (8 concurrent) 6.0 71.6 119.3 System Call Overhead 15000.0 248072.6 165.4 ======== System Benchmarks Index Score (Partial Only) 110.6 v6.9-rc1 + this patch series: System Benchmarks Partial Index BASELINE RESULT INDEX Execl Throughput 43.0 196.8 45.8 File Copy 1024 bufsize 2000 maxblocks 3960.0 71782.2 181.3 File Copy 256 bufsize 500 maxblocks 1655.0 21269.4 128.5 File Copy 4096 bufsize 8000 maxblocks 5800.0 199424.0 343.8 Pipe Throughput 12440.0 196468.6 157.9 Pipe-based Context Switching 4000.0 24261.8 60.7 Process Creation 126.0 459.0 36.4 Shell Scripts (1 concurrent) 42.4 543.8 128.2 Shell Scripts (16 concurrent) --- 35.5 --- Shell Scripts (8 concurrent) 6.0 71.7 119.6 System Call Overhead 15000.0 259415.2 172.9 ======== System Benchmarks Index Score (Partial Only) 113.0 * b4-shazam-lts: riscv: mm: Always use an ASID to flush mm contexts riscv: mm: Preserve global TLB entries when switching contexts riscv: mm: Make asid_bits a local variable riscv: mm: Use a fixed layout for the MM context ID riscv: mm: Introduce cntx2asid/cntx2version helper macros riscv: Avoid TLB flush loops when affected by SiFive CIP-1200 riscv: Apply SiFive CIP-1200 workaround to single-ASID sfence.vma riscv: mm: Combine the SMP and UP TLB flush code riscv: Only send remote fences when some other CPU is online riscv: mm: Broadcast kernel TLB flushes only when needed riscv: Use IPIs for remote cache/TLB flushes by default riscv: Factor out page table TLB synchronization riscv: Flush the instruction cache during SMP bringup Link: https://lore.kernel.org/r/20240327045035.368512-1-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Jisheng Zhang authored
Currently, riscv linux requires at least IMA, so all platforms have a multiplier. And I assume the 'mul' efficiency is comparable or better than a sequence of five or so register-dependent arithmetic instructions. Select ARCH_HAS_FAST_MULTIPLIER to get slightly nicer codegen. Refer to commit f9b41929 ("[PATCH] bitops: hweight() speedup") for more details. In a simple benchmark test calling hweight64() in a loop, it got: about 14% performance improvement on JH7110, tested on Milkv Mars. about 23% performance improvement on TH1520 and SG2042, tested on Sipeed LPI4A and SG2042 platform. a slight performance drop on CV1800B, tested on milkv duo. Among all riscv platforms in my hands, this is the only one which sees a slight performance drop. It means the 'mul' isn't quick enough. However, the situation exists on x86 too, for example, P4 doesn't have fast integer multiplies as said in the above commit, x86 also selects ARCH_HAS_FAST_MULTIPLIER. So let's select ARCH_HAS_FAST_MULTIPLIER which can benefit almost riscv platforms. Samuel also provided some performance numbers: On Unmatched: 20% speedup for __sw_hweight32 and 30% speedup for __sw_hweight64. On D1: 8% speedup for __sw_hweight32 and 8% slowdown for __sw_hweight64. Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Reviewed-by: Samuel Holland <samuel.holland@sifive.com> Tested-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240325105823.1483-1-jszhang@kernel.orgSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Palmer Dabbelt authored
Jisheng Zhang <jszhang@kernel.org> says: This series selects ARCH_USE_CMPXCHG_LOCKREF to enable the cmpxchg-based lockless lockref implementation for riscv. Then, implement arch_cmpxchg64_{relaxed|acquire|release}. After patch1: Using Linus' test case[1] on TH1520 platform, I see a 11.2% improvement. On JH7110 platform, I see 12.0% improvement. After patch2: on both TH1520 and JH7110 platforms, I didn't see obvious performance improvement with Linus' test case [1]. IMHO, this may be related with the fence and lr.d/sc.d hw implementations. In theory, lr/sc without fence could give performance improvement over lr/sc plus fence, so add the code here to leave performance improvement room on newer HW platforms. * b4-shazam-merge: riscv: cmpxchg: implement arch_cmpxchg64_{relaxed|acquire|release} riscv: select ARCH_USE_CMPXCHG_LOCKREF Link: http://marc.info/?l=linux-fsdevel&m=137782380714721&w=4 [1] Link: https://lore.kernel.org/r/20240325111038.1700-1-jszhang@kernel.orgSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Jisheng Zhang authored
After commit f51f7a0f ("riscv: enable DMA_BOUNCE_UNALIGNED_KMALLOC for !dma_coherent"), for non-coherent platforms with less than 4GB memory, we rely on users to pass "swiotlb=mmnn,force" kernel parameters to enable DMA bouncing for unaligned kmalloc() buffers. Now let's go further: If no bouncing needed for ZONE_DMA, let kernel automatically allocate 1MB swiotlb buffer per 1GB of RAM for kmalloc() bouncing on non-coherent platforms, so that no need to pass "swiotlb=mmnn,force" any more. The math of "1MB swiotlb buffer per 1GB of RAM for kmalloc() bouncing" is taken from arm64. Users can still force smaller swiotlb buffer by passing "swiotlb=mmnn". Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240325110036.1564-1-jszhang@kernel.orgSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Dawei Li authored
pgtable_l{4,5}_enabled are read only after initialization, make explicit annotation of __ro_after_init on them. Signed-off-by: Dawei Li <dawei.li@shingroup.cn> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240320064712.442579-3-dawei.li@shingroup.cnSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Dawei Li authored
IS_ENABLED(CONFIG_64BIT) in initialization of pgtable_l{4,5}_enabled is redundant, remove it. Signed-off-by: Dawei Li <dawei.li@shingroup.cn> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240320064712.442579-2-dawei.li@shingroup.cnSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Palmer Dabbelt authored
Charlie Jenkins <charlie@rivosinc.com> says: Improve the performance of icache flushing by creating a new prctl flag PR_RISCV_SET_ICACHE_FLUSH_CTX. The interface is left generic to allow for future expansions such as with the proposed J extension [1]. Documentation is also provided to explain the use case. Patch sent to add PR_RISCV_SET_ICACHE_FLUSH_CTX to man-pages [2]. [1] https://github.com/riscv/riscv-j-extension [2] https://lore.kernel.org/linux-man/20240124-fencei_prctl-v1-1-0bddafcef331@rivosinc.com * b4-shazam-merge: cpumask: Add assign cpu documentation: Document PR_RISCV_SET_ICACHE_FLUSH_CTX prctl riscv: Include riscv_set_icache_flush_ctx prctl riscv: Remove unnecessary irqflags processor.h include Link: https://lore.kernel.org/r/20240312-fencei-v13-0-4b6bdc2bbf32@rivosinc.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Palmer Dabbelt authored
Alexandre Ghiti <alexghiti@rivosinc.com> says: patch 1 removes a useless memory barrier and patch 2 actually fixes the issue with IPI in the patching code. * b4-shazam-merge: riscv: Fix text patching when IPI are used riscv: Remove superfluous smp_mb() Link: https://lore.kernel.org/r/20240229121056.203419-1-alexghiti@rivosinc.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
- 29 Apr, 2024 13 commits
-
-
Samuel Holland authored
Even if multiple ASIDs are not supported, using the single-ASID variant of the sfence.vma instruction preserves TLB entries for global (kernel) pages. So it is always more efficient to use the single-ASID code path. Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20240327045035.368512-14-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Samuel Holland authored
If the CPU does not support multiple ASIDs, all MM contexts use ASID 0. In this case, it is still beneficial to flush the TLB by ASID, as the single-ASID variant of the sfence.vma instruction preserves TLB entries for global (kernel) pages. This optimization is recommended by the RISC-V privileged specification: If the implementation does not provide ASIDs, or software chooses to always use ASID 0, then after every satp write, software should execute SFENCE.VMA with rs1=x0. In the common case that no global translations have been modified, rs2 should be set to a register other than x0 but which contains the value zero, so that global translations are not flushed. It is not possible to apply this optimization when using the ASID allocator, because that code must flush the TLB for all ASIDs at once when incrementing the version number. Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20240327045035.368512-13-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Samuel Holland authored
This variable is only used inside asids_init(). Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20240327045035.368512-12-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Samuel Holland authored
Currently, the size of the ASID field in the MM context ID dynamically depends on the number of hardware-supported ASID bits. This requires reading a global variable to extract either field from the context ID. Instead, allocate the maximum possible number of bits to the ASID field, so the layout of the context ID is known at compile-time. Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20240327045035.368512-11-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Samuel Holland authored
When using the ASID allocator, the MM context ID contains two values: the ASID in the lower bits, and the allocator version number in the remaining bits. Use macros to make this separation more obvious. Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20240327045035.368512-10-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Samuel Holland authored
Implementations affected by SiFive errata CIP-1200 have a bug which forces the kernel to always use the global variant of the sfence.vma instruction. When affected by this errata, do not attempt to flush a range of addresses; each iteration of the loop would actually flush the whole TLB instead. Instead, minimize the overall number of sfence.vma instructions. Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Yunhui Cui <cuiyunhui@bytedance.com> Link: https://lore.kernel.org/r/20240327045035.368512-9-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Samuel Holland authored
commit 3f1e7829 ("riscv: add ASID-based tlbflushing methods") added calls to the sfence.vma instruction with rs2 != x0. These single-ASID instruction variants are also affected by SiFive errata CIP-1200. Until now, the errata workaround was not needed for the single-ASID sfence.vma variants, because they were only used when the ASID allocator was enabled, and the affected SiFive platforms do not support multiple ASIDs. However, we are going to start using those sfence.vma variants regardless of ASID support, so now we need alternatives covering them. Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20240327045035.368512-8-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Samuel Holland authored
In SMP configurations, all TLB flushing narrower than flush_tlb_all() goes through __flush_tlb_range(). Do the same in UP configurations. This allows UP configurations to take advantage of recent improvements to the code in tlbflush.c, such as support for huge pages and flushing multiple-page ranges. Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Yunhui Cui <cuiyunhui@bytedance.com> Link: https://lore.kernel.org/r/20240327045035.368512-7-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Samuel Holland authored
If no other CPU is online, a local cache or TLB flush is sufficient. These checks can be constant-folded when SMP is disabled. Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240327045035.368512-6-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Samuel Holland authored
__flush_tlb_range() avoids broadcasting TLB flushes when an mm context is only active on the local CPU. Apply this same optimization to TLB flushes of kernel memory when only one CPU is online. This check can be constant-folded when SMP is disabled. Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20240327045035.368512-5-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Samuel Holland authored
An IPI backend is always required in an SMP configuration, but an SBI implementation is not. For example, SBI will be unavailable when the kernel runs in M mode. For this reason, consider IPI delivery of cache and TLB flushes to be the base case, and any other implementation (such as the SBI remote fence extension) to be an optimization. Generally, if IPIs can be delivered without firmware assistance, they are assumed to be faster than SBI calls due to the SBI context switch overhead. However, when SBI is used as the IPI backend, then the context switch cost must be paid anyway, and performing the cache/TLB flush directly in the SBI implementation is more efficient than injecting an interrupt to S-mode. This is the only existing scenario where riscv_ipi_set_virq_range() is called with use_for_rfence set to false. sbi_ipi_init() already checks riscv_ipi_have_virq_range(), so it only calls riscv_ipi_set_virq_range() when no other IPI device is available. This allows moving the static key and dropping the use_for_rfence parameter. This decouples the static key from the irqchip driver probe order. Furthermore, the static branch only makes sense when CONFIG_RISCV_SBI is enabled. Optherwise, IPIs must be used. Add a fallback definition of riscv_use_sbi_for_rfence() which handles this case and removes the need to check CONFIG_RISCV_SBI elsewhere, such as in cacheflush.c. Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240327045035.368512-4-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Samuel Holland authored
The logic is the same for all page table levels. See commit 69be3fb1 ("riscv: enable MMU_GATHER_RCU_TABLE_FREE for SMP && MMU"). Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240327045035.368512-3-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Samuel Holland authored
Instruction cache flush IPIs are sent only to CPUs in cpu_online_mask, so they will not target a CPU until it calls set_cpu_online() earlier in smp_callin(). As a result, if instruction memory is modified between the CPU coming out of reset and that point, then its instruction cache may contain stale data. Therefore, the instruction cache must be flushed after the set_cpu_online() synchronization point. Fixes: 08f051ed ("RISC-V: Flush I$ when making a dirty page executable") Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20240327045035.368512-2-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
- 28 Apr, 2024 6 commits
-
-
Clément Léger authored
Export the Zihintpause ISA extension through hwprobe which allows using "pause" instructions. Some userspace applications (OpenJDK for instance) uses this to handle some locking back-off. Signed-off-by: Clément Léger <cleger@rivosinc.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240221083108.1235311-1-cleger@rivosinc.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Clément Léger authored
While reworking code to fix sparse errors, it appears that the RISCV_M_MODE specific could actually be removed and use the one for normal mode. Even though RISCV_M_MODE can do direct user memory access, using the user uaccess helpers is also going to work. Since there is no need anymore for specific accessors (load_u8()/store_u8()), we can directly use memcpy()/copy_{to/from}_user() and get rid of the copy loop entirely. __read_insn() is also fixed to use an unsigned long instead of a pointer which was cast in __user address space. The insn_addr parameter is now cast from unsigned lnog to the correct address space directly. Signed-off-by: Clément Léger <cleger@rivosinc.com> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Link: https://lore.kernel.org/r/20240206154104.896809-1-cleger@rivosinc.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Samuel Holland authored
While the processor is executing kernel code, the value of the scratch CSR is always zero, so there is no need to save the value. Continue to write the CSR during the resume flow, so we do not rely on firmware to initialize it. Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/20240312195641.1830521-1-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Palmer Dabbelt authored
Samuel Holland <samuel.holland@sifive.com> says: This series aims to improve support for NOMMU, specifically by making it easier to test NOMMU kernels in QEMU and on various widely-available hardware (errata permitting). After all, everything supports Svbare... After applying this series, a NOMMU kernel based on defconfig (changing only the three options below*) boots to userspace on QEMU when passed as -kernel. # CONFIG_RISCV_M_MODE is not set # CONFIG_MMU is not set CONFIG_NONPORTABLE=y *if you are using LLD, you must also disable BPF_SYSCALL and KALLSYMS, because LLD bails on out-of-range references to undefined weak symbols. * b4-shazam-merge: riscv: Allow NOMMU kernels to run in S-mode riscv: Remove MMU dependency from Zbb and Zicboz riscv: Fix loading 64-bit NOMMU kernels past the start of RAM riscv: Fix TASK_SIZE on 64-bit NOMMU Link: https://lore.kernel.org/r/20240227003630.3634533-1-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Miguel Ojeda authored
The rust modules work on 64-bit RISC-V, with no twiddling required. Select HAVE_RUST and provide the required flags to kbuild so that the modules can be used. The Makefile and Kconfig changes are lifted from work done by Miguel in the Rust-for-Linux tree, hence his authorship. Following the rabbit hole, the Makefile changes originated in a script, created based on config files originally added by Gary, hence his co-authorship. 32-bit is broken in core rust code, so support is limited to 64-bit: ld.lld: error: undefined symbol: __udivdi3 As 64-bit RISC-V is now supported, add it to the arch support table. Co-developed-by: Gary Guo <gary@garyguo.net> Signed-off-by: Gary Guo <gary@garyguo.net> Signed-off-by: Miguel Ojeda <ojeda@kernel.org> Co-developed-by: Conor Dooley <conor.dooley@microchip.com> Signed-off-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20240409-silencer-book-ce1320f06aab@spudSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Palmer Dabbelt authored
Leonardo Bras <leobras@redhat.com> says: While studying riscv's cmpxchg.h file, I got really interested in understanding how RISCV asm implemented the different versions of {cmp,}xchg. When I understood the pattern, it made sense for me to remove the duplications and create macros to make it easier to understand what exactly changes between the versions: Instruction sufixes & barriers. Also, did the same kind of work on atomic.c. After that, I noted both cmpxchg and xchg only accept variables of size 4 and 8, compared to x86 and arm64 which do 1,2,4,8. Now that deduplication is done, it is quite direct to implement them for variable sizes 1 and 2, so I did it. Then Guo Ren already presented me some possible users :) I did compare the generated asm on a test.c that contained usage for every changed function, and could not detect any change on patches 1 + 2 + 3 compared with upstream. Pathes 4 & 5 were compiled-tested, merged with guoren/qspinlock_v11 and booted just fine with qemu -machine virt -append "qspinlock". (tree: https://gitlab.com/LeoBras/linux/-/commits/guo_qspinlock_v11) Latest tests happened based on this tree: https://github.com/guoren83/linux/tree/qspinlock_v12 * b4-shazam-lts: riscv/cmpxchg: Implement xchg for variables of size 1 and 2 riscv/cmpxchg: Implement cmpxchg for variables of size 1 and 2 riscv/atomic.h : Deduplicate arch_atomic.* riscv/cmpxchg: Deduplicate cmpxchg() asm and macros riscv/cmpxchg: Deduplicate xchg() asm functions Link: https://lore.kernel.org/r/20240103163203.72768-2-leobras@redhat.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
- 24 Apr, 2024 2 commits
-
-
Jisheng Zhang authored
After selecting ARCH_USE_CMPXCHG_LOCKREF, one straight futher optimization is implementing the arch_cmpxchg64_relaxed() because the lockref code does not need the cmpxchg to have barrier semantics. At the same time, implement arch_cmpxchg64_acquire and arch_cmpxchg64_release as well. However, on both TH1520 and JH7110 platforms, I didn't see obvious performance improvement with Linus' test case [1]. IMHO, this may be related with the fence and lr.d/sc.d hw implementations. In theory, lr/sc without fence could give performance improvement over lr/sc plus fence, so add the code here to leave performance improvement room on newer HW platforms. Link: http://marc.info/?l=linux-fsdevel&m=137782380714721&w=4 [1] Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Reviewed-by: Andrea Parri <parri.andrea@gmail.com> Link: https://lore.kernel.org/r/20240325111038.1700-3-jszhang@kernel.orgSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Jisheng Zhang authored
Select ARCH_USE_CMPXCHG_LOCKREF to enable the cmpxchg-based lockless lockref implementation for riscv. Using Linus' test case[1] on TH1520 platform, I see a 11.2% improvement. On JH7110 platform, I see 12.0% improvement. Link: http://marc.info/?l=linux-fsdevel&m=137782380714721&w=4 [1] Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Reviewed-by: Andrea Parri <parri.andrea@gmail.com> Link: https://lore.kernel.org/r/20240325111038.1700-2-jszhang@kernel.orgSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
- 18 Apr, 2024 3 commits
-
-
Charlie Jenkins authored
Standardize an assign_cpu function for cpumasks. Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Link: https://lore.kernel.org/r/20240312-fencei-v13-4-4b6bdc2bbf32@rivosinc.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Charlie Jenkins authored
Provide documentation that explains how to properly do CMODX in riscv. Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240312-fencei-v13-3-4b6bdc2bbf32@rivosinc.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Charlie Jenkins authored
Support new prctl with key PR_RISCV_SET_ICACHE_FLUSH_CTX to enable optimization of cross modifying code. This prctl enables userspace code to use icache flushing instructions such as fence.i with the guarantee that the icache will continue to be clean after thread migration. Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20240312-fencei-v13-2-4b6bdc2bbf32@rivosinc.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
- 17 Apr, 2024 3 commits
-
-
Charlie Jenkins authored
This include is not used and can lead to circular dependencies. Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Reviewed-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20240312-fencei-v13-1-4b6bdc2bbf32@rivosinc.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Alexandre Ghiti authored
For now, we use stop_machine() to patch the text and when we use IPIs for remote icache flushes (which is emitted in patch_text_nosync()), the system hangs. So instead, make sure every CPU executes the stop_machine() patching function and emit a local icache flush there. Co-developed-by: Björn Töpel <bjorn@rivosinc.com> Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Andrea Parri <parri.andrea@gmail.com> Link: https://lore.kernel.org/r/20240229121056.203419-3-alexghiti@rivosinc.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Alexandre Ghiti authored
This memory barrier is not needed and not documented so simply remove it. Suggested-by: Andrea Parri <andrea@rivosinc.com> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Andrea Parri <parri.andrea@gmail.com> Link: https://lore.kernel.org/r/20240229121056.203419-2-alexghiti@rivosinc.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
- 09 Apr, 2024 4 commits
-
-
Samuel Holland authored
For ease of testing, it is convenient to run NOMMU kernels in supervisor mode. The only required change is to offset the kernel load address, since the beginning of RAM is usually reserved for M-mode firmware. Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20240227003630.3634533-5-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Samuel Holland authored
The Zbb and Zicboz ISA extensions have no dependency on an MMU and are equally useful on NOMMU kernels. Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20240227003630.3634533-4-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Samuel Holland authored
commit 3335068f ("riscv: Use PUD/P4D/PGD pages for the linear mapping") added logic to allow using RAM below the kernel load address. However, this does not work for NOMMU, where PAGE_OFFSET is fixed to the kernel load address. Since that range of memory corresponds to PFNs below ARCH_PFN_OFFSET, mm initialization runs off the beginning of mem_map and corrupts adjacent kernel memory. Fix this by restoring the previous behavior for NOMMU kernels. Fixes: 3335068f ("riscv: Use PUD/P4D/PGD pages for the linear mapping") Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20240227003630.3634533-3-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-
Samuel Holland authored
On NOMMU, userspace memory can come from anywhere in physical RAM. The current definition of TASK_SIZE is wrong if any RAM exists above 4G, causing spurious failures in the userspace access routines. Fixes: 6bd33e1e ("riscv: add nommu support") Fixes: c3f896dc ("mm: switch the test_vmalloc module to use __vmalloc_node") Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Jisheng Zhang <jszhang@kernel.org> Reviewed-by: Bo Gan <ganboing@gmail.com> Link: https://lore.kernel.org/r/20240227003630.3634533-2-samuel.holland@sifive.comSigned-off-by: Palmer Dabbelt <palmer@rivosinc.com>
-