- 01 Aug, 2023 9 commits
-
-
git://git.kernel.dk/linuxDarrick J. Wong authored
Improve iomap/xfs async dio write performance iomap always punts async dio write completions to a workqueue, which has a cost in terms of efficiency (now you need an unrelated worker to process it) and latency (now you're bouncing a completion through an async worker, which is a classic slowdown scenario). io_uring handles IRQ completions via task_work, and for writes that don't need to do extra IO at completion time, we can safely complete them inline from that. This patchset adds IOCB_DIO_CALLER_COMP, which an IO issuer can set to inform the completion side that any extra work that needs doing for that completion can be punted to a safe task context. The iomap dio completion will happen in hard/soft irq context, and we need a saner context to process these completions. IOCB_DIO_CALLER_COMP is added, which can be set in a struct kiocb->ki_flags by the issuer. If the completion side of the iocb handling understands this flag, it can choose to set a kiocb->dio_complete() handler and just call ki_complete from IRQ context. The issuer must then ensure that this callback is processed from a task. io_uring punts IRQ completions to task_work already, so it's trivial wire it up to run more of the completion before posting a CQE. This is good for up to a 37% improvement in throughput/latency for low queue depth IO, patch 5 has the details. If we need to do real work at completion time, iomap will clear the IOMAP_DIO_CALLER_COMP flag. This work came about when Andres tested low queue depth dio writes for postgres and compared it to doing sync dio writes, showing that the async processing slows us down a lot. * tag 'xfs-async-dio.6-2023-08-01' of git://git.kernel.dk/linux: iomap: support IOCB_DIO_CALLER_COMP io_uring/rw: add write support for IOCB_DIO_CALLER_COMP fs: add IOCB flags related to passing back dio completions iomap: add IOMAP_DIO_INLINE_COMP iomap: only set iocb->private for polled bio iomap: treat a write through cache the same as FUA iomap: use an unsigned type for IOMAP_DIO_* defines iomap: cleanup up iomap_dio_bio_end_io() Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-
Jens Axboe authored
If IOCB_DIO_CALLER_COMP is set, utilize that to set kiocb->dio_complete handler and data for that callback. Rather than punt the completion to a workqueue, we pass back the handler and data to the issuer and will get a callback from a safe task context. Using the following fio job to randomly dio write 4k blocks at queue depths of 1..16: fio --name=dio-write --filename=/data1/file --time_based=1 \ --runtime=10 --bs=4096 --rw=randwrite --norandommap --buffered=0 \ --cpus_allowed=4 --ioengine=io_uring --iodepth=$depth shows the following results before and after this patch: Stock Patched Diff ======================================= QD1 155K 162K + 4.5% QD2 290K 313K + 7.9% QD4 533K 597K +12.0% QD8 604K 827K +36.9% QD16 615K 845K +37.4% which shows nice wins all around. If we factored in per-IOP efficiency, the wins look even nicer. This becomes apparent as queue depth rises, as the offloaded workqueue completions runs out of steam. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If the filesystem dio handler understands IOCB_DIO_CALLER_COMP, we'll get a kiocb->ki_complete() callback with kiocb->dio_complete set. In that case, rather than complete the IO directly through task_work, queue up an intermediate task_work handler that first processes this callback and then immediately completes the request. For XFS, this avoids a punt through a workqueue, which is a lot less efficient and adds latency to lower queue depth (or sync) O_DIRECT writes. Only do this for non-polled IO, as polled IO doesn't need this kind of deferral as it always completes within the task itself. This then avoids a check for deferral in the polled IO completion handler. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Async dio completions generally happen from hard/soft IRQ context, which means that users like iomap may need to defer some of the completion handling to a workqueue. This is less efficient than having the original issuer handle it, like we do for sync IO, and it adds latency to the completions. Add IOCB_DIO_CALLER_COMP, which the issuer can set if it is able to safely punt these completions to a safe context. If the dio handler is aware of this flag, assign a callback handler in kiocb->dio_complete and associated data io kiocb->private. The issuer will then call this handler with that data from task context. No functional changes in this patch. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Rather than gate whether or not we need to punt a dio completion to a workqueue on whether the IO is a write or not, add an explicit flag for it. For now we treat them the same, reads always set the flags and async writes do not. No functional changes in this patch. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
iocb->private is only used for polled IO, where the completer will find the bio to poll through that field. Assign it when we're submitting a polled bio, and get rid of the dio->poll_bio indirection. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Whether we have a write back cache and are using FUA or don't have a write back cache at all is the same situation. Treat them the same. This makes the IOMAP_DIO_WRITE_FUA name a bit misleading, as we have two cases that provide stable writes: 1) Volatile write cache with FUA writes 2) Normal write without a volatile write cache Rename that flag to IOMAP_DIO_STABLE_WRITE to make that clearer, and update some of the FUA comments as well. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
IOMAP_DIO_DIRTY shifts by 31 bits, which makes UBSAN unhappy. Clean up all the defines by making the shifted value an unsigned value. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reported-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Make the logic a bit easier to follow: 1) Add a release_bio out path, as everybody needs to touch that, and have our bio ref check jump there if it's non-zero. 2) Add a kiocb local variable. 3) Add comments for each of the three conditions (sync, inline, or async workqueue punt). No functional changes in this patch. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 25 Jul, 2023 9 commits
-
-
Darrick J. Wong authored
Merge tag 'iomap-per-block-dirty-tracking' of https://github.com/riteshharjani/linux into iomap-6.6-merge iomap: Add per-block dirty state tracking to iomap iomap today only tracks per-block update state bitmap, this series extends the support by adding per-block dirty state bitmap tracking to iomap buffered I/O path. This helps in reducing the write amplification and improve write performance for large folio writes and for platforms with higher pagesize compared to blocksize. We have seen ~83% performance improvement with these patches using database benchmarking tests, with XFS on 64k pagesize. fio benchmark (as shown in the last patch which adds dirty tracking support) showed close to 16x performance improvement when tested with 64K pagesize on 4k blocksize XFS using nvme on Power. * tag 'iomap-per-block-dirty-tracking' of https://github.com/riteshharjani/linux: iomap: Add per-block dirty state tracking to improve performance iomap: Allocate ifs in ->write_begin() early iomap: Refactor iomap_write_delalloc_punch() function out iomap: Use iomap_punch_t typedef iomap: Fix possible overflow condition in iomap_write_delalloc_scan iomap: Add some uptodate state handling helpers for ifs state bitmap iomap: Drop ifs argument from iomap_set_range_uptodate() iomap: Rename iomap_page to iomap_folio_state and others [djwong: also yay to less write amplification!] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-
Ritesh Harjani (IBM) authored
When filesystem blocksize is less than folio size (either with mapping_large_folio_support() or with blocksize < pagesize) and when the folio is uptodate in pagecache, then even a byte write can cause an entire folio to be written to disk during writeback. This happens because we currently don't have a mechanism to track per-block dirty state within struct iomap_folio_state. We currently only track uptodate state. This patch implements support for tracking per-block dirty state in iomap_folio_state->state bitmap. This should help improve the filesystem write performance and help reduce write amplification. Performance testing of below fio workload reveals ~16x performance improvement using nvme with XFS (4k blocksize) on Power (64K pagesize) FIO reported write bw scores improved from around ~28 MBps to ~452 MBps. 1. <test_randwrite.fio> [global] ioengine=psync rw=randwrite overwrite=1 pre_read=1 direct=0 bs=4k size=1G dir=./ numjobs=8 fdatasync=1 runtime=60 iodepth=64 group_reporting=1 [fio-run] 2. Also our internal performance team reported that this patch improves their database workload performance by around ~83% (with XFS on Power) Reported-by: Aravinda Herle <araherle@in.ibm.com> Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Ritesh Harjani (IBM) authored
We dont need to allocate an ifs in ->write_begin() for writes where the position and length completely overlap with the given folio. Therefore, such cases are skipped. Currently when the folio is uptodate, we only allocate ifs at writeback time (in iomap_writepage_map()). This is ok until now, but when we are going to add support for per-block dirty state bitmap in ifs, this could cause some performance degradation. The reason is that if we don't allocate ifs during ->write_begin(), then we will never mark the necessary dirty bits in ->write_end() call. And we will have to mark all the bits as dirty at the writeback time, that could cause the same write amplification and performance problems as it is now. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Ritesh Harjani (IBM) authored
This patch factors iomap_write_delalloc_punch() function out. This function is resposible for actual punch out operation. The reason for doing this is, to avoid deep indentation when we bring punch-out of individual non-dirty blocks within a dirty folio in a later patch (which adds per-block dirty status handling to iomap) to avoid delalloc block leak. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Ritesh Harjani (IBM) authored
It makes it much easier if we have iomap_punch_t typedef for "punch" function pointer in all delalloc related punch, scan and release functions. It will be useful in later patches when we will factor out iomap_write_delalloc_punch() function. Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Ritesh Harjani (IBM) authored
folio_next_index() returns an unsigned long value which left shifted by PAGE_SHIFT could possibly cause an overflow on 32-bit system. Instead use folio_pos(folio) + folio_size(folio), which does this correctly. Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Ritesh Harjani (IBM) authored
This patch adds two of the helper routines ifs_is_fully_uptodate() and ifs_block_is_uptodate() for managing uptodate state of "ifs" state bitmap. In later patches ifs state bitmap array will also handle dirty state of all blocks of a folio. Hence this patch adds some helper routines for handling uptodate state of the ifs state bitmap. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Ritesh Harjani (IBM) authored
iomap_folio_state (ifs) can be derived directly from the folio, making it unnecessary to pass "ifs" as an argument to iomap_set_range_uptodate(). This patch eliminates "ifs" argument from iomap_set_range_uptodate() function. Also, the definition of iomap_set_range_uptodate() and ifs_set_range_uptodate() functions are moved above ifs_alloc(). In upcoming patches, we plan to introduce additional helper routines for handling dirty state, with the intention of consolidating all of "ifs" state handling routines at one place. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Ritesh Harjani (IBM) authored
struct iomap_page actually tracks per-block state of a folio. Hence it make sense to rename some of these function names and data structures for e.g. 1. struct iomap_page (iop) -> struct iomap_folio_state (ifs) 2. iomap_page_create() -> ifs_alloc() 3. iomap_page_release() -> ifs_free() 4. iomap_iop_set_range_uptodate() -> ifs_set_range_uptodate() 5. to_iomap_page() -> folio->private Since in later patches we are also going to add per-block dirty state tracking to iomap_folio_state. Hence this patch also renames "uptodate" & "uptodate_lock" members of iomap_folio_state to "state" and"state_lock". We don't really need to_iomap_page() function, instead directly open code it as folio->private; Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
- 24 Jul, 2023 11 commits
-
-
Darrick J. Wong authored
Merge tag 'large-folio-writes' of git://git.infradead.org/users/willy/pagecache into iomap-6.6-merge Create large folios in iomap buffered write path Commit ebb7fb15 limited the length of ioend chains to 4096 entries to improve worst-case latency. Unfortunately, this had the effect of limiting the performance of: fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 \ -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 \ -numjobs=4 -directory=/mnt/test https://lore.kernel.org/linux-xfs/20230508172406.1CF3.409509F4@e16-tech.com/ The problem ends up being lock contention on the i_pages spinlock as we clear the writeback bit on each folio (and propagate that up through the tree). By using larger folios, we decrease the number of folios to be processed by a factor of 256 for this benchmark, eliminating the lock contention. Creating large folios in the buffered write path is also the right thing to do. It's a project that has been on the back burner for years, it just hasn't been important enough to do before now. * tag 'large-folio-writes' of git://git.infradead.org/users/willy/pagecache: iomap: Copy larger chunks from userspace iomap: Create large folios in the buffered write path filemap: Allow __filemap_get_folio to allocate large folios filemap: Add fgf_t typedef iomap: Remove unnecessary test from iomap_release_folio() doc: Correct the description of ->release_folio iomap: Remove large folio handling in iomap_invalidate_folio() iov_iter: Add copy_folio_from_iter_atomic() iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic() iov_iter: Map the page later in copy_page_from_iter_atomic() [djwong: yay amortizations!] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-
Matthew Wilcox (Oracle) authored
If we have a large folio, we can copy in larger chunks than PAGE_SIZE. Start at the maximum page cache size and shrink by half every time we hit the "we are short on memory" problem. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Matthew Wilcox (Oracle) authored
Use the size of the write as a hint for the size of the folio to create. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Matthew Wilcox (Oracle) authored
Allow callers of __filemap_get_folio() to specify a preferred folio order in the FGP flags. This is only honoured in the FGP_CREATE path; if there is already a folio in the page cache that covers the index, we will return it, no matter what its order is. No create-around is attempted; we will only create folios which start at the specified index. Unmodified callers will continue to allocate order 0 folios. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Matthew Wilcox (Oracle) authored
Similarly to gfp_t, define fgf_t as its own type to prevent various misuses and confusion. Leave the flags as FGP_* for now to reduce the size of this patch; they will be converted to FGF_* later. Move the documentation to the definition of the type insted of burying it in the __filemap_get_folio() documentation. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
-
Matthew Wilcox (Oracle) authored
The check for the folio being under writeback is unnecessary; the caller has checked this and the folio is locked, so the folio cannot be under writeback at this point. The comment is somewhat misleading in that it talks about one specific situation in which we can see a dirty folio. There are others, so change the comment to explain why we can't release the iomap_page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Matthew Wilcox (Oracle) authored
The filesystem ->release_folio method is called under more circumstances now than when the documentation was written. The second sentence describing the interpretation of the return value is the wrong polarity (false indicates failure, not success). And the third sentence is also wrong (the kernel calls try_to_free_buffers() instead). So replace the entire paragraph with a detailed description of what the state of the folio may be, the meaning of the gfp parameter, why the method is being called and what the filesystem is expected to do. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Matthew Wilcox (Oracle) authored
We do not need to release the iomap_page in iomap_invalidate_folio() to allow the folio to be split. The splitting code will call ->release_folio() if there is still per-fs private data attached to the folio. At that point, we will check if the folio is still dirty and decline to release the iomap_page. It is possible to trigger the warning in perfectly legitimate circumstances (eg if a disk read fails, we do a partial write to the folio, then we truncate the folio), which will cause those writes to be lost. Fixes: 60d82310 ("iomap: Support large folios in invalidatepage") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Matthew Wilcox (Oracle) authored
Add a folio wrapper around copy_page_from_iter_atomic(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
Matthew Wilcox (Oracle) authored
copy_page_from_iter_atomic() already handles !highmem compound pages correctly, but if we are passed a highmem compound page, each base page needs to be mapped & unmapped individually. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Kent Overstreet <kent.overstreet@linux.dev>
-
Matthew Wilcox (Oracle) authored
Remove a couple of calls to kunmap_atomic() in the rare error cases. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-
- 23 Jul, 2023 11 commits
-
-
Linus Torvalds authored
-
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-traceLinus Torvalds authored
Pull tracing fixes from Steven Rostedt: - Swapping the ring buffer for snapshotting (for things like irqsoff) can crash if the ring buffer is being resized. Disable swapping when this happens. The missed swap will be reported to the tracer - Report error if the histogram fails to be created due to an error in adding a histogram variable, in event_hist_trigger_parse() - Remove unused declaration of tracing_map_set_field_descr() * tag 'trace-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/histograms: Return an error if we fail to add histogram to hist_vars list ring-buffer: Do not swap cpu_buffer during resize process tracing: Remove unused extern declaration tracing_map_set_field_descr()
-
Linus Torvalds authored
Merge tag 'kbuild-fixes-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Fix stale help text in gconfig - Support *.S files in compile_commands.json - Flatten KBUILD_CFLAGS - Fix external module builds with Rust so that temporary files are created in the modules directories instead of the kernel tree * tag 'kbuild-fixes-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kbuild: rust: avoid creating temporary files kbuild: flatten KBUILD_CFLAGS gen_compile_commands: add assembly files to compilation database kconfig: gconfig: correct program name in help text kconfig: gconfig: drop the Show Debug Info help text
-
Miguel Ojeda authored
`rustc` outputs by default the temporary files (i.e. the ones saved by `-Csave-temps`, such as `*.rcgu*` files) in the current working directory when `-o` and `--out-dir` are not given (even if `--emit=x=path` is given, i.e. it does not use those for temporaries). Since out-of-tree modules are compiled from the `linux` tree, `rustc` then tries to create them there, which may not be accessible. Thus pass `--out-dir` explicitly, even if it is just for the temporary files. Similarly, do so for Rust host programs too. Reported-by: Raphael Nestler <raphael.nestler@gmail.com> Closes: https://github.com/Rust-for-Linux/linux/issues/1015Reported-by: Andrea Righi <andrea.righi@canonical.com> Tested-by: Raphael Nestler <raphael.nestler@gmail.com> # non-hostprogs Tested-by: Andrea Righi <andrea.righi@canonical.com> # non-hostprogs Fixes: 295d8398 ("kbuild: specify output names separately for each emission type from rustc") Cc: stable@vger.kernel.org Signed-off-by: Miguel Ojeda <ojeda@kernel.org> Tested-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
-
git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds authored
Pull kvm fixes from Paolo Bonzini: "ARM: - Avoid pKVM finalization if KVM initialization fails - Add missing BTI instructions in the hypervisor, fixing an early boot failure on BTI systems - Handle MMU notifiers correctly for non hugepage-aligned memslots - Work around a bug in the architecture where hypervisor timer controls have UNKNOWN behavior under nested virt - Disable preemption in kvm_arch_hardware_enable(), fixing a kernel BUG in cpu hotplug resulting from per-CPU accessor sanity checking - Make WFI emulation on GICv4 systems robust w.r.t. preemption, consistently requesting a doorbell interrupt on vcpu_put() - Uphold RES0 sysreg behavior when emulating older PMU versions - Avoid macro expansion when initializing PMU register names, ensuring the tracepoints pretty-print the sysreg s390: - Two fixes for asynchronous destroy x86 fixes will come early next week" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: s390: pv: fix index value of replaced ASCE KVM: s390: pv: simplify shutdown and fix race KVM: arm64: Fix the name of sys_reg_desc related to PMU KVM: arm64: Correctly handle RES0 bits PMEVTYPER<n>_EL0.evtCount KVM: arm64: vgic-v4: Make the doorbell request robust w.r.t preemption KVM: arm64: Add missing BTI instructions KVM: arm64: Correctly handle page aging notifiers for unaligned memslot KVM: arm64: Disable preemption in kvm_arch_hardware_enable() KVM: arm64: Handle kvm_arm_init failure correctly in finalize_pkvm KVM: arm64: timers: Use CNTHCTL_EL2 when setting non-CNTKCTL_EL1 bits
-
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4Linus Torvalds authored
Pull ext4 fixes from Ted Ts'o: "Bug and regression fixes for 6.5-rc3 for ext4's mballoc and jbd2's checkpoint code" * tag 'ext4_for_linus-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix rbtree traversal bug in ext4_mb_use_preallocated ext4: fix off by one issue in ext4_mb_choose_next_group_best_avail() ext4: correct inline offset when handling xattrs in inode body jbd2: remove __journal_try_to_free_buffer() jbd2: fix a race when checking checkpoint buffer busy jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint jbd2: remove journal_clean_one_cp_list() jbd2: remove t_checkpoint_io_list jbd2: recheck chechpointing non-dirty buffer
-
git://git.samba.org/sfrench/cifs-2.6Linus Torvalds authored
Pull smb client fix from Steve French: "Add minor debugging improvement. The change improves ability to read a network trace to debug problems on encrypted connections which are very common (e.g. using wireshark or tcpdump). That works today with tools like 'smbinfo keys /mnt/file' but requires passing in a filename on the mount (see e.g. [1]), but it often makes more sense to just pass in the mount point path (ie a directory not a filename). So this fix was needed to debug some types of problems (an obvious example is on an encrypted connection failing operations on an empty share or with no files in the root of the directory) - so you can simply pass in the 'smbinfo keys <mntpoint>' and get the information that wireshark needs" Link: https://wiki.samba.org/index.php/Wireshark_Decryption [1] * tag '6.5-rc2-smb3-client-fixes-ver2' of git://git.samba.org/sfrench/cifs-2.6: cifs: update internal module version number for cifs.ko cifs: allow dumping keys for directories too
-
Paolo Bonzini authored
Merge tag 'kvm-s390-master-6.5-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD Two fixes for asynchronous destroy
-
Paolo Bonzini authored
Merge tag 'kvmarm-fixes-6.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.5, part #1 - Avoid pKVM finalization if KVM initialization fails - Add missing BTI instructions in the hypervisor, fixing an early boot failure on BTI systems - Handle MMU notifiers correctly for non hugepage-aligned memslots - Work around a bug in the architecture where hypervisor timer controls have UNKNOWN behavior under nested virt. - Disable preemption in kvm_arch_hardware_enable(), fixing a kernel BUG in cpu hotplug resulting from per-CPU accessor sanity checking. - Make WFI emulation on GICv4 systems robust w.r.t. preemption, consistently requesting a doorbell interrupt on vcpu_put() - Uphold RES0 sysreg behavior when emulating older PMU versions - Avoid macro expansion when initializing PMU register names, ensuring the tracepoints pretty-print the sysreg.
-
Mohamed Khalfella authored
Commit 6018b585 ("tracing/histograms: Add histograms to hist_vars if they have referenced variables") added a check to fail histogram creation if save_hist_vars() failed to add histogram to hist_vars list. But the commit failed to set ret to failed return code before jumping to unregister histogram, fix it. Link: https://lore.kernel.org/linux-trace-kernel/20230714203341.51396-1-mkhalfella@purestorage.com Cc: stable@vger.kernel.org Fixes: 6018b585 ("tracing/histograms: Add histograms to hist_vars if they have referenced variables") Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-
Chen Lin authored
When ring_buffer_swap_cpu was called during resize process, the cpu buffer was swapped in the middle, resulting in incorrect state. Continuing to run in the wrong state will result in oops. This issue can be easily reproduced using the following two scripts: /tmp # cat test1.sh //#! /bin/sh for i in `seq 0 100000` do echo 2000 > /sys/kernel/debug/tracing/buffer_size_kb sleep 0.5 echo 5000 > /sys/kernel/debug/tracing/buffer_size_kb sleep 0.5 done /tmp # cat test2.sh //#! /bin/sh for i in `seq 0 100000` do echo irqsoff > /sys/kernel/debug/tracing/current_tracer sleep 1 echo nop > /sys/kernel/debug/tracing/current_tracer sleep 1 done /tmp # ./test1.sh & /tmp # ./test2.sh & A typical oops log is as follows, sometimes with other different oops logs. [ 231.711293] WARNING: CPU: 0 PID: 9 at kernel/trace/ring_buffer.c:2026 rb_update_pages+0x378/0x3f8 [ 231.713375] Modules linked in: [ 231.714735] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G W 6.5.0-rc1-00276-g20edcec2 #15 [ 231.716750] Hardware name: linux,dummy-virt (DT) [ 231.718152] Workqueue: events update_pages_handler [ 231.719714] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 231.721171] pc : rb_update_pages+0x378/0x3f8 [ 231.722212] lr : rb_update_pages+0x25c/0x3f8 [ 231.723248] sp : ffff800082b9bd50 [ 231.724169] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000 [ 231.726102] x26: 0000000000000001 x25: fffffffffffff010 x24: 0000000000000ff0 [ 231.728122] x23: ffff0000c3a0b600 x22: ffff0000c3a0b5c0 x21: fffffffffffffe0a [ 231.730203] x20: ffff0000c3a0b600 x19: ffff0000c0102400 x18: 0000000000000000 [ 231.732329] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffe7aa8510 [ 231.734212] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002 [ 231.736291] x11: ffff8000826998a8 x10: ffff800082b9baf0 x9 : ffff800081137558 [ 231.738195] x8 : fffffc00030e82c8 x7 : 0000000000000000 x6 : 0000000000000001 [ 231.740192] x5 : ffff0000ffbafe00 x4 : 0000000000000000 x3 : 0000000000000000 [ 231.742118] x2 : 00000000000006aa x1 : 0000000000000001 x0 : ffff0000c0007208 [ 231.744196] Call trace: [ 231.744892] rb_update_pages+0x378/0x3f8 [ 231.745893] update_pages_handler+0x1c/0x38 [ 231.746893] process_one_work+0x1f0/0x468 [ 231.747852] worker_thread+0x54/0x410 [ 231.748737] kthread+0x124/0x138 [ 231.749549] ret_from_fork+0x10/0x20 [ 231.750434] ---[ end trace 0000000000000000 ]--- [ 233.720486] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 233.721696] Mem abort info: [ 233.721935] ESR = 0x0000000096000004 [ 233.722283] EC = 0x25: DABT (current EL), IL = 32 bits [ 233.722596] SET = 0, FnV = 0 [ 233.722805] EA = 0, S1PTW = 0 [ 233.723026] FSC = 0x04: level 0 translation fault [ 233.723458] Data abort info: [ 233.723734] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 233.724176] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 233.724589] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 233.725075] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000104943000 [ 233.725592] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000 [ 233.726231] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 233.726720] Modules linked in: [ 233.727007] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G W 6.5.0-rc1-00276-g20edcec2 #15 [ 233.727777] Hardware name: linux,dummy-virt (DT) [ 233.728225] Workqueue: events update_pages_handler [ 233.728655] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 233.729054] pc : rb_update_pages+0x1a8/0x3f8 [ 233.729334] lr : rb_update_pages+0x154/0x3f8 [ 233.729592] sp : ffff800082b9bd50 [ 233.729792] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000 [ 233.730220] x26: 0000000000000000 x25: ffff800082a8b840 x24: ffff0000c0102418 [ 233.730653] x23: 0000000000000000 x22: fffffc000304c880 x21: 0000000000000003 [ 233.731105] x20: 00000000000001f4 x19: ffff0000c0102400 x18: ffff800082fcbc58 [ 233.731727] x17: 0000000000000000 x16: 0000000000000001 x15: 0000000000000001 [ 233.732282] x14: ffff8000825fe0c8 x13: 0000000000000001 x12: 0000000000000000 [ 233.732709] x11: ffff8000826998a8 x10: 0000000000000ae0 x9 : ffff8000801b760c [ 233.733148] x8 : fefefefefefefeff x7 : 0000000000000018 x6 : ffff0000c03298c0 [ 233.733553] x5 : 0000000000000002 x4 : 0000000000000000 x3 : 0000000000000000 [ 233.733972] x2 : ffff0000c3a0b600 x1 : 0000000000000000 x0 : 0000000000000000 [ 233.734418] Call trace: [ 233.734593] rb_update_pages+0x1a8/0x3f8 [ 233.734853] update_pages_handler+0x1c/0x38 [ 233.735148] process_one_work+0x1f0/0x468 [ 233.735525] worker_thread+0x54/0x410 [ 233.735852] kthread+0x124/0x138 [ 233.736064] ret_from_fork+0x10/0x20 [ 233.736387] Code: 92400000 910006b5 aa000021 aa0303f7 (f9400060) [ 233.736959] ---[ end trace 0000000000000000 ]--- After analysis, the seq of the error is as follows [1-5]: int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, int cpu_id) { for_each_buffer_cpu(buffer, cpu) { cpu_buffer = buffer->buffers[cpu]; //1. get cpu_buffer, aka cpu_buffer(A) ... ... schedule_work_on(cpu, &cpu_buffer->update_pages_work); //2. 'update_pages_work' is queue on 'cpu', cpu_buffer(A) is passed to // update_pages_handler, do the update process, set 'update_done' in // complete(&cpu_buffer->update_done) and to wakeup resize process. //----> //3. Just at this moment, ring_buffer_swap_cpu is triggered, //cpu_buffer(A) be swaped to cpu_buffer(B), the max_buffer. //ring_buffer_swap_cpu is called as the 'Call trace' below. Call trace: dump_backtrace+0x0/0x2f8 show_stack+0x18/0x28 dump_stack+0x12c/0x188 ring_buffer_swap_cpu+0x2f8/0x328 update_max_tr_single+0x180/0x210 check_critical_timing+0x2b4/0x2c8 tracer_hardirqs_on+0x1c0/0x200 trace_hardirqs_on+0xec/0x378 el0_svc_common+0x64/0x260 do_el0_svc+0x90/0xf8 el0_svc+0x20/0x30 el0_sync_handler+0xb0/0xb8 el0_sync+0x180/0x1c0 //<---- /* wait for all the updates to complete */ for_each_buffer_cpu(buffer, cpu) { cpu_buffer = buffer->buffers[cpu]; //4. get cpu_buffer, cpu_buffer(B) is used in the following process, //the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong. //for example, cpu_buffer(A)->update_done will leave be set 1, and will //not 'wait_for_completion' at the next resize round. if (!cpu_buffer->nr_pages_to_update) continue; if (cpu_online(cpu)) wait_for_completion(&cpu_buffer->update_done); cpu_buffer->nr_pages_to_update = 0; } ... } //5. the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong, //Continuing to run in the wrong state, then oops occurs. Link: https://lore.kernel.org/linux-trace-kernel/202307191558478409990@zte.com.cnSigned-off-by: Chen Lin <chen.lin5@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-