- 06 Apr, 2023 7 commits
-
-
Pavel Begunkov authored
Separate ->task_complete path in __io_cq_unlock_post_flush(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/baa9b8d822f024e4ee01c40209dbbe38d9c8c11d.1680782017.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Every task_work will try to wake the task to be executed, which causes excessive scheduling and additional overhead. For some tw it's justified, but others won't do much but post a single CQE. When a task waits for multiple cqes, every such task_work will wake it up. Instead, the task may give a hint about how many cqes it waits for, io_req_local_work_add() will compare against it and skip wake ups if #cqes + #tw is not enough to satisfy the waiting condition. Task_work that uses the optimisation should be simple enough and never post more than one CQE. It's also ignored for non DEFER_TASKRUN rings. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d2b77e99d1e86624d8a69f7037d764b739dcd225.1680782017.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
We'll need to grab some information from the previous request in the tw list, inline llist_add(), it'll be used in the following patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f0165493af7b379943c792114b972f331e7d7d10.1680782017.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
We pass 'allow_local' into io_req_task_work_add() but will need more flags. Replace it with a flags bit field and name this allow_local flag. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4c0f01e7ef4e6feebfb199093cc995af7a19befa.1680782017.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Instead of smp_mb() + __io_cqring_wake() in __io_cq_unlock_post_flush() use equivalent io_cqring_wake(). With that we can clean it up further and remove __io_cqring_wake(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/662ee5d898168ac206be06038525e97b64072a46.1680782017.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
We currently pin the ctx for io_req_local_work_add() with percpu_ref_get/put, which implies two rcu_read_lock/unlock pairs and some extra overhead on top in the fast path. Replace it with a pure rcu read and let io_ring_exit_work() synchronise against it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cbdfcb6b232627f30e9e50ef91f13c4f05910247.1680782017.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Move ctx pinning from io_req_local_work_add() to the caller, looks better and makes working with the code a bit easier. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/49c0dbed390b0d6d04cb942dd3592879fd5bfb1b.1680782017.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 05 Apr, 2023 1 commit
-
-
Jens Axboe authored
Rather than check this in the fast path issue, it makes more sense to just assign the copy of the data when we're setting it up anyway. This makes the code a bit cleaner, and removes the need for this check in the issue path. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 04 Apr, 2023 13 commits
-
-
Pavel Begunkov authored
The number of entries in the rsrc node cache is limited to 512, which still seems unnecessarily large. Add per cache thresholds and set to to 32 for the rsrc node cache. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d0cd538b944dac0bf878e276fc0199f21e6bccea.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Every struct io_rsrc_node takes a struct io_rsrc_data reference, which means all rsrc updates do 2 extra atomics. Replace atomics refcounting with a int as it's all done under ->uring_lock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e73c3d6820cf679532696d790b5b8fae23537213.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
We should hold ->uring_lock while putting nodes with io_put_rsrc_node(), add a lockdep check for that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b50d5f156ac41450029796738c1dfd22a521df7a.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Add allocation cache for struct io_rsrc_node, it's always allocated and put under ->uring_lock, so it doesn't need any extra synchronisation around caches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/252a9d9ef9654e6467af30fdc02f57c0118fb76e.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
struct delayed_work rsrc_put_work was previously used to offload node freeing because io_rsrc_node_ref_zero() was previously called by RCU in the IRQ context. Now, as percpu refcounting is gone, we can do it eagerly at the spot without pushing it to a worker. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/13fb1aac1e8d068ad8fd4a0c6d0d157ab61b90c0.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Every io_rsrc_node keeps a list of items to put, and all entries are kmalloc()'ed. However, it's quite often to queue up only one entry per node, so let's add an inline entry there to avoid extra allocations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c482c1c652c45c85ac52e67c974bc758a50fed5f.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
We have too many "rsrc" around which makes the name of struct io_rsrc_node::rsrc_list confusing. The field is responsible for keeping a list of files or buffers, so call it item_list and add comments around. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3e34d4dfc1fdbb6b520f904ee6187c2ccf680efe.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
We use ->rsrc_ref_lock spinlock to protect ->rsrc_ref_list in io_rsrc_node_ref_zero(). Now we removed pcpu refcounting, which means io_rsrc_node_ref_zero() is not executed from the irq context as an RCU callback anymore, and we also put it under ->uring_lock. io_rsrc_node_switch(), which queues up nodes into the list, is also protected by ->uring_lock, so we can safely get rid of ->rsrc_ref_lock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6b60af883c263551190b526a55ff2c9d5ae07141.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Currently, for nodes we have an atomic counter and some cached (non-atomic) refs protected by uring_lock. Let's put all ref manipulations under uring_lock and get rid of the atomic part. It's free as in all cases we care about we already hold the lock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/25b142feed7d831008257d90c8b17c0115d4fc15.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
io_free_req() is not often used but nevertheless problematic as there is no way to know the current context, it may be used from the submission path or even by an irq handler. Push it to a fresh context using task_work. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3a92fe80bb068757e51aaa0b105cfbe8f5dfee9e.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
io_req_put_rsrc() doesn't need any locking, so move it out of a spinlock section in __io_req_complete_post() and adjust helpers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d5b87a5f31270dade6805f7acafc4cc34b84b241.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
We cache refs of the current node (i.e. ctx->rsrc_node) in ctx->rsrc_cached_refs. We'll be moving away from atomics, so move the cached refs in struct io_rsrc_node for now. It's a prep patch and shouldn't change anything in practise. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9edc3669c1d71b06c2dca78b2b2b8bb9292738b9.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
One problem with the current rsrc infra is that often updates will generates lots of rsrc nodes, each carry pcpu refs. That takes quite a lot of memory, especially if there is a stall, and takes lots of CPU cycles. Only pcpu allocations takes >50 of CPU with a naive benchmark updating files in a loop. Replace pcpu refs with normal refcounting. There is already a hot path avoiding atomics / refs, but following patches will further improve it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e9ed8a9457b331a26555ff9443afc64cdaab7247.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 03 Apr, 2023 19 commits
-
-
Jens Axboe authored
We already do this manually for the !SQPOLL case, do it in general and we can also dump the ugly min3() in io_submit_sqes(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
It has nothing to do with the SQE at this point, it's a request submission. While in there, get rid of the 'force_nonblock' argument which is also dead, as we only pass in true. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
For task works we're passing around a bool pointer for whether the current ring is locked or not, let's wrap it in a structure, that will make it more opaque preventing abuse and will also help us to pass more info in the future if needed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1ecec9483d58696e248d1bfd52cf62b04442df1d.1679931367.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Before cond_resched()'ing in handle_tw_list() we also drop the current ring context, and so the next loop iteration will need to pick/pin a new context and do trylock. The chunk removed by this patch was intended to be an optimisation covering exactly this case, i.e. retaking the lock after reschedule, but in reality it's skipped for the first iteration after resched as described and will keep hammering the lock if it's contended. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1ecec9483d58696e248d1bfd52cf62b04442df1d.1679931367.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Since the move to PF_IO_WORKER, we don't juggle memory context manually anymore. Remove that outdated part of the comment for __io_worker_idle(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
There are two leftover structures from the notification registration mechanism that has never been released, kill them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f05f65aebaf8b1b5bf28519a8fdb350e3e7c9ad0.1679924536.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Gabriel Krisman Bertazi authored
Since commit 0654b05e7e65 ("io_uring: One wqe per wq"), we have just a single io_wqe instance embedded per io_wq. Drop the extra structure in favor of accessing struct io_wq directly, cleaning up quite a bit of dereferences and backpointers. No functional changes intended. Tested with liburing's testsuite and mmtests performance microbenchmarks. I didn't observe any performance regressions. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20230322011628.23359-2-krisman@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Gabriel Krisman Bertazi authored
Since we now have a single io_wqe per io_wq instead of per-node, and in preparation to its removal, move the accounting into the parent structure. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20230322011628.23359-2-krisman@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
On at least parisc, we have strict requirements on how we virtually map an address that is shared between the application and the kernel. On these platforms, IOU_PBUF_RING_MMAP should be used when setting up a shared ring buffer for provided buffers. If the application is mapping these pages and asking the kernel to pin+map them as well, then we have no control over what virtual address we get in the kernel. For that case, do a sanity check if SHM_COLOUR is defined, and disallow the mapping request. The application must fall back to using IOU_PBUF_RING_MMAP for this case, and liburing will do that transparently with the set of helpers that it has. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Breno Leitao authored
Add support for KASAN in the alloc_caches (apoll and netmsg_cache). Thus, if something touches the unused caches, it will raise a KASAN warning/exception. It poisons the object when the object is put to the cache, and unpoisons it when the object is gotten or freed. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20230223164353.2839177-2-leitao@debian.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Breno Leitao authored
Having cache entries linked using the hlist format brings no benefit, and also requires an unnecessary extra pointer address per cache entry. Use the internal io_wq_work_node single-linked list for the internal alloc caches (async_msghdr and async_poll) This is required to be able to use KASAN on cache entries, since we do not need to touch unused (and poisoned) cache entries when adding more entries to the list. Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20230223164353.2839177-2-leitao@debian.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Breno Leitao authored
Right now io_wq allocates one io_wqe per NUMA node. As io_wq is now bound to a task, the task basically uses only the NUMA local io_wqe, and almost never changes NUMA nodes, thus, the other wqes are mostly unused. Allocate just one io_wqe embedded into io_wq, and uses all possible cpus (cpu_possible_mask) in the io_wqe->cpumask. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20230310201107.4020580-1-leitao@debian.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
The ring mapped provided buffer rings rely on the application allocating the memory for the ring, and then the kernel will map it. This generally works fine, but runs into issues on some architectures where we need to be able to ensure that the kernel and application virtual address for the ring play nicely together. This at least impacts architectures that set SHM_COLOUR, but potentially also anyone setting SHMLBA. To use this variant of ring provided buffers, the application need not allocate any memory for the ring. Instead the kernel will do so, and the allocation must subsequently call mmap(2) on the ring with the offset set to: IORING_OFF_PBUF_RING | (bgid << IORING_OFF_PBUF_SHIFT) to get a virtual address for the buffer ring. Normally the application would allocate a suitable piece of memory (and correctly aligned) and simply pass that in via io_uring_buf_reg.ring_addr and the kernel would map it. Outside of the setup differences, the kernel allocate + user mapped provided buffer ring works exactly the same. Acked-by: Helge Deller <deller@gmx.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
In preparation for allowing flags to be set for registration, rename the padding and use it for that. Acked-by: Helge Deller <deller@gmx.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Rather than rely on checking buffer_list->buf_pages or ->buf_nr_pages, add a separate member that tracks if this is a ring mapped provided buffer list or not. Acked-by: Helge Deller <deller@gmx.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
In preparation for allowing the kernel to allocate the provided buffer rings and have the application mmap it instead, abstract out the current method of pinning and mapping the user allocated ring. No functional changes intended in this patch. Acked-by: Helge Deller <deller@gmx.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Helge Deller authored
Some architectures have memory cache aliasing requirements (e.g. parisc) if memory is shared between userspace and kernel. This patch fixes the kernel to return an aliased address when asked by userspace via mmap(). Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
io_uring hashes writes to a given file/inode so that it can serialize them. This is useful if the file system needs exclusive access to the file to perform the write, as otherwise we end up with a ton of io-wq threads trying to lock the inode at the same time. This can cause excessive system time. But if the file system has flagged that it supports parallel O_DIRECT writes, then there's no need to serialize the writes. Check for that through FMODE_DIO_PARALLEL_WRITE and don't hash it if we don't need to. In a basic test of 8 threads writing to a file on XFS on a gen2 Optane, with each thread writing in 4k chunks, it improves performance from ~1350K IOPS (or ~5290MiB/sec) to ~1410K IOPS (or ~5500MiB/sec). Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Some filesystems support multiple threads writing to the same file with O_DIRECT without requiring exclusive access to it. io_uring can use this hint to avoid serializing dio writes to this inode, instead allowing them to run in parallel. XFS and ext4 both fall into this category, so set the flag for both of them. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-