Commits · c66ae3ec38f946edb1776d25c1c8cd63803b8ec3 · Kirill Smelkov / linux

06 Apr, 2023 7 commits

io_uring: refactor __io_cq_unlock_post_flush() · c66ae3ec

Pavel Begunkov authored Apr 06, 2023

Separate ->task_complete path in __io_cq_unlock_post_flush().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/baa9b8d822f024e4ee01c40209dbbe38d9c8c11d.1680782017.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

c66ae3ec

io_uring: reduce scheduling due to tw · 8751d154

Pavel Begunkov authored Apr 06, 2023

Every task_work will try to wake the task to be executed, which causes
excessive scheduling and additional overhead. For some tw it's
justified, but others won't do much but post a single CQE.

When a task waits for multiple cqes, every such task_work will wake it
up. Instead, the task may give a hint about how many cqes it waits for,
io_req_local_work_add() will compare against it and skip wake ups
if #cqes + #tw is not enough to satisfy the waiting condition. Task_work
that uses the optimisation should be simple enough and never post more
than one CQE. It's also ignored for non DEFER_TASKRUN rings.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d2b77e99d1e86624d8a69f7037d764b739dcd225.1680782017.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

8751d154

io_uring: inline llist_add() · 51509400

Pavel Begunkov authored Apr 06, 2023

We'll need to grab some information from the previous request in the tw
list, inline llist_add(), it'll be used in the following patch.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f0165493af7b379943c792114b972f331e7d7d10.1680782017.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

51509400

io_uring: add tw add flags · 8501fe70

Pavel Begunkov authored Apr 06, 2023

We pass 'allow_local' into io_req_task_work_add() but will need more
flags. Replace it with a flags bit field and name this allow_local
flag.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4c0f01e7ef4e6feebfb199093cc995af7a19befa.1680782017.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

8501fe70

io_uring: refactor io_cqring_wake() · 6e7248ad

Pavel Begunkov authored Apr 06, 2023

Instead of smp_mb() + __io_cqring_wake() in __io_cq_unlock_post_flush()
use equivalent io_cqring_wake(). With that we can clean it up further
and remove __io_cqring_wake().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/662ee5d898168ac206be06038525e97b64072a46.1680782017.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

6e7248ad

io_uring: optimize local tw add ctx pinning · d73a572d

Pavel Begunkov authored Apr 06, 2023

We currently pin the ctx for io_req_local_work_add() with
percpu_ref_get/put, which implies two rcu_read_lock/unlock pairs and some
extra overhead on top in the fast path. Replace it with a pure rcu read
and let io_ring_exit_work() synchronise against it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cbdfcb6b232627f30e9e50ef91f13c4f05910247.1680782017.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

d73a572d

io_uring: move pinning out of io_req_local_work_add · ab1c590f

Pavel Begunkov authored Apr 06, 2023

Move ctx pinning from io_req_local_work_add() to the caller, looks
better and makes working with the code a bit easier.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/49c0dbed390b0d6d04cb942dd3592879fd5bfb1b.1680782017.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

ab1c590f

05 Apr, 2023 1 commit

io_uring/uring_cmd: assign ioucmd->cmd at async prep time · 758d5d64

Jens Axboe authored Apr 05, 2023

Rather than check this in the fast path issue, it makes more sense to
just assign the copy of the data when we're setting it up anyway. This
makes the code a bit cleaner, and removes the need for this check in
the issue path.
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

758d5d64

04 Apr, 2023 13 commits

io_uring/rsrc: add custom limit for node caching · 69bbc6ad

Pavel Begunkov authored Apr 04, 2023

The number of entries in the rsrc node cache is limited to 512, which
still seems unnecessarily large. Add per cache thresholds and set to
to 32 for the rsrc node cache.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d0cd538b944dac0bf878e276fc0199f21e6bccea.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

69bbc6ad

io_uring/rsrc: optimise io_rsrc_data refcounting · 757ef468

Pavel Begunkov authored Apr 04, 2023

Every struct io_rsrc_node takes a struct io_rsrc_data reference, which
means all rsrc updates do 2 extra atomics. Replace atomics refcounting
with a int as it's all done under ->uring_lock.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e73c3d6820cf679532696d790b5b8fae23537213.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

757ef468

io_uring/rsrc: add lockdep sanity checks · 1f2c8f61

Pavel Begunkov authored Apr 04, 2023

We should hold ->uring_lock while putting nodes with io_put_rsrc_node(),
add a lockdep check for that.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b50d5f156ac41450029796738c1dfd22a521df7a.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

1f2c8f61

io_uring/rsrc: cache struct io_rsrc_node · 9eae8655

Pavel Begunkov authored Apr 04, 2023

Add allocation cache for struct io_rsrc_node, it's always allocated and
put under ->uring_lock, so it doesn't need any extra synchronisation
around caches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/252a9d9ef9654e6467af30fdc02f57c0118fb76e.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

9eae8655

io_uring/rsrc: don't offload node free · 36b9818a

Pavel Begunkov authored Apr 04, 2023

struct delayed_work rsrc_put_work was previously used to offload node
freeing because io_rsrc_node_ref_zero() was previously called by RCU in
the IRQ context. Now, as percpu refcounting is gone, we can do it
eagerly at the spot without pushing it to a worker.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/13fb1aac1e8d068ad8fd4a0c6d0d157ab61b90c0.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

36b9818a

io_uring/rsrc: optimise io_rsrc_put allocation · ff7c75ec

Pavel Begunkov authored Apr 04, 2023

Every io_rsrc_node keeps a list of items to put, and all entries are
kmalloc()'ed. However, it's quite often to queue up only one entry per
node, so let's add an inline entry there to avoid extra allocations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c482c1c652c45c85ac52e67c974bc758a50fed5f.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

ff7c75ec

io_uring/rsrc: rename rsrc_list · c824986c

Pavel Begunkov authored Apr 04, 2023

We have too many "rsrc" around which makes the name of struct
io_rsrc_node::rsrc_list confusing. The field is responsible for keeping
a list of files or buffers, so call it item_list and add comments
around.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3e34d4dfc1fdbb6b520f904ee6187c2ccf680efe.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

c824986c

io_uring/rsrc: kill rsrc_ref_lock · 0a4813b1

Pavel Begunkov authored Apr 04, 2023

We use ->rsrc_ref_lock spinlock to protect ->rsrc_ref_list in
io_rsrc_node_ref_zero(). Now we removed pcpu refcounting, which means
io_rsrc_node_ref_zero() is not executed from the irq context as an RCU
callback anymore, and we also put it under ->uring_lock.
io_rsrc_node_switch(), which queues up nodes into the list, is also
protected by ->uring_lock, so we can safely get rid of ->rsrc_ref_lock.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6b60af883c263551190b526a55ff2c9d5ae07141.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

0a4813b1

io_uring/rsrc: protect node refs with uring_lock · ef8ae64f

Pavel Begunkov authored Apr 04, 2023

Currently, for nodes we have an atomic counter and some cached
(non-atomic) refs protected by uring_lock. Let's put all ref
manipulations under uring_lock and get rid of the atomic part.
It's free as in all cases we care about we already hold the lock.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/25b142feed7d831008257d90c8b17c0115d4fc15.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

ef8ae64f

io_uring: io_free_req() via tw · 03adabe8

Pavel Begunkov authored Apr 04, 2023

io_free_req() is not often used but nevertheless problematic as there is
no way to know the current context, it may be used from the submission
path or even by an irq handler. Push it to a fresh context using
task_work.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3a92fe80bb068757e51aaa0b105cfbe8f5dfee9e.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

03adabe8

io_uring: don't put nodes under spinlocks · 2ad4c6d0

Pavel Begunkov authored Apr 04, 2023

io_req_put_rsrc() doesn't need any locking, so move it out of
a spinlock section in __io_req_complete_post() and adjust helpers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d5b87a5f31270dade6805f7acafc4cc34b84b241.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

2ad4c6d0

io_uring/rsrc: keep cached refs per node · 8e15c0e7

Pavel Begunkov authored Apr 04, 2023

We cache refs of the current node (i.e. ctx->rsrc_node) in
ctx->rsrc_cached_refs. We'll be moving away from atomics, so move the
cached refs in struct io_rsrc_node for now. It's a prep patch and
shouldn't change anything in practise.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9edc3669c1d71b06c2dca78b2b2b8bb9292738b9.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

8e15c0e7

io_uring/rsrc: use non-pcpu refcounts for nodes · b8fb5b4f

Pavel Begunkov authored Apr 04, 2023

One problem with the current rsrc infra is that often updates will
generates lots of rsrc nodes, each carry pcpu refs. That takes quite a
lot of memory, especially if there is a stall, and takes lots of CPU
cycles. Only pcpu allocations takes >50 of CPU with a naive benchmark
updating files in a loop.

Replace pcpu refs with normal refcounting. There is already a hot path
avoiding atomics / refs, but following patches will further improve it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e9ed8a9457b331a26555ff9443afc64cdaab7247.1680576071.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

b8fb5b4f

03 Apr, 2023 19 commits

io_uring: cap io_sqring_entries() at SQ ring size · e3ef728f

Jens Axboe authored Mar 30, 2023

We already do this manually for the !SQPOLL case, do it in general and
we can also dump the ugly min3() in io_submit_sqes().
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e3ef728f

io_uring: rename trace_io_uring_submit_sqe() tracepoint · 2ad57931

Jens Axboe authored Mar 30, 2023

It has nothing to do with the SQE at this point, it's a request
submission. While in there, get rid of the 'force_nonblock' argument
which is also dead, as we only pass in true.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2ad57931

io_uring: encapsulate task_work state · a282967c

Pavel Begunkov authored Mar 27, 2023

For task works we're passing around a bool pointer for whether the
current ring is locked or not, let's wrap it in a structure, that
will make it more opaque preventing abuse and will also help us
to pass more info in the future if needed.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1ecec9483d58696e248d1bfd52cf62b04442df1d.1679931367.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

a282967c

io_uring: remove extra tw trylocks · 13bfa6f1

Pavel Begunkov authored Mar 27, 2023

Before cond_resched()'ing in handle_tw_list() we also drop the current
ring context, and so the next loop iteration will need to pick/pin a new
context and do trylock.

The chunk removed by this patch was intended to be an optimisation
covering exactly this case, i.e. retaking the lock after reschedule, but
in reality it's skipped for the first iteration after resched as
described and will keep hammering the lock if it's contended.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1ecec9483d58696e248d1bfd52cf62b04442df1d.1679931367.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

13bfa6f1

io_uring/io-wq: drop outdated comment · 07d99096

Jens Axboe authored Mar 27, 2023

Since the move to PF_IO_WORKER, we don't juggle memory context manually
anymore. Remove that outdated part of the comment for __io_worker_idle().
Signed-off-by: Jens Axboe <axboe@kernel.dk>

07d99096

io_uring: kill unused notif declarations · d322818e

Pavel Begunkov authored Mar 27, 2023

There are two leftover structures from the notification registration
mechanism that has never been released, kill them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f05f65aebaf8b1b5bf28519a8fdb350e3e7c9ad0.1679924536.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

d322818e

io-wq: Drop struct io_wqe · eb47943f

Gabriel Krisman Bertazi authored Mar 21, 2023

Since commit 0654b05e7e65 ("io_uring: One wqe per wq"), we have just a
single io_wqe instance embedded per io_wq. Drop the extra structure in
favor of accessing struct io_wq directly, cleaning up quite a bit of
dereferences and backpointers.

No functional changes intended. Tested with liburing's testsuite
and mmtests performance microbenchmarks. I didn't observe any
performance regressions.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20230322011628.23359-2-krisman@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

eb47943f

io-wq: Move wq accounting to io_wq · dfd63baf

Gabriel Krisman Bertazi authored Mar 21, 2023

Since we now have a single io_wqe per io_wq instead of per-node, and in
preparation to its removal, move the accounting into the parent
structure.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20230322011628.23359-2-krisman@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

dfd63baf

io_uring/kbuf: disallow mapping a badly aligned provided ring buffer · fcb46c0c

Jens Axboe authored Mar 17, 2023

On at least parisc, we have strict requirements on how we virtually map
an address that is shared between the application and the kernel. On
these platforms, IOU_PBUF_RING_MMAP should be used when setting up a
shared ring buffer for provided buffers. If the application is mapping
these pages and asking the kernel to pin+map them as well, then we have
no control over what virtual address we get in the kernel.

For that case, do a sanity check if SHM_COLOUR is defined, and disallow
the mapping request. The application must fall back to using
IOU_PBUF_RING_MMAP for this case, and liburing will do that transparently
with the set of helpers that it has.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fcb46c0c

io_uring: Add KASAN support for alloc_caches · e1fe7ee8

Breno Leitao authored Feb 23, 2023

Add support for KASAN in the alloc_caches (apoll and netmsg_cache).
Thus, if something touches the unused caches, it will raise a KASAN
warning/exception.

It poisons the object when the object is put to the cache, and unpoisons
it when the object is gotten or freed.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20230223164353.2839177-2-leitao@debian.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

e1fe7ee8

io_uring: Move from hlist to io_wq_work_node · efba1a9e

Breno Leitao authored Feb 23, 2023

Having cache entries linked using the hlist format brings no benefit, and
also requires an unnecessary extra pointer address per cache entry.

Use the internal io_wq_work_node single-linked list for the internal
alloc caches (async_msghdr and async_poll)

This is required to be able to use KASAN on cache entries, since we do
not need to touch unused (and poisoned) cache entries when adding more
entries to the list.
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20230223164353.2839177-2-leitao@debian.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

efba1a9e

io_uring: One wqe per wq · da64d6db

Breno Leitao authored Mar 10, 2023

Right now io_wq allocates one io_wqe per NUMA node.  As io_wq is now
bound to a task, the task basically uses only the NUMA local io_wqe, and
almost never changes NUMA nodes, thus, the other wqes are mostly
unused.

Allocate just one io_wqe embedded into io_wq, and uses all possible cpus
(cpu_possible_mask) in the io_wqe->cpumask.
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20230310201107.4020580-1-leitao@debian.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

da64d6db

io_uring: add support for user mapped provided buffer ring · c56e022c

Jens Axboe authored Mar 14, 2023

The ring mapped provided buffer rings rely on the application allocating
the memory for the ring, and then the kernel will map it. This generally
works fine, but runs into issues on some architectures where we need
to be able to ensure that the kernel and application virtual address for
the ring play nicely together. This at least impacts architectures that
set SHM_COLOUR, but potentially also anyone setting SHMLBA.

To use this variant of ring provided buffers, the application need not
allocate any memory for the ring. Instead the kernel will do so, and
the allocation must subsequently call mmap(2) on the ring with the
offset set to:

	IORING_OFF_PBUF_RING | (bgid << IORING_OFF_PBUF_SHIFT)

to get a virtual address for the buffer ring. Normally the application
would allocate a suitable piece of memory (and correctly aligned) and
simply pass that in via io_uring_buf_reg.ring_addr and the kernel would
map it.

Outside of the setup differences, the kernel allocate + user mapped
provided buffer ring works exactly the same.
Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c56e022c

io_uring/kbuf: rename struct io_uring_buf_reg 'pad' to'flags' · 81cf17cd

Jens Axboe authored Mar 14, 2023

In preparation for allowing flags to be set for registration, rename
the padding and use it for that.
Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

81cf17cd

io_uring/kbuf: add buffer_list->is_mapped member · 25a2c188

Jens Axboe authored Mar 14, 2023

Rather than rely on checking buffer_list->buf_pages or ->buf_nr_pages,
add a separate member that tracks if this is a ring mapped provided
buffer list or not.
Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

25a2c188

io_uring/kbuf: move pinning of provided buffer ring into helper · ba56b632

Jens Axboe authored Mar 14, 2023

In preparation for allowing the kernel to allocate the provided buffer
rings and have the application mmap it instead, abstract out the
current method of pinning and mapping the user allocated ring.

No functional changes intended in this patch.
Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ba56b632

io_uring: Adjust mapping wrt architecture aliasing requirements · d808459b

Helge Deller authored Feb 16, 2023

Some architectures have memory cache aliasing requirements (e.g. parisc)
if memory is shared between userspace and kernel. This patch fixes the
kernel to return an aliased address when asked by userspace via mmap().
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d808459b

io_uring: avoid hashing O_DIRECT writes if the filesystem doesn't need it · d4755e15

Jens Axboe authored Mar 07, 2023

io_uring hashes writes to a given file/inode so that it can serialize
them. This is useful if the file system needs exclusive access to the
file to perform the write, as otherwise we end up with a ton of io-wq
threads trying to lock the inode at the same time. This can cause
excessive system time.

But if the file system has flagged that it supports parallel O_DIRECT
writes, then there's no need to serialize the writes. Check for that
through FMODE_DIO_PARALLEL_WRITE and don't hash it if we don't need to.

In a basic test of 8 threads writing to a file on XFS on a gen2 Optane,
with each thread writing in 4k chunks, it improves performance from
~1350K IOPS (or ~5290MiB/sec) to ~1410K IOPS (or ~5500MiB/sec).
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d4755e15

fs: add FMODE_DIO_PARALLEL_WRITE flag · d8aeb44a

Jens Axboe authored Mar 07, 2023

Some filesystems support multiple threads writing to the same file with
O_DIRECT without requiring exclusive access to it. io_uring can use this
hint to avoid serializing dio writes to this inode, instead allowing them
to run in parallel.

XFS and ext4 both fall into this category, so set the flag for both of
them.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d8aeb44a