Commits · d285da7dbd3b3cc9b4cf822039a87ca4e4106ecf · Kirill Smelkov / linux

15 Apr, 2024 40 commits

io_uring/net: set MSG_ZEROCOPY for sendzc in advance · d285da7d

Pavel Begunkov authored Apr 08, 2024

We can set MSG_ZEROCOPY at the preparation step, do it so we don't have
to care about it later in the issue callback.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c2c22aaa577624977f045979a6db2b9fb2e5648c.1712534031.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

d285da7d

io_uring/net: get rid of io_notif_complete_tw_ext · 6b7f864b

Pavel Begunkov authored Apr 08, 2024

io_notif_complete_tw_ext() can be removed and combined with
io_notif_complete_tw to make it simpler without sacrificing
anything.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/025a124a5e20e2474a57e2f04f16c422eb83063c.1712534031.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

6b7f864b

io_uring/net: merge ubuf sendzc callbacks · 99863292

Pavel Begunkov authored Apr 08, 2024

Splitting io_tx_ubuf_callback_ext from io_tx_ubuf_callback is a pre
mature optimisation that doesn't give us much. Merge the functions into
one and reclaim some simplicity back.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d44d68f6f7add33a0dcf0b7fd7b73c2dc543604f.1712534031.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

99863292

io_uring: return void from io_put_kbuf_comp() · bbbef3e9

Ming Lei authored Apr 07, 2024

The only caller doesn't handle the return value of io_put_kbuf_comp(), so
change its return type into void.

Also follow Jens's suggestion to rename it as io_put_kbuf_drop().
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240407132759.4056167-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

bbbef3e9

io_uring: remove io_req_put_rsrc_locked() · c29006a2

Pavel Begunkov authored Apr 05, 2024

io_req_put_rsrc_locked() is a weird shim function around
io_req_put_rsrc(). All calls to io_req_put_rsrc() require holding
->uring_lock, so we can just use it directly.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a195bc78ac3d2c6fbaea72976e982fe51e50ecdd.1712331455.git.asml.silence@gmail.comReviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c29006a2

io_uring: remove async request cache · d9713ad3

Pavel Begunkov authored Apr 05, 2024

io_req_complete_post() was a sole user of ->locked_free_list, but
since we just gutted the function, the cache is not used anymore and
can be removed.

->locked_free_list served as an asynhronous counterpart of the main
request (i.e. struct io_kiocb) cache for all unlocked cases like io-wq.
Now they're all forced to be completed into the main cache directly,
off of the normal completion path or via io_free_req().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7bffccd213e370abd4de480e739d8b08ab6c1326.1712331455.git.asml.silence@gmail.comReviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d9713ad3

io_uring: turn implicit assumptions into a warning · de96e9ae

Pavel Begunkov authored Apr 05, 2024

io_req_complete_post() is now io-wq only and shouldn't be used outside
of it, i.e. it relies that io-wq holds a ref for the request as
explained in a comment below. Let's add a warning to enforce the
assumption and make sure nobody would try to do anything weird.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1013b60c35d431d0698cafbc53c06f5917348c20.1712331455.git.asml.silence@gmail.comReviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

de96e9ae

io_uring: kill dead code in io_req_complete_post · f3913000

Ming Lei authored Apr 05, 2024

Since commit 8f6c829491fe ("io_uring: remove struct io_tw_state::locked"),
io_req_complete_post() is only called from io-wq submit work, where the
request reference is guaranteed to be grabbed and won't drop to zero
in io_req_complete_post().

Kill the dead code, meantime add req_ref_put() to put the reference.

Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1d8297e2046553153e763a52574f0e0f4d512f86.1712331455.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

f3913000

io_uring/kbuf: remove dead define · 285207f6

Jens Axboe authored Mar 29, 2024

We no longer use IO_BUFFER_LIST_BUF_PER_PAGE, kill it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

285207f6

io_uring: fix warnings on shadow variables · 1da2f311

Jens Axboe authored Mar 29, 2024

There are a few of those:

io_uring/fdinfo.c:170:16: warning: declaration shadows a local variable [-Wshadow]
  170 |                 struct file *f = io_file_from_index(&ctx->file_table, i);
      |                              ^
io_uring/fdinfo.c:53:67: note: previous declaration is here
   53 | __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f)
      |                                                                   ^
io_uring/cancel.c:187:25: warning: declaration shadows a local variable [-Wshadow]
  187 |                 struct io_uring_task *tctx = node->task->io_uring;
      |                                       ^
io_uring/cancel.c:166:31: note: previous declaration is here
  166 |                              struct io_uring_task *tctx,
      |                                                    ^
io_uring/register.c:371:25: warning: declaration shadows a local variable [-Wshadow]
  371 |                 struct io_uring_task *tctx = node->task->io_uring;
      |                                       ^
io_uring/register.c:312:24: note: previous declaration is here
  312 |         struct io_uring_task *tctx = NULL;
      |                               ^

and a simple cleanup gets rid of them. For the fdinfo case, make a
distinction between the file being passed in (for the ring), and the
registered files we iterate. For the other two cases, just get rid of
shadowed variable, there's no reason to have a new one.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1da2f311

io_uring: move mapping/allocation helpers to a separate file · f15ed8b4

Jens Axboe authored Mar 27, 2024

Move the related code from io_uring.c into memmap.c. No functional
changes in this patch, just cleaning it up a bit now that the full
transition is done.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f15ed8b4

io_uring: use unpin_user_pages() where appropriate · 18595c0a

Jens Axboe authored Mar 13, 2024

There are a few cases of open-rolled loops around unpin_user_page(), use
the generic helper instead.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

18595c0a

io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ring · 87585b05

Jens Axboe authored Mar 12, 2024

Rather than use remap_pfn_range() for this and manually free later,
switch to using vm_insert_page() and have it Just Work.

This requires a bit of effort on the mmap lookup side, as the ctx
uring_lock isn't held, which otherwise protects buffer_lists from being
torn down, and it's not safe to grab from mmap context that would
introduce an ABBA deadlock between the mmap lock and the ctx uring_lock.
Instead, lookup the buffer_list under RCU, as the the list is RCU freed
already. Use the existing reference count to determine whether it's
possible to safely grab a reference to it (eg if it's not zero already),
and drop that reference when done with the mapping. If the mmap
reference is the last one, the buffer_list and the associated memory can
go away, since the vma insertion has references to the inserted pages at
that point.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

87585b05

io_uring/kbuf: vmap pinned buffer ring · e270bfd2

Jens Axboe authored Mar 12, 2024

This avoids needing to care about HIGHMEM, and it makes the buffer
indexing easier as both ring provided buffer methods are now virtually
mapped in a contigious fashion.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e270bfd2

io_uring: unify io_pin_pages() · 1943f96b

Jens Axboe authored Mar 13, 2024

Move it into io_uring.c where it belongs, and use it in there as well
rather than have two implementations of this.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1943f96b

io_uring: use vmap() for ring mapping · 09fc75e0

Jens Axboe authored Mar 13, 2024

This is the last holdout which does odd page checking, convert it to
vmap just like what is done for the non-mmap path.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

09fc75e0

io_uring: get rid of remap_pfn_range() for mapping rings/sqes · 3ab1db3c

Jens Axboe authored Mar 13, 2024

Rather than use remap_pfn_range() for this and manually free later,
switch to using vm_insert_pages() and have it Just Work.

If possible, allocate a single compound page that covers the range that
is needed. If that works, then we can just use page_address() on that
page. If we fail to get a compound page, allocate single pages and use
vmap() to map them into the kernel virtual address space.

This just covers the rings/sqes, the other remaining user of the mmap
remap_pfn_range() user will be converted separately. Once that is done,
we can kill the old alloc/free code.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3ab1db3c

mm: add nommu variant of vm_insert_pages() · 62346c6c

Jens Axboe authored Mar 16, 2024

An identical one exists for vm_insert_page(), add one for
vm_insert_pages() to avoid needing to check for CONFIG_MMU in code using
it.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

62346c6c

io_uring: Avoid anonymous enums in io_uring uapi · 0f21a957

Gabriel Krisman Bertazi authored Mar 28, 2024

While valid C, anonymous enums confuse Cython (Python to C translator),
as reported by Ritesh (YoSTEALTH) [1] .  Since people rely on it when
building against liburing and we want to keep this header in sync with
the library version, let's name the existing enums in the uapi header.

[1] https://github.com/cython/cython/issues/3240Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20240328210935.25640-1-krisman@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

0f21a957

io_uring: use the right type for work_llist empty check · 22537c9f

Jens Axboe authored Mar 25, 2024

io_task_work_pending() uses wq_list_empty() on ctx->work_llist, but it's
not an io_wq_work_list, it's a struct llist_head. They both have
->first as head-of-list, and it turns out the checks are identical. But
be proper and use the right helper.

Fixes: dac6a0ea ("io_uring: ensure iopoll runs local task work as well")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

22537c9f

io_uring: Remove the now superfluous sentinel elements from ctl_table array · a80929d1

Joel Granados authored Mar 28, 2024

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which will
reduce the overall build time size of the kernel and run time memory
bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Remove sentinel element from kernel_io_uring_disabled_table
Signed-off-by: Joel Granados <j.granados@samsung.com>
Link: https://lore.kernel.org/r/20240328-jag-sysctl_remset_misc-v1-6-47c1463b3af2@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

a80929d1

io_uring: Remove unused function · 4e9706c6

Jiapeng Chong authored Mar 28, 2024

The function are defined in the io_uring.c file, but not called
elsewhere, so delete the unused function.

io_uring/io_uring.c:646:20: warning: unused function '__io_cq_unlock'.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8660Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240328022324.78029-1-jiapeng.chong@linux.alibaba.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

4e9706c6

io_uring: re-arrange Makefile order · 77a1cd5e

Jens Axboe authored Mar 26, 2024

The object list is a bit of a mess, with core and opcode files mixed in.
Re-arrange it so that we have the core bits first, and then opcode
specific files after that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

77a1cd5e

io_uring: refill request cache in memory order · 05eb5fe2

Jens Axboe authored Mar 25, 2024

The allocator will generally return memory in order, but
__io_alloc_req_refill() then adds them to a stack and we'll extract them
in the opposite order. This obviously isn't a huge deal, but:

1) it makes debugging easier when they are in order
2) keeping them in-order is the right thing to do
3) reduces the code for adding them to the stack

Just add them in reverse to the stack.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

05eb5fe2

io_uring/poll: shrink alloc cache size to 32 · da22bdf3

Jens Axboe authored Mar 21, 2024

This should be plenty, rather than the default of 128, and matches what
we have on the rsrc and futex side as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

da22bdf3

io_uring/alloc_cache: switch to array based caching · 414d0f45

Jens Axboe authored Mar 20, 2024

Currently lists are being used to manage this, but best practice is
usually to have these in an array instead as that it cheaper to manage.

Outside of that detail, games are also played with KASAN as the list
is inside the cached entry itself.

Finally, all users of this need a struct io_cache_entry embedded in
their struct, which is union'ized with something else in there that
isn't used across the free -> realloc cycle.

Get rid of all of that, and simply have it be an array. This will not
change the memory used, as we're just trading an 8-byte member entry
for the per-elem array size.

This reduces the overhead of the recycled allocations, and it reduces
the amount of code code needed to support recycling to about half of
what it currently is.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

414d0f45

io_uring: drop ->prep_async() · e10677a8

Jens Axboe authored Mar 18, 2024

It's now unused, drop the code related to it. This includes the
io_issue_defs->manual alloc field.

While in there, and since ->async_size is now being used a bit more
frequently and in the issue path, move it to io_issue_defs[].
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e10677a8

io_uring/uring_cmd: defer SQE copying until it's needed · 5eff57fa

Jens Axboe authored Mar 20, 2024

The previous commit turned on async data for uring_cmd, and did the
basic conversion of setting everything up on the prep side. However, for
a lot of use cases, -EIOCBQUEUED will get returned on issue, as the
operation got successfully queued. For that case, a persistent SQE isn't
needed, as it's just used for issue.

Unless execution goes async immediately, defer copying the double SQE
until it's necessary.

This greatly reduces the overhead of such commands, as evidenced by
a perf diff from before and after this change:

    10.60%     -8.58%  [kernel.vmlinux]  [k] io_uring_cmd_prep

where the prep side drops from 10.60% to ~2%, which is more expected.
Performance also rises from ~113M IOPS to ~122M IOPS, bringing us back
to where it was before the async command prep.
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5eff57fa

io_uring/uring_cmd: switch to always allocating async data · d10f19df

Jens Axboe authored Mar 18, 2024

Basic conversion ensuring async_data is allocated off the prep path. Adds
a basic alloc cache as well, as passthrough IO can be quite high in rate.
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d10f19df

io_uring/net: move connect to always using async data · e2ea5a70

Jens Axboe authored Mar 18, 2024

While doing that, get rid of io_async_connect and just use the generic
io_async_msghdr. Both of them have a struct sockaddr_storage in there,
and while io_async_msghdr is bigger, if the same type can be used then
the netmsg_cache can get reused for connect as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e2ea5a70

io_uring/rw: add iovec recycling · d6f911a6

Jens Axboe authored Mar 18, 2024

Let the io_async_rw hold on to the iovec and reuse it, rather than always
allocate and free them.

Also enables KASAN for the iovec entries, so that reuse can be detected
even while they are in the cache.

While doing so, shrink io_async_rw by getting rid of the bigger embedded
fast iovec. Since iovecs are being recycled now, shrink it from 8 to 1.
This reduces the io_async_rw size from 264 to 160 bytes, a 40% reduction.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d6f911a6

io_uring/rw: cleanup retry path · cca65713

Jens Axboe authored Mar 22, 2024

We no longer need to gate a potential retry on whether or not the
context matches our original task, as all read/write operations have
been fully prepared upfront. This means there's never any re-import
needed, and hence we can always retry requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cca65713

io_uring: get rid of struct io_rw_state · 0d10bd77

Jens Axboe authored Mar 18, 2024

A separate state struct is not needed anymore, just fold it in with
io_async_rw.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0d10bd77

io_uring/rw: always setup io_async_rw for read/write requests · a9165b83

Jens Axboe authored Mar 18, 2024

read/write requests try to put everything on the stack, and then alloc
and copy if a retry is needed. This necessitates a bunch of nasty code
that deals with intermediate state.

Get rid of this, and have the prep side setup everything that is needed
upfront, which greatly simplifies the opcode handlers.

This includes adding an alloc cache for io_async_rw, to make it cheap
to handle.

In terms of cost, this should be basically free and transparent. For
the worst case of {READ,WRITE}_FIXED which didn't need it before,
performance is unaffected in the normal peak workload that is being
used to test that. Still runs at 122M IOPS.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a9165b83

io_uring/net: drop 'kmsg' parameter from io_req_msg_cleanup() · d80f9407

Jens Axboe authored Mar 18, 2024

Now that iovec recycling is being done, the iovec is no longer being
freed in there. Hence the kmsg parameter is now useless.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d80f9407

io_uring/net: add iovec recycling · 75191341

Jens Axboe authored Mar 16, 2024

Right now the io_async_msghdr is recycled to avoid the overhead of
allocating+freeing it for every request. But the iovec is not included,
hence that will be allocated and freed for each transfer regardless.
This commit enables recyling of the iovec between io_async_msghdr
recycles. This avoids alloc+free for each one if an iovec is used, and
on top of that, it extends the cache hot nature of msg to the iovec as
well.

Also enables KASAN for the iovec entries, so that reuse can be detected
even while they are in the cache.

The io_async_msghdr also shrinks from 376 -> 288 bytes, an 88 byte
saving (or ~23% smaller), as the fast_iovec entry is dropped from 8
entries to a single entry. There's no point keeping a big fast iovec
entry, if iovecs aren't being allocated and freed continually.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

75191341

io_uring/net: remove (now) dead code in io_netmsg_recycle() · 9f8539fe

Jens Axboe authored Mar 20, 2024

All net commands have async data at this point, there's no reason to
check if this is the case or not.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9f8539fe

io_uring: kill io_msg_alloc_async_prep() · 6498c5c9

Jens Axboe authored Mar 18, 2024

We now ONLY call io_msg_alloc_async() from inside prep handling, which
is always locked. No need for this helper anymore, or the check in
io_msg_alloc_async() on whether the ring is locked or not.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6498c5c9

io_uring/net: get rid of ->prep_async() for send side · 50220d6a

Jens Axboe authored Mar 18, 2024

Move the io_async_msghdr out of the issue path and into prep handling,
e it's now done unconditionally and hence does not need to be part
of the issue path. This means any usage of io_sendrecv_prep_async() and
io_sendmsg_prep_async(), and hence the forced async setup path is now
unified with the normal prep setup.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

50220d6a

io_uring/net: get rid of ->prep_async() for receive side · c6f32c7d

Jens Axboe authored Mar 18, 2024

Move the io_async_msghdr out of the issue path and into prep handling,
since it's now done unconditionally and hence does not need to be part
of the issue path. This reduces the footprint of the multishot fast
path of multiple invocations of ->issue() per prep, and also means that
using ->prep_async() can be dropped for recvmsg asthis is now done via
setup on the prep side.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c6f32c7d