- 23 Mar, 2020 4 commits
-
-
Pavel Begunkov authored
We always punt async buffered writes to an io-wq helper, as the core kernel does not have IOCB_NOWAIT support for that. Most buffered async writes complete very quickly, as it's just a copy operation. This means that doing multiple locking roundtrips on the shared wqe lock for each buffered write is wasteful. Additionally, buffered writes are hashed work items, which means that any buffered write to a given file is serialized. Keep identicaly hashed work items contiguously in @wqe->work_list, and track a tail for each hash bucket. On dequeue of a hashed item, splice all of the same hash in one go using the tracked tail. Until the batch is done, the caller doesn't have to synchronize with the wqe or worker locks again. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Hillf Danton authored
Sync removal of file is only used in case of a GFP_KERNEL kmalloc failure at the cost of io_file_put::done and work flush, while a glich like it can be handled at the call site without too much pain. That said, what is proposed is to drop sync removing of file, and the kink in neck as well. Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Hillf Danton authored
A case of task hung was reported by syzbot, INFO: task syz-executor975:9880 blocked for more than 143 seconds. Not tainted 5.6.0-rc6-syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syz-executor975 D27576 9880 9878 0x80004000 Call Trace: schedule+0xd0/0x2a0 kernel/sched/core.c:4154 schedule_timeout+0x6db/0xba0 kernel/time/timer.c:1871 do_wait_for_common kernel/sched/completion.c:83 [inline] __wait_for_common kernel/sched/completion.c:104 [inline] wait_for_common kernel/sched/completion.c:115 [inline] wait_for_completion+0x26a/0x3c0 kernel/sched/completion.c:136 io_queue_file_removal+0x1af/0x1e0 fs/io_uring.c:5826 __io_sqe_files_update.isra.0+0x3a1/0xb00 fs/io_uring.c:5867 io_sqe_files_update fs/io_uring.c:5918 [inline] __io_uring_register+0x377/0x2c00 fs/io_uring.c:7131 __do_sys_io_uring_register fs/io_uring.c:7202 [inline] __se_sys_io_uring_register fs/io_uring.c:7184 [inline] __x64_sys_io_uring_register+0x192/0x560 fs/io_uring.c:7184 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe and bisect pointed to 05f3fb3c ("io_uring: avoid ring quiesce for fixed file set unregister and update"). It is down to the order that we wait for work done before flushing it while nobody is likely going to wake us up. We can drop that completion on stack as flushing work itself is a sync operation we need and no more is left behind it. To that end, io_file_put::done is re-used for indicating if it can be freed in the workqueue worker context. Reported-and-Inspired-by: syzbot <syzbot+538d1957ce178382a394@syzkaller.appspotmail.com> Signed-off-by: Hillf Danton <hdanton@sina.com> Rename ->done to ->free_pfile Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
work->data and work->list are shared in union. io_wq_assign_next() sets ->data if a req having a linked_timeout, but then io-wq may want to use work->list, e.g. to do re-enqueue of a request, so corrupting ->data. ->data is not necessary, just remove it and extract linked_timeout through @link_list. Fixes: 60cf46ae ("io-wq: hash dependent work") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 22 Mar, 2020 1 commit
-
-
Pavel Begunkov authored
After io_assign_current_work() of a linked work, it can be decided to offloaded to another thread so doing io_wqe_enqueue(). However, until next io_assign_current_work() it can be cancelled, that isn't handled. Don't assign it, if it's not going to be executed. Fixes: 60cf46ae ("io-wq: hash dependent work") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 21 Mar, 2020 1 commit
-
-
Lukas Bulwahn authored
Commit bbbdeb47 ("io_uring: dual license io_uring.h uapi header") uses a nested SPDX-License-Identifier to dual license the header. Since then, ./scripts/spdxcheck.py complains: include/uapi/linux/io_uring.h: 1:60 Missing parentheses: OR Add parentheses to make spdxcheck.py happy. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 20 Mar, 2020 1 commit
-
-
Jens Axboe authored
With the previous fixes for number of files open checking, I added some debug code to see if we had other spots where we're checking rlimit() against the async io-wq workers. The only one I found was file size checking, which we should also honor. During write and fallocate prep, store the max file size and override that for the current ask if we're in io-wq worker context. Cc: stable@vger.kernel.org # 5.1+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 14 Mar, 2020 3 commits
-
-
Pavel Begunkov authored
Enable io-wq hashing stuff for dependent works simply by re-enqueueing such requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
It's a preparation patch removing io_wq_enqueue_hashed(), which now should be done by io_wq_hash_work() + io_wq_enqueue(). Also, set hash value for dependant works, and do it as late as possible, because req->file can be unavailable before. This hash will be ignored by io-wq. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
This little tweak restores the behaviour that was before the recent io_worker_handle_work() optimisation patches. It makes the function do cond_resched() and flush_signals() only if there is an actual work to execute. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 12 Mar, 2020 1 commit
-
-
Pavel Begunkov authored
Deduplicate cancellation parts, as many of them looks the same, as do e.g. - io_wqe_cancel_cb_work() and io_wqe_cancel_work() - io_wq_worker_cancel() and io_work_cancel() Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 11 Mar, 2020 3 commits
-
-
Jens Axboe authored
Ensure we keep the truncated value, if we did truncate it. If not, we might read/write more than the registered buffer size. Also for retry, ensure that we return the truncated mapped value for the vectorized versions of the read/write commands. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
This just syncs the header it with the liburing version, so there's no confusion on the license of the header parts. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Xiaoguang Wang authored
When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, applications don't need to do io completion events polling again, they can rely on io_sq_thread to do polling work, which can reduce cpu usage and uring_lock contention. I modify fio io_uring engine codes a bit to evaluate the performance: static int fio_ioring_getevents(struct thread_data *td, unsigned int min, continue; } - if (!o->sqpoll_thread) { + if (o->sqpoll_thread && o->hipri) { r = io_uring_enter(ld, 0, actual_min, IORING_ENTER_GETEVENTS); if (r < 0) { and use "fio -name=fiotest -filename=/dev/nvme0n1 -iodepth=$depth -thread -rw=read -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1 -bs=4k -size=10G -numjobs=1 -time_based -runtime=120" original codes -------------------------------------------------------------------- iodepth | 4 | 8 | 16 | 32 | 64 bw | 1133MB/s | 1519MB/s | 2090MB/s | 2710MB/s | 3012MB/s fio cpu usage | 100% | 100% | 100% | 100% | 100% -------------------------------------------------------------------- with patch -------------------------------------------------------------------- iodepth | 4 | 8 | 16 | 32 | 64 bw | 1196MB/s | 1721MB/s | 2351MB/s | 2977MB/s | 3357MB/s fio cpu usage | 63.8% | 74.4%% | 81.1% | 83.7% | 82.4% -------------------------------------------------------------------- bw improve | 5.5% | 13.2% | 12.3% | 9.8% | 11.5% -------------------------------------------------------------------- From above test results, we can see that bw has above 5.5%~13% improvement, and fio process's cpu usage also drops much. Note this won't improve io_sq_thread's cpu usage when SETUP_IOPOLL|SETUP_SQPOLL are both enabled, in this case, io_sq_thread always has 100% cpu usage. I think this patch will be friendly to applications which will often use io_uring_wait_cqe() or similar from liburing. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 10 Mar, 2020 8 commits
-
-
YueHaibing authored
If CONFIG_NET is not set, gcc warns: fs/io_uring.c:3110:12: warning: io_setup_async_msg defined but not used [-Wunused-function] static int io_setup_async_msg(struct io_kiocb *req, ^~~~~~~~~~~~~~~~~~ There are many funcions wraped by CONFIG_NET, move them together to simplify code, also fix this warning. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Minor tweaks. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Not easy to tell if we're going over the size of bits we can shove in req->flags, so add an end-of-bits marker and a BUILD_BUG_ON() check for it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers is to trigger IO on them. The usual case of shrinking a buffer pool would be to just not replenish the buffers when IO completes, and instead just free it. But it may be nice to have a way to manually remove a number of buffers from a given group, and IORING_OP_REMOVE_BUFFERS provides that functionality. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Like IORING_OP_READV, this is limited to supporting just a single segment in the iovec passed in. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
This splits it into two parts, one that imports the message, and one that imports the iovec. This allows a caller to only do the first part, and import the iovec manually afterwards. No functional changes in this patch. Acked-by: David Miller <davem@davemloft.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
This adds support for the vectored read. This is limited to supporting just 1 segment in the iov, and is provided just for convenience for applications that use IORING_OP_READV already. The iov helpers will be used for IORING_OP_RECVMSG as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If a server process has tons of pending socket connections, generally it uses epoll to wait for activity. When the socket is ready for reading (or writing), the task can select a buffer and issue a recv/send on the given fd. Now that we have fast (non-async thread) support, a task can have tons of pending reads or writes pending. But that means they need buffers to back that data, and if the number of connections is high enough, having them preallocated for all possible connections is unfeasible. With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to use for any request. The request then sets IOSQE_BUFFER_SELECT in the sqe, and a given group ID in sqe->buf_group. When the fd becomes ready, a free buffer from the specified group is selected. If none are available, the request is terminated with -ENOBUFS. If successful, the CQE on completion will contain the buffer ID chosen in the cqe->flags member, encoded as: (buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER; Once a buffer has been consumed by a request, it is no longer available and must be registered again with IORING_OP_PROVIDE_BUFFERS. Requests need to support this feature. For now, IORING_OP_READ and IORING_OP_RECV support it. This is checked on SQE submission, a CQE with res == -EOPNOTSUPP will be posted if attempted on unsupported requests. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to support passing in an addr/len that is associated with a buffer ID and buffer group ID. The group ID is used to index and lookup the buffers, while the buffer ID can be used to notify the application which buffer in the group was used. The addr passed in is the starting buffer address, and length is each buffer length. A number of buffers to add with can be specified, in which case addr is incremented by length for each addition, and each buffer increments the buffer ID specified. No validation is done of the buffer ID. If the application provides buffers within the same group with identical buffer IDs, then it'll have a hard time telling which buffer ID was used. The only restriction is that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the maximum ID that can be used. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 04 Mar, 2020 8 commits
-
-
Jens Axboe authored
This just prepares the ring for having lists of buffers associated with it, that the application can provide for SQEs to consume instead of providing their own. The buffers are organized by group ID. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
First it changes io-wq interfaces. It replaces {get,put}_work() with free_work(), which guaranteed to be called exactly once. It also enforces free_work() callback to be non-NULL. io_uring follows the changes and instead of putting a submission reference in io_put_req_async_completion(), it will be done in io_free_work(). As removes io_get_work() with corresponding refcount_inc(), the ref balance is maintained. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
When executing non-linked hashed work, io_worker_handle_work() will lock-unlock wqe->lock to update hash, and then immediately lock-unlock to get next work. Optimise this case and do lock/unlock only once. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
There are 2 optimisations: - Now, io_worker_handler_work() do io_assign_current_work() twice per request, and each one adds lock/unlock(worker->lock) pair. The first is to reset worker->cur_work to NULL, and the second to set a real work shortly after. If there is a dependant work, set it immediately, that effectively removes the extra NULL'ing. - And there is no use in taking wqe->lock for linked works, as they are not hashed now. Optimise it out. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
This is a preparation patch, it adds some helpers and makes the next patches cleaner. - extract io_impersonate_work() and io_assign_current_work() - replace @next label with nested do-while - move put_work() right after NULL'ing cur_work. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
If after dropping the submission reference req->refs == 1, the request is done, because this one is for io_put_work() and will be dropped synchronously shortly after. In this case it's safe to steal a next work from the request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
There will be no use for @nxt in the handlers, and it's doesn't work anyway, so purge it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
The rule is simple, any async handler gets a submission ref and should put it at the end. Make them all follow it, and so more consistent. This is a preparation patch, and as io_wq_assign_next() currently won't ever work, this doesn't care to use io_put_req_find_next() instead of io_put_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> refcount_inc_not_zero() -> refcount_inc() fix. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 03 Mar, 2020 1 commit
-
-
Pavel Begunkov authored
Don't abuse labels for plain and straightworward code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 02 Mar, 2020 9 commits
-
-
Nathan Chancellor authored
Clang warns: fs/io_uring.c:4178:6: warning: variable 'mask' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] if (def->pollin) ^~~~~~~~~~~ fs/io_uring.c:4182:2: note: uninitialized use occurs here mask |= POLLERR | POLLPRI; ^~~~ fs/io_uring.c:4178:2: note: remove the 'if' if its condition is always true if (def->pollin) ^~~~~~~~~~~~~~~~ fs/io_uring.c:4154:15: note: initialize the variable 'mask' to silence this warning __poll_t mask, ret; ^ = 0 1 warning generated. io_op_defs has many definitions where pollin is not set so mask indeed might be uninitialized. Initialize it to zero and change the next assignment to |=, in case further masks are added in the future to avoid missing changing the assignment then. Fixes: d7718a9d ("io_uring: use poll driven retry for files that support it") Link: https://github.com/ClangBuiltLinux/linux/issues/916Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
io-wq cares about IO_WQ_WORK_UNBOUND flag only while enqueueing, so it's useless setting it for a next req of a link. Thus, removed it from io_prep_linked_timeout(), and inline the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
After __io_queue_sqe() ended up in io_queue_async_work(), it's already known that there is no @nxt req, so skip the check and return from the function. Also, @nxt initialisation now can be done just before io_put_req_find_next(), as there is no jumping until it's checked. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Currently io_uring tries any request in a non-blocking manner, if it can, and then retries from a worker thread if we get -EAGAIN. Now that we have a new and fancy poll based retry backend, use that to retry requests if the file supports it. This means that, for example, an IORING_OP_RECVMSG on a socket no longer requires an async thread to complete the IO. If we get -EAGAIN reading from the socket in a non-blocking manner, we arm a poll handler for notification on when the socket becomes readable. When it does, the pending read is executed directly by the task again, through the io_uring task work handlers. Not only is this faster and more efficient, it also means we're not generating potentially tons of async threads that just sit and block, waiting for the IO to complete. The feature is marked with IORING_FEAT_FAST_POLL, meaning that async pollable IO is fast, and that poll<link>other_op is fast as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Add a pollin/pollout field to the request table, and have commands that we can safely poll for properly marked. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
For poll requests, it's not uncommon to link a read (or write) after the poll to execute immediately after the file is marked as ready. Since the poll completion is called inside the waitqueue wake up handler, we have to punt that linked request to async context. This slows down the processing, and actually means it's faster to not use a link for this use case. We also run into problems if the completion_lock is contended, as we're doing a different lock ordering than the issue side is. Hence we have to do trylock for completion, and if that fails, go async. Poll removal needs to go async as well, for the same reason. eventfd notification needs special case as well, to avoid stack blowing recursion or deadlocks. These are all deficiencies that were inherited from the aio poll implementation, but I think we can do better. When a poll completes, simply queue it up in the task poll list. When the task completes the list, we can run dependent links inline as well. This means we never have to go async, and we can remove a bunch of code associated with that, and optimizations to try and make that run faster. The diffstat speaks for itself. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Store the io_kiocb in the private field instead of the poll entry, this is in preparation for allowing multiple waitqueues. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Oleg Nesterov authored
As Peter pointed out, task_work() can avoid ->pi_lock and cmpxchg() if task->task_works == NULL && !PF_EXITING. And in fact the only reason why task_work_run() needs ->pi_lock is the possible race with task_work_cancel(), we can optimize this code and make the locking more clear. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
@hash_map is unsigned long, but BIT_ULL() is used for manipulations. BIT() is a better match as it returns exactly unsigned long value. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-