- 01 Feb, 2021 38 commits
-
-
Pavel Begunkov authored
WARNING: inconsistent lock state inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage. syz-executor217/8450 [HC1[1]:SC0[0]:HE0:SE1] takes: ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline] ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_req_clean_work fs/io_uring.c:1398 [inline] ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&fs->lock); <Interrupt> lock(&fs->lock); *** DEADLOCK *** 1 lock held by syz-executor217/8450: #0: ffff88802417c3e8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x1071/0x1f30 fs/io_uring.c:9442 stack backtrace: CPU: 1 PID: 8450 Comm: syz-executor217 Not tainted 5.11.0-rc5-next-20210129-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> [...] _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:354 [inline] io_req_clean_work fs/io_uring.c:1398 [inline] io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029 __io_free_req+0x3d/0x2e0 fs/io_uring.c:2046 io_free_req fs/io_uring.c:2269 [inline] io_double_put_req fs/io_uring.c:2392 [inline] io_put_req+0xf9/0x570 fs/io_uring.c:2388 io_link_timeout_fn+0x30c/0x480 fs/io_uring.c:6497 __run_hrtimer kernel/time/hrtimer.c:1519 [inline] __hrtimer_run_queues+0x609/0xe40 kernel/time/hrtimer.c:1583 hrtimer_interrupt+0x334/0x940 kernel/time/hrtimer.c:1645 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1085 [inline] __sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1102 asm_call_irq_on_stack+0xf/0x20 </IRQ> __run_sysvec_on_irqstack arch/x86/include/asm/irq_stack.h:37 [inline] run_sysvec_on_irqstack_cond arch/x86/include/asm/irq_stack.h:89 [inline] sysvec_apic_timer_interrupt+0xbd/0x100 arch/x86/kernel/apic/apic.c:1096 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:629 RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:169 [inline] RIP: 0010:_raw_spin_unlock_irq+0x25/0x40 kernel/locking/spinlock.c:199 spin_unlock_irq include/linux/spinlock.h:404 [inline] io_queue_linked_timeout+0x194/0x1f0 fs/io_uring.c:6525 __io_queue_sqe+0x328/0x1290 fs/io_uring.c:6594 io_queue_sqe+0x631/0x10d0 fs/io_uring.c:6639 io_queue_link_head fs/io_uring.c:6650 [inline] io_submit_sqe fs/io_uring.c:6697 [inline] io_submit_sqes+0x19b5/0x2720 fs/io_uring.c:6960 __do_sys_io_uring_enter+0x107d/0x1f30 fs/io_uring.c:9443 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Don't free requests from under hrtimer context (softirq) as it may sleep or take spinlocks improperly (e.g. non-irq versions). Cc: stable@vger.kernel.org # 5.6+ Reported-by: syzbot+81d17233a2b02eafba33@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Dan Carpenter authored
If we hit a "goto out_free;" before the "ctx->file_data" pointer has been assigned then it leads to a NULL derefence when we call: free_fixed_rsrc_data(ctx->file_data); We can fix this by moving the assignment earlier. Fixes: 1ad555c6 ("io_uring: create common fixed_rsrc_data allocation routines") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Hao Xu authored
Abaci reported this issue: #[ 605.170872] INFO: task kworker/u4:1:53 blocked for more than 143 seconds. [ 605.172123] Not tainted 5.10.0+ #1 [ 605.172811] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 605.173915] task:kworker/u4:1 state:D stack: 0 pid: 53 ppid: 2 flags:0x00004000 [ 605.175130] Workqueue: events_unbound io_ring_exit_work [ 605.175931] Call Trace: [ 605.176334] __schedule+0xe0e/0x25a0 [ 605.176971] ? firmware_map_remove+0x1a1/0x1a1 [ 605.177631] ? write_comp_data+0x2a/0x80 [ 605.178272] schedule+0xd0/0x270 [ 605.178811] schedule_timeout+0x6b6/0x940 [ 605.179415] ? mark_lock.part.0+0xca/0x1420 [ 605.180062] ? usleep_range+0x170/0x170 [ 605.180684] ? wait_for_completion+0x16d/0x280 [ 605.181392] ? mark_held_locks+0x9e/0xe0 [ 605.182079] ? rwlock_bug.part.0+0x90/0x90 [ 605.182853] ? lockdep_hardirqs_on_prepare+0x286/0x400 [ 605.183817] wait_for_completion+0x175/0x280 [ 605.184713] ? wait_for_completion_interruptible+0x340/0x340 [ 605.185611] ? _raw_spin_unlock_irq+0x24/0x30 [ 605.186307] ? migrate_swap_stop+0x9c0/0x9c0 [ 605.187046] kthread_park+0x127/0x1c0 [ 605.187738] io_sq_thread_stop+0xd5/0x530 [ 605.188459] io_ring_exit_work+0xb1/0x970 [ 605.189207] process_one_work+0x92c/0x1510 [ 605.189947] ? pwq_dec_nr_in_flight+0x360/0x360 [ 605.190682] ? rwlock_bug.part.0+0x90/0x90 [ 605.191430] ? write_comp_data+0x2a/0x80 [ 605.192207] worker_thread+0x9b/0xe20 [ 605.192900] ? process_one_work+0x1510/0x1510 [ 605.193599] kthread+0x353/0x460 [ 605.194154] ? _raw_spin_unlock_irq+0x24/0x30 [ 605.194910] ? kthread_create_on_node+0x100/0x100 [ 605.195821] ret_from_fork+0x1f/0x30 [ 605.196605] [ 605.196605] Showing all locks held in the system: [ 605.197598] 1 lock held by khungtaskd/25: [ 605.198301] #0: ffffffff8b5f76a0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x30 [ 605.199914] 3 locks held by kworker/u4:1/53: [ 605.200609] #0: ffff888100109938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x82a/0x1510 [ 605.202108] #1: ffff888100e47dc0 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x85e/0x1510 [ 605.203681] #2: ffff888116931870 (&sqd->lock){+.+.}-{3:3}, at: io_sq_thread_park.part.0+0x19/0x50 [ 605.205183] 3 locks held by systemd-journal/161: [ 605.206037] 1 lock held by syslog-ng/254: [ 605.206674] 2 locks held by agetty/311: [ 605.207292] #0: ffff888101097098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80 [ 605.208715] #1: ffffc900000332e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0 [ 605.210131] 2 locks held by bash/677: [ 605.210723] #0: ffff88810419a098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80 [ 605.212105] #1: ffffc900000512e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0 [ 605.213777] [ 605.214151] ============================================= I believe this is caused by the follow race: (ctx_list is empty now) => io_put_sq_data | ==> kthread_park(sqd->thread); | ====> set KTHREAD_SHOULD_PARK | ====> wake_up_process(k) | sq thread is running | | | needs_sched is true since no ctx, | so TASK_INTERRUPTIBLE set and schedule | out then never wake up again | ====> wait_for_completion | (stuck here) So check if sqthread gets park flag right before schedule(). since ctx_list is always empty when this problem happens, here I put kthread_should_park() before setting the wakeup flag(ctx_list is empty so this for loop is fast), where is close enough to schedule(). The problem doesn't show again in my repro testing after this fix. Reported-by: Abaci <abaci@linux.alibaba.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Add the missing kernel io_uring header, add Pavel as a reviewer, and exclude io_uring from the FILESYSTEMS section to avoid keep spamming Al (mainly) with bug reports, patches, etc. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
noah authored
This patch adds support for skipping a file descriptor when using IORING_REGISTER_FILES_UPDATE. __io_sqe_files_update will skip fds set to IORING_REGISTER_FILES_SKIP. IORING_REGISTER_FILES_SKIP is inturn added as a #define in io_uring.h Signed-off-by: noah <goldstein.w.n@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Replace a while with a simple for loop, that looks way more natural, and enables us to use "continue" as indexes are no more updated by hand in the end of the loop. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
We grab a task for each request and while putting it it also have to do extra work like inflight accounting and waking up that task. This sequence is duplicated several time, it's good time to add a helper. More to that, the helper generates better code due to better locality and so not failing alias analysis. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
For SQPOLL io_uring we want to have only one file note held by sqo_task. Add a warning to make sure it holds. It's deep in io_uring_add_task_file() out of hot path, so shouldn't hurt. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Yejune Deng authored
The function io_remove_personalities() is very similar to io_unregister_personality(),so implement io_remove_personalities() calling io_unregister_personality(). Signed-off-by: Yejune Deng <yejune.deng@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
It's no longer used as IORING_OP_CLOSE got rid for the need of flagging it as uncancelable, kill it of. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
We currently split the close into two, in case we have a ->flush op that we can't safely handle from non-blocking context. This requires us to flag the op as uncancelable if we do need to punt it async, and that means special handling for just this op type. Use __close_fd_get_file() and grab the files lock so we can get the file and check if we need to go async in one atomic operation. That gets rid of the need for splitting this into two steps, and hence the need for IO_WQ_WORK_NO_CANCEL. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Assumes current->files->file_lock is already held on invocation. Helps the caller check the file before removing the fd, if it needs to. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
When a request is completed with comp_state, its completion reference put is deferred to io_submit_flush_completions(), but the submission is put not far from there, so do it together to save one atomic dec per request. That targets requests that complete inline, e.g. buffered rw, send/recv. Proper benchmarking haven't been conducted but for nops(batch=32) it was around 7901 vs 8117 KIOPS (~2.7%), or ~4% per perf profiling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
io_submit_flush_completions() is called down the stack in the _state version of io_req_complete(), that's ok because is only called by io_uring opcode handler functions directly. Move it up to __io_queue_sqe() as preparation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
__io_req_complete() inlining is a bit weird, some compilers don't optimise out the non-NULL branch of it even when called as io_req_complete(). Help it a bit by extracting state and stateless helpers out of __io_req_complete(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Deduplicates translation of timeout flags into hrtimer_mode. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
When io_req_task_work_add() fails, the request will be cancelled by enqueueing via task_works of io-wq. Extract a function for that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
The check in io_state_file_put() is optimised pretty well when called from __io_file_get(). Don't pollute the code with all these variants. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Get rid of a label in io_alloc_req(), it's cleaner to do return directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Apparently, there is one more place hand coded calculation of number of CQ events in the ring. Use __io_cqring_events() helper in io_get_cqring() as well. Naturally, assembly stays identical. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Inline it in its only user, that's cleaner Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
The name is confusing and it's used only in one place. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
personality_idr is usually synchronised by uring_lock, the exception would be removing personalities in io_ring_ctx_wait_and_kill(), which is legit as refs are killed by that point but still would be more resilient to do it under the lock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
It's awkward to pass return a value into a function for it to return it back. Check it at the caller site and clean up io_resubmit_prep() a bit. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
The hot path is IO completing on the first try. Reshuffle io_rw_reissue() so it's checked first. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Bijan Mottahedeh authored
Make the percpu ref release function names consistent between rsrc data and nodes. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Bijan Mottahedeh authored
Create common alloc/free fixed_rsrc_data routines for both files and buffers. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [remove buffer part] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Bijan Mottahedeh authored
Create common routines to be used for both files/buffers registration. [remove io_sqe_rsrc_set_node substitution] Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [merge, quiesce only for files] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
A simple prep patch allowing to set refnode callbacks after it was allocated. This needed to 1) keep ourself off of hi-level functions where it's not pretty and they are not necessary 2) amortise ref_node allocation in the future, e.g. for updates. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Bijan Mottahedeh authored
Split alloc_fixed_file_ref_node into resource generic/specific parts, to be leveraged for fixed buffers. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Bijan Mottahedeh authored
Encapsulate resource reference locking into separate routines. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Bijan Mottahedeh authored
Uplevel ref_list and make it common to all resources. This is to allow one common ref_list to be used for both files, and buffers in upcoming patches. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Bijan Mottahedeh authored
Generalize io_queue_rsrc_removal to handle both files and buffers. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [remove io_mapped_ubuf from rsrc tables/etc. for now] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Bijan Mottahedeh authored
This is a prep rename patch for subsequent patches to generalize file registration. [io_uring_rsrc_update:: rename fds -> data] Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [leave io_uring_files_update as struct] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Bijan Mottahedeh authored
Move allocation of buffer management structures, and validation of buffers into separate routines. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Bijan Mottahedeh authored
Split io_sqe_buffer_register into two routines: - io_sqe_buffer_register() registers a single buffer - io_sqe_buffers_register iterates over all user specified buffers Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Instead of being pessimistic and assume that path lookup will block, use LOOKUP_CACHED to attempt just a cached lookup. This ensures that the fast path is always done inline, and we only punt to async context if IO is needed to satisfy the lookup. For forced nonblock open attempts, mark the file O_NONBLOCK over the actual ->open() call as well. We can safely clear this again before doing fd_install(), so it'll never be user visible that we fiddled with it. This greatly improves the performance of file open where the dentry is already cached: ached 5.10-git 5.10-git+LOOKUP_CACHED Speedup --------------------------------------------------------------- 33% 1,014,975 900,474 1.1x 89% 545,466 292,937 1.9x 100% 435,636 151,475 2.9x The more cache hot we are, the faster the inline LOOKUP_CACHED optimization helps. This is unsurprising and expected, as a thread offload becomes a more dominant part of the total overhead. If we look at io_uring tracing, doing an IORING_OP_OPENAT on a file that isn't in the dentry cache will yield: 275.550481: io_uring_create: ring 00000000ddda6278, fd 3 sq size 8, cq size 16, flags 0 275.550491: io_uring_submit_sqe: ring 00000000ddda6278, op 18, data 0x0, non block 1, sq_thread 0 275.550498: io_uring_queue_async_work: ring 00000000ddda6278, request 00000000c0267d17, flags 69760, normal queue, work 000000003d683991 275.550502: io_uring_cqring_wait: ring 00000000ddda6278, min_events 1 275.550556: io_uring_complete: ring 00000000ddda6278, user_data 0x0, result 4 which shows a failed nonblock lookup, then punt to worker, and then we complete with fd == 4. This takes 65 usec in total. Re-running the same test case again: 281.253956: io_uring_create: ring 0000000008207252, fd 3 sq size 8, cq size 16, flags 0 281.253967: io_uring_submit_sqe: ring 0000000008207252, op 18, data 0x0, non block 1, sq_thread 0 281.253973: io_uring_complete: ring 0000000008207252, user_data 0x0, result 4 shows the same request completing inline, also returning fd == 4. This takes 6 usec. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into for-5.12/io_uring Merge RESOLVE_CACHED bits from Al, as the io_uring changes will build on top of that. * 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED fs: add support for LOOKUP_CACHED saner calling conventions for unlazy_child() fs: make unlazy_walk() error handling consistent fs/namei.c: Remove unlikely of status being -ECHILD in lookup_fast() do_tmpfile(): don't mess with finish_open()
-
- 31 Jan, 2021 2 commits
-
-
Linus Torvalds authored
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull EFI fix from Borislav Petkov: "A single fix from Lukas: handle boolean device properties imported from Apple firmware correctly" * tag 'efi-urgent-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi/apple-properties: Reinstate support for boolean properties
-