1. 05 Feb, 2021 2 commits
  2. 04 Feb, 2021 13 commits
  3. 01 Feb, 2021 25 commits
    • Pavel Begunkov's avatar
      io_uring: simplify do_read return parsing · 57cd657b
      Pavel Begunkov authored
      do_read() returning 0 bytes read (not -EAGAIN/etc.) is not an important
      enough of a case to prioritise it. Fold it into ret < 0 check, so we get
      rid of an extra if and make it a bit more readable.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      57cd657b
    • Pavel Begunkov's avatar
      io_uring: deduplicate adding to REQ_F_INFLIGHT · ce3d5aae
      Pavel Begunkov authored
      We don't know for how long REQ_F_INFLIGHT is going to stay, cleaner to
      extract a helper for marking requests as so.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ce3d5aae
    • Pavel Begunkov's avatar
      io_uring: remove work flags after cleanup · e86d0047
      Pavel Begunkov authored
      Shouldn't be a problem now, but it's better to clean
      REQ_F_WORK_INITIALIZED and work->flags only after relevant resources are
      killed, so cancellation see them.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e86d0047
    • Pavel Begunkov's avatar
      io_uring: inline io_req_drop_files() · 34e08fed
      Pavel Begunkov authored
      req->files now have same lifetime as all other iowq-work resources,
      inline io_req_drop_files() for consistency. Moreover, since
      REQ_F_INFLIGHT is no more files specific, the function name became
      very confusing.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      34e08fed
    • Pavel Begunkov's avatar
      io_uring: kill not used needs_file_no_error · ba13e23f
      Pavel Begunkov authored
      We have no request types left using needs_file_no_error, remove it.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ba13e23f
    • Pavel Begunkov's avatar
      io_uring: fix inconsistent lock state · 9ae1f8dd
      Pavel Begunkov authored
      WARNING: inconsistent lock state
      
      inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
      syz-executor217/8450 [HC1[1]:SC0[0]:HE0:SE1] takes:
      ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline]
      ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_req_clean_work fs/io_uring.c:1398 [inline]
      ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&fs->lock);
        <Interrupt>
          lock(&fs->lock);
      
       *** DEADLOCK ***
      
      1 lock held by syz-executor217/8450:
       #0: ffff88802417c3e8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x1071/0x1f30 fs/io_uring.c:9442
      
      stack backtrace:
      CPU: 1 PID: 8450 Comm: syz-executor217 Not tainted 5.11.0-rc5-next-20210129-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
      [...]
       _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
       spin_lock include/linux/spinlock.h:354 [inline]
       io_req_clean_work fs/io_uring.c:1398 [inline]
       io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029
       __io_free_req+0x3d/0x2e0 fs/io_uring.c:2046
       io_free_req fs/io_uring.c:2269 [inline]
       io_double_put_req fs/io_uring.c:2392 [inline]
       io_put_req+0xf9/0x570 fs/io_uring.c:2388
       io_link_timeout_fn+0x30c/0x480 fs/io_uring.c:6497
       __run_hrtimer kernel/time/hrtimer.c:1519 [inline]
       __hrtimer_run_queues+0x609/0xe40 kernel/time/hrtimer.c:1583
       hrtimer_interrupt+0x334/0x940 kernel/time/hrtimer.c:1645
       local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1085 [inline]
       __sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1102
       asm_call_irq_on_stack+0xf/0x20
       </IRQ>
       __run_sysvec_on_irqstack arch/x86/include/asm/irq_stack.h:37 [inline]
       run_sysvec_on_irqstack_cond arch/x86/include/asm/irq_stack.h:89 [inline]
       sysvec_apic_timer_interrupt+0xbd/0x100 arch/x86/kernel/apic/apic.c:1096
       asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:629
      RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:169 [inline]
      RIP: 0010:_raw_spin_unlock_irq+0x25/0x40 kernel/locking/spinlock.c:199
       spin_unlock_irq include/linux/spinlock.h:404 [inline]
       io_queue_linked_timeout+0x194/0x1f0 fs/io_uring.c:6525
       __io_queue_sqe+0x328/0x1290 fs/io_uring.c:6594
       io_queue_sqe+0x631/0x10d0 fs/io_uring.c:6639
       io_queue_link_head fs/io_uring.c:6650 [inline]
       io_submit_sqe fs/io_uring.c:6697 [inline]
       io_submit_sqes+0x19b5/0x2720 fs/io_uring.c:6960
       __do_sys_io_uring_enter+0x107d/0x1f30 fs/io_uring.c:9443
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Don't free requests from under hrtimer context (softirq) as it may sleep
      or take spinlocks improperly (e.g. non-irq versions).
      
      Cc: stable@vger.kernel.org # 5.6+
      Reported-by: syzbot+81d17233a2b02eafba33@syzkaller.appspotmail.com
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9ae1f8dd
    • Dan Carpenter's avatar
      io_uring: Fix NULL dereference in error in io_sqe_files_register() · 13770a71
      Dan Carpenter authored
      If we hit a "goto out_free;" before the "ctx->file_data" pointer has
      been assigned then it leads to a NULL derefence when we call:
      
      	free_fixed_rsrc_data(ctx->file_data);
      
      We can fix this by moving the assignment earlier.
      
      Fixes: 1ad555c6 ("io_uring: create common fixed_rsrc_data allocation routines")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      13770a71
    • Hao Xu's avatar
      io_uring: check kthread parked flag before sqthread goes to sleep · 8b28fdf2
      Hao Xu authored
      Abaci reported this issue:
      
      #[  605.170872] INFO: task kworker/u4:1:53 blocked for more than 143 seconds.
      [  605.172123]       Not tainted 5.10.0+ #1
      [  605.172811] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  605.173915] task:kworker/u4:1    state:D stack:    0 pid:   53 ppid:     2 flags:0x00004000
      [  605.175130] Workqueue: events_unbound io_ring_exit_work
      [  605.175931] Call Trace:
      [  605.176334]  __schedule+0xe0e/0x25a0
      [  605.176971]  ? firmware_map_remove+0x1a1/0x1a1
      [  605.177631]  ? write_comp_data+0x2a/0x80
      [  605.178272]  schedule+0xd0/0x270
      [  605.178811]  schedule_timeout+0x6b6/0x940
      [  605.179415]  ? mark_lock.part.0+0xca/0x1420
      [  605.180062]  ? usleep_range+0x170/0x170
      [  605.180684]  ? wait_for_completion+0x16d/0x280
      [  605.181392]  ? mark_held_locks+0x9e/0xe0
      [  605.182079]  ? rwlock_bug.part.0+0x90/0x90
      [  605.182853]  ? lockdep_hardirqs_on_prepare+0x286/0x400
      [  605.183817]  wait_for_completion+0x175/0x280
      [  605.184713]  ? wait_for_completion_interruptible+0x340/0x340
      [  605.185611]  ? _raw_spin_unlock_irq+0x24/0x30
      [  605.186307]  ? migrate_swap_stop+0x9c0/0x9c0
      [  605.187046]  kthread_park+0x127/0x1c0
      [  605.187738]  io_sq_thread_stop+0xd5/0x530
      [  605.188459]  io_ring_exit_work+0xb1/0x970
      [  605.189207]  process_one_work+0x92c/0x1510
      [  605.189947]  ? pwq_dec_nr_in_flight+0x360/0x360
      [  605.190682]  ? rwlock_bug.part.0+0x90/0x90
      [  605.191430]  ? write_comp_data+0x2a/0x80
      [  605.192207]  worker_thread+0x9b/0xe20
      [  605.192900]  ? process_one_work+0x1510/0x1510
      [  605.193599]  kthread+0x353/0x460
      [  605.194154]  ? _raw_spin_unlock_irq+0x24/0x30
      [  605.194910]  ? kthread_create_on_node+0x100/0x100
      [  605.195821]  ret_from_fork+0x1f/0x30
      [  605.196605]
      [  605.196605] Showing all locks held in the system:
      [  605.197598] 1 lock held by khungtaskd/25:
      [  605.198301]  #0: ffffffff8b5f76a0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x30
      [  605.199914] 3 locks held by kworker/u4:1/53:
      [  605.200609]  #0: ffff888100109938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x82a/0x1510
      [  605.202108]  #1: ffff888100e47dc0 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x85e/0x1510
      [  605.203681]  #2: ffff888116931870 (&sqd->lock){+.+.}-{3:3}, at: io_sq_thread_park.part.0+0x19/0x50
      [  605.205183] 3 locks held by systemd-journal/161:
      [  605.206037] 1 lock held by syslog-ng/254:
      [  605.206674] 2 locks held by agetty/311:
      [  605.207292]  #0: ffff888101097098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80
      [  605.208715]  #1: ffffc900000332e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0
      [  605.210131] 2 locks held by bash/677:
      [  605.210723]  #0: ffff88810419a098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80
      [  605.212105]  #1: ffffc900000512e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0
      [  605.213777]
      [  605.214151] =============================================
      
      I believe this is caused by the follow race:
      
      (ctx_list is empty now)
      => io_put_sq_data               |
      ==> kthread_park(sqd->thread);  |
      ====> set KTHREAD_SHOULD_PARK	|
      ====> wake_up_process(k)        | sq thread is running
      				|
      				|
      				| needs_sched is true since no ctx,
      				| so TASK_INTERRUPTIBLE set and schedule
      				| out then never wake up again
      				|
      ====> wait_for_completion	|
      	(stuck here)
      
      So check if sqthread gets park flag right before schedule().
      since ctx_list is always empty when this problem happens, here I put
      kthread_should_park() before setting the wakeup flag(ctx_list is empty
      so this for loop is fast), where is close enough to schedule(). The
      problem doesn't show again in my repro testing after this fix.
      Reported-by: default avatarAbaci <abaci@linux.alibaba.com>
      Signed-off-by: default avatarHao Xu <haoxu@linux.alibaba.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8b28fdf2
    • Pavel Begunkov's avatar
      MAINTAINERS: update io_uring section · 090da7d5
      Pavel Begunkov authored
      Add the missing kernel io_uring header, add Pavel as a reviewer, and
      exclude io_uring from the FILESYSTEMS section to avoid keep spamming Al
      (mainly) with bug reports, patches, etc.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      090da7d5
    • noah's avatar
      io_uring: Add skip option for __io_sqe_files_update · 4e0377a1
      noah authored
      This patch adds support for skipping a file descriptor when using
      IORING_REGISTER_FILES_UPDATE.  __io_sqe_files_update will skip fds set
      to IORING_REGISTER_FILES_SKIP. IORING_REGISTER_FILES_SKIP is inturn
      added as a #define in io_uring.h
      Signed-off-by: default avatarnoah <goldstein.w.n@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4e0377a1
    • Pavel Begunkov's avatar
      io_uring: cleanup files_update looping · 67973b93
      Pavel Begunkov authored
      Replace a while with a simple for loop, that looks way more natural, and
      enables us to use "continue" as indexes are no more updated by hand in
      the end of the loop.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      67973b93
    • Pavel Begunkov's avatar
      io_uring: consolidate putting reqs task · 7c660731
      Pavel Begunkov authored
      We grab a task for each request and while putting it it also have to do
      extra work like inflight accounting and waking up that task. This
      sequence is duplicated several time, it's good time to add a helper.
      More to that, the helper generates better code due to better locality
      and so not failing alias analysis.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7c660731
    • Pavel Begunkov's avatar
      io_uring: ensure only sqo_task has file notes · ecfc8492
      Pavel Begunkov authored
      For SQPOLL io_uring we want to have only one file note held by
      sqo_task. Add a warning to make sure it holds. It's deep in
      io_uring_add_task_file() out of hot path, so shouldn't hurt.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ecfc8492
    • Yejune Deng's avatar
      io_uring: simplify io_remove_personalities() · 0bead8cd
      Yejune Deng authored
      The function io_remove_personalities() is very similar to
      io_unregister_personality(),so implement io_remove_personalities()
      calling io_unregister_personality().
      Signed-off-by: default avatarYejune Deng <yejune.deng@gmail.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0bead8cd
    • Jens Axboe's avatar
      io_uring/io-wq: kill off now unused IO_WQ_WORK_NO_CANCEL · 4014d943
      Jens Axboe authored
      It's no longer used as IORING_OP_CLOSE got rid for the need of flagging
      it as uncancelable, kill it of.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4014d943
    • Jens Axboe's avatar
      io_uring: get rid of intermediate IORING_OP_CLOSE stage · 9eac1904
      Jens Axboe authored
      We currently split the close into two, in case we have a ->flush op
      that we can't safely handle from non-blocking context. This requires
      us to flag the op as uncancelable if we do need to punt it async, and
      that means special handling for just this op type.
      
      Use __close_fd_get_file() and grab the files lock so we can get the file
      and check if we need to go async in one atomic operation. That gets rid
      of the need for splitting this into two steps, and hence the need for
      IO_WQ_WORK_NO_CANCEL.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9eac1904
    • Jens Axboe's avatar
      fs: provide locked helper variant of close_fd_get_file() · 53dec2ea
      Jens Axboe authored
      Assumes current->files->file_lock is already held on invocation. Helps
      the caller check the file before removing the fd, if it needs to.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      53dec2ea
    • Pavel Begunkov's avatar
      io_uring: save atomic dec for inline executed reqs · e342c807
      Pavel Begunkov authored
      When a request is completed with comp_state, its completion reference
      put is deferred to io_submit_flush_completions(), but the submission
      is put not far from there, so do it together to save one atomic dec per
      request. That targets requests that complete inline, e.g. buffered rw,
      send/recv.
      
      Proper benchmarking haven't been conducted but for nops(batch=32) it was
      around 7901 vs 8117 KIOPS (~2.7%), or ~4% per perf profiling.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e342c807
    • Pavel Begunkov's avatar
      io_uring: don't flush CQEs deep down the stack · 9affd664
      Pavel Begunkov authored
      io_submit_flush_completions() is called down the stack in the _state
      version of io_req_complete(), that's ok because is only called by
      io_uring opcode handler functions directly. Move it up to
      __io_queue_sqe() as preparation.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9affd664
    • Pavel Begunkov's avatar
      io_uring: help inlining of io_req_complete() · a38d68db
      Pavel Begunkov authored
      __io_req_complete() inlining is a bit weird, some compilers don't
      optimise out the non-NULL branch of it even when called as
      io_req_complete(). Help it a bit by extracting state and stateless
      helpers out of __io_req_complete().
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a38d68db
    • Pavel Begunkov's avatar
      io_uring: add a helper timeout mode calculation · 8662daec
      Pavel Begunkov authored
      Deduplicates translation of timeout flags into hrtimer_mode.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8662daec
    • Pavel Begunkov's avatar
      io_uring: deduplicate failing task_work_add · eab30c4d
      Pavel Begunkov authored
      When io_req_task_work_add() fails, the request will be cancelled by
      enqueueing via task_works of io-wq. Extract a function for that.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      eab30c4d
    • Pavel Begunkov's avatar
      io_uring: remove __io_state_file_put · 02b23a9a
      Pavel Begunkov authored
      The check in io_state_file_put() is optimised pretty well when called
      from __io_file_get(). Don't pollute the code with all these variants.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      02b23a9a
    • Pavel Begunkov's avatar
      io_uring: simplify io_alloc_req() · 85bcb6c6
      Pavel Begunkov authored
      Get rid of a label in io_alloc_req(), it's cleaner to do return
      directly.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      85bcb6c6
    • Pavel Begunkov's avatar
      io_uring: further deduplicate #CQ events calc · 888aae2e
      Pavel Begunkov authored
      Apparently, there is one more place hand coded calculation of number of
      CQ events in the ring. Use __io_cqring_events() helper in
      io_get_cqring() as well. Naturally, assembly stays identical.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      888aae2e