1. 12 Feb, 2021 10 commits
  2. 11 Feb, 2021 6 commits
  3. 10 Feb, 2021 23 commits
    • Colin Ian King's avatar
      io_uring: remove redundant initialization of variable ret · 4a245479
      Colin Ian King authored
      The variable ret is being initialized with a value that is never read
      and it is being updated later with a new value.  The initialization is
      redundant and can be removed.
      
      Addresses-Coverity: ("Unused value")
      Fixes: b63534c4 ("io_uring: re-issue block requests that failed because of resources")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4a245479
    • Pavel Begunkov's avatar
      io_uring: unpark SQPOLL thread for cancelation · 34343786
      Pavel Begunkov authored
      We park SQPOLL task before going into io_uring_cancel_files(), so the
      task won't run task_works including those that might be important for
      the cancellation passes. In this case it's io_poll_remove_one(), which
      frees requests via io_put_req_deferred().
      
      Unpark it for while waiting, it's ok as we disable submissions
      beforehand, so no new requests will be generated.
      
      INFO: task syz-executor893:8493 blocked for more than 143 seconds.
      Call Trace:
       context_switch kernel/sched/core.c:4327 [inline]
       __schedule+0x90c/0x21a0 kernel/sched/core.c:5078
       schedule+0xcf/0x270 kernel/sched/core.c:5157
       io_uring_cancel_files fs/io_uring.c:8912 [inline]
       io_uring_cancel_task_requests+0xe70/0x11a0 fs/io_uring.c:8979
       __io_uring_files_cancel+0x110/0x1b0 fs/io_uring.c:9067
       io_uring_files_cancel include/linux/io_uring.h:51 [inline]
       do_exit+0x2fe/0x2ae0 kernel/exit.c:780
       do_group_exit+0x125/0x310 kernel/exit.c:922
       __do_sys_exit_group kernel/exit.c:933 [inline]
       __se_sys_exit_group kernel/exit.c:931 [inline]
       __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Cc: stable@vger.kernel.org # 5.5+
      Reported-by: syzbot+695b03d82fa8e4901b06@syzkaller.appspotmail.com
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      34343786
    • Jens Axboe's avatar
      io_uring: place ring SQ/CQ arrays under memcg memory limits · 26bfa89e
      Jens Axboe authored
      Instead of imposing rlimit memlock limits for the rings themselves,
      ensure that we account them properly under memcg with __GFP_ACCOUNT.
      We retain rlimit memlock for registered buffers, this is just for the
      ring arrays themselves.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      26bfa89e
    • Jens Axboe's avatar
      io_uring: enable kmemcg account for io_uring requests · 91f245d5
      Jens Axboe authored
      This puts io_uring under the memory cgroups accounting and limits for
      requests.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      91f245d5
    • Jens Axboe's avatar
      io_uring: enable req cache for IRQ driven IO · c7dae4ba
      Jens Axboe authored
      This is the last class of requests that cannot utilize the req alloc
      cache. Add a per-ctx req cache that is protected by the completion_lock,
      and refill our submit side cache when it gets over our batch count.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c7dae4ba
    • Hao Xu's avatar
      io_uring: fix possible deadlock in io_uring_poll · ed670c3f
      Hao Xu authored
      Abaci reported follow issue:
      
      [   30.615891] ======================================================
      [   30.616648] WARNING: possible circular locking dependency detected
      [   30.617423] 5.11.0-rc3-next-20210115 #1 Not tainted
      [   30.618035] ------------------------------------------------------
      [   30.618914] a.out/1128 is trying to acquire lock:
      [   30.619520] ffff88810b063868 (&ep->mtx){+.+.}-{3:3}, at: __ep_eventpoll_poll+0x9f/0x220
      [   30.620505]
      [   30.620505] but task is already holding lock:
      [   30.621218] ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
      [   30.622349]
      [   30.622349] which lock already depends on the new lock.
      [   30.622349]
      [   30.623289]
      [   30.623289] the existing dependency chain (in reverse order) is:
      [   30.624243]
      [   30.624243] -> #1 (&ctx->uring_lock){+.+.}-{3:3}:
      [   30.625263]        lock_acquire+0x2c7/0x390
      [   30.625868]        __mutex_lock+0xae/0x9f0
      [   30.626451]        io_cqring_overflow_flush.part.95+0x6d/0x70
      [   30.627278]        io_uring_poll+0xcb/0xd0
      [   30.627890]        ep_item_poll.isra.14+0x4e/0x90
      [   30.628531]        do_epoll_ctl+0xb7e/0x1120
      [   30.629122]        __x64_sys_epoll_ctl+0x70/0xb0
      [   30.629770]        do_syscall_64+0x2d/0x40
      [   30.630332]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   30.631187]
      [   30.631187] -> #0 (&ep->mtx){+.+.}-{3:3}:
      [   30.631985]        check_prevs_add+0x226/0xb00
      [   30.632584]        __lock_acquire+0x1237/0x13a0
      [   30.633207]        lock_acquire+0x2c7/0x390
      [   30.633740]        __mutex_lock+0xae/0x9f0
      [   30.634258]        __ep_eventpoll_poll+0x9f/0x220
      [   30.634879]        __io_arm_poll_handler+0xbf/0x220
      [   30.635462]        io_issue_sqe+0xa6b/0x13e0
      [   30.635982]        __io_queue_sqe+0x10b/0x550
      [   30.636648]        io_queue_sqe+0x235/0x470
      [   30.637281]        io_submit_sqes+0xcce/0xf10
      [   30.637839]        __x64_sys_io_uring_enter+0x3fb/0x5b0
      [   30.638465]        do_syscall_64+0x2d/0x40
      [   30.638999]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   30.639643]
      [   30.639643] other info that might help us debug this:
      [   30.639643]
      [   30.640618]  Possible unsafe locking scenario:
      [   30.640618]
      [   30.641402]        CPU0                    CPU1
      [   30.641938]        ----                    ----
      [   30.642664]   lock(&ctx->uring_lock);
      [   30.643425]                                lock(&ep->mtx);
      [   30.644498]                                lock(&ctx->uring_lock);
      [   30.645668]   lock(&ep->mtx);
      [   30.646321]
      [   30.646321]  *** DEADLOCK ***
      [   30.646321]
      [   30.647642] 1 lock held by a.out/1128:
      [   30.648424]  #0: ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
      [   30.649954]
      [   30.649954] stack backtrace:
      [   30.650592] CPU: 1 PID: 1128 Comm: a.out Not tainted 5.11.0-rc3-next-20210115 #1
      [   30.651554] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [   30.652290] Call Trace:
      [   30.652688]  dump_stack+0xac/0xe3
      [   30.653164]  check_noncircular+0x11e/0x130
      [   30.653747]  ? check_prevs_add+0x226/0xb00
      [   30.654303]  check_prevs_add+0x226/0xb00
      [   30.654845]  ? add_lock_to_list.constprop.49+0xac/0x1d0
      [   30.655564]  __lock_acquire+0x1237/0x13a0
      [   30.656262]  lock_acquire+0x2c7/0x390
      [   30.656788]  ? __ep_eventpoll_poll+0x9f/0x220
      [   30.657379]  ? __io_queue_proc.isra.88+0x180/0x180
      [   30.658014]  __mutex_lock+0xae/0x9f0
      [   30.658524]  ? __ep_eventpoll_poll+0x9f/0x220
      [   30.659112]  ? mark_held_locks+0x5a/0x80
      [   30.659648]  ? __ep_eventpoll_poll+0x9f/0x220
      [   30.660229]  ? _raw_spin_unlock_irqrestore+0x2d/0x40
      [   30.660885]  ? trace_hardirqs_on+0x46/0x110
      [   30.661471]  ? __io_queue_proc.isra.88+0x180/0x180
      [   30.662102]  ? __ep_eventpoll_poll+0x9f/0x220
      [   30.662696]  __ep_eventpoll_poll+0x9f/0x220
      [   30.663273]  ? __ep_eventpoll_poll+0x220/0x220
      [   30.663875]  __io_arm_poll_handler+0xbf/0x220
      [   30.664463]  io_issue_sqe+0xa6b/0x13e0
      [   30.664984]  ? __lock_acquire+0x782/0x13a0
      [   30.665544]  ? __io_queue_proc.isra.88+0x180/0x180
      [   30.666170]  ? __io_queue_sqe+0x10b/0x550
      [   30.666725]  __io_queue_sqe+0x10b/0x550
      [   30.667252]  ? __fget_files+0x131/0x260
      [   30.667791]  ? io_req_prep+0xd8/0x1090
      [   30.668316]  ? io_queue_sqe+0x235/0x470
      [   30.668868]  io_queue_sqe+0x235/0x470
      [   30.669398]  io_submit_sqes+0xcce/0xf10
      [   30.669931]  ? xa_load+0xe4/0x1c0
      [   30.670425]  __x64_sys_io_uring_enter+0x3fb/0x5b0
      [   30.671051]  ? lockdep_hardirqs_on_prepare+0xde/0x180
      [   30.671719]  ? syscall_enter_from_user_mode+0x2b/0x80
      [   30.672380]  do_syscall_64+0x2d/0x40
      [   30.672901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   30.673503] RIP: 0033:0x7fd89c813239
      [   30.673962] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05  3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48
      [   30.675920] RSP: 002b:00007ffc65a7c628 EFLAGS: 00000217 ORIG_RAX: 00000000000001aa
      [   30.676791] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd89c813239
      [   30.677594] RDX: 0000000000000000 RSI: 0000000000000014 RDI: 0000000000000003
      [   30.678678] RBP: 00007ffc65a7c720 R08: 0000000000000000 R09: 0000000003000000
      [   30.679492] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000400ff0
      [   30.680282] R13: 00007ffc65a7c840 R14: 0000000000000000 R15: 0000000000000000
      
      This might happen if we do epoll_wait on a uring fd while reading/writing
      the former epoll fd in a sqe in the former uring instance.
      So let's don't flush cqring overflow list, just do a simple check.
      Reported-by: default avatarAbaci <abaci@linux.alibaba.com>
      Fixes: 6c503150 ("io_uring: patch up IOPOLL overflow_flush sync")
      Signed-off-by: default avatarHao Xu <haoxu@linux.alibaba.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ed670c3f
    • Pavel Begunkov's avatar
      io_uring: defer flushing cached reqs · e5d1bc0a
      Pavel Begunkov authored
      Awhile there are requests in the allocation cache -- use them, only if
      those ended go for the stashed memory in comp.free_list. As list
      manipulation are generally heavy and are not good for caches, flush them
      all or as much as can in one go.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      [axboe: return success/failure from io_flush_cached_reqs()]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e5d1bc0a
    • Pavel Begunkov's avatar
      io_uring: take comp_state from ctx · c5eef2b9
      Pavel Begunkov authored
      __io_queue_sqe() is always called with a non-NULL comp_state, which is
      taken directly from context. Don't pass it around but infer from ctx.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c5eef2b9
    • Jens Axboe's avatar
      io_uring: enable req cache for task_work items · 65453d1e
      Jens Axboe authored
      task_work is run without utilizing the req alloc cache, so any deferred
      items don't get to take advantage of either the alloc or free side of it.
      With task_work now being wrapped by io_uring, we can use the ctx
      completion state to both use the req cache and the completion flush
      batching.
      
      With this, the only request type that cannot take advantage of the req
      cache is IRQ driven IO for regular files / block devices. Anything else,
      including IOPOLL polled IO to those same tyes, will take advantage of it.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      65453d1e
    • Jens Axboe's avatar
      io_uring: provide FIFO ordering for task_work · 7cbf1722
      Jens Axboe authored
      task_work is a LIFO list, due to how it's implemented as a lockless
      list. For long chains of task_work, this can be problematic as the
      first entry added is the last one processed. Similarly, we'd waste
      a lot of CPU cycles reversing this list.
      
      Wrap the task_work so we have a single task_work entry per task per
      ctx, and use that to run it in the right order.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7cbf1722
    • Jens Axboe's avatar
      io_uring: use persistent request cache · 1b4c351f
      Jens Axboe authored
      Now that we have the submit_state in the ring itself, we can have io_kiocb
      allocations that are persistent across invocations. This reduces the time
      spent doing slab allocations and frees.
      
      [sil: rebased]
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1b4c351f
    • Pavel Begunkov's avatar
      io_uring: feed reqs back into alloc cache · 6ff119a6
      Pavel Begunkov authored
      Make io_req_free_batch(), which is used for inline executed requests and
      IOPOLL, to return requests back into the allocation cache, so avoid
      most of kmalloc()/kfree() for those cases.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6ff119a6
    • Pavel Begunkov's avatar
      io_uring: persistent req cache · bf019da7
      Pavel Begunkov authored
      Don't free batch-allocated requests across syscalls.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bf019da7
    • Pavel Begunkov's avatar
      io_uring: count ctx refs separately from reqs · 9ae72463
      Pavel Begunkov authored
      Currently batch free handles request memory freeing and ctx ref putting
      together. Separate them and use different counters, that will be needed
      for reusing reqs memory.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9ae72463
    • Pavel Begunkov's avatar
      io_uring: remove fallback_req · 3893f39f
      Pavel Begunkov authored
      Remove fallback_req for now, it gets in the way of other changes.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3893f39f
    • Pavel Begunkov's avatar
      io_uring: submit-completion free batching · 905c172f
      Pavel Begunkov authored
      io_submit_flush_completions() does completion batching, but may also use
      free batching as iopoll does. The main beneficiaries should be buffered
      reads/writes and send/recv.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      905c172f
    • Pavel Begunkov's avatar
      io_uring: replace list with array for compl batch · 6dd0be1e
      Pavel Begunkov authored
      Reincarnation of an old patch that replaces a list in struct
      io_compl_batch with an array. It's needed to avoid hooking requests via
      their compl.list, because it won't be always available in the future.
      
      It's also nice to split io_submit_flush_completions() to avoid free
      under locks and remove unlock/lock with a long comment describing when
      it can be done.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6dd0be1e
    • Pavel Begunkov's avatar
      io_uring: don't reinit submit state every time · 5087275d
      Pavel Begunkov authored
      As now submit_state is retained across syscalls, we can save ourself
      from initialising it from ground up for each io_submit_sqes(). Set some
      fields during ctx allocation, and just keep them always consistent.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      [axboe: remove unnecessary zeroing of ctx members]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5087275d
    • Pavel Begunkov's avatar
      io_uring: remove ctx from comp_state · ba88ff11
      Pavel Begunkov authored
      completion state is closely bound to ctx, we don't need to store ctx
      inside as we always have it around to pass to flush.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ba88ff11
    • Pavel Begunkov's avatar
      io_uring: don't keep submit_state on stack · 258b29a9
      Pavel Begunkov authored
      struct io_submit_state is quite big (168 bytes) and going to grow. It's
      better to not keep it on stack as it is now. Move it to context, it's
      always protected by uring_lock, so it's fine to have only one instance
      of it.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      258b29a9
    • Pavel Begunkov's avatar
      io_uring: don't propagate io_comp_state · 889fca73
      Pavel Begunkov authored
      There is no reason to drag io_comp_state into opcode handlers, we just
      need a flag and the actual work will be done in __io_queue_sqe().
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      889fca73
    • Pavel Begunkov's avatar
      io_uring: make op handlers always take issue flags · 61e98203
      Pavel Begunkov authored
      Make opcode handler interfaces a bit more consistent by always passing
      in issue flags. Bulky but pretty easy and mechanical change.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      61e98203
    • Pavel Begunkov's avatar
      io_uring: replace force_nonblock with flags · 45d189c6
      Pavel Begunkov authored
      Replace bool force_nonblock with flags. It has a long standing goal of
      differentiating context from which we execute. Currently we have some
      subtle places where some invariants, like holding of uring_lock, are
      subtly inferred.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      45d189c6
  4. 08 Feb, 2021 1 commit