1. 10 Feb, 2021 19 commits
    • Jens Axboe's avatar
      io_uring: enable req cache for IRQ driven IO · c7dae4ba
      Jens Axboe authored
      This is the last class of requests that cannot utilize the req alloc
      cache. Add a per-ctx req cache that is protected by the completion_lock,
      and refill our submit side cache when it gets over our batch count.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c7dae4ba
    • Hao Xu's avatar
      io_uring: fix possible deadlock in io_uring_poll · ed670c3f
      Hao Xu authored
      Abaci reported follow issue:
      
      [   30.615891] ======================================================
      [   30.616648] WARNING: possible circular locking dependency detected
      [   30.617423] 5.11.0-rc3-next-20210115 #1 Not tainted
      [   30.618035] ------------------------------------------------------
      [   30.618914] a.out/1128 is trying to acquire lock:
      [   30.619520] ffff88810b063868 (&ep->mtx){+.+.}-{3:3}, at: __ep_eventpoll_poll+0x9f/0x220
      [   30.620505]
      [   30.620505] but task is already holding lock:
      [   30.621218] ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
      [   30.622349]
      [   30.622349] which lock already depends on the new lock.
      [   30.622349]
      [   30.623289]
      [   30.623289] the existing dependency chain (in reverse order) is:
      [   30.624243]
      [   30.624243] -> #1 (&ctx->uring_lock){+.+.}-{3:3}:
      [   30.625263]        lock_acquire+0x2c7/0x390
      [   30.625868]        __mutex_lock+0xae/0x9f0
      [   30.626451]        io_cqring_overflow_flush.part.95+0x6d/0x70
      [   30.627278]        io_uring_poll+0xcb/0xd0
      [   30.627890]        ep_item_poll.isra.14+0x4e/0x90
      [   30.628531]        do_epoll_ctl+0xb7e/0x1120
      [   30.629122]        __x64_sys_epoll_ctl+0x70/0xb0
      [   30.629770]        do_syscall_64+0x2d/0x40
      [   30.630332]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   30.631187]
      [   30.631187] -> #0 (&ep->mtx){+.+.}-{3:3}:
      [   30.631985]        check_prevs_add+0x226/0xb00
      [   30.632584]        __lock_acquire+0x1237/0x13a0
      [   30.633207]        lock_acquire+0x2c7/0x390
      [   30.633740]        __mutex_lock+0xae/0x9f0
      [   30.634258]        __ep_eventpoll_poll+0x9f/0x220
      [   30.634879]        __io_arm_poll_handler+0xbf/0x220
      [   30.635462]        io_issue_sqe+0xa6b/0x13e0
      [   30.635982]        __io_queue_sqe+0x10b/0x550
      [   30.636648]        io_queue_sqe+0x235/0x470
      [   30.637281]        io_submit_sqes+0xcce/0xf10
      [   30.637839]        __x64_sys_io_uring_enter+0x3fb/0x5b0
      [   30.638465]        do_syscall_64+0x2d/0x40
      [   30.638999]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   30.639643]
      [   30.639643] other info that might help us debug this:
      [   30.639643]
      [   30.640618]  Possible unsafe locking scenario:
      [   30.640618]
      [   30.641402]        CPU0                    CPU1
      [   30.641938]        ----                    ----
      [   30.642664]   lock(&ctx->uring_lock);
      [   30.643425]                                lock(&ep->mtx);
      [   30.644498]                                lock(&ctx->uring_lock);
      [   30.645668]   lock(&ep->mtx);
      [   30.646321]
      [   30.646321]  *** DEADLOCK ***
      [   30.646321]
      [   30.647642] 1 lock held by a.out/1128:
      [   30.648424]  #0: ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
      [   30.649954]
      [   30.649954] stack backtrace:
      [   30.650592] CPU: 1 PID: 1128 Comm: a.out Not tainted 5.11.0-rc3-next-20210115 #1
      [   30.651554] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [   30.652290] Call Trace:
      [   30.652688]  dump_stack+0xac/0xe3
      [   30.653164]  check_noncircular+0x11e/0x130
      [   30.653747]  ? check_prevs_add+0x226/0xb00
      [   30.654303]  check_prevs_add+0x226/0xb00
      [   30.654845]  ? add_lock_to_list.constprop.49+0xac/0x1d0
      [   30.655564]  __lock_acquire+0x1237/0x13a0
      [   30.656262]  lock_acquire+0x2c7/0x390
      [   30.656788]  ? __ep_eventpoll_poll+0x9f/0x220
      [   30.657379]  ? __io_queue_proc.isra.88+0x180/0x180
      [   30.658014]  __mutex_lock+0xae/0x9f0
      [   30.658524]  ? __ep_eventpoll_poll+0x9f/0x220
      [   30.659112]  ? mark_held_locks+0x5a/0x80
      [   30.659648]  ? __ep_eventpoll_poll+0x9f/0x220
      [   30.660229]  ? _raw_spin_unlock_irqrestore+0x2d/0x40
      [   30.660885]  ? trace_hardirqs_on+0x46/0x110
      [   30.661471]  ? __io_queue_proc.isra.88+0x180/0x180
      [   30.662102]  ? __ep_eventpoll_poll+0x9f/0x220
      [   30.662696]  __ep_eventpoll_poll+0x9f/0x220
      [   30.663273]  ? __ep_eventpoll_poll+0x220/0x220
      [   30.663875]  __io_arm_poll_handler+0xbf/0x220
      [   30.664463]  io_issue_sqe+0xa6b/0x13e0
      [   30.664984]  ? __lock_acquire+0x782/0x13a0
      [   30.665544]  ? __io_queue_proc.isra.88+0x180/0x180
      [   30.666170]  ? __io_queue_sqe+0x10b/0x550
      [   30.666725]  __io_queue_sqe+0x10b/0x550
      [   30.667252]  ? __fget_files+0x131/0x260
      [   30.667791]  ? io_req_prep+0xd8/0x1090
      [   30.668316]  ? io_queue_sqe+0x235/0x470
      [   30.668868]  io_queue_sqe+0x235/0x470
      [   30.669398]  io_submit_sqes+0xcce/0xf10
      [   30.669931]  ? xa_load+0xe4/0x1c0
      [   30.670425]  __x64_sys_io_uring_enter+0x3fb/0x5b0
      [   30.671051]  ? lockdep_hardirqs_on_prepare+0xde/0x180
      [   30.671719]  ? syscall_enter_from_user_mode+0x2b/0x80
      [   30.672380]  do_syscall_64+0x2d/0x40
      [   30.672901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   30.673503] RIP: 0033:0x7fd89c813239
      [   30.673962] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05  3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48
      [   30.675920] RSP: 002b:00007ffc65a7c628 EFLAGS: 00000217 ORIG_RAX: 00000000000001aa
      [   30.676791] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd89c813239
      [   30.677594] RDX: 0000000000000000 RSI: 0000000000000014 RDI: 0000000000000003
      [   30.678678] RBP: 00007ffc65a7c720 R08: 0000000000000000 R09: 0000000003000000
      [   30.679492] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000400ff0
      [   30.680282] R13: 00007ffc65a7c840 R14: 0000000000000000 R15: 0000000000000000
      
      This might happen if we do epoll_wait on a uring fd while reading/writing
      the former epoll fd in a sqe in the former uring instance.
      So let's don't flush cqring overflow list, just do a simple check.
      Reported-by: default avatarAbaci <abaci@linux.alibaba.com>
      Fixes: 6c503150 ("io_uring: patch up IOPOLL overflow_flush sync")
      Signed-off-by: default avatarHao Xu <haoxu@linux.alibaba.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ed670c3f
    • Pavel Begunkov's avatar
      io_uring: defer flushing cached reqs · e5d1bc0a
      Pavel Begunkov authored
      Awhile there are requests in the allocation cache -- use them, only if
      those ended go for the stashed memory in comp.free_list. As list
      manipulation are generally heavy and are not good for caches, flush them
      all or as much as can in one go.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      [axboe: return success/failure from io_flush_cached_reqs()]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e5d1bc0a
    • Pavel Begunkov's avatar
      io_uring: take comp_state from ctx · c5eef2b9
      Pavel Begunkov authored
      __io_queue_sqe() is always called with a non-NULL comp_state, which is
      taken directly from context. Don't pass it around but infer from ctx.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c5eef2b9
    • Jens Axboe's avatar
      io_uring: enable req cache for task_work items · 65453d1e
      Jens Axboe authored
      task_work is run without utilizing the req alloc cache, so any deferred
      items don't get to take advantage of either the alloc or free side of it.
      With task_work now being wrapped by io_uring, we can use the ctx
      completion state to both use the req cache and the completion flush
      batching.
      
      With this, the only request type that cannot take advantage of the req
      cache is IRQ driven IO for regular files / block devices. Anything else,
      including IOPOLL polled IO to those same tyes, will take advantage of it.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      65453d1e
    • Jens Axboe's avatar
      io_uring: provide FIFO ordering for task_work · 7cbf1722
      Jens Axboe authored
      task_work is a LIFO list, due to how it's implemented as a lockless
      list. For long chains of task_work, this can be problematic as the
      first entry added is the last one processed. Similarly, we'd waste
      a lot of CPU cycles reversing this list.
      
      Wrap the task_work so we have a single task_work entry per task per
      ctx, and use that to run it in the right order.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7cbf1722
    • Jens Axboe's avatar
      io_uring: use persistent request cache · 1b4c351f
      Jens Axboe authored
      Now that we have the submit_state in the ring itself, we can have io_kiocb
      allocations that are persistent across invocations. This reduces the time
      spent doing slab allocations and frees.
      
      [sil: rebased]
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1b4c351f
    • Pavel Begunkov's avatar
      io_uring: feed reqs back into alloc cache · 6ff119a6
      Pavel Begunkov authored
      Make io_req_free_batch(), which is used for inline executed requests and
      IOPOLL, to return requests back into the allocation cache, so avoid
      most of kmalloc()/kfree() for those cases.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6ff119a6
    • Pavel Begunkov's avatar
      io_uring: persistent req cache · bf019da7
      Pavel Begunkov authored
      Don't free batch-allocated requests across syscalls.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bf019da7
    • Pavel Begunkov's avatar
      io_uring: count ctx refs separately from reqs · 9ae72463
      Pavel Begunkov authored
      Currently batch free handles request memory freeing and ctx ref putting
      together. Separate them and use different counters, that will be needed
      for reusing reqs memory.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9ae72463
    • Pavel Begunkov's avatar
      io_uring: remove fallback_req · 3893f39f
      Pavel Begunkov authored
      Remove fallback_req for now, it gets in the way of other changes.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3893f39f
    • Pavel Begunkov's avatar
      io_uring: submit-completion free batching · 905c172f
      Pavel Begunkov authored
      io_submit_flush_completions() does completion batching, but may also use
      free batching as iopoll does. The main beneficiaries should be buffered
      reads/writes and send/recv.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      905c172f
    • Pavel Begunkov's avatar
      io_uring: replace list with array for compl batch · 6dd0be1e
      Pavel Begunkov authored
      Reincarnation of an old patch that replaces a list in struct
      io_compl_batch with an array. It's needed to avoid hooking requests via
      their compl.list, because it won't be always available in the future.
      
      It's also nice to split io_submit_flush_completions() to avoid free
      under locks and remove unlock/lock with a long comment describing when
      it can be done.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6dd0be1e
    • Pavel Begunkov's avatar
      io_uring: don't reinit submit state every time · 5087275d
      Pavel Begunkov authored
      As now submit_state is retained across syscalls, we can save ourself
      from initialising it from ground up for each io_submit_sqes(). Set some
      fields during ctx allocation, and just keep them always consistent.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      [axboe: remove unnecessary zeroing of ctx members]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5087275d
    • Pavel Begunkov's avatar
      io_uring: remove ctx from comp_state · ba88ff11
      Pavel Begunkov authored
      completion state is closely bound to ctx, we don't need to store ctx
      inside as we always have it around to pass to flush.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ba88ff11
    • Pavel Begunkov's avatar
      io_uring: don't keep submit_state on stack · 258b29a9
      Pavel Begunkov authored
      struct io_submit_state is quite big (168 bytes) and going to grow. It's
      better to not keep it on stack as it is now. Move it to context, it's
      always protected by uring_lock, so it's fine to have only one instance
      of it.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      258b29a9
    • Pavel Begunkov's avatar
      io_uring: don't propagate io_comp_state · 889fca73
      Pavel Begunkov authored
      There is no reason to drag io_comp_state into opcode handlers, we just
      need a flag and the actual work will be done in __io_queue_sqe().
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      889fca73
    • Pavel Begunkov's avatar
      io_uring: make op handlers always take issue flags · 61e98203
      Pavel Begunkov authored
      Make opcode handler interfaces a bit more consistent by always passing
      in issue flags. Bulky but pretty easy and mechanical change.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      61e98203
    • Pavel Begunkov's avatar
      io_uring: replace force_nonblock with flags · 45d189c6
      Pavel Begunkov authored
      Replace bool force_nonblock with flags. It has a long standing goal of
      differentiating context from which we execute. Currently we have some
      subtle places where some invariants, like holding of uring_lock, are
      subtly inferred.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      45d189c6
  2. 08 Feb, 2021 1 commit
  3. 05 Feb, 2021 3 commits
  4. 04 Feb, 2021 13 commits
  5. 01 Feb, 2021 4 commits