1. 24 Jul, 2020 7 commits
    • Pavel Begunkov's avatar
      io_uring: extract io_sendmsg_copy_hdr() · 2ae523ed
      Pavel Begunkov authored
      Don't repeat send msg initialisation code, it's error prone.
      Extract and use a helper function.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2ae523ed
    • Pavel Begunkov's avatar
      io_uring: use more specific type in rcv/snd msg cp · 1400e697
      Pavel Begunkov authored
      send/recv msghdr initialisation works with struct io_async_msghdr, but
      pulls the whole struct io_async_ctx for no reason. That complicates it
      with composite accessing, e.g. io->msg.
      
      Use and pass the most specific type, which is struct io_async_msghdr.
      It is the larget field in union io_async_ctx and doesn't save stack
      space, but looks clearer.
      The most of the changes are replacing "io->msg." with "iomsg->"
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1400e697
    • Pavel Begunkov's avatar
      io_uring: rename sr->msg into umsg · 270a5940
      Pavel Begunkov authored
      Every second field in send/recv is called msg, make it a bit more
      understandable by renaming ->msg, which is a user provided ptr,
      to ->umsg.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      270a5940
    • Dmitry Vyukov's avatar
      io_uring: fix sq array offset calculation · b36200f5
      Dmitry Vyukov authored
      rings_size() sets sq_offset to the total size of the rings (the returned
      value which is used for memory allocation). This is wrong: sq array should
      be located within the rings, not after them. Set sq_offset to where it
      should be.
      
      Fixes: 75b28aff ("io_uring: allocate the two rings together")
      Signed-off-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarHristo Venev <hristo@venev.name>
      Cc: io-uring@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b36200f5
    • Jens Axboe's avatar
      Merge branch 'io_uring-5.8' into for-5.9/io_uring · 760618f7
      Jens Axboe authored
      Merge in io_uring-5.8 fixes, as changes/cleanups to how we do locked
      mem accounting require a fixup, and only one of the spots are noticed
      by git as the other merges cleanly. The flags fix from io_uring-5.8
      also causes a merge conflict, the leak fix for recvmsg, the double poll
      fix, and the link failure locking fix.
      
      * io_uring-5.8:
        io_uring: fix lockup in io_fail_links()
        io_uring: fix ->work corruption with poll_add
        io_uring: missed req_init_async() for IOSQE_ASYNC
        io_uring: always allow drain/link/hardlink/async sqe flags
        io_uring: ensure double poll additions work with both request types
        io_uring: fix recvmsg memory leak with buffer selection
        io_uring: fix not initialised work->flags
        io_uring: fix missing msg_name assignment
        io_uring: account user memory freed when exit has been queued
        io_uring: fix memleak in io_sqe_files_register()
        io_uring: fix memleak in __io_sqe_files_update()
        io_uring: export cq overflow status to userspace
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      760618f7
    • Pavel Begunkov's avatar
      io_uring: fix lockup in io_fail_links() · 4ae6dbd6
      Pavel Begunkov authored
      io_fail_links() doesn't consider REQ_F_COMP_LOCKED leading to nested
      spin_lock(completion_lock) and lockup.
      
      [  197.680409] rcu: INFO: rcu_preempt detected expedited stalls on
      	CPUs/tasks: { 6-... } 18239 jiffies s: 1421 root: 0x40/.
      [  197.680411] rcu: blocking rcu_node structures:
      [  197.680412] Task dump for CPU 6:
      [  197.680413] link-timeout    R  running task        0  1669
      	1 0x8000008a
      [  197.680414] Call Trace:
      [  197.680420]  ? io_req_find_next+0xa0/0x200
      [  197.680422]  ? io_put_req_find_next+0x2a/0x50
      [  197.680423]  ? io_poll_task_func+0xcf/0x140
      [  197.680425]  ? task_work_run+0x67/0xa0
      [  197.680426]  ? do_exit+0x35d/0xb70
      [  197.680429]  ? syscall_trace_enter+0x187/0x2c0
      [  197.680430]  ? do_group_exit+0x43/0xa0
      [  197.680448]  ? __x64_sys_exit_group+0x18/0x20
      [  197.680450]  ? do_syscall_64+0x52/0xa0
      [  197.680452]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4ae6dbd6
    • Pavel Begunkov's avatar
      io_uring: fix ->work corruption with poll_add · d5e16d8e
      Pavel Begunkov authored
      req->work might be already initialised by the time it gets into
      __io_arm_poll_handler(), which will corrupt it by using fields that are
      in an union with req->work. Luckily, the only side effect is missing
      put_creds(). Clean req->work before going there.
      Suggested-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d5e16d8e
  2. 23 Jul, 2020 1 commit
  3. 18 Jul, 2020 2 commits
    • Daniele Albano's avatar
      io_uring: always allow drain/link/hardlink/async sqe flags · 61710e43
      Daniele Albano authored
      We currently filter these for timeout_remove/async_cancel/files_update,
      but we only should be filtering for fixed file and buffer select. This
      also causes a second read of sqe->flags, which isn't needed.
      
      Just check req->flags for the relevant bits. This then allows these
      commands to be used in links, for example, like everything else.
      Signed-off-by: default avatarDaniele Albano <d.albano@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      61710e43
    • Jens Axboe's avatar
      io_uring: ensure double poll additions work with both request types · 807abcb0
      Jens Axboe authored
      The double poll additions were centered around doing POLL_ADD on file
      descriptors that use more than one waitqueue (typically one for read,
      one for write) when being polled. However, it can also end up being
      triggered for when we use poll triggered retry. For that case, we cannot
      safely use req->io, as that could be used by the request type itself.
      
      Add a second io_poll_iocb pointer in the structure we allocate for poll
      based retry, and ensure we use the right one from the two paths.
      
      Fixes: 18bceab1 ("io_uring: allow POLL_ADD with double poll_wait() users")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      807abcb0
  4. 15 Jul, 2020 1 commit
  5. 12 Jul, 2020 2 commits
  6. 10 Jul, 2020 2 commits
    • Jens Axboe's avatar
      io_uring: account user memory freed when exit has been queued · 309fc03a
      Jens Axboe authored
      We currently account the memory after the exit work has been run, but
      that leaves a gap where a process has closed its ring and until the
      memory has been accounted as freed. If the memlocked ulimit is
      borderline, then that can introduce spurious setup errors returning
      -ENOMEM because the free work hasn't been run yet.
      
      Account this as freed when we close the ring, as not to expose a tiny
      gap where setting up a new ring can fail.
      
      Fixes: 85faa7b8 ("io_uring: punt final io_ring_ctx wait-and-free to workqueue")
      Cc: stable@vger.kernel.org # v5.7
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      309fc03a
    • Yang Yingliang's avatar
      io_uring: fix memleak in io_sqe_files_register() · 667e57da
      Yang Yingliang authored
      I got a memleak report when doing some fuzz test:
      
      BUG: memory leak
      unreferenced object 0x607eeac06e78 (size 8):
        comm "test", pid 295, jiffies 4294735835 (age 31.745s)
        hex dump (first 8 bytes):
          00 00 00 00 00 00 00 00                          ........
        backtrace:
          [<00000000932632e6>] percpu_ref_init+0x2a/0x1b0
          [<0000000092ddb796>] __io_uring_register+0x111d/0x22a0
          [<00000000eadd6c77>] __x64_sys_io_uring_register+0x17b/0x480
          [<00000000591b89a6>] do_syscall_64+0x56/0xa0
          [<00000000864a281d>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Call percpu_ref_exit() on error path to avoid
      refcount memleak.
      
      Fixes: 05f3fb3c ("io_uring: avoid ring quiesce for fixed file set unregister and update")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      667e57da
  7. 09 Jul, 2020 4 commits
    • Jens Axboe's avatar
      io_uring: remove dead 'ctx' argument and move forward declaration · 4349f30e
      Jens Axboe authored
      We don't use 'ctx' at all in io_sq_thread_drop_mm(), it just works
      on the mm of the current task. Drop the argument.
      
      Move io_file_put_work() to where we have the other forward declarations
      of functions.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4349f30e
    • Jens Axboe's avatar
      io_uring: get rid of __req_need_defer() · 2bc9930e
      Jens Axboe authored
      We just have one caller of this, req_need_defer(), just inline the
      code in there instead.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2bc9930e
    • Yang Yingliang's avatar
      io_uring: fix memleak in __io_sqe_files_update() · f3bd9dae
      Yang Yingliang authored
      I got a memleak report when doing some fuzz test:
      
      BUG: memory leak
      unreferenced object 0xffff888113e02300 (size 488):
      comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
      hex dump (first 32 bytes):
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      a0 a4 ce 19 81 88 ff ff 60 ce 09 0d 81 88 ff ff ........`.......
      backtrace:
      [<00000000129a84ec>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
      [<00000000129a84ec>] __alloc_file+0x25/0x310 fs/file_table.c:101
      [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
      [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
      [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
      [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
      [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
      [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
      [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
      [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      BUG: memory leak
      unreferenced object 0xffff8881152dd5e0 (size 16):
      comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
      hex dump (first 16 bytes):
      01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
      backtrace:
      [<0000000074caa794>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
      [<0000000074caa794>] lsm_file_alloc security/security.c:567 [inline]
      [<0000000074caa794>] security_file_alloc+0x32/0x160 security/security.c:1440
      [<00000000c6745ea3>] __alloc_file+0xba/0x310 fs/file_table.c:106
      [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
      [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
      [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
      [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
      [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
      [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
      [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
      [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      If io_sqe_file_register() failed, we need put the file that get by fget()
      to avoid the memleak.
      
      Fixes: c3a31e60 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f3bd9dae
    • Xiaoguang Wang's avatar
      io_uring: export cq overflow status to userspace · 6d5f9049
      Xiaoguang Wang authored
      For those applications which are not willing to use io_uring_enter()
      to reap and handle cqes, they may completely rely on liburing's
      io_uring_peek_cqe(), but if cq ring has overflowed, currently because
      io_uring_peek_cqe() is not aware of this overflow, it won't enter
      kernel to flush cqes, below test program can reveal this bug:
      
      static void test_cq_overflow(struct io_uring *ring)
      {
              struct io_uring_cqe *cqe;
              struct io_uring_sqe *sqe;
              int issued = 0;
              int ret = 0;
      
              do {
                      sqe = io_uring_get_sqe(ring);
                      if (!sqe) {
                              fprintf(stderr, "get sqe failed\n");
                              break;;
                      }
                      ret = io_uring_submit(ring);
                      if (ret <= 0) {
                              if (ret != -EBUSY)
                                      fprintf(stderr, "sqe submit failed: %d\n", ret);
                              break;
                      }
                      issued++;
              } while (ret > 0);
              assert(ret == -EBUSY);
      
              printf("issued requests: %d\n", issued);
      
              while (issued) {
                      ret = io_uring_peek_cqe(ring, &cqe);
                      if (ret) {
                              if (ret != -EAGAIN) {
                                      fprintf(stderr, "peek completion failed: %s\n",
                                              strerror(ret));
                                      break;
                              }
                              printf("left requets: %d\n", issued);
                              continue;
                      }
                      io_uring_cqe_seen(ring, cqe);
                      issued--;
                      printf("left requets: %d\n", issued);
              }
      }
      
      int main(int argc, char *argv[])
      {
              int ret;
              struct io_uring ring;
      
              ret = io_uring_queue_init(16, &ring, 0);
              if (ret) {
                      fprintf(stderr, "ring setup failed: %d\n", ret);
                      return 1;
              }
      
              test_cq_overflow(&ring);
              return 0;
      }
      
      To fix this issue, export cq overflow status to userspace by adding new
      IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
      io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
      Signed-off-by: default avatarXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6d5f9049
  8. 08 Jul, 2020 2 commits
  9. 07 Jul, 2020 3 commits
  10. 06 Jul, 2020 3 commits
  11. 05 Jul, 2020 6 commits
  12. 04 Jul, 2020 1 commit
    • Jens Axboe's avatar
      io_uring: fix regression with always ignoring signals in io_cqring_wait() · b7db41c9
      Jens Axboe authored
      When switching to TWA_SIGNAL for task_work notifications, we also made
      any signal based condition in io_cqring_wait() return -ERESTARTSYS.
      This breaks applications that rely on using signals to abort someone
      waiting for events.
      
      Check if we have a signal pending because of queued task_work, and
      repeat the signal check once we've run the task_work. This provides a
      reliable way of telling the two apart.
      
      Additionally, only use TWA_SIGNAL if we are using an eventfd. If not,
      we don't have the dependency situation described in the original commit,
      and we can get by with just using TWA_RESUME like we previously did.
      
      Fixes: ce593a6c ("io_uring: use signal based task_work running")
      Cc: stable@vger.kernel.org # v5.7
      Reported-by: default avatarAndres Freund <andres@anarazel.de>
      Tested-by: default avatarAndres Freund <andres@anarazel.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b7db41c9
  13. 30 Jun, 2020 6 commits
    • Jens Axboe's avatar
      io_uring: use signal based task_work running · ce593a6c
      Jens Axboe authored
      Since 5.7, we've been using task_work to trigger async running of
      requests in the context of the original task. This generally works
      great, but there's a case where if the task is currently blocked
      in the kernel waiting on a condition to become true, it won't process
      task_work. Even though the task is woken, it just checks whatever
      condition it's waiting on, and goes back to sleep if it's still false.
      
      This is a problem if that very condition only becomes true when that
      task_work is run. An example of that is the task registering an eventfd
      with io_uring, and it's now blocked waiting on an eventfd read. That
      read could depend on a completion event, and that completion event
      won't get trigged until task_work has been run.
      
      Use the TWA_SIGNAL notification for task_work, so that we ensure that
      the task always runs the work when queued.
      
      Cc: stable@vger.kernel.org # v5.7
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ce593a6c
    • Oleg Nesterov's avatar
      task_work: teach task_work_add() to do signal_wake_up() · e91b4816
      Oleg Nesterov authored
      So that the target task will exit the wait_event_interruptible-like
      loop and call task_work_run() asap.
      
      The patch turns "bool notify" into 0,TWA_RESUME,TWA_SIGNAL enum, the
      new TWA_SIGNAL flag implies signal_wake_up().  However, it needs to
      avoid the race with recalc_sigpending(), so the patch also adds the
      new JOBCTL_TASK_WORK bit included in JOBCTL_PENDING_MASK.
      
      TODO: once this patch is merged we need to change all current users
      of task_work_add(notify = true) to use TWA_RESUME.
      
      Cc: stable@vger.kernel.org # v5.7
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e91b4816
    • Pavel Begunkov's avatar
      io_uring: fix missing ->mm on exit · 8eb06d7e
      Pavel Begunkov authored
      There is a fancy bug, where exiting user task may not have ->mm,
      that makes task_works to try to do kthread_use_mm(ctx->sqo_mm).
      
      Don't do that if sqo_mm is NULL.
      
      [  290.460558] WARNING: CPU: 6 PID: 150933 at kernel/kthread.c:1238
      	kthread_use_mm+0xf3/0x110
      [  290.460579] CPU: 6 PID: 150933 Comm: read-write2 Tainted: G
      	I E     5.8.0-rc2-00066-g9b21720607cf #531
      [  290.460580] RIP: 0010:kthread_use_mm+0xf3/0x110
      ...
      [  290.460584] Call Trace:
      [  290.460584]  __io_sq_thread_acquire_mm.isra.0.part.0+0x25/0x30
      [  290.460584]  __io_req_task_submit+0x64/0x80
      [  290.460584]  io_req_task_submit+0x15/0x20
      [  290.460585]  task_work_run+0x67/0xa0
      [  290.460585]  do_exit+0x35d/0xb70
      [  290.460585]  do_group_exit+0x43/0xa0
      [  290.460585]  get_signal+0x140/0x900
      [  290.460586]  do_signal+0x37/0x780
      [  290.460586]  __prepare_exit_to_usermode+0x126/0x1c0
      [  290.460586]  __syscall_return_slowpath+0x3b/0x1c0
      [  290.460587]  do_syscall_64+0x5f/0xa0
      [  290.460587]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      following with faults.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8eb06d7e
    • Pavel Begunkov's avatar
      io_uring: optimise io_req_find_next() fast check · 3fa5e0f3
      Pavel Begunkov authored
      gcc 9.2.0 compiles io_req_find_next() as a separate function leaving
      the first REQ_F_LINK_HEAD fast check not inlined. Help it by splitting
      out the check from the function.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3fa5e0f3
    • Pavel Begunkov's avatar
      io_uring: simplify io_async_task_func() · 0be0b0e3
      Pavel Begunkov authored
      Greatly simplify io_async_task_func() removing duplicated functionality
      of __io_req_task_submit(). This do one extra spin lock/unlock for
      cancelled poll case, but that shouldn't happen often.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0be0b0e3
    • Pavel Begunkov's avatar
      io_uring: fix NULL mm in io_poll_task_func() · ea1164e5
      Pavel Begunkov authored
      io_poll_task_func() hand-coded link submission forgetting to set
      TASK_RUNNING, acquire mm, etc. Call existing helper for that,
      i.e. __io_req_task_submit().
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ea1164e5