1. 23 Mar, 2020 4 commits
    • Pavel Begunkov's avatar
      io-wq: handle hashed writes in chains · 86f3cd1b
      Pavel Begunkov authored
      We always punt async buffered writes to an io-wq helper, as the core
      kernel does not have IOCB_NOWAIT support for that. Most buffered async
      writes complete very quickly, as it's just a copy operation. This means
      that doing multiple locking roundtrips on the shared wqe lock for each
      buffered write is wasteful. Additionally, buffered writes are hashed
      work items, which means that any buffered write to a given file is
      serialized.
      
      Keep identicaly hashed work items contiguously in @wqe->work_list, and
      track a tail for each hash bucket. On dequeue of a hashed item, splice
      all of the same hash in one go using the tracked tail. Until the batch
      is done, the caller doesn't have to synchronize with the wqe or worker
      locks again.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      86f3cd1b
    • Hillf Danton's avatar
      io-uring: drop 'free_pfile' in struct io_file_put · a5318d3c
      Hillf Danton authored
      Sync removal of file is only used in case of a GFP_KERNEL kmalloc
      failure at the cost of io_file_put::done and work flush, while a
      glich like it can be handled at the call site without too much pain.
      
      That said, what is proposed is to drop sync removing of file, and
      the kink in neck as well.
      Signed-off-by: default avatarHillf Danton <hdanton@sina.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a5318d3c
    • Hillf Danton's avatar
      io-uring: drop completion when removing file · 4afdb733
      Hillf Danton authored
      A case of task hung was reported by syzbot,
      
      INFO: task syz-executor975:9880 blocked for more than 143 seconds.
            Not tainted 5.6.0-rc6-syzkaller #0
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      syz-executor975 D27576  9880   9878 0x80004000
      Call Trace:
       schedule+0xd0/0x2a0 kernel/sched/core.c:4154
       schedule_timeout+0x6db/0xba0 kernel/time/timer.c:1871
       do_wait_for_common kernel/sched/completion.c:83 [inline]
       __wait_for_common kernel/sched/completion.c:104 [inline]
       wait_for_common kernel/sched/completion.c:115 [inline]
       wait_for_completion+0x26a/0x3c0 kernel/sched/completion.c:136
       io_queue_file_removal+0x1af/0x1e0 fs/io_uring.c:5826
       __io_sqe_files_update.isra.0+0x3a1/0xb00 fs/io_uring.c:5867
       io_sqe_files_update fs/io_uring.c:5918 [inline]
       __io_uring_register+0x377/0x2c00 fs/io_uring.c:7131
       __do_sys_io_uring_register fs/io_uring.c:7202 [inline]
       __se_sys_io_uring_register fs/io_uring.c:7184 [inline]
       __x64_sys_io_uring_register+0x192/0x560 fs/io_uring.c:7184
       do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      and bisect pointed to 05f3fb3c ("io_uring: avoid ring quiesce for
      fixed file set unregister and update").
      
      It is down to the order that we wait for work done before flushing it
      while nobody is likely going to wake us up.
      
      We can drop that completion on stack as flushing work itself is a sync
      operation we need and no more is left behind it.
      
      To that end, io_file_put::done is re-used for indicating if it can be
      freed in the workqueue worker context.
      Reported-and-Inspired-by: default avatarsyzbot <syzbot+538d1957ce178382a394@syzkaller.appspotmail.com>
      Signed-off-by: default avatarHillf Danton <hdanton@sina.com>
      
      Rename ->done to ->free_pfile
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4afdb733
    • Pavel Begunkov's avatar
      io_uring: Fix ->data corruption on re-enqueue · 18a542ff
      Pavel Begunkov authored
      work->data and work->list are shared in union. io_wq_assign_next() sets
      ->data if a req having a linked_timeout, but then io-wq may want to use
      work->list, e.g. to do re-enqueue of a request, so corrupting ->data.
      
      ->data is not necessary, just remove it and extract linked_timeout
      through @link_list.
      
      Fixes: 60cf46ae ("io-wq: hash dependent work")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      18a542ff
  2. 22 Mar, 2020 1 commit
  3. 21 Mar, 2020 1 commit
  4. 20 Mar, 2020 1 commit
    • Jens Axboe's avatar
      io_uring: honor original task RLIMIT_FSIZE · 4ed734b0
      Jens Axboe authored
      With the previous fixes for number of files open checking, I added some
      debug code to see if we had other spots where we're checking rlimit()
      against the async io-wq workers. The only one I found was file size
      checking, which we should also honor.
      
      During write and fallocate prep, store the max file size and override
      that for the current ask if we're in io-wq worker context.
      
      Cc: stable@vger.kernel.org # 5.1+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4ed734b0
  5. 14 Mar, 2020 3 commits
  6. 12 Mar, 2020 1 commit
  7. 11 Mar, 2020 3 commits
    • Jens Axboe's avatar
      io_uring: fix truncated async read/readv and write/writev retry · 3f9d6441
      Jens Axboe authored
      Ensure we keep the truncated value, if we did truncate it. If not, we
      might read/write more than the registered buffer size.
      
      Also for retry, ensure that we return the truncated mapped value for
      the vectorized versions of the read/write commands.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3f9d6441
    • Jens Axboe's avatar
      io_uring: dual license io_uring.h uapi header · bbbdeb47
      Jens Axboe authored
      This just syncs the header it with the liburing version, so there's no
      confusion on the license of the header parts.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bbbdeb47
    • Xiaoguang Wang's avatar
      io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled · 32b2244a
      Xiaoguang Wang authored
      When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, applications don't need
      to do io completion events polling again, they can rely on io_sq_thread to do
      polling work, which can reduce cpu usage and uring_lock contention.
      
      I modify fio io_uring engine codes a bit to evaluate the performance:
      static int fio_ioring_getevents(struct thread_data *td, unsigned int min,
                              continue;
                      }
      
      -               if (!o->sqpoll_thread) {
      +               if (o->sqpoll_thread && o->hipri) {
                              r = io_uring_enter(ld, 0, actual_min,
                                                      IORING_ENTER_GETEVENTS);
                              if (r < 0) {
      
      and use "fio  -name=fiotest -filename=/dev/nvme0n1 -iodepth=$depth -thread
      -rw=read -ioengine=io_uring  -hipri=1 -sqthread_poll=1  -direct=1 -bs=4k
      -size=10G -numjobs=1  -time_based -runtime=120"
      
      original codes
      --------------------------------------------------------------------
      iodepth       |        4 |        8 |       16 |       32 |       64
      bw            | 1133MB/s | 1519MB/s | 2090MB/s | 2710MB/s | 3012MB/s
      fio cpu usage |     100% |     100% |     100% |     100% |     100%
      --------------------------------------------------------------------
      
      with patch
      --------------------------------------------------------------------
      iodepth       |        4 |        8 |       16 |       32 |       64
      bw            | 1196MB/s | 1721MB/s | 2351MB/s | 2977MB/s | 3357MB/s
      fio cpu usage |    63.8% |   74.4%% |    81.1% |    83.7% |    82.4%
      --------------------------------------------------------------------
      bw improve    |     5.5% |    13.2% |    12.3% |     9.8% |    11.5%
      --------------------------------------------------------------------
      
      From above test results, we can see that bw has above 5.5%~13%
      improvement, and fio process's cpu usage also drops much. Note this
      won't improve io_sq_thread's cpu usage when SETUP_IOPOLL|SETUP_SQPOLL
      are both enabled, in this case, io_sq_thread always has 100% cpu usage.
      I think this patch will be friendly to applications which will often use
      io_uring_wait_cqe() or similar from liburing.
      Signed-off-by: default avatarXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      32b2244a
  8. 10 Mar, 2020 8 commits
    • YueHaibing's avatar
      io_uring: Fix unused function warnings · 469956e8
      YueHaibing authored
      If CONFIG_NET is not set, gcc warns:
      
      fs/io_uring.c:3110:12: warning: io_setup_async_msg defined but not used [-Wunused-function]
       static int io_setup_async_msg(struct io_kiocb *req,
                  ^~~~~~~~~~~~~~~~~~
      
      There are many funcions wraped by CONFIG_NET, move them
      together to simplify code, also fix this warning.
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      
      Minor tweaks.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      469956e8
    • Jens Axboe's avatar
      io_uring: add end-of-bits marker and build time verify it · 84557871
      Jens Axboe authored
      Not easy to tell if we're going over the size of bits we can shove
      in req->flags, so add an end-of-bits marker and a BUILD_BUG_ON()
      check for it.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      84557871
    • Jens Axboe's avatar
      io_uring: provide means of removing buffers · 067524e9
      Jens Axboe authored
      We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers
      is to trigger IO on them. The usual case of shrinking a buffer pool
      would be to just not replenish the buffers when IO completes, and
      instead just free it. But it may be nice to have a way to manually
      remove a number of buffers from a given group, and
      IORING_OP_REMOVE_BUFFERS provides that functionality.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      067524e9
    • Jens Axboe's avatar
      io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG · 52de1fe1
      Jens Axboe authored
      Like IORING_OP_READV, this is limited to supporting just a single
      segment in the iovec passed in.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      52de1fe1
    • Jens Axboe's avatar
      net: abstract out normal and compat msghdr import · 0a384abf
      Jens Axboe authored
      This splits it into two parts, one that imports the message, and one
      that imports the iovec. This allows a caller to only do the first part,
      and import the iovec manually afterwards.
      
      No functional changes in this patch.
      Acked-by: default avatarDavid Miller <davem@davemloft.net>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0a384abf
    • Jens Axboe's avatar
      io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_READV · 4d954c25
      Jens Axboe authored
      This adds support for the vectored read. This is limited to supporting
      just 1 segment in the iov, and is provided just for convenience for
      applications that use IORING_OP_READV already.
      
      The iov helpers will be used for IORING_OP_RECVMSG as well.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4d954c25
    • Jens Axboe's avatar
      io_uring: support buffer selection for OP_READ and OP_RECV · bcda7baa
      Jens Axboe authored
      If a server process has tons of pending socket connections, generally
      it uses epoll to wait for activity. When the socket is ready for reading
      (or writing), the task can select a buffer and issue a recv/send on the
      given fd.
      
      Now that we have fast (non-async thread) support, a task can have tons
      of pending reads or writes pending. But that means they need buffers to
      back that data, and if the number of connections is high enough, having
      them preallocated for all possible connections is unfeasible.
      
      With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
      use for any request. The request then sets IOSQE_BUFFER_SELECT in the
      sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
      a free buffer from the specified group is selected. If none are
      available, the request is terminated with -ENOBUFS. If successful, the
      CQE on completion will contain the buffer ID chosen in the cqe->flags
      member, encoded as:
      
      	(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
      
      Once a buffer has been consumed by a request, it is no longer available
      and must be registered again with IORING_OP_PROVIDE_BUFFERS.
      
      Requests need to support this feature. For now, IORING_OP_READ and
      IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
      res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bcda7baa
    • Jens Axboe's avatar
      io_uring: add IORING_OP_PROVIDE_BUFFERS · ddf0322d
      Jens Axboe authored
      IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to
      support passing in an addr/len that is associated with a buffer ID and
      buffer group ID. The group ID is used to index and lookup the buffers,
      while the buffer ID can be used to notify the application which buffer
      in the group was used. The addr passed in is the starting buffer address,
      and length is each buffer length. A number of buffers to add with can be
      specified, in which case addr is incremented by length for each addition,
      and each buffer increments the buffer ID specified.
      
      No validation is done of the buffer ID. If the application provides
      buffers within the same group with identical buffer IDs, then it'll have
      a hard time telling which buffer ID was used. The only restriction is
      that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the
      maximum ID that can be used.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ddf0322d
  9. 04 Mar, 2020 8 commits
  10. 03 Mar, 2020 1 commit
  11. 02 Mar, 2020 9 commits
    • Nathan Chancellor's avatar
      io_uring: Ensure mask is initialized in io_arm_poll_handler · 8755d97a
      Nathan Chancellor authored
      Clang warns:
      
      fs/io_uring.c:4178:6: warning: variable 'mask' is used uninitialized
      whenever 'if' condition is false [-Wsometimes-uninitialized]
              if (def->pollin)
                  ^~~~~~~~~~~
      fs/io_uring.c:4182:2: note: uninitialized use occurs here
              mask |= POLLERR | POLLPRI;
              ^~~~
      fs/io_uring.c:4178:2: note: remove the 'if' if its condition is always
      true
              if (def->pollin)
              ^~~~~~~~~~~~~~~~
      fs/io_uring.c:4154:15: note: initialize the variable 'mask' to silence
      this warning
              __poll_t mask, ret;
                           ^
                            = 0
      1 warning generated.
      
      io_op_defs has many definitions where pollin is not set so mask indeed
      might be uninitialized. Initialize it to zero and change the next
      assignment to |=, in case further masks are added in the future to avoid
      missing changing the assignment then.
      
      Fixes: d7718a9d ("io_uring: use poll driven retry for files that support it")
      Link: https://github.com/ClangBuiltLinux/linux/issues/916Signed-off-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8755d97a
    • Pavel Begunkov's avatar
      io_uring: remove io_prep_next_work() · 3b17cf5a
      Pavel Begunkov authored
      io-wq cares about IO_WQ_WORK_UNBOUND flag only while enqueueing, so
      it's useless setting it for a next req of a link. Thus, removed it
      from io_prep_linked_timeout(), and inline the function.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3b17cf5a
    • Pavel Begunkov's avatar
      io_uring: remove extra nxt check after punt · 4bc4494e
      Pavel Begunkov authored
      After __io_queue_sqe() ended up in io_queue_async_work(), it's already
      known that there is no @nxt req, so skip the check and return from the
      function.
      
      Also, @nxt initialisation now can be done just before
      io_put_req_find_next(), as there is no jumping until it's checked.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4bc4494e
    • Jens Axboe's avatar
      io_uring: use poll driven retry for files that support it · d7718a9d
      Jens Axboe authored
      Currently io_uring tries any request in a non-blocking manner, if it can,
      and then retries from a worker thread if we get -EAGAIN. Now that we have
      a new and fancy poll based retry backend, use that to retry requests if
      the file supports it.
      
      This means that, for example, an IORING_OP_RECVMSG on a socket no longer
      requires an async thread to complete the IO. If we get -EAGAIN reading
      from the socket in a non-blocking manner, we arm a poll handler for
      notification on when the socket becomes readable. When it does, the
      pending read is executed directly by the task again, through the io_uring
      task work handlers. Not only is this faster and more efficient, it also
      means we're not generating potentially tons of async threads that just
      sit and block, waiting for the IO to complete.
      
      The feature is marked with IORING_FEAT_FAST_POLL, meaning that async
      pollable IO is fast, and that poll<link>other_op is fast as well.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d7718a9d
    • Jens Axboe's avatar
      io_uring: mark requests that we can do poll async in io_op_defs · 8a72758c
      Jens Axboe authored
      Add a pollin/pollout field to the request table, and have commands that
      we can safely poll for properly marked.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8a72758c
    • Jens Axboe's avatar
      io_uring: add per-task callback handler · b41e9852
      Jens Axboe authored
      For poll requests, it's not uncommon to link a read (or write) after
      the poll to execute immediately after the file is marked as ready.
      Since the poll completion is called inside the waitqueue wake up handler,
      we have to punt that linked request to async context. This slows down
      the processing, and actually means it's faster to not use a link for this
      use case.
      
      We also run into problems if the completion_lock is contended, as we're
      doing a different lock ordering than the issue side is. Hence we have
      to do trylock for completion, and if that fails, go async. Poll removal
      needs to go async as well, for the same reason.
      
      eventfd notification needs special case as well, to avoid stack blowing
      recursion or deadlocks.
      
      These are all deficiencies that were inherited from the aio poll
      implementation, but I think we can do better. When a poll completes,
      simply queue it up in the task poll list. When the task completes the
      list, we can run dependent links inline as well. This means we never
      have to go async, and we can remove a bunch of code associated with
      that, and optimizations to try and make that run faster. The diffstat
      speaks for itself.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b41e9852
    • Jens Axboe's avatar
      io_uring: store io_kiocb in wait->private · c2f2eb7d
      Jens Axboe authored
      Store the io_kiocb in the private field instead of the poll entry, this
      is in preparation for allowing multiple waitqueues.
      
      No functional changes in this patch.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c2f2eb7d
    • Oleg Nesterov's avatar
      task_work_run: don't take ->pi_lock unconditionally · 6fb61492
      Oleg Nesterov authored
      As Peter pointed out, task_work() can avoid ->pi_lock and cmpxchg()
      if task->task_works == NULL && !PF_EXITING.
      
      And in fact the only reason why task_work_run() needs ->pi_lock is
      the possible race with task_work_cancel(), we can optimize this code
      and make the locking more clear.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6fb61492
    • Pavel Begunkov's avatar
      io-wq: use BIT for ulong hash · 3684f246
      Pavel Begunkov authored
      @hash_map is unsigned long, but BIT_ULL() is used for manipulations.
      BIT() is a better match as it returns exactly unsigned long value.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3684f246