1. 11 Nov, 2019 7 commits
    • Jens Axboe's avatar
      io_uring: provide fallback request for OOM situations · 0ddf92e8
      Jens Axboe authored
      One thing that really sucks for userspace APIs is if the kernel passes
      back -ENOMEM/-EAGAIN for resource shortages. The application really has
      no idea of what to do in those cases. Should it try and reap
      completions? Probably a good idea. Will it solve the issue? Who knows.
      
      This patch adds a simple fallback mechanism if we fail to allocate
      memory for a request. If we fail allocating memory from the slab for a
      request, we punt to a pre-allocated request. There's just one of these
      per io_ring_ctx, but the important part is if we ever return -EBUSY to
      the application, the applications knows that it can wait for events and
      make forward progress when events have completed. This is the important
      part.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0ddf92e8
    • Jens Axboe's avatar
      io_uring: convert accept4() -ERESTARTSYS into -EINTR · 8e3cca12
      Jens Axboe authored
      If we cancel a pending accept operating with a signal, we get
      -ERESTARTSYS returned. Turn that into -EINTR for userspace, we should
      not be return -ERESTARTSYS.
      
      Fixes: 17f2fe35 ("io_uring: add support for IORING_OP_ACCEPT")
      Reported-by: default avatarHrvoje Zeba <zeba.hrvoje@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8e3cca12
    • Jens Axboe's avatar
      io_uring: fix error clear of ->file_table in io_sqe_files_register() · 46568e9b
      Jens Axboe authored
      syzbot reports that when using failslab and friends, we can get a double
      free in io_sqe_files_unregister():
      
      BUG: KASAN: double-free or invalid-free in
      io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185
      
      CPU: 1 PID: 8819 Comm: syz-executor452 Not tainted 5.4.0-rc6-next-20191108
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x197/0x210 lib/dump_stack.c:118
        print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
        kasan_report_invalid_free+0x65/0xa0 mm/kasan/report.c:468
        __kasan_slab_free+0x13a/0x150 mm/kasan/common.c:450
        kasan_slab_free+0xe/0x10 mm/kasan/common.c:480
        __cache_free mm/slab.c:3426 [inline]
        kfree+0x10a/0x2c0 mm/slab.c:3757
        io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185
        io_ring_ctx_free fs/io_uring.c:3998 [inline]
        io_ring_ctx_wait_and_kill+0x348/0x700 fs/io_uring.c:4060
        io_uring_release+0x42/0x50 fs/io_uring.c:4068
        __fput+0x2ff/0x890 fs/file_table.c:280
        ____fput+0x16/0x20 fs/file_table.c:313
        task_work_run+0x145/0x1c0 kernel/task_work.c:113
        exit_task_work include/linux/task_work.h:22 [inline]
        do_exit+0x904/0x2e60 kernel/exit.c:817
        do_group_exit+0x135/0x360 kernel/exit.c:921
        __do_sys_exit_group kernel/exit.c:932 [inline]
        __se_sys_exit_group kernel/exit.c:930 [inline]
        __x64_sys_exit_group+0x44/0x50 kernel/exit.c:930
        do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x43f2c8
      Code: 31 b8 c5 f7 ff ff 48 8b 5c 24 28 48 8b 6c 24 30 4c 8b 64 24 38 4c 8b
      6c 24 40 4c 8b 74 24 48 4c 8b 7c 24 50 48 83 c4 58 c3 66 <0f> 1f 84 00 00
      00 00 00 48 8d 35 59 ca 00 00 0f b6 d2 48 89 fb 48
      RSP: 002b:00007ffd5b976008 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000043f2c8
      RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
      RBP: 00000000004bf0a8 R08: 00000000000000e7 R09: ffffffffffffffd0
      R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001
      R13: 00000000006d1180 R14: 0000000000000000 R15: 0000000000000000
      
      This happens if we fail allocating the file tables. For that case we do
      free the file table correctly, but we forget to set it to NULL. This
      means that ring teardown will see it as being non-NULL, and attempt to
      free it again.
      
      Fix this by clearing the file_table pointer if we free the table.
      
      Reported-by: syzbot+3254bc44113ae1e331ee@syzkaller.appspotmail.com
      Fixes: 65e19f54 ("io_uring: support for larger fixed file sets")
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      46568e9b
    • Jackie Liu's avatar
      io_uring: separate the io_free_req and io_free_req_find_next interface · c69f8dbe
      Jackie Liu authored
      Similar to the distinction between io_put_req and io_put_req_find_next,
      io_free_req has been modified similarly, with no functional changes.
      Signed-off-by: default avatarJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c69f8dbe
    • Jackie Liu's avatar
      io_uring: keep io_put_req only responsible for release and put req · ec9c02ad
      Jackie Liu authored
      We already have io_put_req_find_next to find the next req of the link.
      we should not use the io_put_req function to find them. They should be
      functions of the same level.
      Signed-off-by: default avatarJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ec9c02ad
    • Jackie Liu's avatar
      io_uring: remove passed in 'ctx' function parameter ctx if possible · a197f664
      Jackie Liu authored
      Many times, the core of the function is req, and req has already set
      req->ctx at initialization time, so there is no need to pass in the
      ctx from the caller.
      
      Cleanup, no functional change.
      Signed-off-by: default avatarJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a197f664
    • Jens Axboe's avatar
      io_uring: reduce/pack size of io_ring_ctx · 206aefde
      Jens Axboe authored
      With the recent flurry of additions and changes to io_uring, the
      layout of io_ring_ctx has become a bit stale. We're right now at
      704 bytes in size on my x86-64 build, or 11 cachelines. This
      patch does two things:
      
      - We have to completion structs embedded, that we only use for
        quiesce of the ctx (or shutdown) and for sqthread init cases.
        That 2x32 bytes right there, let's dynamically allocate them.
      
      - Reorder the struct a bit with an eye on cachelines, use cases,
        and holes.
      
      With this patch, we're down to 512 bytes, or 8 cachelines.
      Reviewed-by: default avatarJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      206aefde
  2. 07 Nov, 2019 2 commits
    • Jens Axboe's avatar
      io_uring: properly mark async work as bounded vs unbounded · 5f8fd2d3
      Jens Axboe authored
      Now that io-wq supports separating the two request lifetime types, mark
      the following IO as having unbounded runtimes:
      
      - Any read/write to a non-regular file
      - Any specific networked IO
      - Any poll command
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5f8fd2d3
    • Jens Axboe's avatar
      io-wq: add support for bounded vs unbunded work · c5def4ab
      Jens Axboe authored
      io_uring supports request types that basically have two different
      lifetimes:
      
      1) Bounded completion time. These are requests like disk reads or writes,
         which we know will finish in a finite amount of time.
      2) Unbounded completion time. These are generally networked IO, where we
         have no idea how long they will take to complete. Another example is
         POLL commands.
      
      This patch provides support for io-wq to handle these differently, so we
      don't starve bounded requests by tying up workers for too long. By default
      all work is bounded, unless otherwise specified in the work item.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c5def4ab
  3. 09 Nov, 2019 2 commits
    • Jens Axboe's avatar
      io-wq: io_wqe_run_queue() doesn't need to use list_empty_careful() · 91d666ea
      Jens Axboe authored
      We hold the wqe lock at this point (which is also annotated), so there's
      no need to use the careful variant of list_empty().
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      91d666ea
    • Jens Axboe's avatar
      io_uring: add support for backlogged CQ ring · 1d7bb1d5
      Jens Axboe authored
      Currently we drop completion events, if the CQ ring is full. That's fine
      for requests with bounded completion times, but it may make it harder or
      impossible to use io_uring with networked IO where request completion
      times are generally unbounded. Or with POLL, for example, which is also
      unbounded.
      
      After this patch, we never overflow the ring, we simply store requests
      in a backlog for later flushing. This flushing is done automatically by
      the kernel. To prevent the backlog from growing indefinitely, if the
      backlog is non-empty, we apply back pressure on IO submissions. Any
      attempt to submit new IO with a non-empty backlog will get an -EBUSY
      return from the kernel. This is a signal to the application that it has
      backlogged CQ events, and that it must reap those before being allowed
      to submit more IO.
      
      Note that if we do return -EBUSY, we will have filled whatever
      backlogged events into the CQ ring first, if there's room. This means
      the application can safely reap events WITHOUT entering the kernel and
      waiting for them, they are already available in the CQ ring.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1d7bb1d5
  4. 08 Nov, 2019 3 commits
    • Jens Axboe's avatar
      io_uring: pass in io_kiocb to fill/add CQ handlers · 78e19bbe
      Jens Axboe authored
      This is in preparation for handling CQ ring overflow a bit smarter. We
      should not have any functional changes in this patch. Most of the
      changes are fairly straight forward, the only ones that stick out a bit
      are the ones that change __io_free_req() to take the reference count
      into account. If the request hasn't been submitted yet, we know it's
      safe to simply ignore references and free it. But let's clean these up
      too, as later patches will depend on the caller doing the right thing if
      the completion logging grabs a reference to the request.
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      78e19bbe
    • Jens Axboe's avatar
      io_uring: make io_cqring_events() take 'ctx' as argument · 84f97dc2
      Jens Axboe authored
      The rings can be derived from the ctx, and we need the ctx there for
      a future change.
      
      No functional changes in this patch.
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      84f97dc2
    • Jens Axboe's avatar
      io_uring: add support for linked SQE timeouts · 2665abfd
      Jens Axboe authored
      While we have support for generic timeouts, we don't have a way to tie
      a timeout to a specific SQE. The generic timeouts simply trigger wakeups
      on the CQ ring.
      
      This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
      as a link to a previous command. The timeout specific can be either
      relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
      the timeout triggers before the dependent command completes, it will
      attempt to cancel that command. Likewise, if the dependent command
      completes before the timeout triggers, it will cancel the timeout.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2665abfd
  5. 07 Nov, 2019 4 commits
  6. 06 Nov, 2019 4 commits
  7. 05 Nov, 2019 2 commits
  8. 04 Nov, 2019 2 commits
  9. 02 Nov, 2019 1 commit
  10. 01 Nov, 2019 3 commits
    • Jens Axboe's avatar
      io_uring: remove io_uring_add_to_prev() trace event · 0069fc6b
      Jens Axboe authored
      This internal logic was killed with the conversion to io-wq, so we no
      longer have a need for this particular trace. Kill it.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0069fc6b
    • Jackie Liu's avatar
      io_uring: set -EINTR directly when a signal wakes up in io_cqring_wait · e9ffa5c2
      Jackie Liu authored
      We didn't use -ERESTARTSYS to tell the application layer to restart the
      system call, but instead return -EINTR. we can set -EINTR directly when
      wakeup by the signal, which can help us save an assignment operation and
      comparison operation.
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e9ffa5c2
    • Jens Axboe's avatar
      io_uring: support for generic async request cancel · 62755e35
      Jens Axboe authored
      This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
      cancel requests that have been punted to async context and are now
      in-flight. This works for regular read/write requests to files, as
      long as they haven't been started yet. For socket based IO (or things
      like accept4(2)), we can cancel work that is already running as well.
      
      To cancel a request, the sqe must have ->addr set to the user_data of
      the request it wishes to cancel. If the request is cancelled
      successfully, the original request is completed with -ECANCELED
      and the cancel request is completed with a result of 0. If the
      request was already running, the original may or may not complete
      in error. The cancel request will complete with -EALREADY for that
      case. And finally, if the request to cancel wasn't found, the cancel
      request is completed with -ENOENT.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      62755e35
  11. 30 Oct, 2019 1 commit
    • Jens Axboe's avatar
      io_uring: io_wq_create() returns an error pointer, not NULL · 975c99a5
      Jens Axboe authored
      syzbot reported an issue where we crash at setup time if failslab is
      used. The issue is that io_wq_create() returns an error pointer on
      failure, not NULL. Hence io_uring thought the io-wq was setup just
      fine, but in reality it's a garbage error pointer.
      
      Use IS_ERR() instead of a NULL check, and assign ret appropriately.
      
      Reported-by: syzbot+221cc24572a2fed23b6b@syzkaller.appspotmail.com
      Fixes: 561fb04a ("io_uring: replace workqueue usage with io-wq")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      975c99a5
  12. 29 Oct, 2019 9 commits
    • Jens Axboe's avatar
      io_uring: fix race with canceling timeouts · 842f9612
      Jens Axboe authored
      If we get -1 from hrtimer_try_to_cancel(), we know that the timer
      is running. Hence leave all completion to the timeout handler. If
      we don't, we can corrupt the list and miss a completion.
      
      Fixes: 11365043 ("io_uring: add support for canceling timeout requests")
      Reported-by: default avatarHrvoje Zeba <zeba.hrvoje@gmail.com>
      Tested-by: default avatarHrvoje Zeba <zeba.hrvoje@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      842f9612
    • Jens Axboe's avatar
      io_uring: support for larger fixed file sets · 65e19f54
      Jens Axboe authored
      There's been a few requests for supporting more fixed files than 1024.
      This isn't really tricky to do, we just need to split up the file table
      into multiple tables and index appropriately. As we do so, reduce the
      max single file table to 512. This enables us to do single page allocs
      always for the tables, which is an improvement over the situation prior.
      
      This patch adds support for up to 64K files, which should be enough for
      everyone.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      65e19f54
    • Jens Axboe's avatar
      io_uring: protect fixed file indexing with array_index_nospec() · b7620121
      Jens Axboe authored
      We index the file tables with a user given value. After we check
      it's within our limits, use array_index_nospec() to prevent any
      spectre attacks here.
      Suggested-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b7620121
    • Jens Axboe's avatar
      io_uring: add support for IORING_OP_ACCEPT · 17f2fe35
      Jens Axboe authored
      This allows an application to call accept4() in an async fashion. Like
      other opcodes, we first try a non-blocking accept, then punt to async
      context if we have to.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      17f2fe35
    • Jens Axboe's avatar
      net: add __sys_accept4_file() helper · de2ea4b6
      Jens Axboe authored
      This is identical to __sys_accept4(), except it takes a struct file
      instead of an fd, and it also allows passing in extra file->f_flags
      flags. The latter is done to support masking in O_NONBLOCK without
      manipulating the original file flags.
      
      No functional changes in this patch.
      
      Cc: netdev@vger.kernel.org
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      de2ea4b6
    • Jens Axboe's avatar
      io_uring: io_uring: add support for async work inheriting files · fcb323cc
      Jens Axboe authored
      This is in preparation for adding opcodes that need to add new files
      in a process file table, system calls like open(2) or accept4(2).
      
      If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work
      item. If work that needs to get punted to async context have this
      set, the async worker will assume the original task file table before
      executing the work.
      
      Note that opcodes that need access to the current files of an
      application cannot be done through IORING_SETUP_SQPOLL.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fcb323cc
    • Jens Axboe's avatar
      io_uring: replace workqueue usage with io-wq · 561fb04a
      Jens Axboe authored
      Drop various work-arounds we have for workqueues:
      
      - We no longer need the async_list for tracking sequential IO.
      
      - We don't have to maintain our own mm tracking/setting.
      
      - We don't need a separate workqueue for buffered writes. This didn't
        even work that well to begin with, as it was suboptimal for multiple
        buffered writers on multiple files.
      
      - We can properly cancel pending interruptible work. This fixes
        deadlocks with particularly socket IO, where we cannot cancel them
        when the io_uring is closed. Hence the ring will wait forever for
        these requests to complete, which may never happen. This is different
        from disk IO where we know requests will complete in a finite amount
        of time.
      
      - Due to being able to cancel work interruptible work that is already
        running, we can implement file table support for work. We need that
        for supporting system calls that add to a process file table.
      
      - It gets us one step closer to adding async support for any system
        call.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      561fb04a
    • Jens Axboe's avatar
      io-wq: small threadpool implementation for io_uring · 771b53d0
      Jens Axboe authored
      This adds support for io-wq, a smaller and specialized thread pool
      implementation. This is meant to replace workqueues for io_uring. Among
      the reasons for this addition are:
      
      - We can assign memory context smarter and more persistently if we
        manage the life time of threads.
      
      - We can drop various work-arounds we have in io_uring, like the
        async_list.
      
      - We can implement hashed work insertion, to manage concurrency of
        buffered writes without needing a) an extra workqueue, or b)
        needlessly making the concurrency of said workqueue very low
        which hurts performance of multiple buffered file writers.
      
      - We can implement cancel through signals, for cancelling
        interruptible work like read/write (or send/recv) to/from sockets.
      
      - We need the above cancel for being able to assign and use file tables
        from a process.
      
      - We can implement a more thorough cancel operation in general.
      
      - We need it to move towards a syslet/threadlet model for even faster
        async execution. For that we need to take ownership of the used
        threads.
      
      This list is just off the top of my head. Performance should be the
      same, or better, at least that's what I've seen in my testing. io-wq
      supports basic NUMA functionality, setting up a pool per node.
      
      io-wq hooks up to the scheduler schedule in/out just like workqueue
      and uses that to drive the need for more/less workers.
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      771b53d0
    • Pavel Begunkov's avatar
      io_uring: Fix mm_fault with READ/WRITE_FIXED · 95a1b3ff
      Pavel Begunkov authored
      Commit fb5ccc98 ("io_uring: Fix broken links with offloading")
      introduced a potential performance regression with unconditionally
      taking mm even for READ/WRITE_FIXED operations.
      
      Return the logic handling it back. mm-faulted requests will go through
      the generic submission path, so honoring links and drains, but will
      fail further on req->has_user check.
      
      Fixes: fb5ccc98 ("io_uring: Fix broken links with offloading")
      Cc: stable@vger.kernel.org # v5.4
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      95a1b3ff