1. 08 Nov, 2019 1 commit
    • Jens Axboe's avatar
      io_uring: add support for linked SQE timeouts · 2665abfd
      Jens Axboe authored
      While we have support for generic timeouts, we don't have a way to tie
      a timeout to a specific SQE. The generic timeouts simply trigger wakeups
      on the CQ ring.
      
      This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
      as a link to a previous command. The timeout specific can be either
      relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
      the timeout triggers before the dependent command completes, it will
      attempt to cancel that command. Likewise, if the dependent command
      completes before the timeout triggers, it will cancel the timeout.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2665abfd
  2. 07 Nov, 2019 4 commits
  3. 06 Nov, 2019 4 commits
  4. 05 Nov, 2019 2 commits
  5. 04 Nov, 2019 2 commits
  6. 02 Nov, 2019 1 commit
  7. 01 Nov, 2019 3 commits
    • Jens Axboe's avatar
      io_uring: remove io_uring_add_to_prev() trace event · 0069fc6b
      Jens Axboe authored
      This internal logic was killed with the conversion to io-wq, so we no
      longer have a need for this particular trace. Kill it.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0069fc6b
    • Jackie Liu's avatar
      io_uring: set -EINTR directly when a signal wakes up in io_cqring_wait · e9ffa5c2
      Jackie Liu authored
      We didn't use -ERESTARTSYS to tell the application layer to restart the
      system call, but instead return -EINTR. we can set -EINTR directly when
      wakeup by the signal, which can help us save an assignment operation and
      comparison operation.
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e9ffa5c2
    • Jens Axboe's avatar
      io_uring: support for generic async request cancel · 62755e35
      Jens Axboe authored
      This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
      cancel requests that have been punted to async context and are now
      in-flight. This works for regular read/write requests to files, as
      long as they haven't been started yet. For socket based IO (or things
      like accept4(2)), we can cancel work that is already running as well.
      
      To cancel a request, the sqe must have ->addr set to the user_data of
      the request it wishes to cancel. If the request is cancelled
      successfully, the original request is completed with -ECANCELED
      and the cancel request is completed with a result of 0. If the
      request was already running, the original may or may not complete
      in error. The cancel request will complete with -EALREADY for that
      case. And finally, if the request to cancel wasn't found, the cancel
      request is completed with -ENOENT.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      62755e35
  8. 30 Oct, 2019 1 commit
    • Jens Axboe's avatar
      io_uring: io_wq_create() returns an error pointer, not NULL · 975c99a5
      Jens Axboe authored
      syzbot reported an issue where we crash at setup time if failslab is
      used. The issue is that io_wq_create() returns an error pointer on
      failure, not NULL. Hence io_uring thought the io-wq was setup just
      fine, but in reality it's a garbage error pointer.
      
      Use IS_ERR() instead of a NULL check, and assign ret appropriately.
      
      Reported-by: syzbot+221cc24572a2fed23b6b@syzkaller.appspotmail.com
      Fixes: 561fb04a ("io_uring: replace workqueue usage with io-wq")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      975c99a5
  9. 29 Oct, 2019 21 commits
    • Jens Axboe's avatar
      io_uring: fix race with canceling timeouts · 842f9612
      Jens Axboe authored
      If we get -1 from hrtimer_try_to_cancel(), we know that the timer
      is running. Hence leave all completion to the timeout handler. If
      we don't, we can corrupt the list and miss a completion.
      
      Fixes: 11365043 ("io_uring: add support for canceling timeout requests")
      Reported-by: default avatarHrvoje Zeba <zeba.hrvoje@gmail.com>
      Tested-by: default avatarHrvoje Zeba <zeba.hrvoje@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      842f9612
    • Jens Axboe's avatar
      io_uring: support for larger fixed file sets · 65e19f54
      Jens Axboe authored
      There's been a few requests for supporting more fixed files than 1024.
      This isn't really tricky to do, we just need to split up the file table
      into multiple tables and index appropriately. As we do so, reduce the
      max single file table to 512. This enables us to do single page allocs
      always for the tables, which is an improvement over the situation prior.
      
      This patch adds support for up to 64K files, which should be enough for
      everyone.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      65e19f54
    • Jens Axboe's avatar
      io_uring: protect fixed file indexing with array_index_nospec() · b7620121
      Jens Axboe authored
      We index the file tables with a user given value. After we check
      it's within our limits, use array_index_nospec() to prevent any
      spectre attacks here.
      Suggested-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b7620121
    • Jens Axboe's avatar
      io_uring: add support for IORING_OP_ACCEPT · 17f2fe35
      Jens Axboe authored
      This allows an application to call accept4() in an async fashion. Like
      other opcodes, we first try a non-blocking accept, then punt to async
      context if we have to.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      17f2fe35
    • Jens Axboe's avatar
      net: add __sys_accept4_file() helper · de2ea4b6
      Jens Axboe authored
      This is identical to __sys_accept4(), except it takes a struct file
      instead of an fd, and it also allows passing in extra file->f_flags
      flags. The latter is done to support masking in O_NONBLOCK without
      manipulating the original file flags.
      
      No functional changes in this patch.
      
      Cc: netdev@vger.kernel.org
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      de2ea4b6
    • Jens Axboe's avatar
      io_uring: io_uring: add support for async work inheriting files · fcb323cc
      Jens Axboe authored
      This is in preparation for adding opcodes that need to add new files
      in a process file table, system calls like open(2) or accept4(2).
      
      If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work
      item. If work that needs to get punted to async context have this
      set, the async worker will assume the original task file table before
      executing the work.
      
      Note that opcodes that need access to the current files of an
      application cannot be done through IORING_SETUP_SQPOLL.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fcb323cc
    • Jens Axboe's avatar
      io_uring: replace workqueue usage with io-wq · 561fb04a
      Jens Axboe authored
      Drop various work-arounds we have for workqueues:
      
      - We no longer need the async_list for tracking sequential IO.
      
      - We don't have to maintain our own mm tracking/setting.
      
      - We don't need a separate workqueue for buffered writes. This didn't
        even work that well to begin with, as it was suboptimal for multiple
        buffered writers on multiple files.
      
      - We can properly cancel pending interruptible work. This fixes
        deadlocks with particularly socket IO, where we cannot cancel them
        when the io_uring is closed. Hence the ring will wait forever for
        these requests to complete, which may never happen. This is different
        from disk IO where we know requests will complete in a finite amount
        of time.
      
      - Due to being able to cancel work interruptible work that is already
        running, we can implement file table support for work. We need that
        for supporting system calls that add to a process file table.
      
      - It gets us one step closer to adding async support for any system
        call.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      561fb04a
    • Jens Axboe's avatar
      io-wq: small threadpool implementation for io_uring · 771b53d0
      Jens Axboe authored
      This adds support for io-wq, a smaller and specialized thread pool
      implementation. This is meant to replace workqueues for io_uring. Among
      the reasons for this addition are:
      
      - We can assign memory context smarter and more persistently if we
        manage the life time of threads.
      
      - We can drop various work-arounds we have in io_uring, like the
        async_list.
      
      - We can implement hashed work insertion, to manage concurrency of
        buffered writes without needing a) an extra workqueue, or b)
        needlessly making the concurrency of said workqueue very low
        which hurts performance of multiple buffered file writers.
      
      - We can implement cancel through signals, for cancelling
        interruptible work like read/write (or send/recv) to/from sockets.
      
      - We need the above cancel for being able to assign and use file tables
        from a process.
      
      - We can implement a more thorough cancel operation in general.
      
      - We need it to move towards a syslet/threadlet model for even faster
        async execution. For that we need to take ownership of the used
        threads.
      
      This list is just off the top of my head. Performance should be the
      same, or better, at least that's what I've seen in my testing. io-wq
      supports basic NUMA functionality, setting up a pool per node.
      
      io-wq hooks up to the scheduler schedule in/out just like workqueue
      and uses that to drive the need for more/less workers.
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      771b53d0
    • Pavel Begunkov's avatar
      io_uring: Fix mm_fault with READ/WRITE_FIXED · 95a1b3ff
      Pavel Begunkov authored
      Commit fb5ccc98 ("io_uring: Fix broken links with offloading")
      introduced a potential performance regression with unconditionally
      taking mm even for READ/WRITE_FIXED operations.
      
      Return the logic handling it back. mm-faulted requests will go through
      the generic submission path, so honoring links and drains, but will
      fail further on req->has_user check.
      
      Fixes: fb5ccc98 ("io_uring: Fix broken links with offloading")
      Cc: stable@vger.kernel.org # v5.4
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      95a1b3ff
    • Pavel Begunkov's avatar
      io_uring: remove index from sqe_submit · fa456228
      Pavel Begunkov authored
      submit->index is used only for inbound check in submission path (i.e.
      head < ctx->sq_entries). However, it always will be true, as
      1. it's already validated by io_get_sqring()
      2. ctx->sq_entries can't be changedd in between, because of held
      ctx->uring_lock and ctx->refs.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fa456228
    • Dmitrii Dolgov's avatar
      io_uring: add set of tracing events · c826bd7a
      Dmitrii Dolgov authored
      To trace io_uring activity one can get an information from workqueue and
      io trace events, but looks like some parts could be hard to identify via
      this approach. Making what happens inside io_uring more transparent is
      important to be able to reason about many aspects of it, hence introduce
      the set of tracing events.
      
      All such events could be roughly divided into two categories:
      
      * those, that are helping to understand correctness (from both kernel
        and an application point of view). E.g. a ring creation, file
        registration, or waiting for available CQE. Proposed approach is to
        get a pointer to an original structure of interest (ring context, or
        request), and then find relevant events. io_uring_queue_async_work
        also exposes a pointer to work_struct, to be able to track down
        corresponding workqueue events.
      
      * those, that provide performance related information. Mostly it's about
        events that change the flow of requests, e.g. whether an async work
        was queued, or delayed due to some dependencies. Another important
        case is how io_uring optimizations (e.g. registered files) are
        utilized.
      Signed-off-by: default avatarDmitrii Dolgov <9erthalion6@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c826bd7a
    • Jens Axboe's avatar
      io_uring: add support for canceling timeout requests · 11365043
      Jens Axboe authored
      We might have cases where the need for a specific timeout is gone, add
      support for canceling an existing timeout operation. This works like the
      POLL_REMOVE command, where the application passes in the user_data of
      the timeout it wishes to cancel in the sqe->addr field.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      11365043
    • Jens Axboe's avatar
      io_uring: add support for absolute timeouts · a41525ab
      Jens Axboe authored
      This is a pretty trivial addition on top of the relative timeouts
      we have now, but it's handy for ensuring tighter timing for those
      that are building scheduling primitives on top of io_uring.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a41525ab
    • Jackie Liu's avatar
      io_uring: replace s->needs_lock with s->in_async · ba5290cc
      Jackie Liu authored
      There is no function change, just to clean up the code, use s->in_async
      to make the code know where it is.
      Signed-off-by: default avatarJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ba5290cc
    • Jens Axboe's avatar
      io_uring: allow application controlled CQ ring size · 33a107f0
      Jens Axboe authored
      We currently size the CQ ring as twice the SQ ring, to allow some
      flexibility in not overflowing the CQ ring. This is done because the
      SQE life time is different than that of the IO request itself, the SQE
      is consumed as soon as the kernel has seen the entry.
      
      Certain application don't need a huge SQ ring size, since they just
      submit IO in batches. But they may have a lot of requests pending, and
      hence need a big CQ ring to hold them all. By allowing the application
      to control the CQ ring size multiplier, we can cater to those
      applications more efficiently.
      
      If an application wants to define its own CQ ring size, it must set
      IORING_SETUP_CQSIZE in the setup flags, and fill out
      io_uring_params->cq_entries. The value must be a power of two.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      33a107f0
    • Jens Axboe's avatar
      io_uring: add support for IORING_REGISTER_FILES_UPDATE · c3a31e60
      Jens Axboe authored
      Allows the application to remove/replace/add files to/from a file set.
      Passes in a struct:
      
      struct io_uring_files_update {
      	__u32 offset;
      	__s32 *fds;
      };
      
      that holds an array of fds, size of array passed in through the usual
      nr_args part of the io_uring_register() system call. The logic is as
      follows:
      
      1) If ->fds[i] is -1, the existing file at i + ->offset is removed from
         the set.
      2) If ->fds[i] is a valid fd, the existing file at i + ->offset is
         replaced with ->fds[i].
      
      For case #2, is the existing file is currently empty (fd == -1), the
      new fd is simply added to the array.
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c3a31e60
    • Jens Axboe's avatar
      io_uring: allow sparse fixed file sets · 08a45173
      Jens Axboe authored
      This is in preparation for allowing updates to fixed file sets without
      requiring a full unregister+register.
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      08a45173
    • Jens Axboe's avatar
      io_uring: run dependent links inline if possible · ba816ad6
      Jens Axboe authored
      Currently any dependent link is executed from a new workqueue context,
      which means that we'll be doing a context switch per link in the chain.
      If we are running the completion of the current request from our async
      workqueue and find that the next request is a link, then run it directly
      from the workqueue context instead of forcing another switch.
      
      This improves the performance of linked SQEs, and reduces the CPU
      overhead.
      Reviewed-by: default avatarJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ba816ad6
    • Anton Ivanov's avatar
      um-ubd: Entrust re-queue to the upper layers · d848074b
      Anton Ivanov authored
      Fixes crashes due to ubd requeue logic conflicting with the block-mq
      logic. Crash is reproducible in 5.0 - 5.3.
      
      Fixes: 53766def ("um: Clean-up command processing in UML UBD driver")
      Cc: stable@vger.kernel.org # v5.0+
      Signed-off-by: default avatarAnton Ivanov <anton.ivanov@cambridgegreys.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d848074b
    • Anton Eidelman's avatar
      nvme-multipath: remove unused groups_only mode in ana log · 86cccfbf
      Anton Eidelman authored
      groups_only mode in nvme_read_ana_log() is no longer used: remove it.
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      86cccfbf
    • Anton Eidelman's avatar
      nvme-multipath: fix possible io hang after ctrl reconnect · af8fd042
      Anton Eidelman authored
      The following scenario results in an IO hang:
      1) ctrl completes a request with NVME_SC_ANA_TRANSITION.
         NVME_NS_ANA_PENDING bit in ns->flags is set and ana_work is triggered.
      2) ana_work: nvme_read_ana_log() tries to get the ANA log page from the ctrl.
         This fails because ctrl disconnects.
         Therefore nvme_update_ns_ana_state() is not called
         and NVME_NS_ANA_PENDING bit in ns->flags is not cleared.
      3) ctrl reconnects: nvme_mpath_init(ctrl,...) calls
         nvme_read_ana_log(ctrl, groups_only=true).
         However, nvme_update_ana_state() does not update namespaces
         because nr_nsids = 0 (due to groups_only mode).
      4) scan_work calls nvme_validate_ns() finds the ns and re-validates OK.
      
      Result:
      The ctrl is now live but NVME_NS_ANA_PENDING bit in ns->flags is still set.
      Consequently ctrl will never be considered a viable path by __nvme_find_path().
      IO will hang if ctrl is the only or the last path to the namespace.
      
      More generally, while ctrl is reconnecting, its ANA state may change.
      And because nvme_mpath_init() requests ANA log in groups_only mode,
      these changes are not propagated to the existing ctrl namespaces.
      This may result in a mal-function or an IO hang.
      
      Solution:
      nvme_mpath_init() will nvme_read_ana_log() with groups_only set to false.
      This will not harm the new ctrl case (no namespaces present),
      and will make sure the ANA state of namespaces gets updated after reconnect.
      
      Note: Another option would be for nvme_mpath_init() to invoke
      nvme_parse_ana_log(..., nvme_set_ns_ana_state) for each existing namespace.
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      af8fd042
  10. 28 Oct, 2019 1 commit
    • Jens Axboe's avatar
      io_uring: don't touch ctx in setup after ring fd install · 044c1ab3
      Jens Axboe authored
      syzkaller reported an issue where it looks like a malicious app can
      trigger a use-after-free of reading the ctx ->sq_array and ->rings
      value right after having installed the ring fd in the process file
      table.
      
      Defer ring fd installation until after we're done reading those
      values.
      
      Fixes: 75b28aff ("io_uring: allocate the two rings together")
      Reported-by: syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      044c1ab3