1. 19 May, 2023 1 commit
    • Jens Axboe's avatar
      io_uring: maintain ordering for DEFER_TASKRUN tw list · 3af0356c
      Jens Axboe authored
      We use lockless lists for the local and deferred task_work, which means
      that when we queue up events for processing, we ultimately process them
      in reverse order to how they were received. This usually doesn't matter,
      but for some cases, it does seem to make a big difference. Do the right
      thing and reverse the list before processing it, so that we know it's
      processed in the same order in which it was received.
      
      This makes a rather big difference for some medium load network tests,
      where consistency of performance was a bit all over the place. Here's
      a case that has 4 connections each doing two sends and receives:
      
      io_uring port=10002: rps:161.13k Bps:  1.45M idle=256ms
      io_uring port=10002: rps:107.27k Bps:  0.97M idle=413ms
      io_uring port=10002: rps:136.98k Bps:  1.23M idle=321ms
      io_uring port=10002: rps:155.58k Bps:  1.40M idle=268ms
      
      and after the change:
      
      io_uring port=10002: rps:205.48k Bps:  1.85M idle=140ms user=40ms
      io_uring port=10002: rps:203.57k Bps:  1.83M idle=139ms user=20ms
      io_uring port=10002: rps:218.79k Bps:  1.97M idle=106ms user=30ms
      io_uring port=10002: rps:217.88k Bps:  1.96M idle=110ms user=20ms
      io_uring port=10002: rps:222.31k Bps:  2.00M idle=101ms user=0ms
      io_uring port=10002: rps:218.74k Bps:  1.97M idle=102ms user=20ms
      io_uring port=10002: rps:208.43k Bps:  1.88M idle=125ms user=40ms
      
      using more of the time to actually process work rather than sitting
      idle.
      
      No effects have been observed at the peak end of the spectrum, where
      performance is still the same even with deep batch depths (and hence
      more items to sort).
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3af0356c
  2. 17 May, 2023 4 commits
  3. 16 May, 2023 5 commits
    • Josh Triplett's avatar
      io_uring: Add io_uring_setup flag to pre-register ring fd and never install it · 6e76ac59
      Josh Triplett authored
      With IORING_REGISTER_USE_REGISTERED_RING, an application can register
      the ring fd and use it via registered index rather than installed fd.
      This allows using a registered ring for everything *except* the initial
      mmap.
      
      With IORING_SETUP_NO_MMAP, io_uring_setup uses buffers allocated by the
      user, rather than requiring a subsequent mmap.
      
      The combination of the two allows a user to operate *entirely* via a
      registered ring fd, making it unnecessary to ever install the fd in the
      first place. So, add a flag IORING_SETUP_REGISTERED_FD_ONLY to make
      io_uring_setup register the fd and return a registered index, without
      installing the fd.
      
      This allows an application to avoid touching the fd table at all, and
      allows a library to never even momentarily install a file descriptor.
      
      This splits out an io_ring_add_registered_file helper from
      io_ring_add_registered_fd, for use by io_uring_setup.
      Signed-off-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Link: https://lore.kernel.org/r/bc8f431bada371c183b95a83399628b605e978a3.1682699803.git.josh@joshtriplett.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6e76ac59
    • Jens Axboe's avatar
      io_uring: support for user allocated memory for rings/sqes · 03d89a2d
      Jens Axboe authored
      Currently io_uring applications must call mmap(2) twice to map the rings
      themselves, and the sqes array. This works fine, but it does not support
      using huge pages to back the rings/sqes.
      
      Provide a way for the application to pass in pre-allocated memory for
      the rings/sqes, which can then suitably be allocated from shmfs or
      via mmap to get huge page support.
      
      Particularly for larger rings, this reduces the TLBs needed.
      
      If an application wishes to take advantage of that, it must pre-allocate
      the memory needed for the sq/cq ring, and the sqes. The former must
      be passed in via the io_uring_params->cq_off.user_data field, while the
      latter is passed in via the io_uring_params->sq_off.user_data field. Then
      it must set IORING_SETUP_NO_MMAP in the io_uring_params->flags field,
      and io_uring will then map the existing memory into the kernel for shared
      use. The application must not call mmap(2) to map rings as it otherwise
      would have, that will now fail with -EINVAL if this setup flag was used.
      
      The pages used for the rings and sqes must be contigious. The intent here
      is clearly that huge pages should be used, otherwise the normal setup
      procedure works fine as-is. The application may use one huge page for
      both the rings and sqes.
      
      Outside of those initialization changes, everything works like it did
      before.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      03d89a2d
    • Jens Axboe's avatar
      io_uring: add ring freeing helper · 9c189eee
      Jens Axboe authored
      We do rings and sqes separately, move them into a helper that does both
      the freeing and clearing of the memory.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9c189eee
    • Jens Axboe's avatar
      io_uring: return error pointer from io_mem_alloc() · e27cef86
      Jens Axboe authored
      In preparation for having more than one time of ring allocator, make the
      existing one return valid/error-pointer rather than just NULL.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e27cef86
    • Jens Axboe's avatar
      io_uring: remove sq/cq_off memset · 9b1b58ca
      Jens Axboe authored
      We only have two reserved members we're not clearing, do so manually
      instead. This is in preparation for using one of these members for
      a new feature.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9b1b58ca
  4. 15 May, 2023 3 commits
  5. 14 May, 2023 13 commits
  6. 13 May, 2023 14 commits