1. 21 Jul, 2022 7 commits
  2. 20 Jul, 2022 22 commits
  3. 19 Jul, 2022 11 commits
    • Jakub Kicinski's avatar
      Merge branch 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux · 7f9eee19
      Jakub Kicinski authored
      Pavel Begunkov says:
      
      ====================
      io_uring zerocopy send
      
      The patchset implements io_uring zerocopy send. It works with both registered
      and normal buffers, mixing is allowed but not recommended. Apart from usual
      request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
      the userspace when buffers are freed and can be reused (see API design below),
      which is delivered into io_uring's Completion Queue. Those "buffer-free"
      notifications are not necessarily per request, but the userspace has control
      over it and should explicitly attaching a number of requests to a single
      notification. The series also adds some internal optimisations when used with
      registered buffers like removing page referencing.
      
      From the kernel networking perspective there are two main changes. The first
      one is passing ubuf_info into the network layer from io_uring (inside of an
      in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
      caching on the io_uring side, but also helps to avoid cross-referencing
      and synchronisation problems. The second part is an optional optimisation
      removing page referencing for requests with registered buffers.
      
      Benchmarking UDP with an optimised version of the selftest (see [1]), which
      sends a bunch of requests, waits for completions and repeats. "+ flush" column
      posts one additional "buffer-free" notification per request, and just "zc"
      doesn't post buffer notifications at all.
      
      NIC (requests / second):
      IO size | non-zc    | zc             | zc + flush
      4000    | 495134    | 606420 (+22%)  | 558971 (+12%)
      1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
      1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
      600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)
      
      dummy (requests / second):
      IO size | non-zc    | zc             | zc + flush
      8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
      4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
      1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
      600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)
      
      Previously it also brought a massive performance speedup compared to the
      msg_zerocopy tool (see [3]), which is probably not super interesting. There
      is also an additional bunch of refcounting optimisations that was omitted from
      the series for simplicity and as they don't change the picture drastically,
      they will be sent as follow up, as well as flushing optimisations closing the
      performance gap b/w two last columns.
      
      For TCP on localhost (with hacks enabling localhost zerocopy) and including
      additional overhead for receive:
      
      IO size | non-zc    | zc
      1200    | 4174      | 4148
      4096    | 7597      | 11228
      
      Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
      omitted optimisations will somewhat help, should look better for 4000,
      but couldn't test properly because of setup problems.
      
      Links:
      
        liburing (benchmark + tests):
        [1] https://github.com/isilence/liburing/tree/zc_v4
      
        kernel repo:
        [2] https://github.com/isilence/linux/tree/zc_v4
      
        RFC v1:
        [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/
      
        RFC v2:
        https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/
      
        Net patches based:
        git@github.com:isilence/linux.git zc_v4-net-base
        or
        https://github.com/isilence/linux/tree/zc_v4-net-base
      
      API design overview:
      
        The series introduces an io_uring concept of notifactors. From the userspace
        perspective it's an entity to which it can bind one or more requests and then
        requesting to flush it. Flushing a notifier makes it impossible to attach new
        requests to it, and instructs the notifier to post a completion once all
        requests attached to it are completed and the kernel doesn't need the buffers
        anymore.
      
        Notifications are stored in notification slots, which should be registered as
        an array in io_uring. Each slot stores only one notifier at any particular
        moment. Flushing removes it from the slot and the slot automatically replaces
        it with a new notifier. All operations with notifiers are done by specifying
        an index of a slot it's currently in.
      
        When registering a notification the userspace specifies a u64 tag for each
        slot, which will be copied in notification completion entries as
        cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
        sequence number counting notifiers of a slot.
      
      ====================
      
      Link: https://lore.kernel.org/r/cover.1657643355.git.asml.silence@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7f9eee19
    • Pavel Begunkov's avatar
      tcp: support externally provided ubufs · eb315a7d
      Pavel Begunkov authored
      Teach tcp how to use external ubuf_info provided in msghdr and
      also prepare it for managed frags by sprinkling
      skb_zcopy_downgrade_managed() when it could mix managed and not managed
      frags.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      eb315a7d
    • Pavel Begunkov's avatar
      ipv6/udp: support externally provided ubufs · 1fd3ae8c
      Pavel Begunkov authored
      Teach ipv6/udp how to use external ubuf_info provided in msghdr and
      also prepare it for managed frags by sprinkling
      skb_zcopy_downgrade_managed() when it could mix managed and not managed
      frags.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1fd3ae8c
    • Pavel Begunkov's avatar
      ipv4/udp: support externally provided ubufs · c445f31b
      Pavel Begunkov authored
      Teach ipv4/udp how to use external ubuf_info provided in msghdr and
      also prepare it for managed frags by sprinkling
      skb_zcopy_downgrade_managed() when it could mix managed and not managed
      frags.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c445f31b
    • Pavel Begunkov's avatar
      net: introduce __skb_fill_page_desc_noacc · 84ce071e
      Pavel Begunkov authored
      Managed pages contain pinned userspace pages and controlled by upper
      layers, there is no need in tracking skb->pfmemalloc for them. Introduce
      a helper for filling frags but ignoring page tracking, it'll be needed
      later.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      84ce071e
    • Pavel Begunkov's avatar
      net: introduce managed frags infrastructure · 753f1ca4
      Pavel Begunkov authored
      Some users like io_uring can do page pinning more efficiently, so we
      want a way to delegate referencing to other subsystems. For that add
      a new flag called SKBFL_MANAGED_FRAG_REFS. When set, skb doesn't hold
      page references and upper layers are responsivle to managing page
      lifetime.
      
      It's allowed to convert skbs from managed to normal by calling
      skb_zcopy_downgrade_managed(). The function will take all needed
      page references and clear the flag. It's needed, for instance,
      to avoid mixing managed modes.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      753f1ca4
    • David Ahern's avatar
      net: Allow custom iter handler in msghdr · ebe73a28
      David Ahern authored
      Add support for custom iov_iter handling to msghdr. The idea is that
      in-kernel subsystems want control over how an SG is split.
      Signed-off-by: default avatarDavid Ahern <dsahern@kernel.org>
      [pavel: move callback into msghdr]
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ebe73a28
    • Pavel Begunkov's avatar
      skbuff: carry external ubuf_info in msghdr · 7c701d92
      Pavel Begunkov authored
      Make possible for network in-kernel callers like io_uring to pass in a
      custom ubuf_info by setting it in a new field of struct msghdr.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7c701d92
    • Edward Cree's avatar
    • Roi Dayan's avatar
      net/mlx5: CT: Remove warning of ignore_flow_level support for non PF · 22df2e93
      Roi Dayan authored
      ignore_flow_level isn't supported for SFs, and so it causes
      post_act and ct to warn about it per SF.
      Apply the warning only for PF.
      Signed-off-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      22df2e93
    • Aya Levin's avatar
      net/mlx5e: Add resiliency for PTP TX port timestamp · 58a51894
      Aya Levin authored
      PTP TX port timestamp relies on receiving 2 CQEs for each outgoing
      packet (WQE). The regular CQE has a less accurate timestamp than the
      wire CQE. On link change, the wire CQE may get lost. Let the driver
      detect and restore the relation between the CQEs, and re-sync after
      timeout.
      
      Add resiliency for this as follows: add id (producer counter)
      into the WQE's metadata. This id will be received in the wire
      CQE (in wqe_counter field). On handling the wire CQE, if there is no
      match, replay the PTP application with the time-stamp from the regular
      CQE and restore the sync between the CQEs and their SKBs. This patch
      adds 2 ptp counters:
      1) ptp_cq0_resync_event: number of times a mismatch was detected between
         the regular CQE and the wire CQE.
      2) ptp_cq0_resync_cqe: total amount of missing wire CQEs.
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      58a51894