1. 19 Jul, 2022 9 commits
    • Jiri Pirko's avatar
      net: devlink: remove unused locked functions · f655dacb
      Jiri Pirko authored
      Remove locked versions of functions that are no longer used by anyone.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f655dacb
    • Jiri Pirko's avatar
      netdevsim: convert driver to use unlocked devlink API during init/fini · 012ec02a
      Jiri Pirko authored
      Prepare for devlink reload being called with devlink->lock held and
      convert the netdevsim driver to use unlocked devlink API during init and
      fini flows. Take devl_lock() in reload_down() and reload_up() ops in the
      meantime before reload cmd is converted to take the lock itself.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      012ec02a
    • Jiri Pirko's avatar
      net: devlink: add unlocked variants of devlink_region_create/destroy() functions · eb0e9fa2
      Jiri Pirko authored
      Add unlocked variants of devlink_region_create/destroy() functions
      to be used in drivers called-in with devlink->lock held.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      eb0e9fa2
    • Jiri Pirko's avatar
      mlxsw: convert driver to use unlocked devlink API during init/fini · 72a4c8c9
      Jiri Pirko authored
      Prepare for devlink reload being called with devlink->lock held and
      convert the mlxsw driver to use unlocked devlink API during init and
      fini flows. Take devl_lock() in reload_down() and reload_up() ops in the
      meantime before reload cmd is converted to take the lock itself.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Tested-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      72a4c8c9
    • Jiri Pirko's avatar
      net: devlink: add unlocked variants of devlink_dpipe*() functions · 70a2ff89
      Jiri Pirko authored
      Add unlocked variants of devlink_dpipe*() functions to be used
      in drivers called-in with devlink->lock held.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      70a2ff89
    • Jiri Pirko's avatar
      net: devlink: add unlocked variants of devlink_sb*() functions · 755cfa69
      Jiri Pirko authored
      Add unlocked variants of devlink_sb*() functions to be used
      in drivers called-in with devlink->lock held.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      755cfa69
    • Jiri Pirko's avatar
      net: devlink: add unlocked variants of devlink_resource*() functions · c223d6a4
      Jiri Pirko authored
      Add unlocked variants of devlink_resource*() functions to be used
      in drivers called-in with devlink->lock held.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c223d6a4
    • Jiri Pirko's avatar
      net: devlink: add unlocked variants of devling_trap*() functions · 852e85a7
      Jiri Pirko authored
      Add unlocked variants of devl_trap*() functions to be used in drivers
      called-in with devlink->lock held.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      852e85a7
    • Moshe Shemesh's avatar
      net: devlink: avoid false DEADLOCK warning reported by lockdep · e26fde2f
      Moshe Shemesh authored
      Add a lock_class_key per devlink instance to avoid DEADLOCK warning by
      lockdep, while locking more than one devlink instance in driver code,
      for example in opening VFs flow.
      
      Kernel log:
      [  101.433802] ============================================
      [  101.433803] WARNING: possible recursive locking detected
      [  101.433810] 5.19.0-rc1+ #35 Not tainted
      [  101.433812] --------------------------------------------
      [  101.433813] bash/892 is trying to acquire lock:
      [  101.433815] ffff888127bfc2f8 (&devlink->lock){+.+.}-{3:3}, at: probe_one+0x3c/0x690 [mlx5_core]
      [  101.433909]
                     but task is already holding lock:
      [  101.433910] ffff888118f4c2f8 (&devlink->lock){+.+.}-{3:3}, at: mlx5_core_sriov_configure+0x62/0x280 [mlx5_core]
      [  101.433989]
                     other info that might help us debug this:
      [  101.433990]  Possible unsafe locking scenario:
      
      [  101.433991]        CPU0
      [  101.433991]        ----
      [  101.433992]   lock(&devlink->lock);
      [  101.433993]   lock(&devlink->lock);
      [  101.433995]
                      *** DEADLOCK ***
      
      [  101.433996]  May be due to missing lock nesting notation
      
      [  101.433996] 6 locks held by bash/892:
      [  101.433998]  #0: ffff88810eb50448 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0xf3/0x1d0
      [  101.434009]  #1: ffff888114777c88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x20d/0x520
      [  101.434017]  #2: ffff888102b58660 (kn->active#231){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x230/0x520
      [  101.434023]  #3: ffff888102d70198 (&dev->mutex){....}-{3:3}, at: sriov_numvfs_store+0x132/0x310
      [  101.434031]  #4: ffff888118f4c2f8 (&devlink->lock){+.+.}-{3:3}, at: mlx5_core_sriov_configure+0x62/0x280 [mlx5_core]
      [  101.434108]  #5: ffff88812adce198 (&dev->mutex){....}-{3:3}, at: __device_attach+0x76/0x430
      [  101.434116]
                     stack backtrace:
      [  101.434118] CPU: 5 PID: 892 Comm: bash Not tainted 5.19.0-rc1+ #35
      [  101.434120] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      [  101.434130] Call Trace:
      [  101.434133]  <TASK>
      [  101.434135]  dump_stack_lvl+0x57/0x7d
      [  101.434145]  __lock_acquire.cold+0x1df/0x3e7
      [  101.434151]  ? register_lock_class+0x1880/0x1880
      [  101.434157]  lock_acquire+0x1c1/0x550
      [  101.434160]  ? probe_one+0x3c/0x690 [mlx5_core]
      [  101.434229]  ? lockdep_hardirqs_on_prepare+0x400/0x400
      [  101.434232]  ? __xa_alloc+0x1ed/0x2d0
      [  101.434236]  ? ksys_write+0xf3/0x1d0
      [  101.434239]  __mutex_lock+0x12c/0x14b0
      [  101.434243]  ? probe_one+0x3c/0x690 [mlx5_core]
      [  101.434312]  ? probe_one+0x3c/0x690 [mlx5_core]
      [  101.434380]  ? devlink_alloc_ns+0x11b/0x910
      [  101.434385]  ? mutex_lock_io_nested+0x1320/0x1320
      [  101.434388]  ? lockdep_init_map_type+0x21a/0x7d0
      [  101.434391]  ? lockdep_init_map_type+0x21a/0x7d0
      [  101.434393]  ? __init_swait_queue_head+0x70/0xd0
      [  101.434397]  probe_one+0x3c/0x690 [mlx5_core]
      [  101.434467]  pci_device_probe+0x1b4/0x480
      [  101.434471]  really_probe+0x1e0/0xaa0
      [  101.434474]  __driver_probe_device+0x219/0x480
      [  101.434478]  driver_probe_device+0x49/0x130
      [  101.434481]  __device_attach_driver+0x1b8/0x280
      [  101.434484]  ? driver_allows_async_probing+0x140/0x140
      [  101.434487]  bus_for_each_drv+0x123/0x1a0
      [  101.434489]  ? bus_for_each_dev+0x1a0/0x1a0
      [  101.434491]  ? lockdep_hardirqs_on_prepare+0x286/0x400
      [  101.434494]  ? trace_hardirqs_on+0x2d/0x100
      [  101.434498]  __device_attach+0x1a3/0x430
      [  101.434501]  ? device_driver_attach+0x1e0/0x1e0
      [  101.434503]  ? pci_bridge_d3_possible+0x1e0/0x1e0
      [  101.434506]  ? pci_create_resource_files+0xeb/0x190
      [  101.434511]  pci_bus_add_device+0x6c/0xa0
      [  101.434514]  pci_iov_add_virtfn+0x9e4/0xe00
      [  101.434517]  ? trace_hardirqs_on+0x2d/0x100
      [  101.434521]  sriov_enable+0x64a/0xca0
      [  101.434524]  ? pcibios_sriov_disable+0x10/0x10
      [  101.434528]  mlx5_core_sriov_configure+0xab/0x280 [mlx5_core]
      [  101.434602]  sriov_numvfs_store+0x20a/0x310
      [  101.434605]  ? sriov_totalvfs_show+0xc0/0xc0
      [  101.434608]  ? sysfs_file_ops+0x170/0x170
      [  101.434611]  ? sysfs_file_ops+0x117/0x170
      [  101.434614]  ? sysfs_file_ops+0x170/0x170
      [  101.434616]  kernfs_fop_write_iter+0x348/0x520
      [  101.434619]  new_sync_write+0x2e5/0x520
      [  101.434621]  ? new_sync_read+0x520/0x520
      [  101.434624]  ? lock_acquire+0x1c1/0x550
      [  101.434626]  ? lockdep_hardirqs_on_prepare+0x400/0x400
      [  101.434630]  vfs_write+0x5cb/0x8d0
      [  101.434633]  ksys_write+0xf3/0x1d0
      [  101.434635]  ? __x64_sys_read+0xb0/0xb0
      [  101.434638]  ? lockdep_hardirqs_on_prepare+0x286/0x400
      [  101.434640]  ? syscall_enter_from_user_mode+0x1d/0x50
      [  101.434643]  do_syscall_64+0x3d/0x90
      [  101.434647]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
      [  101.434650] RIP: 0033:0x7f5ff536b2f7
      [  101.434658] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f
      1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f
      05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
      [  101.434661] RSP: 002b:00007ffd9ea85d58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  101.434664] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5ff536b2f7
      [  101.434666] RDX: 0000000000000002 RSI: 000055c4c279e230 RDI: 0000000000000001
      [  101.434668] RBP: 000055c4c279e230 R08: 000000000000000a R09: 0000000000000001
      [  101.434669] R10: 000055c4c283cbf0 R11: 0000000000000246 R12: 0000000000000002
      [  101.434670] R13: 00007f5ff543d500 R14: 0000000000000002 R15: 00007f5ff543d700
      [  101.434673]  </TASK>
      Signed-off-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e26fde2f
  2. 18 Jul, 2022 21 commits
    • Sieng-Piaw Liew's avatar
      atl1c: use netif_napi_add_tx() for Tx NAPI · 6e693a10
      Sieng-Piaw Liew authored
      Use netif_napi_add_tx() for NAPI in Tx direction instead of the regular
      netif_napi_add() function.
      Signed-off-by: default avatarSieng-Piaw Liew <liew.s.piaw@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e693a10
    • Arun Ramadoss's avatar
      net: dsa: microchip: fix Clang -Wunused-const-variable warning on 'ksz_dt_ids' · da53af8c
      Arun Ramadoss authored
      This patch removes the of_match_ptr() pointer when dereferencing the
      ksz_dt_ids which produce the unused variable warning.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Suggested-by: default avatarArnd Bergmann <arnd@kernel.org>
      Signed-off-by: default avatarArun Ramadoss <arun.ramadoss@microchip.com>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      da53af8c
    • David S. Miller's avatar
      Merge branch 'tls-rx-avoid-skb_cow_data' · fd18d5f1
      David S. Miller authored
      Jakub Kicinski says:
      
      ====================
      tls: rx: avoid skb_cow_data()
      
      TLS calls skb_cow_data() on the skb it received from strparser
      whenever it needs to hold onto the skb with the decrypted data.
      (The alternative being decrypting directly to a user space buffer
      in whic case the input skb doesn't get modified or used after.)
      TLS needs the decrypted skb:
       - almost always with TLS 1.3 (unless the new NoPad is enabled);
       - when user space buffer is too small to fit the record;
       - when BPF sockmap is enabled.
      
      Most of the time the skb we get out of strparser is a clone of
      a 64kB data unit coalsced by GRO. To make things worse skb_cow_data()
      tries to output a linear skb and allocates it with GFP_ATOMIC.
      This occasionally fails even under moderate memory pressure.
      
      This patch set rejigs the TLS Rx so that we don't expect decryption
      in place. The decryption handlers return an skb which may or may not
      be the skb from strparser. For TLS 1.3 this results in a 20-30%
      performance improvement without NoPad enabled.
      
      v2: rebase after 3d8c51b2 ("net/tls: Check for errors in tls_device_init")
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd18d5f1
    • Jakub Kicinski's avatar
      tls: rx: decrypt into a fresh skb · fd31f399
      Jakub Kicinski authored
      We currently CoW Rx skbs whenever we can't decrypt to a user
      space buffer. The skbs can be enormous (64kB) and CoW does
      a linear alloc which has a strong chance of failing under
      memory pressure. Or even without, skb_cow_data() assumes
      GFP_ATOMIC.
      
      Allocate a new frag'd skb and decrypt into it. We finally
      take advantage of the decrypted skb getting returned via
      darg.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd31f399
    • Jakub Kicinski's avatar
      tls: rx: async: don't put async zc on the list · cbbdee99
      Jakub Kicinski authored
      The "zero-copy" path in SW TLS will engage either for no skbs or
      for all but last. If the recvmsg parameters are right and the
      socket can do ZC we'll ZC until the iterator can't fit a full
      record at which point we'll decrypt one more record and copy
      over the necessary bits to fill up the request.
      
      The only reason we hold onto the ZC skbs which went thru the async
      path until the end of recvmsg() is to count bytes. We need an accurate
      count of zc'ed bytes so that we can calculate how much of the non-zc'd
      data to copy. To allow freeing input skbs on the ZC path count only
      how much of the list we'll need to consume.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cbbdee99
    • Jakub Kicinski's avatar
      tls: rx: async: hold onto the input skb · c618db2a
      Jakub Kicinski authored
      Async crypto currently benefits from the fact that we decrypt
      in place. When we allow input and output to be different skbs
      we will have to hang onto the input while we move to the next
      record. Clone the inputs and keep them on a list.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c618db2a
    • Jakub Kicinski's avatar
      tls: rx: async: adjust record geometry immediately · 6ececdc5
      Jakub Kicinski authored
      Async crypto TLS Rx currently waits for crypto to be done
      in order to strip the TLS header and tailer. Simplify
      the code by moving the pointers immediately, since only
      TLS 1.2 is supported here there is no message padding.
      
      This simplifies the decryption into a new skb in the next
      patch as we don't have to worry about input vs output
      skb in the decrypt_done() handler any more.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ececdc5
    • Jakub Kicinski's avatar
      tls: rx: return the decrypted skb via darg · 6bd116c8
      Jakub Kicinski authored
      Instead of using ctx->recv_pkt after decryption read the skb
      from darg.skb. This moves the decision of what the "output skb"
      is to the decrypt handlers. For now after decrypt handler returns
      successfully ctx->recv_pkt is simply moved to darg.skb, but it
      will change soon.
      
      Note that tls_decrypt_sg() cannot clear the ctx->recv_pkt
      because it gets called to re-encrypt (i.e. by the device offload).
      So we need an awkward temporary if() in tls_rx_one_record().
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6bd116c8
    • Jakub Kicinski's avatar
      tls: rx: read the input skb from ctx->recv_pkt · 541cc48b
      Jakub Kicinski authored
      Callers always pass ctx->recv_pkt into decrypt_skb_update(),
      and it propagates it to its callees. This may give someone
      the false impression that those functions can accept any valid
      skb containing a TLS record. That's not the case, the record
      sequence number is read from the context, and they can only
      take the next record coming out of the strp.
      
      Let the functions get the skb from the context instead of
      passing it in. This will also make it cleaner to return
      a different skb than ctx->recv_pkt as the decrypted one
      later on.
      
      Since we're touching the definition of decrypt_skb_update()
      use this as an opportunity to rename it.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      541cc48b
    • Jakub Kicinski's avatar
      tls: rx: factor out device darg update · 8a958732
      Jakub Kicinski authored
      I already forgot to transform darg from input to output
      semantics once on the NIC inline crypto fastpath. To
      avoid this happening again create a device equivalent
      of decrypt_internal(). A function responsible for decryption
      and transforming darg.
      
      While at it rename decrypt_internal() to a hopefully slightly
      more meaningful name.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a958732
    • Jakub Kicinski's avatar
      tls: rx: remove the message decrypted tracking · 53d57999
      Jakub Kicinski authored
      We no longer allow a decrypted skb to remain linked to ctx->recv_pkt.
      Anything on the list is decrypted, anything on ctx->recv_pkt needs
      to be decrypted.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      53d57999
    • Jakub Kicinski's avatar
      tls: rx: don't keep decrypted skbs on ctx->recv_pkt · abb47dc9
      Jakub Kicinski authored
      Detach the skb from ctx->recv_pkt after decryption is done,
      even if we can't consume it.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      abb47dc9
    • Jakub Kicinski's avatar
      tls: rx: don't try to keep the skbs always on the list · 008141de
      Jakub Kicinski authored
      I thought that having the skb either always on the ctx->rx_list
      or ctx->recv_pkt will simplify the handling, as we would not
      have to remember to flip it from one to the other on exit paths.
      
      This became a little harder to justify with the fix for BPF
      sockmaps. Subsequent changes will make the situation even worse.
      Queue the skbs only when really needed.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      008141de
    • Jakub Kicinski's avatar
      tls: rx: allow only one reader at a time · 4cbc325e
      Jakub Kicinski authored
      recvmsg() in TLS gets data from the skb list (rx_list) or fresh
      skbs we read from TCP via strparser. The former holds skbs which were
      already decrypted for peek or decrypted and partially consumed.
      
      tls_wait_data() only notices appearance of fresh skbs coming out
      of TCP (or psock). It is possible, if there is a concurrent call
      to peek() and recv() that the peek() will move the data from input
      to rx_list without recv() noticing. recv() will then read data out
      of order or never wake up.
      
      This is not a practical use case/concern, but it makes the self
      tests less reliable. This patch solves the problem by allowing
      only one reader in.
      
      Because having multiple processes calling read()/peek() is not
      normal avoid adding a lock and try to fast-path the single reader
      case.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4cbc325e
    • David S. Miller's avatar
      Merge branch 'net-smc-virt-contig-buffers' · 3898f52c
      David S. Miller authored
      Wen Gu says:
      
      ====================
      net/smc: Introduce virtually contiguous buffers for SMC-R
      
      On long-running enterprise production servers, high-order contiguous
      memory pages are usually very rare and in most cases we can only get
      fragmented pages.
      
      When replacing TCP with SMC-R in such production scenarios, attempting
      to allocate high-order physically contiguous sndbufs and RMBs may result
      in frequent memory compaction, which will cause unexpected hung issue
      and further stability risks.
      
      So this patch set is aimed to allow SMC-R link group to use virtually
      contiguous sndbufs and RMBs to avoid potential issues mentioned above.
      Whether to use physically or virtually contiguous buffers can be set
      by sysctl smcr_buf_type.
      
      Note that using virtually contiguous buffers will bring an acceptable
      performance regression, which can be mainly divided into two parts:
      
      1) regression in data path, which is brought by additional address
         translation of sndbuf by RNIC in Tx. But in general, translating
         address through MTT is fast. According to qperf test, this part
         regression is basically less than 10% in latency and bandwidth.
         (see patch 5/6 for details)
      
      2) regression in buffer initialization and destruction path, which is
         brought by additional MR operations of sndbufs. But thanks to link
         group buffer reuse mechanism, the impact of this kind of regression
         decreases as times of buffer reuse increases.
      
      Patch set overview:
      - Patch 1/6 and 2/6 mainly about simplifying and optimizing DMA sync
        operation, which will reduce overhead on the data path, especially
        when using virtually contiguous buffers;
      - Patch 3/6 and 4/6 introduce a sysctl smcr_buf_type to set the type
        of buffers in new created link group;
      - Patch 5/6 allows SMC-R to use virtually contiguous sndbufs and RMBs,
        including buffer creation, destruction, MR operation and access;
      - patch 6/6 extends netlink attribute for buffer type of SMC-R link group;
      
      v1->v2:
      - Patch 5/6 fixes build issue on 32bit;
      - Patch 3/6 adds description of new sysctl in smc-sysctl.rst;
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3898f52c
    • Wen Gu's avatar
      net/smc: Extend SMC-R link group netlink attribute · ddefb2d2
      Wen Gu authored
      Extend SMC-R link group netlink attribute SMC_GEN_LGR_SMCR.
      Introduce SMC_NLA_LGR_R_BUF_TYPE to show the buffer type of
      SMC-R link group.
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ddefb2d2
    • Wen Gu's avatar
      net/smc: Allow virtually contiguous sndbufs or RMBs for SMC-R · b8d19945
      Wen Gu authored
      On long-running enterprise production servers, high-order contiguous
      memory pages are usually very rare and in most cases we can only get
      fragmented pages.
      
      When replacing TCP with SMC-R in such production scenarios, attempting
      to allocate high-order physically contiguous sndbufs and RMBs may result
      in frequent memory compaction, which will cause unexpected hung issue
      and further stability risks.
      
      So this patch is aimed to allow SMC-R link group to use virtually
      contiguous sndbufs and RMBs to avoid potential issues mentioned above.
      Whether to use physically or virtually contiguous buffers can be set
      by sysctl smcr_buf_type.
      
      Note that using virtually contiguous buffers will bring an acceptable
      performance regression, which can be mainly divided into two parts:
      
      1) regression in data path, which is brought by additional address
         translation of sndbuf by RNIC in Tx. But in general, translating
         address through MTT is fast.
      
         Taking 256KB sndbuf and RMB as an example, the comparisons in qperf
         latency and bandwidth test with physically and virtually contiguous
         buffers are as follows:
      
      - client:
        smc_run taskset -c <cpu> qperf <server> -oo msg_size:1:64K:*2\
        -t 5 -vu tcp_{bw|lat}
      - server:
        smc_run taskset -c <cpu> qperf
      
         [latency]
         msgsize              tcp            smcr        smcr-use-virt-buf
         1               11.17 us         7.56 us         7.51 us (-0.67%)
         2               10.65 us         7.74 us         7.56 us (-2.31%)
         4               11.11 us         7.52 us         7.59 us ( 0.84%)
         8               10.83 us         7.55 us         7.51 us (-0.48%)
         16              11.21 us         7.46 us         7.51 us ( 0.71%)
         32              10.65 us         7.53 us         7.58 us ( 0.61%)
         64              10.95 us         7.74 us         7.80 us ( 0.76%)
         128             11.14 us         7.83 us         7.87 us ( 0.47%)
         256             10.97 us         7.94 us         7.92 us (-0.28%)
         512             11.23 us         7.94 us         8.20 us ( 3.25%)
         1024            11.60 us         8.12 us         8.20 us ( 0.96%)
         2048            14.04 us         8.30 us         8.51 us ( 2.49%)
         4096            16.88 us         9.13 us         9.07 us (-0.64%)
         8192            22.50 us        10.56 us        11.22 us ( 6.26%)
         16384           28.99 us        12.88 us        13.83 us ( 7.37%)
         32768           40.13 us        16.76 us        16.95 us ( 1.16%)
         65536           68.70 us        24.68 us        24.85 us ( 0.68%)
         [bandwidth]
         msgsize                tcp              smcr          smcr-use-virt-buf
         1                1.65 MB/s         1.59 MB/s         1.53 MB/s (-3.88%)
         2                3.32 MB/s         3.17 MB/s         3.08 MB/s (-2.67%)
         4                6.66 MB/s         6.33 MB/s         6.09 MB/s (-3.85%)
         8               13.67 MB/s        13.45 MB/s        11.97 MB/s (-10.99%)
         16              25.36 MB/s        27.15 MB/s        24.16 MB/s (-11.01%)
         32              48.22 MB/s        54.24 MB/s        49.41 MB/s (-8.89%)
         64             106.79 MB/s       107.32 MB/s        99.05 MB/s (-7.71%)
         128            210.21 MB/s       202.46 MB/s       201.02 MB/s (-0.71%)
         256            400.81 MB/s       416.81 MB/s       393.52 MB/s (-5.59%)
         512            746.49 MB/s       834.12 MB/s       809.99 MB/s (-2.89%)
         1024          1292.33 MB/s      1641.96 MB/s      1571.82 MB/s (-4.27%)
         2048          2007.64 MB/s      2760.44 MB/s      2717.68 MB/s (-1.55%)
         4096          2665.17 MB/s      4157.44 MB/s      4070.76 MB/s (-2.09%)
         8192          3159.72 MB/s      4361.57 MB/s      4270.65 MB/s (-2.08%)
         16384         4186.70 MB/s      4574.13 MB/s      4501.17 MB/s (-1.60%)
         32768         4093.21 MB/s      4487.42 MB/s      4322.43 MB/s (-3.68%)
         65536         4057.14 MB/s      4735.61 MB/s      4555.17 MB/s (-3.81%)
      
      2) regression in buffer initialization and destruction path, which is
         brought by additional MR operations of sndbufs. But thanks to link
         group buffer reuse mechanism, the impact of this kind of regression
         decreases as times of buffer reuse increases.
      
         Taking 256KB sndbuf and RMB as an example, latency of some key SMC-R
         buffer-related function obtained by bpftrace are as follows:
      
         Function                         Phys-bufs           Virt-bufs
         smcr_new_buf_create()             67154 ns            79164 ns
         smc_ib_buf_map_sg()                 525 ns              928 ns
         smc_ib_get_memory_region()       162294 ns           161191 ns
         smc_wr_reg_send()                  9957 ns             9635 ns
         smc_ib_put_memory_region()       203548 ns           198374 ns
         smc_ib_buf_unmap_sg()               508 ns             1158 ns
      
      ------------
      Test environment notes:
      1. Above tests run on 2 VMs within the same Host.
      2. The NIC is ConnectX-4Lx, using SRIOV and passing through 2 VFs to
         the each VM respectively.
      3. VMs' vCPUs are binded to different physical CPUs, and the binded
         physical CPUs are isolated by `isolcpus=xxx` cmdline.
      4. NICs' queue number are set to 1.
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8d19945
    • Wen Gu's avatar
      net/smc: Use sysctl-specified types of buffers in new link group · b984f370
      Wen Gu authored
      This patch introduces a new SMC-R specific element buf_type
      in struct smc_link_group, for recording the value of sysctl
      smcr_buf_type when link group is created.
      
      New created link group will create and reuse buffers of the
      type specified by buf_type.
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b984f370
    • Wen Gu's avatar
      net/smc: Introduce a sysctl for setting SMC-R buffer type · 4bc5008e
      Wen Gu authored
      This patch introduces the sysctl smcr_buf_type for setting
      the type of SMC-R sndbufs and RMBs.
      
      Valid values includes:
      
      - SMCR_PHYS_CONT_BUFS, which means use physically contiguous
        buffers for better performance and is the default value.
      
      - SMCR_VIRT_CONT_BUFS, which means use virtually contiguous
        buffers in case of physically contiguous memory is scarce.
      
      - SMCR_MIXED_BUFS, which means first try to use physically
        contiguous buffers. If not available, then use virtually
        contiguous buffers.
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4bc5008e
    • Guangguan Wang's avatar
      net/smc: optimize for smc_sndbuf_sync_sg_for_device and smc_rmb_sync_sg_for_cpu · 0ef69e78
      Guangguan Wang authored
      Some CPU, such as Xeon, can guarantee DMA cache coherency.
      So it is no need to use dma sync APIs to flush cache on such CPUs.
      In order to avoid calling dma sync APIs on the IO path, use the
      dma_need_sync to check whether smc_buf_desc needs dma sync when
      creating smc_buf_desc.
      Signed-off-by: default avatarGuangguan Wang <guangguan.wang@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ef69e78
    • Guangguan Wang's avatar
      net/smc: remove redundant dma sync ops · 6d52e2de
      Guangguan Wang authored
      smc_ib_sync_sg_for_cpu/device are the ops used for dma memory cache
      consistency. Smc sndbufs are dma buffers, where CPU writes data to
      it and PCIE device reads data from it. So for sndbufs,
      smc_ib_sync_sg_for_device is needed and smc_ib_sync_sg_for_cpu is
      redundant as PCIE device will not write the buffers. Smc rmbs
      are dma buffers, where PCIE device write data to it and CPU read
      data from it. So for rmbs, smc_ib_sync_sg_for_cpu is needed and
      smc_ib_sync_sg_for_device is redundant as CPU will not write the buffers.
      Signed-off-by: default avatarGuangguan Wang <guangguan.wang@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6d52e2de
  3. 16 Jul, 2022 4 commits
  4. 15 Jul, 2022 6 commits