1. 16 Jun, 2017 7 commits
    • Tariq Toukan's avatar
      net/mlx4_en: Poll XDP TX completion queue in RX NAPI · 6c78511b
      Tariq Toukan authored
      Instead of having their own NAPIs, XDP TX completion queues get
      polled within the corresponding RX NAPI.
      This prevents any possible race on TX ring prod/cons indices,
      between the context that issues the transmits (RX NAPI) and the
      context that handles the completions (was previously done in
      a separate NAPI).
      
      This also improves performance, as it decreases the number
      of NAPIs running on a CPU, saving the overhead of syncing
      and switching between the contexts.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      Single queue no-RSS optimization ON.
      
      XDP_TX packet rate:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 12.0 Mpps | 13.8 Mpps |  15% |
      IPv6 | 12.0 Mpps | 13.8 Mpps |  15% |
      -------------------------------------
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c78511b
    • Tariq Toukan's avatar
      net/mlx4_en: Improve XDP xmit function · 36ea7964
      Tariq Toukan authored
      Several performance improvements in XDP TX datapath,
      including:
      - Ring a single doorbell for XDP TX ring per NAPI budget,
        instead of doing it per a lower threshold (was 8).
        This includes removing the flow of immediate doorbell ringing
        in case of a full TX ring.
      - Compiler branch predictor hints.
      - Calculate values in compile time rather than in runtime.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      Single queue no-RSS optimization ON.
      
      XDP_TX packet rate:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 10.3 Mpps | 12.0 Mpps |  17% |
      IPv6 | 10.3 Mpps | 12.0 Mpps |  17% |
      -------------------------------------
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      36ea7964
    • Tariq Toukan's avatar
      net/mlx4_en: Improve stack xmit function · f28186d6
      Tariq Toukan authored
      Several small code and performance improvements in stack TX datapath,
      including:
      - Compiler branch predictor hints.
      - Minimize variables scope.
      - Move tx_info non-inline flow handling to a separate function.
      - Calculate data_offset in compile time rather than in runtime
        (for !lso_header_size branch).
      - Avoid trinary-operator ("?") when value can be preset in a matching
        branch.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      
      Gain is too small to be measurable, no degradation sensed.
      Results are similar for IPv4 and IPv6.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f28186d6
    • Tariq Toukan's avatar
      net/mlx4_en: Improve transmit CQ polling · cc26a490
      Tariq Toukan authored
      Several small performance improvements in TX CQ polling,
      including:
      - Compiler branch predictor hints.
      - Minimize variables scope.
      - More proper check of cq type.
      - Use boolean instead of int for a binary indication.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      
      Packet-rate tests for both regular stack and XDP use cases:
      No noticeable gain, no degradation.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc26a490
    • Tariq Toukan's avatar
      net/mlx4_en: Improve receive data-path · 9bcee89a
      Tariq Toukan authored
      Several small performance improvements in RX datapath,
      including:
      - Compiler branch predictor hints.
      - Replace a multiplication with a shift operation.
      - Minimize variables scope.
      - Write-prefetch for packet header.
      - Avoid trinary-operator ("?") when value can be preset in a matching
        branch.
      - Save a branch by updating RX ring doorbell within
        mlx4_en_refill_rx_buffers(), which now returns void.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      Single queue no-RSS optimization ON
      (enable by ethtool -L <interface> rx 1).
      
      XDP_DROP packet rate:
      Same (28.1 Mpps), lower CPU utilization (from ~100% to ~92%).
      
      Drop packets in TC:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 4.14 Mpps | 4.18 Mpps |   1% |
      -------------------------------------
      
      XDP_TX packet rate:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 10.1 Mpps | 10.3 Mpps |   2% |
      IPv6 | 10.1 Mpps | 10.3 Mpps |   2% |
      -------------------------------------
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9bcee89a
    • Saeed Mahameed's avatar
      net/mlx4_en: Optimized single ring steering · 4931c6ef
      Saeed Mahameed authored
      Avoid touching RX QP RSS context when loading with only
      one RX ring, to allow optimized A0 RX steering.
      
      Enable by:
      - loading mlx4_core with module param: log_num_mgm_entry_size = -6.
      - then: ethtool -L <interface> rx 1
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      
      XDP_DROP packet rate:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 20.5 Mpps | 28.1 Mpps |  37% |
      IPv6 | 18.4 Mpps | 28.1 Mpps |  53% |
      -------------------------------------
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4931c6ef
    • Tariq Toukan's avatar
      net/mlx4_en: Remove unused argument in TX datapath function · cf97050d
      Tariq Toukan authored
      Remove owner argument, as it is obsolete and unused.
      This also saves the overhead of calculating its value in data-path.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf97050d
  2. 15 Jun, 2017 33 commits