Commits · a07a296bba9ded457b853a3f11f3a23b1e5978d1 · Kirill Smelkov / linux

16 Oct, 2021 21 commits

net: ipvtap: fix template string argument of device_create() call · a07a296b

Jean Sacren authored Oct 15, 2021

The last argument of device_create() call should be a template string.
The tap_name variable should be the argument to the string, but not the
argument of the call itself.  We should add the template string and turn
tap_name into its argument.
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a07a296b

net: macvtap: fix template string argument of device_create() call · 1c5b5b3f

Jean Sacren authored Oct 15, 2021

The last argument of device_create() call should be a template string.
The tap_name variable should be the argument to the string, but not the
argument of the call itself.  We should add the template string and turn
tap_name into its argument.
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1c5b5b3f

Merge tag 'mlx5-updates-2021-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 93eb2b77

David S. Miller authored Oct 16, 2021

Saeed Mahameed says:

====================
mlx5-updates-2021-10-15

1) From Rongwei Liu:

Use system_image_guid and native_port_num when bonding.
Don't relay on PCIe ids anymore.

With some specific NIC, the physical devices may have PCIe IDs like
0001:01:00.0/1 and 0002:02:00.0/1. All of these devices should have
the same system_image_guid and device index can be queried from
native_port_num.

For matching sibling devices/port of the same HCA, compare the HCA
GUID reported on each device rather than just assuming PCIe ids have
similar attributes.

2) From Amir Tzin: Use HCA defined Timouts

Replace hard coded timeouts with values stored by firmware in default
timeouts register (DTOR). Timeouts are read during driver load. If DTOR
is not supported by firmware then fallback to hard coded defaults
instead.

3) From Shay Drory: Disable roce at HCA level
Disable RoCE in Firmware when devlink roce parameter is set to off.

4) A small set of trivial cleanups
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

93eb2b77

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · 803a4344

David S. Miller authored Oct 16, 2021

Tony Nguyen says:

====================
100GbE Intel Wired LAN Driver Updates 2021-10-14

Maciej Machnikowski says:

Extend the driver implementation to support PTP pins on E810-T and
derivative devices.

E810-T adapters are equipped with:
- 2 external bidirectional SMA connectors
- 1 internal TX U.FL shared with SMA1
- 1 internal RX U.FL shared with SMA2

The SMA and U.FL configuration is controlled by the external
multiplexer.

E810-T Derivatives are equipped with:
- 2 1PPS outputs on SDP20 and SDP22
- 2 1PPS inputs on SDP21 and SDP23
---
v2:
- Remove defensive programming check and simplify return statement
  (Patch 3)
- Remove unnecessary parentheses (Patch 4)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

803a4344

Merge branch 'mptcp-fixes' · dcebeb8b

David S. Miller authored Oct 16, 2021

Mat Martineau says:

====================
mptcp: A few fixes

This set has three separate changes for the net-next tree:

Patch 1 guarantees safe handling and a warning if a NULL value is
encountered when gathering subflow data for the MPTCP_SUBFLOW_ADDRS
socket option.

Patch 2 increases the default number of subflows allowed per MPTCP
connection.

Patch 3 makes an existing function 'static'.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

dcebeb8b

mptcp: Make mptcp_pm_nl_mp_prio_send_ack() static · 3828c514

Mat Martineau authored Oct 15, 2021

This function is only used within pm_netlink.c now.

Fixes: 06706542 ("mptcp: add the outgoing MP_PRIO support")
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3828c514

mptcp: increase default max additional subflows to 2 · 72bcbc46

Paolo Abeni authored Oct 15, 2021

The current default does not allowing additional subflows, mostly
as a safety restriction to avoid uncontrolled resource consumption
on busy servers.

Still the system admin and/or the application have to opt-in to
MPTCP explicitly. After that, they need to change (increase) the
default maximum number of additional subflows.

Let set that to reasonable default, and make end-users life easier.

Additionally we need to update some self-tests accordingly.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

72bcbc46

mptcp: Avoid NULL dereference in mptcp_getsockopt_subflow_addrs() · 29211e7d

Tim Gardner authored Oct 15, 2021

Coverity complains of a possible NULL dereference in
mptcp_getsockopt_subflow_addrs():

 861       } else if (sk->sk_family == AF_INET6) {
    	3. returned_null: inet6_sk returns NULL. [show details]
    	4. var_assigned: Assigning: np = NULL return value from inet6_sk.
 862                const struct ipv6_pinfo *np = inet6_sk(sk);

Fix this by checking for NULL.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/231
Fixes: c11c5906 ("mptcp: add MPTCP_SUBFLOW_ADDRS getsockopt support")
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
[mjm: Added WARN_ON_ONCE() to the unexpected case]
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

29211e7d

net/mlx5: Use system_image_guid to determine bonding · 8a543184

Rongwei Liu authored Oct 12, 2021

With specific NICs, the PFs may have different PCIe ids like
0001:01:00.0/1 and 0002:02:00:00/1.

For PFs with the same system_image_guid, driver should consider
them under the same physical NIC and they are legal to bond together.

If firmware doesn't support system_image_guid, set it to zero and
fallback to use PCIe ids.
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

8a543184

net/mlx5: Use native_port_num as 1st option of device index · 1021d064

Rongwei Liu authored Oct 12, 2021

Using "native_port_num" can support more NICs.

Fallback to PCIe IDs if "native_port_num" query fails.
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

1021d064

net/mlx5: Introduce new device index wrapper · 2ec16ddd

Rongwei Liu authored Sep 16, 2021

Downstream patches.
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

2ec16ddd

net/mlx5: Check return status first when querying system_image_guid · 7b1b6d35

Rongwei Liu authored Oct 08, 2021

When querying system_image_guid from firmware, we should check return
value first. The buffer content is valid only if query succeed.
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

7b1b6d35

net/mlx5: DR, Prefer kcalloc over open coded arithmetic · 0e6f3ef4

Len Baker authored Sep 05, 2021

As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.

So, refactor the code a bit to use the purpose specific kcalloc()
function instead of the argument size * count in the kzalloc() function.

[1] https://www.kernel.org/doc/html/v5.14/process/deprecated.html#open-coded-arithmetic-in-allocator-argumentsSigned-off-by: Len Baker <len.baker@gmx.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

0e6f3ef4

net/mlx5e: Add extack msgs related to TC for better debug · 0885ae1a

Abhiram R N authored Sep 22, 2021

As multiple places EOPNOTSUPP and EINVAL is returned from driver
it becomes difficult to understand the reason only with error code.
With the netlink extack message exact reason will be known and will
aid in debugging.
Signed-off-by: Abhiram R N <abhiramrn@gmail.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

0885ae1a

net/mlx5: CT: Fix missing cleanup of ct nat table on init failure · 88594d83

Paul Blakey authored Sep 30, 2021

If CT fails to initialize it's rhashtables, it doesn't destroy
the ct nat global table.

Destroy the ct nat global table on ct init failure.

Fixes: d7cade51 ("net/mlx5e: check return value of rhashtable_init")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

88594d83

net/mlx5: Disable roce at HCA level · fbfa97b4

Shay Drory authored Aug 18, 2021

Currently, when a user disables roce via the devlink param, this change
isn't passed down to the device.
If device allows disabling RoCE at device level, make use of it. This
instructs the device to skip memory allocations related to RoCE
functionality which otherwise is done by the device.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

fbfa97b4

net/mlx5i: Enable Rx steering for IPoIB via ethtool · 9fbe1c25

Moosa Baransi authored Sep 26, 2021

Enable steering IPoIB packets via ethtool, the same way it is done today
for Ethernet packets.
Signed-off-by: Moosa Baransi <moosab@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

9fbe1c25

net/mlx5: Bridge, provide flow source hints · 17ac528d

Vlad Buslov authored Oct 12, 2021

Currently, SMFS mode doesn't support rx-loopback flows which causes bridge
egress rules to be rejected because without hint rules for both rx and tx
destinations are created by default. Provide explicit flow source hints for
compatibility with SMFS.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

17ac528d

net/mlx5: Read timeout values from DTOR · 32def412

Amir Tzin authored Oct 13, 2021

Replace hard coded timeouts with values stored by firmware in default
timeouts register (DTOR). Timeouts are read during driver load. If DTOR
is not supported by firmware then fallback to hard coded defaults
instead.
Signed-off-by: Amir Tzin <amirtz@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

32def412

net/mlx5: Read timeout values from init segment · 5945e1ad

Amir Tzin authored Oct 07, 2021

Replace hard coded timeouts with values stored in firmware's init
segment. Timeouts are read from init segment during driver load. If init
segment timeouts are not supported then fallback to hard coded defaults
instead. Also move pre initialization timeouts which cannot be read from
firmware to the new mechanism.
Signed-off-by: Amir Tzin <amirtz@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

5945e1ad

net/mlx5: Add layout to support default timeouts register · 4b2c5fa9

Amir Tzin authored Jul 21, 2021

Add needed structures and defines for DTOR (default timeouts register).
This will be used to get timeouts values from FW instead of hard coded
values in the driver code thus enabling support for slower devices which
need longer timeouts.
Signed-off-by: Amir Tzin <amirtz@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

4b2c5fa9

15 Oct, 2021 19 commits

ice: make use of ice_for_each_* macros · 2faf63b6

Maciej Fijalkowski authored Aug 19, 2021

Go through the code base and use ice_for_each_* macros.  While at it,
introduce ice_for_each_xdp_txq() macro that can be used for looping over
xdp_rings array.

Commit is not introducing any new functionality.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

2faf63b6

ice: introduce XDP_TX fallback path · 22bf877e

Maciej Fijalkowski authored Aug 19, 2021

Under rare circumstances there might be a situation where a requirement
of having XDP Tx queue per CPU could not be fulfilled and some of the Tx
resources have to be shared between CPUs. This yields a need for placing
accesses to xdp_ring inside a critical section protected by spinlock.
These accesses happen to be in the hot path, so let's introduce the
static branch that will be triggered from the control plane when driver
could not provide Tx queue dedicated for XDP on each CPU.

Currently, the design that has been picked is to allow any number of XDP
Tx queues that is at least half of a count of CPUs that platform has.
For lower number driver will bail out with a response to user that there
were not enough Tx resources that would allow configuring XDP. The
sharing of rings is signalled via static branch enablement which in turn
indicates that lock for xdp_ring accesses needs to be taken in hot path.

Approach based on static branch has no impact on performance of a
non-fallback path. One thing that is needed to be mentioned is a fact
that the static branch will act as a global driver switch, meaning that
if one PF got out of Tx resources, then other PFs that ice driver is
servicing will suffer. However, given the fact that HW that ice driver
is handling has 1024 Tx queues per each PF, this is currently an
unlikely scenario.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

22bf877e

ice: optimize XDP_TX workloads · 9610bd98

Maciej Fijalkowski authored Aug 19, 2021

Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.

Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.

Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.

First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.

Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.

Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.

With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.

This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.

Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>

As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32

with tx_busy staying calm.

When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64

it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

9610bd98

ice: propagate xdp_ring onto rx_ring · eb087cd8

Maciej Fijalkowski authored Aug 19, 2021

With rings being split, it is now convenient to introduce a pointer to
XDP ring within the Rx ring. For XDP_TX workloads this means that
xdp_rings array access will be skipped, which was executed per each
processed frame.

Also, read the XDP prog once per NAPI and if prog is present, set up the
local xdp_ring pointer. Reading prog a single time was discussed in [1]
with some concern raised by Toke around dispatcher handling and having
the need for going through the RCU grace period in the ndo_bpf driver
callback, but ice currently is torning down NAPI instances regardless of
the prog presence on VSI.

Although the pointer to XDP ring introduced to Rx ring makes things a
lot slimmer/simpler, I still feel that single prog read per NAPI
lifetime is beneficial.

Further patch that will introduce the fallback path will also get a
profit from that as xdp_ring pointer will be set during the XDP rings
setup.

[1]: https://lore.kernel.org/bpf/87k0oseo6e.fsf@toke.dk/Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

eb087cd8

ice: do not create xdp_frame on XDP_TX · a55e16fa

Maciej Fijalkowski authored Aug 19, 2021

xdp_frame is not needed for XDP_TX data path in ice driver case.
For this data path cleaning of sent descriptor will not happen anywhere
outside of the driver, which means that carrying the information about
the underlying memory model via xdp_frame will not be used. Therefore,
this conversion can be simply dropped, which would relieve CPU a bit.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

a55e16fa

ice: unify xdp_rings accesses · 0bb4f9ec

Maciej Fijalkowski authored Aug 19, 2021

There has been a long lasting issue of improper xdp_rings indexing for
XDP_TX and XDP_REDIRECT actions. Given that currently rx_ring->q_index
is mixed with smp_processor_id(), there could be a situation where Tx
descriptors are produced onto XDP Tx ring, but tail is never bumped -
for example pin a particular queue id to non-matching IRQ line.

Address this problem by ignoring the user ring count setting and always
initialize the xdp_rings array to be of num_possible_cpus() size. Then,
always use the smp_processor_id() as an index to xdp_rings array. This
provides serialization as at given time only a single softirq can run on
a particular CPU.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

0bb4f9ec

ice: split ice_ring onto Tx/Rx separate structs · e72bba21

Maciej Fijalkowski authored Aug 19, 2021

While it was convenient to have a generic ring structure that served
both Tx and Rx sides, next commits are going to introduce several
Tx-specific fields, so in order to avoid hurting the Rx side, let's
pull out the Tx ring onto new ice_tx_ring and ice_rx_ring structs.

Rx ring could be handled by the old ice_ring which would reduce the code
churn within this patch, but this would make things asymmetric.

Make the union out of the ring container within ice_q_vector so that it
is possible to iterate over newly introduced ice_tx_ring.

Remove the @size as it's only accessed from control path and it can be
calculated pretty easily.

Change definitions of ice_update_ring_stats and
ice_fetch_u64_stats_per_ring so that they are ring agnostic and can be
used for both Rx and Tx rings.

Sizes of Rx and Tx ring structs are 256 and 192 bytes, respectively. In
Rx ring xdp_rxq_info occupies its own cacheline, so it's the major
difference now.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

e72bba21

ice: move ice_container_type onto ice_ring_container · dc23715c

Maciej Fijalkowski authored Aug 19, 2021

Currently ice_container_type is scoped only for ice_ethtool.c. Next
commit that will split the ice_ring struct onto Rx/Tx specific ring
structs is going to also modify the type of linked list of rings that is
within ice_ring_container. Therefore, the functions that are taking the
ice_ring_container as an input argument will need to be aware of a ring
type that will be looked up.

Embed ice_container_type within ice_ring_container and initialize it
properly when allocating the q_vectors.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

dc23715c

ice: remove ring_active from ice_ring · e93d1c37

Maciej Fijalkowski authored Aug 19, 2021

This field is dead and driver is not making any use of it. Simply remove
it.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

e93d1c37

Merge branch 'dpaa2-irq-coalescing' · 295711fa

David S. Miller authored Oct 15, 2021

Ioana Ciornei says:

====================
dpaa2-eth: add support for IRQ coalescing

This patch set adds support for interrupts coalescing in dpaa2-eth.
The first patches add support for the hardware level configuration of
the IRQ coalescing in the dpio driver, while the ones that touch the
dpaa2-eth driver are responsible for the ethtool user interraction.

With the adaptive IRQ coalescing in place and enabled we have observed
the following changes in interrupt rates on one A72 core @2.2GHz
(LX2160A) while running a Rx TCP flow.  The TCP stream is sent on a
10Gbit link and the only cpu that does Rx is fully utilized.
                                IRQ rate (irqs / sec)
before:   4.59 Gbits/sec                24k
after:    5.67 Gbits/sec                1.3k
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

295711fa

net: dpaa2: add adaptive interrupt coalescing · fc398bec

Ioana Ciornei authored Oct 15, 2021

Add support for adaptive interrupt coalescing to the dpaa2-eth driver.
First of all, ETHTOOL_COALESCE_USE_ADAPTIVE_RX is defined as a supported
coalesce parameter and the requested state is configured through the
dpio APIs added in the previous patch.

Besides the ethtool API interaction, we keep track of how many bytes and
frames are dequeued per CDAN (Channel Data Availability Notification)
and update the Net DIM instance through the dpaa2_io_update_net_dim()
API.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fc398bec

soc: fsl: dpio: add Net DIM integration · 69651bd8

Ioana Ciornei authored Oct 15, 2021

Use the generic dynamic interrupt moderation (dim) framework to
implement adaptive interrupt coalescing on Rx. With the per-packet
interrupt scheme, a high interrupt rate has been noted for moderate
traffic flows leading to high CPU utilization.

The dpio driver exports new functions to enable/disable adaptive IRQ
coalescing on a DPIO object, to query the state or to update Net DIM
with a new set of bytes and frames dequeued.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

69651bd8

net: dpaa2: add support for manual setup of IRQ coalesing · a64b4421

Ioana Ciornei authored Oct 15, 2021

Use the newly exported dpio driver API to manually configure the IRQ
coalescing parameters requested by the user.
The .get_coalesce() and .set_coalesce() net_device callbacks are
implemented and directly export or setup the rx-usecs on all the
channels configured.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a64b4421

soc: fsl: dpio: add support for irq coalescing per software portal · ed1d2143

Ioana Ciornei authored Oct 15, 2021

In DPAA2 based SoCs, the IRQ coalesing support per software portal has 2
configurable parameters:
 - the IRQ timeout period (QBMAN_CINH_SWP_ITPR): how many 256 QBMAN
   cycles need to pass until a dequeue interrupt is asserted.
 - the IRQ threshold (QBMAN_CINH_SWP_DQRR_ITR): how many dequeue
   responses in the DQRR ring would generate an IRQ.

Add support for setting up and querying these IRQ coalescing related
parameters.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ed1d2143

soc: fsl: dpio: extract the QBMAN clock frequency from the attributes · 2cf0b6fe

Ioana Ciornei authored Oct 15, 2021

Through the dpio_get_attributes() firmware call the dpio driver has
access to the QBMAN clock frequency. Extend the structure which holds
the firmware's response so that we can have access to this information.

This will be needed in the next patches which also add support for
interrupt coalescing which needs to be configured based on the
frequency.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2cf0b6fe

Merge branch 'L4S-style-ce_threshold_ect1-marking' · f3fafbcb

David S. Miller authored Oct 15, 2021

Eric Dumazet says:

====================
net/sched: implement L4S style ce_threshold_ect1 marking

As suggested by Ingemar Johansson, Neal Cardwell, and others, fq_codel can be used
for Low Latency, Low Loss, Scalable Throughput (L4S) with a small change.

In ce_threshold_ect1 mode, only ECT(1) packets can be marked to CE if
their sojourn time is above the threshold.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

f3fafbcb

fq_codel: implement L4S style ce_threshold_ect1 marking · e72aeb9e

Eric Dumazet authored Oct 14, 2021

Add TCA_FQ_CODEL_CE_THRESHOLD_ECT1 boolean option to select Low Latency,
Low Loss, Scalable Throughput (L4S) style marking, along with ce_threshold.

If enabled, only packets with ECT(1) can be transformed to CE
if their sojourn time is above the ce_threshold.

Note that this new option does not change rules for codel law.
In particular, if TCA_FQ_CODEL_ECN is left enabled (this is
the default when fq_codel qdisc is created), ECT(0) packets can
still get CE if codel law (as governed by limit/target) decides so.

Section 4.3.b of current draft [1] states:

b.  A scheduler with per-flow queues such as FQ-CoDel or FQ-PIE can
    be used for L4S.  For instance within each queue of an FQ-CoDel
    system, as well as a CoDel AQM, there is typically also ECN
    marking at an immediate (unsmoothed) shallow threshold to support
    use in data centres (see Sec.5.2.7 of [RFC8290]).  This can be
    modified so that the shallow threshold is solely applied to
    ECT(1) packets.  Then if there is a flow of non-ECN or ECT(0)
    packets in the per-flow-queue, the Classic AQM (e.g.  CoDel) is
    applied; while if there is a flow of ECT(1) packets in the queue,
    the shallower (typically sub-millisecond) threshold is applied.

Tested:

tc qd replace dev eth1 root fq_codel ce_threshold_ect1 50usec

netperf ... -t TCP_STREAM -- K dctcp

tc -s -d qd sh dev eth1
qdisc fq_codel 8022: root refcnt 32 limit 10240p flows 1024 quantum 9212 target 5ms ce_threshold_ect1 49us interval 100ms memory_limit 32Mb ecn drop_batch 64
 Sent 14388596616 bytes 9543449 pkt (dropped 0, overlimits 0 requeues 152013)
 backlog 0b 0p requeues 152013
  maxpacket 68130 drop_overlimit 0 new_flow_count 95678 ecn_mark 0 ce_mark 7639
  new_flows_len 0 old_flows_len 0

[1] L4S current draft:
https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-l4s-archSigned-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ingemar Johansson S <ingemar.s.johansson@ericsson.com>
Cc: Tom Henderson <tomh@tomh.org>
Cc: Bob Briscoe <in@bobbriscoe.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

e72aeb9e

net: add skb_get_dsfield() helper · 70e939dd

Eric Dumazet authored Oct 14, 2021

skb_get_dsfield(skb) gets dsfield from skb, or -1
if an error was found.

This is basically a wrapper around ipv4_get_dsfield()
and ipv6_get_dsfield().

Used by following patch for fq_codel.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ingemar Johansson S <ingemar.s.johansson@ericsson.com>
Cc: Tom Henderson <tomh@tomh.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

70e939dd

tcp: switch orphan_count to bare per-cpu counters · 19757ceb

Eric Dumazet authored Oct 14, 2021

Use of percpu_counter structure to track count of orphaned
sockets is causing problems on modern hosts with 256 cpus
or more.

Stefan Bach reported a serious spinlock contention in real workloads,
that I was able to reproduce with a netfilter rule dropping
incoming FIN packets.

    53.56%  server  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
            |
            ---queued_spin_lock_slowpath
               |
                --53.51%--_raw_spin_lock_irqsave
                          |
                           --53.51%--__percpu_counter_sum
                                     tcp_check_oom
                                     |
                                     |--39.03%--__tcp_close
                                     |          tcp_close
                                     |          inet_release
                                     |          inet6_release
                                     |          sock_close
                                     |          __fput
                                     |          ____fput
                                     |          task_work_run
                                     |          exit_to_usermode_loop
                                     |          do_syscall_64
                                     |          entry_SYSCALL_64_after_hwframe
                                     |          __GI___libc_close
                                     |
                                      --14.48%--tcp_out_of_resources
                                                tcp_write_timeout
                                                tcp_retransmit_timer
                                                tcp_write_timer_handler
                                                tcp_write_timer
                                                call_timer_fn
                                                expire_timers
                                                __run_timers
                                                run_timer_softirq
                                                __softirqentry_text_start

As explained in commit cf86a086 ("net/dst: use a smaller percpu_counter
batch for dst entries accounting"), default batch size is too big
for the default value of tcp_max_orphans (262144).

But even if we reduce batch sizes, there would still be cases
where the estimated count of orphans is beyond the limit,
and where tcp_too_many_orphans() has to call the expensive
percpu_counter_sum_positive().

One solution is to use plain per-cpu counters, and have
a timer to periodically refresh this cache.

Updating this cache every 100ms seems about right, tcp pressure
state is not radically changing over shorter periods.

percpu_counter was nice 15 years ago while hosts had less
than 16 cpus, not anymore by current standards.

v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
    reported by kernel test robot <lkp@intel.com>
    Remove unused socket argument from tcp_too_many_orphans()

Fixes: dd24c001 ("net: Use a percpu_counter for orphan_count")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Stefan Bach <sfb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

19757ceb