Commits · 9e86a297796bc35a8d2a63f8c6d16bd7262dc948 · Kirill Smelkov / linux

19 Jul, 2023 35 commits

selftests: mptcp: sockopt: format subtests results in TAP · 9e86a297

Matthieu Baerts authored Jul 17, 2023

The current selftests infrastructure formats the results in TAP 13. This
version doesn't support subtests and only the end result of each
selftest is taken into account. It means that a single issue in a
subtest of a selftest containing multiple subtests forces the whole
selftest to be marked as failed. It also means that subtests results are
not tracked by CIs executing selftests.

MPTCP selftests run hundreds of various subtests. It is then important
to track each of them and not one result per selftest.

It is particularly interesting to do that when validating stable kernels
with the last version of the test suite: tests might fail because a
feature is not supported but the test didn't skip that part. In this
case, if subtests are not tracked, the whole selftest will be marked as
failed making the other subtests useless because their results are
ignored.

This patch formats subtests results in TAP in mptcp_sockopt.sh selftest.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

9e86a297

selftests: mptcp: simult flows: format subtests results in TAP · 675d9933

Matthieu Baerts authored Jul 17, 2023

MPTCP selftests run hundreds of various subtests. It is then important
to track each of them and not one result per selftest.

This patch formats subtests results in TAP in simult_flows.sh selftest.

675d9933

selftests: mptcp: diag: format subtests results in TAP · ce990257

Matthieu Baerts authored Jul 17, 2023

MPTCP selftests run hundreds of various subtests. It is then important
to track each of them and not one result per selftest.

This patch formats subtests results in TAP in diag.sh selftest.

ce990257

selftests: mptcp: join: format subtests results in TAP · 7f117cd3

Matthieu Baerts authored Jul 17, 2023

MPTCP selftests run hundreds of various subtests. It is then important
to track each of them and not one result per selftest.

This patch formats subtests results in TAP in mptcp_join.sh selftest.

In this selftest and before starting each subtest, the 'reset' function
is called. We can then check if the previous test has passed, failed or
has been skipped from there.

7f117cd3

selftests: mptcp: pm_netlink: format subtests results in TAP · d85555ac

Matthieu Baerts authored Jul 17, 2023

MPTCP selftests run hundreds of various subtests. It is then important
to track each of them and not one result per selftest.

This patch formats subtests results in TAP in pm_netlink.sh selftest.

d85555ac

selftests: mptcp: connect: format subtests results in TAP · dd350f46

Matthieu Baerts authored Jul 17, 2023

MPTCP selftests run hundreds of various subtests. It is then important
to track each of them and not one result per selftest.

This patch formats subtests results in TAP in mptcp_connect.sh selftest.

dd350f46

selftests: mptcp: lib: format subtests results in TAP · c4192967

Matthieu Baerts authored Jul 17, 2023

MPTCP selftests run hundreds of various subtests. It is then important
to track each of them and not one result per selftest.

This patch adds some helpers in mptcp_lib.sh to be able to easily format
subtests results in TAP in the different MPTCP selftests.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/368Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

c4192967

selftests: mptcp: userspace_pm: reduce dup code around printf · d8463d81

Matthieu Baerts authored Jul 17, 2023

In this selftest, "printf" is always used with "stdbuf".

With a new helper, it is possible to call "stdbuf" only from one place.
This makes the code a bit clearer to read.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

d8463d81

selftests: mptcp: userspace_pm: uniform results printing · e198ad75

Matthieu Baerts authored Jul 17, 2023

There are a few reasons to do that:

- When the tabs are not printed as 8 spaces, some results were not
  properly aligned

- Some lines printing the test name were very long due to the use of a
  lot of spaces/tabs at the end and stdbuf at the beginning.

- To reduce duplicated code, e.g. to print what has failed and set the
  status

But by centralising how the test results are printed, this also prepares
future commits to avoid more duplicated code and ease the tracking of
the different subtests.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

e198ad75

selftests: mptcp: userspace_pm: fix shellcheck warnings · 8320b138

Matthieu Baerts authored Jul 17, 2023

shellcheck recently helped to find an issue where a wrong variable name
was used. It is then good to fix the other harmless issues in order to
spot "real" ones later.

Here, three categories of warnings are ignored:

- SC2317: Command appears to be unreachable. The cleanup() function is
  invoke indirectly via the EXIT trap.

- SC2034: Variable appears unused. The check_expected_one() function
  takes the name of the variable in argument but it ends up reading the
  content: indirect usage.

- SC2086: Double quote to prevent globbing and word splitting. This is
  recommended but the current usage is correct and there is no need to
  do all these modifications to be compliant with this rule.

One error has been fixed with SC2181: Check exit code directly with e.g.
'if ! mycmd;', not indirectly with $?.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

8320b138

selftests: mptcp: userspace pm: don't stop if error · e141c1e8

Matthieu Baerts authored Jul 17, 2023

No more tests were executed after a failure but it is still interesting
to get results for all the tests to better understand what's still OK
and what's not after a modification.

Now we only exit earlier if the two connections cannot be established.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

e141c1e8

selftests: mptcp: connect: don't stop if error · edbc16c4

Matthieu Baerts authored Jul 17, 2023

No more tests were executed after a failure but it is still interesting
to get results for all the tests to better understand what's still OK
and what's not after a modification.

Now we only exit earlier if the basic tests are failing: no ping going
through namespaces or unable to transfer data on the loopback interface.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

edbc16c4

net: stmmac: xgmac: Fix L3L4 filter count · 47448ff2

Rohan G Thomas authored Jul 17, 2023

Get the exact count of L3L4 filters when the L3L4FNUM field of
HW_FEATURE1 register is >= 8. If L3L4FNUM < 8, then the number of L3L4
filters supported by XGMAC is equal to L3L4FNUM. From L3L4FNUM >= 8
the number of L3L4 filters goes on like 8, 16, 32, ... Current
maximum of L3L4FNUM = 10.

Also, fix the XGMAC_IDDR bitmask of L3L4_ADDR_CTRL register. IDDR
field starts from the 8th bit of the L3L4_ADDR_CTRL register. IDDR[3:0]
indicates the type of L3L4 filter register while IDDR[8:4] indicates
the filter number (0 to 31). So overall 9 bits are used for IDDR
(i.e. L3L4_ADDR_CTRL[16:8]) to address the registers of all the
filters. Currently, XGMAC_IDDR is GENMASK(15,8), causing issues
accessing L3L4 filters above 15 for those XGMACs configured with more
than 16 L3L4 filters.
Signed-off-by: Rohan G Thomas <rohan.g.thomas@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

47448ff2

Merge branch 'backup-nexthop-ID' · b3f937f1

David S. Miller authored Jul 19, 2023

Ido Schimmel says:

====================
Add backup nexthop ID support

tl;dr
=====

This patchset adds a new bridge port attribute specifying the nexthop
object ID to attach to a redirected skb as tunnel metadata. The ID is
used by the VXLAN driver to choose the target VTEP for the skb. This is
useful for EVPN multi-homing, where we want to redirect local
(intra-rack) traffic upon carrier loss through one of the other VTEPs
(ES peers) connected to the target host.

Background
==========

In a typical EVPN multi-homing setup each host is multi-homed using a
set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf
switches in a rack. These switches act as VTEPs and are not directly
connected (as opposed to MLAG), but can communicate with each other (as
well as with VTEPs in remote racks) via spine switches over L3.

The control plane uses Type 1 routes [1] to create a mapping between an
ES and VTEPs where the ES has active links. In addition, the control
plane uses Type 2 routes [2] to create a mapping between {MAC, VLAN} and
an ES.

These tables are then used by the control plane to instruct VTEPs how to
reach remote hosts. For example, assuming {MAC X, VLAN Y} is accessible
via ES1 and this ES has active links to VTEP1 and VTEP2. The control
plane will program the following entries to a remote VTEP:

 # ip nexthop add id 1 via $VTEP1_IP fdb
 # ip nexthop add id 2 via $VTEP2_IP fdb
 # ip nexthop add id 10 group 1/2 fdb
 # bridge fdb add $MAC_X dev vx0 master extern_learn vlan $VLAN_Y
 # bridge fdb add $MAC_Y dev vx0 self extern_learn nhid 10 src_vni $VNI_Y

Remote traffic towards the host will be load balanced between VTEP1 and
VTEP2. If the control plane notices a carrier loss on the ES1 link
connected to VTEP1, it will issue a Type 1 route withdraw, prompting
remote VTEPs to remove the effected nexthop from the group:

 # ip nexthop replace id 10 group 2 fdb

Motivation
==========

While remote traffic can be redirected to a VTEP with an active ES link
by withdrawing a Type 1 route, the same is not true for local traffic. A
host that is multi-homed to VTEP1 and VTEP2 via another ES (e.g., ES2)
will send its traffic to {MAC X, VLAN Y} via one of these two switches,
according to its LAG hash algorithm which is not under our control. If
the traffic arrives at VTEP1 - which no longer has an active ES1 link -
it will be dropped due to the carrier loss.

In MLAG setups, the above problem is solved by redirecting the traffic
through the peer link upon carrier loss. This is achieved by defining
the peer link as the backup port of the host facing bond. For example:

 # bridge link set dev bond0 backup_port bond_peer

Unlike MLAG, there is no peer link between the leaf switches in EVPN.
Instead, upon carrier loss, local traffic should be redirected through
one of the active ES peers. This can be achieved by defining the VXLAN
port as the backup port of the host facing bonds. For example:

 # bridge link set dev es1_bond backup_port vx0

However, the VXLAN driver is not programmed with FDB entries for locally
attached hosts and therefore does not know to which VTEP to redirect the
traffic to. This will result in the traffic being replicated to all the
VTEPs (potentially hundreds) in the network and each VTEP dropping the
traffic, except for the active ES peer.

Avoiding the flooding by programming local FDB entries in the VXLAN
driver is not a viable solution as it requires to significantly increase
the number of programmed FDB entries.

Implementation
==============

The proposed solution is to create an FDB nexthop group for each ES with
the IP addresses of the active ES peers and set this ID as the backup
nexthop ID (new bridge port attribute) of the ES link. For example, on
VTEP1:

 # ip nexthop add id 1 via $VTEP2_IP fdb
 # ip nexthop add id 10 group 1 fdb
 # bridge link set dev es1_bond backup_nhid 10
 # bridge link set dev es1_bond backup_port vx0

When the ES link loses its carrier, traffic will be redirected to the
VXLAN port, but instead of only attaching the tunnel ID (i.e., VNI) as
tunnel metadata to the skb, the backup nexthop ID will be attached as
well. The VXLAN driver will then use this information to forward the skb
via the nexthop object associated with the ID, as if the skb hit an FDB
entry associated with this ID.

Testing
=======

A test for both the existing backup port attribute as well as the new
backup nexthop ID attribute is added in patch #4.

Patchset overview
=================

Patch #1 extends the tunnel key structure with the new nexthop ID field.

Patch #2 uses the new field in the VXLAN driver to forward packets via
the specified nexthop ID.

Patch #3 adds the new backup nexthop ID bridge port attribute and
adjusts the bridge driver to attach the ID as tunnel metadata upon
redirection.

Patch #4 adds a selftest.

iproute2 patches can be found here [3].

Changelog
=========

Since RFC [4]:

* Added Nik's tags.

[1] https://datatracker.ietf.org/doc/html/rfc7432#section-7.1
[2] https://datatracker.ietf.org/doc/html/rfc7432#section-7.2
[3] https://github.com/idosch/iproute2/tree/submit/backup_nhid_v1
[4] https://lore.kernel.org/netdev/20230713070925.3955850-1-idosch@nvidia.com/
====================
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

b3f937f1

selftests: net: Add bridge backup port and backup nexthop ID test · b4084530

Ido Schimmel authored Jul 17, 2023

Add test cases for bridge backup port and backup nexthop ID, testing
both good and bad flows.

Example truncated output:

 # ./test_bridge_backup_port.sh
 [...]
 Tests passed:  83
 Tests failed:   0
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

b4084530

bridge: Add backup nexthop ID support · 29cfb2aa

Ido Schimmel authored Jul 17, 2023

Add a new bridge port attribute that allows attaching a nexthop object
ID to an skb that is redirected to a backup bridge port with VLAN
tunneling enabled.

Specifically, when redirecting a known unicast packet, read the backup
nexthop ID from the bridge port that lost its carrier and set it in the
bridge control block of the skb before forwarding it via the backup
port. Note that reading the ID from the bridge port should not result in
a cache miss as the ID is added next to the 'backup_port' field that was
already accessed. After this change, the 'state' field still stays on
the first cache line, together with other data path related fields such
as 'flags and 'vlgrp':

struct net_bridge_port {
        struct net_bridge *        br;                   /*     0     8 */
        struct net_device *        dev;                  /*     8     8 */
        netdevice_tracker          dev_tracker;          /*    16     0 */
        struct list_head           list;                 /*    16    16 */
        long unsigned int          flags;                /*    32     8 */
        struct net_bridge_vlan_group * vlgrp;            /*    40     8 */
        struct net_bridge_port *   backup_port;          /*    48     8 */
        u32                        backup_nhid;          /*    56     4 */
        u8                         priority;             /*    60     1 */
        u8                         state;                /*    61     1 */
        u16                        port_no;              /*    62     2 */
        /* --- cacheline 1 boundary (64 bytes) --- */
[...]
} __attribute__((__aligned__(8)));

When forwarding an skb via a bridge port that has VLAN tunneling
enabled, check if the backup nexthop ID stored in the bridge control
block is valid (i.e., not zero). If so, instead of attaching the
pre-allocated metadata (that only has the tunnel key set), allocate a
new metadata, set both the tunnel key and the nexthop object ID and
attach it to the skb.

By default, do not dump the new attribute to user space as a value of
zero is an invalid nexthop object ID.

The above is useful for EVPN multihoming. When one of the links
composing an Ethernet Segment (ES) fails, traffic needs to be redirected
towards the host via one of the other ES peers. For example, if a host
is multihomed to three different VTEPs, the backup port of each ES link
needs to be set to the VXLAN device and the backup nexthop ID needs to
point to an FDB nexthop group that includes the IP addresses of the
other two VTEPs. The VXLAN driver will extract the ID from the metadata
of the redirected skb, calculate its flow hash and forward it towards
one of the other VTEPs. If the ID does not exist, or represents an
invalid nexthop object, the VXLAN driver will drop the skb. This
relieves the bridge driver from the need to validate the ID.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

29cfb2aa

vxlan: Add support for nexthop ID metadata · d977e1c8

Ido Schimmel authored Jul 17, 2023

VXLAN FDB entries can point to FDB nexthop objects. Each such object
includes the IP address(es) of remote VTEP(s) via which the target host
is accessible. Example:

 # ip nexthop add id 1 via 192.0.2.1 fdb
 # ip nexthop add id 2 via 192.0.2.17 fdb
 # ip nexthop add id 1000 group 1/2 fdb
 # bridge fdb add 00:11:22:33:44:55 dev vx0 self static nhid 1000 src_vni 10020

This is useful for EVPN multihoming where a single host can be connected
to multiple VTEPs. The source VTEP will calculate the flow hash of the
skb and forward it towards the IP address of one of the VTEPs member in
the nexthop group.

There are cases where an external entity (e.g., the bridge driver) can
provide not only the tunnel ID (i.e., VNI) of the skb, but also the ID
of the nexthop object via which the skb should be forwarded.

Therefore, in order to support such cases, when the VXLAN device is in
external / collect metadata mode and the tunnel info attached to the skb
is of bridge type, extract the nexthop ID from the tunnel info. If the
ID is valid (i.e., non-zero), forward the skb via the nexthop object
associated with the ID, as if the skb hit an FDB entry associated with
this ID.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

d977e1c8

ip_tunnels: Add nexthop ID field to ip_tunnel_key · 8bb5e825

Ido Schimmel authored Jul 17, 2023

Extend the ip_tunnel_key structure with a field indicating the ID of the
nexthop object via which the skb should be routed.

The field is going to be populated in subsequent patches by the bridge
driver in order to indicate to the VXLAN driver which FDB nexthop object
to use in order to reach the target host.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

8bb5e825

Merge branch 'remove-unnecessary-void-conversions' · 3223eeaf

Jakub Kicinski authored Jul 18, 2023

Wu Yunchuan says:

====================
Remove unnecessary (void*) conversions

Remove (void*) conversions under "drivers/net" directory.

PATCH v2 link:
https://lore.kernel.org/all/20230710063828.172593-1-suhui@nfschina.com/
PATCH v1 link:
https://lore.kernel.org/all/20230628024121.1439149-1-yunchuan@nfschina.com/
====================

Link: https://lore.kernel.org/r/20230717030937.53818-1-yunchuan@nfschina.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3223eeaf

net: bna: Remove unnecessary (void*) conversions · 1d5123ef

Wu Yunchuan authored Jul 17, 2023

No need cast (void*) to (struct bnad_tx_info *) or
(struct bnad_rx_info *).
Signed-off-by: Wu Yunchuan <yunchuan@nfschina.com>
Link: https://lore.kernel.org/r/20230717031229.55169-1-yunchuan@nfschina.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1d5123ef

can: ems_pci: Remove unnecessary (void*) conversions · 9235e3bc

Wu Yunchuan authored Jul 17, 2023

No need cast (void*) to (struct ems_pci_card *).
Signed-off-by: Wu Yunchuan <yunchuan@nfschina.com>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/r/20230717031221.55073-1-yunchuan@nfschina.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9235e3bc

net: mdio: Remove unnecessary (void*) conversions · 04115deb

Wu Yunchuan authored Jul 17, 2023

No need cast (void*) to (struct xgene_mdio_pdata *).
Signed-off-by: Wu Yunchuan <yunchuan@nfschina.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20230717031212.54991-1-yunchuan@nfschina.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

04115deb

ethernet: smsc: remove unnecessary (void*) conversions · 099090c6

Wu Yunchuan authored Jul 17, 2023

No need cast (voidd*) to (struct smsc911x_data *) or
(struct smsc9420_pdata *).
Signed-off-by: Wu Yunchuan <yunchuan@nfschina.com>
Link: https://lore.kernel.org/r/20230717031204.54912-1-yunchuan@nfschina.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

099090c6

ice: remove unnecessary (void*) conversions · c59cc267

Wu Yunchuan authored Jul 17, 2023

No need cast (void*) to (struct ice_ring_container *).
Signed-off-by: Wu Yunchuan <yunchuan@nfschina.com>
Link: https://lore.kernel.org/r/20230717031154.54740-1-yunchuan@nfschina.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c59cc267

net: hns: Remove unnecessary (void*) conversions · 406eb9cf

Wu Yunchuan authored Jul 17, 2023

No need cast (void*) to (struct hns_mdio_device *).
Signed-off-by: Wu Yunchuan <yunchuan@nfschina.com>
Reviewed-by: Hao Lan <lanhao@huawei.com>
Link: https://lore.kernel.org/r/20230717031137.54639-1-yunchuan@nfschina.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

406eb9cf

net: hns3: remove unnecessary (void*) conversions. · 14fbcad0

Wu Yunchuan authored Jul 17, 2023

No need cast (void*) to (struct hns3_nic_priv *).
Signed-off-by: Wu Yunchuan <yunchuan@nfschina.com>
Reviewed-by: Hao Lan <lanhao@huawei.com>
Link: https://lore.kernel.org/r/20230717031128.54557-1-yunchuan@nfschina.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

14fbcad0

net: ppp: Remove unnecessary (void*) conversions · 89c04d6c

Wu Yunchuan authored Jul 17, 2023

No need cast (void*) to (struct sock *).
Signed-off-by: Wu Yunchuan <yunchuan@nfschina.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://lore.kernel.org/r/20230717031115.54432-1-yunchuan@nfschina.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

89c04d6c

net: atlantic: Remove unnecessary (void*) conversions · f15fbe46

Wu Yunchuan authored Jul 17, 2023

No need cast (void*) to (struct hw_atl2_priv *).
Signed-off-by: Wu Yunchuan <yunchuan@nfschina.com>
Link: https://lore.kernel.org/r/20230717031055.54266-1-yunchuan@nfschina.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f15fbe46

tcp: get rid of sysctl_tcp_adv_win_scale · dfa2f048

Eric Dumazet authored Jul 17, 2023

With modern NIC drivers shifting to full page allocations per
received frame, we face the following issue:

TCP has one per-netns sysctl used to tweak how to translate
a memory use into an expected payload (RWIN), in RX path.

tcp_win_from_space() implementation is limited to few cases.

For hosts dealing with various MSS, we either under estimate
or over estimate the RWIN we send to the remote peers.

For instance with the default sysctl_tcp_adv_win_scale value,
we expect to store 50% of payload per allocated chunk of memory.

For the typical use of MTU=1500 traffic, and order-0 pages allocations
by NIC drivers, we are sending too big RWIN, leading to potential
tcp collapse operations, which are extremely expensive and source
of latency spikes.

This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
uses a per socket scaling factor, so that we can precisely
adjust the RWIN based on effective skb->len/skb->truesize ratio.

This patch alone can double TCP receive performance when receivers
are too slow to drain their receive queue, or by allowing
a bigger RWIN when MSS is close to PAGE_SIZE.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

dfa2f048

Merge branch 'net-mana-fix-doorbell-access-for-receive-queues' · 63c8778d

Jakub Kicinski authored Jul 18, 2023

Long Li says:

====================
net: mana: Fix doorbell access for receive queues

This patchset fixes the issues discovered during 200G physical link
tests. It fixes doorbell usage and WQE format for receive queues.
====================

Link: https://lore.kernel.org/r/1689622539-5334-1-git-send-email-longli@linuxonhyperv.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

63c8778d

net: mana: Use the correct WQE count for ringing RQ doorbell · f5e39b57

Long Li authored Jul 17, 2023

The hardware specification specifies that WQE_COUNT should set to 0 for
the Receive Queue. Although currently the hardware doesn't enforce the
check, in the future releases it may check on this value.
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1689622539-5334-3-git-send-email-longli@linuxonhyperv.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f5e39b57

net: mana: Batch ringing RX queue doorbell on receiving packets · da4e8648

Long Li authored Jul 17, 2023

It's inefficient to ring the doorbell page every time a WQE is posted to
the received queue. Excessive MMIO writes result in CPU spending more
time waiting on LOCK instructions (atomic operations), resulting in
poor scaling performance.

Move the code for ringing doorbell page to where after we have posted all
WQEs to the receive queue during a callback from napi_poll().

With this change, tests showed an improvement from 120G/s to 160G/s on a
200G physical link, with 16 or 32 hardware queues.

Tests showed no regression in network latency benchmarks on single
connection.
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1689622539-5334-2-git-send-email-longli@linuxonhyperv.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

da4e8648

net: mvpp2: debugfs: remove redundant parameter check in three functions · f8e34332

Minjie Du authored Jul 17, 2023

As per the comment above debugfs_create_dir(), it is not expected to
return an error, so an extra error check is not needed.
Drop the return check of debugfs_create_dir() in
mvpp2_dbgfs_c2_entry_init(), mvpp2_dbgfs_flow_tbl_entry_init()
and mvpp2_dbgfs_cls_init().
Signed-off-by: Minjie Du <duminjie@vivo.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20230717025538.2848-1-duminjie@vivo.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f8e34332

net: txgbe: change LAN reset mode · 9843814f

Jiawen Wu authored Jul 17, 2023

The old way to do LAN reset is sending reset command to firmware. Once
firmware performs reset, it reconfigures what it needs.

In the new firmware versions, veto bit is introduced for NCSI/LLDP to
block PHY domain in LAN reset. At this point, writing register of LAN
reset directly makes the same effect as the old way. And it does not
reset MNG domain, so that veto bit does not change.

Since veto bit was never used, the old firmware is compatible with the
driver before and after this change. The new firmware needs to use with
the driver after this change if it wants to implement the new feature,
otherwise it is the same as the old firmware.
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230717021333.94181-1-jiawenwu@trustnetic.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9843814f

selftests/net: replace manual array size calc with ARRAYSIZE macro. · 3645c71b

Mahmoud Maatuq authored Jul 16, 2023

fixes coccinelle WARNING: Use ARRAY_SIZE
Signed-off-by: Mahmoud Maatuq <mahmoudmatook.mm@gmail.com>
Link: https://lore.kernel.org/r/20230716184349.2124858-1-mahmoudmatook.mm@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3645c71b

18 Jul, 2023 5 commits

rtnetlink: Move nesting cancellation rollback to proper function · 4a59cdfd

Gal Pressman authored Jul 16, 2023

Make rtnl_fill_vf() cancel the vfinfo attribute on error instead of the
inner rtnl_fill_vfinfo(), as it is the function that starts it.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Link: https://lore.kernel.org/r/20230716072440.2372567-1-gal@nvidia.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

4a59cdfd

igc: Add TransmissionOverrun counter · d3750076

Muhammad Husaini Zulkifli authored Jul 14, 2023

Add TransmissionOverrun as per defined by IEEE 802.1Q Bridges.
TransmissionOverrun counter shall be incremented if the implementation
detects that a frame from a given queue is still being transmitted by
the MAC when that gate-close event for that queue occurs.

This counter is utilised by the Certification conformance test to
inform the user application whether any packets are currently being
transmitted on a particular queue during a gate-close event.

Intel Discrete I225/I226 have a mechanism to not transmit a packets if
the gate open time is insufficient for the packet transmission by setting
the Strict_End bit. Thus, it is expected for this counter to be always
zero at this moment.

Inspired from enetc_taprio_stats() and enetc_taprio_queue_stats(), now
driver also report the tx_overruns counter per traffic class.

User can get this counter by using below command:
1) tc -s qdisc show dev <interface> root
2) tc -s class show dev <interface>

Test Result (Before):
class mq :1 root
 Sent 1289 bytes 20 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq :2 root
 Sent 124 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq :3 root
 Sent 46028 bytes 86 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq :4 root
 Sent 2596 bytes 14 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

Test Result (After):
class taprio 100:1 root
 Sent 8491 bytes 38 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 Transmit overruns: 0
class taprio 100:2 root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 Transmit overruns: 0
class taprio 100:3 root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 Transmit overruns: 0
class taprio 100:4 root
 Sent 994 bytes 11 pkt (dropped 0, overlimits 0 requeues 1)
 backlog 0b 0p requeues 1
 Transmit overruns: 0
Signed-off-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Naama Meir <naamax.meir@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20230714201428.1718097-1-anthony.l.nguyen@intel.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

d3750076

ptp: Explicitly include correct DT includes · 9ffc4de5

Rob Herring authored Jul 14, 2023

The DT of_device.h and of_platform.h date back to the separate
of_platform_bus_type before it as merged into the regular platform bus.
As part of that merge prepping Arm DT support 13 years ago, they
"temporarily" include each other. They also include platform_device.h
and of.h. As a result, there's a pretty much random mix of those include
files used throughout the tree. In order to detangle these headers and
replace the implicit includes with struct declarations, users need to
explicitly include the correct includes.
Signed-off-by: Rob Herring <robh@kernel.org>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Link: https://lore.kernel.org/r/20230714174922.4063153-1-robh@kernel.orgSigned-off-by: Paolo Abeni <pabeni@redhat.com>

9ffc4de5

netconsole: Append kernel version to message · c62c0a17

Breno Leitao authored Jul 14, 2023

Create a new netconsole runtime option that prepends the kernel version in
the netconsole message. This is useful to map kernel messages to kernel
version in a simple way, i.e., without checking somewhere which kernel
version the host that sent the message is using.

If this option is selected, then the "<release>," is prepended before the
netconsole message. This is an example of a netconsole output, with
release feature enabled:

6.4.0-01762-ga1ba2ffe946e;12,426,112883998,-;this is a test

Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20230714111330.3069605-1-leitao@debian.orgSigned-off-by: Paolo Abeni <pabeni@redhat.com>

c62c0a17

Merge branch 'remove-some-unused-phylink-legacy' · a7f6eb19

Paolo Abeni authored Jul 18, 2023

Russell King says:

====================
Remove some unused phylink legacy

I believe we are now in a position where some of the legacy phylink code
can be removed!

I believe that all DSA drivers do not make use of any pre-March 2020
phylink behaviour - all drivers now seem to set legacy_pre_march2020 to
false, and the conditions that DSA sets it to true are no longer
satisifed by any driver.

Moreover, no one uses the .mac_an_restart() method, so this can also be
removed.
====================

Link: https://lore.kernel.org/r/ZLERQ2OBrv44Ppyc@shell.armlinux.org.ukSigned-off-by: Paolo Abeni <pabeni@redhat.com>

a7f6eb19