Commits · c4d7eb57687f358cd498ea3624519236af8db97e · Kirill Smelkov / linux

02 Feb, 2021 9 commits

net/mxl5e: Add change profile method · c4d7eb57

Saeed Mahameed authored Mar 22, 2020

Port nic netdevice will be used as uplink representor in downstream
patches. Add change profile method to allow changing a mlx5e netdevice
profile dynamically.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>

c4d7eb57

net/mlx5e: Separate between netdev objects and mlx5e profiles initialization · 3ef14e46

Saeed Mahameed authored Feb 25, 2020

1) Initialize netdevice features and structures on netdevice allocation
   and outside of the mlx5e profile.

2) As now mlx5e netdevice private params will be setup on profile init only
   after netdevice features are already set, we add  a call to
   netde_update_features() to resolve any conflict.
   This is nice since we reuse the fix_features ndo code if a profile
   wants different default features, instead of duplicating features
   conflict resolution code on profile initialization.

3) With this we achieve total separation between mlx5e profiles and
   netdevices, and will allow replacing mlx5e profiles on the fly to reuse
   the same netdevice for multiple profiles.
   e.g. for uplink representor profile as shown in the following patch

4) Profile callbacks are not allowed to touch netdev->features directly
   anymore, since in downstream patch we will detach/attach netdev
   dynamically to profile, hence we move the code dealing with
   netdev->features from profile->init() to fix_features ndo, and we
   will call netdev_update_features() on
   mlx5e_attach_netdev(profile, netdev);
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>

3ef14e46

Merge branch 'rework-the-memory-barrier-for-scrq-entry' · 9ae4bdc6

Jakub Kicinski authored Feb 01, 2021

Lijun Pan says:

====================
rework the memory barrier for SCRQ entry

This series rework the memory barrier for SCRQ (Sub-Command-Response
Queue) entry.
====================

Link: https://lore.kernel.org/r/20210130011905.1485-1-ljp@linux.ibm.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9ae4bdc6

ibmvnic: remove unnecessary rmb() inside ibmvnic_poll · 2719cb44

Lijun Pan authored Jan 29, 2021

rmb() can be removed since:
1. pending_scrq() has dma_rmb() at the function end;
2. dma_rmb(), though weaker, is enough here.
Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
Acked-by: Dwip Banerjee <dnbanerg@us.ibm.com>
Acked-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Reviewed-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

2719cb44

ibmvnic: rework to ensure SCRQ entry reads are properly ordered · 665ab1eb

Lijun Pan authored Jan 29, 2021

Move the dma_rmb() between pending_scrq() and ibmvnic_next_scrq()
into the end of pending_scrq() to save the duplicated code since
this dma_rmb will be used 3 times.
Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
Acked-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

665ab1eb

Merge tag 'mlx5-dr-2021-01-29' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 1a2b60f6

Jakub Kicinski authored Feb 01, 2021

Saeed Mahameed says:

====================
mlx5-dr-2021-01-29

Add support for Connect-X6DX Software steering

This series adds SW Steering support for Connect-X6DX.

Since the STE and actions formats are different on this new HW,
we implemented the HW specific STEv1 layer on the infrastructure
implemented in previous mlx5 DR patchset to support all the
functionalities as previous devices.

Most of the code in this series very is low level HW specific, we
implement the function pointers for the generic SW steering layer.

* tag 'mlx5-dr-2021-01-29' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: DR, Allow SW steering for sw_owner_v2 devices
  net/mlx5: DR, Copy all 64B whenever replacing STE in the head of miss-list
  net/mlx5: DR, Use HW specific logic API when writing STE
  net/mlx5: DR, Use the right size when writing partial STE into HW
  net/mlx5: DR, Add STEv1 modify header logic
  net/mlx5: DR, Add STEv1 action apply logic
  net/mlx5: DR, Add STEv1 setters and getters
  net/mlx5: DR, Allow native protocol support for HW STEv1
  net/mlx5: DR, Add HW STEv1 match logic
  net/mlx5: DR, Add match STEv1 structs to ifc
  net/mlx5: DR, Fix potential shift wrapping of 32-bit value
====================

Link: https://lore.kernel.org/r/20210130022618.317351-1-saeed@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1a2b60f6

Merge branch 'net-dsa-hellcreek-report-tables-sizes' · f222a993

Jakub Kicinski authored Feb 01, 2021

Kurt Kanzenbach says:

====================
net: dsa: hellcreek: Report tables sizes

Florian, Andrew and Vladimir suggested at some point to use devlink for
reporting tables, features and debugging counters instead of using debugfs and
printk.

So, start by reporting the VLAN and FDB table sizes.
====================

Link: https://lore.kernel.org/r/20210130135934.22870-1-kurt@kmk-computers.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f222a993

net: dsa: hellcreek: Report FDB table occupancy · 8486e83f

Kurt Kanzenbach authored Jan 30, 2021

Report the FDB table size and occupancy via devlink. The actual size depends on
the used Hellcreek version:

|root@tsn:~# devlink resource show platform/ff240000.switch
|platform/ff240000.switch:
|  name VLAN size 4096 occ 2 unit entry dpipe_tables none
|  name FDB size 256 occ 6 unit entry dpipe_tables none
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Kurt Kanzenbach <kurt@kmk-computers.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

8486e83f

net: dsa: hellcreek: Report VLAN table occupancy · 7f976d5c

Kurt Kanzenbach authored Jan 30, 2021

The VLAN membership configuration is cached in software already. So, it can be
reported via devlink. Add support for it:

|root@tsn:~# devlink resource show platform/ff240000.switch
|platform/ff240000.switch:
|  name VLAN size 4096 occ 4 unit entry dpipe_tables none
Signed-off-by: Kurt Kanzenbach <kurt@kmk-computers.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

7f976d5c

30 Jan, 2021 31 commits

tcp: shrink inet_connection_sock icsk_mtup enabled and probe_size · 14e8e0f6

Neal Cardwell authored Jan 29, 2021

This commit shrinks inet_connection_sock by 4 bytes, by shrinking
icsk_mtup.enabled from 32 bits to 1 bit, and shrinking
icsk_mtup.probe_size from s32 to an unsuigned 31 bit field.

This is to save space to compensate for the recent introduction of a
new u32 in inet_connection_sock, icsk_probes_tstamp, in the recent bug
fix commit 9d9b1ee0 ("tcp: fix TCP_USER_TIMEOUT with zero window").

This should not change functionality, since icsk_mtup.enabled is only
ever set to 0 or 1, and icsk_mtup.probe_size can only be either 0
or a positive MTU value returned by tcp_mss_to_mtu()
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210129185438.1813237-1-ncardwell.kernel@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

14e8e0f6

Merge branch 'net-bridge-drop-hosts-limit-sysfs-and-add-a-comment' · 4e146def

Jakub Kicinski authored Jan 29, 2021

Nikolay Aleksandrov says:

====================
net: bridge: drop hosts limit sysfs and add a comment

As recently discussed[1] we should stop extending the bridge sysfs
support for new options and move to using netlink only, so patch 01
drops the recently added hosts limit sysfs support which is still in
net-next only and patch 02 adds comments in br_sysfs_br/if.c to warn
against adding new sysfs options.

[1] https://lore.kernel.org/netdev/20210128105201.7c6bed82@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/T/#mda7265b2e57b52bdab863f286efa85291cf83822
====================

Link: https://lore.kernel.org/r/20210129115142.188455-2-razor@blackwall.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4e146def

net: bridge: add warning comments to avoid extending sysfs · 1e16f382

Nikolay Aleksandrov authored Jan 29, 2021

We're moving to netlink-only options, so add comments in the bridge's
sysfs files to warn against adding any new sysfs entries.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

1e16f382

net: bridge: mcast: drop hosts limit sysfs support · 7d0888d5

Nikolay Aleksandrov authored Jan 29, 2021

We decided to stop adding new sysfs bridge options and continue with
netlink only, so remove hosts limit sysfs support.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

7d0888d5

Merge branch 'tag_8021q-for-ocelot-switches' · 56435d91

Jakub Kicinski authored Jan 29, 2021

Vladimir Oltean says:

====================
tag_8021q for Ocelot switches

The Felix switch inside LS1028A has an issue. It has a 2.5G CPU port,
and the external ports, in the majority of use cases, run at 1G. This
means that, when the CPU injects traffic into the switch, it is very
easy to run into congestion. This is not to say that it is impossible to
enter congestion even with all ports running at the same speed, just
that the default configuration is already very prone to that by design.

Normally, the way to deal with that is using Ethernet flow control
(PAUSE frames).

However, this functionality is not working today with the ENETC - Felix
switch pair. The hardware issue is undergoing documentation right now as
an erratum within NXP, but several customers have been requesting a
reasonable workaround for it.

In truth, the LS1028A has 2 internal port pairs. The lack of flow control
is an issue only when NPI mode (Node Processor Interface, aka the mode
where the "CPU port module", which carries DSA-style tagged packets, is
connected to a regular Ethernet port) is used, and NPI mode is supported
by Felix on a single port.

In past BSPs, we have had setups where both internal port pairs were
enabled. We were advertising the following setup:

"data port"     "control port"
  (2.5G)            (1G)

   eno2             eno3
    ^                ^
    |                |
    | regular        | DSA-tagged
    | frames         | frames
    |                |
    v                v
   swp4             swp5

This works but is highly unpractical, due to NXP shifting the task of
designing a functional system (choosing which port to use, depending on
type of traffic required) up to the end user. The swpN interfaces would
have to be bridged with swp4, in order for the eno2 "data port" to have
access to the outside network. And the swpN interfaces would still be
capable of IP networking. So running a DHCP client would give us two IP
interfaces from the same subnet, one assigned to eno2, and the other to
swpN (0, 1, 2, 3).

Also, the dual port design doesn't scale. When attaching another DSA
switch to a Felix port, the end result is that the "data port" cannot
carry any meaningful data to the external world, since it lacks the DSA
tags required to traverse the sja1105 switches below. All that traffic
needs to go through the "control port".

So in newer BSPs there was a desire to simplify that setup, and only
have one internal port pair:

   eno2            eno3
    ^
    |
    | DSA-tagged    x disabled
    | frames
    |
    v
   swp4            swp5

However, this setup only exacerbates the issue of not having flow
control on the NPI port, since that is the only port now. Also, there
are use cases that still require the "data port", such as IEEE 802.1CB
(TSN stream identification doesn't work over an NPI port), source
MAC address learning over NPI, etc.

Again, there is a desire to keep the simplicity of the single internal
port setup, while regaining the benefits of having a dedicated data port
as well. And this series attempts to deliver just that.

So the NPI functionality is disabled conditionally. Its purpose was:
- To ensure individually addressable ports on TX. This can be replaced
  by using some designated VLAN tags which are pushed by the DSA tagger
  code, then removed by the switch (so they are invisible to the outside
  world and to the user).
- To ensure source port identification on RX. Again, this can be
  replaced by using some designated VLAN tags to encapsulate all RX
  traffic (each VLAN uniquely identifies a source port). The DSA tagger
  determines which port it was based on the VLAN number, then removes
  that header.
- To deliver PTP timestamps. This cannot be obtained through VLAN
  headers, so we need to take a step back and see how else we can do
  that. The Microchip Ocelot-1 (VSC7514 MIPS) driver performs manual
  injection/extraction from the CPU port module using register-based
  MMIO, and not over Ethernet. We will need to do the same from DSA,
  which makes this tagger a sort of hybrid between DSA and pure
  switchdev.
====================

Link: https://lore.kernel.org/r/20210129010009.3959398-1-olteanv@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

56435d91

net: dsa: felix: perform switch setup for tag_8021q · e21268ef

Vladimir Oltean authored Jan 29, 2021

Unlike sja1105, the only other user of the software-defined tag_8021q.c
tagger format, the implementation we choose for the Felix DSA switch
driver preserves full functionality under a vlan_filtering bridge
(i.e. IP termination works through the DSA user ports under all
circumstances).

The tag_8021q protocol just wants:
- Identifying the ingress switch port based on the RX VLAN ID, as seen
  by the CPU. We achieve this by using the TCAM engines (which are also
  used for tc-flower offload) to push the RX VLAN as a second, outer
  tag, on egress towards the CPU port.
- Steering traffic injected into the switch from the network stack
  towards the correct front port based on the TX VLAN, and consuming
  (popping) that header on the switch's egress.

A tc-flower pseudocode of the static configuration done by the driver
would look like this:

$ tc qdisc add dev <cpu-port> clsact
$ for eth in swp0 swp1 swp2 swp3; do \
	tc filter add dev <cpu-port> egress flower indev ${eth} \
		action vlan push id <rxvlan> protocol 802.1ad; \
	tc filter add dev <cpu-port> ingress protocol 802.1Q flower
		vlan_id <txvlan> action vlan pop \
		action mirred egress redirect dev ${eth}; \
done

but of course since DSA does not register network interfaces for the CPU
port, this configuration would be impossible for the user to do. Also,
due to the same reason, it is impossible for the user to inadvertently
delete these rules using tc. These rules do not collide in any way with
tc-flower, they just consume some TCAM space, which is something we can
live with.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

e21268ef

net: dsa: add a second tagger for Ocelot switches based on tag_8021q · 7c83a7c5

Vladimir Oltean authored Jan 29, 2021

There are use cases for which the existing tagger, based on the NPI
(Node Processor Interface) functionality, is insufficient.

Namely:
- Frames injected through the NPI port bypass the frame analyzer, so no
  source address learning is performed, no TSN stream classification,
  etc.
- Flow control is not functional over an NPI port (PAUSE frames are
  encapsulated in the same Extraction Frame Header as all other frames)
- There can be at most one NPI port configured for an Ocelot switch. But
  in NXP LS1028A and T1040 there are two Ethernet CPU ports. The non-NPI
  port is currently either disabled, or operated as a plain user port
  (albeit an internally-facing one). Having the ability to configure the
  two CPU ports symmetrically could pave the way for e.g. creating a LAG
  between them, to increase bandwidth seamlessly for the system.

So there is a desire to have an alternative to the NPI mode. This change
keeps the default tagger for the Seville and Felix switches as "ocelot",
but it can be changed via the following device attribute:

echo ocelot-8021q > /sys/class/<dsa-master>/dsa/tagging
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

7c83a7c5

net: dsa: felix: convert to the new .change_tag_protocol DSA API · adb3dccf

Vladimir Oltean authored Jan 29, 2021

In expectation of the new tag_ocelot_8021q tagger implementation, we
need to be able to do runtime switchover between one tagger and another.
So we must structure the existing code for the current NPI-based tagger
in a certain way.

We move the felix_npi_port_init function in expectation of the future
driver configuration necessary for tag_ocelot_8021q: we would like to
not have the NPI-related bits interspersed with the tag_8021q bits.

The conversion from this:

	ocelot_write_rix(ocelot,
			 ANA_PGID_PGID_PGID(GENMASK(ocelot->num_phys_ports, 0)),
			 ANA_PGID_PGID, PGID_UC);

to this:

	cpu_flood = ANA_PGID_PGID_PGID(BIT(ocelot->num_phys_ports));
	ocelot_rmw_rix(ocelot, cpu_flood, cpu_flood, ANA_PGID_PGID, PGID_UC);

is perhaps non-trivial, but is nonetheless non-functional. The PGID_UC
(replicator for unknown unicast) is already configured out of hardware
reset to flood to all ports except ocelot->num_phys_ports (the CPU port
module). All we change is that we use a read-modify-write to only add
the CPU port module to the unknown unicast replicator, as opposed to
doing a full write to the register.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

adb3dccf

net: dsa: allow changing the tag protocol via the "tagging" device attribute · 53da0eba

Vladimir Oltean authored Jan 29, 2021

Currently DSA exposes the following sysfs:
$ cat /sys/class/net/eno2/dsa/tagging
ocelot

which is a read-only device attribute, introduced in the kernel as
commit 98cdb480 ("net: dsa: Expose tagging protocol to user-space"),
and used by libpcap since its commit 993db3800d7d ("Add support for DSA
link-layer types").

It would be nice if we could extend this device attribute by making it
writable:
$ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging

This is useful with DSA switches that can make use of more than one
tagging protocol. It may be useful in dsa_loop in the future too, to
perform offline testing of various taggers, or for changing between dsa
and edsa on Marvell switches, if that is desirable.

In terms of implementation, drivers can support this feature by
implementing .change_tag_protocol, which should always leave the switch
in a consistent state: either with the new protocol if things went well,
or with the old one if something failed. Teardown of the old protocol,
if necessary, must be handled by the driver.

Some things remain as before:
- The .get_tag_protocol is currently only called at probe time, to load
  the initial tagging protocol driver. Nonetheless, new drivers should
  report the tagging protocol in current use now.
- The driver should manage by itself the initial setup of tagging
  protocol, no later than the .setup() method, as well as destroying
  resources used by the last tagger in use, no earlier than the
  .teardown() method.

For multi-switch DSA trees, error handling is a bit more complicated,
since e.g. the 5th out of 7 switches may fail to change the tag
protocol. When that happens, a revert to the original tag protocol is
attempted, but that may fail too, leaving the tree in an inconsistent
state despite each individual switch implementing .change_tag_protocol
transactionally. Since the intersection between drivers that implement
.change_tag_protocol and drivers that support D in DSA is currently the
empty set, the possibility for this error to happen is ignored for now.

Testing:

$ insmod mscc_felix.ko
[   79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
[   79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
$ insmod tag_ocelot.ko
$ rmmod mscc_felix.ko
$ insmod mscc_felix.ko
[   97.261724] libphy: VSC9959 internal MDIO bus: probed
[   97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
[   97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
[   97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
[   97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
[   97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[   97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[   97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
[   97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
[   98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[   98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
[   98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
[   98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[   98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
[   98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
[   98.527948] device eno2 entered promiscuous mode
[   98.544755] DSA: tree 0 setup

$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
^C
 -  10.0.0.1 ping statistics  -
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.754/1.545/2.337 ms

$ cat /sys/class/net/eno2/dsa/tagging
ocelot
$ cat ./test_ocelot_8021q.sh
        #!/bin/bash

        ip link set swp0 down
        ip link set swp1 down
        ip link set swp2 down
        ip link set swp3 down
        ip link set swp5 down
        ip link set eno2 down
        echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
        ip link set eno2 up
        ip link set swp0 up
        ip link set swp1 up
        ip link set swp2 up
        ip link set swp3 up
        ip link set swp5 up
$ ./test_ocelot_8021q.sh
./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
$ rmmod tag_ocelot.ko
rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
$ insmod tag_ocelot_8021q.ko
$ ./test_ocelot_8021q.sh
$ cat /sys/class/net/eno2/dsa/tagging
ocelot-8021q
$ rmmod tag_ocelot.ko
$ rmmod tag_ocelot_8021q.ko
rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
$ rmmod mscc_felix.ko
[  645.544426] mscc_felix 0000:00:00.5: Link is Down
[  645.838608] DSA: tree 0 torn down
$ rmmod tag_ocelot_8021q.ko
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

53da0eba

net: dsa: keep a copy of the tagging protocol in the DSA switch tree · 357f203b

Vladimir Oltean authored Jan 29, 2021

Cascading DSA switches can be done multiple ways. There is the brute
force approach / tag stacking, where one upstream switch, located
between leaf switches and the host Ethernet controller, will just
happily transport the DSA header of those leaf switches as payload.
For this kind of setups, DSA works without any special kind of treatment
compared to a single switch - they just aren't aware of each other.
Then there's the approach where the upstream switch understands the tags
it transports from its leaves below, as it doesn't push a tag of its own,
but it routes based on the source port & switch id information present
in that tag (as opposed to DMAC & VID) and it strips the tag when
egressing a front-facing port. Currently only Marvell implements the
latter, and Marvell DSA trees contain only Marvell switches.

So it is safe to say that DSA trees already have a single tag protocol
shared by all switches, and in fact this is what makes the switches able
to understand each other. This fact is also implied by the fact that
currently, the tagging protocol is reported as part of a sysfs installed
on the DSA master and not per port, so it must be the same for all the
ports connected to that DSA master regardless of the switch that they
belong to.

It's time to make this official and enforce it (yes, this also means we
won't have any "switch understands tag to some extent but is not able to
speak it" hardware oddities that we'll support in the future).

This is needed due to the imminent introduction of the dsa_switch_ops::
change_tag_protocol driver API. When that is introduced, we'll have
to notify switches of the tagging protocol that they're configured to
use. Currently the tag_ops structure pointer is held only for CPU ports.
But there are switches which don't have CPU ports and nonetheless still
need to be configured. These would be Marvell leaf switches whose
upstream port is just a DSA link. How do we inform these of their
tagging protocol setup/deletion?

One answer to the above would be: iterate through the DSA switch tree's
ports once, list the CPU ports, get their tag_ops, then iterate again
now that we have it, and notify everybody of that tag_ops. But what to
do if conflicts appear between one cpu_dp->tag_ops and another? There's
no escaping the fact that conflict resolution needs to be done, so we
can be upfront about it.

Ease our work and just keep the master copy of the tag_ops inside the
struct dsa_switch_tree. Reference counting is now moved to be per-tree
too, instead of per-CPU port.

There are many places in the data path that access master->dsa_ptr->tag_ops
and we would introduce unnecessary performance penalty going through yet
another indirection, so keep those right where they are.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

357f203b

net: dsa: document the existing switch tree notifiers and add a new one · 886f8e26

Vladimir Oltean authored Jan 29, 2021

The existence of dsa_broadcast has generated some confusion in the past:
https://www.mail-archive.com/netdev@vger.kernel.org/msg365042.html

So let's document the existing dsa_port_notify and dsa_broadcast
functions and explain when each of them should be used.

Also, in fact, the in-between function has always been there but was
lacking a name, and is the main reason for this patch: dsa_tree_notify.
Refactor dsa_broadcast to use it.

This patch also moves dsa_broadcast (a top-level function) to dsa2.c,
where it really belonged in the first place, but had no companion so it
stood with dsa_port_notify.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

886f8e26

net: mscc: ocelot: don't use NPI tag prefix for the CPU port module · cacea62f

Vladimir Oltean authored Jan 29, 2021

Context: Ocelot switches put the injection/extraction frame header in
front of the Ethernet header. When used in NPI mode, a DSA master would
see junk instead of the destination MAC address, and it would most
likely drop the packets. So the Ocelot frame header can have an optional
prefix, which is just "ff:ff:ff:ff:ff:fe > ff:ff:ff:ff:ff:ff" padding
put before the actual tag (still before the real Ethernet header) such
that the DSA master thinks it's looking at a broadcast frame with a
strange EtherType.

Unfortunately, a lesson learned in commit 69df578c ("net: mscc:
ocelot: eliminate confusion between CPU and NPI port") seems to have
been forgotten in the meanwhile.

The CPU port module and the NPI port have independent settings for the
length of the tag prefix. However, the driver is using the same variable
to program both of them.

There is no reason really to use any tag prefix with the CPU port
module, since that is not connected to any Ethernet port. So this patch
makes the inj_prefix and xtr_prefix variables apply only to the NPI
port (which the switchdev ocelot_vsc7514 driver does not use).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cacea62f

net: mscc: ocelot: reapply bridge forwarding mask on bonding join/leave · 9b521250

Vladimir Oltean authored Jan 29, 2021

Applying the bridge forwarding mask currently is done only on the STP
state changes for any port. But it depends on both STP state changes,
and bonding interface state changes. Export the bit that recalculates
the forwarding mask so that it could be reused, and call it when a port
starts and stops offloading a bonding interface.

Now that the logic is split into a separate function, we can rename "p"
into "port", since the "port" variable was already taken in
ocelot_bridge_stp_state_set. Also, we can rename "i" into "lag", to make
it more clear what is it that we're iterating through.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

9b521250

net: mscc: ocelot: store a namespaced VCAP filter ID · 50c6cc5b

Vladimir Oltean authored Jan 29, 2021

We will be adding some private VCAP filters that should not interfere in
any way with the filters added using tc-flower. So we need to allocate
some IDs which will not be used by tc.

Currently ocelot uses an u32 id derived from the flow cookie, which in
itself is an unsigned long. This is a problem in itself, since on 64 bit
systems, sizeof(unsigned long)=8, so the driver is already truncating
these.

Create a struct ocelot_vcap_id which contains the full unsigned long
cookie from tc, as well as a boolean that is supposed to namespace the
filters added by tc with the ones that aren't.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

50c6cc5b

net: mscc: ocelot: export VCAP structures to include/soc/mscc · 0e9bb4e9

Vladimir Oltean authored Jan 29, 2021

The Felix driver will need to preinstall some VCAP filters for its
tag_8021q implementation (outside of the tc-flower offload logic), so
these need to be exported to the common includes.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

0e9bb4e9

net: dsa: tag_8021q: add helpers to deduce whether a VLAN ID is RX or TX VLAN · 9c7caf28

Vladimir Oltean authored Jan 29, 2021

The sja1105 implementation can be blind about this, but the felix driver
doesn't do exactly what it's being told, so it needs to know whether it
is a TX or an RX VLAN, so it can install the appropriate type of TCAM
rule.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

9c7caf28

vmxnet3: Remove buf_info from device accessible structures · de1da8bc

Ronak Doshi authored Jan 27, 2021

buf_info structures in RX & TX queues are private driver data that
do not need to be visible to the device.  Although there is physical
address and length in the queue descriptor that points to these
structures, their layout is not standardized, and device never looks
at them.

So lets allocate these structures in non-DMA-able memory, and fill
physical address as all-ones and length as zero in the queue
descriptor.

That should alleviate worries brought by Martin Radev in
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20210104/022829.html
that malicious vmxnet3 device could subvert SVM/TDX guarantees.
Signed-off-by: Petr Vandrovec <petr@vmware.com>
Signed-off-by: Ronak Doshi <doshir@vmware.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

de1da8bc

net: dsa: hellcreek: Add missing TAPRIO dependency · 6c13d75b

Kurt Kanzenbach authored Jan 28, 2021

Add missing dependency to TAPRIO to avoid build failures such as:

|ERROR: modpost: "taprio_offload_get" [drivers/net/dsa/hirschmann/hellcreek_sw.ko] undefined!
|ERROR: modpost: "taprio_offload_free" [drivers/net/dsa/hirschmann/hellcreek_sw.ko] undefined!

Fixes: 24dfc6eb ("net: dsa: hellcreek: Add TAPRIO offloading support")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/20210128163338.22665-1-kurt@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6c13d75b

net: proc: speedup /proc/net/netstat · 0d6cd689

Eric Dumazet authored Jan 28, 2021

Use cache friendly helpers to better use cpu caches
while reading /proc/net/netstat

Tested on a platform with 256 threads (AMD Rome)

Before: 305 usec spent in netstat_seq_show()
After: 130 usec spent in netstat_seq_show()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210128162145.1703601-1-eric.dumazet@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0d6cd689

net: Remove redundant calls of sk_tx_queue_clear(). · df610cd9

Kuniyuki Iwashima authored Jan 29, 2021

The commit 41b14fb8 ("net: Do not clear the sock TX queue in
sk_set_socket()") removes sk_tx_queue_clear() from sk_set_socket() and adds
it instead in sk_alloc() and sk_clone_lock() to fix an issue introduced in
the commit e022f0b4 ("net: Introduce sk_tx_queue_mapping"). On the
other hand, the original commit had already put sk_tx_queue_clear() in
sk_prot_alloc(): the callee of sk_alloc() and sk_clone_lock(). Thus
sk_tx_queue_clear() is called twice in each path.

If we remove sk_tx_queue_clear() in sk_alloc() and sk_clone_lock(), it
currently works well because (i) sk_tx_queue_mapping is defined between
sk_dontcopy_begin and sk_dontcopy_end, and (ii) sock_copy() called after
sk_prot_alloc() in sk_clone_lock() does not overwrite sk_tx_queue_mapping.
However, if we move sk_tx_queue_mapping out of the no copy area, it
introduces a bug unintentionally.

Therefore, this patch adds a compile-time check to take care of the order
of sock_copy() and sk_tx_queue_clear() and removes sk_tx_queue_clear() from
sk_prot_alloc() so that it does the only allocation and its callers
initialize fields.

CC: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20210128150217.6060-1-kuniyu@amazon.co.jpSigned-off-by: Jakub Kicinski <kuba@kernel.org>

df610cd9

Merge branch 'net-hns3-updates-for-next' · 77609b1d

Jakub Kicinski authored Jan 29, 2021

Huazhong Tan says:

====================
net: hns3: updates for -next

This patchset adds dump tm info of nodes, priority and qset in debugfs.
Three debugfs files tm_nodes, tm_priority and tm_qset are created in
new tm directory, and use cat command to dump their info, for examples:

$ cat tm_nodes
       BASE_ID  MAX_NUM
PG         0         8
PRI        0         8
QSET       0         8
QUEUE      0      1024

$ cat tm_priority
ID    MODE  DWRR  C_IR_B  C_IR_U  C_IR_S  C_BS_B  C_BS_S  C_FLAG  C_RATE(Mbps)  P_IR_B  P_IR_U  P_IR_S  P_BS_B  P_BS_S  P_FLAG  P_RATE(Mbps)
0000  dwrr  100     0       0       0       5      20       0          0        150       7       0       5      20       0          0
0001    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
0002    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
0003    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
0004    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
0005    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
0006    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
0007    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0

$ cat tm_qset
ID    MAP_PRI  LINK_VLD  MODE  DWRR
0000     0        1      dwrr  100
0001     0        0        sp    0
0002     0        0        sp    0
0003     0        0        sp    0
0004     0        0        sp    0
0005     0        0        sp    0
0006     0        0        sp    0

change log:
V2: add readonly files for dump all nodes, priority and qset info
    suggested by Jakub Kicinski.

previous version:
V1: https://patchwork.kernel.org/project/netdevbpf/patch/1610694569-43099-1-git-send-email-tanhuazhong@huawei.com/
====================

Link: https://lore.kernel.org/r/1611834696-56207-1-git-send-email-tanhuazhong@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

77609b1d

net: hns3: add debugfs support for tm nodes, priority and qset info · 04987ca1

Guangbin Huang authored Jan 28, 2021

In order to query tm info of nodes, priority and qset
for debugging, adds three debugfs files tm_nodes,
tm_priority and tm_qset in newly created tm directory.

Unlike previous debugfs commands, these three files
just support read ops, so they only support to use cat
command to dump their info.

The new tm file style is acccording to suggestion from
Jakub Kicinski's opinion as link https://lkml.org/lkml/2020/9/29/2101.
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

04987ca1

net: hns3: add interfaces to query information of tm priority/qset · 2bbad0aa

Guangbin Huang authored Jan 28, 2021

Add some interfaces to get information of tm priority and qset,
then they can be used by debugfs.
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

2bbad0aa

Merge branch 'net-add-support-for-ip-generic-checksum-offload-for-gre' · 2d88296a

Jakub Kicinski authored Jan 29, 2021

Xin Long says:

====================
net: add support for ip generic checksum offload for gre

This patchset it to add ip generic csum processing first in
skb_csum_hwoffload_help() in Patch 1/2 and then add csum
offload support for GRE header in Patch 2/2.
====================

Link: https://lore.kernel.org/r/cover.1611825446.git.lucien.xin@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2d88296a

ip_gre: add csum offload support for gre header · efa1a65c

Xin Long authored Jan 28, 2021

This patch is to add csum offload support for gre header:

On the TX path in gre_build_header(), when CHECKSUM_PARTIAL's set
for inner proto, it will calculate the csum for outer proto, and
inner csum will be offloaded later. Otherwise, CHECKSUM_PARTIAL
and csum_start/offset will be set for outer proto, and the outer
csum will be offloaded later.

On the GSO path in gre_gso_segment(), when CHECKSUM_PARTIAL is
not set for inner proto and the hardware supports csum offload,
CHECKSUM_PARTIAL and csum_start/offset will be set for outer
proto, and outer csum will be offloaded later. Otherwise, it
will do csum for outer proto by calling gso_make_checksum().

Note that SCTP has to do the csum by itself for non GSO path in
sctp_packet_pack(), as gre_build_header() can't handle the csum
with CHECKSUM_PARTIAL set for SCTP CRC csum offload.

v1->v2:
  - remove the SCTP part, as GRE dev doesn't support SCTP CRC CSUM
    and it will always do checksum for SCTP in sctp_packet_pack()
    when it's not a GSO packet.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

efa1a65c

net: support ip generic csum processing in skb_csum_hwoffload_help · 62fafcd6

Xin Long authored Jan 28, 2021

NETIF_F_IP|IPV6_CSUM feature flag indicates UDP and TCP csum offload
while NETIF_F_HW_CSUM feature flag indicates ip generic csum offload
for HW, which includes not only for TCP/UDP csum, but also for other
protocols' csum like GRE's.

However, in skb_csum_hwoffload_help() it only checks features against
NETIF_F_CSUM_MASK(NETIF_F_HW|IP|IPV6_CSUM). So if it's a non TCP/UDP
packet and the features doesn't support NETIF_F_HW_CSUM, but supports
NETIF_F_IP|IPV6_CSUM only, it would still return 0 and leave the HW
to do csum.

This patch is to support ip generic csum processing by checking
NETIF_F_HW_CSUM for all protocols, and check (NETIF_F_IP_CSUM |
NETIF_F_IPV6_CSUM) only for TCP and UDP.

Note that we're using skb->csum_offset to check if it's a TCP/UDP
proctol, this might be fragile. However, as Alex said, for now we
only have a few L4 protocols that are requesting Tx csum offload,
we'd better fix this until a new protocol comes with a same csum
offset.

v1->v2:
  - not extend skb->csum_not_inet, but use skb->csum_offset to tell
    if it's an UDP/TCP csum packet.
v2->v3:
  - add a note in the changelog, as Willem suggested.
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

62fafcd6

Merge tag 'linux-can-next-for-5.12-20210129' of... · fd3d3755

Jakub Kicinski authored Jan 29, 2021

Merge tag 'linux-can-next-for-5.12-20210129' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
linux-can-next-for-5.12-20210129

All patches are by me and target the mcp251xfd driver. The first 4
patches update the information regarding the "85% of (FSYSCLK/2)"
errata. The other 4 are misc cleanups, unitfy error messages, add
missing postfix to a macro, simplify the return of a function, and
make use of dev_err_probe() in the mcp251xfd_probe() function.
====================

Link: https://lore.kernel.org/r/20210129084302.3040284-1-mkl@pengutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

fd3d3755

net: mhi: Get rid of local rx queue count · 6e10785e

Loic Poulain authored Jan 11, 2021

Use the new mhi_get_free_desc_count helper to track queue usage
instead of relying on the locally maintained rx_queued count.
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

6e10785e

net: mhi: Get RX queue size from MHI core · e6ec3ccd

Loic Poulain authored Jan 28, 2021

The RX queue size can be determined at runtime by retrieving the
number of available transfer descriptors.
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

e6ec3ccd

Merge branch 'mhi-net-immutable' of https://git.kernel.org/pub/scm/linux/kernel/git/mani/mhi · 2bca263c
Jakub Kicinski authored Jan 29, 2021
```
Needed by mhi-net patches.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
```
2bca263c

docs: networking: timestamping: fix section title markup · 5daf8384

Jan Luebbe authored Jan 28, 2021

This section was missed during the conversion to ReST, so convert it in the
same style as the surrounding section titles.
Signed-off-by: Jan Luebbe <jlu@pengutronix.de>
Link: https://lore.kernel.org/r/20210128111930.29473-1-jlu@pengutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

5daf8384