Commits · 16db6323042f39b6f49148969e9d03d11265bc1b · Kirill Smelkov / linux

26 Jan, 2021 7 commits

bnxt_en: Update firmware interface to 1.10.2.11. · 16db6323

Michael Chan authored Jan 25, 2021

Updates to backing store APIs, QoS profiles, and push buffer initial
index support.

Since the new HWRM_FUNC_BACKING_STORE_CFG message size has increased,
we need to add some compat. logic to fall back to the smaller legacy
size if firmware cannot accept the larger message size.  The new fields
added to the structure are not used yet.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

16db6323

Merge branch 'dsa-add-mt7530-gpio-support' · ae189ccb

Jakub Kicinski authored Jan 25, 2021

DENG Qingfang says:

====================
dsa: add MT7530 GPIO support

MT7530's LED controller can be used as GPIO controller.
Add support for it.
====================

Link: https://lore.kernel.org/r/20210125044322.6280-1-dqfext@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ae189ccb

net: dsa: mt7530: MT7530 optional GPIO support · 429a0ede

DENG Qingfang authored Jan 25, 2021

MT7530's LED controller can drive up to 15 LED/GPIOs.

Add support for GPIO control and allow users to use its GPIOs by
setting gpio-controller property in device tree.
Signed-off-by: DENG Qingfang <dqfext@gmail.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

429a0ede

dt-bindings: net: dsa: add MT7530 GPIO controller binding · 974d5ba6

DENG Qingfang authored Jan 25, 2021

Add device tree binding to support MT7530 GPIO controller.
Signed-off-by: DENG Qingfang <dqfext@gmail.com>
Acked-by: Rob Herring <robh@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

974d5ba6

net: ethernet: mediatek: support setting MTU · 4fd59792

DENG Qingfang authored Jan 25, 2021

MT762x HW, except for MT7628, supports frame length up to 2048
(maximum length on GDM), so allow setting MTU up to 2030.

Also set the default frame length to the hardware default 1518.
Signed-off-by: DENG Qingfang <dqfext@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20210125042046.5599-1-dqfext@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4fd59792

bridge: Use PTR_ERR_OR_ZERO instead if(IS_ERR(...)) + PTR_ERR · 8d21c882

Jiapeng Zhong authored Jan 25, 2021

coccicheck suggested using PTR_ERR_OR_ZERO() and looking at the code.

Fix the following coccicheck warnings:

./net/bridge/br_multicast.c:1295:7-13: WARNING: PTR_ERR_OR_ZERO can be
used.
Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/1611542381-91178-1-git-send-email-abaci-bugfix@linux.alibaba.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8d21c882

octeontx2-af: Support ESP/AH RSS hashing · b9b7421a

Subbaraya Sundeep authored Jan 23, 2021

Support SPI and sequence number fields of
ESP/AH header to be hashed for RSS. By default
ESP/AH fields are not considered for RSS and
needs to be set explicitly as below:
ethtool -U eth0 rx-flow-hash esp4 sdfn
or
ethtool -U eth0 rx-flow-hash ah4 sdfn
or
ethtool -U eth0 rx-flow-hash esp6 sdfn
or
ethtool -U eth0 rx-flow-hash ah6 sdfn

To disable hashing of ESP fields:
ethtool -U eth0 rx-flow-hash esp4 sd
or
ethtool -U eth0 rx-flow-hash ah4 sd
or
ethtool -U eth0 rx-flow-hash esp6 sd
or
ethtool -U eth0 rx-flow-hash ah6 sd
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Link: https://lore.kernel.org/r/1611378552-13288-1-git-send-email-sundeep.lkml@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b9b7421a

25 Jan, 2021 1 commit

tg3: improve PCI VPD access · 24f97b6a

Heiner Kallweit authored Jan 22, 2021

When working on the PCI VPD code I also tested with a Broadcom BCM95719
card. tg3 uses internal NVRAM access with this card, so I forced it to
PCI VPD mode for testing. PCI VPD access fails
(i + PCI_VPD_LRDT_TAG_SIZE + j > len) because only TG3_NVM_VPD_LEN (256)
bytes are read, but PCI VPD has 400 bytes on this card.

So add a constant TG3_NVM_PCI_VPD_MAX_LEN that defines the maximum
PCI VPD size. The actual VPD size is returned by pci_read_vpd().
In addition it's not worth looping over pci_read_vpd(). If we miss the
125ms timeout per VPD dword read then definitely something is wrong,
and if the tg3 module loading is killed then there's also not much
benefit in retrying the VPD read.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/cb9e9113-0861-3904-87e0-d4c4ab3c8860@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

24f97b6a

24 Jan, 2021 6 commits

Merge branch 'net-dsa-hellcreek-add-taprio-offloading' · a61e4b60

Jakub Kicinski authored Jan 23, 2021

Kurt Kanzenbach says:

====================
net: dsa: hellcreek: Add TAPRIO offloading

The switch has support for the 802.1Qbv Time Aware Shaper (TAS). Traffic
schedules may be configured individually on each front port. Each port
has eight egress queues. The traffic is mapped to a traffic class
respectively via the PCP field of a VLAN tagged frame.

Previous attempts:
 * https://lkml.kernel.org/netdev/20201121115703.23221-1-kurt@linutronix.de/
 * https://lkml.kernel.org/netdev/20210116124922.32356-1-kurt@linutronix.de/
====================

Link: https://lore.kernel.org/r/20210123105633.16753-1-kurt@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a61e4b60

net: dsa: hellcreek: Add TAPRIO offloading support · 24dfc6eb

Kurt Kanzenbach authored Jan 23, 2021

The switch has support for the 802.1Qbv Time Aware Shaper (TAS). Traffic
schedules may be configured individually on each front port. Each port has eight
egress queues. The traffic is mapped to a traffic class respectively via the PCP
field of a VLAN tagged frame.

The TAPRIO Qdisc already implements that. Therefore, this interface can simply
be reused. Add .port_setup_tc() accordingly.

The activation of a schedule on a port is split into two parts:

 * Programming the necessary gate control list (GCL)
 * Setup delayed work for starting the schedule

The hardware supports starting a schedule up to eight seconds in the future. The
TAPRIO interface provides an absolute base time. Therefore, periodic delayed
work is leveraged to check whether a schedule may be started or not.
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

24dfc6eb

net: mhi: Set wwan device type · b80b5dbf

Loic Poulain authored Jan 22, 2021

The 'wwan' devtype is meant for devices that require additional
configuration to be used, like WWAN specific APN setup over AT/QMI
commands, rmnet link creation, etc. This is the case for MHI (Modem
host Interface) netdev which targets modem/WWAN endpoints.
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Link: https://lore.kernel.org/r/1611328554-1414-1-git-send-email-loic.poulain@linaro.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b80b5dbf

Merge branch 'udp-allow-forwarding-of-plain-non-fraglisted-udp-gro-packets' · acb4151f

Jakub Kicinski authored Jan 23, 2021

Alexander Lobakin says:

====================
udp: allow forwarding of plain (non-fraglisted) UDP GRO packets

This series allows to form UDP GRO packets in cases without sockets
(for forwarding). To not change the current datapath, this is
performed only when the new corresponding netdev feature is enabled
via Ethtool (and fraglisted GRO is disabled).
Prior to this point, only fraglisted UDP GRO was available. Plain UDP
GRO shows better forwarding performance when a target NIC is capable
of GSO UDP offload.

Since v3 [2]:
 - rename introduced netdev feature to reflect that it targets
   forwarding and don't touch fraglisted GRO at all (Willem de Bruijn).

Since v2 [1]:
 - convert to a series;
 - new: add new netdev_feature to explicitly enable/disable UDP GRO
   when there is no socket, defaults to off (Paolo Abeni).

Since v1 [0]:
 - drop redundant 'if (sk)' check (Alexander Duyck);
 - add a ref in the commit message to one more commit that was
   an important step for UDP GRO forwarding.

[0] https://lore.kernel.org/netdev/20210112211536.261172-1-alobakin@pm.me
[1] https://lore.kernel.org/netdev/20210113103232.4761-1-alobakin@pm.me
[2] https://lore.kernel.org/netdev/20210118193122.87271-1-alobakin@pm.me
====================

Link: https://lore.kernel.org/r/20210122181909.36340-1-alobakin@pm.meSigned-off-by: Jakub Kicinski <kuba@kernel.org>

acb4151f

udp: allow forwarding of plain (non-fraglisted) UDP GRO packets · 36707061

Alexander Lobakin authored Jan 22, 2021

Commit 9fd1ff5d ("udp: Support UDP fraglist GRO/GSO.") actually
not only added a support for fraglisted UDP GRO, but also tweaked
some logics the way that non-fraglisted UDP GRO started to work for
forwarding too.
Commit 2e4ef10f ("net: add GSO UDP L4 and GSO fraglists to the
list of software-backed types") added GSO UDP L4 to the list of
software GSO to allow virtual netdevs to forward them as is up to
the real drivers.

Tests showed that currently forwarding and NATing of plain UDP GRO
packets are performed fully correctly, regardless if the target
netdevice has a support for hardware/driver GSO UDP L4 or not.
Add the last element and allow to form plain UDP GRO packets if
we are on forwarding path, and the new NETIF_F_GRO_UDP_FWD is
enabled on a receiving netdevice.

If both NETIF_F_GRO_FRAGLIST and NETIF_F_GRO_UDP_FWD are set,
fraglisted GRO takes precedence. This keeps the current behaviour
and is generally more optimal for now, as the number of NICs with
hardware USO offload is relatively small.
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

36707061

net: introduce a netdev feature for UDP GRO forwarding · 6f1c0ea1

Alexander Lobakin authored Jan 22, 2021

Introduce a new netdev feature, NETIF_F_GRO_UDP_FWD, to allow user
to turn UDP GRO on and off for forwarding.
Defaults to off to not change current datapath.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

6f1c0ea1

23 Jan, 2021 26 commits

Merge branch 'remove-unneeded-phy-time-stamping-option' · 692347a9

Jakub Kicinski authored Jan 23, 2021

Richard Cochran says:

====================
Remove unneeded PHY time stamping option.

The NETWORK_PHY_TIMESTAMPING configuration option adds additional
checks into the networking hot path, and it is only needed by two
rather esoteric devices, namely the TI DP83640 PHYTER and the ZHAW
InES 1588 IP core.  Very few end users have these devices, and those
that do have them are building specialized embedded systems.

Unfortunately two unrelated drivers depend on this option, and two
defconfigs enable it.  It is probably my fault for not paying enough
attention in reviews.

This series corrects the gratuitous use of NETWORK_PHY_TIMESTAMPING.
====================

Link: https://lore.kernel.org/r/cover.1611198584.git.richardcochran@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

692347a9

net: mvpp2: Remove unneeded Kconfig dependency. · 04cbb740

Richard Cochran authored Jan 20, 2021

The mvpp2 is an Ethernet driver, and it implements MAC style time
stamping of PTP frames.  It has no need of the expensive option to
enable PHY time stamping.  Remove the incorrect dependency.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

04cbb740

net: dsa: mv88e6xxx: Remove bogus Kconfig dependency. · 57ba0077

Richard Cochran authored Jan 20, 2021

The mv88e6xxx is a DSA driver, and it implements DSA style time
stamping of PTP frames.  It has no need of the expensive option to
enable PHY time stamping.  Remove the bogus dependency.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Brandon Streiff <brandon.streiff@ni.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

57ba0077

Merge branch 'net-ipa-napi-poll-updates' · e7b76db3

Jakub Kicinski authored Jan 23, 2021

Alex Elder says:

====================
net: ipa: NAPI poll updates

While reviewing the IPA NAPI polling code in detail I found two
problems.  This series fixes those, and implements a few other
improvements to this part of the code.

The first two patches are minor bug fixes that avoid extra passes
through the poll function.  The third simplifies code inside the
polling loop a bit.

The last two update how interrupts are disabled; previously it was
possible for another I/O completion condition to be recorded before
NAPI got scheduled.
====================

Link: https://lore.kernel.org/r/20210121114821.26495-1-elder@linaro.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e7b76db3

net: ipa: disable IEOB interrupts before clearing · 7bd9785f

Alex Elder authored Jan 21, 2021

Currently in gsi_isr_ieob(), event ring IEOB interrupts are disabled
one at a time.  The loop disables the IEOB interrupt for all event
rings represented in the event mask.  Instead, just disable them all
at once.

Disable them all *before* clearing the interrupt condition.  This
guarantees we'll schedule NAPI for each event once, before another
IEOB interrupt could be signaled.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

7bd9785f

net: ipa: repurpose gsi_irq_ieob_disable() · 5725593e

Alex Elder authored Jan 21, 2021

Rename gsi_irq_ieob_disable() to be gsi_irq_ieob_disable_one().

Introduce a new function gsi_irq_ieob_disable() that takes a mask of
events to disable rather than a single event id.  This will be used
in the next patch.

Rename gsi_irq_ieob_enable() to be gsi_irq_ieob_enable_one() to be
consistent.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

5725593e

net: ipa: have gsi_channel_update() return a value · 223f5b34

Alex Elder authored Jan 21, 2021

Have gsi_channel_update() return the first transaction in the
updated completed transaction list, or NULL if no new transactions
have been added.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

223f5b34

net: ipa: heed napi_complete() return value · 148604e7

Alex Elder authored Jan 21, 2021

Pay attention to the return value of napi_complete(), completing
polling only if it returns true.

Just use napi rather than &channel->napi as the argument passed to
napi_complete().
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

148604e7

net: ipa: count actual work done in gsi_channel_poll() · c80c4a1e

Alex Elder authored Jan 21, 2021

There is an off-by-one problem in gsi_channel_poll().  The count of
transactions completed is incremented each time through the loop
*before* determining whether there is any more work to do.  As a
result, if we exit the loop early the counter its value is one more
than the number of transactions actually processed.

Instead, increment the count after processing, to ensure it reflects
the number of processed transactions.  The result is more naturally
described as a for loop rather than a while loop, so change that.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c80c4a1e

Merge branch 'mlxsw-expose-number-of-physical-ports' · 59a49d96

Jakub Kicinski authored Jan 22, 2021

Ido Schimmel says:

====================
mlxsw: Expose number of physical ports

The switch ASIC has a limited capacity of physical ports that it can
support. While each system is brought up with a different number of
ports, this number can be increased via splitting up to the ASIC's
limit.

Expose physical ports as a devlink resource so that user space will have
visibility into the maximum number of ports that can be supported and
the current occupancy. With this resource it is possible, for example,
to write generic (i.e., not platform dependent) tests for port
splitting.

Patch #1 adds the new resource and patch #2 adds a selftest.

v2:
* Add the physical ports resource as a generic devlink resource so that
  it could be re-used by other device drivers
====================

Link: https://lore.kernel.org/r/20210121131024.2656154-1-idosch@idosch.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

59a49d96

selftests: mlxsw: Add a scale test for physical ports · 5154b1b8

Danielle Ratson authored Jan 21, 2021

Query the maximum number of supported physical ports using devlink-resource
and test that this number can be reached by splitting each of the
splittable ports to its width. Test that an error is returned in case
the maximum number is exceeded.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

5154b1b8

mlxsw: Register physical ports as a devlink resource · 321f7ab0

Danielle Ratson authored Jan 21, 2021

The switch ASIC has a limited capacity of physical ('flavour physical'
in devlink terminology) ports that it can support. While each system is
brought up with a different number of ports, this number can be
increased via splitting up to the ASIC's limit.

Expose physical ports as a devlink resource so that user space will have
visibility to the maximum number of ports that can be supported and the
current occupancy.

In addition, add a "Generic Resources" section in devlink-resource
documentation so the different drivers will be aligned by the same resource
name when exposing to user space.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

321f7ab0

Merge branch 'htb-offload' · 35187642

Jakub Kicinski authored Jan 22, 2021

Maxim Mikityanskiy says:

====================
HTB offload

This series adds support for HTB offload to the HTB qdisc, and adds
usage to mlx5 driver.

The previous RFCs are available at [1], [2].

The feature is intended to solve the performance bottleneck caused by
the single lock of the HTB qdisc, which prevents it from scaling well.
The HTB algorithm itself is offloaded to the device, eliminating the
need to take the root lock of HTB on every packet. Classification part
is done in clsact (still in software) to avoid acquiring the lock, which
imposes a limitation that filters can target only leaf classes.

The speedup on Mellanox ConnectX-6 Dx was 14.2 times in the UDP
multi-stream test, compared to software HTB implementation (more details
in the mlx5 patch).

[1]: https://www.spinics.net/lists/netdev/msg628422.html
[2]: https://www.spinics.net/lists/netdev/msg663548.html

v2 changes:

Fixed sparse and smatch warnings. Formatted HTB patches to 80 chars per
line.

v3 changes:

Fixed the CI failure on parisc with 16-bit xchg by replacing it with
WRITE_ONCE. Fixed the capability bits in mlx5_ifc.h and the value of
MLX5E_QOS_MAX_LEAF_NODES.

v4 changes:

Check if HTB is root when offloading. Add extack for hardware errors.
Rephrase explanations of how it works in the commit message. Remove %hu
from format strings. Add resiliency when leaf_del_last fails to create a
new leaf node.
====================

Link: https://lore.kernel.org/r/20210119120815.463334-1-maximmi@mellanox.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

35187642

net/mlx5e: Support HTB offload · 214baf22

Maxim Mikityanskiy authored Jan 19, 2021

This commit adds support for HTB offload in the mlx5e driver.

Performance:

  NIC: Mellanox ConnectX-6 Dx
  CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (24 cores with HT)

  100 Gbit/s line rate, 500 UDP streams @ ~200 Mbit/s each
  48 traffic classes, flower used for steering
  No shaping (rate limits set to 4 Gbit/s per TC) - checking for max
  throughput.

  Baseline: 98.7 Gbps, 8.25 Mpps
  HTB: 6.7 Gbps, 0.56 Mpps
  HTB offload: 95.6 Gbps, 8.00 Mpps

Limitations:

1. 256 leaf nodes, 3 levels of depth.

2. Granularity for ceil is 1 Mbit/s. Rates are converted to weights, and
the bandwidth is split among the siblings according to these weights.
Other parameters for classes are not supported.

Ethtool statistics support for QoS SQs are also added. The counters are
called qos_txN_*, where N is the QoS queue number (starting from 0, the
numeration is separate from the normal SQs), and * is the counter name
(the counters are the same as for the normal SQs).
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

214baf22

sch_htb: Stats for offloaded HTB · 83271586

Maxim Mikityanskiy authored Jan 19, 2021

This commit adds support for statistics of offloaded HTB. Bytes and
packets counters for leaf and inner nodes are supported, the values are
taken from per-queue qdiscs, and the numbers that the user sees should
have the same behavior as the software (non-offloaded) HTB.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

83271586

sch_htb: Hierarchical QoS hardware offload · d03b195b

Maxim Mikityanskiy authored Jan 19, 2021

HTB doesn't scale well because of contention on a single lock, and it
also consumes CPU. This patch adds support for offloading HTB to
hardware that supports hierarchical rate limiting.

In the offload mode, HTB passes control commands to the driver using
ndo_setup_tc. The driver has to replicate the whole hierarchy of classes
and their settings (rate, ceil) in the NIC. Every modification of the
HTB tree caused by the admin results in ndo_setup_tc being called.

After this setup, the HTB algorithm is done completely in the NIC. An SQ
(send queue) is created for every leaf class and attached to the
hierarchy, so that the NIC can calculate and obey aggregated rate
limits, too. In the future, it can be changed, so that multiple SQs will
back a single leaf class.

ndo_select_queue is responsible for selecting the right queue that
serves the traffic class of each packet.

The data path works as follows: a packet is classified by clsact, the
driver selects a hardware queue according to its class, and the packet
is enqueued into this queue's qdisc.

This solution addresses two main problems of scaling HTB:

1. Contention by flow classification. Currently the filters are attached
to the HTB instance as follows:

    # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
    classid 1:10

It's possible to move classification to clsact egress hook, which is
thread-safe and lock-free:

    # tc filter add dev eth0 egress protocol ip flower dst_port 80
    action skbedit priority 1:10

This way classification still happens in software, but the lock
contention is eliminated, and it happens before selecting the TX queue,
allowing the driver to translate the class to the corresponding hardware
queue in ndo_select_queue.

Note that this is already compatible with non-offloaded HTB and doesn't
require changes to the kernel nor iproute2.

2. Contention by handling packets. HTB is not multi-queue, it attaches
to a whole net device, and handling of all packets takes the same lock.
When HTB is offloaded, it registers itself as a multi-queue qdisc,
similarly to mq: HTB is attached to the netdev, and each queue has its
own qdisc.

Some features of HTB may be not supported by some particular hardware,
for example, the maximum number of classes may be limited, the
granularity of rate and ceil parameters may be different, etc. - so, the
offload is not enabled by default, a new parameter is used to enable it:

    # tc qdisc replace dev eth0 root handle 1: htb offload
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

d03b195b

net: sched: Add extack to Qdisc_class_ops.delete · 4dd78a73

Maxim Mikityanskiy authored Jan 19, 2021

In a following commit, sch_htb will start using extack in the delete
class operation to pass hardware errors in offload mode. This commit
prepares for that by adding the extack parameter to this callback and
converting usage of the existing qdiscs.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

4dd78a73

net: sched: Add multi-queue support to sch_tree_lock · ca1e4ab1

Maxim Mikityanskiy authored Jan 19, 2021

The existing qdiscs that set TCQ_F_MQROOT don't use sch_tree_lock.
However, hardware-offloaded HTB will start setting this flag while also
using sch_tree_lock.

The current implementation of sch_tree_lock basically locks on
qdisc->dev_queue->qdisc, and it works fine when the tree is attached to
some queue. However, it's not the case for MQROOT qdiscs: such a qdisc
is the root itself, and its dev_queue just points to queue 0, while not
actually being used, because there are real per-queue qdiscs.

This patch changes the logic of sch_tree_lock and sch_tree_unlock to
lock the qdisc itself if it's the MQROOT.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ca1e4ab1

Merge branch 'tcp-add-cmsg-rx-timestamps-to-rx-zerocopy' · 04a88637

Jakub Kicinski authored Jan 22, 2021

Arjun Roy says:

====================
tcp: add CMSG+rx timestamps to rx. zerocopy

Provide CMSG and receive timestamp support to TCP
receive zerocopy. Patch 1 refactors CMSG pending state for
tcp_recvmsg() to avoid the use of magic numbers; patch 2 implements
receive timestamp via CMSG support for receive zerocopy, and uses the
constants added in patch 1.
====================

Link: https://lore.kernel.org/r/20210121004148.2340206-1-arjunroy.kdev@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

04a88637

tcp: Add receive timestamp support for receive zerocopy. · 7eeba170

Arjun Roy authored Jan 20, 2021

tcp_recvmsg() uses the CMSG mechanism to receive control information
like packet receive timestamps. This patch adds CMSG fields to
struct tcp_zerocopy_receive, and provides receive timestamps
if available to the user.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

7eeba170

tcp: Remove CMSG magic numbers for tcp_recvmsg(). · 925bba24

Arjun Roy authored Jan 20, 2021

At present, tcp_recvmsg() uses flags to track if any CMSGs are pending
and what those CMSGs are. These flags are currently magic numbers,
used only within tcp_recvmsg().

To prepare for receive timestamp support in tcp receive zerocopy,
gently refactor these magic numbers into enums.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

925bba24

Merge branch 'net-bridge-multicast-add-initial-eht-support' · 5225d5f5

Jakub Kicinski authored Jan 22, 2021

Nikolay Aleksandrov says:

====================
net: bridge: multicast: add initial EHT support

This set adds explicit host tracking support for IGMPv3/MLDv2. The
already present per-port fast leave flag is used to enable it since that
is the primary goal of EHT, to track a group and its S,Gs usage per-host
and when left without any interested hosts delete them before the standard
timers. The EHT code is pretty self-contained and not enabled by default.
There is no new uAPI added, all of the functionality is currently hidden
behind the fast leave flag. In the future that will change (more below).
The host tracking uses two new sets per port group: one having an entry for
each host which contains that host's view of the group (source list and
filter mode), and one set which contains an entry for each source having
an internal set which contains an entry for each host that has reported
an interest for that source. RB trees are used for all sets so they're
compact when not used and fast when we need to do lookups.
To illustrate it:
 [ bridge port group ]
  ` [ host set (rb) ]
   ` [ host entry with a list of sources and filter mode ]
  ` [ source set (rb) ]
   ` [ source entry ]
    ` [ source host set (rb) ]
     ` [ source host entry with a timer ]

The number of tracked sources per host is limited to the maximum total
number of S,G entries per port group - PG_SRC_ENT_LIMIT (currently 32).
The number of hosts is unlimited, I think the argument that a local
attacker can exhaust the memory/cause high CPU usage can be applied to
fdb entries as well which are unlimited. In the future if needed we can
add an option to limit these, but I don't think it's necessary for a
start. All of the new sets are protected by the bridge's multicast lock.
I'm pretty sure we'll be changing the cases and improving the
convergence time in the future, but this seems like a good start.

Patch breakdown:
 patch 1 -  4: minor cleanups and preparations for EHT
 patch      5: adds the new structures which will be used in the
               following patches
 patch      6: adds support to create, destroy and lookup host entries
 patch      7: adds support to create, delete and lokup source set entries
 patch      8: adds a host "delete" function which is just a host's
               source list flush since that would automatically delete
               the host
 patch 9 - 10: add support for handling all IGMPv3/MLDv2 report types
               more information can be found in the individual patches
 patch     11: optmizes a specific TO_INCLUDE use-case with host timeouts
 patch     12: handles per-host filter mode changing (include <-> exclude)
 patch     13: pulls out block group deletion since now it can be
               deleted in both filter modes
 patch     14: marks deletions done due to fast leave

Future plans:
 - export host information
 - add an option to reduce queries
 - add an option to limit the number of host entries
 - tune more fast leave cases for quicker convergence

By the way I think this is the first open-source EHT implementation, I
couldn't find any while researching it. :)
====================

Link: https://lore.kernel.org/r/20210120145203.1109140-1-razor@blackwall.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

5225d5f5

net: bridge: multicast: mark IGMPv3/MLDv2 fast-leave deletes · d5a10222

Nikolay Aleksandrov authored Jan 20, 2021

Mark groups which were deleted due to fast leave/EHT.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

d5a10222

net: bridge: multicast: handle block pg delete for all cases · e87e4b5c

Nikolay Aleksandrov authored Jan 20, 2021

A block report can result in empty source and host sets for both include
and exclude groups so if there are no hosts left we can safely remove
the group. Pull the block group handling so it can cover both cases and
add a check if EHT requires the delete.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

e87e4b5c

net: bridge: multicast: add EHT host filter_mode handling · c9739016

Nikolay Aleksandrov authored Jan 20, 2021

We should be able to handle host filter mode changing. For exclude mode
we must create a zero-src entry so the group will be kept even without
any S,G entries (non-zero source sets). That entry doesn't count to the
entry limit and can always be created, its timer is refreshed on new
exclude reports and if we change the host filter mode to include then it
gets removed and we rely only on the non-zero source sets.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c9739016

net: bridge: multicast: optimize TO_INCLUDE EHT timeouts · b66bf55b

Nikolay Aleksandrov authored Jan 20, 2021

This is an optimization specifically for TO_INCLUDE which sends queries
for the older entries and thus lowers the S,G timers to LMQT. If we have
the following situation for a group in either include or exclude mode:
 - host A was interested in srcs X and Y, but is timing out
 - host B sends TO_INCLUDE src Z, the bridge lowers X and Y's timeouts
   to LMQT
 - host B sends BLOCK src Z after LMQT time has passed
 => since host B is the last host we can delete the group, but if we
    still have host A's EHT entries for X and Y (i.e. if they weren't
    lowered to LMQT previously) then we'll have to wait another LMQT
    time before deleting the group, with this optimization we can
    directly remove it regardless of the group mode as there are no more
    interested hosts
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

b66bf55b