Commits · 0f9ad4e7594419e906427bceba8e52467412895b · Kirill Smelkov / linux

15 Sep, 2020 9 commits

Merge branch 's390-qeth-next' · 0f9ad4e7

David S. Miller authored Sep 15, 2020

Julian Wiedmann says:

====================
s390/qeth: updates 2020-09-10

subject to positive review by the bridge maintainers on patch 5,
please apply the following patch series to netdev's net-next tree.

Alexandra adds BR_LEARNING_SYNC support to qeth. In addition to the
main qeth changes (controlling the feature, and raising switchdev
events), this also needs
- Patch 1 and 2 for some s390/cio infrastructure improvements
  (acked by Heiko to go in via net-next), and
- Patch 5 to introduce a new switchdev_notifier_type, so that a driver
  can clear all previously learned entries from the bridge FDB in case
  things go out-of-sync later on.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

0f9ad4e7

s390/qeth: implement ndo_bridge_setlink for learning_sync · 521c65b6

Alexandra Winter authored Sep 10, 2020

Documentation/networking/switchdev.txt and 'man bridge' indicate that the
learning_sync bridge attribute is used to control whether a given
device will sync MAC addresses learned on its device port to a master
bridge FDB, where they will show up as 'extern_learn offload'. So we map
qeth_l2_dev2br_an_set() to the learning_sync bridge link attribute.

Turning off learning_sync will flush all extern_learn entries from the
bridge fdb and all pending events from the card's work queue.

When the hardware interface goes offline with learning_sync on
(e.g. for HW recovery), all extern_learn entries will be flushed from the
bridge fdb and all pending events from the card's work queue. When the
interface goes online again, it will send new notifications for all then
valid MACs. learning_sync attribute can not be modified while interface is
offline. See
'commit e6e771b3 ("s390/qeth: detach netdevice while card is offline")'

An alternative implementation would be to always offload the 'learning'
attribute of a software bridge to the hardware interface attached to it
and thus implicitly enable fdb notification. This was not chosen for 2
reasons:
1) In our case the software bridge is NOT a representation of a hardware
switch. It is just connected to a smart NIC that is able to inform
about the addresses attached to it. It is not necessarily using source
MAC learning for this and other bridgeports can be attached to other
NICs with different properties.
2) We want a means to enable this notification explicitly. There may be
cases where a bridgeport is set to 'learning', but we do not want to
enable the notification.
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

521c65b6

s390/qeth: implement ndo_bridge_getlink for learning_sync · 780b6e7d

Alexandra Winter authored Sep 10, 2020

Documentation/networking/switchdev.txt and 'man bridge' indicate that the
learning_sync bridge attribute is used to indicate whether a given
device will sync MAC addresses learned on its device port to a master
bridge FDB.

learning_sync attribute can not be read while interface is offline (down).
See
'commit e6e771b3 ("s390/qeth: detach netdevice while card is offline")'
We return EOPNOTSUPP and not EONODEV in this case, because EONOTSUPP is the
only rc that is tolerated by 'bridge -d link show'.
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

780b6e7d

s390/qeth: Reset address notification in case of buffer overflow · 817741a8

Alexandra Winter authored Sep 10, 2020

In case hardware sends more device-to-bridge-address-change notfications
than the qeth-l2 driver can handle, the hardware will send an overflow
event and then stop sending any events. It expects software to flush its
FDB and start over again. Re-enabling address-change-notification will
report all current addresses.

In order to re-enable address-change-notification this patch defines
the functions qeth_l2_dev2br_an_set() and qeth_l2_dev2br_an_set_cb
to enable or disable dev-to-bridge-address-notification.

A following patch will use the learning_sync bridgeport flag to trigger
enabling or disabling of address-change-notification, so we define
priv->brport_features to store the current setting. BRIDGE_INFO and
ADDR_INFO functionality are mutually exclusive, whereas ADDR_INFO and
qeth_l2_vnicc* can be used together.

Alternative implementations to handle buffer overflow:
Just re-enabling notification and adding all newly reported addresses
would cover any lost 'add' events, but not the lost 'delete' events.
Then these invalid addresses would stay in the bridge FDB as long as the
device exists.
Setting the net device down and up, would be an alternative, but is a bit
drastic. If the net device has many secondary addresses this will create
many delete/add events at its peers which could de-stabilize the
network segment.
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

817741a8

bridge: Add SWITCHDEV_FDB_FLUSH_TO_BRIDGE notifier · d05e8e68

Alexandra Winter authored Sep 10, 2020

so the switchdev can notifiy the bridge to flush non-permanent fdb entries
for this port. This is useful whenever the hardware fdb of the switchdev
is reset, but the netdev and the bridgeport are not deleted.

Note that this has the same effect as the IFLA_BRPORT_FLUSH attribute.

CC: Jiri Pirko <jiri@resnulli.us>
CC: Ivan Vecera <ivecera@redhat.com>
CC: Roopa Prabhu <roopa@nvidia.com>
CC: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d05e8e68

s390/qeth: Translate address events into switchdev notifiers · 10a6cfc0

Alexandra Winter authored Sep 10, 2020

A qeth-l2 HiperSockets card can show switch-ish behaviour in the sense,
that it can report all MACs that are reachable via this interface. Just
like a switch device, it can notify the software bridge about changes
to its fdb. This patch exploits this device-to-bridge-notification and
extracts the relevant information from the hardware events to generate
notifications to an attached software bridge.

There are 2 sources for this information:
1) The reply message of Perform-Network-Subchannel-Operations (PNSO)
(operation code ADDR_INFO) reports all addresses that are currently
reachable (implemented in a later patch).
2) As long as device-to-bridge-notification is enabled, hardware will
generate address change notification events, whenever the content of
the hardware fdb changes (this patch).

The bridge_hostnotify feature (PNSO operation code BRIDGE_INFO) uses
the same address change notification events. We need to distinguish
between qeth_pnso_mode QETH_PNSO_BRIDGEPORT and QETH_PNSO_ADDR_INFO
and call a different handler. In both cases deadlocks must be
prevented, if the workqueue is drained under lock and QETH_PNSO_NONE,
when notification is disabled.

bridge_hostnotify generates udev events, there is no intend to do the same
for dev2br. Instead this patch will generate SWITCHDEV_FDB_ADD_TO_BRIDGE
and SWITCHDEV_FDB_DEL_TO_BRIDGE notifications, that will cause the
software bridge to add (or delete) entries to its fdb as 'extern_learn
offload'.

Documentation/networking/switchdev.txt proposes to add
"depends NET_SWITCHDEV" to driver's Kconfig. This is not done here,
so even in absence of the NET_SWITCHDEV module, the QETH_L2 module will
still be built, but then the switchdev notifiers will have no effect.

No VLAN filtering is done on the entries and VLAN information is not
passed on to the bridge fdb entries. This could be added later.
For now VLAN interfaces can be defined on the upper bridge interface.

Multicast entries are not passed on to the bridge fdb.
This could be added later. For now mcast flooding can be used in the
bridge.

The card reports all MACs that are in its FDB, but we must not pass on
MACs that are registered for this interface.
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

10a6cfc0

s390/qeth: Detect PNSO OC3 capability · fa115adf

Alexandra Winter authored Sep 10, 2020

This patch detects whether device-to-bridge-notification, provided
by the Perform Network Subchannel Operation (PNSO) operation code
ADDR_INFO (OC3), is supported by this card. A following patch will
map this to the learning_sync bridgeport flag, so we store it in
priv->brport_hw_features in bridgeport flag format.

Only IQD cards provide PNSO.
There is a feature bit to indicate whether the machine provides OC3,
unfortunately it is not set on old machines.
So PNSO is called to find out. As this will disable notification
and is exclusive with bridgeport_notification, this must be done
during card initialisation before previous settings are restored.

PNSO functionality requires some configuration values that are added to
the qeth_card.info structure. Some helper functions are defined to fill
them out when the card is brought online and some other places are
adapted, that can also benefit from these fields.
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fa115adf

s390/cio: Helper functions to read CSSID, IID, and CHID · b983aa1f

Alexandra Winter authored Sep 10, 2020

Add helper functions to expose Channel Subsystem ID (CSSID), MIF Image Id
(IID), Channel ID (CHID) and Channel Path ID (CHPID).
These values are required by the qeth driver's exploitation of network-
address-change-notifications to determine which entries belong to this
interface.

Store the Partition identifier in System log, as this may be used to map
a Linux view to a Hardware view for debugging purpose.
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Vineeth Vijayan <vneethv@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b983aa1f

s390/cio: Add new Operation Code OC3 to PNSO · 4fea49a7

Alexandra Winter authored Sep 10, 2020

Add support for operation code 3 (OC3) of the
Perform-Network-Subchannel-Operations (PNSO) function
of the Channel-Subsystem-Call (CHSC) instruction.

PNSO provides 2 operation codes:
OC0 - BRIDGE_INFO
OC3 - ADDR_INFO (new)

Extend the function calls to *pnso* to pass the OC and
add new response code 0108.

Support for OC3 is indicated by a flag in the css_general_characteristics.
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: Vineeth Vijayan <vneethv@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4fea49a7

14 Sep, 2020 31 commits

tcp: schedule EPOLLOUT after a partial sendmsg · afb83012

Soheil Hassas Yeganeh authored Sep 14, 2020

For EPOLLET, applications must call sendmsg until they get EAGAIN.
Otherwise, there is no guarantee that EPOLLOUT is sent if there was
a failure upon memory allocation.

As a result on high-speed NICs, userspace observes multiple small
sendmsgs after a partial sendmsg until EAGAIN, since TCP can send
1-2 TSOs in between two sendmsg syscalls:

// One large partial send due to memory allocation failure.
sendmsg(20MB)   = 2MB
// Many small sends until EAGAIN.
sendmsg(18MB)   = 64KB
sendmsg(17.9MB) = 128KB
sendmsg(17.8MB) = 64KB
...
sendmsg(...)    = EAGAIN
// At this point, userspace can assume an EPOLLOUT.

To fix this, set the SOCK_NOSPACE on all partial sendmsg scenarios
to guarantee that we send EPOLLOUT after partial sendmsg.

After this commit userspace can assume that it will receive an EPOLLOUT
after the first partial sendmsg. This EPOLLOUT will benefit from
sk_stream_write_space() logic delaying the EPOLLOUT until significant
space is available in write queue.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

afb83012

tcp: return EPOLLOUT from tcp_poll only when notsent_bytes is half the limit · 8ba3c9d1

Soheil Hassas Yeganeh authored Sep 14, 2020

If there was any event available on the TCP socket, tcp_poll()
will be called to retrieve all the events.  In tcp_poll(), we call
sk_stream_is_writeable() which returns true as long as we are at least
one byte below notsent_lowat.  This will result in quite a few
spurious EPLLOUT and frequent tiny sendmsg() calls as a result.

Similar to sk_stream_write_space(), use __sk_stream_is_writeable
with a wake value of 1, so that we set EPOLLOUT only if half the
space is available for write.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8ba3c9d1

ionic: fix up debugfs after queue swap · ed6d9b02

Shannon Nelson authored Sep 13, 2020

Clean and rebuild the debugfs info for the queues being swapped.

Fixes: a34e25ab ("ionic: change the descriptor ring length without full reset")
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

ed6d9b02

__netif_receive_skb_core: don't untag vlan from skb on DSA master · b14a9fc4

Vladimir Oltean authored Sep 12, 2020

A DSA master interface has upper network devices, each representing an
Ethernet switch port attached to it. Demultiplexing the source ports and
setting skb->dev accordingly is done through the catch-all ETH_P_XDSA
packet_type handler. Catch-all because DSA vendors have various header
implementations, which can be placed anywhere in the frame: before the
DMAC, before the EtherType, before the FCS, etc. So, the ETH_P_XDSA
handler acts like an rx_handler more than anything.

It is unlikely for the DSA master interface to have any other upper than
the DSA switch interfaces themselves. Only maybe a bridge upper*, but it
is very likely that the DSA master will have no 8021q upper. So
__netif_receive_skb_core() will try to untag the VLAN, despite the fact
that the DSA switch interface might have an 8021q upper. So the skb will
never reach that.

So far, this hasn't been a problem because most of the possible
placements of the DSA switch header mentioned in the first paragraph
will displace the VLAN header when the DSA master receives the frame, so
__netif_receive_skb_core() will not actually execute any VLAN-specific
code for it. This only becomes a problem when the DSA switch header does
not displace the VLAN header (for example with a tail tag).

What the patch does is it bypasses the untagging of the skb when there
is a DSA switch attached to this net device. So, DSA is the only
packet_type handler which requires seeing the VLAN header. Once skb->dev
will be changed, __netif_receive_skb_core() will be invoked again and
untagging, or delivery to an 8021q upper, will happen in the RX of the
DSA switch interface itself.

*see commit 9eb8eff0 ("net: bridge: allow enslaving some DSA master
network devices". This is actually the reason why I prefer keeping DSA
as a packet_type handler of ETH_P_XDSA rather than converting to an
rx_handler. Currently the rx_handler code doesn't support chaining, and
this is a problem because a DSA master might be bridged.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b14a9fc4

Merge branch 'net-next-dsa-mt7530-add-support-for-MT7531' · 0ca6d8b7

David S. Miller authored Sep 14, 2020

Landen Chao says:

====================
net-next: dsa: mt7530: add support for MT7531

This patch series adds support for MT7531.

MT7531 is the next generation of MT7530 which could be found on Mediatek
router platforms such as MT7622 or MT7629.

It is also a 7-ports switch with 5 giga embedded phys, 2 cpu ports, and
the same MAC logic of MT7530. Cpu port 6 only supports SGMII interface.
Cpu port 5 supports either RGMII or SGMII in different HW SKU, but cannot
be muxed to PHY of port 0/4 like mt7530. Due to support for SGMII
interface, pll, and pad setting are different from MT7530.

MT7531 SGMII interface can be configured in following mode:
- 'SGMII AN mode' with in-band negotiation capability
    which is compatible with PHY_INTERFACE_MODE_SGMII.
- 'SGMII force mode' without in-band negotiation
    which is compatible with 10B/8B encoding of
    PHY_INTERFACE_MODE_1000BASEX with fixed full-duplex and fixed pause.
- 2.5 times faster clocked 'SGMII force mode' without in-band negotiation
    which is compatible with 10B/8B encoding of
    PHY_INTERFACE_MODE_2500BASEX with fixed full-duplex and fixed pause.

v4 -> v5
- Add fixed-link node to dsa cpu port in dts file by suggestion of
  Vladimir Oltean.

v3 -> v4
- Adjust the coding style by suggestion of Jakub Kicinski.
  Remove unnecessary jumping label, merge continuous numeric 'switch
  cases' into one line, and keep the variables longest to shortest
  (reverse xmas tree).

v2 -> v3
- Keep the same setup logic of mt7530/mt7621 because these series of
  patches is for adding mt7531 hardware.
- Do not adjust rgmii delay when vendor phy driver presents in order to
  prevent double adjustment by suggestion of Andrew Lunn.
- Remove redundant 'Example 4' from dt-bindings by suggestion of
  Rob Herring.
- Fix typo.

v1 -> v2
- change phylink_validate callback function to support full-duplex
  gigabit only to match hardware capability.
- add description of SGMII interface.
- configure mt7531 cpu port in fastest speed by default.
- parse SGMII control word for in-band negotiation mode.
- configure RGMII delay based on phy.rst.
- Rename the definition in the header file to avoid potential conflicts.
- Add wrapper function for mdio read/write to support both C22 and C45.
- correct fixed-link speed of 2500base-x in dts.
- add MT7531 port mirror setting.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

0ca6d8b7

arm64: dts: mt7622: add mt7531 dsa to bananapi-bpi-r64 board · 79a675e6

Landen Chao authored Sep 11, 2020

Add mt7531 dsa to bananapi-bpi-r64 board for 5 giga Ethernet ports support.
Signed-off-by: Landen Chao <landen.chao@mediatek.com>
Tested-By: Frank Wunderlich <frank-w@public-files.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

79a675e6

arm64: dts: mt7622: add mt7531 dsa to mt7622-rfb1 board · 6af06448

Landen Chao authored Sep 11, 2020

Add mt7531 dsa to mt7622-rfb1 board for 5 giga Ethernet ports support.
mt7622 only supports 1 sgmii interface, so either gmac0 or gmac1 can be
configured as sgmii interface. In this patch, change to connect mt7622
gmac0 and mt7531 port6 through sgmii interface.
Signed-off-by: Landen Chao <landen.chao@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6af06448

net: dsa: mt7530: Add the support of MT7531 switch · c288575f

Landen Chao authored Sep 11, 2020

Add new support for MT7531:

MT7531 is the next generation of MT7530. It is also a 7-ports switch with
5 giga embedded phys, 2 cpu ports, and the same MAC logic of MT7530. Cpu
port 6 only supports SGMII interface. Cpu port 5 supports either RGMII
or SGMII in different HW sku, but cannot be muxed to PHY of port 0/4 like
mt7530. Due to SGMII interface support, pll, and pad setting are different
from MT7530. This patch adds different initial setting, and SGMII phylink
handlers of MT7531.

MT7531 SGMII interface can be configured in following mode:
- 'SGMII AN mode' with in-band negotiation capability
    which is compatible with PHY_INTERFACE_MODE_SGMII.
- 'SGMII force mode' without in-band negotiation
    which is compatible with 10B/8B encoding of
    PHY_INTERFACE_MODE_1000BASEX with fixed full-duplex and fixed pause.
- 2.5 times faster clocked 'SGMII force mode' without in-band negotiation
    which is compatible with 10B/8B encoding of
    PHY_INTERFACE_MODE_2500BASEX with fixed full-duplex and fixed pause.
Signed-off-by: Landen Chao <landen.chao@mediatek.com>
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c288575f

dt-bindings: net: dsa: add new MT7531 binding to support MT7531 · 27834b02

Landen Chao authored Sep 11, 2020

Add devicetree binding to support the compatible mt7531 switch as used
in the MediaTek MT7531 switch.
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: Landen Chao <landen.chao@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

27834b02

net: dsa: mt7530: Extend device data ready for adding a new hardware · 88bdef8b

Landen Chao authored Sep 11, 2020

Add a structure holding required operations for each device such as device
initialization, PHY port read or write, a checker whether PHY interface is
supported on a certain port, MAC port setup for either bus pad or a
specific PHY interface.

The patch is done for ready adding a new hardware MT7531, and keep the
same setup logic of existing hardware.
Signed-off-by: Landen Chao <landen.chao@mediatek.com>
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

88bdef8b

net: dsa: mt7530: Refine message in Kconfig · dc8ef938

Landen Chao authored Sep 11, 2020

Refine message in Kconfig with fixing typo and an explicit MT7621 support.
Signed-off-by: Landen Chao <landen.chao@mediatek.com>
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dc8ef938

drivers/net/wan/x25_asy: Remove an unnecessary x25_type_trans call · 4b468385

Xie He authored Sep 11, 2020

x25_type_trans only needs to be called before we call netif_rx to pass
the skb to upper layers.

It does not need to be called before lapb_data_received. The LAPB module
does not need the fields that are set by calling it.

In the other two X.25 drivers - lapbether and hdlc_x25. x25_type_trans
is only called before netif_rx and not before lapb_data_received.

Cc: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

4b468385

net: try to avoid unneeded backlog flush · 2de79ee2

Paolo Abeni authored Sep 10, 2020

flush_all_backlogs() may cause deadlock on systems
running processes with FIFO scheduling policy.

The above is critical in -RT scenarios, where user-space
specifically ensure no network activity is scheduled on
the CPU running the mentioned FIFO process, but still get
stuck.

This commit tries to address the problem checking the
backlog status on the remote CPUs before scheduling the
flush operation. If the backlog is empty, we can skip it.

v1 -> v2:
 - explicitly clear flushed cpu mask - Eric
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2de79ee2

Merge branch 'mlxsw-Derive-SBIB-from-maximum-port-speed-and-MTU' · 7b2d1b8d

David S. Miller authored Sep 14, 2020

Ido Schimmel says:

====================
mlxsw: Derive SBIB from maximum port speed & MTU

Petr says:

Internal buffer is a part of port headroom used for packets that are
mirrored due to triggers that the Spectrum ASIC considers "egress". Besides
ACL mirroring on port egresss this includes also packets mirrored due to
ECN marking.

This patchset changes the way the internal mirroring buffer is reserved.
Currently the buffer reflects port MTU and speed accurately. In the future,
mlxsw should support dcbnl_setbuffer hook to allow the users to set buffer
sizes by hand. In that case, there might not be enough space for growth of
the internal mirroring buffer due to MTU and speed changes. While vetoing
MTU changes would be merely confusing, port speed changes cannot be vetoed,
and such change would simply lead to issues in packet mirroring.

For these reasons, with these patches the internal mirroring buffer is
derived from maximum MTU and maximum speed achievable on the port.

Patches #1 and #2 introduce a new callback to determine the maximum speed a
given port can achieve.

With patches #3 and #4, the information about, respectively, maximum MTU
and maximum port speed, is kept in struct mlxsw_sp_port.

In patch #5, maximum MTU and maximum speed are used to determine the size
of the internal buffer. MTU update and speed update hooks are dropped,
because they are no longer necessary.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

7b2d1b8d

mlxsw: spectrum_span: Derive SBIB from maximum port speed & MTU · 532b49e4

Petr Machata authored Sep 13, 2020

The SBIB register configures the size of an internal buffer that the
Spectrum ASICs use when mirroring traffic on egress. This size should be
taken into account when validating that the port headroom buffers are not
larger than the chip can handle. Up until now this was not done, which is
incidentally not a problem, because the priority group buffers that mlxsw
auto-configures are small enough that the boundary condition could not be
violated.

However when dcbnl_setbuffer is implemented, the user has control over
sizes of PG buffers, and they might overshoot the headroom capacity.
However the size of the SBIB buffer depends on port speed, and that cannot
be vetoed. Therefore SBIB size should be deduced from maximum port speed.

Additionally, once the buffers are configured by hand, the user could get
into an uncomfortable situation where their MTU change requests get vetoed,
because the SBIB does not fit anymore. Therefore derive SBIB size from
maximum permissible MTU as well.

Remove all the code that adjusted the SBIB size whenever speed or MTU
changed.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

532b49e4

mlxsw: spectrum: Keep maximum speed around · 3232e8c6

Petr Machata authored Sep 13, 2020

The maximum port speed depends on link modes supported by the port, and for
Ethernet ports is constant. The maximum speed will be handy when setting
SBIB, the internal buffer used for traffic mirroring. Therefore, keep it in
struct mlxsw_sp_port for easy access.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3232e8c6

mlxsw: spectrum: Keep maximum MTU around · 2ecf87ae

Petr Machata authored Sep 13, 2020

The maximum port MTU depends on port type. On Spectrum, mlxsw configures
all ports as Ethernet ports, and the maximum MTU therefore never changes.
Besides checking MTU configuration, maximum MTU will also be handy when
setting SBIB, the internal buffer used for traffic mirroring. Therefore,
keep it in struct mlxsw_sp_port for easy access.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2ecf87ae

mlxsw: spectrum_ethtool: Introduce ptys_max_speed callback · 60fbc521

Petr Machata authored Sep 13, 2020

The SBIB register configures the size of an internal buffer that the
Spectrum ASICs use when mirroring traffic on egress. This size should be
taken into account when validating that the port headroom buffers are not
larger than the chip can handle. Up until now this was not done, which is
incidentally not a problem, because the priority group buffers that mlxsw
auto-configures are small enough that the boundary condition could not be
violated.

When dcbnl_setbuffer is implemented, the user gets control over sizes of PG
buffers, and they might overshoot the headroom capacity. However the size
of the SBIB buffer depends on port speed, which cannot be vetoed. There is
obviously no way to retroactively push back on requests for overlarge PG
buffers, or reject an overlarge MTU, or cancel losslessness of a certain
PG.

Therefore, instead of taking into account the current speed when
calculating SBIB buffer size, take into account the maximum speed that a
port with given Ethernet protocol capabilities can have.

To that end, add a new ethtool callback, ptys_max_speed, which determines
this maximum speed.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

60fbc521

mlxsw: spectrum_ethtool: Extract a helper to get Ethernet attributes · d24ca6c0

Petr Machata authored Sep 13, 2020

In order to allow reusing the logic, extract from
mlxsw_sp_port_get_link_ksettings() the code to obtain Ethernet protocol
attributes, mlxsw_sp_port_ptys_query().
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d24ca6c0

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 7952d7ed

David S. Miller authored Sep 14, 2020

Tony Nguyen says:

====================
40GbE Intel Wired LAN Driver Updates 2020-09-14

This series contains updates to i40e driver only.

Li RongQing removes binding affinity mask to a fixed CPU and sets
prefetch of Rx buffer page to occur conditionally.

Björn provides AF_XDP performance improvements by not prefetching HW
descriptors, using 16 byte descriptors, and moving buffer allocation
out of Rx processing loop.

v2: Define prefetch_page_address in a common header for patch 2.
Dropped, previous, patch 5 as it is being reworked to be more
generalized.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

7952d7ed

Merge tag 'rxrpc-next-20200914' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · e0d9ae69

David S. Miller authored Sep 14, 2020

David Howells says:

====================
rxrpc: Fixes for the connection manager rewrite

Here are some fixes for the connection manager rewrite:

 (1) Fix a goto to the wrong place in error handling.

 (2) Fix a missing NULL pointer check.

 (3) The stored allocation error needs to be stored signed.

 (4) Fix a leak of connection bundle when clearing connections due to
     net namespace exit.

 (5) Fix an overget of the bundle when setting up a new client conn.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e0d9ae69

hinic: add vxlan segmentation and cs offload support · 33acd755

Luo bin authored Sep 14, 2020

Add NETIF_F_GSO_UDP_TUNNEL and NETIF_F_GSO_UDP_TUNNEL_CSUM features
to support vxlan segmentation and checksum offload. Ipip and ipv6
tunnel packets are regarded as non-tunnel pkt for hw and as for other
type of tunnel pkts, checksum offload is disabled.
Signed-off-by: Luo bin <luobin9@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

33acd755

net: qlcnic: remove unused variable 'val' in qlcnic_83xx_cam_unlock() · f3694707

Zhang Changzhong authored Sep 14, 2020

Fixes the following W=1 kernel build warning(s):

drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c:661:6: warning:
 variable 'val' set but not used [-Wunused-but-set-variable]
  661 |  u32 val;
      |      ^~~

After commit 7f966452 ("qlcnic: 83xx memory map and HW access
routines"), variable 'val' is never used in qlcnic_83xx_cam_unlock(), so
removing it to avoid build warning.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f3694707

net: pxa168_eth: remove unused variable 'retval' int pxa168_eth_change_mtu() · f7ab0f04

Zhang Changzhong authored Sep 14, 2020

Fixes the following W=1 kernel build warning(s):

drivers/net/ethernet/marvell/pxa168_eth.c:1190:6: warning:
 variable 'retval' set but not used [-Wunused-but-set-variable]
 1190 |  int retval;
      |      ^~~~~~

Function pxa168_eth_change_mtu() always return zero, so variable 'retval'
is redundant, just remove it.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f7ab0f04

net: fec: ptp: remove unused variable 'ns' in fec_time_keep() · 992bae7e

Zhang Changzhong authored Sep 14, 2020

Fixes the following W=1 kernel build warning(s):

drivers/net/ethernet/freescale/fec_ptp.c:523:6: warning:
 variable 'ns' set but not used [-Wunused-but-set-variable]
  523 |  u64 ns;
      |      ^~

After commit 6605b730 ("FEC: Add time stamping code and a PTP
hardware clock"), variable 'ns' is never used in fec_time_keep(),
so removing it to avoid build warning.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Acked-by: Fugang Duan <fugang.duan@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

992bae7e

net: dnet: remove unused variable 'tx_status 'in dnet_start_xmit() · 85743cea

Zhang Changzhong authored Sep 14, 2020

Fixes the following W=1 kernel build warning(s):

drivers/net/ethernet/dnet.c:510:6: warning:
 variable 'tx_status' set but not used [-Wunused-but-set-variable]
  u32 tx_status, irq_enable;
      ^~~~~~~~~

After commit 47964174 ("dnet: Dave DNET ethernet controller driver
(updated)"), variable 'tx_status' is never used in dnet_start_xmit(),
so removing it to avoid build warning.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

85743cea

tcp: remove SOCK_QUEUE_SHRUNK · 0cbe6a8f

Eric Dumazet authored Sep 14, 2020

SOCK_QUEUE_SHRUNK is currently used by TCP as a temporary state
that remembers if some room has been made in the rtx queue
by an incoming ACK packet.

This is later used from tcp_check_space() before
considering to send EPOLLOUT.

Problem is: If we receive SACK packets, and no packet
is removed from RTX queue, we can send fresh packets, thus
moving them from write queue to rtx queue and eventually
empty the write queue.

This stall can happen if TCP_NOTSENT_LOWAT is used.

With this fix, we no longer risk stalling sends while holes
are repaired, and we can fully use socket sndbuf.

This also removes a cache line dirtying for typical RPC
workloads.

Fixes: c9bee3b7 ("tcp: TCP_NOTSENT_LOWAT socket option")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0cbe6a8f

net/packet: Fix a comment about hard_header_len and headroom allocation · b4c58814

Xie He authored Sep 14, 2020

This comment is outdated and no longer reflects the actual implementation
of af_packet.c.

Reasons for the new comment:

1.

In af_packet.c, the function packet_snd first reserves a headroom of
length (dev->hard_header_len + dev->needed_headroom).
Then if the socket is a SOCK_DGRAM socket, it calls dev_hard_header,
which calls dev->header_ops->create, to create the link layer header.
If the socket is a SOCK_RAW socket, it "un-reserves" a headroom of
length (dev->hard_header_len), and checks if the user has provided a
header sized between (dev->min_header_len) and (dev->hard_header_len)
(in dev_validate_header).
This shows the developers of af_packet.c expect hard_header_len to
be consistent with header_ops.

2.

In af_packet.c, the function packet_sendmsg_spkt has a FIXME comment.
That comment states that prepending an LL header internally in a driver
is considered a bug. I believe this bug can be fixed by setting
hard_header_len to 0, making the internal header completely invisible
to af_packet.c (and requesting the headroom in needed_headroom instead).

3.

There is a commit for a WiFi driver:
commit 9454f7a8 ("mwifiex: set needed_headroom, not hard_header_len")
According to the discussion about it at:
  https://patchwork.kernel.org/patch/11407493/
The author tried to set the WiFi driver's hard_header_len to the Ethernet
header length, and request additional header space internally needed by
setting needed_headroom.
This means this usage is already adopted by driver developers.

Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Brian Norris <briannorris@chromium.org>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b4c58814

Merge branch 'mptcp-introduce-support-for-real-multipath-xmit' · b91c06c5

David S. Miller authored Sep 14, 2020

Paolo Abeni says:

====================
mptcp: introduce support for real multipath xmit

This series enable MPTCP socket to transmit data on multiple subflows
concurrently in a load balancing scenario.

First the receive code path is refactored to better deal with out-of-order
data (patches 1-7). An RB-tree is introduced to queue MPTCP-level out-of-order
data, closely resembling the TCP level OoO handling.

When data is sent on multiple subflows, the peer can easily see OoO - "future"
data at the MPTCP level, especially if speeds, delay, or jitter are not
symmetric.

The other major change regards the netlink PM, which is extended to allow
creating non backup subflows in patches 9-11.

There are a few smaller additions, like the introduction of OoO related mibs,
send buffer autotuning and better ack handling.

Finally a bunch of new self-tests is introduced. The new feature is tested
ensuring that the B/W used by an MPTCP socket using multiple subflows matches
the link aggregated B/W - we use low B/W virtual links, to ensure the tests
are not CPU bounded.

v1 -> v2:
  - fix 32 bit build breakage
  - fix a bunch of checkpatch issues
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

b91c06c5

mptcp: simult flow self-tests · 1a418cb8

Paolo Abeni authored Sep 14, 2020

Add a bunch of test-cases for multiple subflow xmit:
create multiple subflows simulating different links
condition via netem and verify that the msk is able
to use completely the aggregated bandwidth.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1a418cb8

mptcp: call tcp_cleanup_rbuf on subflows · c76c6956

Paolo Abeni authored Sep 14, 2020

That is needed to let the subflows announce promptly when new
space is available in the receive buffer.

tcp_cleanup_rbuf() is currently a static function, drop the
scope modifier and add a declaration in the TCP header.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c76c6956