Commits · 90ce665c6a40dc1be771bf5f86e624c0acf3a76f · Kirill Smelkov / linux

26 May, 2020 27 commits

net: mdiobus: add clause 45 mdiobus accessors · 90ce665c

Russell King authored May 26, 2020

There is a recurring pattern throughout some of the PHY code converting
a devad and regnum to our packed clause 45 representation. Rather than
having this scattered around the code, let's put a common translation
function in mdio.h, and provide some register accessors.

Convert the phylib core, phylink, bcm87xx and cortina to use these.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

90ce665c

Merge branch 'flow-mpls' · 8928e19a

David S. Miller authored May 26, 2020

Guillaume Nault says:

====================
flow_dissector, cls_flower: Add support for multiple MPLS Label Stack Entries

Currently, the flow dissector and the Flower classifier can only handle
the first entry of an MPLS label stack. This patch series generalises
the code to allow parsing and matching the Label Stack Entries that
follow.

Patch 1 extends the flow dissector to parse MPLS LSEs until the Bottom
Of Stack bit is reached. The number of parsed LSEs is capped at
FLOW_DIS_MPLS_MAX (arbitrarily set to 7). Flower and the NFP driver
are updated to take into account the new layout of struct
flow_dissector_key_mpls.

Patch 2 extends Flower. It defines new netlink attributes, which are
independent from the previous MPLS ones. Mixing the old and the new
attributes in a same filter is not allowed. For backward compatibility,
the old attributes are used when dumping filters that don't require the
new ones.

Changes since v2:
  * Fix compilation with the new MLX5 bareudp tunnel code.

Changes since v1:
  * Fix compilation of NFP driver (kbuild test robot).
  * Fix sparse warning with entropy label (kbuild test robot).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8928e19a

cls_flower: Support filtering on multiple MPLS Label Stack Entries · 61aec25a

Guillaume Nault authored May 26, 2020

With struct flow_dissector_key_mpls now recording the first
FLOW_DIS_MPLS_MAX labels, we can extend Flower to filter on any of
these LSEs independently.

In order to avoid creating new netlink attributes for every possible
depth, let's define a new TCA_FLOWER_KEY_MPLS_OPTS nested attribute
that contains the list of LSEs to match. Each LSE is represented by
another attribute, TCA_FLOWER_KEY_MPLS_OPTS_LSE, which then contains
the attributes representing the depth and the MPLS fields to match at
this depth (label, TTL, etc.).

For each MPLS field, the mask is always set to all-ones, as this is
what the original API did. We could allow user configurable masks in
the future if there is demand for more flexibility.

The new API also allows to only specify an LSE depth. In that case,
Flower only verifies that the MPLS label stack depth is greater or
equal to the provided depth (that is, an LSE exists at this depth).

Filters that only match on one (or more) fields of the first LSE are
dumped using the old netlink attributes, to avoid confusing user space
programs that don't understand the new API.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

61aec25a

flow_dissector: Parse multiple MPLS Label Stack Entries · 58cff782

Guillaume Nault authored May 26, 2020

The current MPLS dissector only parses the first MPLS Label Stack
Entry (second LSE can be parsed too, but only to set a key_id).

This patch adds the possibility to parse several LSEs by making
__skb_flow_dissect_mpls() return FLOW_DISSECT_RET_PROTO_AGAIN as long
as the Bottom Of Stack bit hasn't been seen, up to a maximum of
FLOW_DIS_MPLS_MAX entries.

FLOW_DIS_MPLS_MAX is arbitrarily set to 7. This should be enough for
many practical purposes, without wasting too much space.

To record the parsed values, flow_dissector_key_mpls is modified to
store an array of stack entries, instead of just the values of the
first one. A bit field, "used_lses", is also added to keep track of
the LSEs that have been set. The objective is to avoid defining a
new FLOW_DISSECTOR_KEY_MPLS_XX for each level of the MPLS stack.

TC flower is adapted for the new struct flow_dissector_key_mpls layout.
Matching on several MPLS Label Stack Entries will be added in the next
patch.

The NFP and MLX5 drivers are also adapted: nfp_flower_compile_mac() and
mlx5's parse_tunnel() now verify that the rule only uses the first LSE
and fail if it doesn't.

Finally, the behaviour of the FLOW_DISSECTOR_KEY_MPLS_ENTROPY key is
slightly modified. Instead of recording the first Entropy Label, it
now records the last one. This shouldn't have any consequences since
there doesn't seem to have any user of FLOW_DISSECTOR_KEY_MPLS_ENTROPY
in the tree. We'd probably better do a hash of all parsed MPLS labels
instead (excluding reserved labels) anyway. That'd give better entropy
and would probably also simplify the code. But that's not the purpose
of this patch, so I'm keeping that as a future possible improvement.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

58cff782

Merge tag 'batadv-next-for-davem-20200526' of git://git.open-mesh.org/linux-merge · fb8ddaa9

David S. Miller authored May 26, 2020

Simon Wunderlich says:

====================
This cleanup patchset includes the following patches:

 - Fix revert dynamic lockdep key changes for batman-adv,
   by Sven Eckelmann

 - use rcu_replace_pointer() where appropriate, by Antonio Quartulli

 - Revert "disable ethtool link speed detection when auto negotiation
   off", by Sven Eckelmann
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

fb8ddaa9

Merge branch 'tipc-add-some-improvements' · 6a862a44

David S. Miller authored May 26, 2020

Tuong Lien says:

====================
tipc: add some improvements

This series adds some improvements to TIPC.

The first patch improves the TIPC broadcast's performance with the 'Gap
ACK blocks' mechanism similar to unicast before, while the others give
support on tracing & statistics for broadcast links, and an alternative
to carry broadcast retransmissions via unicast which might be useful in
some cases.

Besides, the Nagle algorithm can now automatically 'adjust' itself
depending on the specific network condition a stream connection runs by
the last patch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6a862a44

tipc: add test for Nagle algorithm effectiveness · 0a3e060f

Tuong Lien authored May 26, 2020

When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.

In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.

In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.

When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.

Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.

This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.

In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.

Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

0a3e060f

tipc: add support for broadcast rcv stats dumping · 03b6fefd

Tuong Lien authored May 26, 2020

This commit enables dumping the statistics of a broadcast-receiver link
like the traditional 'broadcast-link' one (which is for broadcast-
sender). The link dumping can be triggered via netlink (e.g. the
iproute2/tipc tool) by the link flag - 'TIPC_NLA_LINK_BROADCAST' as the
indicator.

The name of a broadcast-receiver link of a specific peer will be in the
format: 'broadcast-link:<peer-id>'.

For example:

Link <broadcast-link:1001002>
  Window:50 packets
  RX packets:7841 fragments:2408/440 bundles:0/0
  TX packets:0 fragments:0/0 bundles:0/0
  RX naks:0 defs:124 dups:0
  TX naks:21 acks:0 retrans:0
  Congestion link:0  Send queue max:0 avg:0

In addition, the broadcast-receiver link statistics can be reset in the
usual way via netlink by specifying that link name in command.

Note: the 'tipc_link_name_ext()' is removed because the link name can
now be retrieved simply via the 'l->name'.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

03b6fefd

tipc: enable broadcast retrans via unicast · a91d55d1

Tuong Lien authored May 26, 2020

In some environment, broadcast traffic is suppressed at high rate (i.e.
a kind of bandwidth limit setting). When it is applied, TIPC broadcast
can still run successfully. However, when it comes to a high load, some
packets will be dropped first and TIPC tries to retransmit them but the
packet retransmission is intentionally broadcast too, so making things
worse and not helpful at all.

This commit enables the broadcast retransmission via unicast which only
retransmits packets to the specific peer that has really reported a gap
i.e. not broadcasting to all nodes in the cluster, so will prevent from
being suppressed, and also reduce some overheads on the other peers due
to duplicates, finally improve the overall TIPC broadcast performance.

Note: the functionality can be turned on/off via the sysctl file:

echo 1 > /proc/sys/net/tipc/bc_retruni
echo 0 > /proc/sys/net/tipc/bc_retruni

Default is '0', i.e. the broadcast retransmission still works as usual.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

a91d55d1

tipc: add back link trace events · c6ed7a5c

Tuong Lien authored May 26, 2020

In the previous commit ("tipc: add Gap ACK blocks support for broadcast
link"), we have removed the following link trace events due to the code
changes:

- tipc_link_bc_ack
- tipc_link_retrans

This commit adds them back along with some minor changes to adapt to
the new code.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

c6ed7a5c

tipc: introduce Gap ACK blocks for broadcast link · d7626b5a

Tuong Lien authored May 26, 2020

As achieved through commit 9195948f ("tipc: improve TIPC throughput
by Gap ACK blocks"), we apply the same mechanism for the broadcast link
as well. The 'Gap ACK blocks' data field in a 'PROTOCOL/STATE_MSG' will
consist of two parts built for both the broadcast and unicast types:

 31                       16 15                        0
+-------------+-------------+-------------+-------------+
|  bgack_cnt  |  ugack_cnt  |            len            |
+-------------+-------------+-------------+-------------+  -
|            gap            |            ack            |   |
+-------------+-------------+-------------+-------------+    > bc gacks
:                           :                           :   |
+-------------+-------------+-------------+-------------+  -
|            gap            |            ack            |   |
+-------------+-------------+-------------+-------------+    > uc gacks
:                           :                           :   |
+-------------+-------------+-------------+-------------+  -

which is "automatically" backward-compatible.

We also increase the max number of Gap ACK blocks to 128, allowing upto
64 blocks per type (total buffer size = 516 bytes).

Besides, the 'tipc_link_advance_transmq()' function is refactored which
is applicable for both the unicast and broadcast cases now, so some old
functions can be removed and the code is optimized.

With the patch, TIPC broadcast is more robust regardless of packet loss
or disorder, latency, ... in the underlying network. Its performance is
boost up significantly.
For example, experiment with a 5% packet loss rate results:

$ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000
real    0m 42.46s
user    0m 1.16s
sys     0m 17.67s

Without the patch:

$ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000
real    8m 27.94s
user    0m 0.55s
sys     0m 2.38s
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

d7626b5a

qed: Add EDPM mode type for user-fw compatibility · ff937b91

Yuval Basson authored May 26, 2020

In older FW versions the completion flag was treated as the ack flag in
edpm messages. Expose the FW option of setting which mode the QP is in
by adding a flag to the qedr <-> qed API.

Flag is added for backward compatibility with libqedr.
This flag will be set by qedr after determining whether the libqedr is
using the updated version.

Fixes: f1093940 ("qed: Add support for QP verbs")
Signed-off-by: Yuval Basson <yuval.bason@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ff937b91

tcp: tcp_v4_err() icmp skb is named icmp_skb · 23917494

Eric Dumazet authored May 25, 2020

I missed the fact that tcp_v4_err() differs from tcp_v6_err().

After commit 4d1a2d9e ("Rename skb to icmp_skb in tcp_v4_err()")
the skb argument has been renamed to icmp_skb only in one function.

I will in a future patch reconciliate these functions to avoid
this kind of confusion.

Fixes: 45af29ca ("tcp: allow traceroute -Mtcp for unpriv users")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

23917494

batman-adv: Revert "disable ethtool link speed detection when auto negotiation off" · 9ad346c9

Sven Eckelmann authored Nov 25, 2019

The commit 8c46fcd7 ("batman-adv: disable ethtool link speed detection
when auto negotiation off") disabled the usage of ethtool's link_ksetting
when auto negotation was enabled due to invalid values when used with
tun/tap virtual net_devices. According to the patch, automatic measurements
should be used for these kind of interfaces.

But there are major flaws with this argumentation:

* automatic measurements are not implemented
* auto negotiation has nothing to do with the validity of the retrieved
  values

The first point has to be fixed by a longer patch series. The "validity"
part of the second point must be addressed in the same patch series by
dropping the usage of ethtool's link_ksetting (thus always doing automatic
measurements over ethernet).

Drop the patch again to have more default values for various net_device
types/configurations. The user can still overwrite them using the
batadv_hardif's BATADV_ATTR_THROUGHPUT_OVERRIDE.
Reported-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>

9ad346c9

Merge branch 'r8169-sync-hw-config-for-few-chip-versions-with-r8168-vendor-driver' · d52caf04

David S. Miller authored May 25, 2020

Heiner Kallweit says:

====================
r8169: sync hw config for few chip versions with r8168 vendor driver

Sync hw config for few chip versions with r8168 vendor driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d52caf04

r8169: sync RTL8168f/RTL8411 hw config with vendor driver · d05890c5

Heiner Kallweit authored May 25, 2020

Sync hw config for RTL8168f/RTL8411 with r8168 vendor driver.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d05890c5

r8169: sync RTL8168evl hw config with vendor driver · 33b00ca1

Heiner Kallweit authored May 25, 2020

Sync hw config for RTL8168evl with r8168 vendor driver.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

33b00ca1

r8169: sync RTL8168h hw config with vendor driver · ee1350f9

Heiner Kallweit authored May 25, 2020

Sync hw config for RTL8168h with r8168 vendor driver.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ee1350f9

r8169: sync RTL8168g hw config with vendor driver · d29d5ff9

Heiner Kallweit authored May 25, 2020

Sync hw config for RTL8168g with r8168 vendor driver.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d29d5ff9

Merge tag 'wireless-drivers-next-2020-05-25' of... · 3248044e

David S. Miller authored May 25, 2020

Merge tag 'wireless-drivers-next-2020-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers-next patches for v5.8

Second set of patches for v5.8. Lots of new features and new supported
hardware for mt76. Also rtw88 got new hardware support.

Major changes:

rtw88

* add support for Realtek 8723DE PCI adapter

* rename rtw88.ko/rtwpci.ko to rtw88_core.ko/rtw88_pci.ko

iwlwifi

* stop supporting swcrypto and bt_coex_active module parameters on
  mvm devices

* enable A-AMSDU in low latency

mt76

* new devices for mt76x0/mt76x2

* support for non-offload firmware on mt7663

* hw/sched scan support for mt7663

* mt7615/mt7663 MSI support

* TDLS support

* mt7603/mt7615 rate control fixes

* new driver for mt7915

* wowlan support for mt7663

* suspend/resume support for mt7663
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

3248044e

bridge: mrp: Fix out-of-bounds read in br_mrp_parse · 617504c6

Horatiu Vultur authored May 25, 2020

The issue was reported by syzbot. When the function br_mrp_parse was
called with a valid net_bridge_port, the net_bridge was an invalid
pointer. Therefore the check br->stp_enabled could pass/fail
depending where it was pointing in memory.
The fix consists of setting the net_bridge pointer if the port is a
valid pointer.

Reported-by: syzbot+9c6f0f1f8e32223df9a4@syzkaller.appspotmail.com
Fixes: 65369933 ("bridge: mrp: Integrate MRP into the bridge")
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

617504c6

drivers: ipa: print dev_err info accurately · 07153961

Wang Wenhu authored May 24, 2020

Print certain name string instead of hard-coded "memory" for dev_err
output, which would be more accurate and helpful for debugging.
Signed-off-by: Wang Wenhu <wenhu.wang@vivo.com>
Cc: Alex Elder <elder@kernel.org>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

07153961

ptp_clock: Let the ADJ_OFFSET interface respect the ADJ_NANO flag for PHC devices. · eabd5c9d

Richard Cochran authored May 24, 2020

In commit 184ecc9e ("ptp: Add adjphase
function to support phase offset control.") the PTP Hardware Clock
interface expanded to support the ADJ_OFFSET offset mode.  However,
the implementation did not respect the traditional yet pedantic
distinction between units of microseconds and nanoseconds signaled by
the ADJ_NANO flag.  This patch fixes the issue by adding logic to
handle that flag.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Vincent Cheng <vincent.cheng.xh@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

eabd5c9d

tcp: allow traceroute -Mtcp for unpriv users · 45af29ca

Eric Dumazet authored May 24, 2020

Unpriv users can use traceroute over plain UDP sockets, but not TCP ones.

$ traceroute -Mtcp 8.8.8.8
You do not have enough privileges to use this traceroute method.

$ traceroute -n -Mudp 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
 1  192.168.86.1  3.631 ms  3.512 ms  3.405 ms
 2  10.1.10.1  4.183 ms  4.125 ms  4.072 ms
 3  96.120.88.125  20.621 ms  19.462 ms  20.553 ms
 4  96.110.177.65  24.271 ms  25.351 ms  25.250 ms
 5  69.139.199.197  44.492 ms  43.075 ms  44.346 ms
 6  68.86.143.93  27.969 ms  25.184 ms  25.092 ms
 7  96.112.146.18  25.323 ms 96.112.146.22  25.583 ms 96.112.146.26  24.502 ms
 8  72.14.239.204  24.405 ms 74.125.37.224  16.326 ms  17.194 ms
 9  209.85.251.9  18.154 ms 209.85.247.55  14.449 ms 209.85.251.9  26.296 ms^C

We can easily support traceroute over TCP, by queueing an error message
into socket error queue.

Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
enable this feature, and that the error message is only queued
while in SYN_SNT state.

socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_TIMESTAMP_OLD, [1], 4) = 0
setsockopt(3, SOL_IPV6, IPV6_UNICAST_HOPS, [5], 4) = 0
connect(3, {sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
        inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0}, 28) = -1 EHOSTUNREACH (No route to host)
recvmsg(3, {msg_name={sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
        inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0},
        msg_namelen=1024->28, msg_iov=[{iov_base="`\r\337\320\0004\6\1&\7\370\260\200\231\16\27\0\0\0\0\0\0\0\0 \2\n\5f\10\2\227"..., iov_len=1024}],
        msg_iovlen=1, msg_control=[{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=SO_TIMESTAMP_OLD, cmsg_data={tv_sec=1590340680, tv_usec=272424}},
                                   {cmsg_len=60, cmsg_level=SOL_IPV6, cmsg_type=IPV6_RECVERR}],
        msg_controllen=96, msg_flags=MSG_ERRQUEUE}, MSG_ERRQUEUE) = 144

Suggested-by: Maciej Żenczykowski <maze@google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

45af29ca

bnx2x: allow bnx2x_bsc_read() to schedule · 880f8f99

Eric Dumazet authored May 23, 2020

bnx2x_warpcore_read_sfp_module_eeprom() can call bnx2x_bsc_read()
three times before giving up.

This causes latency blips of at least 31 ms (58 ms being reported
by our teams)

Convert the long lasting loops of udelay() to usleep_range() ones,
and breaks the loops on precise time tracking.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ariel Elior <aelior@marvell.com>
Cc: Sudarsana Kalluru <skalluru@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

880f8f99

ipv4: potential underflow in compat_ip_setsockopt() · 6a1015b0

Dan Carpenter authored May 23, 2020

The value of "n" is capped at 0x1ffffff but it checked for negative
values.  I don't think this causes a problem but I'm not certain and
it's harmless to prevent it.

Fixes: 2e041728 ("ipv4: do compat setsockopt for MCAST_MSFILTER directly")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6a1015b0

mvneta: MVNETA_SKB_HEADROOM set last 3 bits to zero · ca23cb0b

Sven Auhagen authored May 23, 2020

For XDP the MVNETA_SKB_HEADROOM is used as an offset for
the received data.
The MVNETA manual states that the last 3 bits assumed to be 0.

This is currently the case but lets make it explicit in the definition
to prevent future problems.
Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

ca23cb0b

25 May, 2020 13 commits

vxlan: Do not assume RTNL is held in vxlan_fdb_info() · 06ec313e

Ido Schimmel authored May 25, 2020

vxlan_fdb_info() is not always called with RTNL held or from an RCU
read-side critical section. For example, in the following call path:

vxlan_cleanup()
  vxlan_fdb_destroy()
    vxlan_fdb_notify()
      __vxlan_fdb_notify()
        vxlan_fdb_info()

The use of rtnl_dereference() can therefore result in the following
splat [1].

Fix this by dereferencing the nexthop under RCU read-side critical
section.

[1]
[May24 22:56] =============================
[  +0.004676] WARNING: suspicious RCU usage
[  +0.004614] 5.7.0-rc5-custom-16219-g201392003491 #2772 Not tainted
[  +0.007116] -----------------------------
[  +0.004657] drivers/net/vxlan.c:276 suspicious rcu_dereference_check() usage!
[  +0.008164]
              other info that might help us debug this:

[  +0.009126]
              rcu_scheduler_active = 2, debug_locks = 1
[  +0.007504] 5 locks held by bash/6892:
[  +0.004392]  #0: ffff8881d47e3410 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: __do_execve_file.isra.27+0x392/0x23c0
[  +0.011795]  #1: ffff8881d47e34b0 (&sig->exec_update_mutex){+.+.}-{3:3}, at: flush_old_exec+0x510/0x2030
[  +0.010947]  #2: ffff8881a141b0b0 (ptlock_ptr(page)#2){+.+.}-{2:2}, at: unmap_page_range+0x9c0/0x2590
[  +0.010585]  #3: ffff888230009d50 ((&vxlan->age_timer)){+.-.}-{0:0}, at: call_timer_fn+0xe8/0x800
[  +0.010192]  #4: ffff888183729bc8 (&vxlan->hash_lock[h]){+.-.}-{2:2}, at: vxlan_cleanup+0x133/0x4a0
[  +0.010382]
              stack backtrace:
[  +0.005103] CPU: 1 PID: 6892 Comm: bash Not tainted 5.7.0-rc5-custom-16219-g201392003491 #2772
[  +0.009675] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
[  +0.010155] Call Trace:
[  +0.002775]  <IRQ>
[  +0.002313]  dump_stack+0xfd/0x178
[  +0.003895]  lockdep_rcu_suspicious+0x14a/0x153
[  +0.005157]  vxlan_fdb_info+0xe39/0x12a0
[  +0.004775]  __vxlan_fdb_notify+0xb8/0x160
[  +0.004672]  vxlan_fdb_notify+0x8e/0xe0
[  +0.004370]  vxlan_fdb_destroy+0x117/0x330
[  +0.004662]  vxlan_cleanup+0x1aa/0x4a0
[  +0.004329]  call_timer_fn+0x1c4/0x800
[  +0.004357]  run_timer_softirq+0x129d/0x17e0
[  +0.004762]  __do_softirq+0x24c/0xaef
[  +0.004232]  irq_exit+0x167/0x190
[  +0.003767]  smp_apic_timer_interrupt+0x1dd/0x6a0
[  +0.005340]  apic_timer_interrupt+0xf/0x20
[  +0.004620]  </IRQ>

Fixes: 1274e1cc ("vxlan: ecmp support for mac fdb entries")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

06ec313e

Merge branch 'mlxsw-Various-trap-changes-part-1' · f36221e8

David S. Miller authored May 24, 2020

Ido Schimmel says:

====================
mlxsw: Various trap changes - part 1

This patch set contains various changes in mlxsw trap configuration.
Another set will perform similar changes before exposing control traps
(e.g., IGMP query, ARP request) via devlink-trap.

Tested with existing devlink-trap selftests. Please see individual
patches for a detailed changelog.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

f36221e8

mlxsw: spectrum: Fix spelling mistake in trap's name · 154388e1

Ido Schimmel authored May 25, 2020

Fix incorrect spelling of "advertisement".
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

154388e1

mlxsw: spectrum: Use dedicated trap group for sampled packets · ce3c3bf0

Ido Schimmel authored May 25, 2020

The rate with which packets are sampled is determined by user space, so
there is no need to associate such packets with a policer.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ce3c3bf0

mlxsw: spectrum: Use same trap group for IPv6 ND and ARP packets · b33f5d9f

Ido Schimmel authored May 25, 2020

Both packet types are needed for the same reason (neighbour discovery),
so associate them with the same trap group.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b33f5d9f

mlxsw: spectrum: Rename ARP trap group · 32446438

Ido Schimmel authored May 25, 2020

The ARP trap group will be used for IPv6 ND traps in the next patch, so
rename it to "NEIGH_DISCOVERY" which is more appropriate.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

32446438

mlxsw: spectrum_trap: Remove unnecessary field · d88f8cc1

Ido Schimmel authored May 25, 2020

Now that traffic class (TC) and priority are set to the same value,
there is no need to store both. Remove the first.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d88f8cc1

mlxsw: spectrum: Align TC and trap priority · 5047d819

Ido Schimmel authored May 25, 2020

The traffic class (TC) attribute of packet traps determines through which
TC a packet trap will be scheduled through the CPU port.

The priority attribute determines which trap will be triggered in case
several packet traps match a packet.

We try to configure these attributes to the same value for all packet
traps as there is little reason not to.

Some packet traps did not use the same value, so rectify that now.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5047d819

mlxsw: spectrum_buffers: Assign non-zero quotas to TC 0 of the CPU port · e0d84847

Ido Schimmel authored May 25, 2020

As explained in commit 9ffcc372 ("mlxsw: spectrum: Allow packets to
be trapped from any PG"), incoming packets can be admitted to the shared
buffer and forwarded / trapped, if:

(Ingress{Port}.Usage < Thres && Ingress{Port,PG}.Usage < Thres &&
 Egress{Port}.Usage < Thres && Egress{Port,TC}.Usage < Thres)
||
(Ingress{Port}.Usage < Min || Ingress{Port,PG} < Min ||
 Egress{Port}.Usage < Min || Egress{Port,TC}.Usage < Min)

Trapped packets are scheduled to transmission through the CPU port.
Currently, the minimum and maximum quotas of traffic class (TC) 0 of the
CPU port are 0, which means it is not usable.

Assign non-zero quotas to TC 0 of the CPU port, so that it could be
utilized by subsequent patches.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e0d84847

mlxsw: spectrum: Change default rate and priority of DHCP packets · 938e6d0b

Ido Schimmel authored May 25, 2020

Reduce the default acceptable rate of DHCP packets to 128 packets per
second and reduce their priority. This is reasonable given the Spectrum
ASICs are limited to 128 ports at the moment.

These are only the default values. Users will be able to modify them via
devlink-trap.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

938e6d0b

mlxsw: spectrum: Trap IPv4 DHCP packets in router · 0ecb9474

Ido Schimmel authored May 25, 2020

Currently, IPv4 DHCP packets are trapped during L2 forwarding, which
means that packets might be trapped unnecessarily. Instead, only trap
the DHCP packets that reach the router. Either because they were flooded
to the router port or forwarded to it by the FDB. This is consistent
with the corresponding IPv6 trap.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0ecb9474

mlxsw: spectrum: Use same trap group for MLD and IGMP packets · 99129069

Ido Schimmel authored May 25, 2020

Both packet types are needed for the same reason (multicast snooping),
so associate them with the same trap group.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

99129069

mlxsw: spectrum: Rename IGMP trap group · debb7af6

Ido Schimmel authored May 25, 2020

The IGMP trap group will be used for MLD traps in the next patch, so
rename it to "MC_SNOOPING" which is more appropriate.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

debb7af6