Commits · 4e096a18867a5a989b510f6999d9c6b6622e8f7b · Kirill Smelkov / linux

24 Feb, 2021 3 commits

net: introduce CAN specific pointer in the struct net_device · 4e096a18

Oleksij Rempel authored Feb 23, 2021

Since 20dd3850 ("can: Speed up CAN frame receiption by using
ml_priv") the CAN framework uses per device specific data in the AF_CAN
protocol. For this purpose the struct net_device->ml_priv is used. Later
the ml_priv usage in CAN was extended for other users, one of them being
CAN_J1939.

Later in the kernel ml_priv was converted to an union, used by other
drivers. E.g. the tun driver started storing it's stats pointer.

Since tun devices can claim to be a CAN device, CAN specific protocols
will wrongly interpret this pointer, which will cause system crashes.
Mostly this issue is visible in the CAN_J1939 stack.

To fix this issue, we request a dedicated CAN pointer within the
net_device struct.

Reported-by: syzbot+5138c4dd15a0401bec7b@syzkaller.appspotmail.com
Fixes: 20dd3850 ("can: Speed up CAN frame receiption by using ml_priv")
Fixes: ffd956ee ("can: introduce CAN midlayer private and allocate it automatically")
Fixes: 9d71dd0c ("can: add support of SAE J1939 protocol")
Fixes: 497a5757 ("tun: switch to net core provided statistics counters")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20210223070127.4538-1-o.rempel@pengutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4e096a18

net: usb: qmi_wwan: support ZTE P685M modem · 88eee9b7

Lech Perczak authored Feb 23, 2021

Now that interface 3 in "option" driver is no longer mapped, add device
ID matching it to qmi_wwan.

The modem is used inside ZTE MF283+ router and carriers identify it as
such.
Interface mapping is:
0: QCDM, 1: AT (PCUI), 2: AT (Modem), 3: QMI, 4: ADB

T:  Bus=02 Lev=02 Prnt=02 Port=05 Cnt=01 Dev#=  3 Spd=480  MxCh= 0
D:  Ver= 2.01 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=19d2 ProdID=1275 Rev=f0.00
S:  Manufacturer=ZTE,Incorporated
S:  Product=ZTE Technologies MSM
S:  SerialNumber=P685M510ZTED0000CP&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&0
C:* #Ifs= 5 Cfg#= 1 Atr=a0 MxPwr=500mA
I:* If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
E:  Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
I:* If#= 1 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option
E:  Ad=83(I) Atr=03(Int.) MxPS=  10 Ivl=32ms
E:  Ad=82(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
I:* If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option
E:  Ad=85(I) Atr=03(Int.) MxPS=  10 Ivl=32ms
E:  Ad=84(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=03(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
I:* If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=qmi_wwan
E:  Ad=87(I) Atr=03(Int.) MxPS=   8 Ivl=32ms
E:  Ad=86(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=04(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
I:* If#= 4 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=42 Prot=01 Driver=(none)
E:  Ad=88(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=05(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
Acked-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Lech Perczak <lech.perczak@gmail.com>
Link: https://lore.kernel.org/r/20210223183456.6377-1-lech.perczak@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

88eee9b7

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 6fbd15c0

Jakub Kicinski authored Feb 23, 2021

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2021-02-22

Dave corrects reporting of max TCs to use the value from hardware
capabilities and setting of DCBx capability bits when changing between
SW and FW LLDP.

Brett fixes trusted VF multicast promiscuous not receiving expected
packets and corrects VF max packet size when a port VLAN is configured.

Henry updates available RSS queues following a change in channel count
with a user defined LUT.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  ice: update the number of available RSS queues
  ice: Fix state bits on LLDP mode switch
  ice: Account for port VLAN in VF max packet size calculation
  ice: Set trusted VF as default VSI when setting allmulti on
  ice: report correct max number of TCs
====================

Link: https://lore.kernel.org/r/20210222235814.834282-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6fbd15c0

23 Feb, 2021 35 commits

Merge branch 'wireguard-fixes-for-5-12-rc1' · fcb30073

Jakub Kicinski authored Feb 23, 2021

Jason Donenfeld says:

====================
wireguard fixes for 5.12-rc1

This series has a collection of fixes that have piled up for a little
while now, that I unfortunately didn't get a chance to send out earlier.

1) Removes unlikely() from IS_ERR(), since it's already implied.

2) Remove a bogus sparse annotation that hasn't been needed for years.

3) Addition test in the test suite for stressing parallel ndo_start_xmit.

4) Slight struct reordering in preparation for subsequent fix.

5) If skb->protocol is bogus, we no longer attempt to send icmp messages.

6) Massive memory usage fix, hit by larger deployments.

7) Fix typo in kconfig dependency logic.

(1) and (2) are tiny cleanups, and (3) is just a test, so if you're
trying to reduce churn, you could not backport these. But (4), (5), (6),
and (7) fix problems and should be applied to stable. IMO, it's probably
easiest to just apply them all to stable.
====================

Link: https://lore.kernel.org/r/20210222162549.3252778-1-Jason@zx2c4.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

fcb30073

wireguard: kconfig: use arm chacha even with no neon · bce24739

Jason A. Donenfeld authored Feb 22, 2021

The condition here was incorrect: a non-neon fallback implementation is
available on arm32 when NEON is not supported.
Reported-by: Ilya Lipnitskiy <ilya.lipnitskiy@gmail.com>
Fixes: e7096c13 ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bce24739

wireguard: queueing: get rid of per-peer ring buffers · 8b5553ac

Jason A. Donenfeld authored Feb 22, 2021

Having two ring buffers per-peer means that every peer results in two
massive ring allocations. On an 8-core x86_64 machine, this commit
reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which
is an 90% reduction. Ninety percent! With some single-machine
deployments approaching 500,000 peers, we're talking about a reduction
from 7 gigs of memory down to 700 megs of memory.

In order to get rid of these per-peer allocations, this commit switches
to using a list-based queueing approach. Currently GSO fragments are
chained together using the skb->next pointer (the skb_list_* singly
linked list approach), so we form the per-peer queue around the unused
skb->prev pointer (which sort of makes sense because the links are
pointing backwards). Use of skb_queue_* is not possible here, because
that is based on doubly linked lists and spinlocks. Multiple cores can
write into the queue at any given time, because its writes occur in the
start_xmit path or in the udp_recv path. But reads happen in a single
workqueue item per-peer, amounting to a multi-producer, single-consumer
paradigm.

The MPSC queue is implemented locklessly and never blocks. However, it
is not linearizable (though it is serializable), with a very tight and
unlikely race on writes, which, when hit (some tiny fraction of the
0.15% of partial adds on a fully loaded 16-core x86_64 system), causes
the queue reader to terminate early. However, because every packet sent
queues up the same workqueue item after it is fully added, the worker
resumes again, and stopping early isn't actually a problem, since at
that point the packet wouldn't have yet been added to the encryption
queue. These properties allow us to avoid disabling interrupts or
spinning. The design is based on Dmitry Vyukov's algorithm [1].

Performance-wise, ordinarily list-based queues aren't preferable to
ringbuffers, because of cache misses when following pointers around.
However, we *already* have to follow the adjacent pointers when working
through fragments, so there shouldn't actually be any change there. A
potential downside is that dequeueing is a bit more complicated, but the
ptr_ring structure used prior had a spinlock when dequeueing, so all and
all the difference appears to be a wash.

Actually, from profiling, the biggest performance hit, by far, of this
commit winds up being atomic_add_unless(count, 1, max) and atomic_
dec(count), which account for the majority of CPU time, according to
perf. In that sense, the previous ring buffer was superior in that it
could check if it was full by head==tail, which the list-based approach
cannot do.

But all and all, this enables us to get massive memory savings, allowing
WireGuard to scale for real world deployments, without taking much of a
performance hit.

[1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queueReviewed-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Fixes: e7096c13 ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

8b5553ac

wireguard: device: do not generate ICMP for non-IP packets · 99fff526

Jason A. Donenfeld authored Feb 22, 2021

If skb->protocol doesn't match the actual skb->data header, it's
probably not a good idea to pass it off to icmp{,v6}_ndo_send, which is
expecting to reply to a valid IP packet. So this commit has that early
mismatch case jump to a later error label.

Fixes: e7096c13 ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

99fff526

wireguard: peer: put frequently used members above cache lines · 5a059869

Jason A. Donenfeld authored Feb 22, 2021

The is_dead boolean is checked for every single packet, while the
internal_id member is used basically only for pr_debug messages. So it
makes sense to hoist up is_dead into some space formerly unused by a
struct hole, while demoting internal_api to below the lowest struct
cache line.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

5a059869

wireguard: selftests: test multiple parallel streams · d5a49aa6

Jason A. Donenfeld authored Feb 22, 2021

In order to test ndo_start_xmit being called in parallel, explicitly add
separate tests, which should all run on different cores. This should
help tease out bugs associated with queueing up packets from different
cores in parallel. Currently, it hasn't found those types of bugs, but
given future planned work, this is a useful regression to avoid.

Fixes: e7096c13 ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

d5a49aa6

wireguard: socket: remove bogus __be32 annotation · 7f57bd8d

Jann Horn authored Feb 22, 2021

The endpoint->src_if4 has nothing to do with fixed-endian numbers; remove
the bogus annotation.

This was introduced in
https://git.zx2c4.com/wireguard-monolithic-historical/commit?id=14e7d0a499a676ec55176c0de2f9fcbd34074a82
in the historical WireGuard repo because the old code used to
zero-initialize multiple members as follows:

    endpoint->src4.s_addr = endpoint->src_if4 = fl.saddr = 0;

Because fl.saddr is fixed-endian and an assignment returns a value with the
type of its left operand, this meant that sparse detected an assignment
between values of different endianness.

Since then, this assignment was already split up into separate statements;
just the cast survived.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

7f57bd8d

wireguard: avoid double unlikely() notation when using IS_ERR() · 30ac4e2f

Antonio Quartulli authored Feb 22, 2021

The definition of IS_ERR() already applies the unlikely() notation
when checking the error status of the passed pointer. For this
reason there is no need to have the same notation outside of
IS_ERR() itself.

Clean up code by removing redundant notation.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

30ac4e2f

net: qrtr: Fix memory leak in qrtr_tun_open · fc0494ea

Takeshi Misawa authored Feb 22, 2021

If qrtr_endpoint_register() failed, tun is leaked.
Fix this, by freeing tun in error path.

syzbot report:
BUG: memory leak
unreferenced object 0xffff88811848d680 (size 64):
  comm "syz-executor684", pid 10171, jiffies 4294951561 (age 26.070s)
  hex dump (first 32 bytes):
    80 dd 0a 84 ff ff ff ff 00 00 00 00 00 00 00 00  ................
    90 d6 48 18 81 88 ff ff 90 d6 48 18 81 88 ff ff  ..H.......H.....
  backtrace:
    [<0000000018992a50>] kmalloc include/linux/slab.h:552 [inline]
    [<0000000018992a50>] kzalloc include/linux/slab.h:682 [inline]
    [<0000000018992a50>] qrtr_tun_open+0x22/0x90 net/qrtr/tun.c:35
    [<0000000003a453ef>] misc_open+0x19c/0x1e0 drivers/char/misc.c:141
    [<00000000dec38ac8>] chrdev_open+0x10d/0x340 fs/char_dev.c:414
    [<0000000079094996>] do_dentry_open+0x1e6/0x620 fs/open.c:817
    [<000000004096d290>] do_open fs/namei.c:3252 [inline]
    [<000000004096d290>] path_openat+0x74a/0x1b00 fs/namei.c:3369
    [<00000000b8e64241>] do_filp_open+0xa0/0x190 fs/namei.c:3396
    [<00000000a3299422>] do_sys_openat2+0xed/0x230 fs/open.c:1172
    [<000000002c1bdcef>] do_sys_open fs/open.c:1188 [inline]
    [<000000002c1bdcef>] __do_sys_openat fs/open.c:1204 [inline]
    [<000000002c1bdcef>] __se_sys_openat fs/open.c:1199 [inline]
    [<000000002c1bdcef>] __x64_sys_openat+0x7f/0xe0 fs/open.c:1199
    [<00000000f3a5728f>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    [<000000004b38b7ec>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 28fb4e59 ("net: qrtr: Expose tunneling endpoint to user space")
Reported-by: syzbot+5d6e4af21385f5cfc56a@syzkaller.appspotmail.com
Signed-off-by: Takeshi Misawa <jeliantsurux@gmail.com>
Link: https://lore.kernel.org/r/20210221234427.GA2140@DESKTOPSigned-off-by: Jakub Kicinski <kuba@kernel.org>

fc0494ea

vxlan: move debug check after netdev unregister · 92584ddf

Taehee Yoo authored Feb 21, 2021

The debug check must be done after unregister_netdevice_many() call --
the hlist_del_rcu() for this is done inside .ndo_stop.

This is the same with commit 0fda7600 ("geneve: move debug check after
netdev unregister")

Test commands:
    ip netns del A
    ip netns add A
    ip netns add B

    ip netns exec B ip link add vxlan0 type vxlan vni 100 local 10.0.0.1 \
	    remote 10.0.0.2 dstport 4789 srcport 4789 4789
    ip netns exec B ip link set vxlan0 netns A
    ip netns exec A ip link set vxlan0 up
    ip netns del B

Splat looks like:
[   73.176249][    T7] ------------[ cut here ]------------
[   73.178662][    T7] WARNING: CPU: 4 PID: 7 at drivers/net/vxlan.c:4743 vxlan_exit_batch_net+0x52e/0x720 [vxlan]
[   73.182597][    T7] Modules linked in: vxlan openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 mlx5_core nfp mlxfw ixgbevf tls sch_fq_codel nf_tables nfnetlink ip_tables x_tables unix
[   73.190113][    T7] CPU: 4 PID: 7 Comm: kworker/u16:0 Not tainted 5.11.0-rc7+ #838
[   73.193037][    T7] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[   73.196986][    T7] Workqueue: netns cleanup_net
[   73.198946][    T7] RIP: 0010:vxlan_exit_batch_net+0x52e/0x720 [vxlan]
[   73.201509][    T7] Code: 00 01 00 00 0f 84 39 fd ff ff 48 89 ca 48 c1 ea 03 80 3c 1a 00 0f 85 a6 00 00 00 89 c2 48 83 c2 02 49 8b 14 d4 48 85 d2 74 ce <0f> 0b eb ca e8 b9 51 db dd 84 c0 0f 85 4a fe ff ff 48 c7 c2 80 bc
[   73.208813][    T7] RSP: 0018:ffff888100907c10 EFLAGS: 00010286
[   73.211027][    T7] RAX: 000000000000003c RBX: dffffc0000000000 RCX: ffff88800ec411f0
[   73.213702][    T7] RDX: ffff88800a278000 RSI: ffff88800fc78c70 RDI: ffff88800fc78070
[   73.216169][    T7] RBP: ffff88800b5cbdc0 R08: fffffbfff424de61 R09: fffffbfff424de61
[   73.218463][    T7] R10: ffffffffa126f307 R11: fffffbfff424de60 R12: ffff88800ec41000
[   73.220794][    T7] R13: ffff888100907d08 R14: ffff888100907c50 R15: ffff88800fc78c40
[   73.223337][    T7] FS:  0000000000000000(0000) GS:ffff888114800000(0000) knlGS:0000000000000000
[   73.225814][    T7] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   73.227616][    T7] CR2: 0000562b5cb4f4d0 CR3: 0000000105fbe001 CR4: 00000000003706e0
[   73.229700][    T7] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   73.231820][    T7] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   73.233844][    T7] Call Trace:
[   73.234698][    T7]  ? vxlan_err_lookup+0x3c0/0x3c0 [vxlan]
[   73.235962][    T7]  ? ops_exit_list.isra.11+0x93/0x140
[   73.237134][    T7]  cleanup_net+0x45e/0x8a0
[ ... ]

Fixes: 57b61127 ("vxlan: speedup vxlan tunnels dismantle")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://lore.kernel.org/r/20210221154552.11749-1-ap420073@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

92584ddf

Merge branch 'r8152-minor-adjustments' · 2c8396de

Jakub Kicinski authored Feb 23, 2021

Hayes Wang says:

====================
r8152: minor adjustments

These patches are used to adjust the code.
====================

Link: https://lore.kernel.org/r/1394712342-15778-341-Taiwan-albertk@realtek.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2c8396de

r8152: spilt rtl_set_eee_plus and r8153b_green_en · 40fa7568

Hayes Wang authored Feb 19, 2021

Add rtl_eee_plus_en() and rtl_green_en().
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

40fa7568

r8152: replace netif_err with dev_err · 156c3207

Hayes Wang authored Feb 19, 2021

Some messages are before calling register_netdev(), so replace
netif_err() with dev_err().
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

156c3207

r8152: check if the pointer of the function exists · c79515e4

Hayes Wang authored Feb 19, 2021

Return error code if autosuspend_en, eee_get, or eee_set don't exist.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c79515e4

r8152: enable U1/U2 for USB_SPEED_SUPER · 7a0ae61a

Hayes Wang authored Feb 19, 2021

U1/U2 shoued be enabled for USB 3.0 or later. The USB 2.0 doesn't
support it.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

7a0ae61a

net/sched: cls_flower: validate ct_state for invalid and reply flags · 3aed8b63

wenxu authored Feb 23, 2021

Add invalid and reply flags validate in the fl_validate_ct_state.
This makes the checking complete if compared to ovs'
validate_ct_state().
Signed-off-by: wenxu <wenxu@ucloud.cn>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/1614064315-364-1-git-send-email-wenxu@ucloud.cnSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3aed8b63

Merge branch 'net-dsa-learning-fixes-for-b53-bcm_sf2' · f3f9be9c

Jakub Kicinski authored Feb 23, 2021

Florian Fainelli says:

====================
net: dsa: Learning fixes for b53/bcm_sf2

This patch series contains a couple of fixes for the b53/bcm_sf2 drivers
with respect to configuring learning.

The first patch is wiring-up the necessary dsa_switch_ops operations in
order to support the offloading of bridge flags.

The second patch corrects the switch driver's default learning behavior
which was unfortunately wrong from day one.

This is submitted against "net" because this is technically a bug fix
since ports should not have had learning enabled by default but given
this is dependent upon Vladimir's recent br_flags series, there is no
Fixes tag provided.

I will be providing targeted stable backports that look a bit
different.

Changes in v2:

- added first patch
- updated second patch to include BR_LEARNING check in br_flags_pre as
  a support bridge flag to offload
====================

Link: https://lore.kernel.org/r/20210222223010.2907234-1-f.fainelli@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f3f9be9c

net: dsa: b53: Support setting learning on port · f9b3827e

Florian Fainelli authored Feb 22, 2021

Add support for being able to set the learning attribute on port, and
make sure that the standalone ports start up with learning disabled.

We can remove the code in bcm_sf2 that configured the ports learning
attribute because we want the standalone ports to have learning disabled
by default and port 7 cannot be bridged, so its learning attribute will
not change past its initial configuration.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

f9b3827e

net: dsa: bcm_sf2: Wire-up br_flags_pre, br_flags and set_mrouter · e6dd86ed

Florian Fainelli authored Feb 22, 2021

Because bcm_sf2 implements its own dsa_switch_ops we need to export the
b53_br_flags_pre(), b53_br_flags() and b53_set_mrouter so we can wire-up
them up like they used to be with the former b53_br_egress_floods().

Fixes: a8b659e7 ("net: dsa: act as passthrough for bridge port flags")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

e6dd86ed

Marvell Sky2 Ethernet adapter: fix warning messages. · 18755e27

Krzysztof Halasa authored Feb 18, 2021

sky2.c driver uses netdev_warn() before the net device is initialized.
Fix it by using dev_warn() instead.
Signed-off-by: Krzysztof Halasa <khalasa@piap.pl>

Link: https://lore.kernel.org/r/m3a6s1r1ul.fsf@t19.piap.plSigned-off-by: Jakub Kicinski <kuba@kernel.org>

18755e27

bcm63xx_enet: fix sporadic kernel panic · 9bc1ef64

Sieng Piaw Liew authored Feb 22, 2021

In ndo_stop functions, netdev_completed_queue() is called during forced
tx reclaim, after netdev_reset_queue(). This may trigger kernel panic if
there is any tx skb left.

This patch moves netdev_reset_queue() to after tx reclaim, so BQL can
complete successfully then reset.
Signed-off-by: Sieng Piaw Liew <liew.s.piaw@gmail.com>
Fixes: 4c59b0f5 ("bcm63xx_enet: add BQL support")
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210222013530.1356-1-liew.s.piaw@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9bc1ef64

net: icmp: pass zeroed opts from icmp{,v6}_ndo_send before sending · ee576c47

Jason A. Donenfeld authored Feb 23, 2021

The icmp{,v6}_send functions make all sorts of use of skb->cb, casting
it with IPCB or IP6CB, assuming the skb to have come directly from the
inet layer. But when the packet comes from the ndo layer, especially
when forwarded, there's no telling what might be in skb->cb at that
point. As a result, the icmp sending code risks reading bogus memory
contents, which can result in nasty stack overflows such as this one
reported by a user:

    panic+0x108/0x2ea
    __stack_chk_fail+0x14/0x20
    __icmp_send+0x5bd/0x5c0
    icmp_ndo_send+0x148/0x160

In icmp_send, skb->cb is cast with IPCB and an ip_options struct is read
from it. The optlen parameter there is of particular note, as it can
induce writes beyond bounds. There are quite a few ways that can happen
in __ip_options_echo. For example:

    // sptr/skb are attacker-controlled skb bytes
    sptr = skb_network_header(skb);
    // dptr/dopt points to stack memory allocated by __icmp_send
    dptr = dopt->__data;
    // sopt is the corrupt skb->cb in question
    if (sopt->rr) {
        optlen  = sptr[sopt->rr+1]; // corrupt skb->cb + skb->data
        soffset = sptr[sopt->rr+2]; // corrupt skb->cb + skb->data
	// this now writes potentially attacker-controlled data, over
	// flowing the stack:
        memcpy(dptr, sptr+sopt->rr, optlen);
    }

In the icmpv6_send case, the story is similar, but not as dire, as only
IP6CB(skb)->iif and IP6CB(skb)->dsthao are used. The dsthao case is
worse than the iif case, but it is passed to ipv6_find_tlv, which does
a bit of bounds checking on the value.

This is easy to simulate by doing a `memset(skb->cb, 0x41,
sizeof(skb->cb));` before calling icmp{,v6}_ndo_send, and it's only by
good fortune and the rarity of icmp sending from that context that we've
avoided reports like this until now. For example, in KASAN:

    BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0xa0e/0x12b0
    Write of size 38 at addr ffff888006f1f80e by task ping/89
    CPU: 2 PID: 89 Comm: ping Not tainted 5.10.0-rc7-debug+ #5
    Call Trace:
     dump_stack+0x9a/0xcc
     print_address_description.constprop.0+0x1a/0x160
     __kasan_report.cold+0x20/0x38
     kasan_report+0x32/0x40
     check_memory_region+0x145/0x1a0
     memcpy+0x39/0x60
     __ip_options_echo+0xa0e/0x12b0
     __icmp_send+0x744/0x1700

Actually, out of the 4 drivers that do this, only gtp zeroed the cb for
the v4 case, while the rest did not. So this commit actually removes the
gtp-specific zeroing, while putting the code where it belongs in the
shared infrastructure of icmp{,v6}_ndo_send.

This commit fixes the issue by passing an empty IPCB or IP6CB along to
the functions that actually do the work. For the icmp_send, this was
already trivial, thanks to __icmp_send providing the plumbing function.
For icmpv6_send, this required a tiny bit of refactoring to make it
behave like the v4 case, after which it was straight forward.

Fixes: a2b78e9b ("sunvnet: generate ICMP PTMUD messages for smaller port MTUs")
Reported-by: SinYu <liuxyon@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/netdev/CAF=yD-LOF116aHub6RMe8vB8ZpnrrnoTdqhobEx+bvoA8AsP0w@mail.gmail.com/T/Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Link: https://lore.kernel.org/r/20210223131858.72082-1-Jason@zx2c4.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ee576c47

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 42870a1a

Jakub Kicinski authored Feb 22, 2021

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2021-02-19

This series contains updates to i40e driver only.

Slawomir resolves an issue with the IPv6 extension headers being
processed incorrectly.

Keita Suzuki fixes a memory leak on probe failure.

Mateusz initializes AQ command structures to zero to comply with
spec, fixes FW flow control settings being overwritten and resolves an
issue with adding VLAN filters after enabling FW LLDP. He also adds
an additional check when adding TC filter as the current check doesn't
properly distinguish between IPv4 and IPv6.

Sylwester removes setting disabled bit when syncing filters as this
prevents VFs from completing setup.

Norbert cleans up sparse warnings.

v2:
- Fix fixes tag on patch 7

* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  i40e: Fix endianness conversions
  i40e: Fix add TC filter for IPv6
  i40e: Fix VFs not created
  i40e: Fix addition of RX filters after enabling FW LLDP agent
  i40e: Fix overwriting flow control settings during driver loading
  i40e: Add zero-initialization of AQ command structures
  i40e: Fix memory leak in i40e_probe
  i40e: Fix flow for IPv6 next header (extension header)
====================

Link: https://lore.kernel.org/r/20210219213606.2567536-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

42870a1a

net/mlx4_core: Add missed mlx4_free_cmd_mailbox() · 8eb65fda

Chuhong Yuan authored Feb 21, 2021

mlx4_do_mirror_rule() forgets to call mlx4_free_cmd_mailbox() to
free the memory region allocated by mlx4_alloc_cmd_mailbox() before
an exit.
Add the missed call to fix it.

Fixes: 78efed27 ("net/mlx4_core: Support mirroring VF DMFS rules on both ports")
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20210221143559.390277-1-hslester96@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8eb65fda

net: stmmac: fix CBS idleslope and sendslope calculation · 24877687

Song, Yoong Siang authored Feb 18, 2021

When link speed is not 100 Mbps, port transmit rate and speed divider
are set to 8 and 1000000 respectively. These values are incorrect for
CBS idleslope and sendslope HW values calculation if the link speed is
not 1 Gbps.

This patch adds switch statement to set the values of port transmit rate
and speed divider for 10 Gbps, 5 Gbps, 2.5 Gbps, 1 Gbps, and 100 Mbps.
Note that CBS is not supported at 10 Mbps.

Fixes: bc41a668 ("net: stmmac: tc: Remove the speed dependency")
Fixes: 1f705bc6 ("net: stmmac: Add support for CBS QDISC")
Signed-off-by: Song, Yoong Siang <yoong.siang.song@intel.com>
Link: https://lore.kernel.org/r/1613655653-11755-1-git-send-email-yoong.siang.song@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

24877687

Merge branch 'mptcp-a-bunch-of-fixes' · e5bcf0e8

Jakub Kicinski authored Feb 22, 2021

Paolo Abeni says:

====================
mptcp: a bunch of fixes

This series bundle a few MPTCP fixes for the current net tree.
They have been detected via syzkaller and packetdrill

Patch 1 fixes a slow close for orphaned sockets

Patch 2 fixes another hangup at close time, when no
data was actually transmitted before close

Patch 3 fixes a memory leak with unusual sockopts

Patch 4 fixes stray wake-ups on listener sockets
====================

Link: https://lore.kernel.org/r/cover.1613755058.git.pabeni@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e5bcf0e8

mptcp: do not wakeup listener for MPJ subflows · 52557dbc

Paolo Abeni authored Feb 19, 2021

MPJ subflows are not exposed as fds to user spaces. As such,
incoming MPJ subflows are removed from the accept queue by
tcp_check_req()/tcp_get_cookie_sock().

Later tcp_child_process() invokes subflow_data_ready() on the
parent socket regardless of the subflow kind, leading to poll
wakeups even if the later accept will block.

Address the issue by double-checking the queue state before
waking the user-space.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/164Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Fixes: f296234c ("mptcp: Add handling of incoming MP_JOIN requests")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

52557dbc

mptcp: provide subflow aware release function · ad98dd37

Florian Westphal authored Feb 19, 2021

mptcp re-used inet(6)_release, so the subflow sockets are ignored.
Need to invoke ip(v6)_mc_drop_socket function to ensure mcast join
resources get free'd.

Fixes: 717e79c8 ("mptcp: Add setsockopt()/getsockopt() socket operations")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/110Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ad98dd37

mptcp: fix DATA_FIN generation on early shutdown · d87903b6

Paolo Abeni authored Feb 19, 2021

If the msk is closed before sending or receiving any data,
no DATA_FIN is generated, instead an MPC ack packet is
crafted out.

In the above scenario, the MPTCP protocol creates and sends a
pure ack and such packets matches also the criteria for an
MPC ack and the protocol tries first to insert MPC options,
leading to the described error.

This change addresses the issue by avoiding the insertion of an
MPC option for DATA_FIN packets or if the sub-flow is not
established.

To avoid doing multiple times the same test, fetch the data_fin
flag in a bool variable and pass it to both the interested
helpers.

Fixes: 6d0060f6 ("mptcp: Write MPTCP DSS headers to outgoing data packets")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

d87903b6

mptcp: fix DATA_FIN processing for orphaned sockets · 341c6524

Paolo Abeni authored Feb 19, 2021

Currently we move orphaned msk sockets directly from FIN_WAIT2
state to CLOSE, with the rationale that incoming additional
data could be just dropped by the TCP stack/TW sockets.

Anyhow we miss sending MPTCP-level ack on incoming DATA_FIN,
and that may hang the peers.

Fixes: e16163b6 ("mptcp: refactor shutdown and close")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

341c6524

net: dsa: Fix dependencies with HSR · 94ead4ca

Florian Fainelli authored Feb 19, 2021

The core DSA framework uses hsr_is_master() which would not resolve to a
valid symbol if HSR is built-into the kernel and DSA is a module.

Fixes: 18596f50 ("net: dsa: add support for offloading HSR")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: George McCollister <george.mccollister@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Tested-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/20210220051222.15672-1-f.fainelli@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

94ead4ca

net: phy: icplus: call phy_restore_page() when phy_select_page() fails · 4e9d9d1f

Dan Carpenter authored Feb 19, 2021

The comments to phy_select_page() say that "phy_restore_page() must
always be called after this, irrespective of success or failure of this
call." If we don't call phy_restore_page() then we are still holding
the phy_lock_mdio_bus() so it eventually leads to a dead lock.

Fixes: 32ab60e5 ("net: phy: icplus: add MDI/MDIX support for IP101A/G")
Fixes: f9bc51e6 ("net: phy: icplus: fix paged register access")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Michael Walle <michael@walle.cc>
Reviewed-by: Russell King <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/YC+OpFGsDPXPnXM5@mwandaSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4e9d9d1f

net: mvpp2: skip RSS configurations on loopback port · 0a8a8000

Stefan Chulski authored Feb 18, 2021

PPv2 loopback port doesn't support RSS, so we should
skip RSS configurations for this port.
Signed-off-by: Stefan Chulski <stefanc@marvell.com>
Reviewed-by: Marcin Wojtas <mw@semihalf.com>
Link: https://lore.kernel.org/r/1613652123-19021-1-git-send-email-stefanc@marvell.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0a8a8000

dpaa_eth: fix the access method for the dpaa_napi_portal · 433dfc99

Camelia Groza authored Feb 18, 2021

The current use of container_of is flawed and unnecessary. Obtain
the dpaa_napi_portal reference from the private percpu data instead.

Fixes: a1e031ff ("dpaa_eth: add XDP_REDIRECT support")
Reported-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/r/20210218182106.22613-1-camelia.groza@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

433dfc99

net: ag71xx: remove unnecessary MTU reservation · 04b385f3

DENG Qingfang authored Feb 18, 2021

2 bytes of the MTU are reserved for Atheros DSA tag, but DSA core
has already handled that since commit dc0fe7d4.
Remove the unnecessary reservation.

Fixes: d51b6ce4 ("net: ethernet: add ag71xx driver")
Signed-off-by: DENG Qingfang <dqfext@gmail.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20210218034514.3421-1-dqfext@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

04b385f3

22 Feb, 2021 2 commits

ice: update the number of available RSS queues · 0393e46a

Henry Tieman authored Dec 03, 2020

It was possible to have Rx queues that were not available for use
by RSS. This would happen when increasing the number of Rx queues
while there was a user defined RSS LUT.

Always update the number of available RSS queues when changing the
number of Rx queues.

Fixes: 87324e74 ("ice: Implement ethtool ops for channels")
Signed-off-by: Henry Tieman <henry.w.tieman@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

0393e46a

ice: Fix state bits on LLDP mode switch · 0d4907f6

Dave Ertman authored Nov 20, 2020

DCBX_CAP bits were not being adjusted when switching
between SW and FW controlled LLDP.

Adjust bits to correctly indicate which mode the
LLDP logic is in.

Fixes: b94b013e ("ice: Implement DCBNL support")
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

0d4907f6