Commits · 721111b1b29c67fd18ac2f69b3a48c06ba996762 · Kirill Smelkov / linux

10 Nov, 2021 21 commits

gve: fix unmatched u64_stats_update_end() · 721111b1

Dan Carpenter authored Nov 10, 2021

The u64_stats_update_end() call is supposed to be inside the curly
braces so it pairs with the u64_stats_update_begin().

Fixes: 37149e93 ("gve: Implement packet continuation for RX.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

721111b1

net: ethernet: lantiq_etop: Fix compilation error · 68eabc34

Aleksander Jan Bajkowski authored Nov 09, 2021

This fixes the error detected when compiling the driver.

Fixes: 14d4e308 ("net: lantiq: configure the burst length in ethernet drivers")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

68eabc34

selftests: forwarding: Fix packet matching in mirroring selftests · af0a5111

Petr Machata authored Nov 09, 2021

In commit 6de6e46d ("cls_flower: Fix inability to match GRE/IPIP
packets"), cls_flower was fixed to match an outer packet of a tunneled
packet as would be expected, rather than dissecting to the inner packet and
matching on that.

This fix uncovered several issues in packet matching in mirroring
selftests:

- in mirror_gre_bridge_1d_vlan.sh and mirror_gre_vlan_bridge_1q.sh, the
  vlan_ethtype match is copied around as "ip", even as some of the tests
  are running over ip6gretap. This is fixed by using an "ipv6" for
  vlan_ethtype in the ip6gretap tests.

- in mirror_gre_changes.sh, a filter to count GRE packets is set up to
  match TTL of 50. This used to trigger in the offloaded datapath, where
  the envelope TTL was matched, but not in the software datapath, which
  considered TTL of the inner packet. Now that both match consistently, all
  the packets were double-counted. This is fixed by marking the filter as
  skip_hw, leaving only the SW datapath component active.

Fixes: 6de6e46d ("cls_flower: Fix inability to match GRE/IPIP packets")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

af0a5111

vsock: prevent unnecessary refcnt inc for nonblocking connect · c7cd82b9

Eiichi Tsukata authored Nov 09, 2021

Currently vosck_connect() increments sock refcount for nonblocking
socket each time it's called, which can lead to memory leak if
it's called multiple times because connect timeout function decrements
sock refcount only once.

Fixes it by making vsock_connect() return -EALREADY immediately when
sock state is already SS_CONNECTING.

Fixes: d021c344 ("VSOCK: Introduce VM Sockets")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c7cd82b9

net: marvell: mvpp2: Fix wrong SerDes reconfiguration order · bb7bbb6e

Marek Behún authored Nov 08, 2021

Commit bfe301eb ("net: mvpp2: convert to use
mac_prepare()/mac_finish()") introduced a bug wherein it leaves the MAC
RESET register asserted after mac_finish(), due to wrong order of
function calls.

Before it was:
  .mac_config()
    mvpp22_mode_reconfigure()
      assert reset
    mvpp2_xlg_config()
      deassert reset

Now it is:
  .mac_prepare()
  .mac_config()
    mvpp2_xlg_config()
      deassert reset
  .mac_finish()
    mvpp2_xlg_config()
      assert reset

Obviously this is wrong.

This bug is triggered when phylink tries to change the PHY interface
mode from a GMAC mode (sgmii, 1000base-x, 2500base-x) to XLG mode
(10gbase-r, xaui). The XLG mode does not work since reset is left
asserted. Only after
  ifconfig down && ifconfig up
is called will the XLG mode work.

Move the call to mvpp22_mode_reconfigure() to .mac_prepare()
implementation. Since some of the subsequent functions need to know
whether the interface is being changed, we unfortunately also need to
pass around the new interface mode before setting port->phy_interface.

Fixes: bfe301eb ("net: mvpp2: convert to use mac_prepare()/mac_finish()")
Signed-off-by: Marek Behún <kabel@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

bb7bbb6e

net: ethernet: ti: cpsw_ale: Fix access to un-initialized memory · 7a166854

Christophe JAILLET authored Nov 08, 2021

It is spurious to allocate a bitmap without initializing it.
So, better safe than sorry, initialize it to 0 at least to have some known
values.

While at it, switch to the devm_bitmap_ API which is less verbose.

Fixes: 4b41d343 ("net: ethernet: ti: cpsw: allow untagged traffic on host port")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

7a166854

net: stmmac: allow a tc-taprio base-time of zero · f64ab8e4

Vladimir Oltean authored Nov 08, 2021

Commit fe28c53e ("net: stmmac: fix taprio configuration when
base_time is in the past") allowed some base time values in the past,
but apparently not all, the base-time value of 0 (Jan 1st 1970) is still
explicitly denied by the driver.

Remove the bogus check.

Fixes: b60189e0 ("net: stmmac: Integrate EST with TAPRIO scheduler API")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

f64ab8e4

selftests: net: test_vxlan_under_vrf: fix HV connectivity test · e7e4785f

Andrea Righi authored Nov 05, 2021

It looks like test_vxlan_under_vrf.sh is always failing to verify the
connectivity test during the ping between the two simulated VMs.

This is due to the fact that veth-hv in each VM should have a distinct
MAC address.

Fix by setting a unique MAC address on each simulated VM interface.

Without this fix:

 $ sudo ./tools/testing/selftests/net/test_vxlan_under_vrf.sh
 Checking HV connectivity                                           [ OK ]
 Check VM connectivity through VXLAN (underlay in the default VRF)  [FAIL]

With this fix applied:

 $ sudo ./tools/testing/selftests/net/test_vxlan_under_vrf.sh
 Checking HV connectivity                                           [ OK ]
 Check VM connectivity through VXLAN (underlay in the default VRF)  [ OK ]
 Check VM connectivity through VXLAN (underlay in a VRF)            [FAIL]

NOTE: the connectivity test with the underlay VRF is still failing; it
seems that ARP requests are blocked at the simulated hypervisor level,
probably due to some missing ARP forwarding rules. This requires more
investigation (in the meantime we may consider to set that test as
expected failure - XFAIL).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e7e4785f

Merge branch 'hns3-fixes' · 1413ff13

David S. Miller authored Nov 10, 2021

Guangbin Huang says:

====================
net: hns3: add some fixes for -net

This series adds some fixes for the HNS3 ethernet driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

1413ff13

net: hns3: allow configure ETS bandwidth of all TCs · 688db0c7

Guangbin Huang authored Nov 10, 2021

Currently, driver only allow configuring ETS bandwidth of TCs according
to the max TC number queried from firmware. However, the hardware actually
supports 8 TCs and users may need to configure ETS bandwidth of all TCs,
so remove the restriction.

Fixes: 330baff5 ("net: hns3: add ETS TC weight setting in SSU module")
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

688db0c7

net: hns3: remove check VF uc mac exist when set by PF · 91fcc79b

Guangbin Huang authored Nov 10, 2021

If users set unicast mac address for VFs by PF, they need to guarantee all
VFs' address is different. This patch removes the check mac address exist
of VFs, for usrs can refresh mac addresses of all VFs directly without
need to modify the exist mac address to other value firstly.

Fixes: 8e6de441 ("net: hns3: add support for configuring VF MAC from the host")
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

91fcc79b

net: hns3: fix some mac statistics is always 0 in device version V2 · 1122eac1

Guangbin Huang authored Nov 10, 2021

When driver queries the register number of mac statistics from firmware,
the old firmware runs in device version V2 only returns number of valid
registers, not include number of three reserved registers among of them.
It cause driver doesn't record the last three data when query mac
statistics.

To fix this problem, driver never query register number in device version
V2 and set it to a fixed value which include three reserved registers.

Fixes: c8af2887 ("net: hns3: add support pause/pfc durations for mac statistics")
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1122eac1

net: hns3: fix kernel crash when unload VF while it is being reset · e140c798

Yufeng Mo authored Nov 10, 2021

When fully configure VLANs for a VF, then unload the VF while
triggering a reset to PF, will cause a kernel crash because the
irq is already uninit.

[ 293.177579] ------------[ cut here ]------------
[ 293.183502] kernel BUG at drivers/pci/msi.c:352!
[ 293.189547] Internal error: Oops - BUG: 0 [#1] SMP
......
[ 293.390124] Workqueue: hclgevf hclgevf_service_task [hclgevf]
[ 293.402627] pstate: 80c00009 (Nzcv daif +PAN +UAO)
[ 293.414324] pc : free_msi_irqs+0x19c/0x1b8
[ 293.425429] lr : free_msi_irqs+0x18c/0x1b8
[ 293.436545] sp : ffff00002716fbb0
[ 293.446950] x29: ffff00002716fbb0 x28: 0000000000000000
[ 293.459519] x27: 0000000000000000 x26: ffff45b91ea16b00
[ 293.472183] x25: 0000000000000000 x24: ffffa587b08f4700
[ 293.484717] x23: ffffc591ac30e000 x22: ffffa587b08f8428
[ 293.497190] x21: ffffc591ac30e300 x20: 0000000000000000
[ 293.509594] x19: ffffa58a062a8300 x18: 0000000000000000
[ 293.521949] x17: 0000000000000000 x16: ffff45b91dcc3f48
[ 293.534013] x15: 0000000000000000 x14: 0000000000000000
[ 293.545883] x13: 0000000000000040 x12: 0000000000000228
[ 293.557508] x11: 0000000000000020 x10: 0000000000000040
[ 293.568889] x9 : ffff45b91ea1e190 x8 : ffffc591802d0000
[ 293.580123] x7 : ffffc591802d0148 x6 : 0000000000000120
[ 293.591190] x5 : ffffc591802d0000 x4 : 0000000000000000
[ 293.602015] x3 : 0000000000000000 x2 : 0000000000000000
[ 293.612624] x1 : 00000000000004a4 x0 : ffffa58a1e0c6b80
[ 293.623028] Call trace:
[ 293.630340] free_msi_irqs+0x19c/0x1b8
[ 293.638849] pci_disable_msix+0x118/0x140
[ 293.647452] pci_free_irq_vectors+0x20/0x38
[ 293.656081] hclgevf_uninit_msi+0x44/0x58 [hclgevf]
[ 293.665309] hclgevf_reset_rebuild+0x1ac/0x2e0 [hclgevf]
[ 293.674866] hclgevf_reset+0x358/0x400 [hclgevf]
[ 293.683545] hclgevf_reset_service_task+0xd0/0x1b0 [hclgevf]
[ 293.693325] hclgevf_service_task+0x4c/0x2e8 [hclgevf]
[ 293.702307] process_one_work+0x1b0/0x448
[ 293.710034] worker_thread+0x54/0x468
[ 293.717331] kthread+0x134/0x138
[ 293.724114] ret_from_fork+0x10/0x18
[ 293.731324] Code: f940b000 b4ffff00 a903e7b8 f90017b6 (d4210000)

This patch fixes the problem by waiting for the VF reset done
while unloading the VF.

Fixes: e2cb1dec ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support")
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e140c798

net: hns3: sync rx ring head in echo common pull · 3b6db4a0

Yufeng Mo authored Nov 10, 2021

When the driver processes rx packets, the head pointer is updated only
after the number of received packets reaches 16. However, hardware
relies on the head pointer to calculate the number of FBDs. As a result,
the hardware calculates the FBD incorrectly. Therefore, the driver
proactively updates the head pointer in each common poll to ensure that
the number of FBDs calculated by the hardware is correct.

Fixes: 68752b24 ("net: hns3: schedule the polling again when allocation fails")
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3b6db4a0

net: hns3: fix pfc packet number incorrect after querying pfc parameters · 0b653a81

Jie Wang authored Nov 10, 2021

Currently, driver will send command to firmware to query pfc packet number
when user uses dcb tool to get pfc parameters. However, the periodic
service task will also periodically query and record MAC statistics,
including pfc packet number.

As the hardware registers of statistics is cleared after reading, it will
cause pfc packet number of MAC statistics are not correct after using dcb
tool to get pfc parameters.

To fix this problem, when user uses dcb tool to get pfc parameters, driver
updates MAC statistics firstly and then get pfc packet number from MAC
statistics.

Fixes: 64fd2300 ("net: hns3: add support for querying pfc puase packets statistic")
Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0b653a81

net: hns3: fix ROCE base interrupt vector initialization bug · beb27ca4

Jie Wang authored Nov 10, 2021

Currently, NIC init ROCE interrupt vector with MSIX interrupt. But ROCE use
pci_irq_vector() to get interrupt vector, which adds the relative interrupt
vector again and gets wrong interrupt vector.

So fixes it by assign relative interrupt vector to ROCE instead of MSIX
interrupt vector and delete the unused struct member base_msi_vector
declaration of hclgevf_dev.

Fixes: 46a3df9f ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support")
Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

beb27ca4

net: hns3: fix failed to add reuse multicast mac addr to hardware when mc mac table is full · 3b4c6566

Guangbin Huang authored Nov 10, 2021

Currently, when driver is failed to add a new multicast mac address to
hardware due to the multicast mac table is full, it will directly return.
In this case, if the multicast mac list has some reuse addresses after the
new address, those reuse addresses will never be added to hardware.

To fix this problem, if function hclge_add_mc_addr_common() returns
-ENOSPC, hclge_sync_vport_mac_list() should judge whether continue or
stop to add next address.

As function hclge_sync_vport_mac_list() needs parameter mac_type to know
whether is uc or mc, refine this function to add parameter mac_type and
remove parameter sync. So does function hclge_unsync_vport_mac_list().

Fixes: ee4bcd3b ("net: hns3: refactor the MAC address configure")
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3b4c6566

net: mana: Fix spelling mistake "calledd" -> "called" · 8f1bc38b

Colin Ian King authored Nov 08, 2021

There is a spelling mistake in a dev_info message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Link: https://lore.kernel.org/r/20211108201817.43121-1-colin.i.king@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8f1bc38b

net/sched: sch_taprio: fix undefined behavior in ktime_mono_to_any · 6dc25401

Eric Dumazet authored Nov 08, 2021

1) if q->tk_offset == TK_OFFS_MAX, then get_tcp_tstamp() calls
   ktime_mono_to_any() with out-of-bound value.

2) if q->tk_offset is changed in taprio_parse_clockid(),
   taprio_get_time() might also call ktime_mono_to_any()
   with out-of-bound value as sysbot found:

UBSAN: array-index-out-of-bounds in kernel/time/timekeeping.c:908:27
index 3 is out of range for type 'ktime_t *[3]'
CPU: 1 PID: 25668 Comm: kworker/u4:0 Not tainted 5.15.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 ubsan_epilogue+0xb/0x5a lib/ubsan.c:151
 __ubsan_handle_out_of_bounds.cold+0x62/0x6c lib/ubsan.c:291
 ktime_mono_to_any+0x1d4/0x1e0 kernel/time/timekeeping.c:908
 get_tcp_tstamp net/sched/sch_taprio.c:322 [inline]
 get_packet_txtime net/sched/sch_taprio.c:353 [inline]
 taprio_enqueue_one+0x5b0/0x1460 net/sched/sch_taprio.c:420
 taprio_enqueue+0x3b1/0x730 net/sched/sch_taprio.c:485
 dev_qdisc_enqueue+0x40/0x300 net/core/dev.c:3785
 __dev_xmit_skb net/core/dev.c:3869 [inline]
 __dev_queue_xmit+0x1f6e/0x3630 net/core/dev.c:4194
 batadv_send_skb_packet+0x4a9/0x5f0 net/batman-adv/send.c:108
 batadv_iv_ogm_send_to_if net/batman-adv/bat_iv_ogm.c:393 [inline]
 batadv_iv_ogm_emit net/batman-adv/bat_iv_ogm.c:421 [inline]
 batadv_iv_send_outstanding_bat_ogm_packet+0x6d7/0x8e0 net/batman-adv/bat_iv_ogm.c:1701
 process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
 worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
 kthread+0x405/0x4f0 kernel/kthread.c:327
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

Fixes: 7ede7b03 ("taprio: make clock reference conversions easier")
Fixes: 54002066 ("taprio: Adjust timestamps for TCP packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vedang Patel <vedang.patel@intel.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Link: https://lore.kernel.org/r/20211108180815.1822479-1-eric.dumazet@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6dc25401

amt: use cancel_delayed_work() instead of flush_delayed_work() in amt_fini() · 43aa4937

Taehee Yoo authored Nov 08, 2021

When the amt module is being removed, it calls flush_delayed_work() to exit
source_gc_wq. But it wouldn't be exited properly because the
amt_source_gc_work(), which is the callback function of source_gc_wq
internally calls mod_delayed_work() again.
So, amt_source_gc_work() would be called after the amt module is removed.
Therefore kernel panic would occur.
In order to avoid it, cancel_delayed_work() should be used instead of
flush_delayed_work().

Test commands:
   modprobe amt
   modprobe -rv amt

Splat looks like:
 BUG: unable to handle page fault for address: fffffbfff80f50db
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 1237ee067 P4D 1237ee067 PUD 1237b2067 PMD 100c11067 PTE 0
 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN PTI
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0+ #27
 5a0ebebc29fe5c40c68bea90197606c3a832b09f
 RIP: 0010:run_timer_softirq+0x221/0xfc0
 Code: 00 00 4c 89 e1 4c 8b 30 48 c1 e9 03 80 3c 29 00 0f 85 ed 0b 00 00
 4d 89 34 24 4d 85 f6 74 19 49 8d 7e 08 48 89 f9 48 c1 e9 03 <80> 3c 29 00
 0f 85 fa 0b 00 00 4d 89 66 08 83 04 24 01 49 89 d4 48
 RSP: 0018:ffff888119009e50 EFLAGS: 00010806
 RAX: ffff8881191f8a80 RBX: 00000000007ffe2a RCX: 1ffffffff80f50db
 RDX: ffff888119009ed0 RSI: 0000000000000008 RDI: ffffffffc07a86d8
 RBP: dffffc0000000000 R08: ffff8881191f8280 R09: ffffed102323f061
 R10: ffff8881191f8307 R11: ffffed102323f060 R12: ffff888119009ec8
 R13: 00000000000000c0 R14: ffffffffc07a86d0 R15: ffff8881191f82e8
 FS:  0000000000000000(0000) GS:ffff888119000000(0000)
 knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: fffffbfff80f50db CR3: 00000001062dc002 CR4: 00000000003706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <IRQ>
  ? add_timer+0x650/0x650
  ? kvm_clock_read+0x14/0x30
  ? ktime_get+0xb9/0x180
  ? rcu_read_lock_held_common+0xe/0xa0
  ? rcu_read_lock_sched_held+0x56/0xc0
  ? rcu_read_lock_bh_held+0xa0/0xa0
  ? hrtimer_interrupt+0x271/0x790
  __do_softirq+0x1d0/0x88f
  irq_exit_rcu+0xe7/0x120
  sysvec_apic_timer_interrupt+0x8a/0xb0
  </IRQ>
  <TASK>
[ ... ]

Fixes: bc54e49c ("amt: add multicast(IGMP) report message handler")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://lore.kernel.org/r/20211108145340.17208-1-ap420073@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

43aa4937

net: dsa: mv88e6xxx: Don't support >1G speeds on 6191X on ports other than 10 · dc2fc9f0

Marek Behún authored Nov 04, 2021

Model 88E6191X only supports >1G speeds on port 10. Port 0 and 9 are
only 1G.

Fixes: de776d0d ("net: dsa: mv88e6xxx: add support for mv88e6393x family")
Signed-off-by: Marek Behún <kabel@kernel.org>
Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20211104171747.10509-1-kabel@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

dc2fc9f0

09 Nov, 2021 8 commits

Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · fceb0795

Jakub Kicinski authored Nov 09, 2021

Alexei Starovoitov says:

====================
pull-request: bpf 2021-11-09

We've added 7 non-merge commits during the last 3 day(s) which contain
a total of 10 files changed, 174 insertions(+), 48 deletions(-).

The main changes are:

1) Various sockmap fixes, from John and Jussi.

2) Fix out-of-bound issue with bpf_pseudo_func, from Martin.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg
  bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding
  bpf, sockmap: Fix race in ingress receive verdict with redirect to self
  bpf, sockmap: Remove unhash handler for BPF sockmap usage
  bpf, sockmap: Use stricter sk state checks in sk_lookup_assign
  bpf: selftest: Trigger a DCE on the whole subprog
  bpf: Stop caching subprog index in the bpf_pseudo_func insn
====================

Link: https://lore.kernel.org/r/20211109215702.38350-1-alexei.starovoitov@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

fceb0795

amt: add IPV6 Kconfig dependency · 9758aba8

Arnd Bergmann authored Nov 08, 2021

This driver cannot be built-in if IPV6 is a loadable module:

x86_64-linux-ld: drivers/net/amt.o: in function `amt_build_mld_gq':
amt.c:(.text+0x2e7d): undefined reference to `ipv6_dev_get_saddr'

Add the idiomatic Kconfig dependency that all such modules
have.

Fixes: b9022b53 ("amt: add control plane of amt interface")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9758aba8

gve: Fix off by one in gve_tx_timeout() · 1c360cc1

Dan Carpenter authored Nov 09, 2021

The priv->ntfy_blocks[] has "priv->num_ntfy_blks" elements so this >
needs to be >= to prevent an off by one bug.  The priv->ntfy_blocks[]
array is allocated in gve_alloc_notify_blocks().

Fixes: 87a7f321 ("gve: Recover from queue stall due to missed IRQ")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1c360cc1

hamradio: defer 6pack kfree after unregister_netdev · 0b911192

Lin Ma authored Nov 08, 2021

There is a possible race condition (use-after-free) like below

 (USE)                       |  (FREE)
  dev_queue_xmit             |
   __dev_queue_xmit          |
    __dev_xmit_skb           |
     sch_direct_xmit         | ...
      xmit_one               |
       netdev_start_xmit     | tty_ldisc_kill
        __netdev_start_xmit  |  6pack_close
         sp_xmit             |   kfree
          sp_encaps          |
                             |

According to the patch "defer ax25 kfree after unregister_netdev", this
patch reorder the kfree after the unregister_netdev to avoid the possible
UAF as the unregister_netdev() is well synchronized and won't return if
there is a running routine.
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>

0b911192

hamradio: defer ax25 kfree after unregister_netdev · 3e0588c2

Lin Ma authored Nov 08, 2021

There is a possible race condition (use-after-free) like below

 (USE)                       |  (FREE)
ax25_sendmsg                 |
 ax25_queue_xmit             |
  dev_queue_xmit             |
   __dev_queue_xmit          |
    __dev_xmit_skb           |
     sch_direct_xmit         | ...
      xmit_one               |
       netdev_start_xmit     | tty_ldisc_kill
        __netdev_start_xmit  |  mkiss_close
         ax_xmit             |   kfree
          ax_encaps          |
                             |

Even though there are two synchronization primitives before the kfree:
1. wait_for_completion(&ax->dead). This can prevent the race with
routines from mkiss_ioctl. However, it cannot stop the routine coming
from upper layer, i.e., the ax25_sendmsg.

2. netif_stop_queue(ax->dev). It seems that this line of code aims to
halt the transmit queue but it fails to stop the routine that already
being xmit.

This patch reorder the kfree after the unregister_netdev to avoid the
possible UAF as the unregister_netdev() is well synchronized and won't
return if there is a running routine.
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>

3e0588c2

net: sungem_phy: fix code indentation · 54f0bad6

Jean Sacren authored Nov 07, 2021

Remove extra space in front of the return statement.

Fixes: eb5b5b2f ("sungem_phy: support bcm5461 phy, autoneg.")
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

54f0bad6

bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg · b2c46181

Jussi Maki authored Nov 03, 2021

The current conversion of skb->data_end reads like this:

  ; data_end = (void*)(long)skb->data_end;
   559: (79) r1 = *(u64 *)(r2 +200)   ; r1  = skb->data
   560: (61) r11 = *(u32 *)(r2 +112)  ; r11 = skb->len
   561: (0f) r1 += r11
   562: (61) r11 = *(u32 *)(r2 +116)
   563: (1f) r1 -= r11

But similar to the case in 84f44df6 ("bpf: sock_ops sk access may stomp
registers when dst_reg = src_reg"), the code will read an incorrect skb->len
when src == dst. In this case we end up generating this xlated code:

  ; data_end = (void*)(long)skb->data_end;
   559: (79) r1 = *(u64 *)(r1 +200)   ; r1  = skb->data
   560: (61) r11 = *(u32 *)(r1 +112)  ; r11 = (skb->data)->len
   561: (0f) r1 += r11
   562: (61) r11 = *(u32 *)(r1 +116)
   563: (1f) r1 -= r11

... where line 560 is the reading 4B of (skb->data + 112) instead of the
intended skb->len Here the skb pointer in r1 gets set to skb->data and the
later deref for skb->len ends up following skb->data instead of skb.

This fixes the issue similarly to the patch mentioned above by creating an
additional temporary variable and using to store the register when dst_reg =
src_reg. We name the variable bpf_temp_reg and place it in the cb context for
sk_skb. Then we restore from the temp to ensure nothing is lost.

Fixes: 16137b09 ("bpf: Compute data_end dynamically with JIT code")
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-6-john.fastabend@gmail.com

b2c46181

bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding · e0dc3b93

John Fastabend authored Nov 03, 2021

Strparser is reusing the qdisc_skb_cb struct to stash the skb message handling
progress, e.g. offset and length of the skb. First this is poorly named and
inherits a struct from qdisc that doesn't reflect the actual usage of cb[] at
this layer.

But, more importantly strparser is using the following to access its metadata.

  (struct _strp_msg *)((void *)skb->cb + offsetof(struct qdisc_skb_cb, data))

Where _strp_msg is defined as:

  struct _strp_msg {
        struct strp_msg            strp;                 /*     0     8 */
        int                        accum_len;            /*     8     4 */

        /* size: 12, cachelines: 1, members: 2 */
        /* last cacheline: 12 bytes */
  };

So we use 12 bytes of ->data[] in struct. However in BPF code running parser
and verdict the user has read capabilities into the data[] array as well. Its
not too problematic, but we should not be exposing internal state to BPF
program. If its really needed then we can use the probe_read() APIs which allow
reading kernel memory. And I don't believe cb[] layer poses any API breakage by
moving this around because programs can't depend on cb[] across layers.

In order to fix another issue with a ctx rewrite we need to stash a temp
variable somewhere. To make this work cleanly this patch builds a cb struct
for sk_skb types called sk_skb_cb struct. Then we can use this consistently
in the strparser, sockmap space. Additionally we can start allowing ->cb[]
write access after this.

Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-5-john.fastabend@gmail.com

e0dc3b93

08 Nov, 2021 3 commits

bpf, sockmap: Fix race in ingress receive verdict with redirect to self · c5d2177a

John Fastabend authored Nov 03, 2021

A socket in a sockmap may have different combinations of programs attached
depending on configuration. There can be no programs in which case the socket
acts as a sink only. There can be a TX program in this case a BPF program is
attached to sending side, but no RX program is attached. There can be an RX
program only where sends have no BPF program attached, but receives are hooked
with BPF. And finally, both TX and RX programs may be attached. Giving us the
permutations:

 None, Tx, Rx, and TxRx

To date most of our use cases have been TX case being used as a fast datapath
to directly copy between local application and a userspace proxy. Or Rx cases
and TxRX applications that are operating an in kernel based proxy. The traffic
in the first case where we hook applications into a userspace application looks
like this:

  AppA  redirect   AppB
   Tx <-----------> Rx
   |                |
   +                +
   TCP <--> lo <--> TCP

In this case all traffic from AppA (after 3whs) is copied into the AppB
ingress queue and no traffic is ever on the TCP recieive_queue.

In the second case the application never receives, except in some rare error
cases, traffic on the actual user space socket. Instead the send happens in
the kernel.

           AppProxy       socket pool
       sk0 ------------->{sk1,sk2, skn}
        ^                      |
        |                      |
        |                      v
       ingress              lb egress
       TCP                  TCP

Here because traffic is never read off the socket with userspace recv() APIs
there is only ever one reader on the sk receive_queue. Namely the BPF programs.

However, we've started to introduce a third configuration where the BPF program
on receive should process the data, but then the normal case is to push the
data into the receive queue of AppB.

       AppB
       recv()                (userspace)
     -----------------------
       tcp_bpf_recvmsg()     (kernel)
         |             |
         |             |
         |             |
       ingress_msgQ    |
         |             |
       RX_BPF          |
         |             |
         v             v
       sk->receive_queue

This is different from the App{A,B} redirect because traffic is first received
on the sk->receive_queue.

Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
queue for any data handled by the BPF rx program and returned with PASS code
so that it was enqueued on the ingress msg queue. Then if no data exists on
that queue it checks the socket receive queue. Unfortunately, this is the same
receive_queue the BPF program is reading data off of. So we get a race. Its
possible for the recvmsg() hook to pull data off the receive_queue before the
BPF hook has a chance to read it. It typically happens when an application is
banging on recv() and getting EAGAINs. Until they manage to race with the RX
BPF program.

To fix this we note that before this patch at attach time when the socket is
loaded into the map we check if it needs a TX program or just the base set of
proto bpf hooks. Then it uses the above general RX hook regardless of if we
have a BPF program attached at rx or not. This patch now extends this check to
handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
race when an RX program is attached we use a new hook that is nearly identical
to the old one except now we do not let the recv() call skip the RX BPF program.
Now only the BPF program pulls data from sk->receive_queue and recv() only
pulls data from the ingress msgQ post BPF program handling.

With this resolved our AppB from above has been up and running for many hours
without detecting any errors. We do this by correlating counters in RX BPF
events and the AppB to ensure data is never skipping the BPF program. Selftests,
was not able to detect this because we only run them for a short period of time
on well ordered send/recvs so we don't get any of the noise we see in real
application environments.

Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com

c5d2177a

bpf, sockmap: Remove unhash handler for BPF sockmap usage · b8b8315e

John Fastabend authored Nov 03, 2021

We do not need to handle unhash from BPF side we can simply wait for the
close to happen. The original concern was a socket could transition from
ESTABLISHED state to a new state while the BPF hook was still attached.
But, we convinced ourself this is no longer possible and we also improved
BPF sockmap to handle listen sockets so this is no longer a problem.

More importantly though there are cases where unhash is called when data is
in the receive queue. The BPF unhash logic will flush this data which is
wrong. To be correct it should keep the data in the receive queue and allow
a receiving application to continue reading the data. This may happen when
tcp_abort() is received for example. Instead of complicating the logic in
unhash simply moving all this to tcp_close() hook solves this.

Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-3-john.fastabend@gmail.com

b8b8315e

bpf, sockmap: Use stricter sk state checks in sk_lookup_assign · 40a34121

John Fastabend authored Nov 03, 2021

In order to fix an issue with sockets in TCP sockmap redirect cases we plan
to allow CLOSE state sockets to exist in the sockmap. However, the check in
bpf_sk_lookup_assign() currently only invalidates sockets in the
TCP_ESTABLISHED case relying on the checks on sockmap insert to ensure we
never SOCK_CLOSE state sockets in the map.

To prepare for this change we flip the logic in bpf_sk_lookup_assign() to
explicitly test for the accepted cases. Namely, a tcp socket in TCP_LISTEN
or a udp socket in TCP_CLOSE state. This also makes the code more resilent
to future changes.
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-2-john.fastabend@gmail.com

40a34121

07 Nov, 2021 8 commits

litex_liteeth: Fix a double free in the remove function · c45231a7

Christophe JAILLET authored Nov 07, 2021

'netdev' is a managed resource allocated in the probe using
'devm_alloc_etherdev()'.
It must not be freed explicitly in the remove function.

Fixes: ee7da21a ("net: Add driver for LiteX's LiteETH network interface")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

c45231a7

nfc: pn533: Fix double free when pn533_fill_fragment_skbs() fails · 9fec40f8

Chengfeng Ye authored Nov 05, 2021

skb is already freed by dev_kfree_skb in pn533_fill_fragment_skbs,
but follow error handler branch when pn533_fill_fragment_skbs()
fails, skb is freed again, results in double free issue. Fix this
by not free skb in error path of pn533_fill_fragment_skbs.

Fixes: 963a82e0 ("NFC: pn533: Split large Tx frames in chunks")
Fixes: 93ad4202 ("NFC: pn533: Target mode Tx fragmentation support")
Signed-off-by: Chengfeng Ye <cyeaa@connect.ust.hk>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9fec40f8

selftests: net: tls: remove unused variable and code · 62b12ab5

Anders Roxell authored Nov 05, 2021

When building selftests/net with clang, the compiler warn about the
function abs() see below:

tls.c:657:15: warning: variable 'len_compared' set but not used [-Wunused-but-set-variable]
        unsigned int len_compared = 0;
                     ^

Rework to remove the unused variable and the for-loop where the variable
'len_compared' was assinged.

Fixes: 7f657d5b ("selftests: tls: add selftests for TLS sockets")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

62b12ab5

net: marvell: prestera: fix hw structure laid out · e1464db5

Volodymyr Mytnyk authored Nov 05, 2021

The prestera FW v4.0 support commit has been merged
accidentally w/o review comments addressed and waiting
for the final patch set to be uploaded. So, fix the remaining
comments related to structure laid out and build issues.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: bb5dbf2c ("net: marvell: prestera: add firmware v4.0 support")
Signed-off-by: Volodymyr Mytnyk <vmytnyk@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e1464db5

sctp: remove unreachable code from sctp_sf_violation_chunk() · e7ea51cd

Alexey Khoroshilov authored Nov 05, 2021

sctp_sf_violation_chunk() is not called with asoc argument equal to NULL,
but if that happens it would lead to NULL pointer dereference
in sctp_vtag_verify().

The patch removes code that handles NULL asoc in sctp_sf_violation_chunk().

Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Proposed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e7ea51cd

llc: fix out-of-bound array index in llc_sk_dev_hash() · 8ac9dfd5

Eric Dumazet authored Nov 05, 2021

Both ifindex and LLC_SK_DEV_HASH_ENTRIES are signed.

This means that (ifindex % LLC_SK_DEV_HASH_ENTRIES) is negative
if @ifindex is negative.

We could simply make LLC_SK_DEV_HASH_ENTRIES unsigned.

In this patch I chose to use hash_32() to get more entropy
from @ifindex, like llc_sk_laddr_hashfn().

UBSAN: array-index-out-of-bounds in ./include/net/llc.h:75:26
index -43 is out of range for type 'hlist_head [64]'
CPU: 1 PID: 20999 Comm: syz-executor.3 Not tainted 5.15.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 ubsan_epilogue+0xb/0x5a lib/ubsan.c:151
 __ubsan_handle_out_of_bounds.cold+0x62/0x6c lib/ubsan.c:291
 llc_sk_dev_hash include/net/llc.h:75 [inline]
 llc_sap_add_socket+0x49c/0x520 net/llc/llc_conn.c:697
 llc_ui_bind+0x680/0xd70 net/llc/af_llc.c:404
 __sys_bind+0x1e9/0x250 net/socket.c:1693
 __do_sys_bind net/socket.c:1704 [inline]
 __se_sys_bind net/socket.c:1702 [inline]
 __x64_sys_bind+0x6f/0xb0 net/socket.c:1702
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fa503407ae9

Fixes: 6d2e3ea2 ("llc: use a device based hash table to speed up multicast delivery")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8ac9dfd5

net: hisilicon: fix hsn3_ethtool kernel-doc warnings · 85879f13

Randy Dunlap authored Nov 05, 2021

Fix kernel-doc warnings and spacing in hns3_ethtool.c:

hns3_ethtool.c:246: warning: No description found for return value of 'hns3_lp_run_test'
hns3_ethtool.c:408: warning: expecting prototype for hns3_nic_self_test(). Prototype was for hns3_self_test() instead
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Peng Li <lipeng321@huawei.com>
Cc: Guangbin Huang <huangguangbin2@huawei.com>
Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
Cc: Salil Mehta <salil.mehta@huawei.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

85879f13

nfc: port100: lower verbosity of cancelled URB messages · 08fcdfa6

Krzysztof Kozlowski authored Nov 07, 2021

It is not an error to receive an URB with -ENOENT because it can come
from regular user operations, e.g. pressing CTRL+C when running nfctool
from neard. Make it a debugging message, not an error.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

08fcdfa6