Commits · 630e4576f83accf90366686f39808d665d8dbecc · Kirill Smelkov / linux

05 Apr, 2021 2 commits

net-ipv6: bugfix - raw & sctp - switch to ipv6_can_nonlocal_bind() · 630e4576

Maciej Żenczykowski authored Apr 05, 2021

Found by virtue of ipv6 raw sockets not honouring the per-socket
IP{,V6}_FREEBIND setting.

Based on hits found via:
git grep '[.]ip_nonlocal_bind'
We fix both raw ipv6 sockets to honour IP{,V6}_FREEBIND and IP{,V6}_TRANSPARENT,
and we fix sctp sockets to honour IP{,V6}_TRANSPARENT (they already honoured
FREEBIND), and not just the ipv6 'ip_nonlocal_bind' sysctl.

The helper is defined as:
static inline bool ipv6_can_nonlocal_bind(struct net *net, struct inet_sock *inet) {
return net->ipv6.sysctl.ip_nonlocal_bind || inet->freebind || inet->transparent;
}
so this change only widens the accepted opt-outs and is thus a clean bugfix.

I'm not entirely sure what 'fixes' tag to add, since this is AFAICT an ancient bug,
but IMHO this should be applied to stable kernels as far back as possible.
As such I'm adding a 'fixes' tag with the commit that originally added the helper,
which happened in 4.19. Backporting to older LTS kernels (at least 4.9 and 4.14)
would presumably require open-coding it or backporting the helper as well.

Other possibly relevant commits:
v4.18-rc6-1502-g83ba4645 net: add helpers checking if socket can be bound to nonlocal address
v4.18-rc6-1431-gd0c1f011 net/ipv6: allow any source address for sendmsg pktinfo with ip_nonlocal_bind
v4.14-rc5-271-gb71d21c2 sctp: full support for ipv6 ip_nonlocal_bind & IP_FREEBIND
v4.7-rc7-1883-g9b974202 sctp: support ipv6 nonlocal bind
v4.1-12247-g35a256fe ipv6: Nonlocal bind

Cc: Lorenzo Colitti <lorenzo@google.com>
Fixes: 83ba4645 ("net: add helpers checking if socket can be bound to nonlocal address")
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Reviewed-By: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

630e4576

openvswitch: fix send of uninitialized stack memory in ct limit reply · 4d51419d

Ilya Maximets authored Apr 04, 2021

'struct ovs_zone_limit' has more members than initialized in
ovs_ct_limit_get_default_limit().  The rest of the memory is a random
kernel stack content that ends up being sent to userspace.

Fix that by using designated initializer that will clear all
non-specified fields.

Fixes: 11efd5cb ("openvswitch: Support conntrack zone limit")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4d51419d

02 Apr, 2021 3 commits

net: macb: restore cmp registers on resume path · a14d273b

Claudiu Beznea authored Apr 02, 2021

Restore CMP screener registers on resume path.

Fixes: c1e85c6c ("net: macb: save/restore the remaining registers and features")
Signed-off-by: Claudiu Beznea <claudiu.beznea@microchip.com>
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a14d273b

net: cls_api: Fix uninitialised struct field bo->unlocked_driver_cb · 990b03b0

Yunjian Wang authored Apr 01, 2021

The 'unlocked_driver_cb' struct field in 'bo' is not being initialized
in tcf_block_offload_init(). The uninitialized 'unlocked_driver_cb'
will be used when calling unlocked_driver_cb(). So initialize 'bo' to
zero to avoid the issue.

Addresses-Coverity: ("Uninitialized scalar variable")
Fixes: 0fdcf78d ("net: use flow_indr_dev_setup_offload()")
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

990b03b0

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 29684d80

David S. Miller authored Apr 02, 2021

Alexei Starovoitov says:

====================
pull-request: bpf 2021-04-01

The following pull-request contains BPF updates for your *net* tree.

We've added 11 non-merge commits during the last 8 day(s) which contain
a total of 10 files changed, 151 insertions(+), 26 deletions(-).

The main changes are:

1) xsk creation fixes, from Ciara.

2) bpf_get_task_stack fix, from Dave.

3) trampoline in modules fix, from Jiri.

4) bpf_obj_get fix for links and progs, from Lorenz.

5) struct_ops progs must be gpl compatible fix, from Toke.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

29684d80

01 Apr, 2021 17 commits

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 9256ce33

David S. Miller authored Apr 01, 2021

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2021-04-01

This series contains updates to i40e driver only.

Arkadiusz fixes warnings for inconsistent indentation.

Magnus fixes an issue on xsk receive where single packets over time
are batched rather than received immediately.

Eryk corrects warnings and reporting of veb-stats.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9256ce33

Merge branch 'mptcp-deadlock' · efd2e92d

David S. Miller authored Apr 01, 2021

Paolo Abeni says:

====================
mptcp: mptcp: fix deadlock in mptcp{,6}_release

syzkaller has reported a few deadlock triggered by
mptcp{,6}_release.

These patches address the issue in the easy way - blocking
the relevant, multicast related, sockopt options on MPTCP
sockets.

Note that later on net-next we are going to revert patch 1/2,
as a part of a larger MPTCP sockopt implementation refactor
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

efd2e92d

mptcp: revert "mptcp: provide subflow aware release function" · 0a3cc579

Paolo Abeni authored Apr 01, 2021

This change reverts commit ad98dd37 ("mptcp: provide subflow aware
release function"). The latter introduced a deadlock spotted by
syzkaller and is not needed anymore after the previous commit.

Fixes: ad98dd37 ("mptcp: provide subflow aware release function")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0a3cc579

mptcp: forbit mcast-related sockopt on MPTCP sockets · 86581852

Paolo Abeni authored Apr 01, 2021

Unrolling mcast state at msk dismantel time is bug prone, as
syzkaller reported:

======================================================
WARNING: possible circular locking dependency detected
5.11.0-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor905/8822 is trying to acquire lock:
ffffffff8d678fe8 (rtnl_mutex){+.+.}-{3:3}, at: ipv6_sock_mc_close+0xd7/0x110 net/ipv6/mcast.c:323

but task is already holding lock:
ffff888024390120 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1600 [inline]
ffff888024390120 (sk_lock-AF_INET6){+.+.}-{0:0}, at: mptcp6_release+0x57/0x130 net/mptcp/protocol.c:3507

which lock already depends on the new lock.

Instead we can simply forbit any mcast-related setsockopt

Fixes: 717e79c8 ("mptcp: Add setsockopt()/getsockopt() socket operations")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

86581852

drivers: net: fix memory leak in peak_usb_create_dev · a0b96b4a

Pavel Skripkin authored Apr 01, 2021

syzbot reported memory leak in peak_usb.
The problem was in case of failure after calling
->dev_init()[2] in peak_usb_create_dev()[1]. The data
allocated int dev_init() wasn't freed, so simple
->dev_free() call fix this problem.

backtrace:
    [<0000000079d6542a>] kmalloc include/linux/slab.h:552 [inline]
    [<0000000079d6542a>] kzalloc include/linux/slab.h:682 [inline]
    [<0000000079d6542a>] pcan_usb_fd_init+0x156/0x210 drivers/net/can/usb/peak_usb/pcan_usb_fd.c:868   [2]
    [<00000000c09f9057>] peak_usb_create_dev drivers/net/can/usb/peak_usb/pcan_usb_core.c:851 [inline] [1]
    [<00000000c09f9057>] peak_usb_probe+0x389/0x490 drivers/net/can/usb/peak_usb/pcan_usb_core.c:949

Reported-by: syzbot+91adee8d9ebb9193d22d@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a0b96b4a

net: udp: Add support for getsockopt(..., ..., UDP_GRO, ..., ...); · 98184612

Norman Maurer authored Apr 01, 2021

Support for UDP_GRO was added in the past but the implementation for
getsockopt was missed which did lead to an error when we tried to
retrieve the setting for UDP_GRO. This patch adds the missing switch
case for UDP_GRO

Fixes: e20cf8d3 ("udp: implement GRO for plain UDP sockets.")
Signed-off-by: Norman Maurer <norman_maurer@apple.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

98184612

drivers: net: fix memory leak in atusb_probe · 6b9fbe16

Pavel Skripkin authored Apr 01, 2021

syzbot reported memory leak in atusb_probe()[1].
The problem was in atusb_alloc_urbs().
Since urb is anchored, we need to release the reference
to correctly free the urb

backtrace:
    [<ffffffff82ba0466>] kmalloc include/linux/slab.h:559 [inline]
    [<ffffffff82ba0466>] usb_alloc_urb+0x66/0xe0 drivers/usb/core/urb.c:74
    [<ffffffff82ad3888>] atusb_alloc_urbs drivers/net/ieee802154/atusb.c:362 [inline][2]
    [<ffffffff82ad3888>] atusb_probe+0x158/0x820 drivers/net/ieee802154/atusb.c:1038 [1]

Reported-by: syzbot+28a246747e0a465127f3@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6b9fbe16

Merge branch 'AF_XDP Socket Creation Fixes' · 6dcc4e38

Alexei Starovoitov authored Apr 01, 2021

Ciara Loftus says:

====================

This series fixes some issues around socket creation for AF_XDP.

Patch 1 fixes a potential NULL pointer dereference in
xsk_socket__create_shared.

Patch 2 ensures that the umem passed to xsk_socket__create(_shared)
remains unchanged in event of failure.

Patch 3 makes it possible for xsk_socket__create(_shared) to
succeed even if the rx and tx XDP rings have already been set up by
introducing a new fields to struct xsk_umem which represent the ring
setup status for the xsk which shares the fd with the umem.

v3->v4:
* Reduced nesting in xsk_put_ctx as suggested by Alexei.
* Use bools instead of a u8 and flags to represent the
  ring setup status as suggested by Björn.

v2->v3:
* Instead of ignoring the return values of the setsockopt calls, introduce
  a new flag to determine whether or not to call them based on the ring
  setup status as suggested by Alexei.

v1->v2:
* Simplified restoring the _save pointers as suggested by Magnus.
* Fixed the condition which determines whether to unmap umem rings
 when socket create fails.

====================
Acked-by: Björn Töpel <bjorn@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

6dcc4e38

libbpf: Only create rx and tx XDP rings when necessary · ca7a83e2

Ciara Loftus authored Mar 31, 2021

Prior to this commit xsk_socket__create(_shared) always attempted to create
the rx and tx rings for the socket. However this causes an issue when the
socket being setup is that which shares the fd with the UMEM. If a
previous call to this function failed with this socket after the rings were
set up, a subsequent call would always fail because the rings are not torn
down after the first call and when we try to set them up again we encounter
an error because they already exist. Solve this by remembering whether the
rings were set up by introducing new bools to struct xsk_umem which
represent the ring setup status and using them to determine whether or
not to set up the rings.

Fixes: 1cad0788 ("libbpf: add support for using AF_XDP sockets")
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331061218.1647-4-ciara.loftus@intel.com

ca7a83e2

libbpf: Restore umem state after socket create failure · 43f1bc1e

Ciara Loftus authored Mar 31, 2021

If the call to xsk_socket__create fails, the user may want to retry the
socket creation using the same umem. Ensure that the umem is in the
same state on exit if the call fails by:
1. ensuring the umem _save pointers are unmodified.
2. not unmapping the set of umem rings that were set up with the umem
during xsk_umem__create, since those maps existed before the call to
xsk_socket__create and should remain in tact even in the event of
failure.

Fixes: 2f6324a3 ("libbpf: Support shared umems between queues and devices")
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331061218.1647-3-ciara.loftus@intel.com

43f1bc1e

libbpf: Ensure umem pointer is non-NULL before dereferencing · df662016

Ciara Loftus authored Mar 31, 2021

Calls to xsk_socket__create dereference the umem to access the
fill_save and comp_save pointers. Make sure the umem is non-NULL
before doing this.

Fixes: 2f6324a3 ("libbpf: Support shared umems between queues and devices")
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20210331061218.1647-2-ciara.loftus@intel.com

df662016

bpf: program: Refuse non-O_RDWR flags in BPF_OBJ_GET · d37300ed

Lorenz Bauer authored Mar 26, 2021

As for bpf_link, refuse creating a non-O_RDWR fd. Since program fds
currently don't allow modifications this is a precaution, not a
straight up bug fix.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210326160501.46234-2-lmb@cloudflare.com

d37300ed

bpf: link: Refuse non-O_RDWR flags in BPF_OBJ_GET · 25fc94b2

Lorenz Bauer authored Mar 26, 2021

Invoking BPF_OBJ_GET on a pinned bpf_link checks the path access
permissions based on file_flags, but the returned fd ignores flags.
This means that any user can acquire a "read-write" fd for a pinned
link with mode 0664 by invoking BPF_OBJ_GET with BPF_F_RDONLY in
file_flags. The fd can be used to invoke BPF_LINK_DETACH, etc.

Fix this by refusing non-O_RDWR flags in BPF_OBJ_GET. This works
because OBJ_GET by default returns a read write mapping and libbpf
doesn't expose a way to override this behaviour for programs
and links.

Fixes: 70ed506c ("bpf: Introduce pinnable bpf_link abstraction")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210326160501.46234-1-lmb@cloudflare.com

25fc94b2

bpf: Refcount task stack in bpf_get_task_stack · 06ab134c

Dave Marchevsky authored Mar 31, 2021

On x86 the struct pt_regs * grabbed by task_pt_regs() points to an
offset of task->stack. The pt_regs are later dereferenced in
__bpf_get_stack (e.g. by user_mode() check). This can cause a fault if
the task in question exits while bpf_get_task_stack is executing, as
warned by task_stack_page's comment:

* When accessing the stack of a non-current task that might exit, use
* try_get_task_stack() instead.  task_stack_page will return a pointer
* that could get freed out from under you.

Taking the comment's advice and using try_get_task_stack() and
put_task_stack() to hold task->stack refcount, or bail early if it's
already 0. Incrementing stack_refcount will ensure the task's stack
sticks around while we're using its data.

I noticed this bug while testing a bpf task iter similar to
bpf_iter_task_stack in selftests, except mine grabbed user stack, and
getting intermittent crashes, which resulted in dumps like:

  BUG: unable to handle page fault for address: 0000000000003fe0
  \#PF: supervisor read access in kernel mode
  \#PF: error_code(0x0000) - not-present page
  RIP: 0010:__bpf_get_stack+0xd0/0x230
  <snip...>
  Call Trace:
  bpf_prog_0a2be35c092cb190_get_task_stacks+0x5d/0x3ec
  bpf_iter_run_prog+0x24/0x81
  __task_seq_show+0x58/0x80
  bpf_seq_read+0xf7/0x3d0
  vfs_read+0x91/0x140
  ksys_read+0x59/0xd0
  do_syscall_64+0x48/0x120
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: fa28dcb8 ("bpf: Introduce helper bpf_get_task_stack()")
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210401000747.3648767-1-davemarchevsky@fb.com

06ab134c

i40e: Fix display statistics for veb_tc · c3214de9

Eryk Rybak authored Mar 02, 2021

If veb-stats was enabled, the ethtool stats triggered a warning
due to invalid size: 'unexpected stat size for veb.tc_%u_tx_packets'.
This was due to an incorrect structure definition for the statistics.
Structures and functions have been improved in line with requirements
for the presentation of statistics, in particular for the functions:
'i40e_add_ethtool_stats' and 'i40e_add_stat_strings'.

Fixes: 1510ae0b ("i40e: convert VEB TC stats to use an i40e_stats array")
Signed-off-by: Eryk Rybak <eryk.roch.rybak@intel.com>
Signed-off-by: Grzegorz Szczurek <grzegorzx.szczurek@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Dave Switzer <david.switzer@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

c3214de9

i40e: fix receiving of single packets in xsk zero-copy mode · 528060ef

Magnus Karlsson authored Mar 19, 2021

Fix so that single packets are received immediately instead of in
batches of 8. If you sent 1 pps to a system, you received 8 packets
every 8 seconds instead of 1 packet every second. The problem behind
this was that the work_done reporting from the Tx part of the driver
was broken. The work_done reporting in i40e controls not only the
reporting back to the napi logic but also the setting of the interrupt
throttling logic. When Tx or Rx reports that it has more to do,
interrupts are throttled or coalesced and when they both report that
they are done, interrupts are armed right away. If the wrong work_done
value is returned, the logic will start to throttle interrupts in a
situation where it should have just enabled them. This leads to the
undesired batching behavior seen in user-space.

Fix this by returning the correct boolean value from the Tx xsk
zero-copy path. Return true if there is nothing to do or if we got
fewer packets to process than we asked for. Return false if we got as
many packets as the budget since there might be more packets we can
process.

Fixes: 3106c580 ("i40e: Use batched xsk Tx interfaces to increase performance")
Reported-by: Sreedevi Joshi <sreedevi.joshi@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

528060ef

i40e: Fix inconsistent indenting · 7a75a841

Arkadiusz Kubalewski authored Feb 17, 2021

Fixed new static analysis findings:
"warn: inconsistent indenting" - introduced lately,
reported with lkp and smatch build.

Fixes: 4b208eaa ("i40e: Add init and default config of software based DCB")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Tested-by: Dave Switzer <david.switzer@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

7a75a841

31 Mar, 2021 17 commits

xdp: fix xdp_return_frame() kernel BUG throw for page_pool memory model · 622d1369

Ong Boon Leong authored Mar 31, 2021

xdp_return_frame() may be called outside of NAPI context to return
xdpf back to page_pool. xdp_return_frame() calls __xdp_return() with
napi_direct = false. For page_pool memory model, __xdp_return() calls
xdp_return_frame_no_direct() unconditionally and below false negative
kernel BUG throw happened under preempt-rt build:

[  430.450355] BUG: using smp_processor_id() in preemptible [00000000] code: modprobe/3884
[  430.451678] caller is __xdp_return+0x1ff/0x2e0
[  430.452111] CPU: 0 PID: 3884 Comm: modprobe Tainted: G     U      E     5.12.0-rc2+ #45

Changes in v2:
 - This patch fixes the issue by making xdp_return_frame_no_direct() is
   only called if napi_direct = true, as recommended for better by
   Jesper Dangaard Brouer. Thanks!

Fixes: 2539650f ("xdp: Helpers for disabling napi_direct of xdp_return_frame")
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

622d1369

Revert "net: correct sk_acceptq_is_full()" · c609e6aa

Eric Dumazet authored Mar 31, 2021

This reverts commit f211ac15.

We had similar attempt in the past, and we reverted it.

History:

64a14651 [NET]: Revert incorrect accept queue backlog changes.
8488df89 [NET]: Fix bugs in "Whether sock accept queue is full" checking

I am adding a fat comment so that future attempts will
be much harder.

Fixes: f211ac15 ("net: correct sk_acceptq_is_full()")
Cc: iuyacan <yacanliu@163.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c609e6aa

Merge tag 'mlx5-fixes-2021-03-31' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 9dc22c0d

David S. Miller authored Mar 31, 2021

Saeed Mahameed says:

====================
mlx5 fixes 2021-03-31

This series introduces some fixes to mlx5 driver.
Please pull and let me know if there is any problem.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9dc22c0d

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec · c9170f13

David S. Miller authored Mar 31, 2021

Steffen Klassert says:

====================
pull request (net): ipsec 2021-03-31

1) Fix ipv4 pmtu checks for xfrm anf vti interfaces.
   From Eyal Birger.

2) There are situations where the socket passed to
   xfrm_output_resume() is not the same as the one
   attached to the skb. Use the socket passed to
   xfrm_output_resume() to avoid lookup failures
   when xfrm is used with VRFs.
   From Evan Nimmo.

3) Make the xfrm_state_hash_generation sequence counter per
   network namespace because but its write serialization
   lock is also per network namespace. Write protection
   is insufficient otherwise.
   From Ahmed S. Darwish.

4) Fixup sctp featue flags when used with esp offload.
   From Xin Long.

5) xfrm BEET mode doesn't support fragments for inner packets.
   This is a limitation of the protocol, so no fix possible.
   Warn at least to notify the user about that situation.
   From Xin Long.

6) Fix NULL pointer dereference on policy lookup when
   namespaces are uses in combination with esp offload.

7) Fix incorrect transformation on esp offload when
   packets get segmented at layer 3.

8) Fix some user triggered usages of WARN_ONCE in
   the xfrm compat layer.
   From Dmitry Safonov.

Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

c9170f13

net/rds: Fix a use after free in rds_message_map_pages · bdc2ab5c

Lv Yunlong authored Mar 30, 2021

In rds_message_map_pages, the rm is freed by rds_message_put(rm).
But rm is still used by rm->data.op_sg in return value.

My patch assigns ERR_CAST(rm->data.op_sg) to err before the rm is
freed to avoid the uaf.

Fixes: 7dba9203 ("net/rds: Use ERR_PTR for rds_message_alloc_sgs()")
Signed-off-by: Lv Yunlong <lyl2019@mail.ustc.edu.cn>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bdc2ab5c

neighbour: Disregard DEAD dst in neigh_update · d47ec7a0

Tong Zhu authored Mar 19, 2021

After a short network outage, the dst_entry is timed out and put
in DST_OBSOLETE_DEAD. We are in this code because arp reply comes
from this neighbour after network recovers. There is a potential
race condition that dst_entry is still in DST_OBSOLETE_DEAD.
With that, another neighbour lookup causes more harm than good.

In best case all packets in arp_queue are lost. This is
counterproductive to the original goal of finding a better path
for those packets.

I observed a worst case with 4.x kernel where a dst_entry in
DST_OBSOLETE_DEAD state is associated with loopback net_device.
It leads to an ethernet header with all zero addresses.
A packet with all zero source MAC address is quite deadly with
mac80211, ath9k and 802.11 block ack.  It fails
ieee80211_find_sta_by_ifaddr in ath9k (xmit.c). Ath9k flushes tx
queue (ath_tx_complete_aggr). BAW (block ack window) is not
updated. BAW logic is damaged and ath9k transmission is disabled.
Signed-off-by: Tong Zhu <zhutong@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d47ec7a0

net/mlx5e: Guarantee room for XSK wakeup NOP on async ICOSQ · 3ff3874f

Tariq Toukan authored Feb 10, 2021

XSK wakeup flow triggers an IRQ by posting a NOP WQE and hitting
the doorbell on the async ICOSQ.
It maintains its state so that it doesn't issue another NOP WQE
if it has an outstanding one already.

For this flow to work properly, the NOP post must not fail.
Make sure to reserve room for the NOP WQE in all WQE posts to the
async ICOSQ.

Fixes: 8d94b590 ("net/mlx5e: Turn XSK ICOSQ into a general asynchronous one")
Fixes: 1182f365 ("net/mlx5e: kTLS, Add kTLS RX HW offload support")
Fixes: 0419d8c9 ("net/mlx5e: kTLS, Add kTLS RX resync support")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

3ff3874f

net/mlx5e: Consider geneve_opts for encap contexts · 929a2fad

Dima Chumak authored Feb 11, 2021

Current algorithm for encap keys is legacy from initial vxlan
implementation and doesn't take into account all possible fields of a
tunnel. For example, for a Geneve tunnel, which may have additional TLV
options, they are ignored when comparing encap keys and a rule can be
attached to an incorrect encap entry.

Fix that by introducing encap_info_equal() operation in
struct mlx5e_tc_tunnel. Geneve tunnel type uses custom implementation,
which extends generic algorithm and considers options if they are set.

Fixes: 7f1a546e ("net/mlx5e: Consider tunnel type for encap contexts")
Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

929a2fad

net/mlx5: Don't request more than supported EQs · a7b76002

Daniel Jurgens authored Dec 05, 2019

Calculating the number of compeltion EQs based on the number of
available IRQ vectors doesn't work now that all async EQs share one IRQ.
Thus the max number of EQs can be exceeded on systems with more than
approximately 256 CPUs. Take this into account when calculating the
number of available completion EQs.

Fixes: 81bfa206 ("net/mlx5: Use a single IRQ for all async EQs")
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

a7b76002

net/mlx5e: kTLS, Fix RX counters atomicity · 6f4fdd53

Tariq Toukan authored Feb 28, 2021

Some TLS RX counters increment per socket/connection, and are not
protected against parallel modifications from several cores.
Switch them to atomic counters by taking them out of the RQ stats into
the global atomic TLS stats.

In this patch, we touch 'rx_tls_ctx/del' that count the number of
device-offloaded RX TLS connections added/deleted.
These counters are updated in the add/del callbacks, out of the fast
data-path.

This change is not needed for counters that increment only in NAPI
context, as they are protected by the NAPI mechanism.
Keep them as tls_* counters under 'struct mlx5e_rq_stats'.

Fixes: 76c1e1ac ("net/mlx5e: kTLS, Add kTLS RX stats")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

6f4fdd53

net/mlx5e: kTLS, Fix TX counters atomicity · a51bce96

Tariq Toukan authored Feb 28, 2021

Some TLS TX counters increment per socket/connection, and are not
protected against parallel modifications from several cores.
Switch them to atomic counters by taking them out of the SQ stats into
the global atomic TLS stats.

In this patch, we touch a single counter 'tx_tls_ctx' that counts the
number of device-offloaded TX TLS connections added.
Now that this counter can be increased without the for having the SQ
context in hand, move it to the mlx5e_ktls_add_tx() callback where it
really belongs, out of the fast data-path.

This change is not needed for counters that increment only in NAPI
context or under the TX lock, as they are already protected.
Keep them as tls_* counters under 'struct mlx5e_sq_stats'.

Fixes: d2ead1f3 ("net/mlx5e: Add kTLS TX HW offload support")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

a51bce96

net/mlx5: E-switch, Create vport miss group only if src rewrite is supported · e929e3da

Maor Dickman authored Mar 15, 2021

Create send to vport miss group was added in order to support traffic
recirculation to root table with metadata source rewrite.
This group is created also in case source rewrite isn't supported.

Fixed by creating send to vport miss group only if source rewrite is
supported by FW.

Fixes: 8e404fef ("net/mlx5e: Match recirculated packet miss in slow table using reg_c1")
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

e929e3da

net/mlx5e: Fix ethtool indication of connector type · 3211434d

Aya Levin authored Mar 24, 2021

Use connector_type read from PTYS register when it's valid, based on
corresponding capability bit.

Fixes: 5b4793f8 ("net/mlx5e: Add support for reading connector type from PTYS")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

3211434d

net/mlx5: Delete auxiliary bus driver eth-rep first · 1f90aedf

Maor Dickman authored Mar 22, 2021

Delete auxiliary bus drivers flow deletes the eth driver
first and then the eth-reps driver but eth-reps devices resources
are depend on eth device.

Fixed by changing the delete order of auxiliary bus drivers to delete
the eth-rep driver first and after it the eth driver.

Fixes: 601c10c8 ("net/mlx5: Delete custom device management logic")
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

1f90aedf

net/mlx5e: Fix mapping of ct_label zero · d24f847e

Ariel Levkovich authored Mar 09, 2021

ct_label 0 is a default label each flow has and therefore
there can be rules that match on ct_label=0 without a prior
rule that set the ct_label to this value.

The ct_label value is not used directly in the HW rules and
instead it is mapped to some id within a defined range and this
id is used to set and match the metadata register which carries
the ct_label.

If we have a rule that matches on ct_label=0, the hw rule will
perform matching on a value that is != 0 because of the mapping
from label to id. Since the metadata register default value is
0 and it was never set before to anything else by an action that
sets the ct_label, there will always be a mismatch between that
register and the value in the rule.

To support such rule, a forced mapping of ct_label 0 to id=0
is done so that it will match the metadata register default
value of 0.

Fixes: 54b154ec ("net/mlx5e: CT: Map 128 bits labels to 32 bit map ID")
Signed-off-by: Ariel Levkovich <lariel@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

d24f847e

net: ensure mac header is set in virtio_net_hdr_to_skb() · 61431a59

Eric Dumazet authored Mar 30, 2021

Commit 924a9bc3 ("net: check if protocol extracted by virtio_net_hdr_set_proto is correct")
added a call to dev_parse_header_protocol() but mac_header is not yet set.

This means that eth_hdr() reads complete garbage, and syzbot complained about it [1]

This patch resets mac_header earlier, to get more coverage about this change.

Audit of virtio_net_hdr_to_skb() callers shows that this change should be safe.

[1]

BUG: KASAN: use-after-free in eth_header_parse_protocol+0xdc/0xe0 net/ethernet/eth.c:282
Read of size 2 at addr ffff888017a6200b by task syz-executor313/8409

CPU: 1 PID: 8409 Comm: syz-executor313 Not tainted 5.12.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x141/0x1d7 lib/dump_stack.c:120
 print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:232
 __kasan_report mm/kasan/report.c:399 [inline]
 kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:416
 eth_header_parse_protocol+0xdc/0xe0 net/ethernet/eth.c:282
 dev_parse_header_protocol include/linux/netdevice.h:3177 [inline]
 virtio_net_hdr_to_skb.constprop.0+0x99d/0xcd0 include/linux/virtio_net.h:83
 packet_snd net/packet/af_packet.c:2994 [inline]
 packet_sendmsg+0x2325/0x52b0 net/packet/af_packet.c:3031
 sock_sendmsg_nosec net/socket.c:654 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:674
 sock_no_sendpage+0xf3/0x130 net/core/sock.c:2860
 kernel_sendpage.part.0+0x1ab/0x350 net/socket.c:3631
 kernel_sendpage net/socket.c:3628 [inline]
 sock_sendpage+0xe5/0x140 net/socket.c:947
 pipe_to_sendpage+0x2ad/0x380 fs/splice.c:364
 splice_from_pipe_feed fs/splice.c:418 [inline]
 __splice_from_pipe+0x43e/0x8a0 fs/splice.c:562
 splice_from_pipe fs/splice.c:597 [inline]
 generic_splice_sendpage+0xd4/0x140 fs/splice.c:746
 do_splice_from fs/splice.c:767 [inline]
 do_splice+0xb7e/0x1940 fs/splice.c:1079
 __do_splice+0x134/0x250 fs/splice.c:1144
 __do_sys_splice fs/splice.c:1350 [inline]
 __se_sys_splice fs/splice.c:1332 [inline]
 __x64_sys_splice+0x198/0x250 fs/splice.c:1332
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46

Fixes: 924a9bc3 ("net: check if protocol extracted by virtio_net_hdr_set_proto is correct")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Balazs Nemeth <bnemeth@redhat.com>
Cc: Willem de Bruijn <willemb@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

61431a59

net: phy: broadcom: Only advertise EEE for supported modes · c056d480

Florian Fainelli authored Mar 30, 2021

We should not be advertising EEE for modes that we do not support,
correct that oversight by looking at the PHY device supported linkmodes.

Fixes: 99cec8a4 ("net: phy: broadcom: Allow enabling or disabling of EEE")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c056d480

30 Mar, 2021 1 commit

nfp: flower: ignore duplicate merge hints from FW · 2ea538db

Yinjun Zhang authored Mar 30, 2021

A merge hint message needs some time to process before the merged
flow actually reaches the firmware, during which we may get duplicate
merge hints if there're more than one packet that hit the pre-merged
flow. And processing duplicate merge hints will cost extra host_ctx's
which are a limited resource.

Avoid the duplicate merge by using hash table to store the sub_flows
to be merged.

Fixes: 8af56f40 ("nfp: flower: offload merge flows")
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2ea538db