Commits · 60d7ceea4b2a2f7106f61a0bb712a71868875106 · Kirill Smelkov / linux

29 Jul, 2022 25 commits

net/mlx4: Lock mlx4 devlink reload callback · 60d7ceea

Moshe Shemesh authored Jul 28, 2022

Change devlink instance locks in mlx4 driver to have devlink reload
callback locked, while keeping all driver paths which leads to devl_ API
functions called by the mlx4 driver locked.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

60d7ceea

net/mlx4: Use devl_ API for devlink port register / unregister · a8c05514

Moshe Shemesh authored Jul 28, 2022

Use devl_ API to call devl_port_register() and devl_port_unregister()
instead of devlink_port_register() and devlink_port_unregister(). Add
devlink instance lock in mlx4 driver paths to these functions.

This will be used by the downstream patch to invoke mlx4 devlink reload
callbacks with devlink lock held.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

a8c05514

net/mlx4: Use devl_ API for devlink region create / destroy · 9cb7e94a

Moshe Shemesh authored Jul 28, 2022

Use devl_ API to call devl_region_create() and devl_region_destroy()
instead of devlink_region_create() and devlink_region_destroy().
Add devlink instance lock in mlx4 driver paths to these functions.

This will be used by the downstream patch to invoke mlx4 devlink reload
callbacks with devlink lock held.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

9cb7e94a

net/mlx5: Lock mlx5 devlink reload callbacks · 84a433a4

Moshe Shemesh authored Jul 28, 2022

Change devlink instance locks in mlx5 driver to have devlink reload
callbacks locked, while keeping all driver paths which lead to devl_ API
functions called by the driver locked.

Add mlx5_load_one_devl_locked() and mlx5_unload_one_devl_locked() which
are used by the paths which are already locked such as devlink reload
callbacks.

This patch makes the driver use devl_ API also for traps register as
these functions are called from the driver paths parallel to reload that
requires locking now.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

84a433a4

net/mlx5: Move fw reset unload to mlx5_fw_reset_complete_reload · c12f4c6a

Moshe Shemesh authored Jul 28, 2022

Refactor fw reset code to have the unload driver part done on
mlx5_fw_reset_complete_reload(), so if it was called by the PF which
initiated the reload fw activate flow, the unload part will be handled
by the mlx5_devlink_reload_fw_activate() callback itself and not by the
reset event work.

This will be used by the downstream patch to invoke devlink reload
callbacks with devlink lock held.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c12f4c6a

net: devlink: remove region snapshots list dependency on devlink->lock · 2dec18ad

Jiri Pirko authored Jul 28, 2022

After mlx4 driver is converted to do locked reload,
devlink_region_snapshot_create() may be called from both locked and
unlocked context.

Note that in mlx4 region snapshots could be created on any command
failure. That can happen in any flow that involves commands to FW,
which means most of the driver flows.

So resolve this by removing dependency on devlink->lock for region
snapshots list consistency and introduce new mutex to ensure it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

2dec18ad

net: devlink: remove region snapshot ID tracking dependency on devlink->lock · 5502e871

Jiri Pirko authored Jul 28, 2022

After mlx4 driver is converted to do locked reload, functions to get/put
regions snapshot ID may be called from both locked and unlocked context.

So resolve this by removing dependency on devlink->lock for region
snapshot ID tracking by using internal xa_lock() to maintain
shapshot_ids xa_array consistency.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

5502e871

Merge branch 'add-framework-for-selftests-in-devlink' · 1515a1b8

Jakub Kicinski authored Jul 28, 2022

Vikas Gupta says:

====================
add framework for selftests in devlink

Add support for selftests in the devlink framework.
Adds a callback .selftests_check and .selftests_run in devlink_ops.
User can add test(s) suite which is subsequently passed to the driver
and driver can opt for running particular tests based on its capabilities.

Patchset adds a flash based test for the bnxt_en driver.
====================

Link: https://lore.kernel.org/r/20220727165721.37959-1-vikas.gupta@broadcom.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1515a1b8

bnxt_en: implement callbacks for devlink selftests · 5b6ff128

vikas authored Jul 27, 2022

Add callbacks
=============
.selftest_check: returns true for flash selftest.
.selftest_run: runs a flash selftest.

Also, refactor NVM APIs so that they can be
used with devlink and ethtool both.
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

5b6ff128

devlink: introduce framework for selftests · 08f588fa

Vikas Gupta authored Jul 27, 2022

Add a framework for running selftests.
Framework exposes devlink commands and test suite(s) to the user
to execute and query the supported tests by the driver.

Below are new entries in devlink_nl_ops
devlink_nl_cmd_selftests_show_doit/dumpit: To query the supported
selftests by the drivers.
devlink_nl_cmd_selftests_run: To execute selftests. Users can
provide a test mask for executing group tests or standalone tests.

Documentation/networking/devlink/ path is already part of MAINTAINERS &
the new files come under this path. Hence no update needed to the
MAINTAINERS
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

08f588fa

Merge branch 'mlx5e-use-tls-tx-pool-to-improve-connection-rate' · 68be7b82

Jakub Kicinski authored Jul 28, 2022

Tariq Toukan says:

====================
mlx5e use TLS TX pool to improve connection rate

To offload encryption operations, the mlx5 device maintains state and
keeps track of every kTLS device-offloaded connection.  Two HW objects
are used per TX context of a kTLS offloaded connection: a. Transport
interface send (TIS) object, to reach the HW context.  b. Data Encryption
Key (DEK) to perform the crypto operations.

These two objects are created and destroyed per TLS TX context, via FW
commands.  In total, 4 FW commands are issued per TLS TX context, which
seriously limits the connection rate.

In this series, we aim to save creation and destroy of TIS objects by
recycling them.  Upon recycling of a TIS, the HW still needs to be
notified for the re-mapping between a TIS and a context. This is done by
posting WQEs via an SQ, significantly faster API than the FW command
interface.

A pool is used for recycling. The pool dynamically interacts to the load
and connection rate, growing and shrinking accordingly.

Saving the TIS FW commands per context increases connection rate by ~42%,
from 11.6K to 16.5K connections per sec.

Connection rate is still limited by FW bottleneck due to the remaining
per context FW commands (DEK create/destroy). This will soon be addressed
in a followup series.  By combining the two series, the FW bottleneck
will be released, and a significantly higher (about 100K connections per
sec) kTLS TX device-offloaded connection rate is reached.
====================

Link: https://lore.kernel.org/r/20220727094346.10540-1-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

68be7b82

net/mlx5e: kTLS, Dynamically re-size TX recycling pool · 624bf099

Tariq Toukan authored Jul 27, 2022

Let the TLS TX recycle pool be more flexible in size, by continuously
and dynamically allocating and releasing HW resources in response to
changes in the connections rate and load.

Allocate and release pool entries in bulks (16). Use a workqueue to
release/allocate in the background. Allocate a new bulk when the pool
size goes lower than the low threshold (1K). Symmetric operation is done
when the pool size gets greater than the upper threshold (4K).

Every idle pool entry holds: 1 TIS, 1 DEK (HW resources), in addition to
~100 bytes in host memory.

Start with an empty pool to minimize memory and HW resources waste for
non-TLS users that have the device-offload TLS enabled.

Upon a new request, in case the pool is empty, do not wait for a whole bulk
allocation to complete.  Instead, trigger an instant allocation of a single
resource to reduce latency.

Performance tests:
Before: 11,684 CPS
After:  16,556 CPS
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

624bf099

net/mlx5e: kTLS, Recycle objects of device-offloaded TLS TX connections · c4dfe704

Tariq Toukan authored Jul 27, 2022

The transport interface send (TIS) object is responsible for performing
all transport related operations of the transmit side.  The ConnectX HW
uses a TIS object to save and access the TLS crypto information and state
of an offloaded TX kTLS connection.

Before this patch, we used to create a new TIS per connection and destroy
it once it’s closed. Every create and destroy of a TIS is a FW command.

Same applies for the private TLS context, where we used to dynamically
allocate and free it per connection.

Resources recycling reduce the impact of the allocation/free operations
and helps speeding up the connection rate.

In this feature we maintain a pool of TX objects and use it to recycle
the resources instead of re-creating them per connection.

A cached TIS popped from the pool is updated to serve the new connection
via the fast-path HW interface, updating the tls static and progress
params. This is a very fast operation, significantly faster than FW
commands.

On recycling, a WQE fence is required after the context params change.
This guarantees that the data is sent after the context has been
successfully updated in hardware, and that the context modification
doesn't interfere with existing traffic.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c4dfe704

net/mlx5e: kTLS, Take stats out of OOO handler · 23b1cf1e

Tariq Toukan authored Jul 27, 2022

Let the caller of mlx5e_ktls_tx_handle_ooo() take care of updating the
stats, according to the returned value.  As the switch/case blocks are
already there, this change saves unnecessary branches in the handler.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

23b1cf1e

net/mlx5e: kTLS, Introduce TLS-specific create TIS · da6682fa

Tariq Toukan authored Jul 27, 2022

TLS TIS objects have a defined role in mapping and reaching the HW TLS
contexts.  Some standard TIS attributes (like LAG port affinity) are
not relevant for them.

Use a dedicated TLS TIS create function instead of the generic
mlx5e_create_tis.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

da6682fa

net/tls: Multi-threaded calls to TX tls_dev_del · 7adc91e0

Tariq Toukan authored Jul 27, 2022

Multiple TLS device-offloaded contexts can be added in parallel via
concurrent calls to .tls_dev_add, while calls to .tls_dev_del are
sequential in tls_device_gc_task.

This is not a sustainable behavior. This creates a rate gap between add
and del operations (addition rate outperforms the deletion rate).  When
running for enough time, the TLS device resources could get exhausted,
failing to offload new connections.

Replace the single-threaded garbage collector work with a per-context
alternative, so they can be handled on several cores in parallel. Use
a new dedicated destruct workqueue for this.

Tested with mlx5 device:
Before: 22141 add/sec,   103 del/sec
After:  11684 add/sec, 11684 del/sec
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

7adc91e0

net/tls: Perform immediate device ctx cleanup when possible · 113671b2

Tariq Toukan authored Jul 27, 2022

TLS context destructor can be run in atomic context. Cleanup operations
for device-offloaded contexts could require access and interaction with
the device callbacks, which might sleep. Hence, the cleanup of such
contexts must be deferred and completed inside an async work.

For all others, this is not necessary, as cleanup is atomic. Invoke
cleanup immediately for them, avoiding queueing redundant gc work.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

113671b2

tls: rx: Fix unsigned comparison with less than zero · 8fd1e151

Yang Li authored Jul 28, 2022

The return from the call to tls_rx_msg_size() is int, it can be
a negative error code, however this is being assigned to an
unsigned long variable 'sz', so making 'sz' an int.

Eliminate the following coccicheck warning:
./net/tls/tls_strp.c:211:6-8: WARNING: Unsigned expression compared with zero: sz < 0
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220728031019.32838-1-yang.lee@linux.alibaba.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8fd1e151

Merge branch 'tls-rx-follow-ups-to-rx-work' · 37e26188

Jakub Kicinski authored Jul 28, 2022

Jakub Kicinski says:

====================
tls: rx: follow ups to rx work

A selection of unrelated changes. First some selftest polishing.
Next a change to rcvtimeo handling for locking based on an exchange
with Eric. Follow up to Paolo's comments from yesterday. Last but
not least a fix to a false positive warning, turns out I've been
testing with DEBUG_NET=n this whole time.
====================

Link: https://lore.kernel.org/r/20220727031524.358216-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

37e26188

tls: rx: fix the false positive warning · e20691fa

Jakub Kicinski authored Jul 26, 2022

I went too far in the accessor conversion, we can't use tls_strp_msg()
after decryption because the message may not be ready. What we care
about on this path is that the output skb is detached, i.e. we didn't
somehow just turn around and used the input skb with its TCP data
still attached. So look at the anchor directly.

Fixes: 84c61fe1 ("tls: rx: do not use the standard strparser")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

e20691fa

tls: strp: rename and multithread the workqueue · d11ef9cc

Jakub Kicinski authored Jul 26, 2022

Paolo points out that there seems to be no strong reason strparser
users a single threaded workqueue. Perhaps there were some performance
or pinning considerations? Since we don't know (and it's the slow path)
let's default to the most natural, multi-threaded choice.

Also rename the workqueue to "tls-".
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

d11ef9cc

tls: rx: don't consider sock_rcvtimeo() cumulative · 70f03fc2

Jakub Kicinski authored Jul 26, 2022

Eric indicates that restarting rcvtimeo on every wait may be fine.
I thought that we should consider it cumulative, and made
tls_rx_reader_lock() return the remaining timeo after acquiring
the reader lock.

tls_rx_rec_wait() gets its timeout passed in by value so it
does not keep track of time previously spent.

Make the lock waiting consistent with tls_rx_rec_wait() - don't
keep track of time spent.

Read the timeo fresh in tls_rx_rec_wait().
It's unclear to me why callers are supposed to cache the value.

Link: https://lore.kernel.org/all/CANn89iKcmSfWgvZjzNGbsrndmCch2HC_EPZ7qmGboDNaWoviNQ@mail.gmail.com/Signed-off-by: Jakub Kicinski <kuba@kernel.org>

70f03fc2

selftests: tls: handful of memrnd() and length checks · 86c591fb

Jakub Kicinski authored Jul 26, 2022

Add a handful of memory randomizations and precise length checks.
Nothing is really broken here, I did this to increase confidence
when debugging. It does fix a GCC warning, tho. Apparently GCC
recognizes that memory needs to be initialized for send() but
does not recognize that for write().
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

86c591fb

net: usb: delete extra space and tab in blank line · efe3e6b5

Xie Shaowen authored Jul 27, 2022

delete extra space and tab in blank line, there is no functional change.
Signed-off-by: Xie Shaowen <studentxswpy@163.com>
Link: https://lore.kernel.org/r/20220727081253.3043941-1-studentxswpy@163.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

efe3e6b5

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 272ac32f
Jakub Kicinski authored Jul 28, 2022
```
No conflicts.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
```
272ac32f

28 Jul, 2022 15 commits

Merge tag 'net-5.19-final' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 33ea1340

Linus Torvalds authored Jul 28, 2022

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bluetooth and netfilter, no known blockers for
  the release.

  Current release - regressions:

   - wifi: mac80211: do not abuse fq.lock in ieee80211_do_stop(), fix
     taking the lock before its initialized

   - Bluetooth: mgmt: fix double free on error path

  Current release - new code bugs:

   - eth: ice: fix tunnel checksum offload with fragmented traffic

  Previous releases - regressions:

   - tcp: md5: fix IPv4-mapped support after refactoring, don't take the
     pure v6 path

   - Revert "tcp: change pingpong threshold to 3", improving detection
     of interactive sessions

   - mld: fix netdev refcount leak in mld_{query | report}_work() due to
     a race

   - Bluetooth:
      - always set event mask on suspend, avoid early wake ups
      - L2CAP: fix use-after-free caused by l2cap_chan_put

   - bridge: do not send empty IFLA_AF_SPEC attribute

  Previous releases - always broken:

   - ping6: fix memleak in ipv6_renew_options()

   - sctp: prevent null-deref caused by over-eager error paths

   - virtio-net: fix the race between refill work and close, resulting
     in NAPI scheduled after close and a BUG()

   - macsec:
      - fix three netlink parsing bugs
      - avoid breaking the device state on invalid change requests
      - fix a memleak in another error path

  Misc:

   - dt-bindings: net: ethernet-controller: rework 'fixed-link' schema

   - two more batches of sysctl data race adornment"

* tag 'net-5.19-final' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (67 commits)
  stmmac: dwmac-mediatek: fix resource leak in probe
  ipv6/addrconf: fix a null-ptr-deref bug for ip6_ptr
  net: ping6: Fix memleak in ipv6_renew_options().
  net/funeth: Fix fun_xdp_tx() and XDP packet reclaim
  sctp: leave the err path free in sctp_stream_init to sctp_stream_free
  sfc: disable softirqs for ptp TX
  ptp: ocp: Select CRC16 in the Kconfig.
  tcp: md5: fix IPv4-mapped support
  virtio-net: fix the race between refill work and close
  mptcp: Do not return EINPROGRESS when subflow creation succeeds
  Bluetooth: L2CAP: Fix use-after-free caused by l2cap_chan_put
  Bluetooth: Always set event mask on suspend
  Bluetooth: mgmt: Fix double free on error path
  wifi: mac80211: do not abuse fq.lock in ieee80211_do_stop()
  ice: do not setup vlan for loopback VSI
  ice: check (DD | EOF) bits on Rx descriptor rather than (EOP | RS)
  ice: Fix VSIs unable to share unicast MAC
  ice: Fix tunnel checksum offload with fragmented traffic
  ice: Fix max VLANs available for VF
  netfilter: nft_queue: only allow supported familes and hooks
  ...

33ea1340

stmmac: dwmac-mediatek: fix resource leak in probe · 4d3d3a1b

Dan Carpenter authored Jul 28, 2022

If mediatek_dwmac_clks_config() fails, then call stmmac_remove_config_dt()
before returning. Otherwise it is a resource leak.

Fixes: fa4b3ca6 ("stmmac: dwmac-mediatek: fix clock issue")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/YuJ4aZyMUlG6yGGa@kiliSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4d3d3a1b

ipv6/addrconf: fix a null-ptr-deref bug for ip6_ptr · 85f0173d

Ziyang Xuan authored Jul 28, 2022

Change net device's MTU to smaller than IPV6_MIN_MTU or unregister
device while matching route. That may trigger null-ptr-deref bug
for ip6_ptr probability as following.

=========================================================
BUG: KASAN: null-ptr-deref in find_match.part.0+0x70/0x134
Read of size 4 at addr 0000000000000308 by task ping6/263

CPU: 2 PID: 263 Comm: ping6 Not tainted 5.19.0-rc7+ #14
Call trace:
 dump_backtrace+0x1a8/0x230
 show_stack+0x20/0x70
 dump_stack_lvl+0x68/0x84
 print_report+0xc4/0x120
 kasan_report+0x84/0x120
 __asan_load4+0x94/0xd0
 find_match.part.0+0x70/0x134
 __find_rr_leaf+0x408/0x470
 fib6_table_lookup+0x264/0x540
 ip6_pol_route+0xf4/0x260
 ip6_pol_route_output+0x58/0x70
 fib6_rule_lookup+0x1a8/0x330
 ip6_route_output_flags_noref+0xd8/0x1a0
 ip6_route_output_flags+0x58/0x160
 ip6_dst_lookup_tail+0x5b4/0x85c
 ip6_dst_lookup_flow+0x98/0x120
 rawv6_sendmsg+0x49c/0xc70
 inet_sendmsg+0x68/0x94

Reproducer as following:
Firstly, prepare conditions:
$ip netns add ns1
$ip netns add ns2
$ip link add veth1 type veth peer name veth2
$ip link set veth1 netns ns1
$ip link set veth2 netns ns2
$ip netns exec ns1 ip -6 addr add 2001:0db8:0:f101::1/64 dev veth1
$ip netns exec ns2 ip -6 addr add 2001:0db8:0:f101::2/64 dev veth2
$ip netns exec ns1 ifconfig veth1 up
$ip netns exec ns2 ifconfig veth2 up
$ip netns exec ns1 ip -6 route add 2000::/64 dev veth1 metric 1
$ip netns exec ns2 ip -6 route add 2001::/64 dev veth2 metric 1

Secondly, execute the following two commands in two ssh windows
respectively:
$ip netns exec ns1 sh
$while true; do ip -6 addr add 2001:0db8:0:f101::1/64 dev veth1; ip -6 route add 2000::/64 dev veth1 metric 1; ping6 2000::2; done

$ip netns exec ns1 sh
$while true; do ip link set veth1 mtu 1000; ip link set veth1 mtu 1500; sleep 5; done

It is because ip6_ptr has been assigned to NULL in addrconf_ifdown() firstly,
then ip6_ignore_linkdown() accesses ip6_ptr directly without NULL check.

	cpu0			cpu1
fib6_table_lookup
__find_rr_leaf
			addrconf_notify [ NETDEV_CHANGEMTU ]
			addrconf_ifdown
			RCU_INIT_POINTER(dev->ip6_ptr, NULL)
find_match
ip6_ignore_linkdown

So we can add NULL check for ip6_ptr before using in ip6_ignore_linkdown() to
fix the null-ptr-deref bug.

Fixes: dcd1f572 ("net/ipv6: Remove fib6_idev")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220728013307.656257-1-william.xuanziyang@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

85f0173d

net: ping6: Fix memleak in ipv6_renew_options(). · e2732600

Kuniyuki Iwashima authored Jul 27, 2022

When we close ping6 sockets, some resources are left unfreed because
pingv6_prot is missing sk->sk_prot->destroy().  As reported by
syzbot [0], just three syscalls leak 96 bytes and easily cause OOM.

    struct ipv6_sr_hdr *hdr;
    char data[24] = {0};
    int fd;

    hdr = (struct ipv6_sr_hdr *)data;
    hdr->hdrlen = 2;
    hdr->type = IPV6_SRCRT_TYPE_4;

    fd = socket(AF_INET6, SOCK_DGRAM, NEXTHDR_ICMP);
    setsockopt(fd, IPPROTO_IPV6, IPV6_RTHDR, data, 24);
    close(fd);

To fix memory leaks, let's add a destroy function.

Note the socket() syscall checks if the GID is within the range of
net.ipv4.ping_group_range.  The default value is [1, 0] so that no
GID meets the condition (1 <= GID <= 0).  Thus, the local DoS does
not succeed until we change the default value.  However, at least
Ubuntu/Fedora/RHEL loosen it.

    $ cat /usr/lib/sysctl.d/50-default.conf
    ...
    -net.ipv4.ping_group_range = 0 2147483647

Also, there could be another path reported with these options, and
some of them require CAP_NET_RAW.

  setsockopt
      IPV6_ADDRFORM (inet6_sk(sk)->pktoptions)
      IPV6_RECVPATHMTU (inet6_sk(sk)->rxpmtu)
      IPV6_HOPOPTS (inet6_sk(sk)->opt)
      IPV6_RTHDRDSTOPTS (inet6_sk(sk)->opt)
      IPV6_RTHDR (inet6_sk(sk)->opt)
      IPV6_DSTOPTS (inet6_sk(sk)->opt)
      IPV6_2292PKTOPTIONS (inet6_sk(sk)->opt)

  getsockopt
      IPV6_FLOWLABEL_MGR (inet6_sk(sk)->ipv6_fl_list)

For the record, I left a different splat with syzbot's one.

  unreferenced object 0xffff888006270c60 (size 96):
    comm "repro2", pid 231, jiffies 4294696626 (age 13.118s)
    hex dump (first 32 bytes):
      01 00 00 00 44 00 00 00 00 00 00 00 00 00 00 00  ....D...........
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    backtrace:
      [<00000000f6bc7ea9>] sock_kmalloc (net/core/sock.c:2564 net/core/sock.c:2554)
      [<000000006d699550>] do_ipv6_setsockopt.constprop.0 (net/ipv6/ipv6_sockglue.c:715)
      [<00000000c3c3b1f5>] ipv6_setsockopt (net/ipv6/ipv6_sockglue.c:1024)
      [<000000007096a025>] __sys_setsockopt (net/socket.c:2254)
      [<000000003a8ff47b>] __x64_sys_setsockopt (net/socket.c:2265 net/socket.c:2262 net/socket.c:2262)
      [<000000007c409dcb>] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
      [<00000000e939c4a9>] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)

[0]: https://syzkaller.appspot.com/bug?extid=a8430774139ec3ab7176

Fixes: 6d0bfe22 ("net: ipv6: Add IPv6 support to the ping socket.")
Reported-by: syzbot+a8430774139ec3ab7176@syzkaller.appspotmail.com
Reported-by: Ayushman Dutta <ayudutta@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220728012220.46918-1-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e2732600

watch_queue: Fix missing locking in add_watch_to_object() · e64ab2db

Linus Torvalds authored Jul 28, 2022

If a watch is being added to a queue, it needs to guard against
interference from addition of a new watch, manual removal of a watch and
removal of a watch due to some other queue being destroyed.

KEYCTL_WATCH_KEY guards against this for the same {key,queue} pair by
holding the key->sem writelocked and by holding refs on both the key and
the queue - but that doesn't prevent interaction from other {key,queue}
pairs.

While add_watch_to_object() does take the spinlock on the event queue,
it doesn't take the lock on the source's watch list.  The assumption was
that the caller would prevent that (say by taking key->sem) - but that
doesn't prevent interference from the destruction of another queue.

Fix this by locking the watcher list in add_watch_to_object().

Fixes: c73be61c ("pipe: Add general notification queue support")
Reported-by: syzbot+03d7b43290037d1f87ca@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: keyrings@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e64ab2db

watch_queue: Fix missing rcu annotation · e0339f03

David Howells authored Jul 28, 2022

Since __post_watch_notification() walks wlist->watchers with only the
RCU read lock held, we need to use RCU methods to add to the list (we
already use RCU methods to remove from the list).

Fix add_watch_to_object() to use hlist_add_head_rcu() instead of
hlist_add_head() for that list.

Fixes: c73be61c ("pipe: Add general notification queue support")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e0339f03

net: cdns,macb: use correct xlnx prefix for Xilinx · 623cd870

Krzysztof Kozlowski authored Jul 26, 2022

Use correct vendor for Xilinx versions of Cadence MACB/GEM Ethernet
controller. The Versal compatible was not released, so it can be
changed. Zynq-7xxx and Ultrascale+ has to be kept in new and deprecated
form.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Harini Katakam <harini.katakam@amd.com>
Link: https://lore.kernel.org/r/20220726070802.26579-2-krzysztof.kozlowski@linaro.orgSigned-off-by: Paolo Abeni <pabeni@redhat.com>

623cd870

dt-bindings: net: cdns,macb: use correct xlnx prefix for Xilinx · afa950b8

Krzysztof Kozlowski authored Jul 26, 2022

Use correct vendor for Xilinx versions of Cadence MACB/GEM Ethernet
controller. The Versal compatible was not released, so it can be
changed. Zynq-7xxx and Ultrascale+ has to be kept in new and deprecated
form.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20220726070802.26579-1-krzysztof.kozlowski@linaro.orgSigned-off-by: Paolo Abeni <pabeni@redhat.com>

afa950b8

net/funeth: Fix fun_xdp_tx() and XDP packet reclaim · 51a83391

Dimitris Michailidis authored Jul 26, 2022

The current implementation of fun_xdp_tx(), used for XPD_TX, is
incorrect in that it takes an address/length pair and later releases it
with page_frag_free(). It is OK for XDP_TX but the same code is used by
ndo_xdp_xmit. In that case it loses the XDP memory type and releases the
packet incorrectly for some of the types. Assorted breakage follows.

Change fun_xdp_tx() to take xdp_frame and rely on xdp_return_frame() in
reclaim.

Fixes: db37bc17 ("net/funeth: add the data path")
Signed-off-by: Dimitris Michailidis <dmichail@fungible.com>
Link: https://lore.kernel.org/r/20220726215923.7887-1-dmichail@fungible.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

51a83391

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · 7d85e9cb

Paolo Abeni authored Jul 28, 2022

Tony Nguyen says:

====================
ice: PPPoE offload support

Marcin Szycik says:

Add support for dissecting PPPoE and PPP-specific fields in flow dissector:
PPPoE session id and PPP protocol type. Add support for those fields in
tc-flower and support offloading PPPoE. Finally, add support for hardware
offload of PPPoE packets in switchdev mode in ice driver.

Example filter:
tc filter add dev $PF1 ingress protocol ppp_ses prio 1 flower pppoe_sid \
    1234 ppp_proto ip skip_sw action mirred egress redirect dev $VF1_PR

Changes in iproute2 are required to use the new fields (will be submitted
soon).

ICE COMMS DDP package is required to create a filter in ice.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: Add support for PPPoE hardware offload
  flow_offload: Introduce flow_match_pppoe
  net/sched: flower: Add PPPoE filter
  flow_dissector: Add PPPoE dissectors
====================

Link: https://lore.kernel.org/r/20220726203133.2171332-1-anthony.l.nguyen@intel.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

7d85e9cb

add missing includes and forward declarations to networking includes under linux/ · 5f10376b

Jakub Kicinski authored Jul 26, 2022

Similarly to a recent include/net/ cleanup, this patch adds
missing includes to networking headers under include/linux.
All these problems are currently masked by the existing users
including the missing dependency before the broken header.

Link: https://lore.kernel.org/all/20220723045755.2676857-1-kuba@kernel.org/ v1
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20220726215652.158167-1-kuba@kernel.orgSigned-off-by: Paolo Abeni <pabeni@redhat.com>

5f10376b

Revert "Merge branch 'octeontx2-minor-tc-fixes'" · 4158e389

Paolo Abeni authored Jul 26, 2022

This reverts commit 35d099da, reversing
changes made to 58d8bcd4.

I wrongly applied that to the net-next tree instead of the intended
target tree (net). Reverting it on net-next.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

4158e389

net: dsa: mv88e6xxx: fix speed setting for CPU/DSA ports · cc1049cc

Marcin Wojtas authored Jul 27, 2022

Commit 3c783b83 ("net: dsa: mv88e6xxx: get rid of SPEED_MAX setting")
stopped relying on SPEED_MAX constant and hardcoded speed settings
for the switch ports and rely on phylink configuration.

It turned out, however, that when the relevant code is called,
the mac_capabilites of CPU/DSA port remain unset.
mv88e6xxx_setup_port() is called via mv88e6xxx_setup() in
dsa_tree_setup_switches(), which precedes setting the caps in
phylink_get_caps down in the chain of dsa_tree_setup_ports().

As a result the mac_capabilites are 0 and the default speed for CPU/DSA
port is 10M at the start. To fix that, execute mv88e6xxx_get_caps()
and obtain the capabilities driectly.

Fixes: 3c783b83 ("net: dsa: mv88e6xxx: get rid of SPEED_MAX setting")
Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/20220726230918.2772378-1-mw@semihalf.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

cc1049cc

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · bf84719d

Jakub Kicinski authored Jul 27, 2022

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2022-07-26

This series contains updates to ice driver only.

Przemyslaw corrects accounting for VF VLANs to allow for correct number
of VLANs for untrusted VF. He also correct issue with checksum offload
on VXLAN tunnels.

Ani allows for two VSIs to share the same MAC address.

Maciej corrects checked bits for descriptor completion of loopback

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  ice: do not setup vlan for loopback VSI
  ice: check (DD | EOF) bits on Rx descriptor rather than (EOP | RS)
  ice: Fix VSIs unable to share unicast MAC
  ice: Fix tunnel checksum offload with fragmented traffic
  ice: Fix max VLANs available for VF
====================

Link: https://lore.kernel.org/r/20220726204646.2171589-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

bf84719d

net: devlink: remove redundant net_eq() check from sb_pool_get_dumpit() · 2bb88b2c

Jiri Pirko authored Jul 27, 2022

The net_eq() check is already performed inside
devlinks_xa_for_each_registered_get() helper, so remove the redundant
appearance.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20220727055912.568391-1-jiri@resnulli.usSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2bb88b2c