- 13 May, 2024 40 commits
-
-
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski authored
Daniel Borkmann says: ==================== pull-request: bpf-next 2024-05-13 We've added 119 non-merge commits during the last 14 day(s) which contain a total of 134 files changed, 9462 insertions(+), 4742 deletions(-). The main changes are: 1) Add BPF JIT support for 32-bit ARCv2 processors, from Shahab Vahedi. 2) Add BPF range computation improvements to the verifier in particular around XOR and OR operators, refactoring of checks for range computation and relaxing MUL range computation so that src_reg can also be an unknown scalar, from Cupertino Miranda. 3) Add support to attach kprobe BPF programs through kprobe_multi link in a session mode, meaning, a BPF program is attached to both function entry and return, the entry program can decide if the return program gets executed and the entry program can share u64 cookie value with return program. Session mode is a common use-case for tetragon and bpftrace, from Jiri Olsa. 4) Fix a potential overflow in libbpf's ring__consume_n() and improve libbpf as well as BPF selftest's struct_ops handling, from Andrii Nakryiko. 5) Improvements to BPF selftests in context of BPF gcc backend, from Jose E. Marchesi & David Faust. 6) Migrate remaining BPF selftest tests from test_sock_addr.c to prog_test- -style in order to retire the old test, run it in BPF CI and additionally expand test coverage, from Jordan Rife. 7) Big batch for BPF selftest refactoring in order to remove duplicate code around common network helpers, from Geliang Tang. 8) Another batch of improvements to BPF selftests to retire obsolete bpf_tcp_helpers.h as everything is available vmlinux.h, from Martin KaFai Lau. 9) Fix BPF map tear-down to not walk the map twice on free when both timer and wq is used, from Benjamin Tissoires. 10) Fix BPF verifier assumptions about socket->sk that it can be non-NULL, from Alexei Starovoitov. 11) Change BTF build scripts to using --btf_features for pahole v1.26+, from Alan Maguire. 12) Small improvements to BPF reusing struct_size() and krealloc_array(), from Andy Shevchenko. 13) Fix s390 JIT to emit a barrier for BPF_FETCH instructions, from Ilya Leoshkevich. 14) Extend TCP ->cong_control() callback in order to feed in ack and flag parameters and allow write-access to tp->snd_cwnd_stamp from BPF program, from Miao Xu. 15) Add support for internal-only per-CPU instructions to inline bpf_get_smp_processor_id() helper call for arm64 and riscv64 BPF JITs, from Puranjay Mohan. 16) Follow-up to remove the redundant ethtool.h from tooling infrastructure, from Tushar Vyavahare. 17) Extend libbpf to support "module:<function>" syntax for tracing programs, from Viktor Malik. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits) bpf: make list_for_each_entry portable bpf: ignore expected GCC warning in test_global_func10.c bpf: disable strict aliasing in test_global_func9.c selftests/bpf: Free strdup memory in xdp_hw_metadata selftests/bpf: Fix a few tests for GCC related warnings. bpf: avoid gcc overflow warning in test_xdp_vlan.c tools: remove redundant ethtool.h from tooling infra selftests/bpf: Expand ATTACH_REJECT tests selftests/bpf: Expand getsockname and getpeername tests sefltests/bpf: Expand sockaddr hook deny tests selftests/bpf: Expand sockaddr program return value tests selftests/bpf: Retire test_sock_addr.(c|sh) selftests/bpf: Remove redundant sendmsg test cases selftests/bpf: Migrate ATTACH_REJECT test cases selftests/bpf: Migrate expected_attach_type tests selftests/bpf: Migrate wildcard destination rewrite test selftests/bpf: Migrate sendmsg6 v4 mapped address tests selftests/bpf: Migrate sendmsg deny test cases selftests/bpf: Migrate WILDCARD_IP test selftests/bpf: Handle SYSCALL_EPERM and SYSCALL_ENOTSUPP test cases ... ==================== Link: https://lore.kernel.org/r/20240513134114.17575-1-daniel@iogearbox.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
Nothing useful is done with the LPA variable in lynx_pcs_get_state_2500basex(), we can just remove the read. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240513115345.2452799-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Tariq Toukan says: ==================== mlx5 misc patches This series includes patches for the mlx5 driver. Patch 1 by Shay enables LAG with HCAs of 8 ports. Patch 2 by Carolina optimizes the safe switch channels operation for the TX-only changes. Patch 3 by Parav cleans up some unused code. ==================== Link: https://lore.kernel.org/r/20240512124306.740898-1-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Parav Pandit authored
MSIX irq allocation and free APIs are no longer in use. Hence, remove the dead code. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://lore.kernel.org/r/20240512124306.740898-4-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Carolina Jubran authored
It is not appropriate for the mlx5e_num_channels_changed function to be called solely for updating the TX queues, even if the channels number has not been changed. Move the code responsible for updating the TC and TX queues from mlx5e_num_channels_changed and produce a new function called mlx5e_update_tc_and_tx_queues. This new function should only be called when the channels number remains unchanged. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240512124306.740898-3-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Shay Drory authored
This patch adds to mlx5 drivers support for 8 ports HCAs. Starting with ConnectX-8 HCAs with 8 ports are possible. As most driver parts aren't affected by such configuration most driver code is unchanged. Specially the only affected areas are: - Lag - Multiport E-Switch - Single FDB E-Switch All of the above are already factored in generic way, and LAG and VF LAG are tested, so all that left is to change a #define and remove checks which are no longer needed. However, Multiport E-Switch is not tested yet, so it is left untouched. This patch will allow to create hardware LAG/VF LAG when all 8 ports are added to the same bond device. for example, In order to activate the hardware lag a user can execute the following: ip link add bond0 type bond ip link set bond0 type bond miimon 100 mode 2 ip link set eth2 master bond0 ip link set eth3 master bond0 ip link set eth4 master bond0 ip link set eth5 master bond0 ip link set eth6 master bond0 ip link set eth7 master bond0 ip link set eth8 master bond0 ip link set eth9 master bond0 Where eth2, eth3, eth4, eth5, eth6, eth7, eth8 and eth9 are the PFs of the same HCA. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240512124306.740898-2-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Lukasz Majewski authored
After this change the single SAN device (ns3eth1) is now replaced with two SAN devices - respectively ns4eth1 and ns5eth1. It is possible to extend this script to have more SAN devices connected by adding them to ns3br1 bridge. Signed-off-by: Lukasz Majewski <lukma@denx.de> Link: https://lore.kernel.org/r/20240510143710.3916631-1-lukma@denx.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Oleksij Rempel says: ==================== net: dsa: microchip: DCB fixes This patch series address recommendation to rename IPV to IPM to avoid confusion with IPV name used in 802.1Qci PSFP. And restores default "PCP only" configuration as source of priorities to avoid possible regressions. ==================== Link: https://lore.kernel.org/r/20240510053828.2412516-1-o.rempel@pengutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Oleksij Rempel authored
Before DCB support, the KSZ driver had only PCP as source of packet priority values. To avoid regressions, make PCP only as default value. User will need enable DSCP support manually. This patch do not affect other KSZ8 related quirks. User will still be warned by setting not support configurations for the port 2. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Acked-by: Arun Ramadoss <arun.ramadoss@microchip.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240510053828.2412516-4-o.rempel@pengutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Oleksij Rempel authored
All other functions are commented. Add missing comments to following functions: ksz_set_global_dscp_entry() ksz_port_add_dscp_prio() ksz_port_del_dscp_prio() Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Acked-by: Arun Ramadoss <arun.ramadoss@microchip.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240510053828.2412516-3-o.rempel@pengutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Oleksij Rempel authored
IPV is added and used term in 802.1Qci PSFP and merged into 802.1Q (from 802.1Q-2018) for another functions. Even it does similar operation holding temporal priority value internally (as it is named), because KSZ datasheet doesn't use the term of IPV (Internal Priority Value) and avoiding any confusion later when PSFP is in the Linux world, it is better to rename IPV to IPM (Internal Priority Mapping). In addition, LAN937x documentation already use IPV for 802.1Qci PSFP related functionality. Suggested-by: Woojung Huh <Woojung.Huh@microchip.com> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Woojung Huh <woojung.huh@microchip.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240510053828.2412516-2-o.rempel@pengutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Samuel Thibault authored
628bc3e5 ("l2tp: Support several sockets with same IP/port quadruple") added support for several L2TPv2 tunnels using the same IP/port quadruple, but if an L2TPv3 socket exists it could eat all the trafic. We thus have to first use the version from the packet to get the proper tunnel, and only then check that the version matches. Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Reviewed-by: James Chapman <jchapman@katalix.com> Link: https://lore.kernel.org/r/20240509205812.4063198-1-samuel.thibault@ens-lyon.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Antonio Quartulli authored
For type String and Binary we are currently usinig the exact-len limit value as is without attempting any name resolution. However, the spec may specify the name of a constant rather than an actual value, which would result in using the constant name as is and thus break the policy. Ensure the limit value is passed to get_limit(), which will always attempt resolving the name before printing the policy rule. Signed-off-by: Antonio Quartulli <a@unstable.cc> Link: https://lore.kernel.org/r/20240510232202.24051-1-a@unstable.ccSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Daniel Jurgens says: ==================== Add TX stop/wake counters Several drivers provide TX stop and wake counters via ethtool stats. Add those to the netdev queue stats, and use them in virtio_net. ==================== Link: https://lore.kernel.org/r/20240510201927.1821109-1-danielj@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Daniel Jurgens authored
Add a tx queue stop and wake counters, they are useful for debugging. $ ./tools/net/ynl/cli.py --spec netlink/specs/netdev.yaml \ --dump qstats-get --json '{"scope": "queue"}' ... {'ifindex': 13, 'queue-id': 0, 'queue-type': 'tx', 'tx-bytes': 14756682850, 'tx-packets': 226465, 'tx-stop': 113208, 'tx-wake': 113208}, {'ifindex': 13, 'queue-id': 1, 'queue-type': 'tx', 'tx-bytes': 18167675008, 'tx-packets': 278660, 'tx-stop': 8632, 'tx-wake': 8632}] Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/20240510201927.1821109-3-danielj@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Daniel Jurgens authored
TX queue stop and wake are counted by some drivers. Support reporting these via netdev-genl queue stats. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/20240510201927.1821109-2-danielj@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Matthieu Baerts (NGI0) authored
A way for an application to know if an MPTCP connection fell back to TCP is to use getsockopt(MPTCP_INFO) and look for errors. The issue with this technique is that the same errors -- EOPNOTSUPP (IPv4) and ENOPROTOOPT (IPv6) -- are returned if there was a fallback, *or* if the kernel doesn't support this socket option. The userspace then has to look at the kernel version to understand what the errors mean. It is not clean, and it doesn't take into account older kernels where the socket option has been backported. A cleaner way would be to expose this info to the TCP socket level. In case of MPTCP socket where no fallback happened, the socket options for the TCP level will be handled in MPTCP code, in mptcp_getsockopt_sol_tcp(). If not, that will be in TCP code, in do_tcp_getsockopt(). So MPTCP simply has to set the value 1, while TCP has to set 0. If the socket option is not supported, one of these two errors will be reported: - EOPNOTSUPP (95 - Operation not supported) for MPTCP sockets - ENOPROTOOPT (92 - Protocol not available) for TCP sockets, e.g. on the socket received after an 'accept()', when the client didn't request to use MPTCP: this socket will be a TCP one, even if the listen socket was an MPTCP one. With this new option, the kernel can return a clear answer to both "Is this kernel new enough to tell me the fallback status?" and "If it is new enough, is it currently a TCP or MPTCP socket?" questions, while not breaking the previous method. Acked-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240509-upstream-net-next-20240509-mptcp-tcp_is_mptcp-v1-1-f846df999202@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Richard Gobert says: ==================== net: gro: remove network_header use, move p->{flush/flush_id} calculations to L4 The cb fields network_offset and inner_network_offset are used instead of skb->network_header throughout GRO. These fields are then leveraged in the next commit to remove flush_id state from napi_gro_cb, and stateful code in {ipv6,inet}_gro_receive which may be unnecessarily complicated due to encapsulation support in GRO. These fields are checked in L4 instead. 3rd patch adds tests for different flush_id flows in GRO. ==================== Link: https://lore.kernel.org/r/20240509190819.2985-1-richardbgobert@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Richard Gobert authored
Added flush id selftests to test different cases where DF flag is set or unset and id value changes in the following packets. All cases where the packets should coalesce or should not coalesce are tested. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240509190819.2985-4-richardbgobert@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Richard Gobert authored
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags, iph->id, ...) against all packets in a loop. These flush checks are used in all merging UDP and TCP flows. These checks need to be done only once and only against the found p skb, since they only affect flush and not same_flow. This patch leverages correct network header offsets from the cb for both outer and inner network headers - allowing these checks to be done only once, in tcp_gro_receive and udp_gro_receive_segment. As a result, NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are more declarative and contained in inet_gro_flush, thus removing the need for flush_id in napi_gro_cb. This results in less parsing code for non-loop flush tests for TCP and UDP flows. To make sure results are not within noise range - I've made netfilter drop all TCP packets, and measured CPU performance in GRO (in this case GRO is responsible for about 50% of the CPU utilization). perf top while replaying 64 parallel IP/TCP streams merging in GRO: (gro_receive_network_flush is compiled inline to tcp_gro_receive) net-next: 6.94% [kernel] [k] inet_gro_receive 3.02% [kernel] [k] tcp_gro_receive patch applied: 4.27% [kernel] [k] tcp_gro_receive 4.22% [kernel] [k] inet_gro_receive perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same results for any encapsulation, in this case inet_gro_receive is top offender in net-next) net-next: 10.09% [kernel] [k] inet_gro_receive 2.08% [kernel] [k] tcp_gro_receive patch applied: 6.97% [kernel] [k] inet_gro_receive 3.68% [kernel] [k] tcp_gro_receive Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240509190819.2985-3-richardbgobert@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Richard Gobert authored
This patch converts references of skb->network_header to napi_gro_cb's network_offset and inner_network_offset. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240509190819.2985-2-richardbgobert@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
David Arinzon says: ==================== ENA driver changes May 2024 This patchset contains several misc and minor changes to the ENA driver. ==================== Link: https://lore.kernel.org/r/20240512134637.25299-1-darinzon@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
David Arinzon authored
For the purpose of obtaining better CPU utilization, minimum rx moderation interval is set to 20 usec. Signed-off-by: Osama Abboud <osamaabb@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240512134637.25299-6-darinzon@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
David Arinzon authored
strscpy copies as much of the string as possible, meaning that the destination string will be truncated in case of no space. As this is a non-critical error in our case, adding a debug level print for indication. This patch also removes a -1 which was added to ensure enough space for NUL, but strscpy destination string is guaranteed to be NUL-terminted, therefore, the -1 is not needed. Signed-off-by: David Arinzon <darinzon@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240512134637.25299-5-darinzon@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
David Arinzon authored
Validate that `first` flag is set only for the first descriptor in multi-buffer packets. In case of an invalid descriptor, a reset will occur. A new reset reason for RX data corruption has been added. Signed-off-by: Shahar Itzko <itzko@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240512134637.25299-4-darinzon@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
David Arinzon authored
This patch makes two changes in order to fill holes and reduce ther overall size of the structures ena_com_dev and ena_com_rx_ctx. Signed-off-by: Shahar Itzko <itzko@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240512134637.25299-3-darinzon@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
David Arinzon authored
This patch adds a counter to the ena_adapter struct in order to keep track of reset failures. The counter is incremented every time either ena_restore_device() or ena_destroy_device() fail. Signed-off-by: Osama Abboud <osamaabb@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240512134637.25299-2-darinzon@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Florian Westphal authored
Now that this test runs in netdev CI it looks like 10s isn't enough for debug kernels: selftests: net/netfilter: nft_flowtable.sh 2024/05/10 20:33:08 socat[12204] E write(7, 0x563feb16a000, 8192): Broken pipe FAIL: file mismatch for ns1 -> ns2 -rw------- 1 root root 37345280 May 10 20:32 /tmp/tmp.Am0yEHhNqI ... Looks like socat gets zapped too quickly, so increase timeout to 1m. Could also reduce tx file size for KSFT_MACHINE_SLOW, but its preferrable to have same test for both debug and nondebug. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240511064814.561525-1-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
Joachim kindly merged the IPv6 support in https://github.com/troglobit/mtools/pull/2, so we can just use his version now. A few more fixes subsequently came in for IPv6, so even better. Check that the deployed mtools version is 3.0 or above. Note that the version check breaks compatibility with my fork where I didn't bump the version, but I assume that won't be a problem. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20240510112856.1262901-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Colin Ian King authored
There is a spelling mistake in a TH_LOG message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240510084811.3299685-1-colin.i.king@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Daniel Golle authored
Setting LED_OFF via brightness_set should deactivate hw control, so make sure netdev trigger rules also get cleared in that case. This fixes unwanted restoration of the default netdev trigger rules and matches the behaviour when using the 'netdev' trigger without any hardware offloading. Fixes: 71e79430 ("net: phy: air_en8811h: Add the Airoha EN8811H PHY driver") Signed-off-by: Daniel Golle <daniel@makrotopia.org> Link: https://lore.kernel.org/r/5ed8ea615890a91fa4df59a7ae8311bbdf63cdcf.1715248281.git.daniel@makrotopia.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextJakub Kicinski authored
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: Patch #1 skips transaction if object type provides no .update interface. Patch #2 skips NETDEV_CHANGENAME which is unused. Patch #3 enables conntrack to handle Multicast Router Advertisements and Multicast Router Solicitations from the Multicast Router Discovery protocol (RFC4286) as untracked opposed to invalid packets. From Linus Luessing. Patch #4 updates DCCP conntracker to mark invalid as invalid, instead of dropping them, from Jason Xing. Patch #5 uses NF_DROP instead of -NF_DROP since NF_DROP is 0, also from Jason. Patch #6 removes reference in netfilter's sysctl documentation on pickup entries which were already removed by Florian Westphal. Patch #7 removes check for IPS_OFFLOAD flag to disable early drop which allows to evict entries from the conntrack table, also from Florian. Patches #8 to #16 updates nf_tables pipapo set backend to allocate the datastructure copy on-demand from preparation phase, to better deal with OOM situations where .commit step is too late to fail. Series from Florian Westphal. Patch #17 adds a selftest with packetdrill to cover conntrack TCP state transitions, also from Florian. Patch #18 use GFP_KERNEL to clone elements from control plane to avoid quick atomic reserves exhaustion with large sets, reporter refers to million entries magnitude. * tag 'nf-next-24-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: allow clone callbacks to sleep selftests: netfilter: add packetdrill based conntrack tests netfilter: nft_set_pipapo: remove dirty flag netfilter: nft_set_pipapo: move cloning of match info to insert/removal path netfilter: nft_set_pipapo: prepare pipapo_get helper for on-demand clone netfilter: nft_set_pipapo: merge deactivate helper into caller netfilter: nft_set_pipapo: prepare walk function for on-demand clone netfilter: nft_set_pipapo: prepare destroy function for on-demand clone netfilter: nft_set_pipapo: make pipapo_clone helper return NULL netfilter: nft_set_pipapo: move prove_locking helper around netfilter: conntrack: remove flowtable early-drop test netfilter: conntrack: documentation: remove reference to non-existent sysctl netfilter: use NF_DROP instead of -NF_DROP netfilter: conntrack: dccp: try not to drop skb in conntrack netfilter: conntrack: fix ct-state for ICMPv6 Multicast Router Discovery netfilter: nf_tables: remove NETDEV_CHANGENAME from netdev chain event handler netfilter: nf_tables: skip transaction if update object is not implemented ==================== Link: https://lore.kernel.org/r/20240512161436.168973-1-pablo@netfilter.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jose E. Marchesi authored
[Changes from V1: - The __compat_break has been abandoned in favor of a more readable can_loop macro that can be used anywhere, including loop conditions.] The macro list_for_each_entry is defined in bpf_arena_list.h as follows: #define list_for_each_entry(pos, head, member) \ for (void * ___tmp = (pos = list_entry_safe((head)->first, \ typeof(*(pos)), member), \ (void *)0); \ pos && ({ ___tmp = (void *)pos->member.next; 1; }); \ cond_break, \ pos = list_entry_safe((void __arena *)___tmp, typeof(*(pos)), member)) The macro cond_break, in turn, expands to a statement expression that contains a `break' statement. Compound statement expressions, and the subsequent ability of placing statements in the header of a `for' loop, are GNU extensions. Unfortunately, clang implements this GNU extension differently than GCC: - In GCC the `break' statement is bound to the containing "breakable" context in which the defining `for' appears. If there is no such context, GCC emits a warning: break statement without enclosing `for' o `switch' statement. - In clang the `break' statement is bound to the defining `for'. If the defining `for' is itself inside some breakable construct, then clang emits a -Wgcc-compat warning. This patch adds a new macro can_loop to bpf_experimental, that implements the same logic than cond_break but evaluates to a boolean expression. The patch also changes all the current instances of usage of cond_break withing the header of loop accordingly. Tested in bpf-next master. No regressions. Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com> Cc: david.faust@oracle.com Cc: cupertino.miranda@oracle.com Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Link: https://lore.kernel.org/r/20240511212243.23477-1-jose.marchesi@oracle.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jose E. Marchesi authored
The BPF selftest global_func10 in progs/test_global_func10.c contains: struct Small { long x; }; struct Big { long x; long y; }; [...] __noinline int foo(const struct Big *big) { if (!big) return 0; return bpf_get_prandom_u32() < big->y; } [...] SEC("cgroup_skb/ingress") __failure __msg("invalid indirect access to stack") int global_func10(struct __sk_buff *skb) { const struct Small small = {.x = skb->len }; return foo((struct Big *)&small) ? 1 : 0; } GCC emits a "maybe uninitialized" warning for the code above, because it knows `foo' accesses `big->y'. Since the purpose of this selftest is to check that the verifier will fail on this sort of invalid memory access, this patch just silences the compiler warning. Tested in bpf-next master. No regressions. Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com> Cc: david.faust@oracle.com Cc: cupertino.miranda@oracle.com Cc: Yonghong Song <yonghong.song@linux.dev> Cc: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240511212349.23549-1-jose.marchesi@oracle.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jose E. Marchesi authored
The BPF selftest test_global_func9.c performs type punning and breaks srict-aliasing rules. In particular, given: int global_func9(struct __sk_buff *skb) { int result = 0; [...] { const struct C c = {.x = skb->len, .y = skb->family }; result |= foo((const struct S *)&c); } } When building with strict-aliasing enabled (the default) the initialization of `c' gets optimized away in its entirely: [... no initialization of `c' ...] r1 = r10 r1 += -40 call foo w0 |= w6 Since GCC knows that `foo' accesses s->x, we get a "maybe uninitialized" warning. On the other hand, when strict-aliasing is disabled GCC only optimizes away the store to `.y': r1 = *(u32 *) (r6+0) *(u32 *) (r10+-40) = r1 ; This is .x = skb->len in `c' r1 = r10 r1 += -40 call foo w0 |= w6 In this case the warning is not emitted, because s-> is initialized. This patch disables strict aliasing in this test when building with GCC. clang seems to not optimize this particular code even when strict aliasing is enabled. Tested in bpf-next master. Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com> Cc: david.faust@oracle.com Cc: cupertino.miranda@oracle.com Cc: Yonghong Song <yonghong.song@linux.dev> Cc: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240511212213.23418-1-jose.marchesi@oracle.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Geliang Tang authored
The strdup() function returns a pointer to a new string which is a duplicate of the string "ifname". Memory for the new string is obtained with malloc(), and need to be freed with free(). This patch adds this missing "free(saved_hwtstamp_ifname)" in cleanup() to avoid a potential memory leak in xdp_hw_metadata.c. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Link: https://lore.kernel.org/r/af9bcccb96655e82de5ce2b4510b88c9c8ed5ed0.1715417367.git.tanggeliang@kylinos.cnSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Cupertino Miranda authored
This patch corrects a few warnings to allow selftests to compile for GCC. -- progs/cpumask_failure.c -- progs/bpf_misc.h:136:22: error: ‘cpumask’ is used uninitialized [-Werror=uninitialized] 136 | #define __sink(expr) asm volatile("" : "+g"(expr)) | ^~~ progs/cpumask_failure.c:68:9: note: in expansion of macro ‘__sink’ 68 | __sink(cpumask); The macro __sink(cpumask) with the '+' contraint modifier forces the the compiler to expect a read and write from cpumask. GCC detects that cpumask is never initialized and reports an error. This patch removes the spurious non required definitions of cpumask. -- progs/dynptr_fail.c -- progs/dynptr_fail.c:1444:9: error: ‘ptr1’ may be used uninitialized [-Werror=maybe-uninitialized] 1444 | bpf_dynptr_clone(&ptr1, &ptr2); Many of the tests in the file are related to the detection of uninitialized pointers by the verifier. GCC is able to detect possible uninitialized values, and reports this as an error. The patch initializes all of the previous uninitialized structs. -- progs/test_tunnel_kern.c -- progs/test_tunnel_kern.c:590:9: error: array subscript 1 is outside array bounds of ‘struct geneve_opt[1]’ [-Werror=array-bounds=] 590 | *(int *) &gopt.opt_data = bpf_htonl(0xdeadbeef); | ^~~~~~~~~~~~~~~~~~~~~~~ progs/test_tunnel_kern.c:575:27: note: at offset 4 into object ‘gopt’ of size 4 575 | struct geneve_opt gopt; This tests accesses beyond the defined data for the struct geneve_opt which contains as last field "u8 opt_data[0]" which clearly does not get reserved space (in stack) in the function header. This pattern is repeated in ip6geneve_set_tunnel and geneve_set_tunnel functions. GCC is able to see this and emits a warning. The patch introduces a local struct that allocates enough space to safely allow the write to opt_data field. -- progs/jeq_infer_not_null_fail.c -- progs/jeq_infer_not_null_fail.c:21:40: error: array subscript ‘struct bpf_map[0]’ is partly outside array bounds of ‘struct <anonymous>[1]’ [-Werror=array-bounds=] 21 | struct bpf_map *inner_map = map->inner_map_meta; | ^~ progs/jeq_infer_not_null_fail.c:14:3: note: object ‘m_hash’ of size 32 14 | } m_hash SEC(".maps"); This example defines m_hash in the context of the compilation unit and casts it to struct bpf_map which is much smaller than the size of struct bpf_map. It errors out in GCC when it attempts to access an element that would be defined in struct bpf_map outsize of the defined limits for m_hash. This patch disables the warning through a GCC pragma. This changes were tested in bpf-next master selftests without any regressions. Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com> Cc: jose.marchesi@oracle.com Cc: david.faust@oracle.com Cc: Yonghong Song <yonghong.song@linux.dev> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Link: https://lore.kernel.org/r/20240510183850.286661-2-cupertino.miranda@oracle.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
David Faust authored
This patch fixes an integer overflow warning raised by GCC in xdp_prognum1 of progs/test_xdp_vlan.c: GCC-BPF [test_maps] test_xdp_vlan.bpf.o progs/test_xdp_vlan.c: In function 'xdp_prognum1': progs/test_xdp_vlan.c:163:25: error: integer overflow in expression '(short int)(((__builtin_constant_p((int)vlan_hdr->h_vlan_TCI)) != 0 ? (int)(short unsigned int)((short int)((int)vlan_hdr->h_vlan_TCI << 8 >> 8) << 8 | (short int)((int)vlan_hdr->h_vlan_TCI << 0 >> 8 << 0)) & 61440 : (int)__builtin_bswap16(vlan_hdr->h_vlan_TCI) & 61440) << 8 >> 8) << 8' of type 'short int' results in '0' [-Werror=overflow] 163 | bpf_htons((bpf_ntohs(vlan_hdr->h_vlan_TCI) & 0xf000) | ^~~~~~~~~ The problem lies with the expansion of the bpf_htons macro and the expression passed into it. The bpf_htons macro (and similarly the bpf_ntohs macro) expand to a ternary operation using either __builtin_bswap16 or ___bpf_swab16 to swap the bytes, depending on whether the expression is constant. For an expression, with 'value' as a u16, like: bpf_htons (value & 0xf000) The entire (value & 0xf000) is 'x' in the expansion of ___bpf_swab16 and we get as one part of the expanded swab16: ((__u16)(value & 0xf000) << 8 >> 8 << 8 This will always evaluate to 0, which is intentional since this subexpression deals with the byte guaranteed to be 0 by the mask. However, GCC warns because the precise reason this always evaluates to 0 is an overflow. Specifically, the plain 0xf000 in the expression is a signed 32-bit integer, which causes 'value' to also be promoted to a signed 32-bit integer, and the combination of the 8-bit left shift and down-cast back to __u16 results in a signed overflow (really a 'warning: overflow in conversion from int to __u16' which is propegated up through the rest of the expression leading to the ultimate overflow warning above), which is a valid warning despite being the intended result of this code. Clang does not warn on this case, likely because it performs constant folding later in the compilation process relative to GCC. It seems that by the time clang does constant folding for this expression, the side of the ternary with this overflow has already been discarded. Fortunately, this warning is easily silenced by simply making the 0xf000 mask explicitly unsigned. This has no impact on the result. Signed-off-by: David Faust <david.faust@oracle.com> Cc: jose.marchesi@oracle.com Cc: cupertino.miranda@oracle.com Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240508193512.152759-1-david.faust@oracle.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Tushar Vyavahare authored
Remove the redundant ethtool.h header file from tools/include/uapi/linux. The file is unnecessary as the system uses the kernel's include/uapi/linux/ethtool.h directly. Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20240508104123.434769-1-tushar.vyavahare@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
Jordan Rife says: ==================== Retire progs/test_sock_addr.c This patch series migrates remaining tests from bpf/test_sock_addr.c to prog_tests/sock_addr.c and progs/verifier_sock_addr.c in order to fully retire the old-style test program and expands test coverage to test previously untested scenarios related to sockaddr hooks. This is a continuation of the work started recently during the expansion of prog_tests/sock_addr.c. Link: https://lore.kernel.org/bpf/20240429214529.2644801-1-jrife@google.com/T/#u ======= Patches ======= * Patch 1 moves tests that check valid return values for recvmsg hooks into progs/verifier_sock_addr.c, a new addition to the verifier test suite. * Patches 2-5 lay the groundwork for test migration, enabling prog_tests/sock_addr.c to handle more test dimensions. * Patches 6-11 move existing tests to prog_tests/sock_addr.c. * Patch 12 removes some redundant test cases. * Patches 14-17 expand on existing test coverage. ==================== Link: https://lore.kernel.org/r/20240510190246.3247730-1-jrife@google.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-