- 12 Feb, 2022 1 commit
-
-
Toke Høiland-Jørgensen authored
When receiving netlink messages, libbpf was using a statically allocated stack buffer of 4k bytes. This happened to work fine on systems with a 4k page size, but on systems with larger page sizes it can lead to truncated messages. The user-visible impact of this was that libbpf would insist no XDP program was attached to some interfaces because that bit of the netlink message got chopped off. Fix this by switching to a dynamically allocated buffer; we borrow the approach from iproute2 of using recvmsg() with MSG_PEEK|MSG_TRUNC to get the actual size of the pending message before receiving it, adjusting the buffer as necessary. While we're at it, also add retries on interrupted system calls around the recvmsg() call. v2: - Move peek logic to libbpf_netlink_recv(), don't double free on ENOMEM. Fixes: 8bbb77b7 ("libbpf: Add various netlink helpers") Reported-by: Zhiqian Guan <zhguan@redhat.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20220211234819.612288-1-toke@redhat.com
-
- 11 Feb, 2022 6 commits
-
-
Andrii Nakryiko authored
Ensure that LIBBPF_0.7.0 inherits everything from LIBBPF_0.6.0. Fixes: dbdd2c7f ("libbpf: Add API to get/set log_level at per-program level") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220211205235.2089104-1-andrii@kernel.org
-
Andrii Nakryiko authored
Quentin Monnet says: ==================== Hi, this set aims at updating the way bpftool versions are numbered. Instead of copying the version from the kernel (given that the sources for the kernel and bpftool are shipped together), align it on libbpf's version number, with a fixed offset (6) to avoid going backwards. Please refer to the description of the second commit for details on the motivations. The patchset also adds the number of the version of libbpf that was used to compile to the output of "bpftool version". Bpftool makes such a heavy usage of libbpf that it makes sense to indicate what version was used to build it. v3: - Compute bpftool's version at compile time, but from the macros exposed by libbpf instead of calling a shell to compute $(BPFTOOL_VERSION) in the Makefile. - Drop the commit which would add a "libbpfversion" target to libbpf's Makefile. This is no longer necessary. - Use libbpf's major, minor versions with jsonw_printf() to avoid offsetting the version string to skip the "v" prefix. - Reword documentation change. v2: - Align on libbpf's version number instead of creating an independent versioning scheme. - Use libbpf_version_string() to retrieve and display libbpf's version. - Re-order patches (1 <-> 2). ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
-
Quentin Monnet authored
Since the notion of versions was introduced for bpftool, it has been following the version number of the kernel (using the version number corresponding to the tree in which bpftool's sources are located). The rationale was that bpftool's features are loosely tied to BPF features in the kernel, and that we could defer versioning to the kernel repository itself. But this versioning scheme is confusing today, because a bpftool binary should be able to work with both older and newer kernels, even if some of its recent features won't be available on older systems. Furthermore, if bpftool is ported to other systems in the future, keeping a Linux-based version number is not a good option. Looking at other options, we could either have a totally independent scheme for bpftool, or we could align it on libbpf's version number (with an offset on the major version number, to avoid going backwards). The latter comes with a few drawbacks: - We may want bpftool releases in-between two libbpf versions. We can always append pre-release numbers to distinguish versions, although those won't look as "official" as something with a proper release number. But at the same time, having bpftool with version numbers that look "official" hasn't really been an issue so far. - If no new feature lands in bpftool for some time, we may move from e.g. 6.7.0 to 6.8.0 when libbpf levels up and have two different versions which are in fact the same. - Following libbpf's versioning scheme sounds better than kernel's, but ultimately it doesn't make too much sense either, because even though bpftool uses the lib a lot, its behaviour is not that much conditioned by the internal evolution of the library (or by new APIs that it may not use). Having an independent versioning scheme solves the above, but at the cost of heavier maintenance. Developers will likely forget to increase the numbers when adding features or bug fixes, and we would take the risk of having to send occasional "catch-up" patches just to update the version number. Based on these considerations, this patch aligns bpftool's version number on libbpf's. This is not a perfect solution, but 1) it's certainly an improvement over the current scheme, 2) the issues raised above are all minor at the moment, and 3) we can still move to an independent scheme in the future if we realise we need it. Given that libbpf is currently at version 0.7.0, and bpftool, before this patch, was at 5.16, we use an offset of 6 for the major version, bumping bpftool to 6.7.0. Libbpf does not export its patch number; leave bpftool's patch number at 0 for now. It remains possible to manually override the version number by setting BPFTOOL_VERSION when calling make. Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220210104237.11649-3-quentin@isovalent.com
-
Quentin Monnet authored
To help users check what version of libbpf is being used with bpftool, print the number along with bpftool's own version number. Output: $ ./bpftool version ./bpftool v5.16.0 using libbpf v0.7 features: libbfd, libbpf_strict, skeletons $ ./bpftool version --json --pretty { "version": "5.16.0", "libbpf_version": "0.7", "features": { "libbfd": true, "libbpf_strict": true, "skeletons": true } } Note that libbpf does not expose its patch number. Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220210104237.11649-2-quentin@isovalent.com
-
Song Liu authored
bpf_prog_pack causes build error with powerpc ppc64_defconfig: kernel/bpf/core.c:830:23: error: variably modified 'bitmap' at file scope 830 | unsigned long bitmap[BITS_TO_LONGS(BPF_PROG_CHUNK_COUNT)]; | ^~~~~~ This is because the marco expands as: unsigned long bitmap[((((((1UL) << (16 + __pte_index_size)) / (1 << 6))) \ + ((sizeof(long) * 8)) - 1) / ((sizeof(long) * 8)))]; where __pte_index_size is a global variable. Fix it by turning bitmap into a 0-length array. Fixes: 57631054 ("bpf: Introduce bpf_prog_pack allocator") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220211024939.2962537-1-song@kernel.org
-
Lorenzo Bianconi authored
Update test_xdp_update_frags adding a test for a buffer size set to (MAX_SKB_FRAGS + 2) * PAGE_SIZE. The kernel is supposed to return -ENOMEM. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/3e4afa0ee4976854b2f0296998fe6754a80b62e5.1644366736.git.lorenzo@kernel.org
-
- 10 Feb, 2022 30 commits
-
-
Daniel Borkmann authored
Alexei Starovoitov says: ==================== The libbpf performs a set of complex operations to load BPF programs. With "loader program" and "CO-RE in the kernel" the loading job of libbpf was diminished. The light skeleton became lean enough to perform program loading and map creation tasks without libbpf. It's now possible to tweak it further to make light skeleton usable out of user space and out of kernel module. This allows bpf_preload.ko to drop user-mode-driver usage, drop host compiler dependency, allow cross compilation and simplify the code. It's a building block toward safe and portable kernel modules. v3->v4: - inlined skel_prep_init_value() as direct assignment in lskel v2->v3: - dropped vm_mmap() and switched to bpf_loader_ctx->flags & KERNEL approach. It allows bpf_preload.ko to be built-in. The kernel is able to load bpf progs before init process starts. - added comments (Yonghong's review) - added error checks in lskel (Andrii's review) - added Acks in all but 2nd patch. v1->v2: - removed redundant anon struct and added comments (Andrii's reivew) - added Yonghong's ack - fixed build warning when JIT is off ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Alexei Starovoitov authored
The main change is a move of the single line #include "iterators.lskel.h" from iterators/iterators.c to bpf_preload_kern.c. Which means that generated light skeleton can be used from user space or user mode driver like iterators.c or from the kernel module or the kernel itself. The direct use of light skeleton from the kernel module simplifies the code, since UMD is no longer necessary. The libbpf.a required user space and UMD. The CO-RE in the kernel and generated "loader bpf program" used by the light skeleton are capable to perform complex loading operations traditionally provided by libbpf. In addition UMD approach was launching UMD process every time bpffs has to be mounted. With light skeleton in the kernel the bpf_preload kernel module loads bpf iterators once and pins them multiple times into different bpffs mounts. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220209232001.27490-6-alexei.starovoitov@gmail.com
-
Alexei Starovoitov authored
Light skeleton and skel_internal.h have changed. Update iterators.lskel.h. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220209232001.27490-5-alexei.starovoitov@gmail.com
-
Alexei Starovoitov authored
Generealize light skeleton by hiding mmap details in skel_internal.h In this form generated lskel.h is usable both by user space and by the kernel. Note that previously #include <bpf/bpf.h> was in *.lskel.h file. To avoid #ifdef-s in a generated lskel.h the include of bpf.h is moved to skel_internal.h, but skel_internal.h is also used by gen_loader.c which is part of libbpf. Therefore skel_internal.h does #include "bpf.h" in case of user space, so gen_loader.c and lskel.h have necessary definitions. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220209232001.27490-4-alexei.starovoitov@gmail.com
-
Alexei Starovoitov authored
Prepare light skeleton to be used in the kernel module and in the user space. The look and feel of lskel.h is mostly the same with the difference that for user space the skel->rodata is the same pointer before and after skel_load operation, while in the kernel the skel->rodata after skel_open and the skel->rodata after skel_load are different pointers. Typical usage of skeleton remains the same for kernel and user space: skel = my_bpf__open(); skel->rodata->my_global_var = init_val; err = my_bpf__load(skel); err = my_bpf__attach(skel); // access skel->rodata->my_global_var; // access skel->bss->another_var; Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220209232001.27490-3-alexei.starovoitov@gmail.com
-
Alexei Starovoitov authored
bpf_sycall programs can be used directly by the kernel modules to load programs and create maps via kernel skeleton. . Export bpf_sys_bpf syscall wrapper to be used in kernel skeleton. . Export bpf_map_get to be used in kernel skeleton. . Allow prog_run cmd for bpf_syscall programs with recursion check. . Enable link_create and raw_tp_open cmds. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220209232001.27490-2-alexei.starovoitov@gmail.com
-
kernel test robot authored
drivers/net/dsa/qca8k.c:422:37-43: ERROR: application of sizeof to pointer sizeof when applied to a pointer typed expression gives the size of the pointer Generated by: scripts/coccinelle/misc/noderef.cocci Fixes: 90386223 ("net: dsa: qca8k: add support for larger read/write size with mgmt Ethernet") CC: Ansuel Smith <ansuelsmth@gmail.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: kernel test robot <lkp@intel.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20220209221304.GA17529@d2214a582157Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Minghao Chi (CGEL ZTE) authored
Replace zero-length array with flexible-array member and make use of the struct_size() helper in kmalloc(). For example: struct switchdev_deferred_item { ... unsigned long data[]; }; Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Minghao Chi (CGEL ZTE) <chi.minghao@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
Commit 563f8e97 ("ipv4: Stop taking ECN bits into account in fib4-rules") replaced the validation test on frh->tos. While the new test is stricter for ECN bits, it doesn't detect the use of high order DSCP bits. This would be fine if IPv4 could properly handle them. But currently, most IPv4 lookups are done with the three high DSCP bits masked. Therefore, using these bits doesn't lead to the expected result. Let's reject such configurations again, so that nobody starts to use and make any assumption about how the stack handles the three high order DSCP bits in fib4 rules. Fixes: 563f8e97 ("ipv4: Stop taking ECN bits into account in fib4-rules") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Subbaraya Sundeep authored
This patch adds TC feature for VFs also. When MCAM rules are allocated for a VF then either TC or ntuple filters can be used. Below are the commands to use TC feature for a VF(say lbk0): devlink dev param set pci/0002:01:00.1 name mcam_count value 16 \ cmode runtime ethtool -K lbk0 hw-tc-offload on ifconfig lbk0 up tc qdisc add dev lbk0 ingress tc filter add dev lbk0 parent ffff: protocol ip flower skip_sw \ dst_mac 98:03:9b:83:aa:12 action police rate 100Mbit burst 5000 Also to modify any fields of the hardware context with NIX_AQ_INSTOP_WRITE command then corresponding masks of those fields must be set as per hardware. This was missing in ingress ratelimiting context. This patch sets those masks also. Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Having to acquire rtnl from netdev_run_todo() for every dismantled device is not desirable when/if rtnl is under stress. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Venkata Sudheer Kumar Bhavaraju authored
Device firmware can assert if the device shutdown path in driver encounters an async. events from mfw (processed in qed_mcp_handle_events()) after qed_mcp_unload_req() returns. A call to qed_mcp_unload_req() currently marks the device as inactive and thus stops any new events, but there is a windows where in-flight events might still be received by the driver. To prevent this race condition, atomically set QED_MCP_BYPASS_PROC_BIT in qed_mcp_unload_req() to make sure qed_mcp_handle_events() ignores all events. Wait for any event that might already be in-process to complete by monitoring QED_MCP_IN_PROCESSING_BIT. Signed-off-by: Pravin Kumar Ganesh Dhende <pdhende@marvell.com> Signed-off-by: Venkata Sudheer Kumar Bhavaraju <vbhavaraju@marvell.com> Signed-off-by: Alok Prasad <palok@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Jakub Kicinski says: ==================== net: ping6: support basic socket cmsgs Add support for common SOL_SOCKET cmsgs in ICMPv6 sockets. Extend the cmsg tests to cover more cmsgs and socket types. SOL_IPV6 cmsgs to follow. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Test TIMESTAMPING and TXTIME across UDP / ICMP and IP versions. Before ICMPv6 support: # ./tools/testing/selftests/net/cmsg_time.sh Case ICMPv6 - ts cnt returned '0', expected '2' Case ICMPv6 - ts0 SCHED returned '', expected 'OK' Case ICMPv6 - ts0 SND returned '', expected 'OK' Case ICMPv6 - TXTIME abs returned '', expected 'OK' Case ICMPv6 - TXTIME rel returned '', expected 'OK' FAIL - 5/36 cases failed After: # ./tools/testing/selftests/net/cmsg_time.sh OK Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Support requesting Tx timestamps: $ ./cmsg_sender -p i -t -4 $tgt 123 -d 1000 SCHED ts0 61us SND ts0 1071us Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Add ability to send delayed packets. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Test if setting SO_MARK with setsockopt works and if cmsg takes precedence over it. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Use new capabilities of cmsg_sender to test ICMP and RAW sockets, previously only UDP was tested. Before SO_MARK support was added to ICMPv6: # ./cmsg_so_mark.sh Case ICMP rejection returned 0, expected 1 FAIL - 1/12 cases failed After: # ./cmsg_so_mark.sh OK Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Support sending fake ICMP(v6) messages and UDP via RAW sockets. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Parametrize the code so that it can support UDP and ICMP sockets in the future, and more cmsg types. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Rename the file in prep for generalization. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Minor reordering of the code and a call to sock_cmsg_send() gives us support for setting the common socket options via cmsg (the usual ones - SO_MARK, SO_TIMESTAMPING_OLD, SCM_TXTIME). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Nothing prevents the user from requesting timestamping on ping6 sockets, yet timestamps are not going to be reported. Plumb the flags through. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
We have ftrace and BPF today, there's no need for printing arguments at the start of a function. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Merge tag 'ieee802154-for-davem-2022-02-10' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next Stefan Schmidt says: ==================== pull-request: ieee802154-next 2022-02-10 An update from ieee802154 for your *net-next* tree. There is more ongoing in ieee802154 than usual. This will be the first pull request for this cycle, but I expect one more. Depending on review and rework times. Pavel Skripkin ported the atusb driver over to the new USB api to avoid unint problems as well as making use of the modern api without kmalloc() needs in he driver. Miquel Raynal landed some changes to ensure proper frame checksum checking with hwsim, documenting our use of wake and stop_queue and eliding a magic value by using the proper define. David Girault documented the address struct used in ieee802154. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queueDavid S. Miller authored
Tony Nguyen says: ==================== 100GbE Intel Wired LAN Driver Updates 2022-02-09 This series contains updates to ice driver only. Brett adds support for QinQ. This begins with code refactoring and re-organization of VLAN configuration functions to allow for introduction of VSI VLAN ops to enable setting and calling of respective operations based on device support of single or double VLANs. Implementations are added for outer VLAN support. To support QinQ, the device must be set to double VLAN mode (DVM). In order for this to occur, the DDP package and NVM must also support DVM. Functions to determine compatibility and properly configure the device are added as well as setting the proper bits to advertise and utilize the proper offloads. Support for VIRTCHNL_VF_OFFLOAD_VLAN_V2 is also included to allow for VF to negotiate and utilize this functionality. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextJakub Kicinski authored
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next 1) Conntrack sets on CHECKSUM_UNNECESSARY for UDP packet with no checksum, from Kevin Mitchell. 2) skb->priority support for nfqueue, from Nicolas Dichtel. 3) Remove conntrack extension register API, from Florian Westphal. 4) Move nat destroy hook to nf_nat_hook instead, to remove nf_ct_ext_destroy(), also from Florian. 5) Wrap pptp conntrack NAT hooks into single structure, from Florian Westphal. 6) Support for tcp option set to noop for nf_tables, also from Florian. 7) Do not run x_tables comment match from packet path in nf_tables, from Florian Westphal. 8) Replace spinlock by cmpxchg() loop to update missed ct event, from Florian Westphal. 9) Wrap cttimeout hooks into single structure, from Florian. 10) Add fast nft_cmp expression for up to 16-bytes. 11) Use cb->ctx to store context in ctnetlink dump, instead of using cb->args[], from Florian Westphal. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: ctnetlink: use dump structure instead of raw args nfqueue: enable to set skb->priority netfilter: nft_cmp: optimize comparison for 16-bytes netfilter: cttimeout: use option structure netfilter: ecache: don't use nf_conn spinlock netfilter: nft_compat: suppress comment match netfilter: exthdr: add support for tcp option removal netfilter: conntrack: pptp: use single option structure netfilter: conntrack: remove extension register api netfilter: conntrack: handle ->destroy hook via nat_ops instead netfilter: conntrack: move extension sizes into core netfilter: conntrack: make all extensions 8-byte alignned netfilter: nfqueue: enable to get skb->priority netfilter: conntrack: mark UDP zero checksum as CHECKSUM_UNNECESSARY ==================== Link: https://lore.kernel.org/r/20220209133616.165104-1-pablo@netfilter.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Sebastian Andrzej Siewior authored
Commit 9652dc2e ("tcp: relax listening_hash operations") removed the need to disable bottom half while acquiring listening_hash.lock. There are still two callers left which disable bottom half before the lock is acquired. On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts as a lock to ensure that resources, that are protected by disabling bottom halves, remain protected. This leads to a circular locking dependency if the lock acquired with disabled bottom halves is also acquired with enabled bottom halves followed by disabling bottom halves. This is the reverse locking order. It has been observed with inet_listen_hashbucket::lock: local_bh_disable() + spin_lock(&ilb->lock): inet_listen() inet_csk_listen_start() sk->sk_prot->hash() := inet_hash() local_bh_disable() __inet_hash() spin_lock(&ilb->lock); acquire(&ilb->lock); Reverse order: spin_lock(&ilb2->lock) + local_bh_disable(): tcp_seq_next() listening_get_next() spin_lock(&ilb2->lock); acquire(&ilb2->lock); tcp4_seq_show() get_tcp4_sock() sock_i_ino() read_lock_bh(&sk->sk_callback_lock); acquire(softirq_ctrl) // <---- whoops acquire(&sk->sk_callback_lock) Drop local_bh_disable() around __inet_hash() which acquires listening_hash->lock. Split inet_unhash() and acquire the listen_hashbucket lock without disabling bottom halves; the inet_ehash lock with disabled bottom halves. Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/12d6f9879a97cd56c09fb53dee343cbb14f7f1f7.camel@gmx.de Link: https://lkml.kernel.org/r/X9CheYjuXWc75Spa@hirez.programming.kicks-ass.net Link: https://lore.kernel.org/r/YgQOebeZ10eNx1W6@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski authored
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-02-09 We've added 126 non-merge commits during the last 16 day(s) which contain a total of 201 files changed, 4049 insertions(+), 2215 deletions(-). The main changes are: 1) Add custom BPF allocator for JITs that pack multiple programs into a huge page to reduce iTLB pressure, from Song Liu. 2) Add __user tagging support in vmlinux BTF and utilize it from BPF verifier when generating loads, from Yonghong Song. 3) Add per-socket fast path check guarding from cgroup/BPF overhead when used by only some sockets, from Pavel Begunkov. 4) Continued libbpf deprecation work of APIs/features and removal of their usage from samples, selftests, libbpf & bpftool, from Andrii Nakryiko and various others. 5) Improve BPF instruction set documentation by adding byte swap instructions and cleaning up load/store section, from Christoph Hellwig. 6) Switch BPF preload infra to light skeleton and remove libbpf dependency from it, from Alexei Starovoitov. 7) Fix architecture-agnostic macros in libbpf for accessing syscall arguments from BPF progs for non-x86 architectures, from Ilya Leoshkevich. 8) Rework port members in struct bpf_sk_lookup and struct bpf_sock to be of 16-bit field with anonymous zero padding, from Jakub Sitnicki. 9) Add new bpf_copy_from_user_task() helper to read memory from a different task than current. Add ability to create sleepable BPF iterator progs, from Kenny Yu. 10) Implement XSK batching for ice's zero-copy driver used by AF_XDP and utilize TX batching API from XSK buffer pool, from Maciej Fijalkowski. 11) Generate temporary netns names for BPF selftests to avoid naming collisions, from Hangbin Liu. 12) Implement bpf_core_types_are_compat() with limited recursion for in-kernel usage, from Matteo Croce. 13) Simplify pahole version detection and finally enable CONFIG_DEBUG_INFO_DWARF5 to be selected with CONFIG_DEBUG_INFO_BTF, from Nathan Chancellor. 14) Misc minor fixes to libbpf and selftests from various folks. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (126 commits) selftests/bpf: Cover 4-byte load from remote_port in bpf_sk_lookup bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide libbpf: Fix compilation warning due to mismatched printf format selftests/bpf: Test BPF_KPROBE_SYSCALL macro libbpf: Add BPF_KPROBE_SYSCALL macro libbpf: Fix accessing the first syscall argument on s390 libbpf: Fix accessing the first syscall argument on arm64 libbpf: Allow overriding PT_REGS_PARM1{_CORE}_SYSCALL selftests/bpf: Skip test_bpf_syscall_macro's syscall_arg1 on arm64 and s390 libbpf: Fix accessing syscall arguments on riscv libbpf: Fix riscv register names libbpf: Fix accessing syscall arguments on powerpc selftests/bpf: Use PT_REGS_SYSCALL_REGS in bpf_syscall_macro libbpf: Add PT_REGS_SYSCALL_REGS macro selftests/bpf: Fix an endianness issue in bpf_syscall_macro test bpf: Fix bpf_prog_pack build HPAGE_PMD_SIZE bpf: Fix leftover header->pages in sparc and powerpc code. libbpf: Fix signedness bug in btf_dump_array_data() selftests/bpf: Do not export subtest as standalone test bpf, x86_64: Fail gracefully on bpf_jit_binary_pack_finalize failures ... ==================== Link: https://lore.kernel.org/r/20220209210050.8425-1-daniel@iogearbox.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Menglong Dong authored
In the commit c504e5c2 ("net: skb: introduce kfree_skb_reason()") drop reason is introduced to the tracepoint of kfree_skb. Therefore, drop_monitor is able to report the drop reason to users by netlink. The drop reasons are reported as string to users, which is exactly the same as what we do when reporting it to ftrace. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220209060838.55513-1-imagedong@tencent.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 09 Feb, 2022 3 commits
-
-
Alexei Starovoitov authored
Jakub Sitnicki says: ==================== Following the recent split-up of the bpf_sock dst_port field, apply the same to technique to the bpf_sk_lookup remote_port field to make uAPI more user friendly. v1 -> v2: - Remove remote_port range check and cast to be16 in TEST_RUN for sk_lookup (kernel test robot) ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jakub Sitnicki authored
Extend the context access tests for sk_lookup prog to cover the surprising case of a 4-byte load from the remote_port field, where the expected value is actually shifted by 16 bits. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220209184333.654927-3-jakub@cloudflare.com
-
Jakub Sitnicki authored
remote_port is another case of a BPF context field documented as a 32-bit value in network byte order for which the BPF context access converter generates a load of a zero-padded 16-bit integer in network byte order. First such case was dst_port in bpf_sock which got addressed in commit 4421a582 ("bpf: Make dst_port field in struct bpf_sock 16-bit wide"). Loading 4-bytes from the remote_port offset and converting the value with bpf_ntohl() leads to surprising results, as the expected value is shifted by 16 bits. Reduce the confusion by splitting the field in two - a 16-bit field holding a big-endian integer, and a 16-bit zero-padding anonymous field that follows it. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220209184333.654927-2-jakub@cloudflare.com
-