- 13 Sep, 2023 2 commits
-
-
Phil Sutter authored
The value in idx and the number of rules handled in that particular __nf_tables_dump_rules() call is not identical. The former is a cursor to pick up from if multiple netlink messages are needed, so its value is ever increasing. Fixing this is not just a matter of subtracting s_idx from it, though: When resetting rules in multiple chains, __nf_tables_dump_rules() is called for each and cb->args[0] is not adjusted in between. Introduce a dedicated counter to record the number of rules reset in this call in a less confusing way. While being at it, prevent the direct return upon buffer exhaustion: Any rules previously dumped into that skb would evade audit logging otherwise. Fixes: 9b5ba5c9 ("netfilter: nf_tables: Unbreak audit log reset") Signed-off-by: Phil Sutter <phil@nwl.cc> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
Florian Westphal authored
The size table is incorrect due to copypaste error, this reserves more size than needed. TSTAMP reserved 32 instead of 16 bytes. TIMEOUT reserved 16 instead of 8 bytes. Fixes: 5f31edc0 ("netfilter: conntrack: move extension sizes into core") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
- 11 Sep, 2023 1 commit
-
-
Pablo Neira Ayuso authored
Anonymous sets need to be populated once at creation and then they are bound to rule since 938154b9 ("netfilter: nf_tables: reject unbound anonymous set before commit phase"), otherwise transaction reports EINVAL. Userspace does not need to delete elements of anonymous sets that are not yet bound, reject this with EOPNOTSUPP. From flush command path, skip anonymous sets, they are expected to be bound already. Otherwise, EINVAL is hit at the end of this transaction for unbound sets. Fixes: 96518518 ("netfilter: add nftables") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
- 08 Sep, 2023 5 commits
-
-
Pablo Neira Ayuso authored
Skip GC run if iterator rewinds to the beginning with EAGAIN, otherwise GC might collect the same element more than once. Fixes: f6c383b8 ("netfilter: nf_tables: adapt set backend to use GC transaction API") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
Pablo Neira Ayuso authored
nft_trans_gc_queue_sync() enqueues the GC transaction and it allocates a new one. If this allocation fails, then stop this GC sync run and retry later. Fixes: 5f68718b ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
Pablo Neira Ayuso authored
pipapo needs to enqueue GC transactions for catchall elements through nft_trans_gc_queue_sync(). Add nft_trans_gc_catchall_sync() and nft_trans_gc_catchall_async() to handle GC transaction queueing accordingly. Fixes: 5f68718b ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Fixes: f6c383b8 ("netfilter: nf_tables: adapt set backend to use GC transaction API") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
Pablo Neira Ayuso authored
rbtree GC does not modify the datastructure, instead it collects expired elements and it enqueues a GC transaction. Use a read spinlock instead to avoid data contention while GC worker is running. Fixes: f6c383b8 ("netfilter: nf_tables: adapt set backend to use GC transaction API") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
Pablo Neira Ayuso authored
Chain binding only requires the rule addition/insertion command within the same transaction. Removal of rules from chain bindings within the same transaction makes no sense, userspace does not utilize this feature. Replace nft_chain_is_bound() check to nft_chain_binding() in rule deletion commands. Replace command implies a rule deletion, reject this command too. Rule flush command can also safely rely on this nft_chain_binding() check because unbound chains are not allowed since 62e1e94b ("netfilter: nf_tables: reject unbound chain set before commit phase"). Fixes: d0e2c7de ("netfilter: nf_tables: add NFT_CHAIN_BINDING") Reported-by: Kevin Rich <kevinrich1337@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
- 07 Sep, 2023 13 commits
-
-
Vladimir Oltean authored
enetc_psi_create() returns an ERR_PTR() or a valid station interface pointer, but checking for the non-NULL quality of the return code blurs that difference away. So if enetc_psi_create() fails, we call enetc_psi_destroy() when we shouldn't. This will likely result in crashes, since enetc_psi_create() cleans up everything after itself when it returns an ERR_PTR(). Fixes: f0168042 ("net: enetc: reimplement RFS/RSS memory clearing as PCI quirk") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/netdev/582183ef-e03b-402b-8e2d-6d9bb3c83bd9@moroto.mountain/Suggested-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230906141609.247579-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
This reverts commit 39285e12. Looks like the change has unintended consequences in exposing objects before they are initialized. Let's drop this patch and try again in net-next. Reported-by: syzbot+44ae022028805f4600fc@syzkaller.appspotmail.com Fixes: 39285e12 ("net: team: do not use dynamic lockdep key") Link: https://lore.kernel.org/all/20230907103124.6adb7256@kernel.org/Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nfPaolo Abeni authored
Florian Westphal says: ==================== netfilter updates for net This PR contains nf_tables updates for your *net* tree. This time almost all fixes are for old bugs: First patch fixes a 4-byte stack OOB write, from myself. This was broken ever since nftables was switches from 128 to 32bit register addressing in v4.1. 2nd patch fixes an out-of-bounds read. This has been broken ever since xt_osf got added in 2.6.31, the bug was then just moved around during refactoring, from Wander Lairson Costa. 3rd patch adds a missing enum description, from Phil Sutter. 4th patch fixes a UaF inftables that occurs when userspace adds elements with a timeout so small that expiration happens while the transaction is still in progress. Fix from Pablo Neira Ayuso. Patch 5 fixes a memory out of bounds access, this was broken since v4.20. Patch from Kyle Zeng and Jozsef Kadlecsik. Patch 6 fixes another bogus memory access when building audit record. Bug added in the previous pull request, fix from Pablo. netfilter pull request 2023-09-06 * tag 'nf-23-09-06' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: Unbreak audit log reset netfilter: ipset: add the missing IP_SET_HASH_WITH_NET0 macro for ip_set_hash_netportnet.c netfilter: nft_set_rbtree: skip sync GC for new elements in this transaction netfilter: nf_tables: uapi: Describe NFTA_RULE_CHAIN_ID netfilter: nfnetlink_osf: avoid OOB read netfilter: nftables: exthdr: fix 4-byte stack OOB write ==================== Link: https://lore.kernel.org/r/20230906162525.11079-1-fw@strlen.deSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Paolo Abeni authored
Jijie Shao says: ==================== There are some bugfix for the HNS3 ethernet driver ==================== Link: https://lore.kernel.org/r/20230906072018.3020671-1-shaojijie@huawei.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Jie Wang authored
HNS3 NIC does not support GSO partial packets segmentation. Actually tunnel packets for example NvGRE packets segment offload and checksum offload is already supported. There is no need to keep gso partial feature bit. So this patch removes it. Fixes: 76ad4f0e ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Jie Wang <wangjie125@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Yisen Zhuang authored
When sfp is absent or unidentified, the port type should be displayed as PORT_OTHERS, rather than PORT_FIBRE. Fixes: 88d10bd6 ("net: hns3: add support for multiple media type") Signed-off-by: Yisen Zhuang <yisen.zhuang@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Jijie Shao authored
We hope that tc qdisc and dcb ets commands can not be used crosswise. If we want to use any of the commands to configure tc, We must use the other command to clear the existing configuration. However, when we configure a single tc with tc qdisc, we can still configure it with dcb ets. Because we use mqprio_active as the tag of tc qdisc configuration, but with dcb ets, we do not check mqprio_active. This patch fix this issue by check mqprio_active before executing the dcb ets command. and add dcb_ets_active to replace HCLGE_FLAG_DCB_ENABLE and HCLGE_FLAG_MQPRIO_ENABLE at the hclge layer, Fixes: cacde272 ("net: hns3: Add hclge_dcb module for the support of DCB feature") Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Hao Chen authored
Now in hns3_dbg_uninit(), there may be concurrency between kfree buffer and read, it may result in memory error. Moving debugfs_remove_recursive() in front of kfree buffer to ensure they don't happen at the same time. Fixes: 5e69ea7e ("net: hns3: refactor the debugfs process") Signed-off-by: Hao Chen <chenhao418@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Hao Chen authored
req1->tcam_data is defined as "u8 tcam_data[8]", and we convert it as (u32 *) without considerring byte order conversion, it may result in printing wrong data for tcam_data. Convert tcam_data to (__le32 *) first to fix it. Fixes: b5a0b70d ("net: hns3: refactor dump fd tcam of debugfs") Signed-off-by: Hao Chen <chenhao418@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Jijie Shao authored
support query tx timeout threshold by debugfs Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Jian Shen authored
Currently, the driver knocks the ring doorbell before updating the ring->last_to_use in tx flow. if the hardware transmiting packet and napi poll scheduling are fast enough, it may get the old ring->last_to_use in drivers' napi poll. In this case, the driver will think the tx is not completed, and return directly without clear the flag __QUEUE_STATE_STACK_XOFF, which may cause tx timeout. Fixes: 20d06ca2 ("net: hns3: optimize the tx clean process") Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Lukasz Majewski authored
The KSZ9477 errata points out (in 'Module 4') the link up/down problems when EEE (Energy Efficient Ethernet) is enabled in the device to which the KSZ9477 tries to auto negotiate. The suggested workaround is to clear advertisement of EEE for PHYs in this chip driver. To avoid regressions with other switch ICs the new MICREL_NO_EEE flag has been introduced. Moreover, the in-register disablement of MMD_DEVICE_ID_EEE_ADV.MMD_EEE_ADV MMD register is removed, as this code is both; now executed too late (after previous rework of the PHY and DSA for KSZ switches) and not required as setting all members of eee_broken_modes bit field prevents the KSZ9477 from advertising EEE. Fixes: 69d3b36c ("net: dsa: microchip: enable EEE support") # for KSZ9477 Signed-off-by: Lukasz Majewski <lukma@denx.de> Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> # Confirmed disabled EEE with oscilloscope. Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://lore.kernel.org/r/20230905093315.784052-1-lukma@denx.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski authored
Daniel Borkmann says: ==================== pull-request: bpf 2023-09-06 We've added 9 non-merge commits during the last 6 day(s) which contain a total of 12 files changed, 189 insertions(+), 44 deletions(-). The main changes are: 1) Fix bpf_sk_storage to address an invalid wait context lockdep report and another one to address missing omem uncharge, from Martin KaFai Lau. 2) Two BPF recursion detection related fixes, from Sebastian Andrzej Siewior. 3) Fix tailcall limit enforcement in trampolines for s390 JIT, from Ilya Leoshkevich. 4) Fix a sockmap refcount race where skbs in sk_psock_backlog can be referenced after user space side has already skb_consumed them, from John Fastabend. 5) Fix BPF CI flake/race wrt sockmap vsock write test where the transport endpoint is not connected, from Xu Kuohai. 6) Follow-up doc fix to address a cross-link warning, from Eduard Zingerman. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Check bpf_sk_storage has uncharged sk_omem_alloc bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc bpf: bpf_sk_storage: Fix invalid wait context lockdep report s390/bpf: Pass through tail call counter in trampolines bpf: Assign bpf_tramp_run_ctx::saved_run_ctx before recursion check. bpf: Invoke __bpf_prog_exit_sleepable_recur() on recursion in kern_sys_bpf(). bpf, sockmap: Fix skb refcnt race after locking changes docs/bpf: Fix "file doesn't exist" warnings in {llvm_reloc,btf}.rst selftests/bpf: Fix a CI failure caused by vsock write ==================== Link: https://lore.kernel.org/r/20230906095117.16941-1-daniel@iogearbox.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 06 Sep, 2023 19 commits
-
-
Pablo Neira Ayuso authored
Deliver audit log from __nf_tables_dump_rules(), table dereference at the end of the table list loop might point to the list head, leading to this crash. [ 4137.407349] BUG: unable to handle page fault for address: 00000000001f3c50 [ 4137.407357] #PF: supervisor read access in kernel mode [ 4137.407359] #PF: error_code(0x0000) - not-present page [ 4137.407360] PGD 0 P4D 0 [ 4137.407363] Oops: 0000 [#1] PREEMPT SMP PTI [ 4137.407365] CPU: 4 PID: 500177 Comm: nft Not tainted 6.5.0+ #277 [ 4137.407369] RIP: 0010:string+0x49/0xd0 [ 4137.407374] Code: ff 77 36 45 89 d1 31 f6 49 01 f9 66 45 85 d2 75 19 eb 1e 49 39 f8 76 02 88 07 48 83 c7 01 83 c6 01 48 83 c2 01 4c 39 cf 74 07 <0f> b6 02 84 c0 75 e2 4c 89 c2 e9 58 e5 ff ff 48 c7 c0 0e b2 ff 81 [ 4137.407377] RSP: 0018:ffff8881179737f0 EFLAGS: 00010286 [ 4137.407379] RAX: 00000000001f2c50 RBX: ffff888117973848 RCX: ffff0a00ffffff04 [ 4137.407380] RDX: 00000000001f3c50 RSI: 0000000000000000 RDI: 0000000000000000 [ 4137.407381] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff [ 4137.407383] R10: ffffffffffffffff R11: ffff88813584d200 R12: 0000000000000000 [ 4137.407384] R13: ffffffffa15cf709 R14: 0000000000000000 R15: ffffffffa15cf709 [ 4137.407385] FS: 00007fcfc18bb580(0000) GS:ffff88840e700000(0000) knlGS:0000000000000000 [ 4137.407387] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4137.407388] CR2: 00000000001f3c50 CR3: 00000001055b2001 CR4: 00000000001706e0 [ 4137.407390] Call Trace: [ 4137.407392] <TASK> [ 4137.407393] ? __die+0x1b/0x60 [ 4137.407397] ? page_fault_oops+0x6b/0xa0 [ 4137.407399] ? exc_page_fault+0x60/0x120 [ 4137.407403] ? asm_exc_page_fault+0x22/0x30 [ 4137.407408] ? string+0x49/0xd0 [ 4137.407410] vsnprintf+0x257/0x4f0 [ 4137.407414] kvasprintf+0x3e/0xb0 [ 4137.407417] kasprintf+0x3e/0x50 [ 4137.407419] nf_tables_dump_rules+0x1c0/0x360 [nf_tables] [ 4137.407439] ? __alloc_skb+0xc3/0x170 [ 4137.407442] netlink_dump+0x170/0x330 [ 4137.407447] __netlink_dump_start+0x227/0x300 [ 4137.407449] nf_tables_getrule+0x205/0x390 [nf_tables] Deliver audit log only once at the end of the rule dump+reset for consistency with the set dump+reset. Ensure audit reset access to table under rcu read side lock. The table list iteration holds rcu read lock side, but recent audit code dereferences table object out of the rcu read lock side. Fixes: ea078ae9 ("netfilter: nf_tables: Audit log rule reset") Fixes: 7e9be112 ("netfilter: nf_tables: Audit log setelem reset") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Florian Westphal <fw@strlen.de>
-
Kyle Zeng authored
The missing IP_SET_HASH_WITH_NET0 macro in ip_set_hash_netportnet can lead to the use of wrong `CIDR_POS(c)` for calculating array offsets, which can lead to integer underflow. As a result, it leads to slab out-of-bound access. This patch adds back the IP_SET_HASH_WITH_NET0 macro to ip_set_hash_netportnet to address the issue. Fixes: 886503f3 ("netfilter: ipset: actually allow allowable CIDR 0 in hash:net,port,net") Suggested-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Kyle Zeng <zengyhkyle@gmail.com> Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
-
Pablo Neira Ayuso authored
New elements in this transaction might expired before such transaction ends. Skip sync GC for such elements otherwise commit path might walk over an already released object. Once transaction is finished, async GC will collect such expired element. Fixes: f6c383b8 ("netfilter: nf_tables: adapt set backend to use GC transaction API") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
-
Phil Sutter authored
Add a brief description to the enum's comment. Fixes: 837830a4 ("netfilter: nf_tables: add NFTA_RULE_CHAIN_ID attribute") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Florian Westphal <fw@strlen.de>
-
Wander Lairson Costa authored
The opt_num field is controlled by user mode and is not currently validated inside the kernel. An attacker can take advantage of this to trigger an OOB read and potentially leak information. BUG: KASAN: slab-out-of-bounds in nf_osf_match_one+0xbed/0xd10 net/netfilter/nfnetlink_osf.c:88 Read of size 2 at addr ffff88804bc64272 by task poc/6431 CPU: 1 PID: 6431 Comm: poc Not tainted 6.0.0-rc4 #1 Call Trace: nf_osf_match_one+0xbed/0xd10 net/netfilter/nfnetlink_osf.c:88 nf_osf_find+0x186/0x2f0 net/netfilter/nfnetlink_osf.c:281 nft_osf_eval+0x37f/0x590 net/netfilter/nft_osf.c:47 expr_call_ops_eval net/netfilter/nf_tables_core.c:214 nft_do_chain+0x2b0/0x1490 net/netfilter/nf_tables_core.c:264 nft_do_chain_ipv4+0x17c/0x1f0 net/netfilter/nft_chain_filter.c:23 [..] Also add validation to genre, subtype and version fields. Fixes: 11eeef41 ("netfilter: passive OS fingerprint xtables match") Reported-by: Lucas Leong <wmliang@infosec.exchange> Signed-off-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
-
Florian Westphal authored
If priv->len is a multiple of 4, then dst[len / 4] can write past the destination array which leads to stack corruption. This construct is necessary to clean the remainder of the register in case ->len is NOT a multiple of the register size, so make it conditional just like nft_payload.c does. The bug was added in 4.1 cycle and then copied/inherited when tcp/sctp and ip option support was added. Bug reported by Zero Day Initiative project (ZDI-CAN-21950, ZDI-CAN-21951, ZDI-CAN-21961). Fixes: 49499c3e ("netfilter: nf_tables: switch registers to 32 bit addressing") Fixes: 935b7f64 ("netfilter: nft_exthdr: add TCP option matching") Fixes: 133dc203 ("netfilter: nft_exthdr: Support SCTP chunks") Fixes: dbb5281a ("netfilter: nf_tables: add support for matching IPv4 options") Signed-off-by: Florian Westphal <fw@strlen.de>
-
Martin KaFai Lau authored
This patch checks the sk_omem_alloc has been uncharged by bpf_sk_storage during the __sk_destruct. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230901231129.578493-4-martin.lau@linux.dev
-
Martin KaFai Lau authored
The commit c83597fa ("bpf: Refactor some inode/task/sk storage functions for reuse"), refactored the bpf_{sk,task,inode}_storage_free() into bpf_local_storage_unlink_nolock() which then later renamed to bpf_local_storage_destroy(). The commit accidentally passed the "bool uncharge_mem = false" argument to bpf_selem_unlink_storage_nolock() which then stopped the uncharge from happening to the sk->sk_omem_alloc. This missing uncharge only happens when the sk is going away (during __sk_destruct). This patch fixes it by always passing "uncharge_mem = true". It is a noop to the task/inode/cgroup storage because they do not have the map_local_storage_(un)charge enabled in the map_ops. A followup patch will be done in bpf-next to remove the uncharge_mem argument. A selftest is added in the next patch. Fixes: c83597fa ("bpf: Refactor some inode/task/sk storage functions for reuse") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230901231129.578493-3-martin.lau@linux.dev
-
Martin KaFai Lau authored
'./test_progs -t test_local_storage' reported a splat: [ 27.137569] ============================= [ 27.138122] [ BUG: Invalid wait context ] [ 27.138650] 6.5.0-03980-gd11ae1b1 #247 Tainted: G O [ 27.139542] ----------------------------- [ 27.140106] test_progs/1729 is trying to lock: [ 27.140713] ffff8883ef047b88 (stock_lock){-.-.}-{3:3}, at: local_lock_acquire+0x9/0x130 [ 27.141834] other info that might help us debug this: [ 27.142437] context-{5:5} [ 27.142856] 2 locks held by test_progs/1729: [ 27.143352] #0: ffffffff84bcd9c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x40 [ 27.144492] #1: ffff888107deb2c0 (&storage->lock){..-.}-{2:2}, at: bpf_local_storage_update+0x39e/0x8e0 [ 27.145855] stack backtrace: [ 27.146274] CPU: 0 PID: 1729 Comm: test_progs Tainted: G O 6.5.0-03980-gd11ae1b1 #247 [ 27.147550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 27.149127] Call Trace: [ 27.149490] <TASK> [ 27.149867] dump_stack_lvl+0x130/0x1d0 [ 27.152609] dump_stack+0x14/0x20 [ 27.153131] __lock_acquire+0x1657/0x2220 [ 27.153677] lock_acquire+0x1b8/0x510 [ 27.157908] local_lock_acquire+0x29/0x130 [ 27.159048] obj_cgroup_charge+0xf4/0x3c0 [ 27.160794] slab_pre_alloc_hook+0x28e/0x2b0 [ 27.161931] __kmem_cache_alloc_node+0x51/0x210 [ 27.163557] __kmalloc+0xaa/0x210 [ 27.164593] bpf_map_kzalloc+0xbc/0x170 [ 27.165147] bpf_selem_alloc+0x130/0x510 [ 27.166295] bpf_local_storage_update+0x5aa/0x8e0 [ 27.167042] bpf_fd_sk_storage_update_elem+0xdb/0x1a0 [ 27.169199] bpf_map_update_value+0x415/0x4f0 [ 27.169871] map_update_elem+0x413/0x550 [ 27.170330] __sys_bpf+0x5e9/0x640 [ 27.174065] __x64_sys_bpf+0x80/0x90 [ 27.174568] do_syscall_64+0x48/0xa0 [ 27.175201] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 27.175932] RIP: 0033:0x7effb40e41ad [ 27.176357] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d8 [ 27.179028] RSP: 002b:00007ffe64c21fc8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141 [ 27.180088] RAX: ffffffffffffffda RBX: 00007ffe64c22768 RCX: 00007effb40e41ad [ 27.181082] RDX: 0000000000000020 RSI: 00007ffe64c22008 RDI: 0000000000000002 [ 27.182030] RBP: 00007ffe64c21ff0 R08: 0000000000000000 R09: 00007ffe64c22788 [ 27.183038] R10: 0000000000000064 R11: 0000000000000202 R12: 0000000000000000 [ 27.184006] R13: 00007ffe64c22788 R14: 00007effb42a1000 R15: 0000000000000000 [ 27.184958] </TASK> It complains about acquiring a local_lock while holding a raw_spin_lock. It means it should not allocate memory while holding a raw_spin_lock since it is not safe for RT. raw_spin_lock is needed because bpf_local_storage supports tracing context. In particular for task local storage, it is easy to get a "current" task PTR_TO_BTF_ID in tracing bpf prog. However, task (and cgroup) local storage has already been moved to bpf mem allocator which can be used after raw_spin_lock. The splat is for the sk storage. For sk (and inode) storage, it has not been moved to bpf mem allocator. Using raw_spin_lock or not, kzalloc(GFP_ATOMIC) could theoretically be unsafe in tracing context. However, the local storage helper requires a verifier accepted sk pointer (PTR_TO_BTF_ID), it is hypothetical if that (mean running a bpf prog in a kzalloc unsafe context and also able to hold a verifier accepted sk pointer) could happen. This patch avoids kzalloc after raw_spin_lock to silent the splat. There is an existing kzalloc before the raw_spin_lock. At that point, a kzalloc is very likely required because a lookup has just been done before. Thus, this patch always does the kzalloc before acquiring the raw_spin_lock and remove the later kzalloc usage after the raw_spin_lock. After this change, it will have a charge and then uncharge during the syscall bpf_map_update_elem() code path. This patch opts for simplicity and not continue the old optimization to save one charge and uncharge. This issue is dated back to the very first commit of bpf_sk_storage which had been refactored multiple times to create task, inode, and cgroup storage. This patch uses a Fixes tag with a more recent commit that should be easier to do backport. Fixes: b00fa38a ("bpf: Enable non-atomic allocations in local storage") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230901231129.578493-2-martin.lau@linux.dev
-
Ilya Leoshkevich authored
s390x eBPF programs use the following extension to the s390x calling convention: tail call counter is passed on stack at offset STK_OFF_TCCNT, which callees otherwise use as scratch space. Currently trampoline does not respect this and clobbers tail call counter. This breaks enforcing tail call limits in eBPF programs, which have trampolines attached to them. Fix by forwarding a copy of the tail call counter to the original eBPF program in the trampoline (for fexit), and by restoring it at the end of the trampoline (for fentry). Fixes: 528eb2cb ("s390/bpf: Implement arch_prepare_bpf_trampoline()") Reported-by: Leon Hwang <hffilwlqm@gmail.com> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230906004448.111674-1-iii@linux.ibm.com
-
Sebastian Andrzej Siewior authored
__bpf_prog_enter_recur() assigns bpf_tramp_run_ctx::saved_run_ctx before performing the recursion check which means in case of a recursion __bpf_prog_exit_recur() uses the previously set bpf_tramp_run_ctx::saved_run_ctx value. __bpf_prog_enter_sleepable_recur() assigns bpf_tramp_run_ctx::saved_run_ctx after the recursion check which means in case of a recursion __bpf_prog_exit_sleepable_recur() uses an uninitialized value. This does not look right. If I read the entry trampoline code right, then bpf_tramp_run_ctx isn't initialized upfront. Align __bpf_prog_enter_sleepable_recur() with __bpf_prog_enter_recur() and set bpf_tramp_run_ctx::saved_run_ctx before the recursion check is made. Remove the assignment of saved_run_ctx in kern_sys_bpf() since it happens a few cycles later. Fixes: e384c7b7 ("bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stack") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20230830080405.251926-3-bigeasy@linutronix.de
-
Sebastian Andrzej Siewior authored
If __bpf_prog_enter_sleepable_recur() detects recursion then it returns 0 without undoing rcu_read_lock_trace(), migrate_disable() or decrementing the recursion counter. This is fine in the JIT case because the JIT code will jump in the 0 case to the end and invoke the matching exit trampoline (__bpf_prog_exit_sleepable_recur()). This is not the case in kern_sys_bpf() which returns directly to the caller with an error code. Add __bpf_prog_exit_sleepable_recur() as clean up in the recursion case. Fixes: b1d18a75 ("bpf: Extend sys_bpf commands for bpf_syscall programs.") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20230830080405.251926-2-bigeasy@linutronix.de
-
Jakub Kicinski authored
sphinx complains about the use of "%PHYLINK_PCS_NEG_*": Documentation/networking/kapi:144: ./include/linux/phylink.h:601: WARNING: Inline literal start-string without end-string. Documentation/networking/kapi:144: ./include/linux/phylink.h:633: WARNING: Inline literal start-string without end-string. These are not valid symbols so drop the '%' prefix. Alternatively we could use %PHYLINK_PCS_NEG_\* (escape the *) or use normal literal ``PHYLINK_PCS_NEG_*`` but there is already a handful of un-adorned DEFINE_* in this file. Fixes: f99d471a ("net: phylink: add PCS negotiation mode") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Link: https://lore.kernel.org/all/20230626162908.2f149f98@canb.auug.org.au/Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Vladimir Oltean says: ==================== tc-cbs offload fixes for SJA1105 DSA Yanan Yang has pointed out to me that certain tc-cbs offloaded configurations do not appear to do any shaping on the LS1021A-TSN board (SJA1105T). This is due to an apparent documentation error that also made its way into the driver, which patch 1/3 now fixes. While investigating and then testing, I've found 2 more bugs, which are patches 2/3 and 3/3. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
The blamed commit left this delta behind: struct sja1105_cbs_entry { - u64 port; - u64 prio; + u64 port; /* Not used for SJA1110 */ + u64 prio; /* Not used for SJA1110 */ u64 credit_hi; u64 credit_lo; u64 send_slope; u64 idle_slope; }; but did not actually implement tc-cbs offload fully for the new switch. The offload is accepted, but it doesn't work. The difference compared to earlier switch generations is that now, the table of CBS shapers is sparse, because there are many more shapers, so the mapping between a {port, prio} and a table index is static, rather than requiring us to store the port and prio into the sja1105_cbs_entry. So, the problem is that the code programs the CBS shaper parameters at a dynamic table index which is incorrect. All that needs to be done for SJA1110 CBS shapers to work is to bypass the logic which allocates shapers in a dense manner, as for SJA1105, and use the fixed mapping instead. Fixes: 3e77e59b ("net: dsa: sja1105: add support for the SJA1110 switch family") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
After running command [2] too many times in a row: [1] $ tc qdisc add dev sw2p0 root handle 1: mqprio num_tc 8 \ map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0 [2] $ tc qdisc replace dev sw2p0 parent 1:1 cbs offload 1 \ idleslope 120000 sendslope -880000 locredit -1320 hicredit 180 (aka more than priv->info->num_cbs_shapers times) we start seeing the following error message: Error: Specified device failed to setup cbs hardware offload. This comes from the fact that ndo_setup_tc(TC_SETUP_QDISC_CBS) presents the same API for the qdisc create and replace cases, and the sja1105 driver fails to distinguish between the 2. Thus, it always thinks that it must allocate the same shaper for a {port, queue} pair, when it may instead have to replace an existing one. Fixes: 4d752508 ("net: dsa: sja1105: offload the Credit-Based Shaper qdisc") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
More careful measurement of the tc-cbs bandwidth shows that the stream bandwidth (effectively idleslope) increases, there is a larger and larger discrepancy between the rate limit obtained by the software Qdisc, and the rate limit obtained by its offloaded counterpart. The discrepancy becomes so large, that e.g. at an idleslope of 40000 (40Mbps), the offloaded cbs does not actually rate limit anything, and traffic will pass at line rate through a 100 Mbps port. The reason for the discrepancy is that the hardware documentation I've been following is incorrect. UM11040.pdf (for SJA1105P/Q/R/S) states about IDLE_SLOPE that it is "the rate (in unit of bytes/sec) at which the credit counter is increased". Cross-checking with UM10944.pdf (for SJA1105E/T) and UM11107.pdf (for SJA1110), the wording is different: "This field specifies the value, in bytes per second times link speed, by which the credit counter is increased". So there's an extra scaling for link speed that the driver is currently not accounting for, and apparently (empirically), that link speed is expressed in Kbps. I've pondered whether to pollute the sja1105_mac_link_up() implementation with CBS shaper reprogramming, but I don't think it is worth it. IMO, the UAPI exposed by tc-cbs requires user space to recalculate the sendslope anyway, since the formula for that depends on port_transmit_rate (see man tc-cbs), which is not an invariant from tc's perspective. So we use the offload->sendslope and offload->idleslope to deduce the original port_transmit_rate from the CBS formula, and use that value to scale the offload->sendslope and offload->idleslope to values that the hardware understands. Some numerical data points: 40Mbps stream, max interfering frame size 1500, port speed 100M --------------------------------------------------------------- tc-cbs parameters: idleslope 40000 sendslope -60000 locredit -900 hicredit 600 which result in hardware values: Before (doesn't work) After (works) credit_hi 600 600 credit_lo 900 900 send_slope 7500000 75 idle_slope 5000000 50 40Mbps stream, max interfering frame size 1500, port speed 1G ------------------------------------------------------------- tc-cbs parameters: idleslope 40000 sendslope -960000 locredit -1440 hicredit 60 which result in hardware values: Before (doesn't work) After (works) credit_hi 60 60 credit_lo 1440 1440 send_slope 120000000 120 idle_slope 5000000 5 5.12Mbps stream, max interfering frame size 1522, port speed 100M ----------------------------------------------------------------- tc-cbs parameters: idleslope 5120 sendslope -94880 locredit -1444 hicredit 77 which result in hardware values: Before (doesn't work) After (works) credit_hi 77 77 credit_lo 1444 1444 send_slope 11860000 118 idle_slope 640000 6 Tested on SJA1105T, SJA1105S and SJA1110A, at 1Gbps and 100Mbps. Fixes: 4d752508 ("net: dsa: sja1105: offload the Credit-Based Shaper qdisc") Reported-by: Yanan Yang <yanan.yang@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queueDavid S. Miller authored
Tony Nguyen says: ==================== Change MIN_TXD and MIN_RXD to allow set rx/tx value between 64 and 80 Olga Zaborska says: Change the minimum value of RX/TX descriptors to 64 to enable setting the rx/tx value between 64 and 80. All igb, igbvf and igc devices can use as low as 64 descriptors. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Bodong Wang authored
ACL flow table is required in switchdev mode when metadata is enabled, driver creates such table when loading each vport. However, not every vport is loaded in switchdev mode. Such as ECPF if it's the eswitch manager. In this case, ACL flow table is still needed. To make it modularized, create ACL flow table for eswitch manager as default and skip such operations when loading manager vport. Also, there is no need to load the eswitch manager vport in switchdev mode. This means there is no need to load it on regular connect-x HCAs where the PF is the eswitch manager. This will avoid creating duplicate ACL flow table for host PF vport. Fixes: 29bcb6e4 ("net/mlx5e: E-Switch, Use metadata for vport matching in send-to-vport rules") Fixes: eb8e9fae ("mlx5/core: E-Switch, Allocate ECPF vport if it's an eswitch manager") Fixes: 5019833d ("net/mlx5: E-switch, Introduce helper function to enable/disable vports") Signed-off-by: Bodong Wang <bodong@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-