- 11 Aug, 2022 13 commits
-
-
Michal Jaron authored
During stress test with attaching and detaching VF from KVM and simultaneously changing VFs spoofcheck and trust there was a call trace in ice_reset_vf that VF's VSI is null. [145237.352797] WARNING: CPU: 46 PID: 840629 at drivers/net/ethernet/intel/ice/ice_vf_lib.c:508 ice_reset_vf+0x3d6/0x410 [ice] [145237.352851] Modules linked in: ice(E) vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_type1 vfio iavf dm_mod xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun bridge stp llc sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iTCO_wdt iTC O_vendor_support irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl ipmi_si intel_cstate ipmi_devintf joydev intel_uncore m ei_me ipmi_msghandler i2c_i801 pcspkr mei lpc_ich ioatdma i2c_smbus acpi_pad acpi_power_meter ip_tables xfs libcrc32c i2c_algo_bit drm_sh mem_helper drm_kms_helper sd_mod t10_pi crc64_rocksoft syscopyarea crc64 sysfillrect sg sysimgblt fb_sys_fops drm i40e ixgbe ahci libahci libata crc32c_intel mdio dca wmi fuse [last unloaded: ice] [145237.352917] CPU: 46 PID: 840629 Comm: kworker/46:2 Tainted: G S W I E 5.19.0-rc6+ #24 [145237.352921] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [145237.352923] Workqueue: ice ice_service_task [ice] [145237.352948] RIP: 0010:ice_reset_vf+0x3d6/0x410 [ice] [145237.352984] Code: 30 ec f3 cc e9 28 fd ff ff 0f b7 4b 50 48 c7 c2 48 19 9c c0 4c 89 ee 48 c7 c7 30 fe 9e c0 e8 d1 21 9d cc 31 c0 e9 a 9 fe ff ff <0f> 0b b8 ea ff ff ff e9 c1 fc ff ff 0f 0b b8 fb ff ff ff e9 91 fe [145237.352987] RSP: 0018:ffffb453e257fdb8 EFLAGS: 00010246 [145237.352990] RAX: ffff8bd0040181c0 RBX: ffff8be68db8f800 RCX: 0000000000000000 [145237.352991] RDX: 000000000000ffff RSI: 0000000000000000 RDI: ffff8be68db8f800 [145237.352993] RBP: ffff8bd0040181c0 R08: 0000000000001000 R09: ffff8bcfd520e000 [145237.352995] R10: 0000000000000000 R11: 00008417b5ab0bc0 R12: 0000000000000005 [145237.352996] R13: ffff8bcee061c0d0 R14: ffff8bd004019640 R15: 0000000000000000 [145237.352998] FS: 0000000000000000(0000) GS:ffff8be5dfb00000(0000) knlGS:0000000000000000 [145237.353000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [145237.353002] CR2: 00007fd81f651d68 CR3: 0000001a0fe10001 CR4: 00000000001726e0 [145237.353003] Call Trace: [145237.353008] <TASK> [145237.353011] ice_process_vflr_event+0x8d/0xb0 [ice] [145237.353049] ice_service_task+0x79f/0xef0 [ice] [145237.353074] process_one_work+0x1c8/0x390 [145237.353081] ? process_one_work+0x390/0x390 [145237.353084] worker_thread+0x30/0x360 [145237.353087] ? process_one_work+0x390/0x390 [145237.353090] kthread+0xe8/0x110 [145237.353094] ? kthread_complete_and_exit+0x20/0x20 [145237.353097] ret_from_fork+0x22/0x30 [145237.353103] </TASK> Remove WARN_ON() from check if VSI is null in ice_reset_vf. Add "VF is already removed\n" in dev_dbg(). This WARN_ON() is unnecessary and causes call trace, despite that call trace, driver still works. There is no need for this warn because this piece of code is responsible for disabling VF's Tx/Rx queues when VF is disabled, but when VF is already removed there is no need to do reset or disable queues. Fixes: efe41860 ("ice: Fix memory corruption in VF driver") Signed-off-by: Michal Jaron <michalx.jaron@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
-
Benjamin Mikailenko authored
In commit b03d519d ("ice: store VF pointer instead of VF ID") WARN_ON checks were added to validate the vsi->vf pointer and catch programming errors. However, one check to vsi->vf was missed. This caused a call trace when resetting VFs. Fix ice_vsi_rebuild by encompassing VF pointer in WARN_ON check. Fixes: b03d519d ("ice: store VF pointer instead of VF ID") Signed-off-by: Benjamin Mikailenko <benjamin.mikailenko@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
-
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski authored
Pull in bpf to silence a false positive warning. * git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Shut up kern_sys_bpf warning. Link: https://lore.kernel.org/r/CAADnVQK589CZN1Q9w8huJqkEyEed+ZMTWqcpA1Rm2CjN3a4XoQ@mail.gmail.com/Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Alexei Starovoitov authored
Shut up this warning: kernel/bpf/syscall.c:5089:5: warning: no previous prototype for function 'kern_sys_bpf' [-Wmissing-prototypes] int kern_sys_bpf(int cmd, union bpf_attr *attr, unsigned int size) Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Maxim Mikityanskiy authored
Currently, tls_device_down synchronizes with tls_device_resync_rx using RCU, however, the pointer to netdev is stored using WRITE_ONCE and loaded using READ_ONCE. Although such approach is technically correct (rcu_dereference is essentially a READ_ONCE, and rcu_assign_pointer uses WRITE_ONCE to store NULL), using special RCU helpers for pointers is more valid, as it includes additional checks and might change the implementation transparently to the callers. Mark the netdev pointer as __rcu and use the correct RCU helpers to access it. For non-concurrent access pass the right conditions that guarantee safe access (locks taken, refcount value). Also use the correct helper in mlx5e, where even READ_ONCE was missing. The transition to RCU exposes existing issues, fixed by this commit: 1. bond_tls_device_xmit could read netdev twice, and it could become NULL the second time, after the NULL check passed. 2. Drivers shouldn't stop processing the last packet if tls_device_down just set netdev to NULL, before tls_dev_del was called. This prevents a possible packet drop when transitioning to the fallback software mode. Fixes: 89df6a81 ("net/bonding: Implement TLS TX device offload") Fixes: c55dcdd4 ("net/tls: Fix use-after-free after the TLS device goes down and up") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Link: https://lore.kernel.org/r/20220810081602.1435800-1-maximmi@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Another device offload bug, we use the length of the output skb as an indication of how much data to copy. But that skb is sized to offset + record length, and we start from offset. So we end up double-counting the offset which leads to skb_copy_bits() returning -EFAULT. Reported-by: Tariq Toukan <tariqt@nvidia.com> Fixes: 84c61fe1 ("tls: rx: do not use the standard strparser") Tested-by: Ran Rozenstein <ranro@nvidia.com> Link: https://lore.kernel.org/r/20220809175544.354343-2-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
We can't do skb_walk_frags() on the input skbs, because the input skbs is really just a pointer to the tcp read queue. We need to bound the "is decrypted" check by the amount of data in the message. Note that the walk in tls_device_reencrypt() is after a CoW so the skb there is safe to walk. Actually in the current implementation it can't have frags at all, but whatever, maybe one day it will. Reported-by: Tariq Toukan <tariqt@nvidia.com> Fixes: 84c61fe1 ("tls: rx: do not use the standard strparser") Tested-by: Ran Rozenstein <ranro@nvidia.com> Link: https://lore.kernel.org/r/20220809175544.354343-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Thadeu Lima de Souza Cascardo authored
When a route filter is replaced and the old filter has a 0 handle, the old one won't be removed from the hashtable, while it will still be freed. The test was there since before commit 1109c005 ("net: sched: RCU cls_route"), when a new filter was not allocated when there was an old one. The old filter was reused and the reinserting would only be necessary if an old filter was replaced. That was still wrong for the same case where the old handle was 0. Remove the old filter from the list independently from its handle value. This fixes CVE-2022-2588, also reported as ZDI-CAN-17440. Reported-by: Zhenpeng Lin <zplin@u.northwestern.edu> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Reviewed-by: Kamal Mostafa <kamal@canonical.com> Cc: <stable@vger.kernel.org> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20220809170518.164662-1-cascardo@canonical.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
The custom multipath hash tests use mausezahn in order to test how changes in various packet fields affect the packet distribution across the available nexthops. The tool uses the libnet library for various low-level packet construction and injection. The library started using the "SO_BINDTODEVICE" socket option for IPv6 sockets in version 1.1.6 and for IPv4 sockets in version 1.2. When the option is not set, packets are not routed according to the table associated with the VRF master device and tests fail. Fix this by prefixing the command with "ip vrf exec", which will cause the route lookup to occur in the VRF routing table. This makes the tests pass regardless of the libnet library version. Fixes: 511e8db5 ("selftests: forwarding: Add test for custom multipath hash") Fixes: 185b0c19 ("selftests: forwarding: Add test for custom multipath hash with IPv4 GRE") Fixes: b7715acb ("selftests: forwarding: Add test for custom multipath hash with IPv6 GRE") Reported-by: Ivan Vecera <ivecera@redhat.com> Tested-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Link: https://lore.kernel.org/r/20220809113320.751413-1-idosch@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski authored
Daniel Borkmann says: ==================== bpf 2022-08-10 We've added 23 non-merge commits during the last 7 day(s) which contain a total of 19 files changed, 424 insertions(+), 35 deletions(-). The main changes are: 1) Several fixes for BPF map iterator such as UAFs along with selftests, from Hou Tao. 2) Fix BPF syscall program's {copy,strncpy}_from_bpfptr() to not fault, from Jinghao Jia. 3) Reject BPF syscall programs calling BPF_PROG_RUN, from Alexei Starovoitov and YiFei Zhu. 4) Fix attach_btf_obj_id info to pick proper target BTF, from Stanislav Fomichev. 5) BPF design Q/A doc update to clarify what is not stable ABI, from Paul E. McKenney. 6) Fix BPF map's prealloc_lru_pop to not reinitialize, from Kumar Kartikeya Dwivedi. 7) Fix bpf_trampoline_put to avoid leaking ftrace hash, from Jiri Olsa. 8) Fix arm64 JIT to address sparse errors around BPF trampoline, from Xu Kuohai. 9) Fix arm64 JIT to use kvcalloc instead of kcalloc for internal program address offset buffer, from Aijun Sun. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits) selftests/bpf: Ensure sleepable program is rejected by hash map iter selftests/bpf: Add write tests for sk local storage map iterator selftests/bpf: Add tests for reading a dangling map iter fd bpf: Only allow sleepable program for resched-able iterator bpf: Check the validity of max_rdwr_access for sock local storage map iterator bpf: Acquire map uref in .init_seq_private for sock{map,hash} iterator bpf: Acquire map uref in .init_seq_private for sock local storage map iterator bpf: Acquire map uref in .init_seq_private for hash map iterator bpf: Acquire map uref in .init_seq_private for array map iterator bpf: Disallow bpf programs call prog_run command. bpf, arm64: Fix bpf trampoline instruction endianness selftests/bpf: Add test for prealloc_lru_pop bug bpf: Don't reinit map value in prealloc_lru_pop bpf: Allow calling bpf_prog_test kfuncs in tracing programs bpf, arm64: Allocate program buffer using kvcalloc instead of kcalloc selftests/bpf: Excercise bpf_obj_get_info_by_fd for bpf2bpf bpf: Use proper target btf when exporting attach_btf_obj_id mptcp, btf: Add struct mptcp_sock definition when CONFIG_MPTCP is disabled bpf: Cleanup ftrace hash in bpf_trampoline_put BPF: Fix potential bad pointer dereference in bpf_sys_bpf() ... ==================== Link: https://lore.kernel.org/r/20220810190624.10748-1-daniel@iogearbox.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Hawkins Jiawei says: ==================== net: enhancements to sk_user_data field This patchset fixes refcount bug by adding SK_USER_DATA_PSOCK flag bit in sk_user_data field. The bug cause following info: WARNING: CPU: 1 PID: 3605 at lib/refcount.c:19 refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19 Modules linked in: CPU: 1 PID: 3605 Comm: syz-executor208 Not tainted 5.18.0-syzkaller-03023-g7e062cda #0 <TASK> __refcount_add_not_zero include/linux/refcount.h:163 [inline] __refcount_inc_not_zero include/linux/refcount.h:227 [inline] refcount_inc_not_zero include/linux/refcount.h:245 [inline] sk_psock_get+0x3bc/0x410 include/linux/skmsg.h:439 tls_data_ready+0x6d/0x1b0 net/tls/tls_sw.c:2091 tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4983 tcp_data_queue+0x25f2/0x4c90 net/ipv4/tcp_input.c:5057 tcp_rcv_state_process+0x1774/0x4e80 net/ipv4/tcp_input.c:6659 tcp_v4_do_rcv+0x339/0x980 net/ipv4/tcp_ipv4.c:1682 sk_backlog_rcv include/net/sock.h:1061 [inline] __release_sock+0x134/0x3b0 net/core/sock.c:2849 release_sock+0x54/0x1b0 net/core/sock.c:3404 inet_shutdown+0x1e0/0x430 net/ipv4/af_inet.c:909 __sys_shutdown_sock net/socket.c:2331 [inline] __sys_shutdown_sock net/socket.c:2325 [inline] __sys_shutdown+0xf1/0x1b0 net/socket.c:2343 __do_sys_shutdown net/socket.c:2351 [inline] __se_sys_shutdown net/socket.c:2349 [inline] __x64_sys_shutdown+0x50/0x70 net/socket.c:2349 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> To improve code maintainability, this patchset refactors sk_user_data flags code to be more generic. ==================== Link: https://lore.kernel.org/r/cover.1659676823.git.yin31149@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Hawkins Jiawei authored
Refactor sk_user_data dereference using more generic function __rcu_dereference_sk_user_data_with_flags(), which improve its maintainability Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Hawkins Jiawei authored
Syzkaller reports refcount bug as follows: ------------[ cut here ]------------ refcount_t: saturated; leaking memory. WARNING: CPU: 1 PID: 3605 at lib/refcount.c:19 refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19 Modules linked in: CPU: 1 PID: 3605 Comm: syz-executor208 Not tainted 5.18.0-syzkaller-03023-g7e062cda #0 <TASK> __refcount_add_not_zero include/linux/refcount.h:163 [inline] __refcount_inc_not_zero include/linux/refcount.h:227 [inline] refcount_inc_not_zero include/linux/refcount.h:245 [inline] sk_psock_get+0x3bc/0x410 include/linux/skmsg.h:439 tls_data_ready+0x6d/0x1b0 net/tls/tls_sw.c:2091 tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4983 tcp_data_queue+0x25f2/0x4c90 net/ipv4/tcp_input.c:5057 tcp_rcv_state_process+0x1774/0x4e80 net/ipv4/tcp_input.c:6659 tcp_v4_do_rcv+0x339/0x980 net/ipv4/tcp_ipv4.c:1682 sk_backlog_rcv include/net/sock.h:1061 [inline] __release_sock+0x134/0x3b0 net/core/sock.c:2849 release_sock+0x54/0x1b0 net/core/sock.c:3404 inet_shutdown+0x1e0/0x430 net/ipv4/af_inet.c:909 __sys_shutdown_sock net/socket.c:2331 [inline] __sys_shutdown_sock net/socket.c:2325 [inline] __sys_shutdown+0xf1/0x1b0 net/socket.c:2343 __do_sys_shutdown net/socket.c:2351 [inline] __se_sys_shutdown net/socket.c:2349 [inline] __x64_sys_shutdown+0x50/0x70 net/socket.c:2349 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> During SMC fallback process in connect syscall, kernel will replaces TCP with SMC. In order to forward wakeup smc socket waitqueue after fallback, kernel will sets clcsk->sk_user_data to origin smc socket in smc_fback_replace_callbacks(). Later, in shutdown syscall, kernel will calls sk_psock_get(), which treats the clcsk->sk_user_data as psock type, triggering the refcnt warning. So, the root cause is that smc and psock, both will use sk_user_data field. So they will mismatch this field easily. This patch solves it by using another bit(defined as SK_USER_DATA_PSOCK) in PTRMASK, to mark whether sk_user_data points to a psock object or not. This patch depends on a PTRMASK introduced in commit f1ff5ce2 ("net, sk_msg: Clear sk_user_data pointer on clone if tagged"). For there will possibly be more flags in the sk_user_data field, this patch also refactor sk_user_data flags code to be more generic to improve its maintainability. Reported-and-tested-by: syzbot+5f26f85569bd179c18ce@syzkaller.appspotmail.com Suggested-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 10 Aug, 2022 27 commits
-
-
Alexei Starovoitov authored
Hou Tao says: ==================== From: Hou Tao <houtao1@huawei.com> Hi, The patchset constitues three fixes for bpf map iterator: (1) patch 1~4: fix user-after-free during reading map iterator fd It is possible when both the corresponding link fd and map fd are closed bfore reading the iterator fd. I had squashed these four patches into one, but it was not friendly for stable backport, so I break these fixes into four single patches in the end. Patch 7 is its testing patch. (2) patch 5: fix invalidity check for values in sk local storage map Patch 8 adds two tests for it. (3) patch 6: reject sleepable program for non-resched map iterator Patch 9 add a test for it. Please check the individual patches for more details. And comments are always welcome. Regards, Tao Changes since v2: * patch 1~6: update commit messages (from Yonghong & Martin) * patch 7: add more detailed comments (from Yonghong) * patch 8: use NULL directly instead of (void *)0 v1: https://lore.kernel.org/bpf/20220806074019.2756957-1-houtao@huaweicloud.com ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
Add a test to ensure sleepable program is rejected by hash map iterator. Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220810080538.1845898-10-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
Add test to validate the overwrite of sock local storage map value in map iterator and another one to ensure out-of-bound value writing is rejected. Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220810080538.1845898-9-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
After closing both related link fd and map fd, reading the map iterator fd to ensure it is OK to do so. Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220810080538.1845898-8-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
When a sleepable program is attached to a hash map iterator, might_fault() will report "BUG: sleeping function called from invalid context..." if CONFIG_DEBUG_ATOMIC_SLEEP is enabled. The reason is that rcu_read_lock() is held in bpf_hash_map_seq_next() and won't be released until all elements are traversed or bpf_hash_map_seq_stop() is called. Fixing it by reusing BPF_ITER_RESCHED to indicate that only non-sleepable program is allowed for iterator without BPF_ITER_RESCHED. We can revise bpf_iter_link_attach() later if there are other conditions which may cause rcu_read_lock() or spin_lock() issues. Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220810080538.1845898-7-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
The value of sock local storage map is writable in map iterator, so check max_rdwr_access instead of max_rdonly_access. Fixes: 5ce6e77c ("bpf: Implement bpf iterator for sock local storage map") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220810080538.1845898-6-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
sock_map_iter_attach_target() acquires a map uref, and the uref may be released before or in the middle of iterating map elements. For example, the uref could be released in sock_map_iter_detach_target() as part of bpf_link_release(), or could be released in bpf_map_put_with_uref() as part of bpf_map_release(). Fixing it by acquiring an extra map uref in .init_seq_private and releasing it in .fini_seq_private. Fixes: 03653515 ("net: Allow iterating sockmap and sockhash") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220810080538.1845898-5-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
bpf_iter_attach_map() acquires a map uref, and the uref may be released before or in the middle of iterating map elements. For example, the uref could be released in bpf_iter_detach_map() as part of bpf_link_release(), or could be released in bpf_map_put_with_uref() as part of bpf_map_release(). So acquiring an extra map uref in bpf_iter_init_sk_storage_map() and releasing it in bpf_iter_fini_sk_storage_map(). Fixes: 5ce6e77c ("bpf: Implement bpf iterator for sock local storage map") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220810080538.1845898-4-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
bpf_iter_attach_map() acquires a map uref, and the uref may be released before or in the middle of iterating map elements. For example, the uref could be released in bpf_iter_detach_map() as part of bpf_link_release(), or could be released in bpf_map_put_with_uref() as part of bpf_map_release(). So acquiring an extra map uref in bpf_iter_init_hash_map() and releasing it in bpf_iter_fini_hash_map(). Fixes: d6c4503c ("bpf: Implement bpf iterator for hash maps") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220810080538.1845898-3-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
bpf_iter_attach_map() acquires a map uref, and the uref may be released before or in the middle of iterating map elements. For example, the uref could be released in bpf_iter_detach_map() as part of bpf_link_release(), or could be released in bpf_map_put_with_uref() as part of bpf_map_release(). Alternative fix is acquiring an extra bpf_link reference just like a pinned map iterator does, but it introduces unnecessary dependency on bpf_link instead of bpf_map. So choose another fix: acquiring an extra map uref in .init_seq_private for array map iterator. Fixes: d3cc2ab5 ("bpf: Implement bpf iterator for array maps") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220810080538.1845898-2-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
The verifier cannot perform sufficient validation of bpf_attr->test.ctx_in pointer, therefore bpf programs should not be allowed to call BPF_PROG_RUN command from within the program. To fix this issue split bpf_sys_bpf() bpf helper into normal kern_sys_bpf() kernel function that can only be used by the kernel light skeleton directly. Reported-by: YiFei Zhu <zhuyifei@google.com> Fixes: b1d18a75 ("bpf: Extend sys_bpf commands for bpf_syscall programs.") Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Xu Kuohai authored
The sparse tool complains as follows: arch/arm64/net/bpf_jit_comp.c:1684:16: warning: incorrect type in assignment (different base types) arch/arm64/net/bpf_jit_comp.c:1684:16: expected unsigned int [usertype] *branch arch/arm64/net/bpf_jit_comp.c:1684:16: got restricted __le32 [usertype] * arch/arm64/net/bpf_jit_comp.c:1700:52: error: subtraction of different types can't work (different base types) arch/arm64/net/bpf_jit_comp.c:1734:29: warning: incorrect type in assignment (different base types) arch/arm64/net/bpf_jit_comp.c:1734:29: expected unsigned int [usertype] * arch/arm64/net/bpf_jit_comp.c:1734:29: got restricted __le32 [usertype] * arch/arm64/net/bpf_jit_comp.c:1918:52: error: subtraction of different types can't work (different base types) This is because the variable branch in function invoke_bpf_prog and the variable branches in function prepare_trampoline are defined as type u32 *, which conflicts with ctx->image's type __le32 *, so sparse complains when assignment or arithmetic operation are performed on these two variables and ctx->image. Since arm64 instructions are always little-endian, change the type of these two variables to __le32 * and call cpu_to_le32() to convert instruction to little-endian before writing it to memory. This is also in line with emit() which internally does cpu_to_le32(), too. Fixes: efc9909f ("bpf, arm64: Add bpf trampoline for arm64") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Link: https://lore.kernel.org/bpf/20220808040735.1232002-1-xukuohai@huawei.com
-
Jakub Kicinski authored
Commit 50a896cf ("genetlink: properly support per-op policy dumping") seems to have copy'n'pasted things a little incorrectly. The #define CTRL_ATTR_MCAST_GRP_MAX should have stayed right after the previous enum. The new CTRL_ATTR_POLICY_* needs its own define for MAX and that max should not contain the superfluous _DUMP in the name. We probably can't do anything about the CTRL_ATTR_POLICY_DUMP_MAX any more, there's likely code which uses it. For consistency (*cough* codegen *cough*) let's add the correctly name define nonetheless. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
After a failed devlink reload, devlink parameters are still registered, which means user space can set and get their values. In the case of the mlxsw "acl_region_rehash_interval" parameter, these operations will trigger a use-after-free [1]. Fix this by rejecting set and get operations while in the failed state. Return the "-EOPNOTSUPP" error code which does not abort the parameters dump, but instead causes it to skip over the problematic parameter. Another possible fix is to perform these checks in the mlxsw parameter callbacks, but other drivers might be affected by the same problem and I am not aware of scenarios where these stricter checks will cause a regression. [1] mlxsw_spectrum3 0000:00:10.0: Port 125: Failed to register netdev mlxsw_spectrum3 0000:00:10.0: Failed to create ports ================================================================== BUG: KASAN: use-after-free in mlxsw_sp_acl_tcam_vregion_rehash_intrvl_get+0xbd/0xd0 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c:904 Read of size 4 at addr ffff8880099dcfd8 by task kworker/u4:4/777 CPU: 1 PID: 777 Comm: kworker/u4:4 Not tainted 5.19.0-rc7-custom-126601-gfe26f28c586d #1 Hardware name: QEMU MSN4700, BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Workqueue: netns cleanup_net Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x92/0xbd lib/dump_stack.c:106 print_address_description mm/kasan/report.c:313 [inline] print_report.cold+0x5e/0x5cf mm/kasan/report.c:429 kasan_report+0xb9/0xf0 mm/kasan/report.c:491 __asan_report_load4_noabort+0x14/0x20 mm/kasan/report_generic.c:306 mlxsw_sp_acl_tcam_vregion_rehash_intrvl_get+0xbd/0xd0 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c:904 mlxsw_sp_acl_region_rehash_intrvl_get+0x49/0x60 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c:1106 mlxsw_sp_params_acl_region_rehash_intrvl_get+0x33/0x80 drivers/net/ethernet/mellanox/mlxsw/spectrum.c:3854 devlink_param_get net/core/devlink.c:4981 [inline] devlink_nl_param_fill+0x238/0x12d0 net/core/devlink.c:5089 devlink_param_notify+0xe5/0x230 net/core/devlink.c:5168 devlink_ns_change_notify net/core/devlink.c:4417 [inline] devlink_ns_change_notify net/core/devlink.c:4396 [inline] devlink_reload+0x15f/0x700 net/core/devlink.c:4507 devlink_pernet_pre_exit+0x112/0x1d0 net/core/devlink.c:12272 ops_pre_exit_list net/core/net_namespace.c:152 [inline] cleanup_net+0x494/0xc00 net/core/net_namespace.c:582 process_one_work+0x9fc/0x1710 kernel/workqueue.c:2289 worker_thread+0x675/0x10b0 kernel/workqueue.c:2436 kthread+0x30c/0x3d0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 </TASK> The buggy address belongs to the physical page: page:ffffea0000267700 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x99dc flags: 0x100000000000000(node=0|zone=1) raw: 0100000000000000 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880099dce80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff8880099dcf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >ffff8880099dcf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff8880099dd000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff8880099dd080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== Fixes: 98bbf70c ("mlxsw: spectrum: add "acl_region_rehash_interval" devlink param") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sun Shouxin authored
In my test, balance-alb bonding with two slaves eth0 and eth1, and then Bond0.150 is created with vlan id attached bond0. After adding bond0.150 into one linux bridge, I noted that Bond0, bond0.150 and bridge were assigned to the same MAC as eth0. Once bond0.150 receives a packet whose dest IP is bridge's and dest MAC is eth1's, the linux bridge will not match eth1's MAC entry in FDB, and not handle it as expected. The patch fix the issue, and diagram as below: eth1(mac:eth1_mac)--bond0(balance-alb,mac:eth0_mac)--eth0(mac:eth0_mac) | bond0.150(mac:eth0_mac) | bridge(ip:br_ip, mac:eth0_mac)--other port Suggested-by: Hu Yadi <huyd12@chinatelecom.cn> Signed-off-by: Sun Shouxin <sunshouxin@chinatelecom.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Clayton Yager authored
OutOctetsProtected, OutOctetsEncrypted, InOctetsValidated, and InOctetsDecrypted were incrementing by the total number of octets in frames instead of by the number of octets of User Data in frames. The Controlled Port statistics ifOutOctets and ifInOctets were incrementing by the total number of octets instead of the number of octets of the MSDUs plus octets of the destination and source MAC addresses. The Controlled Port statistics ifInDiscards and ifInErrors were not incrementing each time the counters they aggregate were. The Controlled Port statistic ifInErrors was not included in the output of macsec_get_stats64 so the value was not present in ip commands output. The ReceiveSA counters InPktsNotValid, InPktsNotUsingSA, and InPktsUnusedSA were not incrementing. Signed-off-by: Clayton Yager <Clayton_Yager@selinc.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Peilin Ye authored
Imagine two non-blocking vsock_connect() requests on the same socket. The first request schedules @connect_work, and after it times out, vsock_connect_timeout() sets *sock* state back to TCP_CLOSE, but keeps *socket* state as SS_CONNECTING. Later, the second request returns -EALREADY, meaning the socket "already has a pending connection in progress", even though the first request has already timed out. As suggested by Stefano, fix it by setting *socket* state back to SS_UNCONNECTED, so that the second request will return -ETIMEDOUT. Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Fixes: d021c344 ("VSOCK: Introduce VM Sockets") Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Peilin Ye authored
An O_NONBLOCK vsock_connect() request may try to reschedule @connect_work. Imagine the following sequence of vsock_connect() requests: 1. The 1st, non-blocking request schedules @connect_work, which will expire after 200 jiffies. Socket state is now SS_CONNECTING; 2. Later, the 2nd, blocking request gets interrupted by a signal after a few jiffies while waiting for the connection to be established. Socket state is back to SS_UNCONNECTED, but @connect_work is still pending, and will expire after 100 jiffies. 3. Now, the 3rd, non-blocking request tries to schedule @connect_work again. Since @connect_work is already scheduled, schedule_delayed_work() silently returns. sock_hold() is called twice, but sock_put() will only be called once in vsock_connect_timeout(), causing a memory leak reported by syzbot: BUG: memory leak unreferenced object 0xffff88810ea56a40 (size 1232): comm "syz-executor756", pid 3604, jiffies 4294947681 (age 12.350s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 28 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00 (..@............ backtrace: [<ffffffff837c830e>] sk_prot_alloc+0x3e/0x1b0 net/core/sock.c:1930 [<ffffffff837cbe22>] sk_alloc+0x32/0x2e0 net/core/sock.c:1989 [<ffffffff842ccf68>] __vsock_create.constprop.0+0x38/0x320 net/vmw_vsock/af_vsock.c:734 [<ffffffff842ce8f1>] vsock_create+0xc1/0x2d0 net/vmw_vsock/af_vsock.c:2203 [<ffffffff837c0cbb>] __sock_create+0x1ab/0x2b0 net/socket.c:1468 [<ffffffff837c3acf>] sock_create net/socket.c:1519 [inline] [<ffffffff837c3acf>] __sys_socket+0x6f/0x140 net/socket.c:1561 [<ffffffff837c3bba>] __do_sys_socket net/socket.c:1570 [inline] [<ffffffff837c3bba>] __se_sys_socket net/socket.c:1568 [inline] [<ffffffff837c3bba>] __x64_sys_socket+0x1a/0x20 net/socket.c:1568 [<ffffffff84512815>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff84512815>] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:80 [<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae <...> Use mod_delayed_work() instead: if @connect_work is already scheduled, reschedule it, and undo sock_hold() to keep the reference count balanced. Reported-and-tested-by: syzbot+b03f55bf128f9a38f064@syzkaller.appspotmail.com Fixes: d021c344 ("VSOCK: Introduce VM Sockets") Co-developed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jose Alonso authored
This reverts commit 36a15e1c. The usage of FLAG_SEND_ZLP causes problems to other firmware/hardware versions that have no issues. The FLAG_SEND_ZLP is not safe to use in this context. See: https://patchwork.ozlabs.org/project/netdev/patch/1270599787.8900.8.camel@Linuxdev4-laptop/#118378 The original problem needs another way to solve. Fixes: 36a15e1c ("net: usb: ax88179_178a needs FLAG_SEND_ZLP") Cc: stable@vger.kernel.org Reported-by: Ronald Wahl <ronald.wahl@raritan.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216327 Link: https://bugs.archlinux.org/task/75491Signed-off-by: Jose Alonso <joalonsof@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Topi Miettinen authored
'IPv4 and IPv4' should be 'IPv4 and IPv6'. Signed-off-by: Topi Miettinen <toiwoton@gmail.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Merge tag 'linux-can-fixes-for-6.0-20220810' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== this is a pull request of 4 patches for net/master, with the whitespace issue fixed. Fedor Pchelkin contributes 2 fixes for the j1939 CAN protocol. A patch by me for the ems_usb driver fixes an unaligned access warning. Sebastian Würl's patch for the mcp251x driver fixes a race condition in the receive interrupt. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Matthias May says: ==================== Do not use RT_TOS for IPv6 flowlabel According to Guillaume Nault RT_TOS should never be used for IPv6. Quote: RT_TOS() is an old macro used to interprete IPv4 TOS as described in the obsolete RFC 1349. It's conceptually wrong to use it even in IPv4 code, although, given the current state of the code, most of the existing calls have no consequence. But using RT_TOS() in IPv6 code is always a bug: IPv6 never had a "TOS" field to be interpreted the RFC 1349 way. There's no historical compatibility to worry about. ==================== Link: https://lore.kernel.org/r/20220805191906.9323-1-matthias.may@westermo.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Matthias May authored
According to Guillaume Nault RT_TOS should never be used for IPv6. Quote: RT_TOS() is an old macro used to interprete IPv4 TOS as described in the obsolete RFC 1349. It's conceptually wrong to use it even in IPv4 code, although, given the current state of the code, most of the existing calls have no consequence. But using RT_TOS() in IPv6 code is always a bug: IPv6 never had a "TOS" field to be interpreted the RFC 1349 way. There's no historical compatibility to worry about. Fixes: 571912c6 ("net: UDP tunnel encapsulation module for tunnelling different protocols like MPLS, IP, NSH etc.") Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Matthias May <matthias.may@westermo.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Matthias May authored
According to Guillaume Nault RT_TOS should never be used for IPv6. Quote: RT_TOS() is an old macro used to interprete IPv4 TOS as described in the obsolete RFC 1349. It's conceptually wrong to use it even in IPv4 code, although, given the current state of the code, most of the existing calls have no consequence. But using RT_TOS() in IPv6 code is always a bug: IPv6 never had a "TOS" field to be interpreted the RFC 1349 way. There's no historical compatibility to worry about. Fixes: ce99f6b9 ("net/mlx5e: Support SRIOV TC encapsulation offloads for IPv6 tunnels") Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Matthias May <matthias.may@westermo.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Matthias May authored
According to Guillaume Nault RT_TOS should never be used for IPv6. Quote: RT_TOS() is an old macro used to interprete IPv4 TOS as described in the obsolete RFC 1349. It's conceptually wrong to use it even in IPv4 code, although, given the current state of the code, most of the existing calls have no consequence. But using RT_TOS() in IPv6 code is always a bug: IPv6 never had a "TOS" field to be interpreted the RFC 1349 way. There's no historical compatibility to worry about. Fixes: 1400615d ("vxlan: allow setting ipv6 traffic class") Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Matthias May <matthias.may@westermo.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Matthias May authored
According to Guillaume Nault RT_TOS should never be used for IPv6. Quote: RT_TOS() is an old macro used to interprete IPv4 TOS as described in the obsolete RFC 1349. It's conceptually wrong to use it even in IPv4 code, although, given the current state of the code, most of the existing calls have no consequence. But using RT_TOS() in IPv6 code is always a bug: IPv6 never had a "TOS" field to be interpreted the RFC 1349 way. There's no historical compatibility to worry about. Fixes: 3a56f86f ("geneve: handle ipv6 priority like ipv4 tos") Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Matthias May <matthias.may@westermo.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Matthias May authored
The current code retrieves the TOS field after the lookup on the ipv4 routing table. The routing process currently only allows routing based on the original 3 TOS bits, and not on the full 6 DSCP bits. As a result the retrieved TOS is cut to the 3 bits. However for inheriting purposes the full 6 bits should be used. Extract the full 6 bits before the route lookup and use that instead of the cut off 3 TOS bits. Fixes: e305ac6c ("geneve: Add support to collect tunnel metadata.") Signed-off-by: Matthias May <matthias.may@westermo.com> Acked-by: Guillaume Nault <gnault@redhat.com> Link: https://lore.kernel.org/r/20220805190006.8078-1-matthias.may@westermo.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-