- 02 May, 2024 8 commits
-
-
Paolo Abeni authored
Richard Gobert says: ==================== net: gro: add flush/flush_id checks and fix wrong offset in udp This series fixes a bug in the complete phase of UDP in GRO, in which socket lookup fails due to using network_header when parsing encapsulated packets. The fix is to add network_offset and inner_network_offset to napi_gro_cb and use these offsets for socket lookup. In addition p->flush/flush_id should be checked in all UDP flows. The same logic from tcp_gro_receive is applied for all flows in udp_gro_receive_segment. This prevents packets with mismatching network headers (flush/flush_id turned on) from merging in UDP GRO. The original series includes a change to vxlan test which adds the local parameter to prevent similar future bugs. I plan to submit it separately to net-next. This series is part of a previously submitted series to net-next: https://lore.kernel.org/all/20240408141720.98832-1-richardbgobert@gmail.com/ v3 -> v4: - Store network offsets, and use them only in udp_gro_complete flows - Correct commit hash used in Fixes tag - v3: https://lore.kernel.org/netdev/20240424163045.123528-1-richardbgobert@gmail.com/ v2 -> v3: - Add network_offsets and fix udp bug in a single commit to make backporting easier - Write to inner_network_offset in {inet,ipv6}_gro_receive - Use network_offsets union in tcp[46]_gro_complete as well - v2: https://lore.kernel.org/netdev/20240419153542.121087-1-richardbgobert@gmail.com/ v1 -> v2: - Use network_offsets instead of p_poff param as suggested by Willem - Check flush before postpull, and for all UDP GRO flows - v1: https://lore.kernel.org/netdev/20240412152120.115067-1-richardbgobert@gmail.com/ ==================== Link: https://lore.kernel.org/r/20240430143555.126083-1-richardbgobert@gmail.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Richard Gobert authored
GRO-GSO path is supposed to be transparent and as such L3 flush checks are relevant to all UDP flows merging in GRO. This patch uses the same logic and code from tcp_gro_receive, terminating merge if flush is non zero. Fixes: e20cf8d3 ("udp: implement GRO for plain UDP sockets.") Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Richard Gobert authored
Commits a6024562 ("udp: Add GRO functions to UDP socket") and 57c67ff4 ("udp: additional GRO support") introduce incorrect usage of {ip,ipv6}_hdr in the complete phase of gro. The functions always return skb->network_header, which in the case of encapsulated packets at the gro complete phase, is always set to the innermost L3 of the packet. That means that calling {ip,ipv6}_hdr for skbs which completed the GRO receive phase (both in gro_list and *_gro_complete) when parsing an encapsulated packet's _outer_ L3/L4 may return an unexpected value. This incorrect usage leads to a bug in GRO's UDP socket lookup. udp{4,6}_lib_lookup_skb functions use ip_hdr/ipv6_hdr respectively. These *_hdr functions return network_header which will point to the innermost L3, resulting in the wrong offset being used in __udp{4,6}_lib_lookup with encapsulated packets. This patch adds network_offset and inner_network_offset to napi_gro_cb, and makes sure both are set correctly. To fix the issue, network_offsets union is used inside napi_gro_cb, in which both the outer and the inner network offsets are saved. Reproduction example: Endpoint configuration example (fou + local address bind) # ip fou add port 6666 ipproto 4 # ip link add name tun1 type ipip remote 2.2.2.1 local 2.2.2.2 encap fou encap-dport 5555 encap-sport 6666 mode ipip # ip link set tun1 up # ip a add 1.1.1.2/24 dev tun1 Netperf TCP_STREAM result on net-next before patch is applied: net-next main, GRO enabled: $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 131072 16384 16384 5.28 2.37 net-next main, GRO disabled: $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 131072 16384 16384 5.01 2745.06 patch applied, GRO enabled: $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 131072 16384 16384 5.01 2877.38 Fixes: a6024562 ("udp: Add GRO functions to UDP socket") Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Shigeru Yoshida authored
KMSAN reported uninit-value access in __ip_make_skb() [1]. __ip_make_skb() tests HDRINCL to know if the skb has icmphdr. However, HDRINCL can cause a race condition. If calling setsockopt(2) with IP_HDRINCL changes HDRINCL while __ip_make_skb() is running, the function will access icmphdr in the skb even if it is not included. This causes the issue reported by KMSAN. Check FLOWI_FLAG_KNOWN_NH on fl4->flowi4_flags instead of testing HDRINCL on the socket. Also, fl4->fl4_icmp_type and fl4->fl4_icmp_code are not initialized. These are union in struct flowi4 and are implicitly initialized by flowi4_init_output(), but we should not rely on specific union layout. Initialize these explicitly in raw_sendmsg(). [1] BUG: KMSAN: uninit-value in __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481 __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481 ip_finish_skb include/net/ip.h:243 [inline] ip_push_pending_frames+0x4c/0x5c0 net/ipv4/ip_output.c:1508 raw_sendmsg+0x2381/0x2690 net/ipv4/raw.c:654 inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x274/0x3c0 net/socket.c:745 __sys_sendto+0x62c/0x7b0 net/socket.c:2191 __do_sys_sendto net/socket.c:2203 [inline] __se_sys_sendto net/socket.c:2199 [inline] __x64_sys_sendto+0x130/0x200 net/socket.c:2199 do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x6d/0x75 Uninit was created at: slab_post_alloc_hook mm/slub.c:3804 [inline] slab_alloc_node mm/slub.c:3845 [inline] kmem_cache_alloc_node+0x5f6/0xc50 mm/slub.c:3888 kmalloc_reserve+0x13c/0x4a0 net/core/skbuff.c:577 __alloc_skb+0x35a/0x7c0 net/core/skbuff.c:668 alloc_skb include/linux/skbuff.h:1318 [inline] __ip_append_data+0x49ab/0x68c0 net/ipv4/ip_output.c:1128 ip_append_data+0x1e7/0x260 net/ipv4/ip_output.c:1365 raw_sendmsg+0x22b1/0x2690 net/ipv4/raw.c:648 inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x274/0x3c0 net/socket.c:745 __sys_sendto+0x62c/0x7b0 net/socket.c:2191 __do_sys_sendto net/socket.c:2203 [inline] __se_sys_sendto net/socket.c:2199 [inline] __x64_sys_sendto+0x130/0x200 net/socket.c:2199 do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x6d/0x75 CPU: 1 PID: 15709 Comm: syz-executor.7 Not tainted 6.8.0-11567-gb3603fcb #25 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014 Fixes: 99e5acae ("ipv4: Fix potential uninit variable access bug in __ip_make_skb()") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Link: https://lore.kernel.org/r/20240430123945.2057348-1-syoshida@redhat.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Alexandra Winter authored
Symptom: When the hsuid attribute is set for the first time on an IQD Layer3 device while the corresponding network interface is already UP, the kernel will try to execute a napi function pointer that is NULL. Example: --------------------------------------------------------------------------- [ 2057.572696] illegal operation: 0001 ilc:1 [#1] SMP [ 2057.572702] Modules linked in: af_iucv qeth_l3 zfcp scsi_transport_fc sunrpc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables libcrc32c nfnetlink ghash_s390 prng xts aes_s390 des_s390 de s_generic sha3_512_s390 sha3_256_s390 sha512_s390 vfio_ccw vfio_mdev mdev vfio_iommu_type1 eadm_sch vfio ext4 mbcache jbd2 qeth_l2 bridge stp llc dasd_eckd_mod qeth dasd_mod qdio ccwgroup pkey zcrypt [ 2057.572739] CPU: 6 PID: 60182 Comm: stress_client Kdump: loaded Not tainted 4.18.0-541.el8.s390x #1 [ 2057.572742] Hardware name: IBM 3931 A01 704 (LPAR) [ 2057.572744] Krnl PSW : 0704f00180000000 0000000000000002 (0x2) [ 2057.572748] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3 [ 2057.572751] Krnl GPRS: 0000000000000004 0000000000000000 00000000a3b008d8 0000000000000000 [ 2057.572754] 00000000a3b008d8 cb923a29c779abc5 0000000000000000 00000000814cfd80 [ 2057.572756] 000000000000012c 0000000000000000 00000000a3b008d8 00000000a3b008d8 [ 2057.572758] 00000000bab6d500 00000000814cfd80 0000000091317e46 00000000814cfc68 [ 2057.572762] Krnl Code:#0000000000000000: 0000 illegal >0000000000000002: 0000 illegal 0000000000000004: 0000 illegal 0000000000000006: 0000 illegal 0000000000000008: 0000 illegal 000000000000000a: 0000 illegal 000000000000000c: 0000 illegal 000000000000000e: 0000 illegal [ 2057.572800] Call Trace: [ 2057.572801] ([<00000000ec639700>] 0xec639700) [ 2057.572803] [<00000000913183e2>] net_rx_action+0x2ba/0x398 [ 2057.572809] [<0000000091515f76>] __do_softirq+0x11e/0x3a0 [ 2057.572813] [<0000000090ce160c>] do_softirq_own_stack+0x3c/0x58 [ 2057.572817] ([<0000000090d2cbd6>] do_softirq.part.1+0x56/0x60) [ 2057.572822] [<0000000090d2cc60>] __local_bh_enable_ip+0x80/0x98 [ 2057.572825] [<0000000091314706>] __dev_queue_xmit+0x2be/0xd70 [ 2057.572827] [<000003ff803dd6d6>] afiucv_hs_send+0x24e/0x300 [af_iucv] [ 2057.572830] [<000003ff803dd88a>] iucv_send_ctrl+0x102/0x138 [af_iucv] [ 2057.572833] [<000003ff803de72a>] iucv_sock_connect+0x37a/0x468 [af_iucv] [ 2057.572835] [<00000000912e7e90>] __sys_connect+0xa0/0xd8 [ 2057.572839] [<00000000912e9580>] sys_socketcall+0x228/0x348 [ 2057.572841] [<0000000091514e1a>] system_call+0x2a6/0x2c8 [ 2057.572843] Last Breaking-Event-Address: [ 2057.572844] [<0000000091317e44>] __napi_poll+0x4c/0x1d8 [ 2057.572846] [ 2057.572847] Kernel panic - not syncing: Fatal exception in interrupt ------------------------------------------------------------------------------------------- Analysis: There is one napi structure per out_q: card->qdio.out_qs[i].napi The napi.poll functions are set during qeth_open(). Since commit 1cfef80d ("s390/qeth: Don't call dev_close/dev_open (DOWN/UP)") qeth_set_offline()/qeth_set_online() no longer call dev_close()/ dev_open(). So if qeth_free_qdio_queues() cleared card->qdio.out_qs[i].napi.poll while the network interface was UP and the card was offline, they are not set again. Reproduction: chzdev -e $devno layer2=0 ip link set dev $network_interface up echo 0 > /sys/bus/ccwgroup/devices/0.0.$devno/online echo foo > /sys/bus/ccwgroup/devices/0.0.$devno/hsuid echo 1 > /sys/bus/ccwgroup/devices/0.0.$devno/online -> Crash (can be enforced e.g. by af_iucv connect(), ip link down/up, ...) Note that a Completion Queue (CQ) is only enabled or disabled, when hsuid is set for the first time or when it is removed. Workarounds: - Set hsuid before setting the device online for the first time or - Use chzdev -d $devno; chzdev $devno hsuid=xxx; chzdev -e $devno; to set hsuid on an existing device. (this will remove and recreate the network interface) Fix: There is no need to free the output queues when a completion queue is added or removed. card->qdio.state now indicates whether the inbound buffer pool and the outbound queues are allocated. card->qdio.c_q indicates whether a CQ is allocated. Fixes: 1cfef80d ("s390/qeth: Don't call dev_close/dev_open (DOWN/UP)") Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240430091004.2265683-1-wintera@linux.ibm.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Guillaume Nault authored
Ensure the inner IP header is part of skb's linear data before reading its ECN bits. Otherwise we might read garbage. One symptom is the system erroneously logging errors like "vxlan: non-ECT from xxx.xxx.xxx.xxx with TOS=xxxx". Similar bugs have been fixed in geneve, ip_tunnel and ip6_tunnel (see commit 1ca1ba46 ("geneve: make sure to pull inner header in geneve_rx()") for example). So let's reuse the same code structure for consistency. Maybe we'll can add a common helper in the future. Fixes: d342894c ("vxlan: virtual extensible lan") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/1239c8db54efec341dd6455c77e0380f58923a3c.1714495737.git.gnault@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Xin Long authored
__skb_linearize() doesn't free the skb when it fails, so move '*buf = NULL' after __skb_linearize(), so that the skb can be freed on the err path. Fixes: b7df21cf ("tipc: skb_linearize the head skb when reassembling msgs") Reported-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Link: https://lore.kernel.org/r/90710748c29a1521efac4f75ea01b3b7e61414cf.1714485818.git.lucien.xin@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Paolo Abeni authored
Sam Page (sam4k) working with Trend Micro Zero Day Initiative reported a UAF in the tipc_buf_append() error path: BUG: KASAN: slab-use-after-free in kfree_skb_list_reason+0x47e/0x4c0 linux/net/core/skbuff.c:1183 Read of size 8 at addr ffff88804d2a7c80 by task poc/8034 CPU: 1 PID: 8034 Comm: poc Not tainted 6.8.2 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014 Call Trace: <IRQ> __dump_stack linux/lib/dump_stack.c:88 dump_stack_lvl+0xd9/0x1b0 linux/lib/dump_stack.c:106 print_address_description linux/mm/kasan/report.c:377 print_report+0xc4/0x620 linux/mm/kasan/report.c:488 kasan_report+0xda/0x110 linux/mm/kasan/report.c:601 kfree_skb_list_reason+0x47e/0x4c0 linux/net/core/skbuff.c:1183 skb_release_data+0x5af/0x880 linux/net/core/skbuff.c:1026 skb_release_all linux/net/core/skbuff.c:1094 __kfree_skb linux/net/core/skbuff.c:1108 kfree_skb_reason+0x12d/0x210 linux/net/core/skbuff.c:1144 kfree_skb linux/./include/linux/skbuff.h:1244 tipc_buf_append+0x425/0xb50 linux/net/tipc/msg.c:186 tipc_link_input+0x224/0x7c0 linux/net/tipc/link.c:1324 tipc_link_rcv+0x76e/0x2d70 linux/net/tipc/link.c:1824 tipc_rcv+0x45f/0x10f0 linux/net/tipc/node.c:2159 tipc_udp_recv+0x73b/0x8f0 linux/net/tipc/udp_media.c:390 udp_queue_rcv_one_skb+0xad2/0x1850 linux/net/ipv4/udp.c:2108 udp_queue_rcv_skb+0x131/0xb00 linux/net/ipv4/udp.c:2186 udp_unicast_rcv_skb+0x165/0x3b0 linux/net/ipv4/udp.c:2346 __udp4_lib_rcv+0x2594/0x3400 linux/net/ipv4/udp.c:2422 ip_protocol_deliver_rcu+0x30c/0x4e0 linux/net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x2e4/0x520 linux/net/ipv4/ip_input.c:233 NF_HOOK linux/./include/linux/netfilter.h:314 NF_HOOK linux/./include/linux/netfilter.h:308 ip_local_deliver+0x18e/0x1f0 linux/net/ipv4/ip_input.c:254 dst_input linux/./include/net/dst.h:461 ip_rcv_finish linux/net/ipv4/ip_input.c:449 NF_HOOK linux/./include/linux/netfilter.h:314 NF_HOOK linux/./include/linux/netfilter.h:308 ip_rcv+0x2c5/0x5d0 linux/net/ipv4/ip_input.c:569 __netif_receive_skb_one_core+0x199/0x1e0 linux/net/core/dev.c:5534 __netif_receive_skb+0x1f/0x1c0 linux/net/core/dev.c:5648 process_backlog+0x101/0x6b0 linux/net/core/dev.c:5976 __napi_poll.constprop.0+0xba/0x550 linux/net/core/dev.c:6576 napi_poll linux/net/core/dev.c:6645 net_rx_action+0x95a/0xe90 linux/net/core/dev.c:6781 __do_softirq+0x21f/0x8e7 linux/kernel/softirq.c:553 do_softirq linux/kernel/softirq.c:454 do_softirq+0xb2/0xf0 linux/kernel/softirq.c:441 </IRQ> <TASK> __local_bh_enable_ip+0x100/0x120 linux/kernel/softirq.c:381 local_bh_enable linux/./include/linux/bottom_half.h:33 rcu_read_unlock_bh linux/./include/linux/rcupdate.h:851 __dev_queue_xmit+0x871/0x3ee0 linux/net/core/dev.c:4378 dev_queue_xmit linux/./include/linux/netdevice.h:3169 neigh_hh_output linux/./include/net/neighbour.h:526 neigh_output linux/./include/net/neighbour.h:540 ip_finish_output2+0x169f/0x2550 linux/net/ipv4/ip_output.c:235 __ip_finish_output linux/net/ipv4/ip_output.c:313 __ip_finish_output+0x49e/0x950 linux/net/ipv4/ip_output.c:295 ip_finish_output+0x31/0x310 linux/net/ipv4/ip_output.c:323 NF_HOOK_COND linux/./include/linux/netfilter.h:303 ip_output+0x13b/0x2a0 linux/net/ipv4/ip_output.c:433 dst_output linux/./include/net/dst.h:451 ip_local_out linux/net/ipv4/ip_output.c:129 ip_send_skb+0x3e5/0x560 linux/net/ipv4/ip_output.c:1492 udp_send_skb+0x73f/0x1530 linux/net/ipv4/udp.c:963 udp_sendmsg+0x1a36/0x2b40 linux/net/ipv4/udp.c:1250 inet_sendmsg+0x105/0x140 linux/net/ipv4/af_inet.c:850 sock_sendmsg_nosec linux/net/socket.c:730 __sock_sendmsg linux/net/socket.c:745 __sys_sendto+0x42c/0x4e0 linux/net/socket.c:2191 __do_sys_sendto linux/net/socket.c:2203 __se_sys_sendto linux/net/socket.c:2199 __x64_sys_sendto+0xe0/0x1c0 linux/net/socket.c:2199 do_syscall_x64 linux/arch/x86/entry/common.c:52 do_syscall_64+0xd8/0x270 linux/arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x6f/0x77 linux/arch/x86/entry/entry_64.S:120 RIP: 0033:0x7f3434974f29 Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 37 8f 0d 00 f7 d8 64 89 01 48 RSP: 002b:00007fff9154f2b8 EFLAGS: 00000212 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3434974f29 RDX: 00000000000032c8 RSI: 00007fff9154f300 RDI: 0000000000000003 RBP: 00007fff915532e0 R08: 00007fff91553360 R09: 0000000000000010 R10: 0000000000000000 R11: 0000000000000212 R12: 000055ed86d261d0 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> In the critical scenario, either the relevant skb is freed or its ownership is transferred into a frag_lists. In both cases, the cleanup code must not free it again: we need to clear the skb reference earlier. Fixes: 1149557d ("tipc: eliminate unnecessary linearization of incoming buffers") Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-23852 Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/752f1ccf762223d109845365d07f55414058e5a3.1714484273.git.pabeni@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 01 May, 2024 8 commits
-
-
Jeffrey Altman authored
The find connection logic of Transarc's Rx was modified in the mid-1990s to support multi-homed servers which might send a response packet from an address other than the destination address in the received packet. The rules for accepting a packet by an Rx initiator (RX_CLIENT_CONNECTION) were altered to permit acceptance of a packet from any address provided that the port number was unchanged and all of the connection identifiers matched (Epoch, CID, SecurityClass, ...). This change applies the same rules to the Linux implementation which makes it consistent with IBM AFS 3.6, Arla, OpenAFS and AuriStorFS. Fixes: 17926a79 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: Jeffrey Altman <jaltman@auristor.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Link: https://lore.kernel.org/r/20240419163057.4141728-1-marc.dionne@auristor.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Felix Fietkau authored
SKB_GSO_FRAGLIST skbs must not be linearized, otherwise they become invalid. Return NULL if such an skb is passed to skb_copy or skb_copy_expand, in order to prevent a crash on a potential later call to skb_gso_segment. Fixes: 3a1296a3 ("net: Support GRO/GSO fraglist chaining.") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Felix Fietkau authored
Calling skb_copy on a SKB_GSO_FRAGLIST skb is not valid, since it returns an invalid linearized skb. This code only needs to change the ethernet header, so pskb_copy is the right function to call here. Fixes: 6db6f0ea ("bridge: multicast to unicast") Signed-off-by: Felix Fietkau <nbd@nbd.name> Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Paolo Abeni authored
Christoph reported a splat hinting at a corrupted snd_una: WARNING: CPU: 1 PID: 38 at net/mptcp/protocol.c:1005 __mptcp_clean_una+0x4b3/0x620 net/mptcp/protocol.c:1005 Modules linked in: CPU: 1 PID: 38 Comm: kworker/1:1 Not tainted 6.9.0-rc1-gbbeac67456c9 #59 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Workqueue: events mptcp_worker RIP: 0010:__mptcp_clean_una+0x4b3/0x620 net/mptcp/protocol.c:1005 Code: be 06 01 00 00 bf 06 01 00 00 e8 a8 12 e7 fe e9 00 fe ff ff e8 8e 1a e7 fe 0f b7 ab 3e 02 00 00 e9 d3 fd ff ff e8 7d 1a e7 fe <0f> 0b 4c 8b bb e0 05 00 00 e9 74 fc ff ff e8 6a 1a e7 fe 0f 0b e9 RSP: 0018:ffffc9000013fd48 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8881029bd280 RCX: ffffffff82382fe4 RDX: ffff8881003cbd00 RSI: ffffffff823833c3 RDI: 0000000000000001 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: fefefefefefefeff R12: ffff888138ba8000 R13: 0000000000000106 R14: ffff8881029bd908 R15: ffff888126560000 FS: 0000000000000000(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f604a5dae38 CR3: 0000000101dac002 CR4: 0000000000170ef0 Call Trace: <TASK> __mptcp_clean_una_wakeup net/mptcp/protocol.c:1055 [inline] mptcp_clean_una_wakeup net/mptcp/protocol.c:1062 [inline] __mptcp_retrans+0x7f/0x7e0 net/mptcp/protocol.c:2615 mptcp_worker+0x434/0x740 net/mptcp/protocol.c:2767 process_one_work+0x1e0/0x560 kernel/workqueue.c:3254 process_scheduled_works kernel/workqueue.c:3335 [inline] worker_thread+0x3c7/0x640 kernel/workqueue.c:3416 kthread+0x121/0x170 kernel/kthread.c:388 ret_from_fork+0x44/0x50 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243 </TASK> When fallback to TCP happens early on a client socket, snd_nxt is not yet initialized and any incoming ack will copy such value into snd_una. If the mptcp worker (dumbly) tries mptcp-level re-injection after such ack, that would unconditionally trigger a send buffer cleanup using 'bad' snd_una values. We could easily disable re-injection for fallback sockets, but such dumb behavior already helped catching a few subtle issues and a very low to zero impact in practice. Instead address the issue always initializing snd_nxt (and write_seq, for consistency) at connect time. Fixes: 8fd73804 ("mptcp: fallback in case of simultaneous connect") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/485Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240429-upstream-net-20240429-mptcp-snd_nxt-init-connect-v1-1-59ceac0a7dcb@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vitaly Lifshits authored
This is a partial revert of commit 6dbdd4de ("e1000e: Workaround for sporadic MDI error on Meteor Lake systems"). The referenced commit used usleep_range inside the PHY access routines, which are sometimes called from an atomic context. This can lead to a kernel panic in some scenarios, such as cable disconnection and reconnection on vPro systems. Solve this by changing the usleep_range calls back to udelay. Fixes: 6dbdd4de ("e1000e: Workaround for sporadic MDI error on Meteor Lake systems") Cc: stable@vger.kernel.org Reported-by: Jérôme Carretero <cJ@zougloub.eu> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218740 Closes: https://lore.kernel.org/lkml/a7eb665c74b5efb5140e6979759ed243072cb24a.camel@zougloub.eu/Co-developed-by: Sasha Neftin <sasha.neftin@intel.com> Signed-off-by: Sasha Neftin <sasha.neftin@intel.com> Signed-off-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Tested-by: Dima Ruinskiy <dima.ruinskiy@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240429171040.1152516-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Marek Behún authored
The Topaz family (88E6141 and 88E6341) only support 256 Forwarding Information Tables. Fixes: a75961d0 ("net: dsa: mv88e6xxx: Add support for ethernet switch 88E6341") Fixes: 1558727a ("net: dsa: mv88e6xxx: Add support for ethernet switch 88E6141") Signed-off-by: Marek Behún <kabel@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://lore.kernel.org/r/20240429133832.9547-1-kabel@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Sebastian Andrzej Siewior authored
The selftest for the driver sends a dummy packet and checks if the packet will be received properly as it should be. The regular TX path and the selftest can use the same network queue so locking is required and was missing in the selftest path. This was addressed in the commit cited below. Unfortunately locking the TX queue requires BH to be disabled which is not the case in selftest path which is invoked in process context. Lockdep should be complaining about this. Use __netif_tx_lock_bh() for TX queue locking. Fixes: c650e048 ("cxgb4: Fix race between loopback and normal Tx path") Reported-by: "John B. Wyatt IV" <jwyatt@redhat.com> Closes: https://lore.kernel.org/all/Zic0ot5aGgR-V4Ks@thinkpad2021/Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20240429091147.YWAaal4v@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Yunsheng Lin authored
rxrpc_alloc_data_txbuf() may be called with data_align being zero in none_alloc_txbuf() and rxkad_alloc_txbuf(), data_align is supposed to be an order-based alignment value, but zero is not a valid order-based alignment value, and '~(data_align - 1)' doesn't result in a valid mask-based alignment value for __page_frag_alloc_align(). Fix it by passing a valid order-based alignment value in none_alloc_txbuf() and rxkad_alloc_txbuf(). Also use page_frag_alloc_align() expecting an order-based alignment value in rxrpc_alloc_data_txbuf() to avoid doing the alignment converting operation and to catch possible invalid alignment value in the future. Remove the 'if (data_align)' checking too, as it is always true for a valid order-based alignment value. Fixes: 6b253646 ("rxrpc: Fix use of changed alignment param to page_frag_alloc_align()") Fixes: 49489bb0 ("rxrpc: Do zerocopy using MSG_SPLICE_PAGES and page frags") CC: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20240428111640.27306-1-linyunsheng@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 29 Apr, 2024 13 commits
-
-
David S. Miller authored
Guillaume Nault says: ==================== vxlan: Fix vxlan counters. Like most virtual devices, vxlan needs special care when updating its netdevice counters. This is done in patch 1. Patch 2 just adds a missing VNI counter update (found while working on patch 1). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
VXLAN stores per-VNI statistics using vxlan_vnifilter_count(). These statistics were not updated when arp_reduce() failed its pskb_may_pull() call. Use vxlan_vnifilter_count() to update the VNI counter when that happens. Fixes: 4095e0e1 ("drivers: vxlan: vnifilter: per vni stats") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
VXLAN devices update their stats locklessly. Therefore these counters should either be stored in per-cpu data structures or the updates should be done using atomic increments. Since the net_device_core_stats infrastructure is already used in vxlan_rcv(), use it for the other rx_dropped and tx_dropped counter updates. Update the other counters atomically using DEV_STATS_INC(). Fixes: d342894c ("vxlan: virtual extensible lan") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Asbjørn Sloth Tønnesen says: ==================== net: qede: avoid overruling error codes This series fixes the qede driver, so that qede_parse_flow_attr() and it's subfunctions doesn't get their error codes overruled (ie. turning -EOPNOTSUPP into -EINVAL). --- I have two more patches along the same lines, but they are not yet causing any issues, so I have them destined for net-next. (those are for qede_flow_spec_validate_unused() and qede_flow_parse_ports().) After that I have a series for converting to extack + the final one for validating control flags. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Asbjørn Sloth Tønnesen authored
When calling qede_parse_actions() then the return code was only used for a non-zero check, and then -EINVAL was returned. qede_parse_actions() can currently fail with: * -EINVAL * -EOPNOTSUPP This patch changes the code to use the actual return code, not just return -EINVAL. The blaimed commit broke the implicit assumption that only -EINVAL would ever be returned. Only compile tested. Fixes: 319a1d19 ("flow_offload: check for basic action hw stats type") Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Asbjørn Sloth Tønnesen authored
In qede_flow_spec_to_rule(), when calling qede_parse_flow_attr() then the return code was only used for a non-zero check, and then -EINVAL was returned. qede_parse_flow_attr() can currently fail with: * -EINVAL * -EOPNOTSUPP * -EPROTONOSUPPORT This patch changes the code to use the actual return code, not just return -EINVAL. The blaimed commit introduced qede_flow_spec_to_rule(), and this call to qede_parse_flow_attr(), it looks like it just duplicated how it was already used. Only compile tested. Fixes: 37c5d3ef ("qede: use ethtool_rx_flow_rule() to remove duplicated parser code") Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Asbjørn Sloth Tønnesen authored
In qede_add_tc_flower_fltr(), when calling qede_parse_flow_attr() then the return code was only used for a non-zero check, and then -EINVAL was returned. qede_parse_flow_attr() can currently fail with: * -EINVAL * -EOPNOTSUPP * -EPROTONOSUPPORT This patch changes the code to use the actual return code, not just return -EINVAL. The blaimed commit introduced these functions. Only compile tested. Fixes: 2ce9c93e ("qede: Ingress tc flower offload (drop action) support.") Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Asbjørn Sloth Tønnesen authored
Explicitly set 'rc' (return code), before jumping to the unlock and return path. By not having any code depend on that 'rc' remains at it's initial value of -EINVAL, then we can re-use 'rc' for the return code of function calls in subsequent patches. Only compile tested. Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Donald has been contributing to YNL a lot. Let's create a dedicated MAINTAINERS entry and add make his involvement official :) Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Doug Berger says: ==================== net: bcmgenet: protect contended accesses Some registers may be modified by parallel execution contexts and require protections to prevent corruption. A review of the driver revealed the need for these additional protections. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Doug Berger authored
The UMAC_CMD register is written from different execution contexts and has insufficient synchronization protections to prevent possible corruption. Of particular concern are the acceses from the phy_device delayed work context used by the adjust_link call and the BH context that may be used by the ndo_set_rx_mode call. A spinlock is added to the driver to protect contended register accesses (i.e. reg_lock) and it is used to synchronize accesses to UMAC_CMD. Fixes: 1c1008c7 ("net: bcmgenet: add main driver file") Cc: stable@vger.kernel.org Signed-off-by: Doug Berger <opendmb@gmail.com> Acked-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Doug Berger authored
The ndo_set_rx_mode function is synchronized with the netif_addr_lock spinlock and BHs disabled. Since this function is also invoked directly from the driver the same synchronization should be applied. Fixes: 72f96347 ("net: bcmgenet: set Rx mode before starting netif") Cc: stable@vger.kernel.org Signed-off-by: Doug Berger <opendmb@gmail.com> Acked-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Doug Berger authored
The EXT_RGMII_OOB_CTRL register can be written from different contexts. It is predominantly written from the adjust_link handler which is synchronized by the phydev->lock, but can also be written from a different context when configuring the mii in bcmgenet_mii_config(). The chances of contention are quite low, but it is conceivable that adjust_link could occur during resume when WoL is enabled so use the phydev->lock synchronizer in bcmgenet_mii_config() to be sure. Fixes: afe3f907 ("net: bcmgenet: power on MII block for all MII modes") Cc: stable@vger.kernel.org Signed-off-by: Doug Berger <opendmb@gmail.com> Acked-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 27 Apr, 2024 1 commit
-
-
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski authored
Daniel Borkmann says: ==================== pull-request: bpf 2024-04-26 We've added 12 non-merge commits during the last 22 day(s) which contain a total of 14 files changed, 168 insertions(+), 72 deletions(-). The main changes are: 1) Fix BPF_PROBE_MEM in verifier and JIT to skip loads from vsyscall page, from Puranjay Mohan. 2) Fix a crash in XDP with devmap broadcast redirect when the latter map is in process of being torn down, from Toke Høiland-Jørgensen. 3) Fix arm64 and riscv64 BPF JITs to properly clear start time for BPF program runtime stats, from Xu Kuohai. 4) Fix a sockmap KCSAN-reported data race in sk_psock_skb_ingress_enqueue, from Jason Xing. 5) Fix BPF verifier error message in resolve_pseudo_ldimm64, from Anton Protopopov. 6) Fix missing DEBUG_INFO_BTF_MODULES Kconfig menu item, from Andrii Nakryiko. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Test PROBE_MEM of VSYSCALL_ADDR on x86-64 bpf, x86: Fix PROBE_MEM runtime load check bpf: verifier: prevent userspace memory access xdp: use flags field to disambiguate broadcast redirect arm32, bpf: Reimplement sign-extension mov instruction riscv, bpf: Fix incorrect runtime stats bpf, arm64: Fix incorrect runtime stats bpf: Fix a verifier verbose message bpf, skmsg: Fix NULL pointer dereference in sk_psock_skb_ingress_enqueue MAINTAINERS: bpf: Add Lehui and Puranjay as riscv64 reviewers MAINTAINERS: Update email address for Puranjay Mohan bpf, kconfig: Fix DEBUG_INFO_BTF_MODULES Kconfig definition ==================== Link: https://lore.kernel.org/r/20240426224248.26197-1-daniel@iogearbox.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 26 Apr, 2024 10 commits
-
-
David Howells authored
Fix extract_user_to_sg() so that it will break out of the loop if iov_iter_extract_pages() returns 0 rather than looping around forever. [Note that I've included two fixes lines as the function got moved to a different file and renamed] Fixes: 85dd2c8f ("netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator") Fixes: f5f82cd1 ("Move netfs_extract_iter_to_sg() to lib/scatterlist.c") Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Steve French <sfrench@samba.org> cc: Herbert Xu <herbert@gondor.apana.org.au> cc: netfs@lists.linux.dev Link: https://lore.kernel.org/r/1967121.1714034372@warthog.procyon.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Alexei Starovoitov authored
Puranjay Mohan says: ==================== bpf: prevent userspace memory access V5: https://lore.kernel.org/bpf/20240324185356.59111-1-puranjay12@gmail.com/ Changes in V6: - Disable the verifier's instrumentation in x86-64 and update the JIT to take care of vsyscall page in addition to userspace addresses. - Update bpf_testmod to test for vsyscall addresses. V4: https://lore.kernel.org/bpf/20240321124640.8870-1-puranjay12@gmail.com/ Changes in V5: - Use TASK_SIZE_MAX + PAGE_SIZE, VSYSCALL_ADDR as userspace boundary in x86-64 JIT. - Added Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> V3: https://lore.kernel.org/bpf/20240321120842.78983-1-puranjay12@gmail.com/ Changes in V4: - Disable this feature on architectures that don't define CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE. - By doing the above, we don't need anything explicitly for s390x. V2: https://lore.kernel.org/bpf/20240321101058.68530-1-puranjay12@gmail.com/ Changes in V3: - Return 0 from bpf_arch_uaddress_limit() in disabled case because it returns u64. - Modify the check in verifier to no do instrumentation when uaddress_limit is 0. V1: https://lore.kernel.org/bpf/20240320105436.4781-1-puranjay12@gmail.com/ Changes in V2: - Disable this feature on s390x. With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. To thwart invalid memory accesses, the JITs add an exception table entry for all such accesses. But in case the src_reg + offset is a userspace address, the BPF program might read that memory if the user has mapped it. x86-64 JIT already instruments the BPF_PROBE_MEM based loads with checks to skip loads from userspace addresses, but is doesn't check for vsyscall page because it falls in the kernel address space but is considered a userspace page. The second patch in this series fixes the x86-64 JIT to also skip loads from the vsyscall page. The last patch updates the bpf_testmod so this address can be checked as part of the selftests. Other architectures don't have the complexity of the vsyscall address and just need to skip loads from the userspace. To make this more scalable and robust, the verifier is updated in the first patch to instrument BPF_PROBE_MEM to skip loads from the userspace addresses. ==================== Link: https://lore.kernel.org/r/20240424100210.11982-1-puranjay@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Puranjay Mohan authored
The vsyscall is a legacy API for fast execution of system calls. It maps a page at address VSYSCALL_ADDR into the userspace program. This address is in the top 10MB of the address space: ffffffffff600000 - ffffffffff600fff | 4 kB | legacy vsyscall ABI The last commit fixes the x86-64 BPF JIT to skip accessing addresses in this memory region. Add this address to bpf_testmod_return_ptr() so we can make sure that it is fixed. After this change and without the previous commit, subprogs_extable selftest will crash the kernel. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20240424100210.11982-4-puranjay@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Puranjay Mohan authored
When a load is marked PROBE_MEM - e.g. due to PTR_UNTRUSTED access - the address being loaded from is not necessarily valid. The BPF jit sets up exception handlers for each such load which catch page faults and 0 out the destination register. If the address for the load is outside kernel address space, the load will escape the exception handling and crash the kernel. To prevent this from happening, the emits some instruction to verify that addr is > end of userspace addresses. x86 has a legacy vsyscall ABI where a page at address 0xffffffffff600000 is mapped with user accessible permissions. The addresses in this page are considered userspace addresses by the fault handler. Therefore, a BPF program accessing this page will crash the kernel. This patch fixes the runtime checks to also check that the PROBE_MEM address is below VSYSCALL_ADDR. Example BPF program: SEC("fentry/tcp_v4_connect") int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk) { *(volatile unsigned long *)&sk->sk_tsq_flags; return 0; } BPF Assembly: 0: (79) r1 = *(u64 *)(r1 +0) 1: (79) r1 = *(u64 *)(r1 +344) 2: (b7) r0 = 0 3: (95) exit x86-64 JIT ========== BEFORE AFTER ------ ----- 0: nopl 0x0(%rax,%rax,1) 0: nopl 0x0(%rax,%rax,1) 5: xchg %ax,%ax 5: xchg %ax,%ax 7: push %rbp 7: push %rbp 8: mov %rsp,%rbp 8: mov %rsp,%rbp b: mov 0x0(%rdi),%rdi b: mov 0x0(%rdi),%rdi ------------------------------------------------------------------------------- f: movabs $0x100000000000000,%r11 f: movabs $0xffffffffff600000,%r10 19: add $0x2a0,%rdi 19: mov %rdi,%r11 20: cmp %r11,%rdi 1c: add $0x2a0,%r11 23: jae 0x0000000000000029 23: sub %r10,%r11 25: xor %edi,%edi 26: movabs $0x100000000a00000,%r10 27: jmp 0x000000000000002d 30: cmp %r10,%r11 29: mov 0x0(%rdi),%rdi 33: ja 0x0000000000000039 --------------------------------\ 35: xor %edi,%edi 2d: xor %eax,%eax \ 37: jmp 0x0000000000000040 2f: leave \ 39: mov 0x2a0(%rdi),%rdi 30: ret \-------------------------------------------- 40: xor %eax,%eax 42: leave 43: ret Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20240424100210.11982-3-puranjay@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Puranjay Mohan authored
With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. To thwart invalid memory accesses, the JITs add an exception table entry for all such accesses. But in case the src_reg + offset is a userspace address, the BPF program might read that memory if the user has mapped it. Make the verifier add guard instructions around such memory accesses and skip the load if the address falls into the userspace region. The JITs need to implement bpf_arch_uaddress_limit() to define where the userspace addresses end for that architecture or TASK_SIZE is taken as default. The implementation is as follows: REG_AX = SRC_REG if(offset) REG_AX += offset; REG_AX >>= 32; if (REG_AX <= (uaddress_limit >> 32)) DST_REG = 0; else DST_REG = *(size *)(SRC_REG + offset); Comparing just the upper 32 bits of the load address with the upper 32 bits of uaddress_limit implies that the values are being aligned down to a 4GB boundary before comparison. The above means that all loads with address <= uaddress_limit + 4GB are skipped. This is acceptable because there is a large hole (much larger than 4GB) between userspace and kernel space memory, therefore a correctly functioning BPF program should not access this 4GB memory above the userspace. Let's analyze what this patch does to the following fentry program dereferencing an untrusted pointer: SEC("fentry/tcp_v4_connect") int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk) { *(volatile long *)sk; return 0; } BPF Program before | BPF Program after ------------------ | ----------------- 0: (79) r1 = *(u64 *)(r1 +0) 0: (79) r1 = *(u64 *)(r1 +0) ----------------------------------------------------------------------- 1: (79) r1 = *(u64 *)(r1 +0) --\ 1: (bf) r11 = r1 ----------------------------\ \ 2: (77) r11 >>= 32 2: (b7) r0 = 0 \ \ 3: (b5) if r11 <= 0x8000 goto pc+2 3: (95) exit \ \-> 4: (79) r1 = *(u64 *)(r1 +0) \ 5: (05) goto pc+1 \ 6: (b7) r1 = 0 \-------------------------------------- 7: (b7) r0 = 0 8: (95) exit As you can see from above, in the best case (off=0), 5 extra instructions are emitted. Now, we analyze the same program after it has gone through the JITs of ARM64 and RISC-V architectures. We follow the single load instruction that has the untrusted pointer and see what instrumentation has been added around it. x86-64 JIT ========== JIT's Instrumentation (upstream) --------------------- 0: nopl 0x0(%rax,%rax,1) 5: xchg %ax,%ax 7: push %rbp 8: mov %rsp,%rbp b: mov 0x0(%rdi),%rdi --------------------------------- f: movabs $0x800000000000,%r11 19: cmp %r11,%rdi 1c: jb 0x000000000000002a 1e: mov %rdi,%r11 21: add $0x0,%r11 28: jae 0x000000000000002e 2a: xor %edi,%edi 2c: jmp 0x0000000000000032 2e: mov 0x0(%rdi),%rdi --------------------------------- 32: xor %eax,%eax 34: leave 35: ret The x86-64 JIT already emits some instructions to protect against user memory access. This patch doesn't make any changes for the x86-64 JIT. ARM64 JIT ========= No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- -------------------------- 0: add x9, x30, #0x0 0: add x9, x30, #0x0 4: nop 4: nop 8: paciasp 8: paciasp c: stp x29, x30, [sp, #-16]! c: stp x29, x30, [sp, #-16]! 10: mov x29, sp 10: mov x29, sp 14: stp x19, x20, [sp, #-16]! 14: stp x19, x20, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 24: mov x25, sp 24: mov x25, sp 28: mov x26, #0x0 28: mov x26, #0x0 2c: sub x27, x25, #0x0 2c: sub x27, x25, #0x0 30: sub sp, sp, #0x0 30: sub sp, sp, #0x0 34: ldr x0, [x0] 34: ldr x0, [x0] -------------------------------------------------------------------------------- 38: ldr x0, [x0] ----------\ 38: add x9, x0, #0x0 -----------------------------------\\ 3c: lsr x9, x9, #32 3c: mov x7, #0x0 \\ 40: cmp x9, #0x10, lsl #12 40: mov sp, sp \\ 44: b.ls 0x0000000000000050 44: ldp x27, x28, [sp], #16 \\--> 48: ldr x0, [x0] 48: ldp x25, x26, [sp], #16 \ 4c: b 0x0000000000000054 4c: ldp x21, x22, [sp], #16 \ 50: mov x0, #0x0 50: ldp x19, x20, [sp], #16 \--------------------------------------- 54: ldp x29, x30, [sp], #16 54: mov x7, #0x0 58: add x0, x7, #0x0 58: mov sp, sp 5c: autiasp 5c: ldp x27, x28, [sp], #16 60: ret 60: ldp x25, x26, [sp], #16 64: nop 64: ldp x21, x22, [sp], #16 68: ldr x10, 0x0000000000000070 68: ldp x19, x20, [sp], #16 6c: br x10 6c: ldp x29, x30, [sp], #16 70: add x0, x7, #0x0 74: autiasp 78: ret 7c: nop 80: ldr x10, 0x0000000000000088 84: br x10 There are 6 extra instructions added in ARM64 in the best case. This will become 7 in the worst case (off != 0). RISC-V JIT (RISCV_ISA_C Disabled) ========== No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- -------------------------- 0: nop 0: nop 4: nop 4: nop 8: li a6, 33 8: li a6, 33 c: addi sp, sp, -16 c: addi sp, sp, -16 10: sd s0, 8(sp) 10: sd s0, 8(sp) 14: addi s0, sp, 16 14: addi s0, sp, 16 18: ld a0, 0(a0) 18: ld a0, 0(a0) --------------------------------------------------------------- 1c: ld a0, 0(a0) --\ 1c: mv t0, a0 --------------------------\ \ 20: srli t0, t0, 32 20: li a5, 0 \ \ 24: lui t1, 4096 24: ld s0, 8(sp) \ \ 28: sext.w t1, t1 28: addi sp, sp, 16 \ \ 2c: bgeu t1, t0, 12 2c: sext.w a0, a5 \ \--> 30: ld a0, 0(a0) 30: ret \ 34: j 8 \ 38: li a0, 0 \------------------------------ 3c: li a5, 0 40: ld s0, 8(sp) 44: addi sp, sp, 16 48: sext.w a0, a5 4c: ret There are 7 extra instructions added in RISC-V. Fixes: 80083428 ("bpf, arm64: Add BPF exception tables") Reported-by: Breno Leitao <leitao@debian.org> Suggested-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Link: https://lore.kernel.org/r/20240424100210.11982-2-puranjay@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
David Bauer authored
Drop the flow-hash of the skb when forwarding to the L2TP netdev. This avoids the L2TP qdisc from using the flow-hash from the outer packet, which is identical for every flow within the tunnel. This does not affect every platform but is specific for the ethernet driver. It depends on the platform including L4 information in the flow-hash. One such example is the Mediatek Filogic MT798x family of networking processors. Fixes: d9e31d17 ("l2tp: Add L2TP ethernet pseudowire support") Acked-by: James Chapman <jchapman@katalix.com> Signed-off-by: David Bauer <mail@david-bauer.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240424171110.13701-1-mail@david-bauer.netSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Kuniyuki Iwashima authored
syzbot triggered various splats (see [0] and links) by a crafted GSO packet of VIRTIO_NET_HDR_GSO_UDP layering the following protocols: ETH_P_8021AD + ETH_P_NSH + ETH_P_IPV6 + IPPROTO_UDP NSH can encapsulate IPv4, IPv6, Ethernet, NSH, and MPLS. As the inner protocol can be Ethernet, NSH GSO handler, nsh_gso_segment(), calls skb_mac_gso_segment() to invoke inner protocol GSO handlers. nsh_gso_segment() does the following for the original skb before calling skb_mac_gso_segment() 1. reset skb->network_header 2. save the original skb->{mac_heaeder,mac_len} in a local variable 3. pull the NSH header 4. resets skb->mac_header 5. set up skb->mac_len and skb->protocol for the inner protocol. and does the following for the segmented skb 6. set ntohs(ETH_P_NSH) to skb->protocol 7. push the NSH header 8. restore skb->mac_header 9. set skb->mac_header + mac_len to skb->network_header 10. restore skb->mac_len There are two problems in 6-7 and 8-9. (a) After 6 & 7, skb->data points to the NSH header, so the outer header (ETH_P_8021AD in this case) is stripped when skb is sent out of netdev. Also, if NSH is encapsulated by NSH + Ethernet (so NSH-Ethernet-NSH), skb_pull() in the first nsh_gso_segment() will make skb->data point to the middle of the outer NSH or Ethernet header because the Ethernet header is not pulled by the second nsh_gso_segment(). (b) While restoring skb->{mac_header,network_header} in 8 & 9, nsh_gso_segment() does not assume that the data in the linear buffer is shifted. However, udp6_ufo_fragment() could shift the data and change skb->mac_header accordingly as demonstrated by syzbot. If this happens, even the restored skb->mac_header points to the middle of the outer header. It seems nsh_gso_segment() has never worked with outer headers so far. At the end of nsh_gso_segment(), the outer header must be restored for the segmented skb, instead of the NSH header. To do that, let's calculate the outer header position relatively from the inner header and set skb->{data,mac_header,protocol} properly. [0]: BUG: KMSAN: uninit-value in ipvlan_process_outbound drivers/net/ipvlan/ipvlan_core.c:524 [inline] BUG: KMSAN: uninit-value in ipvlan_xmit_mode_l3 drivers/net/ipvlan/ipvlan_core.c:602 [inline] BUG: KMSAN: uninit-value in ipvlan_queue_xmit+0xf44/0x16b0 drivers/net/ipvlan/ipvlan_core.c:668 ipvlan_process_outbound drivers/net/ipvlan/ipvlan_core.c:524 [inline] ipvlan_xmit_mode_l3 drivers/net/ipvlan/ipvlan_core.c:602 [inline] ipvlan_queue_xmit+0xf44/0x16b0 drivers/net/ipvlan/ipvlan_core.c:668 ipvlan_start_xmit+0x5c/0x1a0 drivers/net/ipvlan/ipvlan_main.c:222 __netdev_start_xmit include/linux/netdevice.h:4989 [inline] netdev_start_xmit include/linux/netdevice.h:5003 [inline] xmit_one net/core/dev.c:3547 [inline] dev_hard_start_xmit+0x244/0xa10 net/core/dev.c:3563 __dev_queue_xmit+0x33ed/0x51c0 net/core/dev.c:4351 dev_queue_xmit include/linux/netdevice.h:3171 [inline] packet_xmit+0x9c/0x6b0 net/packet/af_packet.c:276 packet_snd net/packet/af_packet.c:3081 [inline] packet_sendmsg+0x8aef/0x9f10 net/packet/af_packet.c:3113 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg net/socket.c:745 [inline] __sys_sendto+0x735/0xa10 net/socket.c:2191 __do_sys_sendto net/socket.c:2203 [inline] __se_sys_sendto net/socket.c:2199 [inline] __x64_sys_sendto+0x125/0x1c0 net/socket.c:2199 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x63/0x6b Uninit was created at: slab_post_alloc_hook mm/slub.c:3819 [inline] slab_alloc_node mm/slub.c:3860 [inline] __do_kmalloc_node mm/slub.c:3980 [inline] __kmalloc_node_track_caller+0x705/0x1000 mm/slub.c:4001 kmalloc_reserve+0x249/0x4a0 net/core/skbuff.c:582 __alloc_skb+0x352/0x790 net/core/skbuff.c:651 skb_segment+0x20aa/0x7080 net/core/skbuff.c:4647 udp6_ufo_fragment+0xcab/0x1150 net/ipv6/udp_offload.c:109 ipv6_gso_segment+0x14be/0x2ca0 net/ipv6/ip6_offload.c:152 skb_mac_gso_segment+0x3e8/0x760 net/core/gso.c:53 nsh_gso_segment+0x6f4/0xf70 net/nsh/nsh.c:108 skb_mac_gso_segment+0x3e8/0x760 net/core/gso.c:53 __skb_gso_segment+0x4b0/0x730 net/core/gso.c:124 skb_gso_segment include/net/gso.h:83 [inline] validate_xmit_skb+0x107f/0x1930 net/core/dev.c:3628 __dev_queue_xmit+0x1f28/0x51c0 net/core/dev.c:4343 dev_queue_xmit include/linux/netdevice.h:3171 [inline] packet_xmit+0x9c/0x6b0 net/packet/af_packet.c:276 packet_snd net/packet/af_packet.c:3081 [inline] packet_sendmsg+0x8aef/0x9f10 net/packet/af_packet.c:3113 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg net/socket.c:745 [inline] __sys_sendto+0x735/0xa10 net/socket.c:2191 __do_sys_sendto net/socket.c:2203 [inline] __se_sys_sendto net/socket.c:2199 [inline] __x64_sys_sendto+0x125/0x1c0 net/socket.c:2199 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x63/0x6b CPU: 1 PID: 5101 Comm: syz-executor421 Not tainted 6.8.0-rc5-syzkaller-00297-gf2e367d6 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 Fixes: c411ed85 ("nsh: add GSO support") Reported-and-tested-by: syzbot+42a0dc856239de4de60e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=42a0dc856239de4de60e Reported-and-tested-by: syzbot+c298c9f0e46a3c86332b@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c298c9f0e46a3c86332b Link: https://lore.kernel.org/netdev/20240415222041.18537-1-kuniyu@amazon.com/Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240424023549.21862-1-kuniyu@amazon.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Jakub Kicinski authored
Bui Quang Minh says: ==================== Ensure the copied buf is NUL terminated (part) I found that some drivers contains an out-of-bound read pattern like this kern_buf = memdup_user(user_buf, count); ... sscanf(kern_buf, ...); The sscanf can be replaced by some other string-related functions. This pattern can lead to out-of-bound read of kern_buf in string-related functions. This series fix the above issue by replacing memdup_user with memdup_user_nul. v1: https://lore.kernel.org/r/20240422-fix-oob-read-v1-0-e02854c30174@gmail.com ==================== Link: https://lore.kernel.org/r/20240424-fix-oob-read-v2-0-f1f1b53a10f4@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Bui Quang Minh authored
We try to access count + 1 byte from userspace with memdup_user(buffer, count + 1). However, the userspace only provides buffer of count bytes and only these count bytes are verified to be okay to access. To ensure the copied buffer is NUL terminated, we use memdup_user_nul instead. Fixes: 3a2eb515 ("octeontx2-af: Fix an off by one in rvu_dbg_qsize_write()") Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Link: https://lore.kernel.org/r/20240424-fix-oob-read-v2-6-f1f1b53a10f4@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Bui Quang Minh authored
Currently, we allocate a nbytes-sized kernel buffer and copy nbytes from userspace to that buffer. Later, we use sscanf on this buffer but we don't ensure that the string is terminated inside the buffer, this can lead to OOB read when using sscanf. Fix this issue by using memdup_user_nul instead of memdup_user. Fixes: 7afc5dbd ("bna: Add debugfs interface.") Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Link: https://lore.kernel.org/r/20240424-fix-oob-read-v2-2-f1f1b53a10f4@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-