1. 02 May, 2024 9 commits
    • Jakub Kicinski's avatar
      MAINTAINERS: remove Ariel Elior · c9ccbcd9
      Jakub Kicinski authored
      aelior@marvell.com bounces, we haven't seen Ariel on lore
      since March 2022.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20240430233305.1356105-1-kuba@kernel.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      c9ccbcd9
    • Paolo Abeni's avatar
      Merge branch 'net-gro-add-flush-flush_id-checks-and-fix-wrong-offset-in-udp' · a257f093
      Paolo Abeni authored
      Richard Gobert says:
      
      ====================
      net: gro: add flush/flush_id checks and fix wrong offset in udp
      
      This series fixes a bug in the complete phase of UDP in GRO, in which
      socket lookup fails due to using network_header when parsing encapsulated
      packets. The fix is to add network_offset and inner_network_offset to
      napi_gro_cb and use these offsets for socket lookup.
      
      In addition p->flush/flush_id should be checked in all UDP flows. The
      same logic from tcp_gro_receive is applied for all flows in
      udp_gro_receive_segment. This prevents packets with mismatching network
      headers (flush/flush_id turned on) from merging in UDP GRO.
      
      The original series includes a change to vxlan test which adds the local
      parameter to prevent similar future bugs. I plan to submit it separately to
      net-next.
      
      This series is part of a previously submitted series to net-next:
      https://lore.kernel.org/all/20240408141720.98832-1-richardbgobert@gmail.com/
      
      v3 -> v4:
       - Store network offsets, and use them only in udp_gro_complete flows
       - Correct commit hash used in Fixes tag
       - v3:
       https://lore.kernel.org/netdev/20240424163045.123528-1-richardbgobert@gmail.com/
      
      v2 -> v3:
       - Add network_offsets and fix udp bug in a single commit to make backporting easier
       - Write to inner_network_offset in {inet,ipv6}_gro_receive
       - Use network_offsets union in tcp[46]_gro_complete as well
       - v2:
       https://lore.kernel.org/netdev/20240419153542.121087-1-richardbgobert@gmail.com/
      
      v1 -> v2:
       - Use network_offsets instead of p_poff param as suggested by Willem
       - Check flush before postpull, and for all UDP GRO flows
       - v1:
       https://lore.kernel.org/netdev/20240412152120.115067-1-richardbgobert@gmail.com/
      ====================
      
      Link: https://lore.kernel.org/r/20240430143555.126083-1-richardbgobert@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      a257f093
    • Richard Gobert's avatar
      net: gro: add flush check in udp_gro_receive_segment · 5babae77
      Richard Gobert authored
      GRO-GSO path is supposed to be transparent and as such L3 flush checks are
      relevant to all UDP flows merging in GRO. This patch uses the same logic
      and code from tcp_gro_receive, terminating merge if flush is non zero.
      
      Fixes: e20cf8d3 ("udp: implement GRO for plain UDP sockets.")
      Signed-off-by: default avatarRichard Gobert <richardbgobert@gmail.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      5babae77
    • Richard Gobert's avatar
      net: gro: fix udp bad offset in socket lookup by adding {inner_}network_offset to napi_gro_cb · 5ef31ea5
      Richard Gobert authored
      Commits a6024562 ("udp: Add GRO functions to UDP socket") and 57c67ff4 ("udp:
      additional GRO support") introduce incorrect usage of {ip,ipv6}_hdr in the
      complete phase of gro. The functions always return skb->network_header,
      which in the case of encapsulated packets at the gro complete phase, is
      always set to the innermost L3 of the packet. That means that calling
      {ip,ipv6}_hdr for skbs which completed the GRO receive phase (both in
      gro_list and *_gro_complete) when parsing an encapsulated packet's _outer_
      L3/L4 may return an unexpected value.
      
      This incorrect usage leads to a bug in GRO's UDP socket lookup.
      udp{4,6}_lib_lookup_skb functions use ip_hdr/ipv6_hdr respectively. These
      *_hdr functions return network_header which will point to the innermost L3,
      resulting in the wrong offset being used in __udp{4,6}_lib_lookup with
      encapsulated packets.
      
      This patch adds network_offset and inner_network_offset to napi_gro_cb, and
      makes sure both are set correctly.
      
      To fix the issue, network_offsets union is used inside napi_gro_cb, in
      which both the outer and the inner network offsets are saved.
      
      Reproduction example:
      
      Endpoint configuration example (fou + local address bind)
      
          # ip fou add port 6666 ipproto 4
          # ip link add name tun1 type ipip remote 2.2.2.1 local 2.2.2.2 encap fou encap-dport 5555 encap-sport 6666 mode ipip
          # ip link set tun1 up
          # ip a add 1.1.1.2/24 dev tun1
      
      Netperf TCP_STREAM result on net-next before patch is applied:
      
      net-next main, GRO enabled:
          $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5
          Recv   Send    Send
          Socket Socket  Message  Elapsed
          Size   Size    Size     Time     Throughput
          bytes  bytes   bytes    secs.    10^6bits/sec
      
          131072  16384  16384    5.28        2.37
      
      net-next main, GRO disabled:
          $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5
          Recv   Send    Send
          Socket Socket  Message  Elapsed
          Size   Size    Size     Time     Throughput
          bytes  bytes   bytes    secs.    10^6bits/sec
      
          131072  16384  16384    5.01     2745.06
      
      patch applied, GRO enabled:
          $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5
          Recv   Send    Send
          Socket Socket  Message  Elapsed
          Size   Size    Size     Time     Throughput
          bytes  bytes   bytes    secs.    10^6bits/sec
      
          131072  16384  16384    5.01     2877.38
      
      Fixes: a6024562 ("udp: Add GRO functions to UDP socket")
      Signed-off-by: default avatarRichard Gobert <richardbgobert@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      5ef31ea5
    • Shigeru Yoshida's avatar
      ipv4: Fix uninit-value access in __ip_make_skb() · fc1092f5
      Shigeru Yoshida authored
      KMSAN reported uninit-value access in __ip_make_skb() [1].  __ip_make_skb()
      tests HDRINCL to know if the skb has icmphdr. However, HDRINCL can cause a
      race condition. If calling setsockopt(2) with IP_HDRINCL changes HDRINCL
      while __ip_make_skb() is running, the function will access icmphdr in the
      skb even if it is not included. This causes the issue reported by KMSAN.
      
      Check FLOWI_FLAG_KNOWN_NH on fl4->flowi4_flags instead of testing HDRINCL
      on the socket.
      
      Also, fl4->fl4_icmp_type and fl4->fl4_icmp_code are not initialized. These
      are union in struct flowi4 and are implicitly initialized by
      flowi4_init_output(), but we should not rely on specific union layout.
      
      Initialize these explicitly in raw_sendmsg().
      
      [1]
      BUG: KMSAN: uninit-value in __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481
       __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481
       ip_finish_skb include/net/ip.h:243 [inline]
       ip_push_pending_frames+0x4c/0x5c0 net/ipv4/ip_output.c:1508
       raw_sendmsg+0x2381/0x2690 net/ipv4/raw.c:654
       inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851
       sock_sendmsg_nosec net/socket.c:730 [inline]
       __sock_sendmsg+0x274/0x3c0 net/socket.c:745
       __sys_sendto+0x62c/0x7b0 net/socket.c:2191
       __do_sys_sendto net/socket.c:2203 [inline]
       __se_sys_sendto net/socket.c:2199 [inline]
       __x64_sys_sendto+0x130/0x200 net/socket.c:2199
       do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      Uninit was created at:
       slab_post_alloc_hook mm/slub.c:3804 [inline]
       slab_alloc_node mm/slub.c:3845 [inline]
       kmem_cache_alloc_node+0x5f6/0xc50 mm/slub.c:3888
       kmalloc_reserve+0x13c/0x4a0 net/core/skbuff.c:577
       __alloc_skb+0x35a/0x7c0 net/core/skbuff.c:668
       alloc_skb include/linux/skbuff.h:1318 [inline]
       __ip_append_data+0x49ab/0x68c0 net/ipv4/ip_output.c:1128
       ip_append_data+0x1e7/0x260 net/ipv4/ip_output.c:1365
       raw_sendmsg+0x22b1/0x2690 net/ipv4/raw.c:648
       inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851
       sock_sendmsg_nosec net/socket.c:730 [inline]
       __sock_sendmsg+0x274/0x3c0 net/socket.c:745
       __sys_sendto+0x62c/0x7b0 net/socket.c:2191
       __do_sys_sendto net/socket.c:2203 [inline]
       __se_sys_sendto net/socket.c:2199 [inline]
       __x64_sys_sendto+0x130/0x200 net/socket.c:2199
       do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      CPU: 1 PID: 15709 Comm: syz-executor.7 Not tainted 6.8.0-11567-gb3603fcb #25
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
      
      Fixes: 99e5acae ("ipv4: Fix potential uninit variable access bug in __ip_make_skb()")
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Signed-off-by: default avatarShigeru Yoshida <syoshida@redhat.com>
      Link: https://lore.kernel.org/r/20240430123945.2057348-1-syoshida@redhat.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      fc1092f5
    • Alexandra Winter's avatar
      s390/qeth: Fix kernel panic after setting hsuid · 8a2e4d37
      Alexandra Winter authored
      Symptom:
      When the hsuid attribute is set for the first time on an IQD Layer3
      device while the corresponding network interface is already UP,
      the kernel will try to execute a napi function pointer that is NULL.
      
      Example:
      ---------------------------------------------------------------------------
      [ 2057.572696] illegal operation: 0001 ilc:1 [#1] SMP
      [ 2057.572702] Modules linked in: af_iucv qeth_l3 zfcp scsi_transport_fc sunrpc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6
      nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables libcrc32c nfnetlink ghash_s390 prng xts aes_s390 des_s390 de
      s_generic sha3_512_s390 sha3_256_s390 sha512_s390 vfio_ccw vfio_mdev mdev vfio_iommu_type1 eadm_sch vfio ext4 mbcache jbd2 qeth_l2 bridge stp llc dasd_eckd_mod qeth dasd_mod
       qdio ccwgroup pkey zcrypt
      [ 2057.572739] CPU: 6 PID: 60182 Comm: stress_client Kdump: loaded Not tainted 4.18.0-541.el8.s390x #1
      [ 2057.572742] Hardware name: IBM 3931 A01 704 (LPAR)
      [ 2057.572744] Krnl PSW : 0704f00180000000 0000000000000002 (0x2)
      [ 2057.572748]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3
      [ 2057.572751] Krnl GPRS: 0000000000000004 0000000000000000 00000000a3b008d8 0000000000000000
      [ 2057.572754]            00000000a3b008d8 cb923a29c779abc5 0000000000000000 00000000814cfd80
      [ 2057.572756]            000000000000012c 0000000000000000 00000000a3b008d8 00000000a3b008d8
      [ 2057.572758]            00000000bab6d500 00000000814cfd80 0000000091317e46 00000000814cfc68
      [ 2057.572762] Krnl Code:#0000000000000000: 0000                illegal
                               >0000000000000002: 0000                illegal
                                0000000000000004: 0000                illegal
                                0000000000000006: 0000                illegal
                                0000000000000008: 0000                illegal
                                000000000000000a: 0000                illegal
                                000000000000000c: 0000                illegal
                                000000000000000e: 0000                illegal
      [ 2057.572800] Call Trace:
      [ 2057.572801] ([<00000000ec639700>] 0xec639700)
      [ 2057.572803]  [<00000000913183e2>] net_rx_action+0x2ba/0x398
      [ 2057.572809]  [<0000000091515f76>] __do_softirq+0x11e/0x3a0
      [ 2057.572813]  [<0000000090ce160c>] do_softirq_own_stack+0x3c/0x58
      [ 2057.572817] ([<0000000090d2cbd6>] do_softirq.part.1+0x56/0x60)
      [ 2057.572822]  [<0000000090d2cc60>] __local_bh_enable_ip+0x80/0x98
      [ 2057.572825]  [<0000000091314706>] __dev_queue_xmit+0x2be/0xd70
      [ 2057.572827]  [<000003ff803dd6d6>] afiucv_hs_send+0x24e/0x300 [af_iucv]
      [ 2057.572830]  [<000003ff803dd88a>] iucv_send_ctrl+0x102/0x138 [af_iucv]
      [ 2057.572833]  [<000003ff803de72a>] iucv_sock_connect+0x37a/0x468 [af_iucv]
      [ 2057.572835]  [<00000000912e7e90>] __sys_connect+0xa0/0xd8
      [ 2057.572839]  [<00000000912e9580>] sys_socketcall+0x228/0x348
      [ 2057.572841]  [<0000000091514e1a>] system_call+0x2a6/0x2c8
      [ 2057.572843] Last Breaking-Event-Address:
      [ 2057.572844]  [<0000000091317e44>] __napi_poll+0x4c/0x1d8
      [ 2057.572846]
      [ 2057.572847] Kernel panic - not syncing: Fatal exception in interrupt
      -------------------------------------------------------------------------------------------
      
      Analysis:
      There is one napi structure per out_q: card->qdio.out_qs[i].napi
      The napi.poll functions are set during qeth_open().
      
      Since
      commit 1cfef80d ("s390/qeth: Don't call dev_close/dev_open (DOWN/UP)")
      qeth_set_offline()/qeth_set_online() no longer call dev_close()/
      dev_open(). So if qeth_free_qdio_queues() cleared
      card->qdio.out_qs[i].napi.poll while the network interface was UP and the
      card was offline, they are not set again.
      
      Reproduction:
      chzdev -e $devno layer2=0
      ip link set dev $network_interface up
      echo 0 > /sys/bus/ccwgroup/devices/0.0.$devno/online
      echo foo > /sys/bus/ccwgroup/devices/0.0.$devno/hsuid
      echo 1 > /sys/bus/ccwgroup/devices/0.0.$devno/online
      -> Crash (can be enforced e.g. by af_iucv connect(), ip link down/up, ...)
      
      Note that a Completion Queue (CQ) is only enabled or disabled, when hsuid
      is set for the first time or when it is removed.
      
      Workarounds:
      - Set hsuid before setting the device online for the first time
      or
      - Use chzdev -d $devno; chzdev $devno hsuid=xxx; chzdev -e $devno;
      to set hsuid on an existing device. (this will remove and recreate the
      network interface)
      
      Fix:
      There is no need to free the output queues when a completion queue is
      added or removed.
      card->qdio.state now indicates whether the inbound buffer pool and the
      outbound queues are allocated.
      card->qdio.c_q indicates whether a CQ is allocated.
      
      Fixes: 1cfef80d ("s390/qeth: Don't call dev_close/dev_open (DOWN/UP)")
      Signed-off-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240430091004.2265683-1-wintera@linux.ibm.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      8a2e4d37
    • Guillaume Nault's avatar
      vxlan: Pull inner IP header in vxlan_rcv(). · f7789419
      Guillaume Nault authored
      Ensure the inner IP header is part of skb's linear data before reading
      its ECN bits. Otherwise we might read garbage.
      One symptom is the system erroneously logging errors like
      "vxlan: non-ECT from xxx.xxx.xxx.xxx with TOS=xxxx".
      
      Similar bugs have been fixed in geneve, ip_tunnel and ip6_tunnel (see
      commit 1ca1ba46 ("geneve: make sure to pull inner header in
      geneve_rx()") for example). So let's reuse the same code structure for
      consistency. Maybe we'll can add a common helper in the future.
      
      Fixes: d342894c ("vxlan: virtual extensible lan")
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/1239c8db54efec341dd6455c77e0380f58923a3c.1714495737.git.gnault@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f7789419
    • Xin Long's avatar
      tipc: fix a possible memleak in tipc_buf_append · 97bf6f81
      Xin Long authored
      __skb_linearize() doesn't free the skb when it fails, so move
      '*buf = NULL' after __skb_linearize(), so that the skb can be
      freed on the err path.
      
      Fixes: b7df21cf ("tipc: skb_linearize the head skb when reassembling msgs")
      Reported-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarTung Nguyen <tung.q.nguyen@dektech.com.au>
      Link: https://lore.kernel.org/r/90710748c29a1521efac4f75ea01b3b7e61414cf.1714485818.git.lucien.xin@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      97bf6f81
    • Paolo Abeni's avatar
      tipc: fix UAF in error path · 080cbb89
      Paolo Abeni authored
      Sam Page (sam4k) working with Trend Micro Zero Day Initiative reported
      a UAF in the tipc_buf_append() error path:
      
      BUG: KASAN: slab-use-after-free in kfree_skb_list_reason+0x47e/0x4c0
      linux/net/core/skbuff.c:1183
      Read of size 8 at addr ffff88804d2a7c80 by task poc/8034
      
      CPU: 1 PID: 8034 Comm: poc Not tainted 6.8.2 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      1.16.0-debian-1.16.0-5 04/01/2014
      Call Trace:
       <IRQ>
       __dump_stack linux/lib/dump_stack.c:88
       dump_stack_lvl+0xd9/0x1b0 linux/lib/dump_stack.c:106
       print_address_description linux/mm/kasan/report.c:377
       print_report+0xc4/0x620 linux/mm/kasan/report.c:488
       kasan_report+0xda/0x110 linux/mm/kasan/report.c:601
       kfree_skb_list_reason+0x47e/0x4c0 linux/net/core/skbuff.c:1183
       skb_release_data+0x5af/0x880 linux/net/core/skbuff.c:1026
       skb_release_all linux/net/core/skbuff.c:1094
       __kfree_skb linux/net/core/skbuff.c:1108
       kfree_skb_reason+0x12d/0x210 linux/net/core/skbuff.c:1144
       kfree_skb linux/./include/linux/skbuff.h:1244
       tipc_buf_append+0x425/0xb50 linux/net/tipc/msg.c:186
       tipc_link_input+0x224/0x7c0 linux/net/tipc/link.c:1324
       tipc_link_rcv+0x76e/0x2d70 linux/net/tipc/link.c:1824
       tipc_rcv+0x45f/0x10f0 linux/net/tipc/node.c:2159
       tipc_udp_recv+0x73b/0x8f0 linux/net/tipc/udp_media.c:390
       udp_queue_rcv_one_skb+0xad2/0x1850 linux/net/ipv4/udp.c:2108
       udp_queue_rcv_skb+0x131/0xb00 linux/net/ipv4/udp.c:2186
       udp_unicast_rcv_skb+0x165/0x3b0 linux/net/ipv4/udp.c:2346
       __udp4_lib_rcv+0x2594/0x3400 linux/net/ipv4/udp.c:2422
       ip_protocol_deliver_rcu+0x30c/0x4e0 linux/net/ipv4/ip_input.c:205
       ip_local_deliver_finish+0x2e4/0x520 linux/net/ipv4/ip_input.c:233
       NF_HOOK linux/./include/linux/netfilter.h:314
       NF_HOOK linux/./include/linux/netfilter.h:308
       ip_local_deliver+0x18e/0x1f0 linux/net/ipv4/ip_input.c:254
       dst_input linux/./include/net/dst.h:461
       ip_rcv_finish linux/net/ipv4/ip_input.c:449
       NF_HOOK linux/./include/linux/netfilter.h:314
       NF_HOOK linux/./include/linux/netfilter.h:308
       ip_rcv+0x2c5/0x5d0 linux/net/ipv4/ip_input.c:569
       __netif_receive_skb_one_core+0x199/0x1e0 linux/net/core/dev.c:5534
       __netif_receive_skb+0x1f/0x1c0 linux/net/core/dev.c:5648
       process_backlog+0x101/0x6b0 linux/net/core/dev.c:5976
       __napi_poll.constprop.0+0xba/0x550 linux/net/core/dev.c:6576
       napi_poll linux/net/core/dev.c:6645
       net_rx_action+0x95a/0xe90 linux/net/core/dev.c:6781
       __do_softirq+0x21f/0x8e7 linux/kernel/softirq.c:553
       do_softirq linux/kernel/softirq.c:454
       do_softirq+0xb2/0xf0 linux/kernel/softirq.c:441
       </IRQ>
       <TASK>
       __local_bh_enable_ip+0x100/0x120 linux/kernel/softirq.c:381
       local_bh_enable linux/./include/linux/bottom_half.h:33
       rcu_read_unlock_bh linux/./include/linux/rcupdate.h:851
       __dev_queue_xmit+0x871/0x3ee0 linux/net/core/dev.c:4378
       dev_queue_xmit linux/./include/linux/netdevice.h:3169
       neigh_hh_output linux/./include/net/neighbour.h:526
       neigh_output linux/./include/net/neighbour.h:540
       ip_finish_output2+0x169f/0x2550 linux/net/ipv4/ip_output.c:235
       __ip_finish_output linux/net/ipv4/ip_output.c:313
       __ip_finish_output+0x49e/0x950 linux/net/ipv4/ip_output.c:295
       ip_finish_output+0x31/0x310 linux/net/ipv4/ip_output.c:323
       NF_HOOK_COND linux/./include/linux/netfilter.h:303
       ip_output+0x13b/0x2a0 linux/net/ipv4/ip_output.c:433
       dst_output linux/./include/net/dst.h:451
       ip_local_out linux/net/ipv4/ip_output.c:129
       ip_send_skb+0x3e5/0x560 linux/net/ipv4/ip_output.c:1492
       udp_send_skb+0x73f/0x1530 linux/net/ipv4/udp.c:963
       udp_sendmsg+0x1a36/0x2b40 linux/net/ipv4/udp.c:1250
       inet_sendmsg+0x105/0x140 linux/net/ipv4/af_inet.c:850
       sock_sendmsg_nosec linux/net/socket.c:730
       __sock_sendmsg linux/net/socket.c:745
       __sys_sendto+0x42c/0x4e0 linux/net/socket.c:2191
       __do_sys_sendto linux/net/socket.c:2203
       __se_sys_sendto linux/net/socket.c:2199
       __x64_sys_sendto+0xe0/0x1c0 linux/net/socket.c:2199
       do_syscall_x64 linux/arch/x86/entry/common.c:52
       do_syscall_64+0xd8/0x270 linux/arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x6f/0x77 linux/arch/x86/entry/entry_64.S:120
      RIP: 0033:0x7f3434974f29
      Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48
      89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
      01 f0 ff ff 73 01 c3 48 8b 0d 37 8f 0d 00 f7 d8 64 89 01 48
      RSP: 002b:00007fff9154f2b8 EFLAGS: 00000212 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3434974f29
      RDX: 00000000000032c8 RSI: 00007fff9154f300 RDI: 0000000000000003
      RBP: 00007fff915532e0 R08: 00007fff91553360 R09: 0000000000000010
      R10: 0000000000000000 R11: 0000000000000212 R12: 000055ed86d261d0
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
       </TASK>
      
      In the critical scenario, either the relevant skb is freed or its
      ownership is transferred into a frag_lists. In both cases, the cleanup
      code must not free it again: we need to clear the skb reference earlier.
      
      Fixes: 1149557d ("tipc: eliminate unnecessary linearization of incoming buffers")
      Cc: stable@vger.kernel.org
      Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-23852
      Acked-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/752f1ccf762223d109845365d07f55414058e5a3.1714484273.git.pabeni@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      080cbb89
  2. 01 May, 2024 8 commits
  3. 29 Apr, 2024 13 commits
  4. 27 Apr, 2024 1 commit
    • Jakub Kicinski's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · b2ff42c6
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2024-04-26
      
      We've added 12 non-merge commits during the last 22 day(s) which contain
      a total of 14 files changed, 168 insertions(+), 72 deletions(-).
      
      The main changes are:
      
      1) Fix BPF_PROBE_MEM in verifier and JIT to skip loads from vsyscall page,
         from Puranjay Mohan.
      
      2) Fix a crash in XDP with devmap broadcast redirect when the latter map
         is in process of being torn down, from Toke Høiland-Jørgensen.
      
      3) Fix arm64 and riscv64 BPF JITs to properly clear start time for BPF
         program runtime stats, from Xu Kuohai.
      
      4) Fix a sockmap KCSAN-reported data race in sk_psock_skb_ingress_enqueue,
          from Jason Xing.
      
      5) Fix BPF verifier error message in resolve_pseudo_ldimm64,
         from Anton Protopopov.
      
      6) Fix missing DEBUG_INFO_BTF_MODULES Kconfig menu item,
         from Andrii Nakryiko.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        selftests/bpf: Test PROBE_MEM of VSYSCALL_ADDR on x86-64
        bpf, x86: Fix PROBE_MEM runtime load check
        bpf: verifier: prevent userspace memory access
        xdp: use flags field to disambiguate broadcast redirect
        arm32, bpf: Reimplement sign-extension mov instruction
        riscv, bpf: Fix incorrect runtime stats
        bpf, arm64: Fix incorrect runtime stats
        bpf: Fix a verifier verbose message
        bpf, skmsg: Fix NULL pointer dereference in sk_psock_skb_ingress_enqueue
        MAINTAINERS: bpf: Add Lehui and Puranjay as riscv64 reviewers
        MAINTAINERS: Update email address for Puranjay Mohan
        bpf, kconfig: Fix DEBUG_INFO_BTF_MODULES Kconfig definition
      ====================
      
      Link: https://lore.kernel.org/r/20240426224248.26197-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b2ff42c6
  5. 26 Apr, 2024 9 commits
    • David Howells's avatar
      Fix a potential infinite loop in extract_user_to_sg() · 6a30653b
      David Howells authored
      Fix extract_user_to_sg() so that it will break out of the loop if
      iov_iter_extract_pages() returns 0 rather than looping around forever.
      
      [Note that I've included two fixes lines as the function got moved to a
      different file and renamed]
      
      Fixes: 85dd2c8f ("netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator")
      Fixes: f5f82cd1 ("Move netfs_extract_iter_to_sg() to lib/scatterlist.c")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: Steve French <sfrench@samba.org>
      cc: Herbert Xu <herbert@gondor.apana.org.au>
      cc: netfs@lists.linux.dev
      Link: https://lore.kernel.org/r/1967121.1714034372@warthog.procyon.org.ukSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6a30653b
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-prevent-userspace-memory-access' · a86538a2
      Alexei Starovoitov authored
      Puranjay Mohan says:
      
      ====================
      bpf: prevent userspace memory access
      
      V5: https://lore.kernel.org/bpf/20240324185356.59111-1-puranjay12@gmail.com/
      Changes in V6:
      - Disable the verifier's instrumentation in x86-64 and update the JIT to
        take care of vsyscall page in addition to userspace addresses.
      - Update bpf_testmod to test for vsyscall addresses.
      
      V4: https://lore.kernel.org/bpf/20240321124640.8870-1-puranjay12@gmail.com/
      Changes in V5:
      - Use TASK_SIZE_MAX + PAGE_SIZE, VSYSCALL_ADDR as userspace boundary in
        x86-64 JIT.
      - Added Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
      
      V3: https://lore.kernel.org/bpf/20240321120842.78983-1-puranjay12@gmail.com/
      Changes in V4:
      - Disable this feature on architectures that don't define
        CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.
      - By doing the above, we don't need anything explicitly for s390x.
      
      V2: https://lore.kernel.org/bpf/20240321101058.68530-1-puranjay12@gmail.com/
      Changes in V3:
      - Return 0 from bpf_arch_uaddress_limit() in disabled case because it
        returns u64.
      - Modify the check in verifier to no do instrumentation when uaddress_limit
        is 0.
      
      V1: https://lore.kernel.org/bpf/20240320105436.4781-1-puranjay12@gmail.com/
      Changes in V2:
      - Disable this feature on s390x.
      
      With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. To
      thwart invalid memory accesses, the JITs add an exception table entry for
      all such accesses. But in case the src_reg + offset is a userspace address,
      the BPF program might read that memory if the user has mapped it.
      
      x86-64 JIT already instruments the BPF_PROBE_MEM based loads with checks to
      skip loads from userspace addresses, but is doesn't check for vsyscall page
      because it falls in the kernel address space but is considered a userspace
      page. The second patch in this series fixes the x86-64 JIT to also skip
      loads from the vsyscall page. The last patch updates the bpf_testmod so
      this address can be checked as part of the selftests.
      
      Other architectures don't have the complexity of the vsyscall address and
      just need to skip loads from the userspace. To make this more scalable and
      robust, the verifier is updated in the first patch to instrument
      BPF_PROBE_MEM to skip loads from the userspace addresses.
      ====================
      
      Link: https://lore.kernel.org/r/20240424100210.11982-1-puranjay@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a86538a2
    • Puranjay Mohan's avatar
      selftests/bpf: Test PROBE_MEM of VSYSCALL_ADDR on x86-64 · 7cd6750d
      Puranjay Mohan authored
      The vsyscall is a legacy API for fast execution of system calls. It maps
      a page at address VSYSCALL_ADDR into the userspace program. This address
      is in the top 10MB of the address space:
      
      ffffffffff600000 - ffffffffff600fff |    4 kB | legacy vsyscall ABI
      
      The last commit fixes the x86-64 BPF JIT to skip accessing addresses in
      this memory region. Add this address to bpf_testmod_return_ptr() so we
      can make sure that it is fixed.
      
      After this change and without the previous commit, subprogs_extable
      selftest will crash the kernel.
      Signed-off-by: default avatarPuranjay Mohan <puranjay@kernel.org>
      Link: https://lore.kernel.org/r/20240424100210.11982-4-puranjay@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7cd6750d
    • Puranjay Mohan's avatar
      bpf, x86: Fix PROBE_MEM runtime load check · b599d7d2
      Puranjay Mohan authored
      When a load is marked PROBE_MEM - e.g. due to PTR_UNTRUSTED access - the
      address being loaded from is not necessarily valid. The BPF jit sets up
      exception handlers for each such load which catch page faults and 0 out
      the destination register.
      
      If the address for the load is outside kernel address space, the load
      will escape the exception handling and crash the kernel. To prevent this
      from happening, the emits some instruction to verify that addr is > end
      of userspace addresses.
      
      x86 has a legacy vsyscall ABI where a page at address 0xffffffffff600000
      is mapped with user accessible permissions. The addresses in this page
      are considered userspace addresses by the fault handler. Therefore, a
      BPF program accessing this page will crash the kernel.
      
      This patch fixes the runtime checks to also check that the PROBE_MEM
      address is below VSYSCALL_ADDR.
      
      Example BPF program:
      
       SEC("fentry/tcp_v4_connect")
       int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk)
       {
      	*(volatile unsigned long *)&sk->sk_tsq_flags;
      	return 0;
       }
      
      BPF Assembly:
      
       0: (79) r1 = *(u64 *)(r1 +0)
       1: (79) r1 = *(u64 *)(r1 +344)
       2: (b7) r0 = 0
       3: (95) exit
      
      			       x86-64 JIT
      			       ==========
      
                  BEFORE                                    AFTER
      	    ------                                    -----
      
       0:   nopl   0x0(%rax,%rax,1)             0:   nopl   0x0(%rax,%rax,1)
       5:   xchg   %ax,%ax                      5:   xchg   %ax,%ax
       7:   push   %rbp                         7:   push   %rbp
       8:   mov    %rsp,%rbp                    8:   mov    %rsp,%rbp
       b:   mov    0x0(%rdi),%rdi               b:   mov    0x0(%rdi),%rdi
      -------------------------------------------------------------------------------
       f:   movabs $0x100000000000000,%r11      f:   movabs $0xffffffffff600000,%r10
      19:   add    $0x2a0,%rdi                 19:   mov    %rdi,%r11
      20:   cmp    %r11,%rdi                   1c:   add    $0x2a0,%r11
      23:   jae    0x0000000000000029          23:   sub    %r10,%r11
      25:   xor    %edi,%edi                   26:   movabs $0x100000000a00000,%r10
      27:   jmp    0x000000000000002d          30:   cmp    %r10,%r11
      29:   mov    0x0(%rdi),%rdi              33:   ja     0x0000000000000039
      --------------------------------\        35:   xor    %edi,%edi
      2d:   xor    %eax,%eax           \       37:   jmp    0x0000000000000040
      2f:   leave                       \      39:   mov    0x2a0(%rdi),%rdi
      30:   ret                          \--------------------------------------------
                                               40:   xor    %eax,%eax
                                               42:   leave
                                               43:   ret
      Signed-off-by: default avatarPuranjay Mohan <puranjay@kernel.org>
      Link: https://lore.kernel.org/r/20240424100210.11982-3-puranjay@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b599d7d2
    • Puranjay Mohan's avatar
      bpf: verifier: prevent userspace memory access · 66e13b61
      Puranjay Mohan authored
      With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. To
      thwart invalid memory accesses, the JITs add an exception table entry
      for all such accesses. But in case the src_reg + offset is a userspace
      address, the BPF program might read that memory if the user has
      mapped it.
      
      Make the verifier add guard instructions around such memory accesses and
      skip the load if the address falls into the userspace region.
      
      The JITs need to implement bpf_arch_uaddress_limit() to define where
      the userspace addresses end for that architecture or TASK_SIZE is taken
      as default.
      
      The implementation is as follows:
      
      REG_AX =  SRC_REG
      if(offset)
      	REG_AX += offset;
      REG_AX >>= 32;
      if (REG_AX <= (uaddress_limit >> 32))
      	DST_REG = 0;
      else
      	DST_REG = *(size *)(SRC_REG + offset);
      
      Comparing just the upper 32 bits of the load address with the upper
      32 bits of uaddress_limit implies that the values are being aligned down
      to a 4GB boundary before comparison.
      
      The above means that all loads with address <= uaddress_limit + 4GB are
      skipped. This is acceptable because there is a large hole (much larger
      than 4GB) between userspace and kernel space memory, therefore a
      correctly functioning BPF program should not access this 4GB memory
      above the userspace.
      
      Let's analyze what this patch does to the following fentry program
      dereferencing an untrusted pointer:
      
        SEC("fentry/tcp_v4_connect")
        int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk)
        {
                      *(volatile long *)sk;
                      return 0;
        }
      
          BPF Program before              |           BPF Program after
          ------------------              |           -----------------
      
        0: (79) r1 = *(u64 *)(r1 +0)          0: (79) r1 = *(u64 *)(r1 +0)
        -----------------------------------------------------------------------
        1: (79) r1 = *(u64 *)(r1 +0) --\      1: (bf) r11 = r1
        ----------------------------\   \     2: (77) r11 >>= 32
        2: (b7) r0 = 0               \   \    3: (b5) if r11 <= 0x8000 goto pc+2
        3: (95) exit                  \   \-> 4: (79) r1 = *(u64 *)(r1 +0)
                                       \      5: (05) goto pc+1
                                        \     6: (b7) r1 = 0
                                         \--------------------------------------
                                              7: (b7) r0 = 0
                                              8: (95) exit
      
      As you can see from above, in the best case (off=0), 5 extra instructions
      are emitted.
      
      Now, we analyze the same program after it has gone through the JITs of
      ARM64 and RISC-V architectures. We follow the single load instruction
      that has the untrusted pointer and see what instrumentation has been
      added around it.
      
                                      x86-64 JIT
                                      ==========
           JIT's Instrumentation
                (upstream)
           ---------------------
      
         0:   nopl   0x0(%rax,%rax,1)
         5:   xchg   %ax,%ax
         7:   push   %rbp
         8:   mov    %rsp,%rbp
         b:   mov    0x0(%rdi),%rdi
        ---------------------------------
         f:   movabs $0x800000000000,%r11
        19:   cmp    %r11,%rdi
        1c:   jb     0x000000000000002a
        1e:   mov    %rdi,%r11
        21:   add    $0x0,%r11
        28:   jae    0x000000000000002e
        2a:   xor    %edi,%edi
        2c:   jmp    0x0000000000000032
        2e:   mov    0x0(%rdi),%rdi
        ---------------------------------
        32:   xor    %eax,%eax
        34:   leave
        35:   ret
      
      The x86-64 JIT already emits some instructions to protect against user
      memory access. This patch doesn't make any changes for the x86-64 JIT.
      
                                        ARM64 JIT
                                        =========
      
              No Intrumentation                       Verifier's Instrumentation
                 (upstream)                                  (This patch)
              -----------------                       --------------------------
      
         0:   add     x9, x30, #0x0                0:   add     x9, x30, #0x0
         4:   nop                                  4:   nop
         8:   paciasp                              8:   paciasp
         c:   stp     x29, x30, [sp, #-16]!        c:   stp     x29, x30, [sp, #-16]!
        10:   mov     x29, sp                     10:   mov     x29, sp
        14:   stp     x19, x20, [sp, #-16]!       14:   stp     x19, x20, [sp, #-16]!
        18:   stp     x21, x22, [sp, #-16]!       18:   stp     x21, x22, [sp, #-16]!
        1c:   stp     x25, x26, [sp, #-16]!       1c:   stp     x25, x26, [sp, #-16]!
        20:   stp     x27, x28, [sp, #-16]!       20:   stp     x27, x28, [sp, #-16]!
        24:   mov     x25, sp                     24:   mov     x25, sp
        28:   mov     x26, #0x0                   28:   mov     x26, #0x0
        2c:   sub     x27, x25, #0x0              2c:   sub     x27, x25, #0x0
        30:   sub     sp, sp, #0x0                30:   sub     sp, sp, #0x0
        34:   ldr     x0, [x0]                    34:   ldr     x0, [x0]
      --------------------------------------------------------------------------------
        38:   ldr     x0, [x0] ----------\        38:   add     x9, x0, #0x0
      -----------------------------------\\       3c:   lsr     x9, x9, #32
        3c:   mov     x7, #0x0            \\      40:   cmp     x9, #0x10, lsl #12
        40:   mov     sp, sp               \\     44:   b.ls    0x0000000000000050
        44:   ldp     x27, x28, [sp], #16   \\--> 48:   ldr     x0, [x0]
        48:   ldp     x25, x26, [sp], #16    \    4c:   b       0x0000000000000054
        4c:   ldp     x21, x22, [sp], #16     \   50:   mov     x0, #0x0
        50:   ldp     x19, x20, [sp], #16      \---------------------------------------
        54:   ldp     x29, x30, [sp], #16         54:   mov     x7, #0x0
        58:   add     x0, x7, #0x0                58:   mov     sp, sp
        5c:   autiasp                             5c:   ldp     x27, x28, [sp], #16
        60:   ret                                 60:   ldp     x25, x26, [sp], #16
        64:   nop                                 64:   ldp     x21, x22, [sp], #16
        68:   ldr     x10, 0x0000000000000070     68:   ldp     x19, x20, [sp], #16
        6c:   br      x10                         6c:   ldp     x29, x30, [sp], #16
                                                  70:   add     x0, x7, #0x0
                                                  74:   autiasp
                                                  78:   ret
                                                  7c:   nop
                                                  80:   ldr     x10, 0x0000000000000088
                                                  84:   br      x10
      
      There are 6 extra instructions added in ARM64 in the best case. This will
      become 7 in the worst case (off != 0).
      
                                 RISC-V JIT (RISCV_ISA_C Disabled)
                                 ==========
      
              No Intrumentation           Verifier's Instrumentation
                 (upstream)                      (This patch)
              -----------------           --------------------------
      
         0:   nop                            0:   nop
         4:   nop                            4:   nop
         8:   li      a6, 33                 8:   li      a6, 33
         c:   addi    sp, sp, -16            c:   addi    sp, sp, -16
        10:   sd      s0, 8(sp)             10:   sd      s0, 8(sp)
        14:   addi    s0, sp, 16            14:   addi    s0, sp, 16
        18:   ld      a0, 0(a0)             18:   ld      a0, 0(a0)
      ---------------------------------------------------------------
        1c:   ld      a0, 0(a0) --\         1c:   mv      t0, a0
      --------------------------\  \        20:   srli    t0, t0, 32
        20:   li      a5, 0      \  \       24:   lui     t1, 4096
        24:   ld      s0, 8(sp)   \  \      28:   sext.w  t1, t1
        28:   addi    sp, sp, 16   \  \     2c:   bgeu    t1, t0, 12
        2c:   sext.w  a0, a5        \  \--> 30:   ld      a0, 0(a0)
        30:   ret                    \      34:   j       8
                                      \     38:   li      a0, 0
                                       \------------------------------
                                            3c:   li      a5, 0
                                            40:   ld      s0, 8(sp)
                                            44:   addi    sp, sp, 16
                                            48:   sext.w  a0, a5
                                            4c:   ret
      
      There are 7 extra instructions added in RISC-V.
      
      Fixes: 80083428 ("bpf, arm64: Add BPF exception tables")
      Reported-by: default avatarBreno Leitao <leitao@debian.org>
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Signed-off-by: default avatarPuranjay Mohan <puranjay12@gmail.com>
      Link: https://lore.kernel.org/r/20240424100210.11982-2-puranjay@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      66e13b61
    • David Bauer's avatar
      net l2tp: drop flow hash on forward · 42f853b4
      David Bauer authored
      Drop the flow-hash of the skb when forwarding to the L2TP netdev.
      
      This avoids the L2TP qdisc from using the flow-hash from the outer
      packet, which is identical for every flow within the tunnel.
      
      This does not affect every platform but is specific for the ethernet
      driver. It depends on the platform including L4 information in the
      flow-hash.
      
      One such example is the Mediatek Filogic MT798x family of networking
      processors.
      
      Fixes: d9e31d17 ("l2tp: Add L2TP ethernet pseudowire support")
      Acked-by: default avatarJames Chapman <jchapman@katalix.com>
      Signed-off-by: default avatarDavid Bauer <mail@david-bauer.net>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240424171110.13701-1-mail@david-bauer.netSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      42f853b4
    • Kuniyuki Iwashima's avatar
      nsh: Restore skb->{protocol,data,mac_header} for outer header in nsh_gso_segment(). · 4b911a96
      Kuniyuki Iwashima authored
      syzbot triggered various splats (see [0] and links) by a crafted GSO
      packet of VIRTIO_NET_HDR_GSO_UDP layering the following protocols:
      
        ETH_P_8021AD + ETH_P_NSH + ETH_P_IPV6 + IPPROTO_UDP
      
      NSH can encapsulate IPv4, IPv6, Ethernet, NSH, and MPLS.  As the inner
      protocol can be Ethernet, NSH GSO handler, nsh_gso_segment(), calls
      skb_mac_gso_segment() to invoke inner protocol GSO handlers.
      
      nsh_gso_segment() does the following for the original skb before
      calling skb_mac_gso_segment()
      
        1. reset skb->network_header
        2. save the original skb->{mac_heaeder,mac_len} in a local variable
        3. pull the NSH header
        4. resets skb->mac_header
        5. set up skb->mac_len and skb->protocol for the inner protocol.
      
      and does the following for the segmented skb
      
        6. set ntohs(ETH_P_NSH) to skb->protocol
        7. push the NSH header
        8. restore skb->mac_header
        9. set skb->mac_header + mac_len to skb->network_header
       10. restore skb->mac_len
      
      There are two problems in 6-7 and 8-9.
      
        (a)
        After 6 & 7, skb->data points to the NSH header, so the outer header
        (ETH_P_8021AD in this case) is stripped when skb is sent out of netdev.
      
        Also, if NSH is encapsulated by NSH + Ethernet (so NSH-Ethernet-NSH),
        skb_pull() in the first nsh_gso_segment() will make skb->data point
        to the middle of the outer NSH or Ethernet header because the Ethernet
        header is not pulled by the second nsh_gso_segment().
      
        (b)
        While restoring skb->{mac_header,network_header} in 8 & 9,
        nsh_gso_segment() does not assume that the data in the linear
        buffer is shifted.
      
        However, udp6_ufo_fragment() could shift the data and change
        skb->mac_header accordingly as demonstrated by syzbot.
      
        If this happens, even the restored skb->mac_header points to
        the middle of the outer header.
      
      It seems nsh_gso_segment() has never worked with outer headers so far.
      
      At the end of nsh_gso_segment(), the outer header must be restored for
      the segmented skb, instead of the NSH header.
      
      To do that, let's calculate the outer header position relatively from
      the inner header and set skb->{data,mac_header,protocol} properly.
      
      [0]:
      BUG: KMSAN: uninit-value in ipvlan_process_outbound drivers/net/ipvlan/ipvlan_core.c:524 [inline]
      BUG: KMSAN: uninit-value in ipvlan_xmit_mode_l3 drivers/net/ipvlan/ipvlan_core.c:602 [inline]
      BUG: KMSAN: uninit-value in ipvlan_queue_xmit+0xf44/0x16b0 drivers/net/ipvlan/ipvlan_core.c:668
       ipvlan_process_outbound drivers/net/ipvlan/ipvlan_core.c:524 [inline]
       ipvlan_xmit_mode_l3 drivers/net/ipvlan/ipvlan_core.c:602 [inline]
       ipvlan_queue_xmit+0xf44/0x16b0 drivers/net/ipvlan/ipvlan_core.c:668
       ipvlan_start_xmit+0x5c/0x1a0 drivers/net/ipvlan/ipvlan_main.c:222
       __netdev_start_xmit include/linux/netdevice.h:4989 [inline]
       netdev_start_xmit include/linux/netdevice.h:5003 [inline]
       xmit_one net/core/dev.c:3547 [inline]
       dev_hard_start_xmit+0x244/0xa10 net/core/dev.c:3563
       __dev_queue_xmit+0x33ed/0x51c0 net/core/dev.c:4351
       dev_queue_xmit include/linux/netdevice.h:3171 [inline]
       packet_xmit+0x9c/0x6b0 net/packet/af_packet.c:276
       packet_snd net/packet/af_packet.c:3081 [inline]
       packet_sendmsg+0x8aef/0x9f10 net/packet/af_packet.c:3113
       sock_sendmsg_nosec net/socket.c:730 [inline]
       __sock_sendmsg net/socket.c:745 [inline]
       __sys_sendto+0x735/0xa10 net/socket.c:2191
       __do_sys_sendto net/socket.c:2203 [inline]
       __se_sys_sendto net/socket.c:2199 [inline]
       __x64_sys_sendto+0x125/0x1c0 net/socket.c:2199
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      Uninit was created at:
       slab_post_alloc_hook mm/slub.c:3819 [inline]
       slab_alloc_node mm/slub.c:3860 [inline]
       __do_kmalloc_node mm/slub.c:3980 [inline]
       __kmalloc_node_track_caller+0x705/0x1000 mm/slub.c:4001
       kmalloc_reserve+0x249/0x4a0 net/core/skbuff.c:582
       __alloc_skb+0x352/0x790 net/core/skbuff.c:651
       skb_segment+0x20aa/0x7080 net/core/skbuff.c:4647
       udp6_ufo_fragment+0xcab/0x1150 net/ipv6/udp_offload.c:109
       ipv6_gso_segment+0x14be/0x2ca0 net/ipv6/ip6_offload.c:152
       skb_mac_gso_segment+0x3e8/0x760 net/core/gso.c:53
       nsh_gso_segment+0x6f4/0xf70 net/nsh/nsh.c:108
       skb_mac_gso_segment+0x3e8/0x760 net/core/gso.c:53
       __skb_gso_segment+0x4b0/0x730 net/core/gso.c:124
       skb_gso_segment include/net/gso.h:83 [inline]
       validate_xmit_skb+0x107f/0x1930 net/core/dev.c:3628
       __dev_queue_xmit+0x1f28/0x51c0 net/core/dev.c:4343
       dev_queue_xmit include/linux/netdevice.h:3171 [inline]
       packet_xmit+0x9c/0x6b0 net/packet/af_packet.c:276
       packet_snd net/packet/af_packet.c:3081 [inline]
       packet_sendmsg+0x8aef/0x9f10 net/packet/af_packet.c:3113
       sock_sendmsg_nosec net/socket.c:730 [inline]
       __sock_sendmsg net/socket.c:745 [inline]
       __sys_sendto+0x735/0xa10 net/socket.c:2191
       __do_sys_sendto net/socket.c:2203 [inline]
       __se_sys_sendto net/socket.c:2199 [inline]
       __x64_sys_sendto+0x125/0x1c0 net/socket.c:2199
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      CPU: 1 PID: 5101 Comm: syz-executor421 Not tainted 6.8.0-rc5-syzkaller-00297-gf2e367d6 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
      
      Fixes: c411ed85 ("nsh: add GSO support")
      Reported-and-tested-by: syzbot+42a0dc856239de4de60e@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=42a0dc856239de4de60e
      Reported-and-tested-by: syzbot+c298c9f0e46a3c86332b@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=c298c9f0e46a3c86332b
      Link: https://lore.kernel.org/netdev/20240415222041.18537-1-kuniyu@amazon.com/Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240424023549.21862-1-kuniyu@amazon.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      4b911a96
    • Jakub Kicinski's avatar
      Merge branch 'ensure-the-copied-buf-is-nul-terminated' · a5b1051a
      Jakub Kicinski authored
      Bui Quang Minh says:
      
      ====================
      Ensure the copied buf is NUL terminated (part)
      
      I found that some drivers contains an out-of-bound read pattern like this
      
      	kern_buf = memdup_user(user_buf, count);
      	...
      	sscanf(kern_buf, ...);
      
      The sscanf can be replaced by some other string-related functions. This
      pattern can lead to out-of-bound read of kern_buf in string-related
      functions.
      
      This series fix the above issue by replacing memdup_user with
      memdup_user_nul.
      
      v1: https://lore.kernel.org/r/20240422-fix-oob-read-v1-0-e02854c30174@gmail.com
      ====================
      
      Link: https://lore.kernel.org/r/20240424-fix-oob-read-v2-0-f1f1b53a10f4@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a5b1051a
    • Bui Quang Minh's avatar
      octeontx2-af: avoid off-by-one read from userspace · f299ee70
      Bui Quang Minh authored
      We try to access count + 1 byte from userspace with memdup_user(buffer,
      count + 1). However, the userspace only provides buffer of count bytes and
      only these count bytes are verified to be okay to access. To ensure the
      copied buffer is NUL terminated, we use memdup_user_nul instead.
      
      Fixes: 3a2eb515 ("octeontx2-af: Fix an off by one in rvu_dbg_qsize_write()")
      Signed-off-by: default avatarBui Quang Minh <minhquangbui99@gmail.com>
      Link: https://lore.kernel.org/r/20240424-fix-oob-read-v2-6-f1f1b53a10f4@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f299ee70