1. 03 May, 2024 21 commits
  2. 02 May, 2024 19 commits
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · e958da0d
      Jakub Kicinski authored
      Cross-merge networking fixes after downstream PR.
      
      Conflicts:
      
      include/linux/filter.h
      kernel/bpf/core.c
        66e13b61 ("bpf: verifier: prevent userspace memory access")
        d503a04f ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
      https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/
      
      No adjacent changes.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e958da0d
    • Linus Torvalds's avatar
      Merge tag 'net-6.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 545c4944
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from bpf.
      
        Relatively calm week, likely due to public holiday in most places. No
        known outstanding regressions.
      
        Current release - regressions:
      
         - rxrpc: fix wrong alignmask in __page_frag_alloc_align()
      
         - eth: e1000e: change usleep_range to udelay in PHY mdic access
      
        Previous releases - regressions:
      
         - gro: fix udp bad offset in socket lookup
      
         - bpf: fix incorrect runtime stat for arm64
      
         - tipc: fix UAF in error path
      
         - netfs: fix a potential infinite loop in extract_user_to_sg()
      
         - eth: ice: ensure the copied buf is NUL terminated
      
         - eth: qeth: fix kernel panic after setting hsuid
      
        Previous releases - always broken:
      
         - bpf:
             - verifier: prevent userspace memory access
             - xdp: use flags field to disambiguate broadcast redirect
      
         - bridge: fix multicast-to-unicast with fraglist GSO
      
         - mptcp: ensure snd_nxt is properly initialized on connect
      
         - nsh: fix outer header access in nsh_gso_segment().
      
         - eth: bcmgenet: fix racing registers access
      
         - eth: vxlan: fix stats counters.
      
        Misc:
      
         - a bunch of MAINTAINERS file updates"
      
      * tag 'net-6.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (45 commits)
        MAINTAINERS: mark MYRICOM MYRI-10G as Orphan
        MAINTAINERS: remove Ariel Elior
        net: gro: add flush check in udp_gro_receive_segment
        net: gro: fix udp bad offset in socket lookup by adding {inner_}network_offset to napi_gro_cb
        ipv4: Fix uninit-value access in __ip_make_skb()
        s390/qeth: Fix kernel panic after setting hsuid
        vxlan: Pull inner IP header in vxlan_rcv().
        tipc: fix a possible memleak in tipc_buf_append
        tipc: fix UAF in error path
        rxrpc: Clients must accept conn from any address
        net: core: reject skb_copy(_expand) for fraglist GSO skbs
        net: bridge: fix multicast-to-unicast with fraglist GSO
        mptcp: ensure snd_nxt is properly initialized on connect
        e1000e: change usleep_range to udelay in PHY mdic access
        net: dsa: mv88e6xxx: Fix number of databases for 88E6141 / 88E6341
        cxgb4: Properly lock TX queue for the selftest.
        rxrpc: Fix using alignmask being zero for __page_frag_alloc_align()
        vxlan: Add missing VNI filter counter update in arp_reduce().
        vxlan: Fix racy device stats updates.
        net: qede: use return from qede_parse_actions()
        ...
      545c4944
    • Jakub Kicinski's avatar
      Merge branch 'bnxt_en-updates-for-net-next' · dcc61472
      Jakub Kicinski authored
      Michael Chan says:
      
      ====================
      bnxt_en: Updates for net-next
      
      The first patch converts the sw_stats field in the completion
      ring structure to a pointer.  This allows the group of
      completion rings using the same MSIX to share the same sw_stats
      structure.  Prior to this, the correct completion ring must be
      used to count packets.
      
      The next four patches remove the RTNL lock when calling the RoCE
      driver for asynchronous stop and start during error recovery and
      firmware reset.  The RTNL ilock is replaced with a private mutex
      used to synchronize RoCE register, unregister, stop, and start.
      
      The last patch adds VF PCI IDs for the 5760X chips.
      
      v2: Dropped patch #1 from v1.  Will work with David to get that
      patch in separately.
      ====================
      
      Link: https://lore.kernel.org/r/20240501003056.100607-1-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dcc61472
    • Ajit Khaparde's avatar
      bnxt_en: Add VF PCI ID for 5760X (P7) chips · 54d0b84f
      Ajit Khaparde authored
      No driver logic changes are required to support the VFs, so just add
      the VF PCI ID.
      Reviewed-by: default avatarSelvin Thyparampil Xavier <selvin.xavier@broadcom.com>
      Signed-off-by: default avatarAjit Khaparde <ajit.khaparde@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240501003056.100607-7-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      54d0b84f
    • Kalesh AP's avatar
      bnxt_en: Optimize recovery path ULP locking in the driver · 3c163f35
      Kalesh AP authored
      In the error recovery path (AER, firmware recovery, etc), the
      driver notifies the RoCE driver via ULP_STOP before the reset
      and via ULP_START after the reset, all under RTNL_LOCK.  The
      RoCE driver can take a long time if there are a lot of QPs to
      destroy, so it is not ideal to hold the global RTNL lock.
      
      Rely on the new en_dev_lock mutex instead for ULP_STOP and
      ULP_START.  For the most part, we move the ULP_STOP call before
      we take the RTNL lock and move the ULP_START after RTNL unlock.
      Note that SRIOV re-enablement must be done after ULP_START
      or RoCE on the VFs will not resume properly after reset.
      
      The one scenario in bnxt_hwrm_if_change() where the RTNL lock
      is already taken in the .ndo_open() context requires the ULP
      restart to be deferred to the bnxt_sp_task() workqueue.
      Reviewed-by: default avatarSelvin Thyparampil Xavier <selvin.xavier@broadcom.com>
      Reviewed-by: default avatarVikas Gupta <vikas.gupta@broadcom.com>
      Reviewed-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Signed-off-by: default avatarKalesh AP <kalesh-anakkur.purayil@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240501003056.100607-6-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3c163f35
    • Kalesh AP's avatar
      bnxt_en: Add a mutex to synchronize ULP operations · de21ec44
      Kalesh AP authored
      The current scheme relies heavily on the RTNL lock for all ULP
      operations between the L2 and the RoCE driver.  Add a new en_dev_lock
      mutex so that the asynchronous ULP_STOP and ULP_START operations
      can be serialized with bnxt_register_dev() and bnxt_unregister_dev()
      calls without relying on the RTNL lock.  The next patch will remove
      the RTNL lock from the ULP_STOP and ULP_START calls.
      Reviewed-by: default avatarSelvin Thyparampil Xavier <selvin.xavier@broadcom.com>
      Reviewed-by: default avatarVikas Gupta <vikas.gupta@broadcom.com>
      Reviewed-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Signed-off-by: default avatarKalesh AP <kalesh-anakkur.purayil@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240501003056.100607-5-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      de21ec44
    • Michael Chan's avatar
      bnxt_en: Don't call ULP_STOP/ULP_START during L2 reset · f79d7a9f
      Michael Chan authored
      There is no need to call ULP_STOP and ULP_START before and after the
      L2 reset in bnxt_reset_task().  This L2 reset is done after detecting
      TX timeout, RX ring errors, or VF config changes.  The L2 reset does
      not affect RoCE since the firmware is not reset and the backing store
      is left alone.
      Reviewed-by: default avatarAndy Gospodarek <andrew.gospodarek@broadcom.com>
      Reviewed-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240501003056.100607-4-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f79d7a9f
    • Kalesh AP's avatar
      bnxt_en: Don't support offline self test when RoCE driver is loaded · 895621f1
      Kalesh AP authored
      Offline self test is a very disruptive operation for RoCE and requires
      all active QPs to be destroyed.  With a large number of QPs, it can
      take a long time to destroy all the QPs and can timeout.  Do not allow
      ethtool offline self test if the RoCE driver is registered on the
      device.
      Reviewed-by: default avatarSelvin Thyparampil Xavier <selvin.xavier@broadcom.com>
      Reviewed-by: default avatarVikas Gupta <vikas.gupta@broadcom.com>
      Reviewed-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Signed-off-by: default avatarKalesh AP <kalesh-anakkur.purayil@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240501003056.100607-3-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      895621f1
    • Edwin Peer's avatar
      bnxt_en: share NQ ring sw_stats memory with subrings · a75fbb3a
      Edwin Peer authored
      On P5_PLUS chips and later, the NQ rings have subrings for RX and TX
      completions respectively. These subrings are passed to the poll
      function instead of the base NQ, but each ring carries its own
      copy of the software ring statistics.
      
      For stats to be conveniently accessible in __bnxt_poll_work(), the
      statistics memory should either be shared between the NQ and its
      subrings or the subrings need to be included in the ethtool stats
      aggregation logic. This patch opts for the former, because it's more
      efficient and less confusing having the software statistics for a
      ring exist in a single place.
      
      Before this patch, the counter will not be displayed if the "wrong"
      cpr->sw_stats was used to increment a counter.
      
      Link: https://lore.kernel.org/netdev/CACKFLikEhVAJA+osD7UjQNotdGte+fth7zOy7yDdLkTyFk9Pyw@mail.gmail.com/Signed-off-by: default avatarEdwin Peer <edwin.peer@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240501003056.100607-2-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a75fbb3a
    • Jakub Kicinski's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · fc1fa5a0
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      i40e: cleanups & refactors
      
      Ivan Vecera says:
      
      This series do following:
      Patch 1 - Removes write-only flags field from i40e_veb structure and
                from i40e_veb_setup() parameters
      Patch 2 - Refactors parameter of i40e_notify_client_of_l2_param_changes()
                and i40e_notify_client_of_netdev_close()
      Patch 3 - Refactors parameter of i40e_detect_recover_hung()
      Patch 4 - Adds helper i40e_pf_get_main_vsi() to get main VSI and uses it
                in existing code
      Patch 5 - Consolidates checks whether given VSI is the main one
      Patch 6 - Adds helper i40e_pf_get_main_veb() to get main VEB and uses it
                in existing code
      Patch 7 - Adds helper i40e_vsi_reconfig_tc() to reconfigure TC for
                particular and uses it to replace existing open-coded pieces
      
      * '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
        i40e: Add and use helper to reconfigure TC for given VSI
        i40e: Add helper to access main VEB
        i40e: Consolidate checks whether given VSI is main
        i40e: Add helper to access main VSI
        i40e: Refactor argument of i40e_detect_recover_hung()
        i40e: Refactor argument of several client notification functions
        i40e: Remove flags field from i40e_veb
      ====================
      
      Link: https://lore.kernel.org/r/20240430180639.1938515-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fc1fa5a0
    • Davide Caratti's avatar
      net/sched: unregister lockdep keys in qdisc_create/qdisc_alloc error path · 86735b57
      Davide Caratti authored
      Naresh and Eric report several errors (corrupted elements in the dynamic
      key hash list), when running tdc.py or syzbot. The error path of
      qdisc_alloc() and qdisc_create() frees the qdisc memory, but it forgets
      to unregister the lockdep key, thus causing use-after-free like the
      following one:
      
       ==================================================================
       BUG: KASAN: slab-use-after-free in lockdep_register_key+0x5f2/0x700
       Read of size 8 at addr ffff88811236f2a8 by task ip/7925
      
       CPU: 26 PID: 7925 Comm: ip Kdump: loaded Not tainted 6.9.0-rc2+ #648
       Hardware name: Supermicro SYS-6027R-72RF/X9DRH-7TF/7F/iTF/iF, BIOS 3.0  07/26/2013
       Call Trace:
        <TASK>
        dump_stack_lvl+0x7c/0xc0
        print_report+0xc9/0x610
        kasan_report+0x89/0xc0
        lockdep_register_key+0x5f2/0x700
        qdisc_alloc+0x21d/0xb60
        qdisc_create_dflt+0x63/0x3c0
        attach_one_default_qdisc.constprop.37+0x8e/0x170
        dev_activate+0x4bd/0xc30
        __dev_open+0x275/0x380
        __dev_change_flags+0x3f1/0x570
        dev_change_flags+0x7c/0x160
        do_setlink+0x1ea1/0x34b0
        __rtnl_newlink+0x8c9/0x1510
        rtnl_newlink+0x61/0x90
        rtnetlink_rcv_msg+0x2f0/0xbc0
        netlink_rcv_skb+0x120/0x380
        netlink_unicast+0x420/0x630
        netlink_sendmsg+0x732/0xbc0
        __sock_sendmsg+0x1ea/0x280
        ____sys_sendmsg+0x5a9/0x990
        ___sys_sendmsg+0xf1/0x180
        __sys_sendmsg+0xd3/0x180
        do_syscall_64+0x96/0x180
        entry_SYSCALL_64_after_hwframe+0x71/0x79
       RIP: 0033:0x7f9503f4fa07
       Code: 0a 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
       RSP: 002b:00007fff6c729068 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
       RAX: ffffffffffffffda RBX: 000000006630c681 RCX: 00007f9503f4fa07
       RDX: 0000000000000000 RSI: 00007fff6c7290d0 RDI: 0000000000000003
       RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000078
       R10: 000000000000009b R11: 0000000000000246 R12: 0000000000000001
       R13: 00007fff6c729180 R14: 0000000000000000 R15: 000055bf67dd9040
        </TASK>
      
       Allocated by task 7745:
        kasan_save_stack+0x1c/0x40
        kasan_save_track+0x10/0x30
        __kasan_kmalloc+0x7b/0x90
        __kmalloc_node+0x1ff/0x460
        qdisc_alloc+0xae/0xb60
        qdisc_create+0xdd/0xfb0
        tc_modify_qdisc+0x37e/0x1960
        rtnetlink_rcv_msg+0x2f0/0xbc0
        netlink_rcv_skb+0x120/0x380
        netlink_unicast+0x420/0x630
        netlink_sendmsg+0x732/0xbc0
        __sock_sendmsg+0x1ea/0x280
        ____sys_sendmsg+0x5a9/0x990
        ___sys_sendmsg+0xf1/0x180
        __sys_sendmsg+0xd3/0x180
        do_syscall_64+0x96/0x180
        entry_SYSCALL_64_after_hwframe+0x71/0x79
      
       Freed by task 7745:
        kasan_save_stack+0x1c/0x40
        kasan_save_track+0x10/0x30
        kasan_save_free_info+0x36/0x60
        __kasan_slab_free+0xfe/0x180
        kfree+0x113/0x380
        qdisc_create+0xafb/0xfb0
        tc_modify_qdisc+0x37e/0x1960
        rtnetlink_rcv_msg+0x2f0/0xbc0
        netlink_rcv_skb+0x120/0x380
        netlink_unicast+0x420/0x630
        netlink_sendmsg+0x732/0xbc0
        __sock_sendmsg+0x1ea/0x280
        ____sys_sendmsg+0x5a9/0x990
        ___sys_sendmsg+0xf1/0x180
        __sys_sendmsg+0xd3/0x180
        do_syscall_64+0x96/0x180
        entry_SYSCALL_64_after_hwframe+0x71/0x79
      
      Fix this ensuring that lockdep_unregister_key() is called before the
      qdisc struct is freed, also in the error path of qdisc_create() and
      qdisc_alloc().
      
      Fixes: af0cb3fa ("net/sched: fix false lockdep warning on qdisc root lock")
      Reported-by: default avatarLinux Kernel Functional Testing <lkft@linaro.org>
      Closes: https://lore.kernel.org/netdev/20240429221706.1492418-1-naresh.kamboju@linaro.org/Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Tested-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Tested-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Link: https://lore.kernel.org/r/2aa1ca0c0a3aa0acc15925c666c777a4b5de553c.1714496886.git.dcaratti@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      86735b57
    • Jakub Kicinski's avatar
      MAINTAINERS: mark MYRICOM MYRI-10G as Orphan · 78cfe547
      Jakub Kicinski authored
      Chris's email address bounces and lore hasn't seen an email
      from anyone with his name for almost a decade.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240430233532.1356982-1-kuba@kernel.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      78cfe547
    • Jakub Kicinski's avatar
      MAINTAINERS: remove Ariel Elior · c9ccbcd9
      Jakub Kicinski authored
      aelior@marvell.com bounces, we haven't seen Ariel on lore
      since March 2022.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20240430233305.1356105-1-kuba@kernel.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      c9ccbcd9
    • Paolo Abeni's avatar
      Merge branch 'net-gro-add-flush-flush_id-checks-and-fix-wrong-offset-in-udp' · a257f093
      Paolo Abeni authored
      Richard Gobert says:
      
      ====================
      net: gro: add flush/flush_id checks and fix wrong offset in udp
      
      This series fixes a bug in the complete phase of UDP in GRO, in which
      socket lookup fails due to using network_header when parsing encapsulated
      packets. The fix is to add network_offset and inner_network_offset to
      napi_gro_cb and use these offsets for socket lookup.
      
      In addition p->flush/flush_id should be checked in all UDP flows. The
      same logic from tcp_gro_receive is applied for all flows in
      udp_gro_receive_segment. This prevents packets with mismatching network
      headers (flush/flush_id turned on) from merging in UDP GRO.
      
      The original series includes a change to vxlan test which adds the local
      parameter to prevent similar future bugs. I plan to submit it separately to
      net-next.
      
      This series is part of a previously submitted series to net-next:
      https://lore.kernel.org/all/20240408141720.98832-1-richardbgobert@gmail.com/
      
      v3 -> v4:
       - Store network offsets, and use them only in udp_gro_complete flows
       - Correct commit hash used in Fixes tag
       - v3:
       https://lore.kernel.org/netdev/20240424163045.123528-1-richardbgobert@gmail.com/
      
      v2 -> v3:
       - Add network_offsets and fix udp bug in a single commit to make backporting easier
       - Write to inner_network_offset in {inet,ipv6}_gro_receive
       - Use network_offsets union in tcp[46]_gro_complete as well
       - v2:
       https://lore.kernel.org/netdev/20240419153542.121087-1-richardbgobert@gmail.com/
      
      v1 -> v2:
       - Use network_offsets instead of p_poff param as suggested by Willem
       - Check flush before postpull, and for all UDP GRO flows
       - v1:
       https://lore.kernel.org/netdev/20240412152120.115067-1-richardbgobert@gmail.com/
      ====================
      
      Link: https://lore.kernel.org/r/20240430143555.126083-1-richardbgobert@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      a257f093
    • Richard Gobert's avatar
      net: gro: add flush check in udp_gro_receive_segment · 5babae77
      Richard Gobert authored
      GRO-GSO path is supposed to be transparent and as such L3 flush checks are
      relevant to all UDP flows merging in GRO. This patch uses the same logic
      and code from tcp_gro_receive, terminating merge if flush is non zero.
      
      Fixes: e20cf8d3 ("udp: implement GRO for plain UDP sockets.")
      Signed-off-by: default avatarRichard Gobert <richardbgobert@gmail.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      5babae77
    • Richard Gobert's avatar
      net: gro: fix udp bad offset in socket lookup by adding {inner_}network_offset to napi_gro_cb · 5ef31ea5
      Richard Gobert authored
      Commits a6024562 ("udp: Add GRO functions to UDP socket") and 57c67ff4 ("udp:
      additional GRO support") introduce incorrect usage of {ip,ipv6}_hdr in the
      complete phase of gro. The functions always return skb->network_header,
      which in the case of encapsulated packets at the gro complete phase, is
      always set to the innermost L3 of the packet. That means that calling
      {ip,ipv6}_hdr for skbs which completed the GRO receive phase (both in
      gro_list and *_gro_complete) when parsing an encapsulated packet's _outer_
      L3/L4 may return an unexpected value.
      
      This incorrect usage leads to a bug in GRO's UDP socket lookup.
      udp{4,6}_lib_lookup_skb functions use ip_hdr/ipv6_hdr respectively. These
      *_hdr functions return network_header which will point to the innermost L3,
      resulting in the wrong offset being used in __udp{4,6}_lib_lookup with
      encapsulated packets.
      
      This patch adds network_offset and inner_network_offset to napi_gro_cb, and
      makes sure both are set correctly.
      
      To fix the issue, network_offsets union is used inside napi_gro_cb, in
      which both the outer and the inner network offsets are saved.
      
      Reproduction example:
      
      Endpoint configuration example (fou + local address bind)
      
          # ip fou add port 6666 ipproto 4
          # ip link add name tun1 type ipip remote 2.2.2.1 local 2.2.2.2 encap fou encap-dport 5555 encap-sport 6666 mode ipip
          # ip link set tun1 up
          # ip a add 1.1.1.2/24 dev tun1
      
      Netperf TCP_STREAM result on net-next before patch is applied:
      
      net-next main, GRO enabled:
          $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5
          Recv   Send    Send
          Socket Socket  Message  Elapsed
          Size   Size    Size     Time     Throughput
          bytes  bytes   bytes    secs.    10^6bits/sec
      
          131072  16384  16384    5.28        2.37
      
      net-next main, GRO disabled:
          $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5
          Recv   Send    Send
          Socket Socket  Message  Elapsed
          Size   Size    Size     Time     Throughput
          bytes  bytes   bytes    secs.    10^6bits/sec
      
          131072  16384  16384    5.01     2745.06
      
      patch applied, GRO enabled:
          $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5
          Recv   Send    Send
          Socket Socket  Message  Elapsed
          Size   Size    Size     Time     Throughput
          bytes  bytes   bytes    secs.    10^6bits/sec
      
          131072  16384  16384    5.01     2877.38
      
      Fixes: a6024562 ("udp: Add GRO functions to UDP socket")
      Signed-off-by: default avatarRichard Gobert <richardbgobert@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      5ef31ea5
    • Shigeru Yoshida's avatar
      ipv4: Fix uninit-value access in __ip_make_skb() · fc1092f5
      Shigeru Yoshida authored
      KMSAN reported uninit-value access in __ip_make_skb() [1].  __ip_make_skb()
      tests HDRINCL to know if the skb has icmphdr. However, HDRINCL can cause a
      race condition. If calling setsockopt(2) with IP_HDRINCL changes HDRINCL
      while __ip_make_skb() is running, the function will access icmphdr in the
      skb even if it is not included. This causes the issue reported by KMSAN.
      
      Check FLOWI_FLAG_KNOWN_NH on fl4->flowi4_flags instead of testing HDRINCL
      on the socket.
      
      Also, fl4->fl4_icmp_type and fl4->fl4_icmp_code are not initialized. These
      are union in struct flowi4 and are implicitly initialized by
      flowi4_init_output(), but we should not rely on specific union layout.
      
      Initialize these explicitly in raw_sendmsg().
      
      [1]
      BUG: KMSAN: uninit-value in __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481
       __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481
       ip_finish_skb include/net/ip.h:243 [inline]
       ip_push_pending_frames+0x4c/0x5c0 net/ipv4/ip_output.c:1508
       raw_sendmsg+0x2381/0x2690 net/ipv4/raw.c:654
       inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851
       sock_sendmsg_nosec net/socket.c:730 [inline]
       __sock_sendmsg+0x274/0x3c0 net/socket.c:745
       __sys_sendto+0x62c/0x7b0 net/socket.c:2191
       __do_sys_sendto net/socket.c:2203 [inline]
       __se_sys_sendto net/socket.c:2199 [inline]
       __x64_sys_sendto+0x130/0x200 net/socket.c:2199
       do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      Uninit was created at:
       slab_post_alloc_hook mm/slub.c:3804 [inline]
       slab_alloc_node mm/slub.c:3845 [inline]
       kmem_cache_alloc_node+0x5f6/0xc50 mm/slub.c:3888
       kmalloc_reserve+0x13c/0x4a0 net/core/skbuff.c:577
       __alloc_skb+0x35a/0x7c0 net/core/skbuff.c:668
       alloc_skb include/linux/skbuff.h:1318 [inline]
       __ip_append_data+0x49ab/0x68c0 net/ipv4/ip_output.c:1128
       ip_append_data+0x1e7/0x260 net/ipv4/ip_output.c:1365
       raw_sendmsg+0x22b1/0x2690 net/ipv4/raw.c:648
       inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851
       sock_sendmsg_nosec net/socket.c:730 [inline]
       __sock_sendmsg+0x274/0x3c0 net/socket.c:745
       __sys_sendto+0x62c/0x7b0 net/socket.c:2191
       __do_sys_sendto net/socket.c:2203 [inline]
       __se_sys_sendto net/socket.c:2199 [inline]
       __x64_sys_sendto+0x130/0x200 net/socket.c:2199
       do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      CPU: 1 PID: 15709 Comm: syz-executor.7 Not tainted 6.8.0-11567-gb3603fcb #25
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
      
      Fixes: 99e5acae ("ipv4: Fix potential uninit variable access bug in __ip_make_skb()")
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Signed-off-by: default avatarShigeru Yoshida <syoshida@redhat.com>
      Link: https://lore.kernel.org/r/20240430123945.2057348-1-syoshida@redhat.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      fc1092f5
    • Alexandra Winter's avatar
      s390/qeth: Fix kernel panic after setting hsuid · 8a2e4d37
      Alexandra Winter authored
      Symptom:
      When the hsuid attribute is set for the first time on an IQD Layer3
      device while the corresponding network interface is already UP,
      the kernel will try to execute a napi function pointer that is NULL.
      
      Example:
      ---------------------------------------------------------------------------
      [ 2057.572696] illegal operation: 0001 ilc:1 [#1] SMP
      [ 2057.572702] Modules linked in: af_iucv qeth_l3 zfcp scsi_transport_fc sunrpc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6
      nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables libcrc32c nfnetlink ghash_s390 prng xts aes_s390 des_s390 de
      s_generic sha3_512_s390 sha3_256_s390 sha512_s390 vfio_ccw vfio_mdev mdev vfio_iommu_type1 eadm_sch vfio ext4 mbcache jbd2 qeth_l2 bridge stp llc dasd_eckd_mod qeth dasd_mod
       qdio ccwgroup pkey zcrypt
      [ 2057.572739] CPU: 6 PID: 60182 Comm: stress_client Kdump: loaded Not tainted 4.18.0-541.el8.s390x #1
      [ 2057.572742] Hardware name: IBM 3931 A01 704 (LPAR)
      [ 2057.572744] Krnl PSW : 0704f00180000000 0000000000000002 (0x2)
      [ 2057.572748]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3
      [ 2057.572751] Krnl GPRS: 0000000000000004 0000000000000000 00000000a3b008d8 0000000000000000
      [ 2057.572754]            00000000a3b008d8 cb923a29c779abc5 0000000000000000 00000000814cfd80
      [ 2057.572756]            000000000000012c 0000000000000000 00000000a3b008d8 00000000a3b008d8
      [ 2057.572758]            00000000bab6d500 00000000814cfd80 0000000091317e46 00000000814cfc68
      [ 2057.572762] Krnl Code:#0000000000000000: 0000                illegal
                               >0000000000000002: 0000                illegal
                                0000000000000004: 0000                illegal
                                0000000000000006: 0000                illegal
                                0000000000000008: 0000                illegal
                                000000000000000a: 0000                illegal
                                000000000000000c: 0000                illegal
                                000000000000000e: 0000                illegal
      [ 2057.572800] Call Trace:
      [ 2057.572801] ([<00000000ec639700>] 0xec639700)
      [ 2057.572803]  [<00000000913183e2>] net_rx_action+0x2ba/0x398
      [ 2057.572809]  [<0000000091515f76>] __do_softirq+0x11e/0x3a0
      [ 2057.572813]  [<0000000090ce160c>] do_softirq_own_stack+0x3c/0x58
      [ 2057.572817] ([<0000000090d2cbd6>] do_softirq.part.1+0x56/0x60)
      [ 2057.572822]  [<0000000090d2cc60>] __local_bh_enable_ip+0x80/0x98
      [ 2057.572825]  [<0000000091314706>] __dev_queue_xmit+0x2be/0xd70
      [ 2057.572827]  [<000003ff803dd6d6>] afiucv_hs_send+0x24e/0x300 [af_iucv]
      [ 2057.572830]  [<000003ff803dd88a>] iucv_send_ctrl+0x102/0x138 [af_iucv]
      [ 2057.572833]  [<000003ff803de72a>] iucv_sock_connect+0x37a/0x468 [af_iucv]
      [ 2057.572835]  [<00000000912e7e90>] __sys_connect+0xa0/0xd8
      [ 2057.572839]  [<00000000912e9580>] sys_socketcall+0x228/0x348
      [ 2057.572841]  [<0000000091514e1a>] system_call+0x2a6/0x2c8
      [ 2057.572843] Last Breaking-Event-Address:
      [ 2057.572844]  [<0000000091317e44>] __napi_poll+0x4c/0x1d8
      [ 2057.572846]
      [ 2057.572847] Kernel panic - not syncing: Fatal exception in interrupt
      -------------------------------------------------------------------------------------------
      
      Analysis:
      There is one napi structure per out_q: card->qdio.out_qs[i].napi
      The napi.poll functions are set during qeth_open().
      
      Since
      commit 1cfef80d ("s390/qeth: Don't call dev_close/dev_open (DOWN/UP)")
      qeth_set_offline()/qeth_set_online() no longer call dev_close()/
      dev_open(). So if qeth_free_qdio_queues() cleared
      card->qdio.out_qs[i].napi.poll while the network interface was UP and the
      card was offline, they are not set again.
      
      Reproduction:
      chzdev -e $devno layer2=0
      ip link set dev $network_interface up
      echo 0 > /sys/bus/ccwgroup/devices/0.0.$devno/online
      echo foo > /sys/bus/ccwgroup/devices/0.0.$devno/hsuid
      echo 1 > /sys/bus/ccwgroup/devices/0.0.$devno/online
      -> Crash (can be enforced e.g. by af_iucv connect(), ip link down/up, ...)
      
      Note that a Completion Queue (CQ) is only enabled or disabled, when hsuid
      is set for the first time or when it is removed.
      
      Workarounds:
      - Set hsuid before setting the device online for the first time
      or
      - Use chzdev -d $devno; chzdev $devno hsuid=xxx; chzdev -e $devno;
      to set hsuid on an existing device. (this will remove and recreate the
      network interface)
      
      Fix:
      There is no need to free the output queues when a completion queue is
      added or removed.
      card->qdio.state now indicates whether the inbound buffer pool and the
      outbound queues are allocated.
      card->qdio.c_q indicates whether a CQ is allocated.
      
      Fixes: 1cfef80d ("s390/qeth: Don't call dev_close/dev_open (DOWN/UP)")
      Signed-off-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240430091004.2265683-1-wintera@linux.ibm.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      8a2e4d37
    • Guillaume Nault's avatar
      vxlan: Pull inner IP header in vxlan_rcv(). · f7789419
      Guillaume Nault authored
      Ensure the inner IP header is part of skb's linear data before reading
      its ECN bits. Otherwise we might read garbage.
      One symptom is the system erroneously logging errors like
      "vxlan: non-ECT from xxx.xxx.xxx.xxx with TOS=xxxx".
      
      Similar bugs have been fixed in geneve, ip_tunnel and ip6_tunnel (see
      commit 1ca1ba46 ("geneve: make sure to pull inner header in
      geneve_rx()") for example). So let's reuse the same code structure for
      consistency. Maybe we'll can add a common helper in the future.
      
      Fixes: d342894c ("vxlan: virtual extensible lan")
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/1239c8db54efec341dd6455c77e0380f58923a3c.1714495737.git.gnault@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f7789419