1. 29 Mar, 2024 25 commits
    • Kuniyuki Iwashima's avatar
      selftest: tcp: Add v4-v4 and v6-v6 bind() conflict tests. · 5e9e9afd
      Kuniyuki Iwashima authored
      We don't have bind() conflict tests for the same protocol pairs.
      
      Let's add them except for the same address pair, which will be
      covered by the following patch adding 6 more bind() calls for
      each test case.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240326204251.51301-6-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5e9e9afd
    • Kuniyuki Iwashima's avatar
      selftest: tcp: Define the reverse order bind() tests explicitly. · 6f9bc755
      Kuniyuki Iwashima authored
      Currently, bind_wildcard.c calls bind() twice for two addresses and
      checks the pre-defined errno against the 2nd call.  Also, the two
      bind() calls are swapped to cover various patterns how bind buckets
      are created.
      
      However, only testing two addresses is insufficient to detect regression.
      So, we will add more bind() calls, and then, we need to define different
      errno for each bind() per test case.
      
      As a prepartion, let's define the reverse order bind() test cases as
      fixtures.
      
      No functional changes are intended.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240326204251.51301-5-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6f9bc755
    • Kuniyuki Iwashima's avatar
      selftest: tcp: Make bind() selftest flexible. · c48baf56
      Kuniyuki Iwashima authored
      Currently, bind_wildcard.c tests only (IPv4, IPv6) pairs, but we will
      add more tests for the same protocol pairs.
      
      This patch makes it possible by changing the address pointer to void.
      
      No functional changes are intended.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240326204251.51301-4-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c48baf56
    • Kuniyuki Iwashima's avatar
      tcp: Fix bind() regression for v6-only wildcard and v4(-mapped-v6) non-wildcard addresses. · d91ef1e1
      Kuniyuki Iwashima authored
      Jianguo Wu reported another bind() regression introduced by bhash2.
      
      Calling bind() for the following 3 addresses on the same port, the
      3rd one should fail but now succeeds.
      
        1. 0.0.0.0 or ::ffff:0.0.0.0
        2. [::] w/ IPV6_V6ONLY
        3. IPv4 non-wildcard address or v4-mapped-v6 non-wildcard address
      
      The first two bind() create tb2 like this:
      
        bhash2 -> tb2(:: w/ IPV6_V6ONLY) -> tb2(0.0.0.0)
      
      The 3rd bind() will match with the IPv6 only wildcard address bucket
      in inet_bind2_bucket_match_addr_any(), however, no conflicting socket
      exists in the bucket.  So, inet_bhash2_conflict() will returns false,
      and thus, inet_bhash2_addr_any_conflict() returns false consequently.
      
      As a result, the 3rd bind() bypasses conflict check, which should be
      done against the IPv4 wildcard address bucket.
      
      So, in inet_bhash2_addr_any_conflict(), we must iterate over all buckets.
      
      Note that we cannot add ipv6_only flag for inet_bind2_bucket as it
      would confuse the following patetrn.
      
        1. [::] w/ SO_REUSE{ADDR,PORT} and IPV6_V6ONLY
        2. [::] w/ SO_REUSE{ADDR,PORT}
        3. IPv4 non-wildcard address or v4-mapped-v6 non-wildcard address
      
      The first bind() would create a bucket with ipv6_only flag true,
      the second bind() would add the [::] socket into the same bucket,
      and the third bind() could succeed based on the wrong assumption
      that ipv6_only bucket would not conflict with v4(-mapped-v6) address.
      
      Fixes: 28044fc1 ("net: Add a bhash2 table hashed by port and address")
      Diagnosed-by: default avatarJianguo Wu <wujianguo106@163.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240326204251.51301-3-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d91ef1e1
    • Kuniyuki Iwashima's avatar
      tcp: Fix bind() regression for v6-only wildcard and v4-mapped-v6 non-wildcard addresses. · ea111449
      Kuniyuki Iwashima authored
      Commit 5e07e672 ("tcp: Use bhash2 for v4-mapped-v6 non-wildcard
      address.") introduced bind() regression for v4-mapped-v6 address.
      
      When we bind() the following two addresses on the same port, the 2nd
      bind() should succeed but fails now.
      
        1. [::] w/ IPV6_ONLY
        2. ::ffff:127.0.0.1
      
      After the chagne, v4-mapped-v6 uses bhash2 instead of bhash to
      detect conflict faster, but I forgot to add a necessary change.
      
      During the 2nd bind(), inet_bind2_bucket_match_addr_any() returns
      the tb2 bucket of [::], and inet_bhash2_conflict() finally calls
      inet_bind_conflict(), which returns true, meaning conflict.
      
        inet_bhash2_addr_any_conflict
        |- inet_bind2_bucket_match_addr_any  <-- return [::] bucket
        `- inet_bhash2_conflict
           `- __inet_bhash2_conflict <-- checks IPV6_ONLY for AF_INET
              |                          but not for v4-mapped-v6 address
              `- inet_bind_conflict  <-- does not check address
      
      inet_bind_conflict() does not check socket addresses because
      __inet_bhash2_conflict() is expected to do so.
      
      However, it checks IPV6_V6ONLY attribute only against AF_INET
      socket, and not for v4-mapped-v6 address.
      
      As a result, v4-mapped-v6 address conflicts with v6-only wildcard
      address.
      
      To avoid that, let's add the missing test to use bhash2 for
      v4-mapped-v6 address.
      
      Fixes: 5e07e672 ("tcp: Use bhash2 for v4-mapped-v6 non-wildcard address.")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240326204251.51301-2-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ea111449
    • Eric Dumazet's avatar
      erspan: make sure erspan_base_hdr is present in skb->head · 17af4205
      Eric Dumazet authored
      syzbot reported a problem in ip6erspan_rcv() [1]
      
      Issue is that ip6erspan_rcv() (and erspan_rcv()) no longer make
      sure erspan_base_hdr is present in skb linear part (skb->head)
      before getting @ver field from it.
      
      Add the missing pskb_may_pull() calls.
      
      v2: Reload iph pointer in erspan_rcv() after pskb_may_pull()
          because skb->head might have changed.
      
      [1]
      
       BUG: KMSAN: uninit-value in pskb_may_pull_reason include/linux/skbuff.h:2742 [inline]
       BUG: KMSAN: uninit-value in pskb_may_pull include/linux/skbuff.h:2756 [inline]
       BUG: KMSAN: uninit-value in ip6erspan_rcv net/ipv6/ip6_gre.c:541 [inline]
       BUG: KMSAN: uninit-value in gre_rcv+0x11f8/0x1930 net/ipv6/ip6_gre.c:610
        pskb_may_pull_reason include/linux/skbuff.h:2742 [inline]
        pskb_may_pull include/linux/skbuff.h:2756 [inline]
        ip6erspan_rcv net/ipv6/ip6_gre.c:541 [inline]
        gre_rcv+0x11f8/0x1930 net/ipv6/ip6_gre.c:610
        ip6_protocol_deliver_rcu+0x1d4c/0x2ca0 net/ipv6/ip6_input.c:438
        ip6_input_finish net/ipv6/ip6_input.c:483 [inline]
        NF_HOOK include/linux/netfilter.h:314 [inline]
        ip6_input+0x15d/0x430 net/ipv6/ip6_input.c:492
        ip6_mc_input+0xa7e/0xc80 net/ipv6/ip6_input.c:586
        dst_input include/net/dst.h:460 [inline]
        ip6_rcv_finish+0x955/0x970 net/ipv6/ip6_input.c:79
        NF_HOOK include/linux/netfilter.h:314 [inline]
        ipv6_rcv+0xde/0x390 net/ipv6/ip6_input.c:310
        __netif_receive_skb_one_core net/core/dev.c:5538 [inline]
        __netif_receive_skb+0x1da/0xa00 net/core/dev.c:5652
        netif_receive_skb_internal net/core/dev.c:5738 [inline]
        netif_receive_skb+0x58/0x660 net/core/dev.c:5798
        tun_rx_batched+0x3ee/0x980 drivers/net/tun.c:1549
        tun_get_user+0x5566/0x69e0 drivers/net/tun.c:2002
        tun_chr_write_iter+0x3af/0x5d0 drivers/net/tun.c:2048
        call_write_iter include/linux/fs.h:2108 [inline]
        new_sync_write fs/read_write.c:497 [inline]
        vfs_write+0xb63/0x1520 fs/read_write.c:590
        ksys_write+0x20f/0x4c0 fs/read_write.c:643
        __do_sys_write fs/read_write.c:655 [inline]
        __se_sys_write fs/read_write.c:652 [inline]
        __x64_sys_write+0x93/0xe0 fs/read_write.c:652
       do_syscall_64+0xd5/0x1f0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      Uninit was created at:
        slab_post_alloc_hook mm/slub.c:3804 [inline]
        slab_alloc_node mm/slub.c:3845 [inline]
        kmem_cache_alloc_node+0x613/0xc50 mm/slub.c:3888
        kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:577
        __alloc_skb+0x35b/0x7a0 net/core/skbuff.c:668
        alloc_skb include/linux/skbuff.h:1318 [inline]
        alloc_skb_with_frags+0xc8/0xbf0 net/core/skbuff.c:6504
        sock_alloc_send_pskb+0xa81/0xbf0 net/core/sock.c:2795
        tun_alloc_skb drivers/net/tun.c:1525 [inline]
        tun_get_user+0x209a/0x69e0 drivers/net/tun.c:1846
        tun_chr_write_iter+0x3af/0x5d0 drivers/net/tun.c:2048
        call_write_iter include/linux/fs.h:2108 [inline]
        new_sync_write fs/read_write.c:497 [inline]
        vfs_write+0xb63/0x1520 fs/read_write.c:590
        ksys_write+0x20f/0x4c0 fs/read_write.c:643
        __do_sys_write fs/read_write.c:655 [inline]
        __se_sys_write fs/read_write.c:652 [inline]
        __x64_sys_write+0x93/0xe0 fs/read_write.c:652
       do_syscall_64+0xd5/0x1f0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      CPU: 1 PID: 5045 Comm: syz-executor114 Not tainted 6.9.0-rc1-syzkaller-00021-g96249052 #0
      
      Fixes: cb73ee40 ("net: ip_gre: use erspan key field for tunnel lookup")
      Reported-by: syzbot+1c1cf138518bf0c53d68@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/netdev/000000000000772f2c0614b66ef7@google.com/Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Lorenzo Bianconi <lorenzo@kernel.org>
      Link: https://lore.kernel.org/r/20240328112248.1101491-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      17af4205
    • Atlas Yu's avatar
      r8169: skip DASH fw status checks when DASH is disabled · 5e864d90
      Atlas Yu authored
      On devices that support DASH, the current code in the "rtl_loop_wait" function
      raises false alarms when DASH is disabled. This occurs because the function
      attempts to wait for the DASH firmware to be ready, even though it's not
      relevant in this case.
      
      r8169 0000:0c:00.0 eth0: RTL8168ep/8111ep, 38:7c:76:49:08:d9, XID 502, IRQ 86
      r8169 0000:0c:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
      r8169 0000:0c:00.0 eth0: DASH disabled
      ...
      r8169 0000:0c:00.0 eth0: rtl_ep_ocp_read_cond == 0 (loop: 30, delay: 10000).
      
      This patch modifies the driver start/stop functions to skip checking the DASH
      firmware status when DASH is explicitly disabled. This prevents unnecessary
      delays and false alarms.
      
      The patch has been tested on several ThinkStation P8/PX workstations.
      
      Fixes: 0ab0c45d ("r8169: add handling DASH when DASH is disabled")
      Signed-off-by: default avatarAtlas Yu <atlas.yu@canonical.com>
      Reviewed-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Link: https://lore.kernel.org/r/20240328055152.18443-1-atlas.yu@canonical.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5e864d90
    • Su Hui's avatar
      octeontx2-pf: check negative error code in otx2_open() · e709acbd
      Su Hui authored
      otx2_rxtx_enable() return negative error code such as -EIO,
      check -EIO rather than EIO to fix this problem.
      
      Fixes: c9262522 ("octeontx2-pf: Disable packet I/O for graceful exit")
      Signed-off-by: default avatarSu Hui <suhui@nfschina.com>
      Reviewed-by: default avatarSubbaraya Sundeep <sbhatta@marvell.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarKalesh AP <kalesh-anakkur.purayil@broadcom.com>
      Link: https://lore.kernel.org/r/20240328020620.4054692-1-suhui@nfschina.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e709acbd
    • Eric Dumazet's avatar
      net: do not consume a cacheline for system_page_pool · 5086f0fe
      Eric Dumazet authored
      There is no reason to consume a full cacheline to store system_page_pool.
      
      We can eventually move it to softnet_data later for full locality control.
      
      Fixes: 2b0cfa6e ("net: add generic percpu page_pool allocator")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Lorenzo Bianconi <lorenzo@kernel.org>
      Cc: Toke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarJesper Dangaard Brouer <hawk@kernel.org>
      Link: https://lore.kernel.org/r/20240328173448.2262593-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5086f0fe
    • Jakub Kicinski's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 50ba9d7e
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2024-03-26 (i40e)
      
      This series contains updates to i40e driver only.
      
      Ivan Vecera resolves an issue where descriptors could be missed when
      exiting busy poll.
      
      Aleksandr corrects counting of MAC filters to only include new or active
      filters and resolves possible use of incorrect/stale 'vf' variable.
      
      * '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
        i40e: fix vf may be used uninitialized in this function warning
        i40e: fix i40e_count_filters() to count only active/new filters
        i40e: Enforce software interrupt during busy-poll exit
      ====================
      
      Link: https://lore.kernel.org/r/20240326162358.1224145-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      50ba9d7e
    • Mahmoud Adam's avatar
      net/rds: fix possible cp null dereference · 62fc3357
      Mahmoud Adam authored
      cp might be null, calling cp->cp_conn would produce null dereference
      
      [Simon Horman adds:]
      
      Analysis:
      
      * cp is a parameter of __rds_rdma_map and is not reassigned.
      
      * The following call-sites pass a NULL cp argument to __rds_rdma_map()
      
        - rds_get_mr()
        - rds_get_mr_for_dest
      
      * Prior to the code above, the following assumes that cp may be NULL
        (which is indicative, but could itself be unnecessary)
      
      	trans_private = rs->rs_transport->get_mr(
      		sg, nents, rs, &mr->r_key, cp ? cp->cp_conn : NULL,
      		args->vec.addr, args->vec.bytes,
      		need_odp ? ODP_ZEROBASED : ODP_NOT_NEEDED);
      
      * The code modified by this patch is guarded by IS_ERR(trans_private),
        where trans_private is assigned as per the previous point in this analysis.
      
        The only implementation of get_mr that I could locate is rds_ib_get_mr()
        which can return an ERR_PTR if the conn (4th) argument is NULL.
      
      * ret is set to PTR_ERR(trans_private).
        rds_ib_get_mr can return ERR_PTR(-ENODEV) if the conn (4th) argument is NULL.
        Thus ret may be -ENODEV in which case the code in question will execute.
      
      Conclusion:
      * cp may be NULL at the point where this patch adds a check;
        this patch does seem to address a possible bug
      
      Fixes: c055fc00 ("net/rds: fix WARNING in rds_conn_connect_if_down")
      Cc: stable@vger.kernel.org # v4.19+
      Signed-off-by: default avatarMahmoud Adam <mngyadam@amazon.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240326153132.55580-1-mngyadam@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      62fc3357
    • Michael Krummsdorf's avatar
      net: dsa: mv88e6xxx: fix usable ports on 88e6020 · 625aefac
      Michael Krummsdorf authored
      The switch has 4 ports with 2 internal PHYs, but ports are numbered up
      to 6, with ports 0, 1, 5 and 6 being usable.
      
      Fixes: 71d94a43 ("net: dsa: mv88e6xxx: add support for MV88E6020 switch")
      Signed-off-by: default avatarMichael Krummsdorf <michael.krummsdorf@tq-group.com>
      Signed-off-by: default avatarMatthias Schiffer <matthias.schiffer@ew.tq-group.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240326123655.40666-1-matthias.schiffer@ew.tq-group.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      625aefac
    • David Thompson's avatar
      mlxbf_gige: stop interface during shutdown · 09ba28e1
      David Thompson authored
      The mlxbf_gige driver intermittantly encounters a NULL pointer
      exception while the system is shutting down via "reboot" command.
      The mlxbf_driver will experience an exception right after executing
      its shutdown() method.  One example of this exception is:
      
      Unable to handle kernel NULL pointer dereference at virtual address 0000000000000070
      Mem abort info:
        ESR = 0x0000000096000004
        EC = 0x25: DABT (current EL), IL = 32 bits
        SET = 0, FnV = 0
        EA = 0, S1PTW = 0
        FSC = 0x04: level 0 translation fault
      Data abort info:
        ISV = 0, ISS = 0x00000004
        CM = 0, WnR = 0
      user pgtable: 4k pages, 48-bit VAs, pgdp=000000011d373000
      [0000000000000070] pgd=0000000000000000, p4d=0000000000000000
      Internal error: Oops: 96000004 [#1] SMP
      CPU: 0 PID: 13 Comm: ksoftirqd/0 Tainted: G S         OE     5.15.0-bf.6.gef6992a #1
      Hardware name: https://www.mellanox.com BlueField SoC/BlueField SoC, BIOS 4.0.2.12669 Apr 21 2023
      pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      pc : mlxbf_gige_handle_tx_complete+0xc8/0x170 [mlxbf_gige]
      lr : mlxbf_gige_poll+0x54/0x160 [mlxbf_gige]
      sp : ffff8000080d3c10
      x29: ffff8000080d3c10 x28: ffffcce72cbb7000 x27: ffff8000080d3d58
      x26: ffff0000814e7340 x25: ffff331cd1a05000 x24: ffffcce72c4ea008
      x23: ffff0000814e4b40 x22: ffff0000814e4d10 x21: ffff0000814e4128
      x20: 0000000000000000 x19: ffff0000814e4a80 x18: ffffffffffffffff
      x17: 000000000000001c x16: ffffcce72b4553f4 x15: ffff80008805b8a7
      x14: 0000000000000000 x13: 0000000000000030 x12: 0101010101010101
      x11: 7f7f7f7f7f7f7f7f x10: c2ac898b17576267 x9 : ffffcce720fa5404
      x8 : ffff000080812138 x7 : 0000000000002e9a x6 : 0000000000000080
      x5 : ffff00008de3b000 x4 : 0000000000000000 x3 : 0000000000000001
      x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
      Call trace:
       mlxbf_gige_handle_tx_complete+0xc8/0x170 [mlxbf_gige]
       mlxbf_gige_poll+0x54/0x160 [mlxbf_gige]
       __napi_poll+0x40/0x1c8
       net_rx_action+0x314/0x3a0
       __do_softirq+0x128/0x334
       run_ksoftirqd+0x54/0x6c
       smpboot_thread_fn+0x14c/0x190
       kthread+0x10c/0x110
       ret_from_fork+0x10/0x20
      Code: 8b070000 f9000ea0 f95056c0 f86178a1 (b9407002)
      ---[ end trace 7cc3941aa0d8e6a4 ]---
      Kernel panic - not syncing: Oops: Fatal exception in interrupt
      Kernel Offset: 0x4ce722520000 from 0xffff800008000000
      PHYS_OFFSET: 0x80000000
      CPU features: 0x000005c1,a3330e5a
      Memory Limit: none
      ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---
      
      During system shutdown, the mlxbf_gige driver's shutdown() is always executed.
      However, the driver's stop() method will only execute if networking interface
      configuration logic within the Linux distribution has been setup to do so.
      
      If shutdown() executes but stop() does not execute, NAPI remains enabled
      and this can lead to an exception if NAPI is scheduled while the hardware
      interface has only been partially deinitialized.
      
      The networking interface managed by the mlxbf_gige driver must be properly
      stopped during system shutdown so that IFF_UP is cleared, the hardware
      interface is put into a clean state, and NAPI is fully deinitialized.
      
      Fixes: f92e1869 ("Add Mellanox BlueField Gigabit Ethernet driver")
      Signed-off-by: default avatarDavid Thompson <davthompson@nvidia.com>
      Link: https://lore.kernel.org/r/20240325210929.25362-1-davthompson@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      09ba28e1
    • Hariprasad Kelam's avatar
      octeontx2-af: Fix issue with loading coalesced KPU profiles · 0ba80d96
      Hariprasad Kelam authored
      The current implementation for loading coalesced KPU profiles has
      a limitation.  The "offset" field, which is used to locate profiles
      within the profile is restricted to a u16.
      
      This restricts the number of profiles that can be loaded. This patch
      addresses this limitation by increasing the size of the "offset" field.
      
      Fixes: 11c730bf ("octeontx2-af: support for coalescing KPU profiles")
      Signed-off-by: default avatarHariprasad Kelam <hkelam@marvell.com>
      Reviewed-by: default avatarKalesh AP <kalesh-anakkur.purayil@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ba80d96
    • David S. Miller's avatar
      Merge branch 'gro-fixes' · ad69a730
      David S. Miller authored
      Antoine Tenart says:
      
      ====================
      gro: various fixes related to UDP tunnels
      
      We found issues when a UDP tunnel endpoint is in a different netns than
      where UDP GRO happens. This kind of setup is actually quite diverse,
      from having one leg of the tunnel on a remove host, to having a tunnel
      between netns (eg. being bridged in another one or on the host). In our
      case that UDP tunnel was geneve.
      
      UDP tunnel packets should not be GROed at the UDP level. The fundamental
      issue here is such packet can't be detected in a foolproof way: we can't
      know by looking at a packet alone and the current logic of looking up
      UDP sockets is fragile (socket could be in another netns, packet could
      be modified in between, etc). Because there is no way to make the GRO
      code to correctly handle those packets in all cases, this series aims at
      two things: making the net stack to correctly behave (as in, no crash
      and no invalid packet) when such thing happens, and in some cases to
      prevent this "early GRO" from happening.
      
      First three patches fix issues when an "UDP tunneled" packet is being
      GROed too early by rx-udp-gro-forwarding or rx-gro-list.
      
      Last patch is preventing locally generated UDP tunnel packets from being
      GROed. This turns out to be more complex than this patch alone as it
      relies on skb->encapsulation which is currently untrusty in some cases
      (see iptunnel_handle_offloads); but that should fix things in practice
      and is acceptable for a fix. Future work is required to improve things
      (prevent all locally generated UDP tunnel packets from being GROed),
      such as fixing the misuse of skb->encapsulation in drivers; but that
      would be net-next material.
      
      Thanks!
      Antoine
      
      Since v3:
        - Fixed the udpgro_fwd selftest in patch 5 (Jakub Kicinski feedback).
        - Improved commit message on patch 3 (Willem de Bruijn feeback).
      
      Since v2:
        - Fixed a build issue with IPv6=m in patch 1 (Jakub Kicinski
          feedback).
        - Fixed typo in patch 1 (Nikolay Aleksandrov feedback).
        - Added Reviewed-by tag on patch 2 (Willem de Bruijn feeback).
        - Added back conversion to CHECKSUM_UNNECESSARY but only from non
          CHECKSUM_PARTIAL in patch 3 (Paolo Abeni & Willem de Bruijn
          feeback).
        - Reworded patch 3 commit msg.
      
      Since v1:
        - Fixed a build issue with IPv6 disabled in patch 1.
        - Reworked commit log in patch 2 (Willem de Bruijn feedback).
        - Added Reviewed-by tags on patches 1 & 4 (Willem de Bruijn feeback).
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ad69a730
    • Antoine Tenart's avatar
      selftests: net: gro fwd: update vxlan GRO test expectations · 0fb101be
      Antoine Tenart authored
      UDP tunnel packets can't be GRO in-between their endpoints as this
      causes different issues. The UDP GRO fwd vxlan tests were relying on
      this and their expectations have to be fixed.
      
      We keep both vxlan tests and expected no GRO from happening. The vxlan
      UDP GRO bench test was removed as it's not providing any valuable
      information now.
      
      Fixes: a062260a ("selftests: net: add UDP GRO forwarding self-tests")
      Signed-off-by: default avatarAntoine Tenart <atenart@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0fb101be
    • Antoine Tenart's avatar
      udp: prevent local UDP tunnel packets from being GROed · 64235eab
      Antoine Tenart authored
      GRO has a fundamental issue with UDP tunnel packets as it can't detect
      those in a foolproof way and GRO could happen before they reach the
      tunnel endpoint. Previous commits have fixed issues when UDP tunnel
      packets come from a remote host, but if those packets are issued locally
      they could run into checksum issues.
      
      If the inner packet has a partial checksum the information will be lost
      in the GRO logic, either in udp4/6_gro_complete or in
      udp_gro_complete_segment and packets will have an invalid checksum when
      leaving the host.
      
      Prevent local UDP tunnel packets from ever being GROed at the outer UDP
      level.
      
      Due to skb->encapsulation being wrongly used in some drivers this is
      actually only preventing UDP tunnel packets with a partial checksum to
      be GROed (see iptunnel_handle_offloads) but those were also the packets
      triggering issues so in practice this should be sufficient.
      
      Fixes: 9fd1ff5d ("udp: Support UDP fraglist GRO/GSO.")
      Fixes: 36707061 ("udp: allow forwarding of plain (non-fraglisted) UDP GRO packets")
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarAntoine Tenart <atenart@kernel.org>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64235eab
    • Antoine Tenart's avatar
      udp: do not transition UDP GRO fraglist partial checksums to unnecessary · f0b8c303
      Antoine Tenart authored
      UDP GRO validates checksums and in udp4/6_gro_complete fraglist packets
      are converted to CHECKSUM_UNNECESSARY to avoid later checks. However
      this is an issue for CHECKSUM_PARTIAL packets as they can be looped in
      an egress path and then their partial checksums are not fixed.
      
      Different issues can be observed, from invalid checksum on packets to
      traces like:
      
        gen01: hw csum failure
        skb len=3008 headroom=160 headlen=1376 tailroom=0
        mac=(106,14) net=(120,40) trans=160
        shinfo(txflags=0 nr_frags=0 gso(size=0 type=0 segs=0))
        csum(0xffff232e ip_summed=2 complete_sw=0 valid=0 level=0)
        hash(0x77e3d716 sw=1 l4=1) proto=0x86dd pkttype=0 iif=12
        ...
      
      Fix this by only converting CHECKSUM_NONE packets to
      CHECKSUM_UNNECESSARY by reusing __skb_incr_checksum_unnecessary. All
      other checksum types are kept as-is, including CHECKSUM_COMPLETE as
      fraglist packets being segmented back would have their skb->csum valid.
      
      Fixes: 9fd1ff5d ("udp: Support UDP fraglist GRO/GSO.")
      Signed-off-by: default avatarAntoine Tenart <atenart@kernel.org>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f0b8c303
    • Antoine Tenart's avatar
      gro: fix ownership transfer · ed4cccef
      Antoine Tenart authored
      If packets are GROed with fraglist they might be segmented later on and
      continue their journey in the stack. In skb_segment_list those skbs can
      be reused as-is. This is an issue as their destructor was removed in
      skb_gro_receive_list but not the reference to their socket, and then
      they can't be orphaned. Fix this by also removing the reference to the
      socket.
      
      For example this could be observed,
      
        kernel BUG at include/linux/skbuff.h:3131!  (skb_orphan)
        RIP: 0010:ip6_rcv_core+0x11bc/0x19a0
        Call Trace:
         ipv6_list_rcv+0x250/0x3f0
         __netif_receive_skb_list_core+0x49d/0x8f0
         netif_receive_skb_list_internal+0x634/0xd40
         napi_complete_done+0x1d2/0x7d0
         gro_cell_poll+0x118/0x1f0
      
      A similar construction is found in skb_gro_receive, apply the same
      change there.
      
      Fixes: 5e10da53 ("skbuff: allow 'slow_gro' for skb carring sock reference")
      Signed-off-by: default avatarAntoine Tenart <atenart@kernel.org>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ed4cccef
    • Antoine Tenart's avatar
      udp: do not accept non-tunnel GSO skbs landing in a tunnel · 3d010c80
      Antoine Tenart authored
      When rx-udp-gro-forwarding is enabled UDP packets might be GROed when
      being forwarded. If such packets might land in a tunnel this can cause
      various issues and udp_gro_receive makes sure this isn't the case by
      looking for a matching socket. This is performed in
      udp4/6_gro_lookup_skb but only in the current netns. This is an issue
      with tunneled packets when the endpoint is in another netns. In such
      cases the packets will be GROed at the UDP level, which leads to various
      issues later on. The same thing can happen with rx-gro-list.
      
      We saw this with geneve packets being GROed at the UDP level. In such
      case gso_size is set; later the packet goes through the geneve rx path,
      the geneve header is pulled, the offset are adjusted and frag_list skbs
      are not adjusted with regard to geneve. When those skbs hit
      skb_fragment, it will misbehave. Different outcomes are possible
      depending on what the GROed skbs look like; from corrupted packets to
      kernel crashes.
      
      One example is a BUG_ON[1] triggered in skb_segment while processing the
      frag_list. Because gso_size is wrong (geneve header was pulled)
      skb_segment thinks there is "geneve header size" of data in frag_list,
      although it's in fact the next packet. The BUG_ON itself has nothing to
      do with the issue. This is only one of the potential issues.
      
      Looking up for a matching socket in udp_gro_receive is fragile: the
      lookup could be extended to all netns (not speaking about performances)
      but nothing prevents those packets from being modified in between and we
      could still not find a matching socket. It's OK to keep the current
      logic there as it should cover most cases but we also need to make sure
      we handle tunnel packets being GROed too early.
      
      This is done by extending the checks in udp_unexpected_gso: GSO packets
      lacking the SKB_GSO_UDP_TUNNEL/_CSUM bits and landing in a tunnel must
      be segmented.
      
      [1] kernel BUG at net/core/skbuff.c:4408!
          RIP: 0010:skb_segment+0xd2a/0xf70
          __udp_gso_segment+0xaa/0x560
      
      Fixes: 9fd1ff5d ("udp: Support UDP fraglist GRO/GSO.")
      Fixes: 36707061 ("udp: allow forwarding of plain (non-fraglisted) UDP GRO packets")
      Signed-off-by: default avatarAntoine Tenart <atenart@kernel.org>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d010c80
    • Lukasz Majewski's avatar
      net: hsr: Use full string description when opening HSR network device · 10e52ad5
      Lukasz Majewski authored
      Up till now only single character ('A' or 'B') was used to provide
      information of HSR slave network device status.
      
      As it is also possible and valid, that Interlink network device may
      be supported as well, the description must be more verbose. As a result
      the full string description is now used.
      Signed-off-by: default avatarLukasz Majewski <lukma@denx.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10e52ad5
    • Jakub Kicinski's avatar
      Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 1ae289b0
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2024-03-27 (e1000e)
      
      This series contains updates to e1000e driver only.
      
      Vitaly adds retry mechanism for some PHY operations to workaround MDI
      error and moves SMBus configuration to avoid possible PHY loss.
      
      * '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
        e1000e: move force SMBUS from enable ulp function to avoid PHY loss issue
        e1000e: Workaround for sporadic MDI error on Meteor Lake systems
      ====================
      
      Link: https://lore.kernel.org/r/20240327185517.2587564-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1ae289b0
    • Jesper Dangaard Brouer's avatar
      xen-netfront: Add missing skb_mark_for_recycle · 03796540
      Jesper Dangaard Brouer authored
      Notice that skb_mark_for_recycle() is introduced later than fixes tag in
      commit 6a5bcd84 ("page_pool: Allow drivers to hint on SKB recycling").
      
      It is believed that fixes tag were missing a call to page_pool_release_page()
      between v5.9 to v5.14, after which is should have used skb_mark_for_recycle().
      Since v6.6 the call page_pool_release_page() were removed (in
      commit 535b9c61 ("net: page_pool: hide page_pool_release_page()")
      and remaining callers converted (in commit 6bfef2ec ("Merge branch
      'net-page_pool-remove-page_pool_release_page'")).
      
      This leak became visible in v6.8 via commit dba1b8a7 ("mm/page_pool: catch
      page_pool memory leaks").
      
      Cc: stable@vger.kernel.org
      Fixes: 6c5aa6fc ("xen networking: add basic XDP support for xen-netfront")
      Reported-by: default avatarLeonidas Spyropoulos <artafinde@archlinux.com>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=218654Reported-by: default avatarArthur Borsboom <arthurborsboom@gmail.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <hawk@kernel.org>
      Link: https://lore.kernel.org/r/171154167446.2671062.9127105384591237363.stgit@firesoulSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      03796540
    • Krzysztof Kozlowski's avatar
      ptp: MAINTAINERS: drop Jeff Sipek · fa845139
      Krzysztof Kozlowski authored
      Emails to Jeff Sipek bounce:
      
        Your message to jsipek@vmware.com couldn't be delivered.
        Recipient is not authorized to accept external mail
        Status code: 550 5.7.1_ETR
      Signed-off-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Link: https://lore.kernel.org/r/20240327081413.306054-1-krzysztof.kozlowski@linaro.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fa845139
    • William Tu's avatar
      Documentation: Add documentation for eswitch attribute · 931ec1e4
      William Tu authored
      Provide devlink documentation for three eswitch attributes:
      mode, inline-mode, and encap-mode.
      Signed-off-by: default avatarWilliam Tu <witu@nvidia.com>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20240325181228.6244-1-witu@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      931ec1e4
  2. 28 Mar, 2024 15 commits
    • Linus Torvalds's avatar
      Merge tag 'net-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 50108c35
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from bpf, WiFi and netfilter.
      
        Current release - regressions:
      
         - ipv6: fix address dump when IPv6 is disabled on an interface
      
        Current release - new code bugs:
      
         - bpf: temporarily disable atomic operations in BPF arena
      
         - nexthop: fix uninitialized variable in nla_put_nh_group_stats()
      
        Previous releases - regressions:
      
         - bpf: protect against int overflow for stack access size
      
         - hsr: fix the promiscuous mode in offload mode
      
         - wifi: don't always use FW dump trig
      
         - tls: adjust recv return with async crypto and failed copy to
           userspace
      
         - tcp: properly terminate timers for kernel sockets
      
         - ice: fix memory corruption bug with suspend and rebuild
      
         - at803x: fix kernel panic with at8031_probe
      
         - qeth: handle deferred cc1
      
        Previous releases - always broken:
      
         - bpf: fix bug in BPF_LDX_MEMSX
      
         - netfilter: reject table flag and netdev basechain updates
      
         - inet_defrag: prevent sk release while still in use
      
         - wifi: pick the version of SESSION_PROTECTION_NOTIF
      
         - wwan: t7xx: split 64bit accesses to fix alignment issues
      
         - mlxbf_gige: call request_irq() after NAPI initialized
      
         - hns3: fix kernel crash when devlink reload during pf
           initialization"
      
      * tag 'net-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
        inet: inet_defrag: prevent sk release while still in use
        Octeontx2-af: fix pause frame configuration in GMP mode
        net: lan743x: Add set RFE read fifo threshold for PCI1x1x chips
        net: bcmasp: Remove phy_{suspend/resume}
        net: bcmasp: Bring up unimac after PHY link up
        net: phy: qcom: at803x: fix kernel panic with at8031_probe
        netfilter: arptables: Select NETFILTER_FAMILY_ARP when building arp_tables.c
        netfilter: nf_tables: skip netdev hook unregistration if table is dormant
        netfilter: nf_tables: reject table flag and netdev basechain updates
        netfilter: nf_tables: reject destroy command to remove basechain hooks
        bpf: update BPF LSM designated reviewer list
        bpf: Protect against int overflow for stack access size
        bpf: Check bloom filter map value size
        bpf: fix warning for crash_kexec
        selftests: netdevsim: set test timeout to 10 minutes
        net: wan: framer: Add missing static inline qualifiers
        mlxbf_gige: call request_irq() after NAPI initialized
        tls: get psock ref after taking rxlock to avoid leak
        selftests: tls: add test with a partially invalid iov
        tls: adjust recv return with async crypto and failed copy to userspace
        ...
      50108c35
    • Florian Westphal's avatar
      inet: inet_defrag: prevent sk release while still in use · 18685451
      Florian Westphal authored
      ip_local_out() and other functions can pass skb->sk as function argument.
      
      If the skb is a fragment and reassembly happens before such function call
      returns, the sk must not be released.
      
      This affects skb fragments reassembled via netfilter or similar
      modules, e.g. openvswitch or ct_act.c, when run as part of tx pipeline.
      
      Eric Dumazet made an initial analysis of this bug.  Quoting Eric:
        Calling ip_defrag() in output path is also implying skb_orphan(),
        which is buggy because output path relies on sk not disappearing.
      
        A relevant old patch about the issue was :
        8282f274 ("inet: frag: Always orphan skbs inside ip_defrag()")
      
        [..]
      
        net/ipv4/ip_output.c depends on skb->sk being set, and probably to an
        inet socket, not an arbitrary one.
      
        If we orphan the packet in ipvlan, then downstream things like FQ
        packet scheduler will not work properly.
      
        We need to change ip_defrag() to only use skb_orphan() when really
        needed, ie whenever frag_list is going to be used.
      
      Eric suggested to stash sk in fragment queue and made an initial patch.
      However there is a problem with this:
      
      If skb is refragmented again right after, ip_do_fragment() will copy
      head->sk to the new fragments, and sets up destructor to sock_wfree.
      IOW, we have no choice but to fix up sk_wmem accouting to reflect the
      fully reassembled skb, else wmem will underflow.
      
      This change moves the orphan down into the core, to last possible moment.
      As ip_defrag_offset is aliased with sk_buff->sk member, we must move the
      offset into the FRAG_CB, else skb->sk gets clobbered.
      
      This allows to delay the orphaning long enough to learn if the skb has
      to be queued or if the skb is completing the reasm queue.
      
      In the former case, things work as before, skb is orphaned.  This is
      safe because skb gets queued/stolen and won't continue past reasm engine.
      
      In the latter case, we will steal the skb->sk reference, reattach it to
      the head skb, and fix up wmem accouting when inet_frag inflates truesize.
      
      Fixes: 7026b1dd ("netfilter: Pass socket pointer down through okfn().")
      Diagnosed-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarxingwei lee <xrivendell7@gmail.com>
      Reported-by: default avataryue sun <samsun1006219@gmail.com>
      Reported-by: syzbot+e5167d7144a62715044c@syzkaller.appspotmail.com
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240326101845.30836-1-fw@strlen.deSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      18685451
    • Hariprasad Kelam's avatar
      Octeontx2-af: fix pause frame configuration in GMP mode · 40d4b480
      Hariprasad Kelam authored
      The Octeontx2 MAC block (CGX) has separate data paths (SMU and GMP) for
      different speeds, allowing for efficient data transfer.
      
      The previous patch which added pause frame configuration has a bug due
      to which pause frame feature is not working in GMP mode.
      
      This patch fixes the issue by configurating appropriate registers.
      
      Fixes: f7e086e7 ("octeontx2-af: Pause frame configuration at cgx")
      Signed-off-by: default avatarHariprasad Kelam <hkelam@marvell.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240326052720.4441-1-hkelam@marvell.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      40d4b480
    • Raju Lakkaraju's avatar
      net: lan743x: Add set RFE read fifo threshold for PCI1x1x chips · e4a58989
      Raju Lakkaraju authored
      PCI11x1x Rev B0 devices might drop packets when receiving back to back frames
      at 2.5G link speed. Change the B0 Rev device's Receive filtering Engine FIFO
      threshold parameter from its hardware default of 4 to 3 dwords to prevent the
      problem. Rev C0 and later hardware already defaults to 3 dwords.
      
      Fixes: bb4f6bff ("net: lan743x: Add PCI11010 / PCI11414 device IDs")
      Signed-off-by: default avatarRaju Lakkaraju <Raju.Lakkaraju@microchip.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240326065805.686128-1-Raju.Lakkaraju@microchip.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      e4a58989
    • Paolo Abeni's avatar
      Merge branch 'net-bcmasp-phy-managements-fixes' · eb67cdb3
      Paolo Abeni authored
      Justin Chen says:
      
      ====================
      net: bcmasp: phy managements fixes
      
      Fix two issues.
      
      - The unimac may be put in a bad state if PHY RX clk doesn't exist
        during reset. Work around this by bringing the unimac out of reset
        during phy up.
      
      - Remove redundant phy_{suspend/resume}
      ====================
      
      Link: https://lore.kernel.org/r/20240325193025.1540737-1-justin.chen@broadcom.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      eb67cdb3
    • Justin Chen's avatar
      net: bcmasp: Remove phy_{suspend/resume} · 4494c10e
      Justin Chen authored
      phy_{suspend/resume} is redundant. It gets called from phy_{stop/start}.
      
      Fixes: 490cb412 ("net: bcmasp: Add support for ASP2.0 Ethernet controller")
      Signed-off-by: default avatarJustin Chen <justin.chen@broadcom.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      4494c10e
    • Justin Chen's avatar
      net: bcmasp: Bring up unimac after PHY link up · dfd222e2
      Justin Chen authored
      The unimac requires the PHY RX clk during reset or it may be put
      into a bad state. Bring up the unimac after link up to ensure the
      PHY RX clk exists.
      
      Fixes: 490cb412 ("net: bcmasp: Add support for ASP2.0 Ethernet controller")
      Signed-off-by: default avatarJustin Chen <justin.chen@broadcom.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      dfd222e2
    • Christian Marangi's avatar
      net: phy: qcom: at803x: fix kernel panic with at8031_probe · 6a4aee27
      Christian Marangi authored
      On reworking and splitting the at803x driver, in splitting function of
      at803x PHYs it was added a NULL dereference bug where priv is referenced
      before it's actually allocated and then is tried to write to for the
      is_1000basex and is_fiber variables in the case of at8031, writing on
      the wrong address.
      
      Fix this by correctly setting priv local variable only after
      at803x_probe is called and actually allocates priv in the phydev struct.
      Reported-by: default avatarWilliam Wortel <wwortel@dorpstraat.com>
      Cc: <stable@vger.kernel.org>
      Fixes: 25d2ba94 ("net: phy: at803x: move specific at8031 probe mode check to dedicated probe")
      Signed-off-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20240325190621.2665-1-ansuelsmth@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      6a4aee27
    • Paolo Abeni's avatar
      Merge tag 'nf-24-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · 005e528c
      Paolo Abeni authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      Patch #1 reject destroy chain command to delete device hooks in netdev
               family, hence, only delchain commands are allowed.
      
      Patch #2 reject table flag update interference with netdev basechain
      	 hook updates, this can leave hooks in inconsistent
      	 registration/unregistration state.
      
      Patch #3 do not unregister netdev basechain hooks if table is dormant.
      	 Otherwise, splat with double unregistration is possible.
      
      Patch #4 fixes Kconfig to allow to restore IP_NF_ARPTABLES,
      	 from Kuniyuki Iwashima.
      
      There are a more fixes still in progress on my side that need more work.
      
      * tag 'nf-24-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
        netfilter: arptables: Select NETFILTER_FAMILY_ARP when building arp_tables.c
        netfilter: nf_tables: skip netdev hook unregistration if table is dormant
        netfilter: nf_tables: reject table flag and netdev basechain updates
        netfilter: nf_tables: reject destroy command to remove basechain hooks
      ====================
      
      Link: https://lore.kernel.org/r/20240328031855.2063-1-pablo@netfilter.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      005e528c
    • Paolo Abeni's avatar
      Merge tag 'for-net' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 7e6f4b2a
      Paolo Abeni authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf 2024-03-27
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 4 non-merge commits during the last 1 day(s) which contain
      a total of 5 files changed, 26 insertions(+), 3 deletions(-).
      
      The main changes are:
      
      1) Fix bloom filter value size validation and protect the verifier
         against such mistakes, from Andrei.
      
      2) Fix build due to CONFIG_KEXEC_CORE/CRASH_DUMP split, from Hari.
      
      3) Update bpf_lsm maintainers entry, from Matt.
      
      * tag 'for-net' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        bpf: update BPF LSM designated reviewer list
        bpf: Protect against int overflow for stack access size
        bpf: Check bloom filter map value size
        bpf: fix warning for crash_kexec
      ====================
      
      Link: https://lore.kernel.org/r/20240328012938.24249-1-alexei.starovoitov@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      7e6f4b2a
    • Linus Torvalds's avatar
      Merge tag 'erofs-for-6.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs · 8d025e20
      Linus Torvalds authored
      Pull erofs fixes from Gao Xiang:
      
       - Add a new reviewer Sandeep Dhavale to build a healthier community
      
       - Drop experimental warning for FSDAX
      
      * tag 'erofs-for-6.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
        MAINTAINERS: erofs: add myself as reviewer
        erofs: drop experimental warning for FSDAX
      8d025e20
    • Kuniyuki Iwashima's avatar
      netfilter: arptables: Select NETFILTER_FAMILY_ARP when building arp_tables.c · 15fba562
      Kuniyuki Iwashima authored
      syzkaller started to report a warning below [0] after consuming the
      commit 4654467d ("netfilter: arptables: allow xtables-nft only
      builds").
      
      The change accidentally removed the dependency on NETFILTER_FAMILY_ARP
      from IP_NF_ARPTABLES.
      
      If NF_TABLES_ARP is not enabled on Kconfig, NETFILTER_FAMILY_ARP will
      be removed and some code necessary for arptables will not be compiled.
      
        $ grep -E "(NETFILTER_FAMILY_ARP|IP_NF_ARPTABLES|NF_TABLES_ARP)" .config
        CONFIG_NETFILTER_FAMILY_ARP=y
        # CONFIG_NF_TABLES_ARP is not set
        CONFIG_IP_NF_ARPTABLES=y
      
        $ make olddefconfig
      
        $ grep -E "(NETFILTER_FAMILY_ARP|IP_NF_ARPTABLES|NF_TABLES_ARP)" .config
        # CONFIG_NF_TABLES_ARP is not set
        CONFIG_IP_NF_ARPTABLES=y
      
      So, when nf_register_net_hooks() is called for arptables, it will
      trigger the splat below.
      
      Now IP_NF_ARPTABLES is only enabled by IP_NF_ARPFILTER, so let's
      restore the dependency on NETFILTER_FAMILY_ARP in IP_NF_ARPFILTER.
      
      [0]:
      WARNING: CPU: 0 PID: 242 at net/netfilter/core.c:316 nf_hook_entry_head+0x1e1/0x2c0 net/netfilter/core.c:316
      Modules linked in:
      CPU: 0 PID: 242 Comm: syz-executor.0 Not tainted 6.8.0-12821-g537c2e91 #10
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      RIP: 0010:nf_hook_entry_head+0x1e1/0x2c0 net/netfilter/core.c:316
      Code: 83 fd 04 0f 87 bc 00 00 00 e8 5b 84 83 fd 4d 8d ac ec a8 0b 00 00 e8 4e 84 83 fd 4c 89 e8 5b 5d 41 5c 41 5d c3 e8 3f 84 83 fd <0f> 0b e8 38 84 83 fd 45 31 ed 5b 5d 4c 89 e8 41 5c 41 5d c3 e8 26
      RSP: 0018:ffffc90000b8f6e8 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000000003 RCX: ffffffff83c42164
      RDX: ffff888106851180 RSI: ffffffff83c42321 RDI: 0000000000000005
      RBP: 0000000000000000 R08: 0000000000000005 R09: 000000000000000a
      R10: 0000000000000003 R11: ffff8881055c2f00 R12: ffff888112b78000
      R13: 0000000000000000 R14: ffff8881055c2f00 R15: ffff8881055c2f00
      FS:  00007f377bd78800(0000) GS:ffff88811b000000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000496068 CR3: 000000011298b003 CR4: 0000000000770ef0
      PKRU: 55555554
      Call Trace:
       <TASK>
       __nf_register_net_hook+0xcd/0x7a0 net/netfilter/core.c:428
       nf_register_net_hook+0x116/0x170 net/netfilter/core.c:578
       nf_register_net_hooks+0x5d/0xc0 net/netfilter/core.c:594
       arpt_register_table+0x250/0x420 net/ipv4/netfilter/arp_tables.c:1553
       arptable_filter_table_init+0x41/0x60 net/ipv4/netfilter/arptable_filter.c:39
       xt_find_table_lock+0x2e9/0x4b0 net/netfilter/x_tables.c:1260
       xt_request_find_table_lock+0x2b/0xe0 net/netfilter/x_tables.c:1285
       get_info+0x169/0x5c0 net/ipv4/netfilter/arp_tables.c:808
       do_arpt_get_ctl+0x3f9/0x830 net/ipv4/netfilter/arp_tables.c:1444
       nf_getsockopt+0x76/0xd0 net/netfilter/nf_sockopt.c:116
       ip_getsockopt+0x17d/0x1c0 net/ipv4/ip_sockglue.c:1777
       tcp_getsockopt+0x99/0x100 net/ipv4/tcp.c:4373
       do_sock_getsockopt+0x279/0x360 net/socket.c:2373
       __sys_getsockopt+0x115/0x1e0 net/socket.c:2402
       __do_sys_getsockopt net/socket.c:2412 [inline]
       __se_sys_getsockopt net/socket.c:2409 [inline]
       __x64_sys_getsockopt+0xbd/0x150 net/socket.c:2409
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0x4f/0x110 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x46/0x4e
      RIP: 0033:0x7f377beca6fe
      Code: 1f 44 00 00 48 8b 15 01 97 0a 00 f7 d8 64 89 02 b8 ff ff ff ff eb b8 0f 1f 44 00 00 f3 0f 1e fa 49 89 ca b8 37 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 0a c3 66 0f 1f 84 00 00 00 00 00 48 8b 15 c9
      RSP: 002b:00000000005df728 EFLAGS: 00000246 ORIG_RAX: 0000000000000037
      RAX: ffffffffffffffda RBX: 00000000004966e0 RCX: 00007f377beca6fe
      RDX: 0000000000000060 RSI: 0000000000000000 RDI: 0000000000000003
      RBP: 000000000042938a R08: 00000000005df73c R09: 00000000005df800
      R10: 00000000004966e8 R11: 0000000000000246 R12: 0000000000000003
      R13: 0000000000496068 R14: 0000000000000003 R15: 00000000004bc9d8
       </TASK>
      
      Fixes: 4654467d ("netfilter: arptables: allow xtables-nft only builds")
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      15fba562
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: skip netdev hook unregistration if table is dormant · 216e7bf7
      Pablo Neira Ayuso authored
      Skip hook unregistration when adding or deleting devices from an
      existing netdev basechain. Otherwise, commit/abort path try to
      unregister hooks which not enabled.
      
      Fixes: b9703ed4 ("netfilter: nf_tables: support for adding new devices to an existing netdev chain")
      Fixes: 7d937b10 ("netfilter: nf_tables: support for deleting devices in an existing netdev chain")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      216e7bf7
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: reject table flag and netdev basechain updates · 1e1fb6f0
      Pablo Neira Ayuso authored
      netdev basechain updates are stored in the transaction object hook list.
      When setting on the table dormant flag, it iterates over the existing
      hooks in the basechain. Thus, skipping the hooks that are being
      added/deleted in this transaction, which leaves hook registration in
      inconsistent state.
      
      Reject table flag updates in combination with netdev basechain updates
      in the same batch:
      
      - Update table flags and add/delete basechain: Check from basechain update
        path if there are pending flag updates for this table.
      - add/delete basechain and update table flags: Iterate over the transaction
        list to search for basechain updates from the table update path.
      
      In both cases, the batch is rejected. Based on suggestion from Florian Westphal.
      
      Fixes: b9703ed4 ("netfilter: nf_tables: support for adding new devices to an existing netdev chain")
      Fixes: 7d937b10 ("netfilter: nf_tables: support for deleting devices in an existing netdev chain")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      1e1fb6f0
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: reject destroy command to remove basechain hooks · b32ca27f
      Pablo Neira Ayuso authored
      Report EOPNOTSUPP if NFT_MSG_DESTROYCHAIN is used to delete hooks in an
      existing netdev basechain, thus, only NFT_MSG_DELCHAIN is allowed.
      
      Fixes: 7d937b10 ("netfilter: nf_tables: support for deleting devices in an existing netdev chain")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      b32ca27f