1. 19 Mar, 2024 13 commits
  2. 18 Mar, 2024 7 commits
    • Abhishek Chauhan's avatar
      Revert "net: Re-use and set mono_delivery_time bit for userspace tstamp packets" · 35c3e279
      Abhishek Chauhan authored
      This reverts commit 885c36e5.
      
      The patch currently broke the bpf selftest test_tc_dtime because
      uapi field __sk_buff->tstamp_type depends on skb->mono_delivery_time which
      does not necessarily mean mono with the original fix as the bit was re-used
      for userspace timestamp as well to avoid tstamp reset in the forwarding
      path. To solve this we need to keep mono_delivery_time as is and
      introduce another bit called user_delivery_time and fall back to the
      initial proposal of setting the user_delivery_time bit based on
      sk_clockid set from userspace.
      
      Fixes: 885c36e5 ("net: Re-use and set mono_delivery_time bit for userspace tstamp packets")
      Link: https://lore.kernel.org/netdev/bc037db4-58bb-4861-ac31-a361a93841d3@linux.dev/Signed-off-by: default avatarAbhishek Chauhan <quic_abchauha@quicinc.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35c3e279
    • Arınç ÜNAL's avatar
      net: dsa: mt7530: prevent possible incorrect XTAL frequency selection · f490c492
      Arınç ÜNAL authored
      On MT7530, the HT_XTAL_FSEL field of the HWTRAP register stores a 2-bit
      value that represents the frequency of the crystal oscillator connected to
      the switch IC. The field is populated by the state of the ESW_P4_LED_0 and
      ESW_P4_LED_0 pins, which is done right after reset is deasserted.
      
        ESW_P4_LED_0    ESW_P3_LED_0    Frequency
        -----------------------------------------
        0               0               Reserved
        0               1               20MHz
        1               0               40MHz
        1               1               25MHz
      
      On MT7531, the XTAL25 bit of the STRAP register stores this. The LAN0LED0
      pin is used to populate the bit. 25MHz when the pin is high, 40MHz when
      it's low.
      
      These pins are also used with LEDs, therefore, their state can be set to
      something other than the bootstrapping configuration. For example, a link
      may be established on port 3 before the DSA subdriver takes control of the
      switch which would set ESW_P3_LED_0 to high.
      
      Currently on mt7530_setup() and mt7531_setup(), 1000 - 1100 usec delay is
      described between reset assertion and deassertion. Some switch ICs in real
      life conditions cannot always have these pins set back to the bootstrapping
      configuration before reset deassertion in this amount of delay. This causes
      wrong crystal frequency to be selected which puts the switch in a
      nonfunctional state after reset deassertion.
      
      The tests below are conducted on an MT7530 with a 40MHz crystal oscillator
      by Justin Swartz.
      
      With a cable from an active peer connected to port 3 before reset, an
      incorrect crystal frequency (0b11 = 25MHz) is selected:
      
                            [1]                  [3]     [5]
                            :                    :       :
                    _____________________________         __________________
      ESW_P4_LED_0                               |_______|
                    _____________________________
      ESW_P3_LED_0                               |__________________________
      
                             :                  : :     :
                             :                  : [4]...:
                             :                  :
                             [2]................:
      
      [1] Reset is asserted.
      [2] Period of 1000 - 1100 usec.
      [3] Reset is deasserted.
      [4] Period of 315 usec. HWTRAP register is populated with incorrect
          XTAL frequency.
      [5] Signals reflect the bootstrapped configuration.
      
      Increase the delay between reset_control_assert() and
      reset_control_deassert(), and gpiod_set_value_cansleep(priv->reset, 0) and
      gpiod_set_value_cansleep(priv->reset, 1) to 5000 - 5100 usec. This amount
      ensures a higher possibility that the switch IC will have these pins back
      to the bootstrapping configuration before reset deassertion.
      
      With a cable from an active peer connected to port 3 before reset, the
      correct crystal frequency (0b10 = 40MHz) is selected:
      
                            [1]        [2-1]     [3]     [5]
                            :          :         :       :
                    _____________________________         __________________
      ESW_P4_LED_0                               |_______|
                    ___________________           _______
      ESW_P3_LED_0                     |_________|       |__________________
      
                             :          :       : :     :
                             :          [2-2]...: [4]...:
                             [2]................:
      
      [1] Reset is asserted.
      [2] Period of 5000 - 5100 usec.
      [2-1] ESW_P3_LED_0 goes low.
      [2-2] Remaining period of 5000 - 5100 usec.
      [3] Reset is deasserted.
      [4] Period of 310 usec. HWTRAP register is populated with bootstrapped
          XTAL frequency.
      [5] Signals reflect the bootstrapped configuration.
      
      ESW_P3_LED_0 low period before reset deassertion:
      
                    5000 usec
                  - 5100 usec
          TEST     RESET HOLD
             #         (usec)
        ---------------------
             1           5410
             2           5440
             3           4375
             4           5490
             5           5475
             6           4335
             7           4370
             8           5435
             9           4205
            10           4335
            11           3750
            12           3170
            13           4395
            14           4375
            15           3515
            16           4335
            17           4220
            18           4175
            19           4175
            20           4350
      
           Min           3170
           Max           5490
      
        Median       4342.500
           Avg       4466.500
      
      Revert commit 2920dd92 ("net: dsa: mt7530: disable LEDs before reset").
      Changing the state of pins via reset assertion is simpler and more
      efficient than doing so by setting the LED controller off.
      
      Fixes: b8f126a8 ("net-next: dsa: add dsa support for Mediatek MT7530 switch")
      Fixes: c288575f ("net: dsa: mt7530: Add the support of MT7531 switch")
      Co-developed-by: default avatarJustin Swartz <justin.swartz@risingedge.co.za>
      Signed-off-by: default avatarJustin Swartz <justin.swartz@risingedge.co.za>
      Signed-off-by: default avatarArınç ÜNAL <arinc.unal@arinc9.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f490c492
    • David S. Miller's avatar
      Merge branch 'veth-xdp-gro' · ba77f6e2
      David S. Miller authored
      Ignat Korchagin says:
      
      ====================
      net: veth: ability to toggle GRO and XDP independently
      
      It is rather confusing that GRO is automatically enabled, when an XDP program
      is attached to a veth interface. Moreover, it is not possible to disable GRO
      on a veth, if an XDP program is attached (which might be desirable in some use
      cases).
      
      Make GRO and XDP independent for a veth interface.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba77f6e2
    • Ignat Korchagin's avatar
      selftests: net: veth: test the ability to independently manipulate GRO and XDP · ba5a6476
      Ignat Korchagin authored
      We should be able to independently flip either XDP or GRO states and toggling
      one should not affect the other.
      
      Adjust other tests as well that had implicit expectation that GRO would be
      automatically enabled.
      Signed-off-by: default avatarIgnat Korchagin <ignat@cloudflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba5a6476
    • Ignat Korchagin's avatar
      net: veth: do not manipulate GRO when using XDP · d7db7775
      Ignat Korchagin authored
      Commit d3256efd ("veth: allow enabling NAPI even without XDP") tried to fix
      the fact that GRO was not possible without XDP, because veth did not use NAPI
      without XDP. However, it also introduced the behaviour that GRO is always
      enabled, when XDP is enabled.
      
      While it might be desired for most cases, it is confusing for the user at best
      as the GRO flag suddenly changes, when an XDP program is attached. It also
      introduces some complexities in state management as was partially addressed in
      commit fe9f8013 ("net: veth: clear GRO when clearing XDP even when down").
      
      But the biggest problem is that it is not possible to disable GRO at all, when
      an XDP program is attached, which might be needed for some use cases.
      
      Fix this by not touching the GRO flag on XDP enable/disable as the code already
      supports switching to NAPI if either GRO or XDP is requested.
      
      Link: https://lore.kernel.org/lkml/20240311124015.38106-1-ignat@cloudflare.com/
      Fixes: d3256efd ("veth: allow enabling NAPI even without XDP")
      Fixes: fe9f8013 ("net: veth: clear GRO when clearing XDP even when down")
      Signed-off-by: default avatarIgnat Korchagin <ignat@cloudflare.com>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7db7775
    • Eric Dumazet's avatar
      packet: annotate data-races around ignore_outgoing · 6ebfad33
      Eric Dumazet authored
      ignore_outgoing is read locklessly from dev_queue_xmit_nit()
      and packet_getsockopt()
      
      Add appropriate READ_ONCE()/WRITE_ONCE() annotations.
      
      syzbot reported:
      
      BUG: KCSAN: data-race in dev_queue_xmit_nit / packet_setsockopt
      
      write to 0xffff888107804542 of 1 bytes by task 22618 on cpu 0:
       packet_setsockopt+0xd83/0xfd0 net/packet/af_packet.c:4003
       do_sock_setsockopt net/socket.c:2311 [inline]
       __sys_setsockopt+0x1d8/0x250 net/socket.c:2334
       __do_sys_setsockopt net/socket.c:2343 [inline]
       __se_sys_setsockopt net/socket.c:2340 [inline]
       __x64_sys_setsockopt+0x66/0x80 net/socket.c:2340
       do_syscall_64+0xd3/0x1d0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      read to 0xffff888107804542 of 1 bytes by task 27 on cpu 1:
       dev_queue_xmit_nit+0x82/0x620 net/core/dev.c:2248
       xmit_one net/core/dev.c:3527 [inline]
       dev_hard_start_xmit+0xcc/0x3f0 net/core/dev.c:3547
       __dev_queue_xmit+0xf24/0x1dd0 net/core/dev.c:4335
       dev_queue_xmit include/linux/netdevice.h:3091 [inline]
       batadv_send_skb_packet+0x264/0x300 net/batman-adv/send.c:108
       batadv_send_broadcast_skb+0x24/0x30 net/batman-adv/send.c:127
       batadv_iv_ogm_send_to_if net/batman-adv/bat_iv_ogm.c:392 [inline]
       batadv_iv_ogm_emit net/batman-adv/bat_iv_ogm.c:420 [inline]
       batadv_iv_send_outstanding_bat_ogm_packet+0x3f0/0x4b0 net/batman-adv/bat_iv_ogm.c:1700
       process_one_work kernel/workqueue.c:3254 [inline]
       process_scheduled_works+0x465/0x990 kernel/workqueue.c:3335
       worker_thread+0x526/0x730 kernel/workqueue.c:3416
       kthread+0x1d1/0x210 kernel/kthread.c:388
       ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:147
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
      
      value changed: 0x00 -> 0x01
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 27 Comm: kworker/u8:1 Tainted: G        W          6.8.0-syzkaller-08073-g480e035f #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024
      Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet
      
      Fixes: fa788d98 ("packet: add sockopt to ignore outgoing packets")
      Reported-by: syzbot+c669c1136495a2e7c31f@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/netdev/CANn89i+Z7MfbkBLOv=p7KZ7=K1rKHO4P1OL5LYDCtBiyqsa9oQ@mail.gmail.com/T/#tSigned-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarJason Xing <kerneljasonxing@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ebfad33
    • Herve Codina's avatar
      net: wan: fsl_qmc_hdlc: Fix module compilation · badc9e33
      Herve Codina authored
      The fsl_qmc_driver does not compile as module:
        error: ‘qmc_hdlc_driver’ undeclared here (not in a function);
          405 | MODULE_DEVICE_TABLE(of, qmc_hdlc_driver);
              |                         ^~~~~~~~~~~~~~~
      
      Fix the typo.
      
      Fixes: b40f00ecd463 ("net: wan: Add support for QMC HDLC")
      Reported-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Closes: https://lore.kernel.org/linux-kernel/87ttl93f7i.fsf@mail.lhotse/Signed-off-by: default avatarHerve Codina <herve.codina@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      badc9e33
  3. 15 Mar, 2024 2 commits
  4. 14 Mar, 2024 9 commits
    • Jens Axboe's avatar
      net: remove {revc,send}msg_copy_msghdr() from exports · e54e09c0
      Jens Axboe authored
      The only user of these was io_uring, and it's not using them anymore.
      Make them static and remove them from the socket header file.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Link: https://lore.kernel.org/r/1b6089d3-c1cf-464a-abd3-b0f0b6bb2523@kernel.dkSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e54e09c0
    • Duanqiang Wen's avatar
      net: txgbe: fix clk_name exceed MAX_DEV_ID limits · e30cef00
      Duanqiang Wen authored
      txgbe register clk which name is i2c_designware.pci_dev_id(),
      clk_name will be stored in clk_lookup_alloc. If PCIe bus number
      is larger than 0x39, clk_name size will be larger than 20 bytes.
      It exceeds clk_lookup_alloc MAX_DEV_ID limits. So the driver
      shortened clk_name.
      
      Fixes: b63f2048 ("net: txgbe: Register fixed rate clock")
      Signed-off-by: default avatarDuanqiang Wen <duanqiangwen@net-swift.com>
      Reviewed-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Link: https://lore.kernel.org/r/20240313080634.459523-1-duanqiangwen@net-swift.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      e30cef00
    • Jakub Kicinski's avatar
      docs: networking: fix indentation errors in multi-pf-netdev · 1c636867
      Jakub Kicinski authored
      Stephen reports new warnings in the docs:
      
      Documentation/networking/multi-pf-netdev.rst:94: ERROR: Unexpected indentation.
      Documentation/networking/multi-pf-netdev.rst:106: ERROR: Unexpected indentation.
      
      Fixes: 77d9ec3f ("Documentation: networking: Add description for multi-pf netdev")
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Link: https://lore.kernel.org/all/20240312153304.0ef1b78e@canb.auug.org.au/Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240313032329.3919036-1-kuba@kernel.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      1c636867
    • Paolo Abeni's avatar
      Merge branch 'rxrpc-fixes-for-af_rxrpc' · 7278c70a
      Paolo Abeni authored
      David Howells says:
      
      ====================
      rxrpc: Fixes for AF_RXRPC
      
      Here are a couple of fixes for the AF_RXRPC changes[1] in net-next.
      
       (1) Fix a runtime warning introduced by a patch that changed how
           page_frag_alloc_align() works.
      
       (2) Fix an is-NULL vs IS_ERR error handling bug.
      
      The patches are tagged here:
      
      	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/rxrpc-iothread-20240312
      
      And can be found on this branch:
      
      	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-iothread
      
      Link: https://lore.kernel.org/r/20240306000655.1100294-1-dhowells@redhat.com/ [1]
      ====================
      
      Link: https://lore.kernel.org/r/20240312233723.2984928-1-dhowells@redhat.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      7278c70a
    • David Howells's avatar
      rxrpc: Fix error check on ->alloc_txbuf() · 89e43541
      David Howells authored
      rxrpc_alloc_*_txbuf() and ->alloc_txbuf() return NULL to indicate no
      memory, but rxrpc_send_data() uses IS_ERR().
      
      Fix rxrpc_send_data() to check for NULL only and set -ENOMEM if it sees
      that.
      
      Fixes: 49489bb0 ("rxrpc: Do zerocopy using MSG_SPLICE_PAGES and page frags")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reported-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      cc: "David S. Miller" <davem@davemloft.net>
      cc: Eric Dumazet <edumazet@google.com>
      cc: Jakub Kicinski <kuba@kernel.org>
      cc: Paolo Abeni <pabeni@redhat.com>
      cc: linux-afs@lists.infradead.org
      cc: netdev@vger.kernel.org
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      89e43541
    • David Howells's avatar
      rxrpc: Fix use of changed alignment param to page_frag_alloc_align() · 6b253646
      David Howells authored
      Commit 411c5f36 ("mm/page_alloc: modify page_frag_alloc_align() to
      accept align as an argument") changed the way page_frag_alloc_align()
      worked, but it didn't fix AF_RXRPC as that use of that allocator function
      hadn't been merged yet at the time.  Now, when the AFS filesystem is used,
      this results in:
      
        WARNING: CPU: 4 PID: 379 at include/linux/gfp.h:323 rxrpc_alloc_data_txbuf+0x9d/0x2b0 [rxrpc]
      
      Fix this by using __page_frag_alloc_align() instead.
      
      Note that it might be better to use an order-based alignment rather than a
      mask-based alignment.
      
      Fixes: 49489bb0 ("rxrpc: Do zerocopy using MSG_SPLICE_PAGES and page frags")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reported-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      cc: Yunsheng Lin <linyunsheng@huawei.com>
      cc: Alexander Duyck <alexander.duyck@gmail.com>
      cc: Michael S. Tsirkin <mst@redhat.com>
      cc: "David S. Miller" <davem@davemloft.net>
      cc: Eric Dumazet <edumazet@google.com>
      cc: Jakub Kicinski <kuba@kernel.org>
      cc: Paolo Abeni <pabeni@redhat.com>
      cc: linux-afs@lists.infradead.org
      cc: netdev@vger.kernel.org
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      6b253646
    • Shigeru Yoshida's avatar
      hsr: Fix uninit-value access in hsr_get_node() · ddbec99f
      Shigeru Yoshida authored
      KMSAN reported the following uninit-value access issue [1]:
      
      =====================================================
      BUG: KMSAN: uninit-value in hsr_get_node+0xa2e/0xa40 net/hsr/hsr_framereg.c:246
       hsr_get_node+0xa2e/0xa40 net/hsr/hsr_framereg.c:246
       fill_frame_info net/hsr/hsr_forward.c:577 [inline]
       hsr_forward_skb+0xe12/0x30e0 net/hsr/hsr_forward.c:615
       hsr_dev_xmit+0x1a1/0x270 net/hsr/hsr_device.c:223
       __netdev_start_xmit include/linux/netdevice.h:4940 [inline]
       netdev_start_xmit include/linux/netdevice.h:4954 [inline]
       xmit_one net/core/dev.c:3548 [inline]
       dev_hard_start_xmit+0x247/0xa10 net/core/dev.c:3564
       __dev_queue_xmit+0x33b8/0x5130 net/core/dev.c:4349
       dev_queue_xmit include/linux/netdevice.h:3134 [inline]
       packet_xmit+0x9c/0x6b0 net/packet/af_packet.c:276
       packet_snd net/packet/af_packet.c:3087 [inline]
       packet_sendmsg+0x8b1d/0x9f30 net/packet/af_packet.c:3119
       sock_sendmsg_nosec net/socket.c:730 [inline]
       __sock_sendmsg net/socket.c:745 [inline]
       __sys_sendto+0x735/0xa10 net/socket.c:2191
       __do_sys_sendto net/socket.c:2203 [inline]
       __se_sys_sendto net/socket.c:2199 [inline]
       __x64_sys_sendto+0x125/0x1c0 net/socket.c:2199
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0x6d/0x140 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      Uninit was created at:
       slab_post_alloc_hook+0x129/0xa70 mm/slab.h:768
       slab_alloc_node mm/slub.c:3478 [inline]
       kmem_cache_alloc_node+0x5e9/0xb10 mm/slub.c:3523
       kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:560
       __alloc_skb+0x318/0x740 net/core/skbuff.c:651
       alloc_skb include/linux/skbuff.h:1286 [inline]
       alloc_skb_with_frags+0xc8/0xbd0 net/core/skbuff.c:6334
       sock_alloc_send_pskb+0xa80/0xbf0 net/core/sock.c:2787
       packet_alloc_skb net/packet/af_packet.c:2936 [inline]
       packet_snd net/packet/af_packet.c:3030 [inline]
       packet_sendmsg+0x70e8/0x9f30 net/packet/af_packet.c:3119
       sock_sendmsg_nosec net/socket.c:730 [inline]
       __sock_sendmsg net/socket.c:745 [inline]
       __sys_sendto+0x735/0xa10 net/socket.c:2191
       __do_sys_sendto net/socket.c:2203 [inline]
       __se_sys_sendto net/socket.c:2199 [inline]
       __x64_sys_sendto+0x125/0x1c0 net/socket.c:2199
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0x6d/0x140 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      CPU: 1 PID: 5033 Comm: syz-executor334 Not tainted 6.7.0-syzkaller-00562-g9f8413c4 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
      =====================================================
      
      If the packet type ID field in the Ethernet header is either ETH_P_PRP or
      ETH_P_HSR, but it is not followed by an HSR tag, hsr_get_skb_sequence_nr()
      reads an invalid value as a sequence number. This causes the above issue.
      
      This patch fixes the issue by returning NULL if the Ethernet header is not
      followed by an HSR tag.
      
      Fixes: f266a683 ("net/hsr: Better frame dispatch")
      Reported-and-tested-by: syzbot+2ef3a8ce8e91b5a50098@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=2ef3a8ce8e91b5a50098 [1]
      Signed-off-by: default avatarShigeru Yoshida <syoshida@redhat.com>
      Link: https://lore.kernel.org/r/20240312152719.724530-1-syoshida@redhat.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      ddbec99f
    • William Tu's avatar
      vmxnet3: Fix missing reserved tailroom · e127ce76
      William Tu authored
      Use rbi->len instead of rcd->len for non-dataring packet.
      
      Found issue:
        XDP_WARN: xdp_update_frame_from_buff(line:278): Driver BUG: missing reserved tailroom
        WARNING: CPU: 0 PID: 0 at net/core/xdp.c:586 xdp_warn+0xf/0x20
        CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W  O       6.5.1 #1
        RIP: 0010:xdp_warn+0xf/0x20
        ...
        ? xdp_warn+0xf/0x20
        xdp_do_redirect+0x15f/0x1c0
        vmxnet3_run_xdp+0x17a/0x400 [vmxnet3]
        vmxnet3_process_xdp+0xe4/0x760 [vmxnet3]
        ? vmxnet3_tq_tx_complete.isra.0+0x21e/0x2c0 [vmxnet3]
        vmxnet3_rq_rx_complete+0x7ad/0x1120 [vmxnet3]
        vmxnet3_poll_rx_only+0x2d/0xa0 [vmxnet3]
        __napi_poll+0x20/0x180
        net_rx_action+0x177/0x390
      Reported-by: default avatarMartin Zaharinov <micron10@gmail.com>
      Tested-by: default avatarMartin Zaharinov <micron10@gmail.com>
      Link: https://lore.kernel.org/netdev/74BF3CC8-2A3A-44FF-98C2-1E20F110A92E@gmail.com/
      Fixes: 54f00cce ("vmxnet3: Add XDP support.")
      Signed-off-by: default avatarWilliam Tu <witu@nvidia.com>
      Link: https://lore.kernel.org/r/20240309183147.28222-1-witu@nvidia.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      e127ce76
    • Kuniyuki Iwashima's avatar
      tcp: Fix refcnt handling in __inet_hash_connect(). · 04d9d1fc
      Kuniyuki Iwashima authored
      syzbot reported a warning in sk_nulls_del_node_init_rcu().
      
      The commit 66b60b0c ("dccp/tcp: Unhash sk from ehash for tb2 alloc
      failure after check_estalblished().") tried to fix an issue that an
      unconnected socket occupies an ehash entry when bhash2 allocation fails.
      
      In such a case, we need to revert changes done by check_established(),
      which does not hold refcnt when inserting socket into ehash.
      
      So, to revert the change, we need to __sk_nulls_add_node_rcu() instead
      of sk_nulls_add_node_rcu().
      
      Otherwise, sock_put() will cause refcnt underflow and leak the socket.
      
      [0]:
      WARNING: CPU: 0 PID: 23948 at include/net/sock.h:799 sk_nulls_del_node_init_rcu+0x166/0x1a0 include/net/sock.h:799
      Modules linked in:
      CPU: 0 PID: 23948 Comm: syz-executor.2 Not tainted 6.8.0-rc6-syzkaller-00159-gc055fc00 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
      RIP: 0010:sk_nulls_del_node_init_rcu+0x166/0x1a0 include/net/sock.h:799
      Code: e8 7f 71 c6 f7 83 fb 02 7c 25 e8 35 6d c6 f7 4d 85 f6 0f 95 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc e8 1b 6d c6 f7 90 <0f> 0b 90 eb b2 e8 10 6d c6 f7 4c 89 e7 be 04 00 00 00 e8 63 e7 d2
      RSP: 0018:ffffc900032d7848 EFLAGS: 00010246
      RAX: ffffffff89cd0035 RBX: 0000000000000001 RCX: 0000000000040000
      RDX: ffffc90004de1000 RSI: 000000000003ffff RDI: 0000000000040000
      RBP: 1ffff1100439ac26 R08: ffffffff89ccffe3 R09: 1ffff1100439ac28
      R10: dffffc0000000000 R11: ffffed100439ac29 R12: ffff888021cd6140
      R13: dffffc0000000000 R14: ffff88802a9bf5c0 R15: ffff888021cd6130
      FS:  00007f3b823f16c0(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f3b823f0ff8 CR3: 000000004674a000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       __inet_hash_connect+0x140f/0x20b0 net/ipv4/inet_hashtables.c:1139
       dccp_v6_connect+0xcb9/0x1480 net/dccp/ipv6.c:956
       __inet_stream_connect+0x262/0xf30 net/ipv4/af_inet.c:678
       inet_stream_connect+0x65/0xa0 net/ipv4/af_inet.c:749
       __sys_connect_file net/socket.c:2048 [inline]
       __sys_connect+0x2df/0x310 net/socket.c:2065
       __do_sys_connect net/socket.c:2075 [inline]
       __se_sys_connect net/socket.c:2072 [inline]
       __x64_sys_connect+0x7a/0x90 net/socket.c:2072
       do_syscall_64+0xf9/0x240
       entry_SYSCALL_64_after_hwframe+0x6f/0x77
      RIP: 0033:0x7f3b8167dda9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f3b823f10c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
      RAX: ffffffffffffffda RBX: 00007f3b817abf80 RCX: 00007f3b8167dda9
      RDX: 000000000000001c RSI: 0000000020000040 RDI: 0000000000000003
      RBP: 00007f3b823f1120 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
      R13: 000000000000000b R14: 00007f3b817abf80 R15: 00007ffd3beb57b8
       </TASK>
      
      Reported-by: syzbot+12c506c1aae251e70449@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=12c506c1aae251e70449
      Fixes: 66b60b0c ("dccp/tcp: Unhash sk from ehash for tb2 alloc failure after check_estalblished().")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240308201623.65448-1-kuniyu@amazon.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      04d9d1fc
  5. 13 Mar, 2024 9 commits
    • Shay Drory's avatar
      devlink: Fix devlink parallel commands processing · d7d75124
      Shay Drory authored
      Commit 870c7ad4 ("devlink: protect devlink->dev by the instance
      lock") added devlink instance locking inside a loop that iterates over
      all the registered devlink instances on the machine in the pre-doit
      phase. This can lead to serialization of devlink commands over
      different devlink instances.
      
      For example: While the first devlink instance is executing firmware
      flash, all commands to other devlink instances on the machine are
      forced to wait until the first devlink finishes.
      
      Therefore, in the pre-doit phase, take the devlink instance lock only
      for the devlink instance the command is targeting. Devlink layer is
      taking a reference on the devlink instance, ensuring the devlink->dev
      pointer is valid. This reference taking was introduced by commit
      a3806872 ("devlink: take device reference for devlink object").
      Without this commit, it would not be safe to access devlink->dev
      lockless.
      
      Fixes: 870c7ad4 ("devlink: protect devlink->dev by the instance lock")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7d75124
    • Eric Dumazet's avatar
      net/sched: taprio: proper TCA_TAPRIO_TC_ENTRY_INDEX check · 343041b5
      Eric Dumazet authored
      taprio_parse_tc_entry() is not correctly checking
      TCA_TAPRIO_TC_ENTRY_INDEX attribute:
      
      	int tc; // Signed value
      
      	tc = nla_get_u32(tb[TCA_TAPRIO_TC_ENTRY_INDEX]);
      	if (tc >= TC_QOPT_MAX_QUEUE) {
      		NL_SET_ERR_MSG_MOD(extack, "TC entry index out of range");
      		return -ERANGE;
      	}
      
      syzbot reported that it could fed arbitary negative values:
      
      UBSAN: shift-out-of-bounds in net/sched/sch_taprio.c:1722:18
      shift exponent -2147418108 is negative
      CPU: 0 PID: 5066 Comm: syz-executor367 Not tainted 6.8.0-rc7-syzkaller-00136-gc8a5c731 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024
      Call Trace:
       <TASK>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
        ubsan_epilogue lib/ubsan.c:217 [inline]
        __ubsan_handle_shift_out_of_bounds+0x3c7/0x420 lib/ubsan.c:386
        taprio_parse_tc_entry net/sched/sch_taprio.c:1722 [inline]
        taprio_parse_tc_entries net/sched/sch_taprio.c:1768 [inline]
        taprio_change+0xb87/0x57d0 net/sched/sch_taprio.c:1877
        taprio_init+0x9da/0xc80 net/sched/sch_taprio.c:2134
        qdisc_create+0x9d4/0x1190 net/sched/sch_api.c:1355
        tc_modify_qdisc+0xa26/0x1e40 net/sched/sch_api.c:1776
        rtnetlink_rcv_msg+0x885/0x1040 net/core/rtnetlink.c:6617
        netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2543
        netlink_unicast_kernel net/netlink/af_netlink.c:1341 [inline]
        netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1367
        netlink_sendmsg+0xa3b/0xd70 net/netlink/af_netlink.c:1908
        sock_sendmsg_nosec net/socket.c:730 [inline]
        __sock_sendmsg+0x221/0x270 net/socket.c:745
        ____sys_sendmsg+0x525/0x7d0 net/socket.c:2584
        ___sys_sendmsg net/socket.c:2638 [inline]
        __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2667
       do_syscall_64+0xf9/0x240
       entry_SYSCALL_64_after_hwframe+0x6f/0x77
      RIP: 0033:0x7f1b2dea3759
      Code: 48 83 c4 28 c3 e8 d7 19 00 00 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007ffd4de452f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00007f1b2def0390 RCX: 00007f1b2dea3759
      RDX: 0000000000000000 RSI: 00000000200007c0 RDI: 0000000000000004
      RBP: 0000000000000003 R08: 0000555500000000 R09: 0000555500000000
      R10: 0000555500000000 R11: 0000000000000246 R12: 00007ffd4de45340
      R13: 00007ffd4de45310 R14: 0000000000000001 R15: 00007ffd4de45340
      
      Fixes: a54fc09e ("net/sched: taprio: allow user input of per-tc max SDU")
      Reported-and-tested-by: syzbot+a340daa06412d6028918@syzkaller.appspotmail.com
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Vladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      343041b5
    • Linu Cherian's avatar
      octeontx2-af: Use matching wake_up API variant in CGX command interface · e642921d
      Linu Cherian authored
      Use wake_up API instead of wake_up_interruptible, since
      wait_event_timeout API is used for waiting on command completion.
      
      Fixes: 1463f382 ("octeontx2-af: Add support for CGX link management")
      Signed-off-by: default avatarLinu Cherian <lcherian@marvell.com>
      Signed-off-by: default avatarSunil Goutham <sgoutham@marvell.com>
      Signed-off-by: default avatarSubbaraya Sundeep <sbhatta@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e642921d
    • Sean Anderson's avatar
      soc: fsl: qbman: Use raw spinlock for cgr_lock · fbec4e7f
      Sean Anderson authored
      smp_call_function always runs its callback in hard IRQ context, even on
      PREEMPT_RT, where spinlocks can sleep. So we need to use a raw spinlock
      for cgr_lock to ensure we aren't waiting on a sleeping task.
      
      Although this bug has existed for a while, it was not apparent until
      commit ef2a8d54 ("net: dpaa: Adjust queue depth on rate change")
      which invokes smp_call_function_single via qman_update_cgr_safe every
      time a link goes up or down.
      
      Fixes: 96f413f4 ("soc/fsl/qbman: fix issue in qman_delete_cgr_safe()")
      CC: stable@vger.kernel.org
      Reported-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Closes: https://lore.kernel.org/all/20230323153935.nofnjucqjqnz34ej@skbuf/Reported-by: default avatarSteffen Trumtrar <s.trumtrar@pengutronix.de>
      Closes: https://lore.kernel.org/linux-arm-kernel/87wmsyvclu.fsf@pengutronix.de/Signed-off-by: default avatarSean Anderson <sean.anderson@linux.dev>
      Reviewed-by: default avatarCamelia Groza <camelia.groza@nxp.com>
      Tested-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fbec4e7f
    • Sean Anderson's avatar
      soc: fsl: qbman: Always disable interrupts when taking cgr_lock · 584c2a91
      Sean Anderson authored
      smp_call_function_single disables IRQs when executing the callback. To
      prevent deadlocks, we must disable IRQs when taking cgr_lock elsewhere.
      This is already done by qman_update_cgr and qman_delete_cgr; fix the
      other lockers.
      
      Fixes: 96f413f4 ("soc/fsl/qbman: fix issue in qman_delete_cgr_safe()")
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarSean Anderson <sean.anderson@linux.dev>
      Reviewed-by: default avatarCamelia Groza <camelia.groza@nxp.com>
      Tested-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      584c2a91
    • Jakub Kicinski's avatar
      Merge branch 'tcp-rds-fix-use-after-free-around-kernel-tcp-reqsk' · 67072c31
      Jakub Kicinski authored
      Kuniyuki Iwashima says:
      
      ====================
      tcp/rds: Fix use-after-free around kernel TCP reqsk.
      
      syzkaller reported an warning of netns ref tracker for RDS TCP listener,
      which commit 740ea3c4 ("tcp: Clean up kernel listener's reqsk in
      inet_twsk_purge()") fixed for per-netns ehash.
      
      This series fixes the bug in the partial fix and fixes the reported bug
      in the global ehash.
      
      v4: https://lore.kernel.org/netdev/20240307232151.55963-1-kuniyu@amazon.com/
      v3: https://lore.kernel.org/netdev/20240307224423.53315-1-kuniyu@amazon.com/
      v2: https://lore.kernel.org/netdev/20240227011041.97375-1-kuniyu@amazon.com/
      v1: https://lore.kernel.org/netdev/20240223172448.94084-1-kuniyu@amazon.com/
      ====================
      
      Link: https://lore.kernel.org/r/20240308200122.64357-1-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      67072c31
    • Kuniyuki Iwashima's avatar
      rds: tcp: Fix use-after-free of net in reqsk_timer_handler(). · 2a750d6a
      Kuniyuki Iwashima authored
      syzkaller reported a warning of netns tracker [0] followed by KASAN
      splat [1] and another ref tracker warning [1].
      
      syzkaller could not find a repro, but in the log, the only suspicious
      sequence was as follows:
      
        18:26:22 executing program 1:
        r0 = socket$inet6_mptcp(0xa, 0x1, 0x106)
        ...
        connect$inet6(r0, &(0x7f0000000080)={0xa, 0x4001, 0x0, @loopback}, 0x1c) (async)
      
      The notable thing here is 0x4001 in connect(), which is RDS_TCP_PORT.
      
      So, the scenario would be:
      
        1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
            rds_tcp_listen_init().
        2. syz-executor connect()s to it and creates a reqsk.
        3. syz-executor exit()s immediately.
        4. netns is dismantled.  [0]
        5. reqsk timer is fired, and UAF happens while freeing reqsk.  [1]
        6. listener is freed after RCU grace period.  [2]
      
      Basically, reqsk assumes that the listener guarantees netns safety
      until all reqsk timers are expired by holding the listener's refcount.
      However, this was not the case for kernel sockets.
      
      Commit 740ea3c4 ("tcp: Clean up kernel listener's reqsk in
      inet_twsk_purge()") fixed this issue only for per-netns ehash.
      
      Let's apply the same fix for the global ehash.
      
      [0]:
      ref_tracker: net notrefcnt@0000000065449cc3 has 1/1 users at
           sk_alloc (./include/net/net_namespace.h:337 net/core/sock.c:2146)
           inet6_create (net/ipv6/af_inet6.c:192 net/ipv6/af_inet6.c:119)
           __sock_create (net/socket.c:1572)
           rds_tcp_listen_init (net/rds/tcp_listen.c:279)
           rds_tcp_init_net (net/rds/tcp.c:577)
           ops_init (net/core/net_namespace.c:137)
           setup_net (net/core/net_namespace.c:340)
           copy_net_ns (net/core/net_namespace.c:497)
           create_new_namespaces (kernel/nsproxy.c:110)
           unshare_nsproxy_namespaces (kernel/nsproxy.c:228 (discriminator 4))
           ksys_unshare (kernel/fork.c:3429)
           __x64_sys_unshare (kernel/fork.c:3496)
           do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
           entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
      ...
      WARNING: CPU: 0 PID: 27 at lib/ref_tracker.c:179 ref_tracker_dir_exit (lib/ref_tracker.c:179)
      
      [1]:
      BUG: KASAN: slab-use-after-free in inet_csk_reqsk_queue_drop (./include/net/inet_hashtables.h:180 net/ipv4/inet_connection_sock.c:952 net/ipv4/inet_connection_sock.c:966)
      Read of size 8 at addr ffff88801b370400 by task swapper/0/0
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <IRQ>
       dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
       print_report (mm/kasan/report.c:378 mm/kasan/report.c:488)
       kasan_report (mm/kasan/report.c:603)
       inet_csk_reqsk_queue_drop (./include/net/inet_hashtables.h:180 net/ipv4/inet_connection_sock.c:952 net/ipv4/inet_connection_sock.c:966)
       reqsk_timer_handler (net/ipv4/inet_connection_sock.c:979 net/ipv4/inet_connection_sock.c:1092)
       call_timer_fn (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/timer.h:127 kernel/time/timer.c:1701)
       __run_timers.part.0 (kernel/time/timer.c:1752 kernel/time/timer.c:2038)
       run_timer_softirq (kernel/time/timer.c:2053)
       __do_softirq (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/irq.h:142 kernel/softirq.c:554)
       irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:644)
       sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1076 (discriminator 14))
       </IRQ>
      
      Allocated by task 258 on cpu 0 at 83.612050s:
       kasan_save_stack (mm/kasan/common.c:48)
       kasan_save_track (mm/kasan/common.c:68)
       __kasan_slab_alloc (mm/kasan/common.c:343)
       kmem_cache_alloc (mm/slub.c:3813 mm/slub.c:3860 mm/slub.c:3867)
       copy_net_ns (./include/linux/slab.h:701 net/core/net_namespace.c:421 net/core/net_namespace.c:480)
       create_new_namespaces (kernel/nsproxy.c:110)
       unshare_nsproxy_namespaces (kernel/nsproxy.c:228 (discriminator 4))
       ksys_unshare (kernel/fork.c:3429)
       __x64_sys_unshare (kernel/fork.c:3496)
       do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
      
      Freed by task 27 on cpu 0 at 329.158864s:
       kasan_save_stack (mm/kasan/common.c:48)
       kasan_save_track (mm/kasan/common.c:68)
       kasan_save_free_info (mm/kasan/generic.c:643)
       __kasan_slab_free (mm/kasan/common.c:265)
       kmem_cache_free (mm/slub.c:4299 mm/slub.c:4363)
       cleanup_net (net/core/net_namespace.c:456 net/core/net_namespace.c:446 net/core/net_namespace.c:639)
       process_one_work (kernel/workqueue.c:2638)
       worker_thread (kernel/workqueue.c:2700 kernel/workqueue.c:2787)
       kthread (kernel/kthread.c:388)
       ret_from_fork (arch/x86/kernel/process.c:153)
       ret_from_fork_asm (arch/x86/entry/entry_64.S:250)
      
      The buggy address belongs to the object at ffff88801b370000
       which belongs to the cache net_namespace of size 4352
      The buggy address is located 1024 bytes inside of
       freed 4352-byte region [ffff88801b370000, ffff88801b371100)
      
      [2]:
      WARNING: CPU: 0 PID: 95 at lib/ref_tracker.c:228 ref_tracker_free (lib/ref_tracker.c:228 (discriminator 1))
      Modules linked in:
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      RIP: 0010:ref_tracker_free (lib/ref_tracker.c:228 (discriminator 1))
      ...
      Call Trace:
      <IRQ>
       __sk_destruct (./include/net/net_namespace.h:353 net/core/sock.c:2204)
       rcu_core (./arch/x86/include/asm/preempt.h:26 kernel/rcu/tree.c:2165 kernel/rcu/tree.c:2433)
       __do_softirq (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/irq.h:142 kernel/softirq.c:554)
       irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:644)
       sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1076 (discriminator 14))
      </IRQ>
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: 467fa153 ("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240308200122.64357-3-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2a750d6a
    • Eric Dumazet's avatar
      tcp: Fix NEW_SYN_RECV handling in inet_twsk_purge() · 1c4e97dd
      Eric Dumazet authored
      inet_twsk_purge() uses rcu to find TIME_WAIT and NEW_SYN_RECV
      objects to purge.
      
      These objects use SLAB_TYPESAFE_BY_RCU semantic and need special
      care. We need to use refcount_inc_not_zero(&sk->sk_refcnt).
      
      Reuse the existing correct logic I wrote for TIME_WAIT,
      because both structures have common locations for
      sk_state, sk_family, and netns pointer.
      
      If after the refcount_inc_not_zero() the object fields longer match
      the keys, use sock_gen_put(sk) to release the refcount.
      
      Then we can call inet_twsk_deschedule_put() for TIME_WAIT,
      inet_csk_reqsk_queue_drop_and_put() for NEW_SYN_RECV sockets,
      with BH disabled.
      
      Then we need to restart the loop because we had drop rcu_read_lock().
      
      Fixes: 740ea3c4 ("tcp: Clean up kernel listener's reqsk in inet_twsk_purge()")
      Link: https://lore.kernel.org/netdev/CANn89iLvFuuihCtt9PME2uS1WJATnf5fKjDToa1WzVnRzHnPfg@mail.gmail.com/T/#uSigned-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240308200122.64357-2-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1c4e97dd
    • Linus Torvalds's avatar
      Merge tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next · 9187210e
      Linus Torvalds authored
      Pull networking updates from Jakub Kicinski:
       "Core & protocols:
      
         - Large effort by Eric to lower rtnl_lock pressure and remove locks:
      
            - Make commonly used parts of rtnetlink (address, route dumps
              etc) lockless, protected by RCU instead of rtnl_lock.
      
            - Add a netns exit callback which already holds rtnl_lock,
              allowing netns exit to take rtnl_lock once in the core instead
              of once for each driver / callback.
      
            - Remove locks / serialization in the socket diag interface.
      
            - Remove 6 calls to synchronize_rcu() while holding rtnl_lock.
      
            - Remove the dev_base_lock, depend on RCU where necessary.
      
         - Support busy polling on a per-epoll context basis. Poll length and
           budget parameters can be set independently of system defaults.
      
         - Introduce struct net_hotdata, to make sure read-mostly global
           config variables fit in as few cache lines as possible.
      
         - Add optional per-nexthop statistics to ease monitoring / debug of
           ECMP imbalance problems.
      
         - Support TCP_NOTSENT_LOWAT in MPTCP.
      
         - Ensure that IPv6 temporary addresses' preferred lifetimes are long
           enough, compared to other configured lifetimes, and at least 2 sec.
      
         - Support forwarding of ICMP Error messages in IPSec, per RFC 4301.
      
         - Add support for the independent control state machine for bonding
           per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled
           control state machine.
      
         - Add "network ID" to MCTP socket APIs to support hosts with multiple
           disjoint MCTP networks.
      
         - Re-use the mono_delivery_time skbuff bit for packets which user
           space wants to be sent at a specified time. Maintain the timing
           information while traversing veth links, bridge etc.
      
         - Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets.
      
         - Simplify many places iterating over netdevs by using an xarray
           instead of a hash table walk (hash table remains in place, for use
           on fastpaths).
      
         - Speed up scanning for expired routes by keeping a dedicated list.
      
         - Speed up "generic" XDP by trying harder to avoid large allocations.
      
         - Support attaching arbitrary metadata to netconsole messages.
      
        Things we sprinkled into general kernel code:
      
         - Enforce VM_IOREMAP flag and range in ioremap_page_range and
           introduce VM_SPARSE kind and vm_area_[un]map_pages (used by
           bpf_arena).
      
         - Rework selftest harness to enable the use of the full range of ksft
           exit code (pass, fail, skip, xfail, xpass).
      
        Netfilter:
      
         - Allow userspace to define a table that is exclusively owned by a
           daemon (via netlink socket aliveness) without auto-removing this
           table when the userspace program exits. Such table gets marked as
           orphaned and a restarting management daemon can re-attach/regain
           ownership.
      
         - Speed up element insertions to nftables' concatenated-ranges set
           type. Compact a few related data structures.
      
        BPF:
      
         - Add BPF token support for delegating a subset of BPF subsystem
           functionality from privileged system-wide daemons such as systemd
           through special mount options for userns-bound BPF fs to a trusted
           & unprivileged application.
      
         - Introduce bpf_arena which is sparse shared memory region between
           BPF program and user space where structures inside the arena can
           have pointers to other areas of the arena, and pointers work
           seamlessly for both user-space programs and BPF programs.
      
         - Introduce may_goto instruction that is a contract between the
           verifier and the program. The verifier allows the program to loop
           assuming it's behaving well, but reserves the right to terminate
           it.
      
         - Extend the BPF verifier to enable static subprog calls in spin lock
           critical sections.
      
         - Support registration of struct_ops types from modules which helps
           projects like fuse-bpf that seeks to implement a new struct_ops
           type.
      
         - Add support for retrieval of cookies for perf/kprobe multi links.
      
         - Support arbitrary TCP SYN cookie generation / validation in the TC
           layer with BPF to allow creating SYN flood handling in BPF
           firewalls.
      
         - Add code generation to inline the bpf_kptr_xchg() helper which
           improves performance when stashing/popping the allocated BPF
           objects.
      
        Wireless:
      
         - Add SPP (signaling and payload protected) AMSDU support.
      
         - Support wider bandwidth OFDMA, as required for EHT operation.
      
        Driver API:
      
         - Major overhaul of the Energy Efficient Ethernet internals to
           support new link modes (2.5GE, 5GE), share more code between
           drivers (especially those using phylib), and encourage more
           uniform behavior. Convert and clean up drivers.
      
         - Define an API for querying per netdev queue statistics from
           drivers.
      
         - IPSec: account in global stats for fully offloaded sessions.
      
         - Create a concept of Ethernet PHY Packages at the Device Tree level,
           to allow parameterizing the existing PHY package code.
      
         - Enable Rx hashing (RSS) on GTP protocol fields.
      
        Misc:
      
         - Improvements and refactoring all over networking selftests.
      
         - Create uniform module aliases for TC classifiers, actions, and
           packet schedulers to simplify creating modprobe policies.
      
         - Address all missing MODULE_DESCRIPTION() warnings in networking.
      
         - Extend the Netlink descriptions in YAML to cover message
           encapsulation or "Netlink polymorphism", where interpretation of
           nested attributes depends on link type, classifier type or some
           other "class type".
      
        Drivers:
      
         - Ethernet high-speed NICs:
            - Add a new driver for Marvell's Octeon PCI Endpoint NIC VF.
            - Intel (100G, ice, idpf):
               - support E825-C devices
            - nVidia/Mellanox:
               - support devices with one port and multiple PCIe links
            - Broadcom (bnxt):
               - support n-tuple filters
               - support configuring the RSS key
            - Wangxun (ngbe/txgbe):
               - implement irq_domain for TXGBE's sub-interrupts
            - Pensando/AMD:
               - support XDP
               - optimize queue submission and wakeup handling (+17% bps)
               - optimize struct layout, saving 28% of memory on queues
      
         - Ethernet NICs embedded and virtual:
            - Google cloud vNIC:
               - refactor driver to perform memory allocations for new queue
                 config before stopping and freeing the old queue memory
            - Synopsys (stmmac):
               - obey queueMaxSDU and implement counters required by 802.1Qbv
            - Renesas (ravb):
               - support packet checksum offload
               - suspend to RAM and runtime PM support
      
         - Ethernet switches:
            - nVidia/Mellanox:
               - support for nexthop group statistics
            - Microchip:
               - ksz8: implement PHY loopback
               - add support for KSZ8567, a 7-port 10/100Mbps switch
      
         - PTP:
            - New driver for RENESAS FemtoClock3 Wireless clock generator.
            - Support OCP PTP cards designed and built by Adva.
      
         - CAN:
            - Support recvmsg() flags for own, local and remote traffic on CAN
              BCM sockets.
            - Support for esd GmbH PCIe/402 CAN device family.
            - m_can:
               - Rx/Tx submission coalescing
               - wake on frame Rx
      
         - WiFi:
            - Intel (iwlwifi):
               - enable signaling and payload protected A-MSDUs
               - support wider-bandwidth OFDMA
               - support for new devices
               - bump FW API to 89 for AX devices; 90 for BZ/SC devices
            - MediaTek (mt76):
               - mt7915: newer ADIE version support
               - mt7925: radio temperature sensor support
            - Qualcomm (ath11k):
               - support 6 GHz station power modes: Low Power Indoor (LPI),
                 Standard Power) SP and Very Low Power (VLP)
               - QCA6390 & WCN6855: support 2 concurrent station interfaces
               - QCA2066 support
            - Qualcomm (ath12k):
               - refactoring in preparation for Multi-Link Operation (MLO)
                 support
               - 1024 Block Ack window size support
               - firmware-2.bin support
               - support having multiple identical PCI devices (firmware needs
                 to have ATH12K_FW_FEATURE_MULTI_QRTR_ID)
               - QCN9274: support split-PHY devices
               - WCN7850: enable Power Save Mode in station mode
               - WCN7850: P2P support
            - RealTek:
               - rtw88: support for more rtw8811cu and rtw8821cu devices
               - rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL
               - rtlwifi: speed up USB firmware initialization
               - rtwl8xxxu:
                   - RTL8188F: concurrent interface support
                   - Channel Switch Announcement (CSA) support in AP mode
            - Broadcom (brcmfmac):
               - per-vendor feature support
               - per-vendor SAE password setup
               - DMI nvram filename quirk for ACEPC W5 Pro"
      
      * tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2255 commits)
        nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y
        nexthop: Fix out-of-bounds access during attribute validation
        nexthop: Only parse NHA_OP_FLAGS for dump messages that require it
        nexthop: Only parse NHA_OP_FLAGS for get messages that require it
        bpf: move sleepable flag from bpf_prog_aux to bpf_prog
        bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes()
        selftests/bpf: Add kprobe multi triggering benchmarks
        ptp: Move from simple ida to xarray
        vxlan: Remove generic .ndo_get_stats64
        vxlan: Do not alloc tstats manually
        devlink: Add comments to use netlink gen tool
        nfp: flower: handle acti_netdevs allocation failure
        net/packet: Add getsockopt support for PACKET_COPY_THRESH
        net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSID
        selftests/bpf: Add bpf_arena_htab test.
        selftests/bpf: Add bpf_arena_list test.
        selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages
        bpf: Add helper macro bpf_addr_space_cast()
        libbpf: Recognize __arena global variables.
        bpftool: Recognize arena map type
        ...
      9187210e