1. 10 Jul, 2024 6 commits
    • Alexei Starovoitov's avatar
      Merge branch 'fixes-for-bpf-timer-lockup-and-uaf' · 0c237341
      Alexei Starovoitov authored
      Kumar Kartikeya Dwivedi says:
      
      ====================
      Fixes for BPF timer lockup and UAF
      
      The following patches contain fixes for timer lockups and a
      use-after-free scenario.
      
      This set proposes to fix the following lockup situation for BPF timers.
      
      CPU 1					CPU 2
      
      bpf_timer_cb				bpf_timer_cb
        timer_cb1				  timer_cb2
          bpf_timer_cancel(timer_cb2)		    bpf_timer_cancel(timer_cb1)
            hrtimer_cancel			      hrtimer_cancel
      
      In this case, both callbacks will continue waiting for each other to
      finish synchronously, causing a lockup.
      
      The proposed fix adds support for tracking in-flight cancellations
      *begun by other timer callbacks* for a particular BPF timer.  Whenever
      preparing to call hrtimer_cancel, a callback will increment the target
      timer's counter, then inspect its in-flight cancellations, and if
      non-zero, return -EDEADLK to avoid situations where the target timer's
      callback is waiting for its completion.
      
      This does mean that in cases where a callback is fired and cancelled, it
      will be unable to cancel any timers in that execution. This can be
      alleviated by maintaining the list of waiting callbacks in bpf_hrtimer
      and searching through it to avoid interdependencies, but this may
      introduce additional delays in bpf_timer_cancel, in addition to
      requiring extra state at runtime which may need to be allocated or
      reused from bpf_hrtimer storage. Moreover, extra synchronization is
      needed to delete these elements from the list of waiting callbacks once
      hrtimer_cancel has finished.
      
      The second patch is for a deadlock situation similar to above in
      bpf_timer_cancel_and_free, but also a UAF scenario that can occur if
      timer is armed before entering it, if hrtimer_running check causes the
      hrtimer_cancel call to be skipped.
      
      As seen above, synchronous hrtimer_cancel would lead to deadlock (if
      same callback tries to free its timer, or two timers free each other),
      therefore we queue work onto the global workqueue to ensure outstanding
      timers are cancelled before bpf_hrtimer state is freed.
      
      Further details are in the patches.
      ====================
      
      Link: https://lore.kernel.org/r/20240709185440.1104957-1-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0c237341
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Defer work in bpf_timer_cancel_and_free · a6fcd19d
      Kumar Kartikeya Dwivedi authored
      Currently, the same case as previous patch (two timer callbacks trying
      to cancel each other) can be invoked through bpf_map_update_elem as
      well, or more precisely, freeing map elements containing timers. Since
      this relies on hrtimer_cancel as well, it is prone to the same deadlock
      situation as the previous patch.
      
      It would be sufficient to use hrtimer_try_to_cancel to fix this problem,
      as the timer cannot be enqueued after async_cancel_and_free. Once
      async_cancel_and_free has been done, the timer must be reinitialized
      before it can be armed again. The callback running in parallel trying to
      arm the timer will fail, and freeing bpf_hrtimer without waiting is
      sufficient (given kfree_rcu), and bpf_timer_cb will return
      HRTIMER_NORESTART, preventing the timer from being rearmed again.
      
      However, there exists a UAF scenario where the callback arms the timer
      before entering this function, such that if cancellation fails (due to
      timer callback invoking this routine, or the target timer callback
      running concurrently). In such a case, if the timer expiration is
      significantly far in the future, the RCU grace period expiration
      happening before it will free the bpf_hrtimer state and along with it
      the struct hrtimer, that is enqueued.
      
      Hence, it is clear cancellation needs to occur after
      async_cancel_and_free, and yet it cannot be done inline due to deadlock
      issues. We thus modify bpf_timer_cancel_and_free to defer work to the
      global workqueue, adding a work_struct alongside rcu_head (both used at
      _different_ points of time, so can share space).
      
      Update existing code comments to reflect the new state of affairs.
      
      Fixes: b00628b1 ("bpf: Introduce bpf timers.")
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20240709185440.1104957-3-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a6fcd19d
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Fail bpf_timer_cancel when callback is being cancelled · d4523831
      Kumar Kartikeya Dwivedi authored
      Given a schedule:
      
      timer1 cb			timer2 cb
      
      bpf_timer_cancel(timer2);	bpf_timer_cancel(timer1);
      
      Both bpf_timer_cancel calls would wait for the other callback to finish
      executing, introducing a lockup.
      
      Add an atomic_t count named 'cancelling' in bpf_hrtimer. This keeps
      track of all in-flight cancellation requests for a given BPF timer.
      Whenever cancelling a BPF timer, we must check if we have outstanding
      cancellation requests, and if so, we must fail the operation with an
      error (-EDEADLK) since cancellation is synchronous and waits for the
      callback to finish executing. This implies that we can enter a deadlock
      situation involving two or more timer callbacks executing in parallel
      and attempting to cancel one another.
      
      Note that we avoid incrementing the cancelling counter for the target
      timer (the one being cancelled) if bpf_timer_cancel is not invoked from
      a callback, to avoid spurious errors. The whole point of detecting
      cur->cancelling and returning -EDEADLK is to not enter a busy wait loop
      (which may or may not lead to a lockup). This does not apply in case the
      caller is in a non-callback context, the other side can continue to
      cancel as it sees fit without running into errors.
      
      Background on prior attempts:
      
      Earlier versions of this patch used a bool 'cancelling' bit and used the
      following pattern under timer->lock to publish cancellation status.
      
      lock(t->lock);
      t->cancelling = true;
      mb();
      if (cur->cancelling)
      	return -EDEADLK;
      unlock(t->lock);
      hrtimer_cancel(t->timer);
      t->cancelling = false;
      
      The store outside the critical section could overwrite a parallel
      requests t->cancelling assignment to true, to ensure the parallely
      executing callback observes its cancellation status.
      
      It would be necessary to clear this cancelling bit once hrtimer_cancel
      is done, but lack of serialization introduced races. Another option was
      explored where bpf_timer_start would clear the bit when (re)starting the
      timer under timer->lock. This would ensure serialized access to the
      cancelling bit, but may allow it to be cleared before in-flight
      hrtimer_cancel has finished executing, such that lockups can occur
      again.
      
      Thus, we choose an atomic counter to keep track of all outstanding
      cancellation requests and use it to prevent lockups in case callbacks
      attempt to cancel each other while executing in parallel.
      Reported-by: default avatarDohyun Kim <dohyunkim@google.com>
      Reported-by: default avatarNeel Natu <neelnatu@google.com>
      Fixes: b00628b1 ("bpf: Introduce bpf timers.")
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20240709185440.1104957-2-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d4523831
    • Mohammad Shehar Yaar Tausif's avatar
      bpf: fix order of args in call to bpf_map_kvcalloc · af253aef
      Mohammad Shehar Yaar Tausif authored
      The original function call passed size of smap->bucket before the number of
      buckets which raises the error 'calloc-transposed-args' on compilation.
      
      Vlastimil Babka added:
      
      The order of parameters can be traced back all the way to 6ac99e8f
      ("bpf: Introduce bpf sk local storage") accross several refactorings,
      and that's why the commit is used as a Fixes: tag.
      
      In v6.10-rc1, a different commit 2c321f3f ("mm: change inlined
      allocation helpers to account at the call site") however exposed the
      order of args in a way that gcc-14 has enough visibility to start
      warning about it, because (in !CONFIG_MEMCG case) bpf_map_kvcalloc is
      then a macro alias for kvcalloc instead of a static inline wrapper.
      
      To sum up the warning happens when the following conditions are all met:
      
      - gcc-14 is used (didn't see it with gcc-13)
      - commit 2c321f3f is present
      - CONFIG_MEMCG is not enabled in .config
      - CONFIG_WERROR turns this from a compiler warning to error
      
      Fixes: 6ac99e8f ("bpf: Introduce bpf sk local storage")
      Reviewed-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Tested-by: default avatarChristian Kujau <lists@nerdbynature.de>
      Signed-off-by: default avatarMohammad Shehar Yaar Tausif <sheharyaar48@gmail.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Link: https://lore.kernel.org/r/20240710100521.15061-2-vbabka@suse.czSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      af253aef
    • Aleksander Jan Bajkowski's avatar
      net: ethernet: lantiq_etop: fix double free in detach · e1533b63
      Aleksander Jan Bajkowski authored
      The number of the currently released descriptor is never incremented
      which results in the same skb being released multiple times.
      
      Fixes: 504d4721 ("MIPS: Lantiq: Add ethernet driver")
      Reported-by: default avatarJoe Perches <joe@perches.com>
      Closes: https://lore.kernel.org/all/fc1bf93d92bb5b2f99c6c62745507cc22f3a7b2d.camel@perches.com/Signed-off-by: default avatarAleksander Jan Bajkowski <olek2@wp.pl>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://patch.msgid.link/20240708205826.5176-1-olek2@wp.plSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e1533b63
    • Michal Kubiak's avatar
      i40e: Fix XDP program unloading while removing the driver · 01fc5142
      Michal Kubiak authored
      The commit 6533e558 ("i40e: Fix reset path while removing
      the driver") introduced a new PF state "__I40E_IN_REMOVE" to block
      modifying the XDP program while the driver is being removed.
      Unfortunately, such a change is useful only if the ".ndo_bpf()"
      callback was called out of the rmmod context because unloading the
      existing XDP program is also a part of driver removing procedure.
      In other words, from the rmmod context the driver is expected to
      unload the XDP program without reporting any errors. Otherwise,
      the kernel warning with callstack is printed out to dmesg.
      
      Example failing scenario:
       1. Load the i40e driver.
       2. Load the XDP program.
       3. Unload the i40e driver (using "rmmod" command).
      
      The example kernel warning log:
      
      [  +0.004646] WARNING: CPU: 94 PID: 10395 at net/core/dev.c:9290 unregister_netdevice_many_notify+0x7a9/0x870
      [...]
      [  +0.010959] RIP: 0010:unregister_netdevice_many_notify+0x7a9/0x870
      [...]
      [  +0.002726] Call Trace:
      [  +0.002457]  <TASK>
      [  +0.002119]  ? __warn+0x80/0x120
      [  +0.003245]  ? unregister_netdevice_many_notify+0x7a9/0x870
      [  +0.005586]  ? report_bug+0x164/0x190
      [  +0.003678]  ? handle_bug+0x3c/0x80
      [  +0.003503]  ? exc_invalid_op+0x17/0x70
      [  +0.003846]  ? asm_exc_invalid_op+0x1a/0x20
      [  +0.004200]  ? unregister_netdevice_many_notify+0x7a9/0x870
      [  +0.005579]  ? unregister_netdevice_many_notify+0x3cc/0x870
      [  +0.005586]  unregister_netdevice_queue+0xf7/0x140
      [  +0.004806]  unregister_netdev+0x1c/0x30
      [  +0.003933]  i40e_vsi_release+0x87/0x2f0 [i40e]
      [  +0.004604]  i40e_remove+0x1a1/0x420 [i40e]
      [  +0.004220]  pci_device_remove+0x3f/0xb0
      [  +0.003943]  device_release_driver_internal+0x19f/0x200
      [  +0.005243]  driver_detach+0x48/0x90
      [  +0.003586]  bus_remove_driver+0x6d/0xf0
      [  +0.003939]  pci_unregister_driver+0x2e/0xb0
      [  +0.004278]  i40e_exit_module+0x10/0x5f0 [i40e]
      [  +0.004570]  __do_sys_delete_module.isra.0+0x197/0x310
      [  +0.005153]  do_syscall_64+0x85/0x170
      [  +0.003684]  ? syscall_exit_to_user_mode+0x69/0x220
      [  +0.004886]  ? do_syscall_64+0x95/0x170
      [  +0.003851]  ? exc_page_fault+0x7e/0x180
      [  +0.003932]  entry_SYSCALL_64_after_hwframe+0x71/0x79
      [  +0.005064] RIP: 0033:0x7f59dc9347cb
      [  +0.003648] Code: 73 01 c3 48 8b 0d 65 16 0c 00 f7 d8 64 89 01 48 83
      c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f
      05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 35 16 0c 00 f7 d8 64 89 01 48
      [  +0.018753] RSP: 002b:00007ffffac99048 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
      [  +0.007577] RAX: ffffffffffffffda RBX: 0000559b9bb2f6e0 RCX: 00007f59dc9347cb
      [  +0.007140] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 0000559b9bb2f748
      [  +0.007146] RBP: 00007ffffac99070 R08: 1999999999999999 R09: 0000000000000000
      [  +0.007133] R10: 00007f59dc9a5ac0 R11: 0000000000000206 R12: 0000000000000000
      [  +0.007141] R13: 00007ffffac992d8 R14: 0000559b9bb2f6e0 R15: 0000000000000000
      [  +0.007151]  </TASK>
      [  +0.002204] ---[ end trace 0000000000000000 ]---
      
      Fix this by checking if the XDP program is being loaded or unloaded.
      Then, block only loading a new program while "__I40E_IN_REMOVE" is set.
      Also, move testing "__I40E_IN_REMOVE" flag to the beginning of XDP_SETUP
      callback to avoid unnecessary operations and checks.
      
      Fixes: 6533e558 ("i40e: Fix reset path while removing the driver")
      Signed-off-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Link: https://patch.msgid.link/20240708230750.625986-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      01fc5142
  2. 09 Jul, 2024 7 commits
    • Hugh Dickins's avatar
      net: fix rc7's __skb_datagram_iter() · f1538310
      Hugh Dickins authored
      X would not start in my old 32-bit partition (and the "n"-handling looks
      just as wrong on 64-bit, but for whatever reason did not show up there):
      "n" must be accumulated over all pages before it's added to "offset" and
      compared with "copy", immediately after the skb_frag_foreach_page() loop.
      
      Fixes: d2d30a37 ("net: allow skb_datagram_iter to be called from any context")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Link: https://patch.msgid.link/fef352e8-b89a-da51-f8ce-04bc39ee6481@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f1538310
    • Paolo Abeni's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 528269fe
      Paolo Abeni authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2024-07-09
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 3 non-merge commits during the last 1 day(s) which contain
      a total of 5 files changed, 81 insertions(+), 11 deletions(-).
      
      The main changes are:
      
      1) Fix a use-after-free in a corner case where tcx_entry got released too
         early. Also add BPF test coverage along with the fix, from Daniel Borkmann.
      
      2) Fix a kernel panic on Loongarch in sk_msg_recvmsg() which got triggered
         by running BPF sockmap selftests, from Geliang Tang.
      
      bpf-for-netdev
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        skmsg: Skip zero length skb in sk_msg_recvmsg
        selftests/bpf: Extend tcx tests to cover late tcx_entry release
        bpf: Fix too early release of tcx_entry
      ====================
      
      Link: https://patch.msgid.link/20240709091452.27840-1-daniel@iogearbox.netSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      528269fe
    • Ronald Wahl's avatar
      net: ks8851: Fix deadlock with the SPI chip variant · 0913ec33
      Ronald Wahl authored
      When SMP is enabled and spinlocks are actually functional then there is
      a deadlock with the 'statelock' spinlock between ks8851_start_xmit_spi
      and ks8851_irq:
      
          watchdog: BUG: soft lockup - CPU#0 stuck for 27s!
          call trace:
            queued_spin_lock_slowpath+0x100/0x284
            do_raw_spin_lock+0x34/0x44
            ks8851_start_xmit_spi+0x30/0xb8
            ks8851_start_xmit+0x14/0x20
            netdev_start_xmit+0x40/0x6c
            dev_hard_start_xmit+0x6c/0xbc
            sch_direct_xmit+0xa4/0x22c
            __qdisc_run+0x138/0x3fc
            qdisc_run+0x24/0x3c
            net_tx_action+0xf8/0x130
            handle_softirqs+0x1ac/0x1f0
            __do_softirq+0x14/0x20
            ____do_softirq+0x10/0x1c
            call_on_irq_stack+0x3c/0x58
            do_softirq_own_stack+0x1c/0x28
            __irq_exit_rcu+0x54/0x9c
            irq_exit_rcu+0x10/0x1c
            el1_interrupt+0x38/0x50
            el1h_64_irq_handler+0x18/0x24
            el1h_64_irq+0x64/0x68
            __netif_schedule+0x6c/0x80
            netif_tx_wake_queue+0x38/0x48
            ks8851_irq+0xb8/0x2c8
            irq_thread_fn+0x2c/0x74
            irq_thread+0x10c/0x1b0
            kthread+0xc8/0xd8
            ret_from_fork+0x10/0x20
      
      This issue has not been identified earlier because tests were done on
      a device with SMP disabled and so spinlocks were actually NOPs.
      
      Now use spin_(un)lock_bh for TX queue related locking to avoid execution
      of softirq work synchronously that would lead to a deadlock.
      
      Fixes: 3dc5d445 ("net: ks8851: Fix TX stall caused by TX buffer overrun")
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Simon Horman <horms@kernel.org>
      Cc: netdev@vger.kernel.org
      Cc: stable@vger.kernel.org # 5.10+
      Signed-off-by: default avatarRonald Wahl <ronald.wahl@raritan.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://patch.msgid.link/20240706101337.854474-1-rwahl@gmx.deSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      0913ec33
    • Aleksandr Mishin's avatar
      octeontx2-af: Fix incorrect value output on error path in rvu_check_rsrc_availability() · 442e26af
      Aleksandr Mishin authored
      In rvu_check_rsrc_availability() in case of invalid SSOW req, an incorrect
      data is printed to error log. 'req->sso' value is printed instead of
      'req->ssow'. Looks like "copy-paste" mistake.
      
      Fix this mistake by replacing 'req->sso' with 'req->ssow'.
      
      Found by Linux Verification Center (linuxtesting.org) with SVACE.
      
      Fixes: 746ea742 ("octeontx2-af: Add RVU block LF provisioning support")
      Signed-off-by: default avatarAleksandr Mishin <amishin@t-argos.ru>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://patch.msgid.link/20240705095317.12640-1-amishin@t-argos.ruSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      442e26af
    • Jakub Kicinski's avatar
      bnxt: fix crashes when reducing ring count with active RSS contexts · 0d1b7d6c
      Jakub Kicinski authored
      bnxt doesn't check if a ring is used by RSS contexts when reducing
      ring count. Core performs a similar check for the drivers for
      the main context, but core doesn't know about additional contexts,
      so it can't validate them. bnxt_fill_hw_rss_tbl_p5() uses ring
      id to index bp->rx_ring[], which without the check may end up
      being out of bounds.
      
        BUG: KASAN: slab-out-of-bounds in __bnxt_hwrm_vnic_set_rss+0xb79/0xe40
        Read of size 2 at addr ffff8881c5809618 by task ethtool/31525
        Call Trace:
        __bnxt_hwrm_vnic_set_rss+0xb79/0xe40
         bnxt_hwrm_vnic_rss_cfg_p5+0xf7/0x460
         __bnxt_setup_vnic_p5+0x12e/0x270
         __bnxt_open_nic+0x2262/0x2f30
         bnxt_open_nic+0x5d/0xf0
         ethnl_set_channels+0x5d4/0xb30
         ethnl_default_set_doit+0x2f1/0x620
      
      Core does track the additional contexts in net-next, so we can
      move this validation out of the driver as a follow up there.
      
      Fixes: b3d0083c ("bnxt_en: Support RSS contexts in ethtool .{get|set}_rxfh()")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Link: https://patch.msgid.link/20240705020005.681746-1-kuba@kernel.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      0d1b7d6c
    • Geliang Tang's avatar
      skmsg: Skip zero length skb in sk_msg_recvmsg · f0c18025
      Geliang Tang authored
      When running BPF selftests (./test_progs -t sockmap_basic) on a Loongarch
      platform, the following kernel panic occurs:
      
        [...]
        Oops[#1]:
        CPU: 22 PID: 2824 Comm: test_progs Tainted: G           OE  6.10.0-rc2+ #18
        Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018
           ... ...
           ra: 90000000048bf6c0 sk_msg_recvmsg+0x120/0x560
          ERA: 9000000004162774 copy_page_to_iter+0x74/0x1c0
         CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
         PRMD: 0000000c (PPLV0 +PIE +PWE)
         EUEN: 00000007 (+FPE +SXE +ASXE -BTE)
         ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
        ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0)
         BADV: 0000000000000040
         PRID: 0014c011 (Loongson-64bit, Loongson-3C5000)
        Modules linked in: bpf_testmod(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack
        Process test_progs (pid: 2824, threadinfo=0000000000863a31, task=...)
        Stack : ...
        Call Trace:
        [<9000000004162774>] copy_page_to_iter+0x74/0x1c0
        [<90000000048bf6c0>] sk_msg_recvmsg+0x120/0x560
        [<90000000049f2b90>] tcp_bpf_recvmsg_parser+0x170/0x4e0
        [<90000000049aae34>] inet_recvmsg+0x54/0x100
        [<900000000481ad5c>] sock_recvmsg+0x7c/0xe0
        [<900000000481e1a8>] __sys_recvfrom+0x108/0x1c0
        [<900000000481e27c>] sys_recvfrom+0x1c/0x40
        [<9000000004c076ec>] do_syscall+0x8c/0xc0
        [<9000000003731da4>] handle_syscall+0xc4/0x160
        Code: ...
        ---[ end trace 0000000000000000 ]---
        Kernel panic - not syncing: Fatal exception
        Kernel relocated by 0x3510000
         .text @ 0x9000000003710000
         .data @ 0x9000000004d70000
         .bss  @ 0x9000000006469400
        ---[ end Kernel panic - not syncing: Fatal exception ]---
        [...]
      
      This crash happens every time when running sockmap_skb_verdict_shutdown
      subtest in sockmap_basic.
      
      This crash is because a NULL pointer is passed to page_address() in the
      sk_msg_recvmsg(). Due to the different implementations depending on the
      architecture, page_address(NULL) will trigger a panic on Loongarch
      platform but not on x86 platform. So this bug was hidden on x86 platform
      for a while, but now it is exposed on Loongarch platform. The root cause
      is that a zero length skb (skb->len == 0) was put on the queue.
      
      This zero length skb is a TCP FIN packet, which was sent by shutdown(),
      invoked in test_sockmap_skb_verdict_shutdown():
      
      	shutdown(p1, SHUT_WR);
      
      In this case, in sk_psock_skb_ingress_enqueue(), num_sge is zero, and no
      page is put to this sge (see sg_set_page in sg_set_page), but this empty
      sge is queued into ingress_msg list.
      
      And in sk_msg_recvmsg(), this empty sge is used, and a NULL page is got by
      sg_page(sge). Pass this NULL page to copy_page_to_iter(), which passes it
      to kmap_local_page() and to page_address(), then kernel panics.
      
      To solve this, we should skip this zero length skb. So in sk_msg_recvmsg(),
      if copy is zero, that means it's a zero length skb, skip invoking
      copy_page_to_iter(). We are using the EFAULT return triggered by
      copy_page_to_iter to check for is_fin in tcp_bpf.c.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Suggested-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarGeliang Tang <tanggeliang@kylinos.cn>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/e3a16eacdc6740658ee02a33489b1b9d4912f378.1719992715.git.tanggeliang@kylinos.cn
      f0c18025
    • Oleksij Rempel's avatar
      net: phy: microchip: lan87xx: reinit PHY after cable test · 30f747b8
      Oleksij Rempel authored
      Reinit PHY after cable test, otherwise link can't be established on
      tested port. This issue is reproducible on LAN9372 switches with
      integrated 100BaseT1 PHYs.
      
      Fixes: 78805025 ("net: phy: microchip_t1: add cable test support for lan87xx phy")
      Signed-off-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Reviewed-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Link: https://patch.msgid.link/20240705084954.83048-1-o.rempel@pengutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      30f747b8
  3. 08 Jul, 2024 3 commits
    • Daniel Borkmann's avatar
      selftests/bpf: Extend tcx tests to cover late tcx_entry release · 5f1d18de
      Daniel Borkmann authored
      Add a test case which replaces an active ingress qdisc while keeping the
      miniq in-tact during the transition period to the new clsact qdisc.
      
        # ./vmtest.sh -- ./test_progs -t tc_link
        [...]
        ./test_progs -t tc_link
        [    3.412871] bpf_testmod: loading out-of-tree module taints kernel.
        [    3.413343] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
        #332     tc_links_after:OK
        #333     tc_links_append:OK
        #334     tc_links_basic:OK
        #335     tc_links_before:OK
        #336     tc_links_chain_classic:OK
        #337     tc_links_chain_mixed:OK
        #338     tc_links_dev_chain0:OK
        #339     tc_links_dev_cleanup:OK
        #340     tc_links_dev_mixed:OK
        #341     tc_links_ingress:OK
        #342     tc_links_invalid:OK
        #343     tc_links_prepend:OK
        #344     tc_links_replace:OK
        #345     tc_links_revision:OK
        Summary: 14/0 PASSED, 0 SKIPPED, 0 FAILED
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Martin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20240708133130.11609-2-daniel@iogearbox.netSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      5f1d18de
    • Daniel Borkmann's avatar
      bpf: Fix too early release of tcx_entry · 1cb6f0ba
      Daniel Borkmann authored
      Pedro Pinto and later independently also Hyunwoo Kim and Wongi Lee reported
      an issue that the tcx_entry can be released too early leading to a use
      after free (UAF) when an active old-style ingress or clsact qdisc with a
      shared tc block is later replaced by another ingress or clsact instance.
      
      Essentially, the sequence to trigger the UAF (one example) can be as follows:
      
        1. A network namespace is created
        2. An ingress qdisc is created. This allocates a tcx_entry, and
           &tcx_entry->miniq is stored in the qdisc's miniqp->p_miniq. At the
           same time, a tcf block with index 1 is created.
        3. chain0 is attached to the tcf block. chain0 must be connected to
           the block linked to the ingress qdisc to later reach the function
           tcf_chain0_head_change_cb_del() which triggers the UAF.
        4. Create and graft a clsact qdisc. This causes the ingress qdisc
           created in step 1 to be removed, thus freeing the previously linked
           tcx_entry:
      
           rtnetlink_rcv_msg()
             => tc_modify_qdisc()
               => qdisc_create()
                 => clsact_init() [a]
               => qdisc_graft()
                 => qdisc_destroy()
                   => __qdisc_destroy()
                     => ingress_destroy() [b]
                       => tcx_entry_free()
                         => kfree_rcu() // tcx_entry freed
      
        5. Finally, the network namespace is closed. This registers the
           cleanup_net worker, and during the process of releasing the
           remaining clsact qdisc, it accesses the tcx_entry that was
           already freed in step 4, causing the UAF to occur:
      
           cleanup_net()
             => ops_exit_list()
               => default_device_exit_batch()
                 => unregister_netdevice_many()
                   => unregister_netdevice_many_notify()
                     => dev_shutdown()
                       => qdisc_put()
                         => clsact_destroy() [c]
                           => tcf_block_put_ext()
                             => tcf_chain0_head_change_cb_del()
                               => tcf_chain_head_change_item()
                                 => clsact_chain_head_change()
                                   => mini_qdisc_pair_swap() // UAF
      
      There are also other variants, the gist is to add an ingress (or clsact)
      qdisc with a specific shared block, then to replace that qdisc, waiting
      for the tcx_entry kfree_rcu() to be executed and subsequently accessing
      the current active qdisc's miniq one way or another.
      
      The correct fix is to turn the miniq_active boolean into a counter. What
      can be observed, at step 2 above, the counter transitions from 0->1, at
      step [a] from 1->2 (in order for the miniq object to remain active during
      the replacement), then in [b] from 2->1 and finally [c] 1->0 with the
      eventual release. The reference counter in general ranges from [0,2] and
      it does not need to be atomic since all access to the counter is protected
      by the rtnl mutex. With this in place, there is no longer a UAF happening
      and the tcx_entry is freed at the correct time.
      
      Fixes: e420bed0 ("bpf: Add fd-based tcx multi-prog infra with link support")
      Reported-by: default avatarPedro Pinto <xten@osec.io>
      Co-developed-by: default avatarPedro Pinto <xten@osec.io>
      Signed-off-by: default avatarPedro Pinto <xten@osec.io>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Hyunwoo Kim <v4bel@theori.io>
      Cc: Wongi Lee <qwerty@theori.io>
      Cc: Martin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20240708133130.11609-1-daniel@iogearbox.netSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      1cb6f0ba
    • Chris Packham's avatar
      docs: networking: devlink: capitalise length value · 83c36e7c
      Chris Packham authored
      Correct the example to match the help text from the devlink utility.
      Signed-off-by: default avatarChris Packham <chris.packham@alliedtelesis.co.nz>
      Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      83c36e7c
  4. 06 Jul, 2024 7 commits
    • Neal Cardwell's avatar
      tcp: fix incorrect undo caused by DSACK of TLP retransmit · 0ec986ed
      Neal Cardwell authored
      Loss recovery undo_retrans bookkeeping had a long-standing bug where a
      DSACK from a spurious TLP retransmit packet could cause an erroneous
      undo of a fast recovery or RTO recovery that repaired a single
      really-lost packet (in a sequence range outside that of the TLP
      retransmit). Basically, because the loss recovery state machine didn't
      account for the fact that it sent a TLP retransmit, the DSACK for the
      TLP retransmit could erroneously be implicitly be interpreted as
      corresponding to the normal fast recovery or RTO recovery retransmit
      that plugged a real hole, thus resulting in an improper undo.
      
      For example, consider the following buggy scenario where there is a
      real packet loss but the congestion control response is improperly
      undone because of this bug:
      
      + send packets P1, P2, P3, P4
      + P1 is really lost
      + send TLP retransmit of P4
      + receive SACK for original P2, P3, P4
      + enter fast recovery, fast-retransmit P1, increment undo_retrans to 1
      + receive DSACK for TLP P4, decrement undo_retrans to 0, undo (bug!)
      + receive cumulative ACK for P1-P4 (fast retransmit plugged real hole)
      
      The fix: when we initialize undo machinery in tcp_init_undo(), if
      there is a TLP retransmit in flight, then increment tp->undo_retrans
      so that we make sure that we receive a DSACK corresponding to the TLP
      retransmit, as well as DSACKs for all later normal retransmits, before
      triggering a loss recovery undo. Note that we also have to move the
      line that clears tp->tlp_high_seq for RTO recovery, so that upon RTO
      we remember the tp->tlp_high_seq value until tcp_init_undo() and clear
      it only afterward.
      
      Also note that the bug dates back to the original 2013 TLP
      implementation, commit 6ba8a3b1 ("tcp: Tail loss probe (TLP)").
      
      However, this patch will only compile and work correctly with kernels
      that have tp->tlp_retrans, which was added only in v5.8 in 2020 in
      commit 76be93fc ("tcp: allow at most one TLP probe per flight").
      So we associate this fix with that later commit.
      
      Fixes: 76be93fc ("tcp: allow at most one TLP probe per flight")
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Kevin Yang <yyd@google.com>
      Link: https://patch.msgid.link/20240703171246.1739561-1-ncardwell.sw@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0ec986ed
    • Jakub Kicinski's avatar
      Merge branch 'wireguard-fixes-for-6-10-rc7' · 842c361b
      Jakub Kicinski authored
      Jason A. Donenfeld says:
      
      ====================
      wireguard fixes for 6.10-rc7
      
      These are four small fixes for WireGuard, which are all marked for
      stable:
      
      1) A QEMU command line fix to remove deprecated flags.
      
      2) Use of proper unaligned helpers to avoid unaligned memory access on
         some systems, from Helge.
      
      3) Two patches to annotate intentional data races, so KCSAN and syzbot
         don't get upset.
      ====================
      
      Link: https://patch.msgid.link/20240704154517.1572127-1-Jason@zx2c4.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      842c361b
    • Jason A. Donenfeld's avatar
      wireguard: send: annotate intentional data race in checking empty queue · 381a7d45
      Jason A. Donenfeld authored
      KCSAN reports a race in wg_packet_send_keepalive, which is intentional:
      
          BUG: KCSAN: data-race in wg_packet_send_keepalive / wg_packet_send_staged_packets
      
          write to 0xffff88814cd91280 of 8 bytes by task 3194 on cpu 0:
           __skb_queue_head_init include/linux/skbuff.h:2162 [inline]
           skb_queue_splice_init include/linux/skbuff.h:2248 [inline]
           wg_packet_send_staged_packets+0xe5/0xad0 drivers/net/wireguard/send.c:351
           wg_xmit+0x5b8/0x660 drivers/net/wireguard/device.c:218
           __netdev_start_xmit include/linux/netdevice.h:4940 [inline]
           netdev_start_xmit include/linux/netdevice.h:4954 [inline]
           xmit_one net/core/dev.c:3548 [inline]
           dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3564
           __dev_queue_xmit+0xeff/0x1d80 net/core/dev.c:4349
           dev_queue_xmit include/linux/netdevice.h:3134 [inline]
           neigh_connected_output+0x231/0x2a0 net/core/neighbour.c:1592
           neigh_output include/net/neighbour.h:542 [inline]
           ip6_finish_output2+0xa66/0xce0 net/ipv6/ip6_output.c:137
           ip6_finish_output+0x1a5/0x490 net/ipv6/ip6_output.c:222
           NF_HOOK_COND include/linux/netfilter.h:303 [inline]
           ip6_output+0xeb/0x220 net/ipv6/ip6_output.c:243
           dst_output include/net/dst.h:451 [inline]
           NF_HOOK include/linux/netfilter.h:314 [inline]
           ndisc_send_skb+0x4a2/0x670 net/ipv6/ndisc.c:509
           ndisc_send_rs+0x3ab/0x3e0 net/ipv6/ndisc.c:719
           addrconf_dad_completed+0x640/0x8e0 net/ipv6/addrconf.c:4295
           addrconf_dad_work+0x891/0xbc0
           process_one_work kernel/workqueue.c:2633 [inline]
           process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2706
           worker_thread+0x525/0x730 kernel/workqueue.c:2787
           kthread+0x1d7/0x210 kernel/kthread.c:388
           ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147
           ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
      
          read to 0xffff88814cd91280 of 8 bytes by task 3202 on cpu 1:
           skb_queue_empty include/linux/skbuff.h:1798 [inline]
           wg_packet_send_keepalive+0x20/0x100 drivers/net/wireguard/send.c:225
           wg_receive_handshake_packet drivers/net/wireguard/receive.c:186 [inline]
           wg_packet_handshake_receive_worker+0x445/0x5e0 drivers/net/wireguard/receive.c:213
           process_one_work kernel/workqueue.c:2633 [inline]
           process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2706
           worker_thread+0x525/0x730 kernel/workqueue.c:2787
           kthread+0x1d7/0x210 kernel/kthread.c:388
           ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147
           ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
      
          value changed: 0xffff888148fef200 -> 0xffff88814cd91280
      
      Mark this race as intentional by using the skb_queue_empty_lockless()
      function rather than skb_queue_empty(), which uses READ_ONCE()
      internally to annotate the race.
      
      Cc: stable@vger.kernel.org
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Link: https://patch.msgid.link/20240704154517.1572127-5-Jason@zx2c4.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      381a7d45
    • Jason A. Donenfeld's avatar
      wireguard: queueing: annotate intentional data race in cpu round robin · 2fe3d6d2
      Jason A. Donenfeld authored
      KCSAN reports a race in the CPU round robin function, which, as the
      comment points out, is intentional:
      
          BUG: KCSAN: data-race in wg_packet_send_staged_packets / wg_packet_send_staged_packets
      
          read to 0xffff88811254eb28 of 4 bytes by task 3160 on cpu 1:
           wg_cpumask_next_online drivers/net/wireguard/queueing.h:127 [inline]
           wg_queue_enqueue_per_device_and_peer drivers/net/wireguard/queueing.h:173 [inline]
           wg_packet_create_data drivers/net/wireguard/send.c:320 [inline]
           wg_packet_send_staged_packets+0x60e/0xac0 drivers/net/wireguard/send.c:388
           wg_packet_send_keepalive+0xe2/0x100 drivers/net/wireguard/send.c:239
           wg_receive_handshake_packet drivers/net/wireguard/receive.c:186 [inline]
           wg_packet_handshake_receive_worker+0x449/0x5f0 drivers/net/wireguard/receive.c:213
           process_one_work kernel/workqueue.c:3248 [inline]
           process_scheduled_works+0x483/0x9a0 kernel/workqueue.c:3329
           worker_thread+0x526/0x720 kernel/workqueue.c:3409
           kthread+0x1d1/0x210 kernel/kthread.c:389
           ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:147
           ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
      
          write to 0xffff88811254eb28 of 4 bytes by task 3158 on cpu 0:
           wg_cpumask_next_online drivers/net/wireguard/queueing.h:130 [inline]
           wg_queue_enqueue_per_device_and_peer drivers/net/wireguard/queueing.h:173 [inline]
           wg_packet_create_data drivers/net/wireguard/send.c:320 [inline]
           wg_packet_send_staged_packets+0x6e5/0xac0 drivers/net/wireguard/send.c:388
           wg_packet_send_keepalive+0xe2/0x100 drivers/net/wireguard/send.c:239
           wg_receive_handshake_packet drivers/net/wireguard/receive.c:186 [inline]
           wg_packet_handshake_receive_worker+0x449/0x5f0 drivers/net/wireguard/receive.c:213
           process_one_work kernel/workqueue.c:3248 [inline]
           process_scheduled_works+0x483/0x9a0 kernel/workqueue.c:3329
           worker_thread+0x526/0x720 kernel/workqueue.c:3409
           kthread+0x1d1/0x210 kernel/kthread.c:389
           ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:147
           ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
      
          value changed: 0xffffffff -> 0x00000000
      
      Mark this race as intentional by using READ/WRITE_ONCE().
      
      Cc: stable@vger.kernel.org
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Link: https://patch.msgid.link/20240704154517.1572127-4-Jason@zx2c4.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2fe3d6d2
    • Helge Deller's avatar
      wireguard: allowedips: avoid unaligned 64-bit memory accesses · 948f991c
      Helge Deller authored
      On the parisc platform, the kernel issues kernel warnings because
      swap_endian() tries to load a 128-bit IPv6 address from an unaligned
      memory location:
      
       Kernel: unaligned access to 0x55f4688c in wg_allowedips_insert_v6+0x2c/0x80 [wireguard] (iir 0xf3010df)
       Kernel: unaligned access to 0x55f46884 in wg_allowedips_insert_v6+0x38/0x80 [wireguard] (iir 0xf2010dc)
      
      Avoid such unaligned memory accesses by instead using the
      get_unaligned_be64() helper macro.
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      [Jason: replace src[8] in original patch with src+8]
      Cc: stable@vger.kernel.org
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Link: https://patch.msgid.link/20240704154517.1572127-3-Jason@zx2c4.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      948f991c
    • Jason A. Donenfeld's avatar
      wireguard: selftests: use acpi=off instead of -no-acpi for recent QEMU · 2cb489eb
      Jason A. Donenfeld authored
      QEMU 9.0 removed -no-acpi, in favor of machine properties, so update the
      Makefile to use the correct QEMU invocation.
      
      Cc: stable@vger.kernel.org
      Fixes: b83fdcd9 ("wireguard: selftests: use microvm on x86")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Link: https://patch.msgid.link/20240704154517.1572127-2-Jason@zx2c4.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2cb489eb
    • Dan Carpenter's avatar
      net: bcmasp: Fix error code in probe() · 0c754d9d
      Dan Carpenter authored
      Return an error code if bcmasp_interface_create() fails.  Don't return
      success.
      
      Fixes: 490cb412 ("net: bcmasp: Add support for ASP2.0 Ethernet controller")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reviewed-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Reviewed-by: default avatarJustin Chen <justin.chen@broadcom.com>
      Link: https://patch.msgid.link/ZoWKBkHH9D1fqV4r@stanley.mountainSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0c754d9d
  5. 05 Jul, 2024 1 commit
  6. 04 Jul, 2024 16 commits