1. 07 Mar, 2024 5 commits
  2. 06 Mar, 2024 11 commits
    • Edward Adam Davis's avatar
      net/rds: fix WARNING in rds_conn_connect_if_down · c055fc00
      Edward Adam Davis authored
      If connection isn't established yet, get_mr() will fail, trigger connection after
      get_mr().
      
      Fixes: 584a8279 ("RDS: RDMA: return appropriate error on rdma map failures")
      Reported-and-tested-by: syzbot+d4faee732755bba9838e@syzkaller.appspotmail.com
      Signed-off-by: default avatarEdward Adam Davis <eadavis@qq.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c055fc00
    • David S. Miller's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · f287d6aa
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2024-03-05 (idpf, ice, i40e, igc, e1000e)
      
      This series contains updates to idpf, ice, i40e, igc and e1000e drivers.
      
      Emil disables local BH on NAPI schedule for proper handling of softirqs
      on idpf.
      
      Jake stops reporting of virtchannel RSS option which in unsupported on
      ice.
      
      Rand Deeb adds null check to prevent possible null pointer dereference
      on ice.
      
      Michal Schmidt moves DPLL mutex initialization to resolve uninitialized
      mutex usage for ice.
      
      Jesse fixes incorrect variable usage for calculating Tx stats on ice.
      
      Ivan Vecera corrects logic for firmware equals check on i40e.
      
      Florian Kauer prevents memory corruption for XDP_REDIRECT on igc.
      
      Sasha reverts an incorrect use of FIELD_GET which caused a regression
      for Wake on LAN on e1000e.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f287d6aa
    • Steffen Klassert's avatar
      Merge branch 'Improve packet offload for dual stack' · 2ce0eae6
      Steffen Klassert authored
      Mike Yu says:
      ====================
      In the XFRM stack, whether a packet is forwarded to the IPv4
      or IPv6 stack depends on the family field of the matched SA.
      This does not completely work for IPsec packet offload in some
      scenario, for example, sending an IPv6 packet that will be
      encrypted and encapsulated as an IPv4 packet in HW.
      
      Here are the patches to make IPsec packet offload work on the
      mentioned scenario.
      ====================
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      2ce0eae6
    • Tobias Jakobi (Compleo)'s avatar
      net: dsa: microchip: fix register write order in ksz8_ind_write8() · b7fb7729
      Tobias Jakobi (Compleo) authored
      This bug was noticed while re-implementing parts of the kernel
      driver in userspace using spidev. The goal was to enable some
      of the errata workarounds that Microchip describes in their
      errata sheet [1].
      
      Both the errata sheet and the regular datasheet of e.g. the KSZ8795
      imply that you need to do this for indirect register accesses:
      - write a 16-bit value to a control register pair (this value
        consists of the indirect register table, and the offset inside
        the table)
      - either read or write an 8-bit value from the data storage
        register (indicated by REG_IND_BYTE in the kernel)
      
      The current implementation has the order swapped. It can be
      proven, by reading back some indirect register with known content
      (the EEE register modified in ksz8_handle_global_errata() is one of
      these), that this implementation does not work.
      
      Private discussion with Oleksij Rempel of Pengutronix has revealed
      that the workaround was apparantly never tested on actual hardware.
      
      [1] https://ww1.microchip.com/downloads/aemDocuments/documents/OTH/ProductDocuments/Errata/KSZ87xx-Errata-DS80000687C.pdfSigned-off-by: default avatarTobias Jakobi (Compleo) <tobias.jakobi.compleo@gmail.com>
      Reviewed-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Fixes: 7b6e6235 ("net: dsa: microchip: ksz8795: handle eee specif erratum")
      Link: https://lore.kernel.org/r/20240304154135.161332-1-tobias.jakobi.compleo@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b7fb7729
    • Jakub Kicinski's avatar
      dpll: move all dpll<>netdev helpers to dpll code · 289e9225
      Jakub Kicinski authored
      Older versions of GCC really want to know the full definition
      of the type involved in rcu_assign_pointer().
      
      struct dpll_pin is defined in a local header, net/core can't
      reach it. Move all the netdev <> dpll code into dpll, where
      the type is known. Otherwise we'd need multiple function calls
      to jump between the compilation units.
      
      This is the same problem the commit under fixes was trying to address,
      but with rcu_assign_pointer() not rcu_dereference().
      
      Some of the exports are not needed, networking core can't
      be a module, we only need exports for the helpers used by
      drivers.
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Link: https://lore.kernel.org/all/35a869c8-52e8-177-1d4d-e57578b99b6@linux-m68k.org/
      Fixes: 640f41ed ("dpll: fix build failure due to rcu_dereference_check() on unknown type")
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240305013532.694866-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      289e9225
    • Toke Høiland-Jørgensen's avatar
      cpumap: Zero-initialise xdp_rxq_info struct before running XDP program · 2487007a
      Toke Høiland-Jørgensen authored
      When running an XDP program that is attached to a cpumap entry, we don't
      initialise the xdp_rxq_info data structure being used in the xdp_buff
      that backs the XDP program invocation. Tobias noticed that this leads to
      random values being returned as the xdp_md->rx_queue_index value for XDP
      programs running in a cpumap.
      
      This means we're basically returning the contents of the uninitialised
      memory, which is bad. Fix this by zero-initialising the rxq data
      structure before running the XDP program.
      
      Fixes: 92164774 ("bpf: cpumap: Add the possibility to attach an eBPF program to cpumap")
      Reported-by: default avatarTobias Böhm <tobias@aibor.de>
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/r/20240305213132.11955-1-toke@redhat.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      2487007a
    • Daniel Borkmann's avatar
      selftests/bpf: Fix up xdp bonding test wrt feature flags · 0bfc0336
      Daniel Borkmann authored
      Adjust the XDP feature flags for the bond device when no bond slave
      devices are attached. After 9b0ed890 ("bonding: do not report
      NETDEV_XDP_ACT_XSK_ZEROCOPY"), the empty bond device must report 0
      as flags instead of NETDEV_XDP_ACT_MASK.
      
        # ./vmtest.sh -- ./test_progs -t xdp_bond
        [...]
        [    3.983311] bond1 (unregistering): (slave veth1_1): Releasing backup interface
        [    3.995434] bond1 (unregistering): Released all slaves
        [    4.022311] bond2: (slave veth2_1): Releasing backup interface
        #507/1   xdp_bonding/xdp_bonding_attach:OK
        #507/2   xdp_bonding/xdp_bonding_nested:OK
        #507/3   xdp_bonding/xdp_bonding_features:OK
        #507/4   xdp_bonding/xdp_bonding_roundrobin:OK
        #507/5   xdp_bonding/xdp_bonding_activebackup:OK
        #507/6   xdp_bonding/xdp_bonding_xor_layer2:OK
        #507/7   xdp_bonding/xdp_bonding_xor_layer23:OK
        #507/8   xdp_bonding/xdp_bonding_xor_layer34:OK
        #507/9   xdp_bonding/xdp_bonding_redirect_multi:OK
        #507     xdp_bonding:OK
        Summary: 1/9 PASSED, 0 SKIPPED, 0 FAILED
        [    4.185255] bond2 (unregistering): Released all slaves
        [...]
      
      Fixes: 9b0ed890 ("bonding: do not report NETDEV_XDP_ACT_XSK_ZEROCOPY")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Message-ID: <20240305090829.17131-2-daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0bfc0336
    • Daniel Borkmann's avatar
      xdp, bonding: Fix feature flags when there are no slave devs anymore · f267f262
      Daniel Borkmann authored
      Commit 9b0ed890 ("bonding: do not report NETDEV_XDP_ACT_XSK_ZEROCOPY")
      changed the driver from reporting everything as supported before a device
      was bonded into having the driver report that no XDP feature is supported
      until a real device is bonded as it seems to be more truthful given
      eventually real underlying devices decide what XDP features are supported.
      
      The change however did not take into account when all slave devices get
      removed from the bond device. In this case after 9b0ed890, the driver
      keeps reporting a feature mask of 0x77, that is, NETDEV_XDP_ACT_MASK &
      ~NETDEV_XDP_ACT_XSK_ZEROCOPY whereas it should have reported a feature
      mask of 0.
      
      Fix it by resetting XDP feature flags in the same way as if no XDP program
      is attached to the bond device. This was uncovered by the XDP bond selftest
      which let BPF CI fail. After adjusting the starting masks on the latter
      to 0 instead of NETDEV_XDP_ACT_MASK the test passes again together with
      this fix.
      
      Fixes: 9b0ed890 ("bonding: do not report NETDEV_XDP_ACT_XSK_ZEROCOPY")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Cc: Prashant Batra <prbatra.mail@gmail.com>
      Cc: Toke Høiland-Jørgensen <toke@redhat.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Message-ID: <20240305090829.17131-1-daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f267f262
    • Alexei Starovoitov's avatar
      Merge branch 'check-bpf_func_state-callback_depth-when-pruning-states' · 399eca1b
      Alexei Starovoitov authored
      Eduard Zingerman says:
      
      ====================
      check bpf_func_state->callback_depth when pruning states
      
      This patch-set fixes bug in states pruning logic hit in mailing list
      discussion [0]. The details of the fix are in patch #1.
      
      The main idea for the fix belongs to Yonghong Song,
      mine contribution is merely in review and test cases.
      
      There are some changes in verification performance:
      
      File                       Program        Insns    (DIFF)  States  (DIFF)
      -------------------------  -------------  ---------------  --------------
      pyperf600_bpf_loop.bpf.o   on_event          +15 (+0.42%)     +0 (+0.00%)
      strobemeta_bpf_loop.bpf.o  on_event        +857 (+37.95%)   +60 (+38.96%)
      xdp_synproxy_kern.bpf.o    syncookie_tc   +2892 (+30.39%)  +109 (+36.33%)
      xdp_synproxy_kern.bpf.o    syncookie_xdp  +2892 (+30.01%)  +109 (+36.09%)
      
      (when tested on a subset of selftests identified by
       selftests/bpf/veristat.cfg and Cilium bpf object files from [4])
      
      Changelog:
      v2 [2] -> v3:
      - fixes for verifier.c commit message as suggested by Yonghong;
      - patch-set re-rerouted to 'bpf' tree as suggested in [2];
      - patch for test_tcp_custom_syncookie is sent separately to 'bpf-next' [3].
      - veristat results updated using 'bpf' tree as baseline and clang 16.
      
      v1 [1] -> v2:
      - patch #2 commit message updated to better reflect verifier behavior
        with regards to checkpoints tree (suggested by Yonghong);
      - veristat results added (suggested by Andrii).
      
      [0] https://lore.kernel.org/bpf/9b251840-7cb8-4d17-bd23-1fc8071d8eef@linux.dev/
      [1] https://lore.kernel.org/bpf/20240212143832.28838-1-eddyz87@gmail.com/
      [2] https://lore.kernel.org/bpf/20240216150334.31937-1-eddyz87@gmail.com/
      [3] https://lore.kernel.org/bpf/20240222150300.14909-1-eddyz87@gmail.com/
      [4] https://github.com/anakryiko/cilium
      ====================
      
      Link: https://lore.kernel.org/r/20240222154121.6991-1-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      399eca1b
    • Eduard Zingerman's avatar
      selftests/bpf: test case for callback_depth states pruning logic · 5c2bc5e2
      Eduard Zingerman authored
      The test case was minimized from mailing list discussion [0].
      It is equivalent to the following C program:
      
          struct iter_limit_bug_ctx { __u64 a; __u64 b; __u64 c; };
      
          static __naked void iter_limit_bug_cb(void)
          {
          	switch (bpf_get_prandom_u32()) {
          	case 1:  ctx->a = 42; break;
          	case 2:  ctx->b = 42; break;
          	default: ctx->c = 42; break;
          	}
          }
      
          int iter_limit_bug(struct __sk_buff *skb)
          {
          	struct iter_limit_bug_ctx ctx = { 7, 7, 7 };
      
          	bpf_loop(2, iter_limit_bug_cb, &ctx, 0);
          	if (ctx.a == 42 && ctx.b == 42 && ctx.c == 7)
          	  asm volatile("r1 /= 0;":::"r1");
          	return 0;
          }
      
      The main idea is that each loop iteration changes one of the state
      variables in a non-deterministic manner. Hence it is premature to
      prune the states that have two iterations left comparing them to
      states with one iteration left.
      E.g. {{7,7,7}, callback_depth=0} can reach state {42,42,7},
      while {{7,7,7}, callback_depth=1} can't.
      
      [0] https://lore.kernel.org/bpf/9b251840-7cb8-4d17-bd23-1fc8071d8eef@linux.dev/Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20240222154121.6991-3-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5c2bc5e2
    • Eduard Zingerman's avatar
      bpf: check bpf_func_state->callback_depth when pruning states · e9a8e5a5
      Eduard Zingerman authored
      When comparing current and cached states verifier should consider
      bpf_func_state->callback_depth. Current state cannot be pruned against
      cached state, when current states has more iterations left compared to
      cached state. Current state has more iterations left when it's
      callback_depth is smaller.
      
      Below is an example illustrating this bug, minimized from mailing list
      discussion [0] (assume that BPF_F_TEST_STATE_FREQ is set).
      The example is not a safe program: if loop_cb point (1) is followed by
      loop_cb point (2), then division by zero is possible at point (4).
      
          struct ctx {
          	__u64 a;
          	__u64 b;
          	__u64 c;
          };
      
          static void loop_cb(int i, struct ctx *ctx)
          {
          	/* assume that generated code is "fallthrough-first":
          	 * if ... == 1 goto
          	 * if ... == 2 goto
          	 * <default>
          	 */
          	switch (bpf_get_prandom_u32()) {
          	case 1:  /* 1 */ ctx->a = 42; return 0; break;
          	case 2:  /* 2 */ ctx->b = 42; return 0; break;
          	default: /* 3 */ ctx->c = 42; return 0; break;
          	}
          }
      
          SEC("tc")
          __failure
          __flag(BPF_F_TEST_STATE_FREQ)
          int test(struct __sk_buff *skb)
          {
          	struct ctx ctx = { 7, 7, 7 };
      
          	bpf_loop(2, loop_cb, &ctx, 0);              /* 0 */
          	/* assume generated checks are in-order: .a first */
          	if (ctx.a == 42 && ctx.b == 42 && ctx.c == 7)
          		asm volatile("r0 /= 0;":::"r0");    /* 4 */
          	return 0;
          }
      
      Prior to this commit verifier built the following checkpoint tree for
      this example:
      
       .------------------------------------- Checkpoint / State name
       |    .-------------------------------- Code point number
       |    |   .---------------------------- Stack state {ctx.a,ctx.b,ctx.c}
       |    |   |        .------------------- Callback depth in frame #0
       v    v   v        v
         - (0) {7P,7P,7},depth=0
           - (3) {7P,7P,7},depth=1
             - (0) {7P,7P,42},depth=1
               - (3) {7P,7,42},depth=2
                 - (0) {7P,7,42},depth=2      loop terminates because of depth limit
                   - (4) {7P,7,42},depth=0    predicted false, ctx.a marked precise
                   - (6) exit
      (a)      - (2) {7P,7,42},depth=2
                 - (0) {7P,42,42},depth=2     loop terminates because of depth limit
                   - (4) {7P,42,42},depth=0   predicted false, ctx.a marked precise
                   - (6) exit
      (b)      - (1) {7P,7P,42},depth=2
                 - (0) {42P,7P,42},depth=2    loop terminates because of depth limit
                   - (4) {42P,7P,42},depth=0  predicted false, ctx.{a,b} marked precise
                   - (6) exit
           - (2) {7P,7,7},depth=1             considered safe, pruned using checkpoint (a)
      (c)  - (1) {7P,7P,7},depth=1            considered safe, pruned using checkpoint (b)
      
      Here checkpoint (b) has callback_depth of 2, meaning that it would
      never reach state {42,42,7}.
      While checkpoint (c) has callback_depth of 1, and thus
      could yet explore the state {42,42,7} if not pruned prematurely.
      This commit makes forbids such premature pruning,
      allowing verifier to explore states sub-tree starting at (c):
      
      (c)  - (1) {7,7,7P},depth=1
             - (0) {42P,7,7P},depth=1
               ...
               - (2) {42,7,7},depth=2
                 - (0) {42,42,7},depth=2      loop terminates because of depth limit
                   - (4) {42,42,7},depth=0    predicted true, ctx.{a,b,c} marked precise
                     - (5) division by zero
      
      [0] https://lore.kernel.org/bpf/9b251840-7cb8-4d17-bd23-1fc8071d8eef@linux.dev/
      
      Fixes: bb124da6 ("bpf: keep track of max number of bpf_loop callback iterations")
      Suggested-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240222154121.6991-2-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e9a8e5a5
  3. 05 Mar, 2024 14 commits
    • Eric Dumazet's avatar
      net/ipv6: avoid possible UAF in ip6_route_mpath_notify() · 685f7d53
      Eric Dumazet authored
      syzbot found another use-after-free in ip6_route_mpath_notify() [1]
      
      Commit f7225172 ("net/ipv6: prevent use after free in
      ip6_route_mpath_notify") was not able to fix the root cause.
      
      We need to defer the fib6_info_release() calls after
      ip6_route_mpath_notify(), in the cleanup phase.
      
      [1]
      BUG: KASAN: slab-use-after-free in rt6_fill_node+0x1460/0x1ac0
      Read of size 4 at addr ffff88809a07fc64 by task syz-executor.2/23037
      
      CPU: 0 PID: 23037 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-01035-gea7f3cfa #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
      Call Trace:
       <TASK>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
        print_address_description mm/kasan/report.c:377 [inline]
        print_report+0x167/0x540 mm/kasan/report.c:488
        kasan_report+0x142/0x180 mm/kasan/report.c:601
       rt6_fill_node+0x1460/0x1ac0
        inet6_rt_notify+0x13b/0x290 net/ipv6/route.c:6184
        ip6_route_mpath_notify net/ipv6/route.c:5198 [inline]
        ip6_route_multipath_add net/ipv6/route.c:5404 [inline]
        inet6_rtm_newroute+0x1d0f/0x2300 net/ipv6/route.c:5517
        rtnetlink_rcv_msg+0x885/0x1040 net/core/rtnetlink.c:6597
        netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2543
        netlink_unicast_kernel net/netlink/af_netlink.c:1341 [inline]
        netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1367
        netlink_sendmsg+0xa3b/0xd70 net/netlink/af_netlink.c:1908
        sock_sendmsg_nosec net/socket.c:730 [inline]
        __sock_sendmsg+0x221/0x270 net/socket.c:745
        ____sys_sendmsg+0x525/0x7d0 net/socket.c:2584
        ___sys_sendmsg net/socket.c:2638 [inline]
        __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2667
       do_syscall_64+0xf9/0x240
       entry_SYSCALL_64_after_hwframe+0x6f/0x77
      RIP: 0033:0x7f73dd87dda9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f73de6550c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00007f73dd9ac050 RCX: 00007f73dd87dda9
      RDX: 0000000000000000 RSI: 0000000020000140 RDI: 0000000000000005
      RBP: 00007f73dd8ca47a R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 000000000000006e R14: 00007f73dd9ac050 R15: 00007ffdbdeb7858
       </TASK>
      
      Allocated by task 23037:
        kasan_save_stack mm/kasan/common.c:47 [inline]
        kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
        poison_kmalloc_redzone mm/kasan/common.c:372 [inline]
        __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:389
        kasan_kmalloc include/linux/kasan.h:211 [inline]
        __do_kmalloc_node mm/slub.c:3981 [inline]
        __kmalloc+0x22e/0x490 mm/slub.c:3994
        kmalloc include/linux/slab.h:594 [inline]
        kzalloc include/linux/slab.h:711 [inline]
        fib6_info_alloc+0x2e/0xf0 net/ipv6/ip6_fib.c:155
        ip6_route_info_create+0x445/0x12b0 net/ipv6/route.c:3758
        ip6_route_multipath_add net/ipv6/route.c:5298 [inline]
        inet6_rtm_newroute+0x744/0x2300 net/ipv6/route.c:5517
        rtnetlink_rcv_msg+0x885/0x1040 net/core/rtnetlink.c:6597
        netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2543
        netlink_unicast_kernel net/netlink/af_netlink.c:1341 [inline]
        netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1367
        netlink_sendmsg+0xa3b/0xd70 net/netlink/af_netlink.c:1908
        sock_sendmsg_nosec net/socket.c:730 [inline]
        __sock_sendmsg+0x221/0x270 net/socket.c:745
        ____sys_sendmsg+0x525/0x7d0 net/socket.c:2584
        ___sys_sendmsg net/socket.c:2638 [inline]
        __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2667
       do_syscall_64+0xf9/0x240
       entry_SYSCALL_64_after_hwframe+0x6f/0x77
      
      Freed by task 16:
        kasan_save_stack mm/kasan/common.c:47 [inline]
        kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
        kasan_save_free_info+0x4e/0x60 mm/kasan/generic.c:640
        poison_slab_object+0xa6/0xe0 mm/kasan/common.c:241
        __kasan_slab_free+0x34/0x70 mm/kasan/common.c:257
        kasan_slab_free include/linux/kasan.h:184 [inline]
        slab_free_hook mm/slub.c:2121 [inline]
        slab_free mm/slub.c:4299 [inline]
        kfree+0x14a/0x380 mm/slub.c:4409
        rcu_do_batch kernel/rcu/tree.c:2190 [inline]
        rcu_core+0xd76/0x1810 kernel/rcu/tree.c:2465
        __do_softirq+0x2bb/0x942 kernel/softirq.c:553
      
      Last potentially related work creation:
        kasan_save_stack+0x3f/0x60 mm/kasan/common.c:47
        __kasan_record_aux_stack+0xae/0x100 mm/kasan/generic.c:586
        __call_rcu_common kernel/rcu/tree.c:2715 [inline]
        call_rcu+0x167/0xa80 kernel/rcu/tree.c:2829
        fib6_info_release include/net/ip6_fib.h:341 [inline]
        ip6_route_multipath_add net/ipv6/route.c:5344 [inline]
        inet6_rtm_newroute+0x114d/0x2300 net/ipv6/route.c:5517
        rtnetlink_rcv_msg+0x885/0x1040 net/core/rtnetlink.c:6597
        netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2543
        netlink_unicast_kernel net/netlink/af_netlink.c:1341 [inline]
        netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1367
        netlink_sendmsg+0xa3b/0xd70 net/netlink/af_netlink.c:1908
        sock_sendmsg_nosec net/socket.c:730 [inline]
        __sock_sendmsg+0x221/0x270 net/socket.c:745
        ____sys_sendmsg+0x525/0x7d0 net/socket.c:2584
        ___sys_sendmsg net/socket.c:2638 [inline]
        __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2667
       do_syscall_64+0xf9/0x240
       entry_SYSCALL_64_after_hwframe+0x6f/0x77
      
      The buggy address belongs to the object at ffff88809a07fc00
       which belongs to the cache kmalloc-512 of size 512
      The buggy address is located 100 bytes inside of
       freed 512-byte region [ffff88809a07fc00, ffff88809a07fe00)
      
      The buggy address belongs to the physical page:
      page:ffffea0002681f00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x9a07c
      head:ffffea0002681f00 order:2 entire_mapcount:0 nr_pages_mapped:0 pincount:0
      flags: 0xfff00000000840(slab|head|node=0|zone=1|lastcpupid=0x7ff)
      page_type: 0xffffffff()
      raw: 00fff00000000840 ffff888014c41c80 dead000000000122 0000000000000000
      raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      page_owner tracks the page as allocated
      page last allocated via order 2, migratetype Unmovable, gfp_mask 0x1d20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL), pid 23028, tgid 23027 (syz-executor.4), ts 2340253595219, free_ts 2339107097036
        set_page_owner include/linux/page_owner.h:31 [inline]
        post_alloc_hook+0x1ea/0x210 mm/page_alloc.c:1533
        prep_new_page mm/page_alloc.c:1540 [inline]
        get_page_from_freelist+0x33ea/0x3580 mm/page_alloc.c:3311
        __alloc_pages+0x255/0x680 mm/page_alloc.c:4567
        __alloc_pages_node include/linux/gfp.h:238 [inline]
        alloc_pages_node include/linux/gfp.h:261 [inline]
        alloc_slab_page+0x5f/0x160 mm/slub.c:2190
        allocate_slab mm/slub.c:2354 [inline]
        new_slab+0x84/0x2f0 mm/slub.c:2407
        ___slab_alloc+0xd17/0x13e0 mm/slub.c:3540
        __slab_alloc mm/slub.c:3625 [inline]
        __slab_alloc_node mm/slub.c:3678 [inline]
        slab_alloc_node mm/slub.c:3850 [inline]
        __do_kmalloc_node mm/slub.c:3980 [inline]
        __kmalloc+0x2e0/0x490 mm/slub.c:3994
        kmalloc include/linux/slab.h:594 [inline]
        kzalloc include/linux/slab.h:711 [inline]
        new_dir fs/proc/proc_sysctl.c:956 [inline]
        get_subdir fs/proc/proc_sysctl.c:1000 [inline]
        sysctl_mkdir_p fs/proc/proc_sysctl.c:1295 [inline]
        __register_sysctl_table+0xb30/0x1440 fs/proc/proc_sysctl.c:1376
        neigh_sysctl_register+0x416/0x500 net/core/neighbour.c:3859
        devinet_sysctl_register+0xaf/0x1f0 net/ipv4/devinet.c:2644
        inetdev_init+0x296/0x4d0 net/ipv4/devinet.c:286
        inetdev_event+0x338/0x15c0 net/ipv4/devinet.c:1555
        notifier_call_chain+0x18f/0x3b0 kernel/notifier.c:93
        call_netdevice_notifiers_extack net/core/dev.c:1987 [inline]
        call_netdevice_notifiers net/core/dev.c:2001 [inline]
        register_netdevice+0x15b2/0x1a20 net/core/dev.c:10340
        br_dev_newlink+0x27/0x100 net/bridge/br_netlink.c:1563
        rtnl_newlink_create net/core/rtnetlink.c:3497 [inline]
        __rtnl_newlink net/core/rtnetlink.c:3717 [inline]
        rtnl_newlink+0x158f/0x20a0 net/core/rtnetlink.c:3730
      page last free pid 11583 tgid 11583 stack trace:
        reset_page_owner include/linux/page_owner.h:24 [inline]
        free_pages_prepare mm/page_alloc.c:1140 [inline]
        free_unref_page_prepare+0x968/0xa90 mm/page_alloc.c:2346
        free_unref_page+0x37/0x3f0 mm/page_alloc.c:2486
        kasan_depopulate_vmalloc_pte+0x74/0x90 mm/kasan/shadow.c:415
        apply_to_pte_range mm/memory.c:2619 [inline]
        apply_to_pmd_range mm/memory.c:2663 [inline]
        apply_to_pud_range mm/memory.c:2699 [inline]
        apply_to_p4d_range mm/memory.c:2735 [inline]
        __apply_to_page_range+0x8ec/0xe40 mm/memory.c:2769
        kasan_release_vmalloc+0x9a/0xb0 mm/kasan/shadow.c:532
        __purge_vmap_area_lazy+0x163f/0x1a10 mm/vmalloc.c:1770
        drain_vmap_area_work+0x40/0xd0 mm/vmalloc.c:1804
        process_one_work kernel/workqueue.c:2633 [inline]
        process_scheduled_works+0x913/0x1420 kernel/workqueue.c:2706
        worker_thread+0xa5f/0x1000 kernel/workqueue.c:2787
        kthread+0x2ef/0x390 kernel/kthread.c:388
        ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
        ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:242
      
      Memory state around the buggy address:
       ffff88809a07fb00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff88809a07fb80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      >ffff88809a07fc00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                             ^
       ffff88809a07fc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88809a07fd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 3b1137fe ("net: ipv6: Change notifications for multipath add to RTA_MULTIPATH")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20240303144801.702646-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      685f7d53
    • Sasha Neftin's avatar
      intel: legacy: Partial revert of field get conversion · ba54b1a2
      Sasha Neftin authored
      Refactoring of the field get conversion introduced a regression in the
      legacy Wake On Lan from a magic packet with i219 devices. Rx address
      copied not correctly from MAC to PHY with FIELD_GET macro.
      
      Fixes: b9a45254 ("intel: legacy: field get conversion")
      Suggested-by: default avatarVitaly Lifshits <vitaly.lifshits@intel.com>
      Signed-off-by: default avatarSasha Neftin <sasha.neftin@intel.com>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      ba54b1a2
    • Florian Kauer's avatar
      igc: avoid returning frame twice in XDP_REDIRECT · ef27f655
      Florian Kauer authored
      When a frame can not be transmitted in XDP_REDIRECT
      (e.g. due to a full queue), it is necessary to free
      it by calling xdp_return_frame_rx_napi.
      
      However, this is the responsibility of the caller of
      the ndo_xdp_xmit (see for example bq_xmit_all in
      kernel/bpf/devmap.c) and thus calling it inside
      igc_xdp_xmit (which is the ndo_xdp_xmit of the igc
      driver) as well will lead to memory corruption.
      
      In fact, bq_xmit_all expects that it can return all
      frames after the last successfully transmitted one.
      Therefore, break for the first not transmitted frame,
      but do not call xdp_return_frame_rx_napi in igc_xdp_xmit.
      This is equally implemented in other Intel drivers
      such as the igb.
      
      There are two alternatives to this that were rejected:
      1. Return num_frames as all the frames would have been
         transmitted and release them inside igc_xdp_xmit.
         While it might work technically, it is not what
         the return value is meant to represent (i.e. the
         number of SUCCESSFULLY transmitted packets).
      2. Rework kernel/bpf/devmap.c and all drivers to
         support non-consecutively dropped packets.
         Besides being complex, it likely has a negative
         performance impact without a significant gain
         since it is anyway unlikely that the next frame
         can be transmitted if the previous one was dropped.
      
      The memory corruption can be reproduced with
      the following script which leads to a kernel panic
      after a few seconds.  It basically generates more
      traffic than a i225 NIC can transmit and pushes it
      via XDP_REDIRECT from a virtual interface to the
      physical interface where frames get dropped.
      
         #!/bin/bash
         INTERFACE=enp4s0
         INTERFACE_IDX=`cat /sys/class/net/$INTERFACE/ifindex`
      
         sudo ip link add dev veth1 type veth peer name veth2
         sudo ip link set up $INTERFACE
         sudo ip link set up veth1
         sudo ip link set up veth2
      
         cat << EOF > redirect.bpf.c
      
         SEC("prog")
         int redirect(struct xdp_md *ctx)
         {
             return bpf_redirect($INTERFACE_IDX, 0);
         }
      
         char _license[] SEC("license") = "GPL";
         EOF
         clang -O2 -g -Wall -target bpf -c redirect.bpf.c -o redirect.bpf.o
         sudo ip link set veth2 xdp obj redirect.bpf.o
      
         cat << EOF > pass.bpf.c
      
         SEC("prog")
         int pass(struct xdp_md *ctx)
         {
             return XDP_PASS;
         }
      
         char _license[] SEC("license") = "GPL";
         EOF
         clang -O2 -g -Wall -target bpf -c pass.bpf.c -o pass.bpf.o
         sudo ip link set $INTERFACE xdp obj pass.bpf.o
      
         cat << EOF > trafgen.cfg
      
         {
           /* Ethernet Header */
           0xe8, 0x6a, 0x64, 0x41, 0xbf, 0x46,
           0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
           const16(ETH_P_IP),
      
           /* IPv4 Header */
           0b01000101, 0,   # IPv4 version, IHL, TOS
           const16(1028),   # IPv4 total length (UDP length + 20 bytes (IP header))
           const16(2),      # IPv4 ident
           0b01000000, 0,   # IPv4 flags, fragmentation off
           64,              # IPv4 TTL
           17,              # Protocol UDP
           csumip(14, 33),  # IPv4 checksum
      
           /* UDP Header */
           10,  0, 1, 1,    # IP Src - adapt as needed
           10,  0, 1, 2,    # IP Dest - adapt as needed
           const16(6666),   # UDP Src Port
           const16(6666),   # UDP Dest Port
           const16(1008),   # UDP length (UDP header 8 bytes + payload length)
           csumudp(14, 34), # UDP checksum
      
           /* Payload */
           fill('W', 1000),
         }
         EOF
      
         sudo trafgen -i trafgen.cfg -b3000MB -o veth1 --cpp
      
      Fixes: 4ff32036 ("igc: Add support for XDP_REDIRECT action")
      Signed-off-by: default avatarFlorian Kauer <florian.kauer@linutronix.de>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      ef27f655
    • Ivan Vecera's avatar
      i40e: Fix firmware version comparison function · 36c824ca
      Ivan Vecera authored
      Helper i40e_is_fw_ver_eq() compares incorrectly given firmware version
      as it returns true when the major version of running firmware is
      greater than the given major version that is wrong and results in
      failure during getting of DCB configuration where this helper is used.
      Fix the check and return true only if the running FW version is exactly
      equals to the given version.
      
      Reproducer:
      1. Load i40e driver
      2. Check dmesg output
      
      [root@host ~]# modprobe i40e
      [root@host ~]# dmesg | grep 'i40e.*DCB'
      [   74.750642] i40e 0000:02:00.0: Query for DCB configuration failed, err -EIO aq_err I40E_AQ_RC_EINVAL
      [   74.759770] i40e 0000:02:00.0: DCB init failed -5, disabled
      [   74.966550] i40e 0000:02:00.1: Query for DCB configuration failed, err -EIO aq_err I40E_AQ_RC_EINVAL
      [   74.975683] i40e 0000:02:00.1: DCB init failed -5, disabled
      
      Fixes: cf488e13 ("i40e: Add other helpers to check version of running firmware and AQ API")
      Signed-off-by: default avatarIvan Vecera <ivecera@redhat.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      36c824ca
    • Jesse Brandeburg's avatar
      ice: fix typo in assignment · 6c5b6ca7
      Jesse Brandeburg authored
      Fix an obviously incorrect assignment, created with a typo or cut-n-paste
      error.
      
      Fixes: 5995ef88 ("ice: realloc VSI stats arrays")
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      6c5b6ca7
    • Michal Schmidt's avatar
      ice: fix uninitialized dplls mutex usage · 9224fc86
      Michal Schmidt authored
      The pf->dplls.lock mutex is initialized too late, after its first use.
      Move it to the top of ice_dpll_init.
      Note that the "err_exit" error path destroys the mutex. And the mutex is
      the last thing destroyed in ice_dpll_deinit.
      This fixes the following warning with CONFIG_DEBUG_MUTEXES:
      
       ice 0000:10:00.0: The DDP package was successfully loaded: ICE OS Default Package version 1.3.36.0
       ice 0000:10:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
       ice 0000:10:00.0: PTP init successful
       ------------[ cut here ]------------
       DEBUG_LOCKS_WARN_ON(lock->magic != lock)
       WARNING: CPU: 0 PID: 410 at kernel/locking/mutex.c:587 __mutex_lock+0x773/0xd40
       Modules linked in: crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ice(+) nvme nvme_c>
       CPU: 0 PID: 410 Comm: kworker/0:4 Not tainted 6.8.0-rc5+ #3
       Hardware name: HPE ProLiant DL110 Gen10 Plus/ProLiant DL110 Gen10 Plus, BIOS U56 10/19/2023
       Workqueue: events work_for_cpu_fn
       RIP: 0010:__mutex_lock+0x773/0xd40
       Code: c0 0f 84 1d f9 ff ff 44 8b 35 0d 9c 69 01 45 85 f6 0f 85 0d f9 ff ff 48 c7 c6 12 a2 a9 85 48 c7 c7 12 f1 a>
       RSP: 0018:ff7eb1a3417a7ae0 EFLAGS: 00010286
       RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000
       RDX: 0000000000000002 RSI: ffffffff85ac2bff RDI: 00000000ffffffff
       RBP: ff7eb1a3417a7b80 R08: 0000000000000000 R09: 00000000ffffbfff
       R10: ff7eb1a3417a7978 R11: ff32b80f7fd2e568 R12: 0000000000000000
       R13: 0000000000000000 R14: 0000000000000000 R15: ff32b7f02c50e0d8
       FS:  0000000000000000(0000) GS:ff32b80efe800000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000055b5852cc000 CR3: 000000003c43a004 CR4: 0000000000771ef0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       PKRU: 55555554
       Call Trace:
        <TASK>
        ? __warn+0x84/0x170
        ? __mutex_lock+0x773/0xd40
        ? report_bug+0x1c7/0x1d0
        ? prb_read_valid+0x1b/0x30
        ? handle_bug+0x42/0x70
        ? exc_invalid_op+0x18/0x70
        ? asm_exc_invalid_op+0x1a/0x20
        ? __mutex_lock+0x773/0xd40
        ? rcu_is_watching+0x11/0x50
        ? __kmalloc_node_track_caller+0x346/0x490
        ? ice_dpll_lock_status_get+0x28/0x50 [ice]
        ? __pfx_ice_dpll_lock_status_get+0x10/0x10 [ice]
        ? ice_dpll_lock_status_get+0x28/0x50 [ice]
        ice_dpll_lock_status_get+0x28/0x50 [ice]
        dpll_device_get_one+0x14f/0x2e0
        dpll_device_event_send+0x7d/0x150
        dpll_device_register+0x124/0x180
        ice_dpll_init_dpll+0x7b/0xd0 [ice]
        ice_dpll_init+0x224/0xa40 [ice]
        ? _dev_info+0x70/0x90
        ice_load+0x468/0x690 [ice]
        ice_probe+0x75b/0xa10 [ice]
        ? _raw_spin_unlock_irqrestore+0x4f/0x80
        ? process_one_work+0x1a3/0x500
        local_pci_probe+0x47/0xa0
        work_for_cpu_fn+0x17/0x30
        process_one_work+0x20d/0x500
        worker_thread+0x1df/0x3e0
        ? __pfx_worker_thread+0x10/0x10
        kthread+0x103/0x140
        ? __pfx_kthread+0x10/0x10
        ret_from_fork+0x31/0x50
        ? __pfx_kthread+0x10/0x10
        ret_from_fork_asm+0x1b/0x30
        </TASK>
       irq event stamp: 125197
       hardirqs last  enabled at (125197): [<ffffffff8416409d>] finish_task_switch.isra.0+0x12d/0x3d0
       hardirqs last disabled at (125196): [<ffffffff85134044>] __schedule+0xea4/0x19f0
       softirqs last  enabled at (105334): [<ffffffff84e1e65a>] napi_get_frags_check+0x1a/0x60
       softirqs last disabled at (105332): [<ffffffff84e1e65a>] napi_get_frags_check+0x1a/0x60
       ---[ end trace 0000000000000000 ]---
      
      Fixes: d7999f5e ("ice: implement dpll interface to control cgu")
      Signed-off-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      9224fc86
    • Rand Deeb's avatar
      net: ice: Fix potential NULL pointer dereference in ice_bridge_setlink() · 06e456a0
      Rand Deeb authored
      The function ice_bridge_setlink() may encounter a NULL pointer dereference
      if nlmsg_find_attr() returns NULL and br_spec is dereferenced subsequently
      in nla_for_each_nested(). To address this issue, add a check to ensure that
      br_spec is not NULL before proceeding with the nested attribute iteration.
      
      Fixes: b1edc14a ("ice: Implement ice_bridge_getlink and ice_bridge_setlink")
      Signed-off-by: default avatarRand Deeb <rand.sec96@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      06e456a0
    • Jacob Keller's avatar
      ice: virtchnl: stop pretending to support RSS over AQ or registers · 2652b99e
      Jacob Keller authored
      The E800 series hardware uses the same iAVF driver as older devices,
      including the virtchnl negotiation scheme.
      
      This negotiation scheme includes a mechanism to determine what type of RSS
      should be supported, including RSS over PF virtchnl messages, RSS over
      firmware AdminQ messages, and RSS via direct register access.
      
      The PF driver will always prefer VIRTCHNL_VF_OFFLOAD_RSS_PF if its
      supported by the VF driver. However, if an older VF driver is loaded, it
      may request only VIRTCHNL_VF_OFFLOAD_RSS_REG or VIRTCHNL_VF_OFFLOAD_RSS_AQ.
      
      The ice driver happily agrees to support these methods. Unfortunately, the
      underlying hardware does not support these mechanisms. The E800 series VFs
      don't have the appropriate registers for RSS_REG. The mailbox queue used by
      VFs for VF to PF communication blocks messages which do not have the
      VF-to-PF opcode.
      
      Stop lying to the VF that it could support RSS over AdminQ or registers, as
      these interfaces do not work when the hardware is operating on an E800
      series device.
      
      In practice this is unlikely to be hit by any normal user. The iAVF driver
      has supported RSS over PF virtchnl commands since 2016, and always defaults
      to using RSS_PF if possible.
      
      In principle, nothing actually stops the existing VF from attempting to
      access the registers or send an AQ command. However a properly coded VF
      will check the capability flags and will report a more useful error if it
      detects a case where the driver does not support the RSS offloads that it
      does.
      
      Fixes: 1071a835 ("ice: Implement virtchnl commands for AVF support")
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarAlan Brady <alan.brady@intel.com>
      Tested-by: default avatarRafal Romanowski <rafal.romanowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      2652b99e
    • Emil Tantilov's avatar
      idpf: disable local BH when scheduling napi for marker packets · 33006858
      Emil Tantilov authored
      Fix softirq's not being handled during napi_schedule() call when
      receiving marker packets for queue disable by disabling local bottom
      half.
      
      The issue can be seen on ifdown:
      NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
      
      Using ftrace to catch the failing scenario:
      ifconfig   [003] d.... 22739.830624: softirq_raise: vec=3 [action=NET_RX]
      <idle>-0   [003] ..s.. 22739.831357: softirq_entry: vec=3 [action=NET_RX]
      
      No interrupt and CPU is idle.
      
      After the patch when disabling local BH before calling napi_schedule:
      ifconfig   [003] d.... 22993.928336: softirq_raise: vec=3 [action=NET_RX]
      ifconfig   [003] ..s1. 22993.928337: softirq_entry: vec=3 [action=NET_RX]
      
      Fixes: c2d548ca ("idpf: add TX splitq napi poll support")
      Reviewed-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      Signed-off-by: default avatarEmil Tantilov <emil.s.tantilov@intel.com>
      Signed-off-by: default avatarAlan Brady <alan.brady@intel.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Tested-by: default avatarKrishneil Singh <krishneil.k.singh@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      33006858
    • Mike Yu's avatar
      xfrm: set skb control buffer based on packet offload as well · 8688ab21
      Mike Yu authored
      In packet offload, packets are not encrypted in XFRM stack, so
      the next network layer which the packets will be forwarded to
      should depend on where the packet came from (either xfrm4_output
      or xfrm6_output) rather than the matched SA's family type.
      
      Test: verified IPv6-in-IPv4 packets on Android device with
            IPsec packet offload enabled
      Signed-off-by: default avatarMike Yu <yumike@google.com>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      8688ab21
    • Mike Yu's avatar
      xfrm: fix xfrm child route lookup for packet offload · d4872d70
      Mike Yu authored
      In current code, xfrm_bundle_create() always uses the matched
      SA's family type to look up a xfrm child route for the skb.
      The route returned by xfrm_dst_lookup() will eventually be
      used in xfrm_output_resume() (skb_dst(skb)->ops->local_out()).
      
      If packet offload is used, the above behavior can lead to
      calling ip_local_out() for an IPv6 packet or calling
      ip6_local_out() for an IPv4 packet, which is likely to fail.
      
      This change fixes the behavior by checking if the matched SA
      has packet offload enabled. If not, keep the same behavior;
      if yes, use the matched SP's family type for the lookup.
      
      Test: verified IPv6-in-IPv4 packets on Android device with
            IPsec packet offload enabled
      Signed-off-by: default avatarMike Yu <yumike@google.com>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      d4872d70
    • Jakub Kicinski's avatar
      Merge tag 'mlx5-fixes-2024-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 4daa8731
      Jakub Kicinski authored
      Saeed Mahameed says:
      
      ====================
      mlx5 fixes 2024-03-01
      
      This series provides bug fixes to mlx5 driver.
      Please pull and let me know if there is any problem.
      
      * tag 'mlx5-fixes-2024-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
        net/mlx5e: Switch to using _bh variant of of spinlock API in port timestamping NAPI poll context
        net/mlx5e: Use a memory barrier to enforce PTP WQ xmit submission tracking occurs after populating the metadata_map
        net/mlx5e: Fix MACsec state loss upon state update in offload path
        net/mlx5e: Change the warning when ignore_flow_level is not supported
        net/mlx5: Check capability for fw_reset
        net/mlx5: Fix fw reporter diagnose output
        net/mlx5: E-switch, Change flow rule destination checking
        Revert "net/mlx5e: Check the number of elements before walk TC rhashtable"
        Revert "net/mlx5: Block entering switchdev mode with ns inconsistency"
      ====================
      
      Link: https://lore.kernel.org/r/20240302070318.62997-1-saeed@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4daa8731
    • Jakub Kicinski's avatar
      Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 47fe2fc1
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2024-03-01 (ixgbe, i40e, ice)
      
      This series contains updates to ixgbe, i40e, and ice drivers.
      
      Maciej corrects disable flow for ixgbe, i40e, and ice drivers which could
      cause non-functional interface with AF_XDP.
      
      Michal restores host configuration when changing MSI-X count for VFs on
      ice driver.
      
      * '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
        ice: reconfig host after changing MSI-X on VF
        ice: reorder disabling IRQ and NAPI in ice_qp_dis
        i40e: disable NAPI right after disabling irqs when handling xsk_pool
        ixgbe: {dis, en}able irqs in ixgbe_txrx_ring_{dis, en}able
      ====================
      
      Link: https://lore.kernel.org/r/20240301192549.2993798-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      47fe2fc1
    • Horatiu Vultur's avatar
      net: sparx5: Fix use after free inside sparx5_del_mact_entry · 89d72d41
      Horatiu Vultur authored
      Based on the static analyzis of the code it looks like when an entry
      from the MAC table was removed, the entry was still used after being
      freed. More precise the vid of the mac_entry was used after calling
      devm_kfree on the mac_entry.
      The fix consists in first using the vid of the mac_entry to delete the
      entry from the HW and after that to free it.
      
      Fixes: b37a1bae ("net: sparx5: add mactable support")
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240301080608.3053468-1-horatiu.vultur@microchip.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      89d72d41
  4. 04 Mar, 2024 6 commits
    • David S. Miller's avatar
      Merge branch 'mptcp-test-fixes' · 948abb59
      David S. Miller authored
      Matthieu Baerts says:
      
      ====================
      selftests: mptcp: fixes for diag.sh
      
      Here are two patches fixing issues in MPTCP diag.sh kselftest:
      
      - Patch 1 makes sure the exit code is '1' in case of error, and not the
        test ID, not to return an exit code that would be wrongly interpreted
        by the ksefltests framework, e.g. '4' means 'skip'.
      
      - Patch 2 avoids waiting for unnecessary conditions, which can cause
        timeouts in some very slow environments.
      ====================
      Signed-off-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      948abb59
    • Matthieu Baerts (NGI0)'s avatar
      selftests: mptcp: diag: avoid extra waiting · f05d2283
      Matthieu Baerts (NGI0) authored
      When creating a lot of listener sockets, it is enough to wait only for
      the last one, like we are doing before in diag.sh for other subtests.
      
      If we do a check for each listener sockets, each time listing all
      available sockets, it can take a very long time in very slow
      environments, at the point we can reach some timeout.
      
      When using the debug kconfig, the waiting time switches from more than
      8 sec to 0.1 sec on my side. In slow/busy environments, and with a poll
      timeout set to 30 ms, the waiting time could go up to ~100 sec because
      the listener socket would timeout and stop, while the script would still
      be checking one by one if all sockets are ready. The result is that
      after having waited for everything to be ready, all sockets have been
      stopped due to a timeout, and it is too late for the script to check how
      many there were.
      
      While at it, also removed ss options we don't need: we only need the
      filtering options, to count how many listener sockets have been created.
      We don't need to ask ss to display internal TCP information, and the
      memory if the output is dropped by the 'wc -l' command anyway.
      
      Fixes: b4b51d36 ("selftests: mptcp: explicitly trigger the listener diag code-path")
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Closes: https://lore.kernel.org/r/20240301063754.2ecefecf@kernel.orgSigned-off-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f05d2283
    • Geliang Tang's avatar
      selftests: mptcp: diag: return KSFT_FAIL not test_cnt · 45bcc034
      Geliang Tang authored
      The test counter 'test_cnt' should not be returned in diag.sh, e.g. what
      if only the 4th test fail? Will do 'exit 4' which is 'exit ${KSFT_SKIP}',
      the whole test will be marked as skipped instead of 'failed'!
      
      So we should do ret=${KSFT_FAIL} instead.
      
      Fixes: df62f2ec ("selftests/mptcp: add diag interface tests")
      Cc: stable@vger.kernel.org
      Fixes: 42fb6cdd ("selftests: mptcp: more stable diag tests")
      Signed-off-by: default avatarGeliang Tang <tanggeliang@kylinos.cn>
      Reviewed-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Signed-off-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      45bcc034
    • Jakub Kicinski's avatar
      page_pool: fix netlink dump stop/resume · 429679dc
      Jakub Kicinski authored
      If message fills up we need to stop writing. 'break' will
      only get us out of the iteration over pools of a single
      netdev, we need to also stop walking netdevs.
      
      This results in either infinite dump, or missing pools,
      depending on whether message full happens on the last
      netdev (infinite dump) or non-last (missing pools).
      
      Fixes: 950ab53b ("net: page_pool: implement GET in the netlink API")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      429679dc
    • Eric Dumazet's avatar
      geneve: make sure to pull inner header in geneve_rx() · 1ca1ba46
      Eric Dumazet authored
      syzbot triggered a bug in geneve_rx() [1]
      
      Issue is similar to the one I fixed in commit 8d975c15
      ("ip6_tunnel: make sure to pull inner header in __ip6_tnl_rcv()")
      
      We have to save skb->network_header in a temporary variable
      in order to be able to recompute the network_header pointer
      after a pskb_inet_may_pull() call.
      
      pskb_inet_may_pull() makes sure the needed headers are in skb->head.
      
      [1]
      BUG: KMSAN: uninit-value in IP_ECN_decapsulate include/net/inet_ecn.h:302 [inline]
       BUG: KMSAN: uninit-value in geneve_rx drivers/net/geneve.c:279 [inline]
       BUG: KMSAN: uninit-value in geneve_udp_encap_recv+0x36f9/0x3c10 drivers/net/geneve.c:391
        IP_ECN_decapsulate include/net/inet_ecn.h:302 [inline]
        geneve_rx drivers/net/geneve.c:279 [inline]
        geneve_udp_encap_recv+0x36f9/0x3c10 drivers/net/geneve.c:391
        udp_queue_rcv_one_skb+0x1d39/0x1f20 net/ipv4/udp.c:2108
        udp_queue_rcv_skb+0x6ae/0x6e0 net/ipv4/udp.c:2186
        udp_unicast_rcv_skb+0x184/0x4b0 net/ipv4/udp.c:2346
        __udp4_lib_rcv+0x1c6b/0x3010 net/ipv4/udp.c:2422
        udp_rcv+0x7d/0xa0 net/ipv4/udp.c:2604
        ip_protocol_deliver_rcu+0x264/0x1300 net/ipv4/ip_input.c:205
        ip_local_deliver_finish+0x2b8/0x440 net/ipv4/ip_input.c:233
        NF_HOOK include/linux/netfilter.h:314 [inline]
        ip_local_deliver+0x21f/0x490 net/ipv4/ip_input.c:254
        dst_input include/net/dst.h:461 [inline]
        ip_rcv_finish net/ipv4/ip_input.c:449 [inline]
        NF_HOOK include/linux/netfilter.h:314 [inline]
        ip_rcv+0x46f/0x760 net/ipv4/ip_input.c:569
        __netif_receive_skb_one_core net/core/dev.c:5534 [inline]
        __netif_receive_skb+0x1a6/0x5a0 net/core/dev.c:5648
        process_backlog+0x480/0x8b0 net/core/dev.c:5976
        __napi_poll+0xe3/0x980 net/core/dev.c:6576
        napi_poll net/core/dev.c:6645 [inline]
        net_rx_action+0x8b8/0x1870 net/core/dev.c:6778
        __do_softirq+0x1b7/0x7c5 kernel/softirq.c:553
        do_softirq+0x9a/0xf0 kernel/softirq.c:454
        __local_bh_enable_ip+0x9b/0xa0 kernel/softirq.c:381
        local_bh_enable include/linux/bottom_half.h:33 [inline]
        rcu_read_unlock_bh include/linux/rcupdate.h:820 [inline]
        __dev_queue_xmit+0x2768/0x51c0 net/core/dev.c:4378
        dev_queue_xmit include/linux/netdevice.h:3171 [inline]
        packet_xmit+0x9c/0x6b0 net/packet/af_packet.c:276
        packet_snd net/packet/af_packet.c:3081 [inline]
        packet_sendmsg+0x8aef/0x9f10 net/packet/af_packet.c:3113
        sock_sendmsg_nosec net/socket.c:730 [inline]
        __sock_sendmsg net/socket.c:745 [inline]
        __sys_sendto+0x735/0xa10 net/socket.c:2191
        __do_sys_sendto net/socket.c:2203 [inline]
        __se_sys_sendto net/socket.c:2199 [inline]
        __x64_sys_sendto+0x125/0x1c0 net/socket.c:2199
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      Uninit was created at:
        slab_post_alloc_hook mm/slub.c:3819 [inline]
        slab_alloc_node mm/slub.c:3860 [inline]
        kmem_cache_alloc_node+0x5cb/0xbc0 mm/slub.c:3903
        kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:560
        __alloc_skb+0x352/0x790 net/core/skbuff.c:651
        alloc_skb include/linux/skbuff.h:1296 [inline]
        alloc_skb_with_frags+0xc8/0xbd0 net/core/skbuff.c:6394
        sock_alloc_send_pskb+0xa80/0xbf0 net/core/sock.c:2783
        packet_alloc_skb net/packet/af_packet.c:2930 [inline]
        packet_snd net/packet/af_packet.c:3024 [inline]
        packet_sendmsg+0x70c2/0x9f10 net/packet/af_packet.c:3113
        sock_sendmsg_nosec net/socket.c:730 [inline]
        __sock_sendmsg net/socket.c:745 [inline]
        __sys_sendto+0x735/0xa10 net/socket.c:2191
        __do_sys_sendto net/socket.c:2203 [inline]
        __se_sys_sendto net/socket.c:2199 [inline]
        __x64_sys_sendto+0x125/0x1c0 net/socket.c:2199
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      Fixes: 2d07dc79 ("geneve: add initial netdev driver for GENEVE tunnels")
      Reported-and-tested-by: syzbot+6a1423ff3f97159aae64@syzkaller.appspotmail.com
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ca1ba46
    • Steven Rostedt (Google)'s avatar
      tracing/net_sched: Fix tracepoints that save qdisc_dev() as a string · 51270d57
      Steven Rostedt (Google) authored
      I'm updating __assign_str() and will be removing the second parameter. To
      make sure that it does not break anything, I make sure that it matches the
      __string() field, as that is where the string is actually going to be
      saved in. To make sure there's nothing that breaks, I added a WARN_ON() to
      make sure that what was used in __string() is the same that is used in
      __assign_str().
      
      In doing this change, an error was triggered as __assign_str() now expects
      the string passed in to be a char * value. I instead had the following
      warning:
      
      include/trace/events/qdisc.h: In function ‘trace_event_raw_event_qdisc_reset’:
      include/trace/events/qdisc.h:91:35: error: passing argument 1 of 'strcmp' from incompatible pointer type [-Werror=incompatible-pointer-types]
         91 |                 __assign_str(dev, qdisc_dev(q));
      
      That's because the qdisc_enqueue() and qdisc_reset() pass in qdisc_dev(q)
      to __assign_str() and to __string(). But that function returns a pointer
      to struct net_device and not a string.
      
      It appears that these events are just saving the pointer as a string and
      then reading it as a string as well.
      
      Use qdisc_dev(q)->name to save the device instead.
      
      Fixes: a34dac0b ("net_sched: add tracepoints for qdisc_reset() and qdisc_destroy()")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51270d57
  5. 02 Mar, 2024 4 commits