1. 02 Aug, 2024 10 commits
    • Kuniyuki Iwashima's avatar
      sctp: Fix null-ptr-deref in reuseport_add_sock(). · 9ab0faa7
      Kuniyuki Iwashima authored
      syzbot reported a null-ptr-deref while accessing sk2->sk_reuseport_cb in
      reuseport_add_sock(). [0]
      
      The repro first creates a listener with SO_REUSEPORT.  Then, it creates
      another listener on the same port and concurrently closes the first
      listener.
      
      The second listen() calls reuseport_add_sock() with the first listener as
      sk2, where sk2->sk_reuseport_cb is not expected to be cleared concurrently,
      but the close() does clear it by reuseport_detach_sock().
      
      The problem is SCTP does not properly synchronise reuseport_alloc(),
      reuseport_add_sock(), and reuseport_detach_sock().
      
      The caller of reuseport_alloc() and reuseport_{add,detach}_sock() must
      provide synchronisation for sockets that are classified into the same
      reuseport group.
      
      Otherwise, such sockets form multiple identical reuseport groups, and
      all groups except one would be silently dead.
      
        1. Two sockets call listen() concurrently
        2. No socket in the same group found in sctp_ep_hashtable[]
        3. Two sockets call reuseport_alloc() and form two reuseport groups
        4. Only one group hit first in __sctp_rcv_lookup_endpoint() receives
            incoming packets
      
      Also, the reported null-ptr-deref could occur.
      
      TCP/UDP guarantees that would not happen by holding the hash bucket lock.
      
      Let's apply the locking strategy to __sctp_hash_endpoint() and
      __sctp_unhash_endpoint().
      
      [0]:
      Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
      KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
      CPU: 1 UID: 0 PID: 10230 Comm: syz-executor119 Not tainted 6.10.0-syzkaller-12585-g301927d2 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
      RIP: 0010:reuseport_add_sock+0x27e/0x5e0 net/core/sock_reuseport.c:350
      Code: 00 0f b7 5d 00 bf 01 00 00 00 89 de e8 1b a4 ff f7 83 fb 01 0f 85 a3 01 00 00 e8 6d a0 ff f7 49 8d 7e 12 48 89 f8 48 c1 e8 03 <42> 0f b6 04 28 84 c0 0f 85 4b 02 00 00 41 0f b7 5e 12 49 8d 7e 14
      RSP: 0018:ffffc9000b947c98 EFLAGS: 00010202
      RAX: 0000000000000002 RBX: ffff8880252ddf98 RCX: ffff888079478000
      RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000012
      RBP: 0000000000000001 R08: ffffffff8993e18d R09: 1ffffffff1fef385
      R10: dffffc0000000000 R11: fffffbfff1fef386 R12: ffff8880252ddac0
      R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f24e45b96c0(0000) GS:ffff8880b9300000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffcced5f7b8 CR3: 00000000241be000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       __sctp_hash_endpoint net/sctp/input.c:762 [inline]
       sctp_hash_endpoint+0x52a/0x600 net/sctp/input.c:790
       sctp_listen_start net/sctp/socket.c:8570 [inline]
       sctp_inet_listen+0x767/0xa20 net/sctp/socket.c:8625
       __sys_listen_socket net/socket.c:1883 [inline]
       __sys_listen+0x1b7/0x230 net/socket.c:1894
       __do_sys_listen net/socket.c:1902 [inline]
       __se_sys_listen net/socket.c:1900 [inline]
       __x64_sys_listen+0x5a/0x70 net/socket.c:1900
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7f24e46039b9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 91 1a 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f24e45b9228 EFLAGS: 00000246 ORIG_RAX: 0000000000000032
      RAX: ffffffffffffffda RBX: 00007f24e468e428 RCX: 00007f24e46039b9
      RDX: 00007f24e46039b9 RSI: 0000000000000003 RDI: 0000000000000004
      RBP: 00007f24e468e420 R08: 00007f24e45b96c0 R09: 00007f24e45b96c0
      R10: 00007f24e45b96c0 R11: 0000000000000246 R12: 00007f24e468e42c
      R13: 00007f24e465a5dc R14: 0020000000000001 R15: 00007ffcced5f7d8
       </TASK>
      Modules linked in:
      
      Fixes: 6ba84574 ("sctp: process sk_reuseport in sctp_get_port_local")
      Reported-by: syzbot+e6979a5d2f10ecb700e4@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=e6979a5d2f10ecb700e4
      Tested-by: syzbot+e6979a5d2f10ecb700e4@syzkaller.appspotmail.com
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Acked-by: default avatarXin Long <lucien.xin@gmail.com>
      Link: https://patch.msgid.link/20240731234624.94055-1-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9ab0faa7
    • Stephen Hemminger's avatar
      MAINTAINERS: update status of sky2 and skge drivers · eeef5f18
      Stephen Hemminger authored
      The old SysKonnect NIc's are not used or actively maintained anymore.
      My sky2 NIC's are all in box in back corner of attic.
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Link: https://patch.msgid.link/20240801162930.212299-1-stephen@networkplumber.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      eeef5f18
    • Jakub Kicinski's avatar
      Merge branch 'mptcp-fix-endpoints-with-signal-and-subflow-flags' · 16dc75e5
      Jakub Kicinski authored
      Matthieu Baerts says:
      
      ====================
      mptcp: fix endpoints with 'signal' and 'subflow' flags
      
      When looking at improving the user experience around the MPTCP endpoints
      setup, I noticed that setting an endpoint with both the 'signal' and the
      'subflow' flags -- as it has been done in the past by users according to
      bug reports we got -- was resulting on only announcing the endpoint, but
      not using it to create subflows: the 'subflow' flag was then ignored.
      
      My initial thought was to modify IPRoute2 to warn the user when the two
      flags were set, but it doesn't sound normal to ignore one of them. I
      then looked at modifying the kernel not to allow having the two flags
      set, but when discussing about that with Mat, we thought it was maybe
      not ideal to do that, as there might be use-cases, we might break some
      configs. Then I saw it was working before v5.17. So instead, I fixed the
      support on the kernel side (patch 5) using Paolo's suggestion. This also
      includes a fix on the options side (patch 1: for v5.11+), an explicit
      deny of some options combinations (patch 2: for v5.18+), and some
      refactoring (patches 3 and 4) to ease the inclusion of the patch 5.
      
      While at it, I added a new selftest (patch 7) to validate this case --
      including a modification of the chk_add_nr helper to inverse the sides
      were the counters are checked (patch 6) -- and allowed ADD_ADDR echo
      just after the MP_JOIN 3WHS.
      
      The selftests modification have the same Fixes tag as the previous
      commit, but no 'Cc: Stable': if the backport can work, that's good --
      but it still need to be verified by running the selftests -- if not, no
      need to worry, many CIs will use the selftests from the last stable
      version to validate previous stable releases.
      ====================
      
      Link: https://patch.msgid.link/20240731-upstream-net-20240731-mptcp-endp-subflow-signal-v1-0-c8a9b036493b@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      16dc75e5
    • Matthieu Baerts (NGI0)'s avatar
      selftests: mptcp: join: test both signal & subflow · 4d2868b5
      Matthieu Baerts (NGI0) authored
      It should be quite uncommon to set both the subflow and the signal
      flags: the initiator of the connection is typically the one creating new
      subflows, not the other peer, then no need to announce additional local
      addresses, and use it to create subflows.
      
      But some people might be confused about the flags, and set both "just to
      be sure at least the right one is set". To verify the previous fix, and
      avoid future regressions, this specific case is now validated: the
      client announces a new address, and initiates a new subflow from the
      same address.
      
      While working on this, another bug has been noticed, where the client
      reset the new subflow because an ADD_ADDR echo got received as the 3rd
      ACK: this new test also explicitly checks that no RST have been sent by
      the client and server.
      
      The 'Fixes' tag here below is the same as the one from the previous
      commit: this patch here is not fixing anything wrong in the selftests,
      but it validates the previous fix for an issue introduced by this commit
      ID.
      
      Fixes: 86e39e04 ("mptcp: keep track of local endpoint still available for each msk")
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Signed-off-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Link: https://patch.msgid.link/20240731-upstream-net-20240731-mptcp-endp-subflow-signal-v1-7-c8a9b036493b@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4d2868b5
    • Matthieu Baerts (NGI0)'s avatar
      selftests: mptcp: join: ability to invert ADD_ADDR check · bec1f3b1
      Matthieu Baerts (NGI0) authored
      In the following commit, the client will initiate the ADD_ADDR, instead
      of the server. We need to way to verify the ADD_ADDR have been correctly
      sent.
      
      Note: the default expected counters for when the port number is given
      are never changed by the caller, no need to accept them as parameter
      then.
      
      The 'Fixes' tag here below is the same as the one from the previous
      commit: this patch here is not fixing anything wrong in the selftests,
      but it validates the previous fix for an issue introduced by this commit
      ID.
      
      Fixes: 86e39e04 ("mptcp: keep track of local endpoint still available for each msk")
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Signed-off-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Link: https://patch.msgid.link/20240731-upstream-net-20240731-mptcp-endp-subflow-signal-v1-6-c8a9b036493b@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bec1f3b1
    • Matthieu Baerts (NGI0)'s avatar
      mptcp: pm: do not ignore 'subflow' if 'signal' flag is also set · 85df533a
      Matthieu Baerts (NGI0) authored
      Up to the 'Fixes' commit, having an endpoint with both the 'signal' and
      'subflow' flags, resulted in the creation of a subflow and an address
      announcement using the address linked to this endpoint. After this
      commit, only the address announcement was done, ignoring the 'subflow'
      flag.
      
      That's because the same bitmap is used for the two flags. It is OK to
      keep this single bitmap, the already selected local endpoint simply have
      to be re-used, but not via select_local_address() not to look at the
      just modified bitmap.
      
      Note that it is unusual to set the two flags together: creating a new
      subflow using a new local address will implicitly advertise it to the
      other peer. So in theory, no need to advertise it explicitly as well.
      Maybe there are use-cases -- the subflow might not reach the other peer
      that way, we can ask the other peer to try initiating the new subflow
      without delay -- or very likely the user is confused, and put both flags
      "just to be sure at least the right one is set". Still, if it is
      allowed, the kernel should do what has been asked: using this endpoint
      to announce the address and to create a new subflow from it.
      
      An alternative is to forbid the use of the two flags together, but
      that's probably too late, there are maybe use-cases, and it was working
      before. This patch will avoid people complaining subflows are not
      created using the endpoint they added with the 'subflow' and 'signal'
      flag.
      
      Note that with the current patch, the subflow might not be created in
      some corner cases, e.g. if the 'subflows' limit was reached when sending
      the ADD_ADDR, but changed later on. It is probably not worth splitting
      id_avail_bitmap per target ('signal', 'subflow'), which will add another
      large field to the msk "just" to track (again) endpoints. Anyway,
      currently when the limits are changed, the kernel doesn't check if new
      subflows can be created or removed, because we would need to keep track
      of the received ADD_ADDR, and more. It sounds OK to assume that the
      limits should be properly configured before establishing new
      connections.
      
      Fixes: 86e39e04 ("mptcp: keep track of local endpoint still available for each msk")
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Signed-off-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Link: https://patch.msgid.link/20240731-upstream-net-20240731-mptcp-endp-subflow-signal-v1-5-c8a9b036493b@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      85df533a
    • Matthieu Baerts (NGI0)'s avatar
      mptcp: pm: don't try to create sf if alloc failed · cd7c957f
      Matthieu Baerts (NGI0) authored
      It sounds better to avoid wasting cycles and / or put extreme memory
      pressure on the system by trying to create new subflows if it was not
      possible to add a new item in the announce list.
      
      While at it, a warning is now printed if the entry was already in the
      list as it should not happen with the in-kernel path-manager. With this
      PM, mptcp_pm_alloc_anno_list() should only fail in case of memory
      pressure.
      
      Fixes: b6c08380 ("mptcp: remove addr and subflow in PM netlink")
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Signed-off-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Link: https://patch.msgid.link/20240731-upstream-net-20240731-mptcp-endp-subflow-signal-v1-4-c8a9b036493b@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cd7c957f
    • Matthieu Baerts (NGI0)'s avatar
      c95eb32c
    • Matthieu Baerts (NGI0)'s avatar
      mptcp: pm: deny endp with signal + subflow + port · 8af1f118
      Matthieu Baerts (NGI0) authored
      As mentioned in the 'Fixes' commit, the port flag is only supported by
      the 'signal' flag, and not by the 'subflow' one. Then if both the
      'signal' and 'subflow' flags are set, the problem is the same: the
      feature cannot work with the 'subflow' flag.
      
      Technically, if both the 'signal' and 'subflow' flags are set, it will
      be possible to create the listening socket, but not to establish a
      subflow using this source port. So better to explicitly deny it, not to
      create some confusions because the expected behaviour is not possible.
      
      Fixes: 09f12c3a ("mptcp: allow to use port and non-signal in set_flags")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Signed-off-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Link: https://patch.msgid.link/20240731-upstream-net-20240731-mptcp-endp-subflow-signal-v1-2-c8a9b036493b@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8af1f118
    • Matthieu Baerts (NGI0)'s avatar
      mptcp: fully established after ADD_ADDR echo on MPJ · d67c5649
      Matthieu Baerts (NGI0) authored
      Before this patch, receiving an ADD_ADDR echo on the just connected
      MP_JOIN subflow -- initiator side, after the MP_JOIN 3WHS -- was
      resulting in an MP_RESET. That's because only ACKs with a DSS or
      ADD_ADDRs without the echo bit were allowed.
      
      Not allowing the ADD_ADDR echo after an MP_CAPABLE 3WHS makes sense, as
      we are not supposed to send an ADD_ADDR before because it requires to be
      in full established mode first. For the MP_JOIN 3WHS, that's different:
      the ADD_ADDR can be sent on a previous subflow, and the ADD_ADDR echo
      can be received on the recently created one. The other peer will already
      be in fully established, so it is allowed to send that.
      
      We can then relax the conditions here to accept the ADD_ADDR echo for
      MPJ subflows.
      
      Fixes: 67b12f79 ("mptcp: full fully established support after ADD_ADDR")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Signed-off-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Link: https://patch.msgid.link/20240731-upstream-net-20240731-mptcp-endp-subflow-signal-v1-1-c8a9b036493b@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d67c5649
  2. 01 Aug, 2024 22 commits
  3. 31 Jul, 2024 8 commits
    • Kuniyuki Iwashima's avatar
      netfilter: iptables: Fix potential null-ptr-deref in ip6table_nat_table_init(). · c22921df
      Kuniyuki Iwashima authored
      ip6table_nat_table_init() accesses net->gen->ptr[ip6table_nat_net_ops.id],
      but the function is exposed to user space before the entry is allocated
      via register_pernet_subsys().
      
      Let's call register_pernet_subsys() before xt_register_template().
      
      Fixes: fdacd57c ("netfilter: x_tables: never register tables by default")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      c22921df
    • Kuniyuki Iwashima's avatar
      netfilter: iptables: Fix null-ptr-deref in iptable_nat_table_init(). · 5830aa86
      Kuniyuki Iwashima authored
      We had a report that iptables-restore sometimes triggered null-ptr-deref
      at boot time. [0]
      
      The problem is that iptable_nat_table_init() is exposed to user space
      before the kernel fully initialises netns.
      
      In the small race window, a user could call iptable_nat_table_init()
      that accesses net_generic(net, iptable_nat_net_id), which is available
      only after registering iptable_nat_net_ops.
      
      Let's call register_pernet_subsys() before xt_register_template().
      
      [0]:
      bpfilter: Loaded bpfilter_umh pid 11702
      Started bpfilter
      BUG: kernel NULL pointer dereference, address: 0000000000000013
       PF: supervisor write access in kernel mode
       PF: error_code(0x0002) - not-present page
      PGD 0 P4D 0
      PREEMPT SMP NOPTI
      CPU: 2 PID: 11879 Comm: iptables-restor Not tainted 6.1.92-99.174.amzn2023.x86_64 #1
      Hardware name: Amazon EC2 c6i.4xlarge/, BIOS 1.0 10/16/2017
      RIP: 0010:iptable_nat_table_init (net/ipv4/netfilter/iptable_nat.c:87 net/ipv4/netfilter/iptable_nat.c:121) iptable_nat
      Code: 10 4c 89 f6 48 89 ef e8 0b 19 bb ff 41 89 c4 85 c0 75 38 41 83 c7 01 49 83 c6 28 41 83 ff 04 75 dc 48 8b 44 24 08 48 8b 0c 24 <48> 89 08 4c 89 ef e8 a2 3b a2 cf 48 83 c4 10 44 89 e0 5b 5d 41 5c
      RSP: 0018:ffffbef902843cd0 EFLAGS: 00010246
      RAX: 0000000000000013 RBX: ffff9f4b052caa20 RCX: ffff9f4b20988d80
      RDX: 0000000000000000 RSI: 0000000000000064 RDI: ffffffffc04201c0
      RBP: ffff9f4b29394000 R08: ffff9f4b07f77258 R09: ffff9f4b07f77240
      R10: 0000000000000000 R11: ffff9f4b09635388 R12: 0000000000000000
      R13: ffff9f4b1a3c6c00 R14: ffff9f4b20988e20 R15: 0000000000000004
      FS:  00007f6284340000(0000) GS:ffff9f51fe280000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000013 CR3: 00000001d10a6005 CR4: 00000000007706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <TASK>
       ? show_trace_log_lvl (arch/x86/kernel/dumpstack.c:259)
       ? show_trace_log_lvl (arch/x86/kernel/dumpstack.c:259)
       ? xt_find_table_lock (net/netfilter/x_tables.c:1259)
       ? __die_body.cold (arch/x86/kernel/dumpstack.c:478 arch/x86/kernel/dumpstack.c:420)
       ? page_fault_oops (arch/x86/mm/fault.c:727)
       ? exc_page_fault (./arch/x86/include/asm/irqflags.h:40 ./arch/x86/include/asm/irqflags.h:75 arch/x86/mm/fault.c:1470 arch/x86/mm/fault.c:1518)
       ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:570)
       ? iptable_nat_table_init (net/ipv4/netfilter/iptable_nat.c:87 net/ipv4/netfilter/iptable_nat.c:121) iptable_nat
       xt_find_table_lock (net/netfilter/x_tables.c:1259)
       xt_request_find_table_lock (net/netfilter/x_tables.c:1287)
       get_info (net/ipv4/netfilter/ip_tables.c:965)
       ? security_capable (security/security.c:809 (discriminator 13))
       ? ns_capable (kernel/capability.c:376 kernel/capability.c:397)
       ? do_ipt_get_ctl (net/ipv4/netfilter/ip_tables.c:1656)
       ? bpfilter_send_req (net/bpfilter/bpfilter_kern.c:52) bpfilter
       nf_getsockopt (net/netfilter/nf_sockopt.c:116)
       ip_getsockopt (net/ipv4/ip_sockglue.c:1827)
       __sys_getsockopt (net/socket.c:2327)
       __x64_sys_getsockopt (net/socket.c:2342 net/socket.c:2339 net/socket.c:2339)
       do_syscall_64 (arch/x86/entry/common.c:51 arch/x86/entry/common.c:81)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)
      RIP: 0033:0x7f62844685ee
      Code: 48 8b 0d 45 28 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 37 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 0a c3 66 0f 1f 84 00 00 00 00 00 48 8b 15 09
      RSP: 002b:00007ffd1f83d638 EFLAGS: 00000246 ORIG_RAX: 0000000000000037
      RAX: ffffffffffffffda RBX: 00007ffd1f83d680 RCX: 00007f62844685ee
      RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000004
      RBP: 0000000000000004 R08: 00007ffd1f83d670 R09: 0000558798ffa2a0
      R10: 00007ffd1f83d680 R11: 0000000000000246 R12: 00007ffd1f83e3b2
      R13: 00007f628455baa0 R14: 00007ffd1f83d7b0 R15: 00007f628457a008
       </TASK>
      Modules linked in: iptable_nat(+) bpfilter rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache veth xt_state xt_connmark xt_nat xt_statistic xt_MASQUERADE xt_mark xt_addrtype ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment nft_compat nf_tables nfnetlink overlay nls_ascii nls_cp437 vfat fat ghash_clmulni_intel aesni_intel ena crypto_simd ptp cryptd i8042 pps_core serio button sunrpc sch_fq_codel configfs loop dm_mod fuse dax dmi_sysfs crc32_pclmul crc32c_intel efivarfs
      CR2: 0000000000000013
      
      Fixes: fdacd57c ("netfilter: x_tables: never register tables by default")
      Reported-by: default avatarTakahiro Kawahara <takawaha@amazon.co.jp>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      5830aa86
    • Linus Torvalds's avatar
      minmax: fix up min3() and max3() too · 21b136cc
      Linus Torvalds authored
      David Laight pointed out that we should deal with the min3() and max3()
      mess too, which still does excessive expansion.
      
      And our current macros are actually rather broken.
      
      In particular, the macros did this:
      
        #define min3(x, y, z) min((typeof(x))min(x, y), z)
        #define max3(x, y, z) max((typeof(x))max(x, y), z)
      
      and that not only is a nested expansion of possibly very complex
      arguments with all that involves, the typing with that "typeof()" cast
      is completely wrong.
      
      For example, imagine what happens in max3() if 'x' happens to be a
      'unsigned char', but 'y' and 'z' are 'unsigned long'.  The types are
      compatible, and there's no warning - but the result is just random
      garbage.
      
      No, I don't think we've ever hit that issue in practice, but since we
      now have sane infrastructure for doing this right, let's just use it.
      It fixes any excessive expansion, and also avoids these kinds of broken
      type issues.
      Requested-by: default avatarDavid Laight <David.Laight@aculab.com>
      Acked-by: default avatarArnd Bergmann <arnd@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21b136cc
    • Linus Torvalds's avatar
      Merge tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · e4fc196f
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
      
       - fix regression in extent map rework when handling insertion of
         overlapping compressed extent
      
       - fix unexpected file length when appending to a file using direct io
         and buffer not faulted in
      
       - in zoned mode, fix accounting of unusable space when flipping
         read-only block group back to read-write
      
       - fix page locking when COWing an inline range, assertion failure found
         by syzbot
      
       - fix calculation of space info in debugging print
      
       - tree-checker, add validation of data reference item
      
       - fix a few -Wmaybe-uninitialized build warnings
      
      * tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry()
        btrfs: fix corruption after buffer fault in during direct IO append write
        btrfs: zoned: fix zone_unusable accounting on making block group read-write again
        btrfs: do not subtract delalloc from avail bytes
        btrfs: make cow_file_range_inline() honor locked_page on error
        btrfs: fix corrupt read due to bad offset of a compressed extent map
        btrfs: tree-checker: validate dref root and objectid
      e4fc196f
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-fixes-for-v6.11-2024-07-30' of... · e254e0c5
      Linus Torvalds authored
      Merge tag 'perf-tools-fixes-for-v6.11-2024-07-30' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
      
      Pull perf tools fixes from Namhyung Kim:
       "Some more build fixes and a random crash fix:
      
         - Fix cross-build by setting pkg-config env according to the arch
      
         - Fix static build for missing library dependencies
      
         - Fix Segfault when callchain has no symbols"
      
      * tag 'perf-tools-fixes-for-v6.11-2024-07-30' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
        perf docs: Document cross compilation
        perf: build: Link lib 'zstd' for static build
        perf: build: Link lib 'lzma' for static build
        perf: build: Only link libebl.a for old libdw
        perf: build: Set Python configuration for cross compilation
        perf: build: Setup PKG_CONFIG_LIBDIR for cross compilation
        perf tool: fix dereferencing NULL al->maps
      e254e0c5
    • Jakub Kicinski's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 0bf50cea
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      ice: fix AF_XDP ZC timeout and concurrency issues
      
      Maciej Fijalkowski says:
      
      Changes included in this patchset address an issue that customer has
      been facing when AF_XDP ZC Tx sockets were used in combination with flow
      control and regular Tx traffic.
      
      After executing:
      ethtool --set-priv-flags $dev link-down-on-close on
      ethtool -A $dev rx on tx on
      
      launching multiple ZC Tx sockets on $dev + pinging remote interface (so
      that regular Tx traffic is present) and then going through down/up of
      $dev, Tx timeout occurred and then most of the time ice driver was unable
      to recover from that state.
      
      These patches combined together solve the described above issue on
      customer side. Main focus here is to forbid producing Tx descriptors when
      either carrier is not yet initialized or process of bringing interface
      down has already started.
      
      v1: https://lore.kernel.org/netdev/20240708221416.625850-1-anthony.l.nguyen@intel.com/
      
      * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
        ice: xsk: fix txq interrupt mapping
        ice: add missing WRITE_ONCE when clearing ice_rx_ring::xdp_prog
        ice: improve updating ice_{t,r}x_ring::xsk_pool
        ice: toggle netif_carrier when setting up XSK pool
        ice: modify error handling when setting XSK pool in ndo_bpf
        ice: replace synchronize_rcu with synchronize_net
        ice: don't busy wait for Rx queue disable in ice_qp_dis()
        ice: respect netif readiness in AF_XDP ZC related ndo's
      ====================
      
      Link: https://patch.msgid.link/20240729200716.681496-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0bf50cea
    • Willem de Bruijn's avatar
      net: drop bad gso csum_start and offset in virtio_net_hdr · 89add400
      Willem de Bruijn authored
      Tighten csum_start and csum_offset checks in virtio_net_hdr_to_skb
      for GSO packets.
      
      The function already checks that a checksum requested with
      VIRTIO_NET_HDR_F_NEEDS_CSUM is in skb linear. But for GSO packets
      this might not hold for segs after segmentation.
      
      Syzkaller demonstrated to reach this warning in skb_checksum_help
      
      	offset = skb_checksum_start_offset(skb);
      	ret = -EINVAL;
      	if (WARN_ON_ONCE(offset >= skb_headlen(skb)))
      
      By injecting a TSO packet:
      
      WARNING: CPU: 1 PID: 3539 at net/core/dev.c:3284 skb_checksum_help+0x3d0/0x5b0
       ip_do_fragment+0x209/0x1b20 net/ipv4/ip_output.c:774
       ip_finish_output_gso net/ipv4/ip_output.c:279 [inline]
       __ip_finish_output+0x2bd/0x4b0 net/ipv4/ip_output.c:301
       iptunnel_xmit+0x50c/0x930 net/ipv4/ip_tunnel_core.c:82
       ip_tunnel_xmit+0x2296/0x2c70 net/ipv4/ip_tunnel.c:813
       __gre_xmit net/ipv4/ip_gre.c:469 [inline]
       ipgre_xmit+0x759/0xa60 net/ipv4/ip_gre.c:661
       __netdev_start_xmit include/linux/netdevice.h:4850 [inline]
       netdev_start_xmit include/linux/netdevice.h:4864 [inline]
       xmit_one net/core/dev.c:3595 [inline]
       dev_hard_start_xmit+0x261/0x8c0 net/core/dev.c:3611
       __dev_queue_xmit+0x1b97/0x3c90 net/core/dev.c:4261
       packet_snd net/packet/af_packet.c:3073 [inline]
      
      The geometry of the bad input packet at tcp_gso_segment:
      
      [   52.003050][ T8403] skb len=12202 headroom=244 headlen=12093 tailroom=0
      [   52.003050][ T8403] mac=(168,24) mac_len=24 net=(192,52) trans=244
      [   52.003050][ T8403] shinfo(txflags=0 nr_frags=1 gso(size=1552 type=3 segs=0))
      [   52.003050][ T8403] csum(0x60000c7 start=199 offset=1536
      ip_summed=3 complete_sw=0 valid=0 level=0)
      
      Mitigate with stricter input validation.
      
      csum_offset: for GSO packets, deduce the correct value from gso_type.
      This is already done for USO. Extend it to TSO. Let UFO be:
      udp[46]_ufo_fragment ignores these fields and always computes the
      checksum in software.
      
      csum_start: finding the real offset requires parsing to the transport
      header. Do not add a parser, use existing segmentation parsing. Thanks
      to SKB_GSO_DODGY, that also catches bad packets that are hw offloaded.
      Again test both TSO and USO. Do not test UFO for the above reason, and
      do not test UDP tunnel offload.
      
      GSO packet are almost always CHECKSUM_PARTIAL. USO packets may be
      CHECKSUM_NONE since commit 10154dbd ("udp: Allow GSO transmit
      from devices with no checksum offload"), but then still these fields
      are initialized correctly in udp4_hwcsum/udp6_hwcsum_outgoing. So no
      need to test for ip_summed == CHECKSUM_PARTIAL first.
      
      This revises an existing fix mentioned in the Fixes tag, which broke
      small packets with GSO offload, as detected by kselftests.
      
      Link: https://syzkaller.appspot.com/bug?extid=e1db31216c789f552871
      Link: https://lore.kernel.org/netdev/20240723223109.2196886-1-kuba@kernel.org
      Fixes: e269d79c ("net: missing check virtio")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://patch.msgid.link/20240729201108.1615114-1-willemdebruijn.kernel@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      89add400
    • Bartosz Golaszewski's avatar
      net: phy: aquantia: only poll GLOBAL_CFG regs on aqr113, aqr113c and aqr115c · a7f3abcf
      Bartosz Golaszewski authored
      Commit 708405f3 ("net: phy: aquantia: wait for the GLOBAL_CFG to
      start returning real values") introduced a workaround for an issue
      observed on aqr115c. However there were never any reports of it
      happening on other models and the workaround has been reported to cause
      and issue on aqr113c (and it may cause the same on any other model not
      supporting 10M mode).
      
      Let's limit the impact of the workaround to aqr113, aqr113c and aqr115c
      and poll the 100M GLOBAL_CFG register instead as both models are known
      to support it correctly.
      Reported-by: default avatarJon Hunter <jonathanh@nvidia.com>
      Closes: https://lore.kernel.org/lkml/7c0140be-4325-4005-9068-7e0fc5ff344d@nvidia.com/
      Fixes: 708405f3 ("net: phy: aquantia: wait for the GLOBAL_CFG to start returning real values")
      Tested-by: default avatarJon Hunter <jonathanh@nvidia.com>
      Signed-off-by: default avatarBartosz Golaszewski <bartosz.golaszewski@linaro.org>
      Reviewed-by: default avatarAntoine Tenart <atenart@kernel.org>
      Link: https://patch.msgid.link/20240729150315.65798-1-brgl@bgdev.plSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a7f3abcf