1. 18 Mar, 2022 10 commits
    • Yonglong Li's avatar
      mptcp: Fix crash due to tcp_tsorted_anchor was initialized before release skb · 3ef3905a
      Yonglong Li authored
      Got crash when doing pressure test of mptcp:
      
      ===========================================================================
      dst_release: dst:ffffa06ce6e5c058 refcnt:-1
      kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
      BUG: unable to handle kernel paging request at ffffa06ce6e5c058
      PGD 190a01067 P4D 190a01067 PUD 43fffb067 PMD 22e403063 PTE 8000000226e5c063
      Oops: 0011 [#1] SMP PTI
      CPU: 7 PID: 7823 Comm: kworker/7:0 Kdump: loaded Tainted: G            E
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.2.1 04/01/2014
      Call Trace:
       ? skb_release_head_state+0x68/0x100
       ? skb_release_all+0xe/0x30
       ? kfree_skb+0x32/0xa0
       ? mptcp_sendmsg_frag+0x57e/0x750
       ? __mptcp_retrans+0x21b/0x3c0
       ? __switch_to_asm+0x35/0x70
       ? mptcp_worker+0x25e/0x320
       ? process_one_work+0x1a7/0x360
       ? worker_thread+0x30/0x390
       ? create_worker+0x1a0/0x1a0
       ? kthread+0x112/0x130
       ? kthread_flush_work_fn+0x10/0x10
       ? ret_from_fork+0x35/0x40
      ===========================================================================
      
      In __mptcp_alloc_tx_skb skb was allocated and skb->tcp_tsorted_anchor will
      be initialized, in under memory pressure situation sk_wmem_schedule will
      return false and then kfree_skb. In this case skb->_skb_refdst is not null
      because_skb_refdst and tcp_tsorted_anchor are stored in the same mem, and
      kfree_skb will try to release dst and cause crash.
      
      Fixes: f70cad10 ("mptcp: stop relying on tcp_tx_skb_cache")
      Reviewed-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarYonglong Li <liyonglong@chinatelecom.cn>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Link: https://lore.kernel.org/r/20220317220953.426024-1-mathew.j.martineau@linux.intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3ef3905a
    • Jakub Kicinski's avatar
      Merge branch 'ipv4-handle-tos-and-scope-properly-for-icmp-redirects-and-pmtu-updates' · 03e2777c
      Jakub Kicinski authored
      Guillaume Nault says:
      
      ====================
      ipv4: Handle TOS and scope properly for ICMP redirects and PMTU updates
      
      ICMPv4 PMTU and redirect handlers didn't properly initialise the
      struct flowi4 they used for route lookups:
      
        * ECN bits sometimes weren't cleared from ->flowi4_tos.
        * The RTO_ONLINK flag wasn't taken into account for ->flowi4_scope.
      
      In some special cases, this resulted in ICMP redirects and PMTU updates
      not being taken into account because fib_lookup() couldn't retrieve the
      correct route.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1647519748.git.gnault@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      03e2777c
    • Guillaume Nault's avatar
      selftest: net: Test IPv4 PMTU exceptions with DSCP and ECN · ec730c3e
      Guillaume Nault authored
      Add two tests to pmtu.sh, for verifying that PMTU exceptions get
      properly created for routes that don't belong to the main table.
      
      A fib-rule based on the packet's DSCP field is used to jump to the
      correct table. ECN shouldn't interfere with this process, so each test
      has two components: one that only sets DSCP and one that sets both DSCP
      and ECN.
      
      One of the test triggers PMTU exceptions using ICMP Echo Requests, the
      other using UDP packets (to test different handlers in the kernel).
      
      A few adjustments are necessary in the rest of the script to allow
      policy routing scenarios:
      
        * Add global variable rt_table that allows setup_routing_*() to
          add routes to a specific routing table. By default rt_table is set
          to "main", so existing tests don't need to be modified.
      
        * Another global variable, policy_mark, is used to define which
          dsfield value is used for policy routing. This variable has no
          effect on tests that don't use policy routing.
      
        * The UDP version of the test uses socat. So cleanup() now also need
          to kill socat PIDs.
      
        * route_get_dst_pmtu_from_exception() and route_get_dst_exception()
          now take an optional third argument specifying the dsfield. If
          not specified, 0 is used, so existing users don't need to be
          modified.
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ec730c3e
    • Guillaume Nault's avatar
      ipv4: Fix route lookups when handling ICMP redirects and PMTU updates · 544b4dd5
      Guillaume Nault authored
      The PMTU update and ICMP redirect helper functions initialise their fl4
      variable with either __build_flow_key() or build_sk_flow_key(). These
      initialisation functions always set ->flowi4_scope with
      RT_SCOPE_UNIVERSE and might set the ECN bits of ->flowi4_tos. This is
      not a problem when the route lookup is later done via
      ip_route_output_key_hash(), which properly clears the ECN bits from
      ->flowi4_tos and initialises ->flowi4_scope based on the RTO_ONLINK
      flag. However, some helpers call fib_lookup() directly, without
      sanitising the tos and scope fields, so the route lookup can fail and,
      as a result, the ICMP redirect or PMTU update aren't taken into
      account.
      
      Fix this by extracting the ->flowi4_tos and ->flowi4_scope sanitisation
      code into ip_rt_fix_tos(), then use this function in handlers that call
      fib_lookup() directly.
      
      Note 1: We can't sanitise ->flowi4_tos and ->flowi4_scope in a central
      place (like __build_flow_key() or flowi4_init_output()), because
      ip_route_output_key_hash() expects non-sanitised values. When called
      with sanitised values, it can erroneously overwrite RT_SCOPE_LINK with
      RT_SCOPE_UNIVERSE in ->flowi4_scope. Therefore we have to be careful to
      sanitise the values only for those paths that don't call
      ip_route_output_key_hash().
      
      Note 2: The problem is mostly about sanitising ->flowi4_tos. Having
      ->flowi4_scope initialised with RT_SCOPE_UNIVERSE instead of
      RT_SCOPE_LINK probably wasn't really a problem: sockets with the
      SOCK_LOCALROUTE flag set (those that'd result in RTO_ONLINK being set)
      normally shouldn't receive ICMP redirects or PMTU updates.
      
      Fixes: 4895c771 ("ipv4: Add FIB nexthop exceptions.")
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      544b4dd5
    • Jakub Kicinski's avatar
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 6bd0c76b
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2022-03-18
      
      We've added 2 non-merge commits during the last 18 day(s) which contain
      a total of 2 files changed, 50 insertions(+), 20 deletions(-).
      
      The main changes are:
      
      1) Fix a race in XSK socket teardown code that can lead to a NULL pointer
         dereference, from Magnus.
      
      2) Small MAINTAINERS doc update to remove Lorenz from sockmap, from Lorenz.
      
      * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        xsk: Fix race at socket teardown
        bpf: Remove Lorenz Bauer from L7 BPF maintainers
      ====================
      
      Link: https://lore.kernel.org/r/20220318152418.28638-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6bd0c76b
    • David S. Miller's avatar
      Merge branch 'af_unix-OOB-fixes' · 9905eed4
      David S. Miller authored
      Kuniyuki Iwashima says:
      
      ====================
      af_unix: Fix some OOB implementation.
      
      This series fixes some data-races and adds a missing feature around the
      commit 314001f0 ("af_unix: Add OOB support").
      
      Changelog:
        - v3:
          - Add the first patch
      
        - v2: https://lore.kernel.org/netdev/20220315054801.72035-1-kuniyu@amazon.co.jp/
          - Add READ_ONCE() to avoid a race reported by KCSAN (Eric)
          - Add IS_ENABLED(CONFIG_AF_UNIX_OOB) (Shoaib)
      
        - v1: https://lore.kernel.org/netdev/20220314052110.53634-1-kuniyu@amazon.co.jp/
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9905eed4
    • Kuniyuki Iwashima's avatar
      af_unix: Support POLLPRI for OOB. · d9a232d4
      Kuniyuki Iwashima authored
      The commit 314001f0 ("af_unix: Add OOB support") introduced OOB for
      AF_UNIX, but it lacks some changes for POLLPRI.  Let's add the missing
      piece.
      
      In the selftest, normal datagrams are sent followed by OOB data, so this
      commit replaces `POLLIN | POLLPRI` with just `POLLPRI` in the first test
      case.
      
      Fixes: 314001f0 ("af_unix: Add OOB support")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d9a232d4
    • Kuniyuki Iwashima's avatar
      af_unix: Fix some data-races around unix_sk(sk)->oob_skb. · e82025c6
      Kuniyuki Iwashima authored
      Out-of-band data automatically places a "mark" showing wherein the
      sequence the out-of-band data would have been.  If the out-of-band data
      implies cancelling everything sent so far, the "mark" is helpful to flush
      them.  When the socket's read pointer reaches the "mark", the ioctl() below
      sets a non zero value to the arg `atmark`:
      
      The out-of-band data is queued in sk->sk_receive_queue as well as ordinary
      data and also saved in unix_sk(sk)->oob_skb.  It can be used to test if the
      head of the receive queue is the out-of-band data meaning the socket is at
      the "mark".
      
      While testing that, unix_ioctl() reads unix_sk(sk)->oob_skb locklessly.
      Thus, all accesses to oob_skb need some basic protection to avoid
      load/store tearing which KCSAN detects when these are called concurrently:
      
        - ioctl(fd_a, SIOCATMARK, &atmark, sizeof(atmark))
        - send(fd_b_connected_to_a, buf, sizeof(buf), MSG_OOB)
      
      BUG: KCSAN: data-race in unix_ioctl / unix_stream_sendmsg
      
      write to 0xffff888003d9cff0 of 8 bytes by task 175 on cpu 1:
       unix_stream_sendmsg (net/unix/af_unix.c:2087 net/unix/af_unix.c:2191)
       sock_sendmsg (net/socket.c:705 net/socket.c:725)
       __sys_sendto (net/socket.c:2040)
       __x64_sys_sendto (net/socket.c:2048)
       do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
      
      read to 0xffff888003d9cff0 of 8 bytes by task 176 on cpu 0:
       unix_ioctl (net/unix/af_unix.c:3101 (discriminator 1))
       sock_do_ioctl (net/socket.c:1128)
       sock_ioctl (net/socket.c:1242)
       __x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:874 fs/ioctl.c:860 fs/ioctl.c:860)
       do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
      
      value changed: 0xffff888003da0c00 -> 0xffff888003da0d00
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 176 Comm: unix_race_oob_i Not tainted 5.17.0-rc5-59529-g83dc4c2a #12
      Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014
      
      Fixes: 314001f0 ("af_unix: Add OOB support")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e82025c6
    • Sukadev Bhattiprolu's avatar
      ibmvnic: fix race between xmit and reset · 4219196d
      Sukadev Bhattiprolu authored
      There is a race between reset and the transmit paths that can lead to
      ibmvnic_xmit() accessing an scrq after it has been freed in the reset
      path. It can result in a crash like:
      
      	Kernel attempted to read user page (0) - exploit attempt? (uid: 0)
      	BUG: Kernel NULL pointer dereference on read at 0x00000000
      	Faulting instruction address: 0xc0080000016189f8
      	Oops: Kernel access of bad area, sig: 11 [#1]
      	...
      	NIP [c0080000016189f8] ibmvnic_xmit+0x60/0xb60 [ibmvnic]
      	LR [c000000000c0046c] dev_hard_start_xmit+0x11c/0x280
      	Call Trace:
      	[c008000001618f08] ibmvnic_xmit+0x570/0xb60 [ibmvnic] (unreliable)
      	[c000000000c0046c] dev_hard_start_xmit+0x11c/0x280
      	[c000000000c9cfcc] sch_direct_xmit+0xec/0x330
      	[c000000000bfe640] __dev_xmit_skb+0x3a0/0x9d0
      	[c000000000c00ad4] __dev_queue_xmit+0x394/0x730
      	[c008000002db813c] __bond_start_xmit+0x254/0x450 [bonding]
      	[c008000002db8378] bond_start_xmit+0x40/0xc0 [bonding]
      	[c000000000c0046c] dev_hard_start_xmit+0x11c/0x280
      	[c000000000c00ca4] __dev_queue_xmit+0x564/0x730
      	[c000000000cf97e0] neigh_hh_output+0xd0/0x180
      	[c000000000cfa69c] ip_finish_output2+0x31c/0x5c0
      	[c000000000cfd244] __ip_queue_xmit+0x194/0x4f0
      	[c000000000d2a3c4] __tcp_transmit_skb+0x434/0x9b0
      	[c000000000d2d1e0] __tcp_retransmit_skb+0x1d0/0x6a0
      	[c000000000d2d984] tcp_retransmit_skb+0x34/0x130
      	[c000000000d310e8] tcp_retransmit_timer+0x388/0x6d0
      	[c000000000d315ec] tcp_write_timer_handler+0x1bc/0x330
      	[c000000000d317bc] tcp_write_timer+0x5c/0x200
      	[c000000000243270] call_timer_fn+0x50/0x1c0
      	[c000000000243704] __run_timers.part.0+0x324/0x460
      	[c000000000243894] run_timer_softirq+0x54/0xa0
      	[c000000000ea713c] __do_softirq+0x15c/0x3e0
      	[c000000000166258] __irq_exit_rcu+0x158/0x190
      	[c000000000166420] irq_exit+0x20/0x40
      	[c00000000002853c] timer_interrupt+0x14c/0x2b0
      	[c000000000009a00] decrementer_common_virt+0x210/0x220
      	--- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x2c
      
      The immediate cause of the crash is the access of tx_scrq in the following
      snippet during a reset, where the tx_scrq can be either NULL or an address
      that will soon be invalid:
      
      	ibmvnic_xmit()
      	{
      		...
      		tx_scrq = adapter->tx_scrq[queue_num];
      		txq = netdev_get_tx_queue(netdev, queue_num);
      		ind_bufp = &tx_scrq->ind_buf;
      
      		if (test_bit(0, &adapter->resetting)) {
      		...
      	}
      
      But beyond that, the call to ibmvnic_xmit() itself is not safe during a
      reset and the reset path attempts to avoid this by stopping the queue in
      ibmvnic_cleanup(). However just after the queue was stopped, an in-flight
      ibmvnic_complete_tx() could have restarted the queue even as the reset is
      progressing.
      
      Since the queue was restarted we could get a call to ibmvnic_xmit() which
      can then access the bad tx_scrq (or other fields).
      
      We cannot however simply have ibmvnic_complete_tx() check the ->resetting
      bit and skip starting the queue. This can race at the "back-end" of a good
      reset which just restarted the queue but has not cleared the ->resetting
      bit yet. If we skip restarting the queue due to ->resetting being true,
      the queue would remain stopped indefinitely potentially leading to transmit
      timeouts.
      
      IOW ->resetting is too broad for this purpose. Instead use a new flag
      that indicates whether or not the queues are active. Only the open/
      reset paths control when the queues are active. ibmvnic_complete_tx()
      and others wake up the queue only if the queue is marked active.
      
      So we will have:
      	A. reset/open thread in ibmvnic_cleanup() and __ibmvnic_open()
      
      		->resetting = true
      		->tx_queues_active = false
      		disable tx queues
      		...
      		->tx_queues_active = true
      		start tx queues
      
      	B. Tx interrupt in ibmvnic_complete_tx():
      
      		if (->tx_queues_active)
      			netif_wake_subqueue();
      
      To ensure that ->tx_queues_active and state of the queues are consistent,
      we need a lock which:
      
      	- must also be taken in the interrupt path (ibmvnic_complete_tx())
      	- shared across the multiple queues in the adapter (so they don't
      	  become serialized)
      
      Use rcu_read_lock() and have the reset thread synchronize_rcu() after
      updating the ->tx_queues_active state.
      
      While here, consolidate a few boolean fields in ibmvnic_adapter for
      better alignment.
      
      Based on discussions with Brian King and Dany Madden.
      
      Fixes: 7ed5b31f ("net/ibmvnic: prevent more than one thread from running in reset")
      Reported-by: default avatarVaishnavi Bhat <vaish123@in.ibm.com>
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4219196d
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · 4fa331b4
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) Fix PPPoE and QinQ with flowtable inet family.
      
      2) Missing register validation in nf_tables.
      
      3) Initialize registers to avoid stack memleak to userspace.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4fa331b4
  2. 17 Mar, 2022 16 commits
  3. 16 Mar, 2022 9 commits
  4. 15 Mar, 2022 5 commits