1. 09 Jun, 2021 4 commits
    • Paolo Abeni's avatar
      udp: fix race between close() and udp_abort() · a8b897c7
      Paolo Abeni authored
      Kaustubh reported and diagnosed a panic in udp_lib_lookup().
      The root cause is udp_abort() racing with close(). Both
      racing functions acquire the socket lock, but udp{v6}_destroy_sock()
      release it before performing destructive actions.
      
      We can't easily extend the socket lock scope to avoid the race,
      instead use the SOCK_DEAD flag to prevent udp_abort from doing
      any action when the critical race happens.
      Diagnosed-and-tested-by: default avatarKaustubh Pandey <kapandey@codeaurora.org>
      Fixes: 5d77dca8 ("net: diag: support SOCK_DESTROY for UDP sockets")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a8b897c7
    • Eric Dumazet's avatar
      inet: annotate data race in inet_send_prepare() and inet_dgram_connect() · dcd01eea
      Eric Dumazet authored
      Both functions are known to be racy when reading inet_num
      as we do not want to grab locks for the common case the socket
      has been bound already. The race is resolved in inet_autobind()
      by reading again inet_num under the socket lock.
      
      syzbot reported:
      BUG: KCSAN: data-race in inet_send_prepare / udp_lib_get_port
      
      write to 0xffff88812cba150e of 2 bytes by task 24135 on cpu 0:
       udp_lib_get_port+0x4b2/0xe20 net/ipv4/udp.c:308
       udp_v6_get_port+0x5e/0x70 net/ipv6/udp.c:89
       inet_autobind net/ipv4/af_inet.c:183 [inline]
       inet_send_prepare+0xd0/0x210 net/ipv4/af_inet.c:807
       inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg net/socket.c:674 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
       ___sys_sendmsg net/socket.c:2404 [inline]
       __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
       __do_sys_sendmmsg net/socket.c:2519 [inline]
       __se_sys_sendmmsg net/socket.c:2516 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff88812cba150e of 2 bytes by task 24132 on cpu 1:
       inet_send_prepare+0x21/0x210 net/ipv4/af_inet.c:806
       inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg net/socket.c:674 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
       ___sys_sendmsg net/socket.c:2404 [inline]
       __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
       __do_sys_sendmmsg net/socket.c:2519 [inline]
       __se_sys_sendmmsg net/socket.c:2516 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000 -> 0x9db4
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 24132 Comm: syz-executor.2 Not tainted 5.13.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dcd01eea
    • Austin Kim's avatar
      net: ethtool: clear heap allocations for ethtool function · 80ec82e3
      Austin Kim authored
      Several ethtool functions leave heap uncleared (potentially) by
      drivers. This will leave the unused portion of heap unchanged and
      might copy the full contents back to userspace.
      Signed-off-by: default avatarAustin Kim <austindh.kim@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80ec82e3
    • Aleksander Jan Bajkowski's avatar
      net: lantiq: disable interrupt before sheduling NAPI · f2386cf7
      Aleksander Jan Bajkowski authored
      This patch fixes TX hangs with threaded NAPI enabled. The scheduled
      NAPI seems to be executed in parallel with the interrupt on second
      thread. Sometimes it happens that ltq_dma_disable_irq() is executed
      after xrx200_tx_housekeeping(). The symptom is that TX interrupts
      are disabled in the DMA controller. As a result, the TX hangs after
      a few seconds of the iperf test. Scheduling NAPI after disabling
      interrupts fixes this issue.
      
      Tested on Lantiq xRX200 (BT Home Hub 5A).
      
      Fixes: 9423361d ("net: lantiq: Disable IRQs only if NAPI gets scheduled ")
      Signed-off-by: default avatarAleksander Jan Bajkowski <olek2@wp.pl>
      Acked-by: default avatarHauke Mehrtens <hauke@hauke-m.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f2386cf7
  2. 08 Jun, 2021 8 commits
    • Shay Agroskin's avatar
      net: ena: fix DMA mapping function issues in XDP · 504fd6a5
      Shay Agroskin authored
      This patch fixes several bugs found when (DMA/LLQ) mapping a packet for
      transmission. The mapping procedure makes the transmitted packet
      accessible by the device.
      When using LLQ, this requires copying the packet's header to push header
      (which would be passed to LLQ) and creating DMA mapping for the payload
      (if the packet doesn't fit the maximum push length).
      When not using LLQ, we map the whole packet with DMA.
      
      The following bugs are fixed in the code:
          1. Add support for non-LLQ machines:
             The ena_xdp_tx_map_frame() function assumed that LLQ is
             supported, and never mapped the whole packet using DMA. On some
             instances, which don't support LLQ, this causes loss of traffic.
      
          2. Wrong DMA buffer length passed to device:
             When using LLQ, the first 'tx_max_header_size' bytes of the
             packet would be copied to push header. The rest of the packet
             would be copied to a DMA'd buffer.
      
          3. Freeing the XDP buffer twice in case of a mapping error:
             In case a buffer DMA mapping fails, the function uses
             xdp_return_frame_rx_napi() to free the RX buffer and returns from
             the function with an error. XDP frames that fail to xmit get
             freed by the kernel and so there is no need for this call.
      
      Fixes: 548c4940 ("net: ena: Implement XDP_TX action")
      Signed-off-by: default avatarShay Agroskin <shayagr@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      504fd6a5
    • Vladimir Oltean's avatar
      net: dsa: felix: re-enable TX flow control in ocelot_port_flush() · 1650bdb1
      Vladimir Oltean authored
      Because flow control is set up statically in ocelot_init_port(), and not
      in phylink_mac_link_up(), what happens is that after the blamed commit,
      the flow control remains disabled after the port flushing procedure.
      
      Fixes: eb4733d7 ("net: dsa: felix: implement port flushing on .phylink_mac_link_down")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1650bdb1
    • Pavel Skripkin's avatar
      net: rds: fix memory leak in rds_recvmsg · 49bfcbfd
      Pavel Skripkin authored
      Syzbot reported memory leak in rds. The problem
      was in unputted refcount in case of error.
      
      int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
      		int msg_flags)
      {
      ...
      
      	if (!rds_next_incoming(rs, &inc)) {
      		...
      	}
      
      After this "if" inc refcount incremented and
      
      	if (rds_cmsg_recv(inc, msg, rs)) {
      		ret = -EFAULT;
      		goto out;
      	}
      ...
      out:
      	return ret;
      }
      
      in case of rds_cmsg_recv() fail the refcount won't be
      decremented. And it's easy to see from ftrace log, that
      rds_inc_addref() don't have rds_inc_put() pair in
      rds_recvmsg() after rds_cmsg_recv()
      
       1)               |  rds_recvmsg() {
       1)   3.721 us    |    rds_inc_addref();
       1)   3.853 us    |    rds_message_inc_copy_to_user();
       1) + 10.395 us   |    rds_cmsg_recv();
       1) + 34.260 us   |  }
      
      Fixes: bdbe6fbc ("RDS: recv.c")
      Reported-and-tested-by: syzbot+5134cdf021c4ed5aaa5f@syzkaller.appspotmail.com
      Signed-off-by: default avatarPavel Skripkin <paskripkin@gmail.com>
      Reviewed-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      49bfcbfd
    • David S. Miller's avatar
      Merge tag 'batadv-net-pullrequest-20210608' of git://git.open-mesh.org/linux-merge · df693f13
      David S. Miller authored
      Simon Wunderlich says:
      
      ====================
      Here is a batman-adv bugfix:
      
       - Avoid WARN_ON timing related checks, by Sven Eckelmann
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df693f13
    • Nicolas Dichtel's avatar
      vrf: fix maximum MTU · 9bb392f6
      Nicolas Dichtel authored
      My initial goal was to fix the default MTU, which is set to 65536, ie above
      the maximum defined in the driver: 65535 (ETH_MAX_MTU).
      
      In fact, it's seems more consistent, wrt min_mtu, to set the max_mtu to
      IP6_MAX_MTU (65535 + sizeof(struct ipv6hdr)) and use it by default.
      
      Let's also, for consistency, set the mtu in vrf_setup(). This function
      calls ether_setup(), which set the mtu to 1500. Thus, the whole mtu config
      is done in the same function.
      
      Before the patch:
      $ ip link add blue type vrf table 1234
      $ ip link list blue
      9: blue: <NOARP,MASTER> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
          link/ether fa:f5:27:70:24:2a brd ff:ff:ff:ff:ff:ff
      $ ip link set dev blue mtu 65535
      $ ip link set dev blue mtu 65536
      Error: mtu greater than device maximum.
      
      Fixes: 5055376a ("net: vrf: Fix ping failed when vrf mtu is set to 0")
      CC: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9bb392f6
    • gushengxian's avatar
      net: appletalk: fix the usage of preposition · d439aa33
      gushengxian authored
      The preposition "for" should be changed to preposition "of".
      Signed-off-by: default avatargushengxian <gushengxian@yulong.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d439aa33
    • Zheng Yongjun's avatar
      net: ipv4: Remove unneed BUG() function · 5ac6b198
      Zheng Yongjun authored
      When 'nla_parse_nested_deprecated' failed, it's no need to
      BUG() here, return -EINVAL is ok.
      Signed-off-by: default avatarZheng Yongjun <zhengyongjun3@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5ac6b198
    • Nanyong Sun's avatar
      net: ipv4: fix memory leak in netlbl_cipsov4_add_std · d612c3f3
      Nanyong Sun authored
      Reported by syzkaller:
      BUG: memory leak
      unreferenced object 0xffff888105df7000 (size 64):
      comm "syz-executor842", pid 360, jiffies 4294824824 (age 22.546s)
      hex dump (first 32 bytes):
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      backtrace:
      [<00000000e67ed558>] kmalloc include/linux/slab.h:590 [inline]
      [<00000000e67ed558>] kzalloc include/linux/slab.h:720 [inline]
      [<00000000e67ed558>] netlbl_cipsov4_add_std net/netlabel/netlabel_cipso_v4.c:145 [inline]
      [<00000000e67ed558>] netlbl_cipsov4_add+0x390/0x2340 net/netlabel/netlabel_cipso_v4.c:416
      [<0000000006040154>] genl_family_rcv_msg_doit.isra.0+0x20e/0x320 net/netlink/genetlink.c:739
      [<00000000204d7a1c>] genl_family_rcv_msg net/netlink/genetlink.c:783 [inline]
      [<00000000204d7a1c>] genl_rcv_msg+0x2bf/0x4f0 net/netlink/genetlink.c:800
      [<00000000c0d6a995>] netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2504
      [<00000000d78b9d2c>] genl_rcv+0x24/0x40 net/netlink/genetlink.c:811
      [<000000009733081b>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
      [<000000009733081b>] netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1340
      [<00000000d5fd43b8>] netlink_sendmsg+0x789/0xc70 net/netlink/af_netlink.c:1929
      [<000000000a2d1e40>] sock_sendmsg_nosec net/socket.c:654 [inline]
      [<000000000a2d1e40>] sock_sendmsg+0x139/0x170 net/socket.c:674
      [<00000000321d1969>] ____sys_sendmsg+0x658/0x7d0 net/socket.c:2350
      [<00000000964e16bc>] ___sys_sendmsg+0xf8/0x170 net/socket.c:2404
      [<000000001615e288>] __sys_sendmsg+0xd3/0x190 net/socket.c:2433
      [<000000004ee8b6a5>] do_syscall_64+0x37/0x90 arch/x86/entry/common.c:47
      [<00000000171c7cee>] entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The memory of doi_def->map.std pointing is allocated in
      netlbl_cipsov4_add_std, but no place has freed it. It should be
      freed in cipso_v4_doi_free which frees the cipso DOI resource.
      
      Fixes: 96cb8e33 ("[NetLabel]: CIPSOv4 and Unlabeled packet integration")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarNanyong Sun <sunnanyong@huawei.com>
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d612c3f3
  3. 07 Jun, 2021 7 commits
  4. 04 Jun, 2021 19 commits
    • Rahul Lakkireddy's avatar
      cxgb4: avoid link re-train during TC-MQPRIO configuration · 3822d067
      Rahul Lakkireddy authored
      When configuring TC-MQPRIO offload, only turn off netdev carrier and
      don't bring physical link down in hardware. Otherwise, when the
      physical link is brought up again after configuration, it gets
      re-trained and stalls ongoing traffic.
      
      Also, when firmware is no longer accessible or crashed, avoid sending
      FLOWC and waiting for reply that will never come.
      
      Fix following hung_task_timeout_secs trace seen in these cases.
      
      INFO: task tc:20807 blocked for more than 122 seconds.
            Tainted: G S                5.13.0-rc3+ #122
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      task:tc   state:D stack:14768 pid:20807 ppid: 19366 flags:0x00000000
      Call Trace:
       __schedule+0x27b/0x6a0
       schedule+0x37/0xa0
       schedule_preempt_disabled+0x5/0x10
       __mutex_lock.isra.14+0x2a0/0x4a0
       ? netlink_lookup+0x120/0x1a0
       ? rtnl_fill_ifinfo+0x10f0/0x10f0
       __netlink_dump_start+0x70/0x250
       rtnetlink_rcv_msg+0x28b/0x380
       ? rtnl_fill_ifinfo+0x10f0/0x10f0
       ? rtnl_calcit.isra.42+0x120/0x120
       netlink_rcv_skb+0x4b/0xf0
       netlink_unicast+0x1a0/0x280
       netlink_sendmsg+0x216/0x440
       sock_sendmsg+0x56/0x60
       __sys_sendto+0xe9/0x150
       ? handle_mm_fault+0x6d/0x1b0
       ? do_user_addr_fault+0x1c5/0x620
       __x64_sys_sendto+0x1f/0x30
       do_syscall_64+0x3c/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f7f73218321
      RSP: 002b:00007ffd19626208 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 000055b7c0a8b240 RCX: 00007f7f73218321
      RDX: 0000000000000028 RSI: 00007ffd19626210 RDI: 0000000000000003
      RBP: 000055b7c08680ff R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 000055b7c085f5f6
      R13: 000055b7c085f60a R14: 00007ffd19636470 R15: 00007ffd196262a0
      
      Fixes: b1396c2b ("cxgb4: parse and configure TC-MQPRIO offload")
      Signed-off-by: default avatarRahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3822d067
    • Yunjian Wang's avatar
      sch_htb: fix refcount leak in htb_parent_to_leaf_offload · 944d671d
      Yunjian Wang authored
      The commit ae81feb7 ("sch_htb: fix null pointer dereference
      on a null new_q") fixes a NULL pointer dereference bug, but it
      is not correct.
      
      Because htb_graft_helper properly handles the case when new_q
      is NULL, and after the previous patch by skipping this call
      which creates an inconsistency : dev_queue->qdisc will still
      point to the old qdisc, but cl->parent->leaf.q will point to
      the new one (which will be noop_qdisc, because new_q was NULL).
      The code is based on an assumption that these two pointers are
      the same, so it can lead to refcount leaks.
      
      The correct fix is to add a NULL pointer check to protect
      qdisc_refcount_inc inside htb_parent_to_leaf_offload.
      
      Fixes: ae81feb7 ("sch_htb: fix null pointer dereference on a null new_q")
      Signed-off-by: default avatarYunjian Wang <wangyunjian@huawei.com>
      Suggested-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      944d671d
    • David S. Miller's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 26821ecd
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2021-06-04
      
      This series contains updates to virtchnl header file and ice driver.
      
      Brett fixes VF being unable to request a different number of queues then
      allocated and adds clearing of VF_MBX_ATQLEN register for VF reset.
      
      Haiyue handles error of rebuilding VF VSI during reset.
      
      Paul fixes reporting of autoneg to use the PHY capabilities.
      
      Dave allows LLDP packets without priority of TC_PRIO_CONTROL to be
      transmitted.
      
      Geert Uytterhoeven adds explicit padding to virtchnl_proto_hdrs
      structure in the virtchnl header file.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26821ecd
    • David S. Miller's avatar
      Merge branch 'wireguard-fixes' · 6fd815bb
      David S. Miller authored
      Jason A. Donenfeld says:
      
      ====================
      wireguard fixes for 5.13-rc5
      
      Here are bug fixes to WireGuard for 5.13-rc5:
      
      1-2,6) These are small, trivial tweaks to our test harness.
      
      3) Linus thinks -O3 is still dangerous to enable. The code gen wasn't so
         much different with -O2 either.
      
      4) We were accidentally calling synchronize_rcu instead of
         synchronize_net while holding the rtnl_lock, resulting in some rather
         large stalls that hit production machines.
      
      5) Peer allocation was wasting literally hundreds of megabytes on real
         world deployments, due to oddly sized large objects not fitting
         nicely into a kmalloc slab.
      
      7-9) We move from an insanely expensive O(n) algorithm to a fast O(1)
           algorithm, and cleanup a massive memory leak in the process, in
           which allowed ips churn would leave danging nodes hanging around
           without cleanup until the interface was removed. The O(1) algorithm
           eliminates packet stalls and high latency issues, in addition to
           bringing operations that took as much as 10 minutes down to less
           than a second.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6fd815bb
    • Jason A. Donenfeld's avatar
      wireguard: allowedips: free empty intermediate nodes when removing single node · bf7b042d
      Jason A. Donenfeld authored
      When removing single nodes, it's possible that that node's parent is an
      empty intermediate node, in which case, it too should be removed.
      Otherwise the trie fills up and never is fully emptied, leading to
      gradual memory leaks over time for tries that are modified often. There
      was originally code to do this, but was removed during refactoring in
      2016 and never reworked. Now that we have proper parent pointers from
      the previous commits, we can implement this properly.
      
      In order to reduce branching and expensive comparisons, we want to keep
      the double pointer for parent assignment (which lets us easily chain up
      to the root), but we still need to actually get the parent's base
      address. So encode the bit number into the last two bits of the pointer,
      and pack and unpack it as needed. This is a little bit clumsy but is the
      fastest and less memory wasteful of the compromises. Note that we align
      the root struct here to a minimum of 4, because it's embedded into a
      larger struct, and we're relying on having the bottom two bits for our
      flag, which would only be 16-bit aligned on m68k.
      
      The existing macro-based helpers were a bit unwieldy for adding the bit
      packing to, so this commit replaces them with safer and clearer ordinary
      functions.
      
      We add a test to the randomized/fuzzer part of the selftests, to free
      the randomized tries by-peer, refuzz it, and repeat, until it's supposed
      to be empty, and then then see if that actually resulted in the whole
      thing being emptied. That combined with kmemcheck should hopefully make
      sure this commit is doing what it should. Along the way this resulted in
      various other cleanups of the tests and fixes for recent graphviz.
      
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf7b042d
    • Jason A. Donenfeld's avatar
      wireguard: allowedips: allocate nodes in kmem_cache · dc680de2
      Jason A. Donenfeld authored
      The previous commit moved from O(n) to O(1) for removal, but in the
      process introduced an additional pointer member to a struct that
      increased the size from 60 to 68 bytes, putting nodes in the 128-byte
      slab. With deployed systems having as many as 2 million nodes, this
      represents a significant doubling in memory usage (128 MiB -> 256 MiB).
      Fix this by using our own kmem_cache, that's sized exactly right. This
      also makes wireguard's memory usage more transparent in tools like
      slabtop and /proc/slabinfo.
      
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Suggested-by: default avatarArnd Bergmann <arnd@arndb.de>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc680de2
    • Jason A. Donenfeld's avatar
      wireguard: allowedips: remove nodes in O(1) · f634f418
      Jason A. Donenfeld authored
      Previously, deleting peers would require traversing the entire trie in
      order to rebalance nodes and safely free them. This meant that removing
      1000 peers from a trie with a half million nodes would take an extremely
      long time, during which we're holding the rtnl lock. Large-scale users
      were reporting 200ms latencies added to the networking stack as a whole
      every time their userspace software would queue up significant removals.
      That's a serious situation.
      
      This commit fixes that by maintaining a double pointer to the parent's
      bit pointer for each node, and then using the already existing node list
      belonging to each peer to go directly to the node, fix up its pointers,
      and free it with RCU. This means removal is O(1) instead of O(n), and we
      don't use gobs of stack.
      
      The removal algorithm has the same downside as the code that it fixes:
      it won't collapse needlessly long runs of fillers.  We can enhance that
      in the future if it ever becomes a problem. This commit documents that
      limitation with a TODO comment in code, a small but meaningful
      improvement over the prior situation.
      
      Currently the biggest flaw, which the next commit addresses, is that
      because this increases the node size on 64-bit machines from 60 bytes to
      68 bytes. 60 rounds up to 64, but 68 rounds up to 128. So we wind up
      using twice as much memory per node, because of power-of-two
      allocations, which is a big bummer. We'll need to figure something out
      there.
      
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f634f418
    • Jason A. Donenfeld's avatar
      wireguard: allowedips: initialize list head in selftest · 46cfe8ee
      Jason A. Donenfeld authored
      The randomized trie tests weren't initializing the dummy peer list head,
      resulting in a NULL pointer dereference when used. Fix this by
      initializing it in the randomized trie test, just like we do for the
      static unit test.
      
      While we're at it, all of the other strings like this have the word
      "self-test", so add it to the missing place here.
      
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      46cfe8ee
    • Jason A. Donenfeld's avatar
      wireguard: peer: allocate in kmem_cache · a4e9f8e3
      Jason A. Donenfeld authored
      With deployments having upwards of 600k peers now, this somewhat heavy
      structure could benefit from more fine-grained allocations.
      Specifically, instead of using a 2048-byte slab for a 1544-byte object,
      we can now use 1544-byte objects directly, thus saving almost 25%
      per-peer, or with 600k peers, that's a savings of 303 MiB. This also
      makes wireguard's memory usage more transparent in tools like slabtop
      and /proc/slabinfo.
      
      Fixes: 8b5553ac ("wireguard: queueing: get rid of per-peer ring buffers")
      Suggested-by: default avatarArnd Bergmann <arnd@arndb.de>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a4e9f8e3
    • Jason A. Donenfeld's avatar
      wireguard: use synchronize_net rather than synchronize_rcu · 24b70eee
      Jason A. Donenfeld authored
      Many of the synchronization points are sometimes called under the rtnl
      lock, which means we should use synchronize_net rather than
      synchronize_rcu. Under the hood, this expands to using the expedited
      flavor of function in the event that rtnl is held, in order to not stall
      other concurrent changes.
      
      This fixes some very, very long delays when removing multiple peers at
      once, which would cause some operations to take several minutes.
      
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      24b70eee
    • Jason A. Donenfeld's avatar
      wireguard: do not use -O3 · cc5060ca
      Jason A. Donenfeld authored
      Apparently, various versions of gcc have O3-related miscompiles. Looking
      at the difference between -O2 and -O3 for gcc 11 doesn't indicate
      miscompiles, but the difference also doesn't seem so significant for
      performance that it's worth risking.
      
      Link: https://lore.kernel.org/lkml/CAHk-=wjuoGyxDhAF8SsrTkN0-YfCx7E6jUN3ikC_tn2AKWTTsA@mail.gmail.com/
      Link: https://lore.kernel.org/lkml/CAHmME9otB5Wwxp7H8bR_i2uH2esEMvoBMC8uEXBMH9p0q1s6Bw@mail.gmail.com/Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc5060ca
    • Jason A. Donenfeld's avatar
      wireguard: selftests: make sure rp_filter is disabled on vethc · f8873d11
      Jason A. Donenfeld authored
      Some distros may enable strict rp_filter by default, which will prevent
      vethc from receiving the packets with an unrouteable reverse path address.
      Reported-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8873d11
    • Jason A. Donenfeld's avatar
      wireguard: selftests: remove old conntrack kconfig value · acf2492b
      Jason A. Donenfeld authored
      On recent kernels, this config symbol is no longer used.
      Reported-by: default avatarRui Salvaterra <rsalvaterra@gmail.com>
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      acf2492b
    • Geert Uytterhoeven's avatar
      virtchnl: Add missing padding to virtchnl_proto_hdrs · 519d8ab1
      Geert Uytterhoeven authored
      On m68k (Coldfire M547x):
      
            CC      drivers/net/ethernet/intel/i40e/i40e_main.o
          In file included from drivers/net/ethernet/intel/i40e/i40e_prototype.h:9,
      		     from drivers/net/ethernet/intel/i40e/i40e.h:41,
      		     from drivers/net/ethernet/intel/i40e/i40e_main.c:12:
          include/linux/avf/virtchnl.h:153:36: warning: division by zero [-Wdiv-by-zero]
            153 |  { virtchnl_static_assert_##X = (n)/((sizeof(struct X) == (n)) ? 1 : 0) }
      	  |                                    ^
          include/linux/avf/virtchnl.h:844:1: note: in expansion of macro ‘VIRTCHNL_CHECK_STRUCT_LEN’
            844 | VIRTCHNL_CHECK_STRUCT_LEN(2312, virtchnl_proto_hdrs);
      	  | ^~~~~~~~~~~~~~~~~~~~~~~~~
          include/linux/avf/virtchnl.h:844:33: error: enumerator value for ‘virtchnl_static_assert_virtchnl_proto_hdrs’ is not an integer constant
            844 | VIRTCHNL_CHECK_STRUCT_LEN(2312, virtchnl_proto_hdrs);
      	  |                                 ^~~~~~~~~~~~~~~~~~~
      
      On m68k, integers are aligned on addresses that are multiples of two,
      not four, bytes.  Hence the size of a structure containing integers may
      not be divisible by 4.
      
      Fix this by adding explicit padding.
      
      Fixes: 1f7ea1cd ("ice: Enable FDIR Configure for AVF")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      519d8ab1
    • Dave Ertman's avatar
      ice: Allow all LLDP packets from PF to Tx · f9f83202
      Dave Ertman authored
      Currently in the ice driver, the check whether to
      allow a LLDP packet to egress the interface from the
      PF_VSI is being based on the SKB's priority field.
      It checks to see if the packets priority is equal to
      TC_PRIO_CONTROL.  Injected LLDP packets do not always
      meet this condition.
      
      SCAPY defaults to a sk_buff->protocol value of ETH_P_ALL
      (0x0003) and does not set the priority field.  There will
      be other injection methods (even ones used by end users)
      that will not correctly configure the socket so that
      SKB fields are correctly populated.
      
      Then ethernet header has to have to correct value for
      the protocol though.
      
      Add a check to also allow packets whose ethhdr->h_proto
      matches ETH_P_LLDP (0x88CC).
      
      Fixes: 0c3a6101 ("ice: Allow egress control packets from PF_VSI")
      Signed-off-by: default avatarDave Ertman <david.m.ertman@intel.com>
      Tested-by: default avatarTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      f9f83202
    • Paul Greenwalt's avatar
      ice: report supported and advertised autoneg using PHY capabilities · 5cd349c3
      Paul Greenwalt authored
      Ethtool incorrectly reported supported and advertised auto-negotiation
      settings for a backplane PHY image which did not support auto-negotiation.
      This can occur when using media or PHY type for reporting ethtool
      supported and advertised auto-negotiation settings.
      
      Remove setting supported and advertised auto-negotiation settings based
      on PHY type in ice_phy_type_to_ethtool(), and MAC type in
      ice_get_link_ksettings().
      
      Ethtool supported and advertised auto-negotiation settings should be
      based on the PHY image using the AQ command get PHY capabilities with
      media. Add setting supported and advertised auto-negotiation settings
      based get PHY capabilities with media in ice_get_link_ksettings().
      
      Fixes: 48cb27f2 ("ice: Implement handlers for ethtool PHY/link operations")
      Signed-off-by: default avatarPaul Greenwalt <paul.greenwalt@intel.com>
      Tested-by: default avatarTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      5cd349c3
    • Haiyue Wang's avatar
      ice: handle the VF VSI rebuild failure · c7ee6ce1
      Haiyue Wang authored
      VSI rebuild can be failed for LAN queue config, then the VF's VSI will
      be NULL, the VF reset should be stopped with the VF entering into the
      disable state.
      
      Fixes: 12bb018c ("ice: Refactor VF reset")
      Signed-off-by: default avatarHaiyue Wang <haiyue.wang@intel.com>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      c7ee6ce1
    • Brett Creeley's avatar
      ice: Fix VFR issues for AVF drivers that expect ATQLEN cleared · 8679f07a
      Brett Creeley authored
      Some AVF drivers expect the VF_MBX_ATQLEN register to be cleared for any
      type of VFR/VFLR. Fix this by clearing the VF_MBX_ATQLEN register at the
      same time as VF_MBX_ARQLEN.
      
      Fixes: 82ba0128 ("ice: clear VF ARQLEN register on reset")
      Signed-off-by: default avatarBrett Creeley <brett.creeley@intel.com>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      8679f07a
    • Brett Creeley's avatar
      ice: Fix allowing VF to request more/less queues via virtchnl · f0457690
      Brett Creeley authored
      Commit 12bb018c ("ice: Refactor VF reset") caused a regression
      that removes the ability for a VF to request a different amount of
      queues via VIRTCHNL_OP_REQUEST_QUEUES. This prevents VF drivers to
      either increase or decrease the number of queue pairs they are
      allocated. Fix this by using the variable vf->num_req_qs when
      determining the vf->num_vf_qs during VF VSI creation.
      
      Fixes: 12bb018c ("ice: Refactor VF reset")
      Signed-off-by: default avatarBrett Creeley <brett.creeley@intel.com>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      f0457690
  5. 03 Jun, 2021 2 commits
    • David S. Miller's avatar
      Merge tag 'for-net-2021-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth · 579028de
      David S. Miller authored
      bluetooth pull request for net:
      
       - Fixes UAF and CVE-2021-3564
       - Fix VIRTIO_ID_BT to use an unassigned ID
       - Fix firmware loading on some Intel Controllers
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      579028de
    • Xuan Zhuo's avatar
      virtio-net: fix for skb_over_panic inside big mode · 1a802423
      Xuan Zhuo authored
      In virtio-net's large packet mode, there is a hole in the space behind
      buf.
      
          hdr_padded_len - hdr_len
      
      We must take this into account when calculating tailroom.
      
      [   44.544385] skb_put.cold (net/core/skbuff.c:5254 (discriminator 1) net/core/skbuff.c:5252 (discriminator 1))
      [   44.544864] page_to_skb (drivers/net/virtio_net.c:485) [   44.545361] receive_buf (drivers/net/virtio_net.c:849 drivers/net/virtio_net.c:1131)
      [   44.545870] ? netif_receive_skb_list_internal (net/core/dev.c:5714)
      [   44.546628] ? dev_gro_receive (net/core/dev.c:6103)
      [   44.547135] ? napi_complete_done (./include/linux/list.h:35 net/core/dev.c:5867 net/core/dev.c:5862 net/core/dev.c:6565)
      [   44.547672] virtnet_poll (drivers/net/virtio_net.c:1427 drivers/net/virtio_net.c:1525)
      [   44.548251] __napi_poll (net/core/dev.c:6985)
      [   44.548744] net_rx_action (net/core/dev.c:7054 net/core/dev.c:7139)
      [   44.549264] __do_softirq (./arch/x86/include/asm/jump_label.h:19 ./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 kernel/softirq.c:560)
      [   44.549762] irq_exit_rcu (kernel/softirq.c:433 kernel/softirq.c:637 kernel/softirq.c:649)
      [   44.551384] common_interrupt (arch/x86/kernel/irq.c:240 (discriminator 13))
      [   44.551991] ? asm_common_interrupt (./arch/x86/include/asm/idtentry.h:638)
      [   44.552654] asm_common_interrupt (./arch/x86/include/asm/idtentry.h:638)
      
      Fixes: fb32856b ("virtio-net: page_to_skb() use build_skb when there's sufficient tailroom")
      Signed-off-by: default avatarXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Reported-by: default avatarCorentin Noël <corentin.noel@collabora.com>
      Tested-by: default avatarCorentin Noël <corentin.noel@collabora.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a802423