1. 06 Apr, 2021 5 commits
    • John Fastabend's avatar
      bpf, sockmap: Fix incorrect fwd_alloc accounting · 144748eb
      John Fastabend authored
      Incorrect accounting fwd_alloc can result in a warning when the socket
      is torn down,
      
       [18455.319240] WARNING: CPU: 0 PID: 24075 at net/core/stream.c:208 sk_stream_kill_queues+0x21f/0x230
       [...]
       [18455.319543] Call Trace:
       [18455.319556]  inet_csk_destroy_sock+0xba/0x1f0
       [18455.319577]  tcp_rcv_state_process+0x1b4e/0x2380
       [18455.319593]  ? lock_downgrade+0x3a0/0x3a0
       [18455.319617]  ? tcp_finish_connect+0x1e0/0x1e0
       [18455.319631]  ? sk_reset_timer+0x15/0x70
       [18455.319646]  ? tcp_schedule_loss_probe+0x1b2/0x240
       [18455.319663]  ? lock_release+0xb2/0x3f0
       [18455.319676]  ? __release_sock+0x8a/0x1b0
       [18455.319690]  ? lock_downgrade+0x3a0/0x3a0
       [18455.319704]  ? lock_release+0x3f0/0x3f0
       [18455.319717]  ? __tcp_close+0x2c6/0x790
       [18455.319736]  ? tcp_v4_do_rcv+0x168/0x370
       [18455.319750]  tcp_v4_do_rcv+0x168/0x370
       [18455.319767]  __release_sock+0xbc/0x1b0
       [18455.319785]  __tcp_close+0x2ee/0x790
       [18455.319805]  tcp_close+0x20/0x80
      
      This currently happens because on redirect case we do skb_set_owner_r()
      with the original sock. This increments the fwd_alloc memory accounting
      on the original sock. Then on redirect we may push this into the queue
      of the psock we are redirecting to. When the skb is flushed from the
      queue we give the memory back to the original sock. The problem is if
      the original sock is destroyed/closed with skbs on another psocks queue
      then the original sock will not have a way to reclaim the memory before
      being destroyed. Then above warning will be thrown
      
        sockA                          sockB
      
        sk_psock_strp_read()
         sk_psock_verdict_apply()
           -- SK_REDIRECT --
           sk_psock_skb_redirect()
                                      skb_queue_tail(psock_other->ingress_skb..)
      
        sk_close()
         sock_map_unref()
           sk_psock_put()
             sk_psock_drop()
               sk_psock_zap_ingress()
      
      At this point we have torn down our own psock, but have the outstanding
      skb in psock_other. Note that SK_PASS doesn't have this problem because
      the sk_psock_drop() logic releases the skb, its still associated with
      our psock.
      
      To resolve lets only account for sockets on the ingress queue that are
      still associated with the current socket. On the redirect case we will
      check memory limits per 6fa9201a, but will omit fwd_alloc accounting
      until skb is actually enqueued. When the skb is sent via skb_send_sock_locked
      or received with sk_psock_skb_ingress memory will be claimed on psock_other.
      
      Fixes: 6fa9201a ("bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self")
      Reported-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/161731444013.68884.4021114312848535993.stgit@john-XPS-13-9370
      144748eb
    • John Fastabend's avatar
      bpf, sockmap: Fix sk->prot unhash op reset · 1c84b331
      John Fastabend authored
      In '4da6a196' we fixed a potential unhash loop caused when
      a TLS socket in a sockmap was removed from the sockmap. This
      happened because the unhash operation on the TLS ctx continued
      to point at the sockmap implementation of unhash even though the
      psock has already been removed. The sockmap unhash handler when a
      psock is removed does the following,
      
       void sock_map_unhash(struct sock *sk)
       {
      	void (*saved_unhash)(struct sock *sk);
      	struct sk_psock *psock;
      
      	rcu_read_lock();
      	psock = sk_psock(sk);
      	if (unlikely(!psock)) {
      		rcu_read_unlock();
      		if (sk->sk_prot->unhash)
      			sk->sk_prot->unhash(sk);
      		return;
      	}
              [...]
       }
      
      The unlikely() case is there to handle the case where psock is detached
      but the proto ops have not been updated yet. But, in the above case
      with TLS and removed psock we never fixed sk_prot->unhash() and unhash()
      points back to sock_map_unhash resulting in a loop. To fix this we added
      this bit of code,
      
       static inline void sk_psock_restore_proto(struct sock *sk,
                                                struct sk_psock *psock)
       {
             sk->sk_prot->unhash = psock->saved_unhash;
      
      This will set the sk_prot->unhash back to its saved value. This is the
      correct callback for a TLS socket that has been removed from the sock_map.
      Unfortunately, this also overwrites the unhash pointer for all psocks.
      We effectively break sockmap unhash handling for any future socks.
      Omitting the unhash operation will leave stale entries in the map if
      a socket transition through unhash, but does not do close() op.
      
      To fix set unhash correctly before calling into tls_update. This way the
      TLS enabled socket will point to the saved unhash() handler.
      
      Fixes: 4da6a196 ("bpf: Sockmap/tls, during free we may call tcp_bpf_unhash() in loop")
      Reported-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Reported-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Suggested-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/161731441904.68884.15593917809745631972.stgit@john-XPS-13-9370
      1c84b331
    • Xin Long's avatar
      tipc: increment the tmp aead refcnt before attaching it · 2a2403ca
      Xin Long authored
      Li Shuang found a NULL pointer dereference crash in her testing:
      
        [] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
        [] RIP: 0010:tipc_crypto_rcv_complete+0xc8/0x7e0 [tipc]
        [] Call Trace:
        []  <IRQ>
        []  tipc_crypto_rcv+0x2d9/0x8f0 [tipc]
        []  tipc_rcv+0x2fc/0x1120 [tipc]
        []  tipc_udp_recv+0xc6/0x1e0 [tipc]
        []  udpv6_queue_rcv_one_skb+0x16a/0x460
        []  udp6_unicast_rcv_skb.isra.35+0x41/0xa0
        []  ip6_protocol_deliver_rcu+0x23b/0x4c0
        []  ip6_input+0x3d/0xb0
        []  ipv6_rcv+0x395/0x510
        []  __netif_receive_skb_core+0x5fc/0xc40
      
      This is caused by NULL returned by tipc_aead_get(), and then crashed when
      dereferencing it later in tipc_crypto_rcv_complete(). This might happen
      when tipc_crypto_rcv_complete() is called by two threads at the same time:
      the tmp attached by tipc_crypto_key_attach() in one thread may be released
      by the one attached by that in the other thread.
      
      This patch is to fix it by incrementing the tmp's refcnt before attaching
      it instead of calling tipc_aead_get() after attaching it.
      
      Fixes: fc1b6d6d ("tipc: introduce TIPC encryption & authentication")
      Reported-by: default avatarLi Shuang <shuali@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2a2403ca
    • Eric Dumazet's avatar
      virtio_net: Do not pull payload in skb->head · 0f6925b3
      Eric Dumazet authored
      Xuan Zhuo reported that commit 3226b158 ("net: avoid 32 x truesize
      under-estimation for tiny skbs") brought  a ~10% performance drop.
      
      The reason for the performance drop was that GRO was forced
      to chain sk_buff (using skb_shinfo(skb)->frag_list), which
      uses more memory but also cause packet consumers to go over
      a lot of overhead handling all the tiny skbs.
      
      It turns out that virtio_net page_to_skb() has a wrong strategy :
      It allocates skbs with GOOD_COPY_LEN (128) bytes in skb->head, then
      copies 128 bytes from the page, before feeding the packet to GRO stack.
      
      This was suboptimal before commit 3226b158 ("net: avoid 32 x truesize
      under-estimation for tiny skbs") because GRO was using 2 frags per MSS,
      meaning we were not packing MSS with 100% efficiency.
      
      Fix is to pull only the ethernet header in page_to_skb()
      
      Then, we change virtio_net_hdr_to_skb() to pull the missing
      headers, instead of assuming they were already pulled by callers.
      
      This fixes the performance regression, but could also allow virtio_net
      to accept packets with more than 128bytes of headers.
      
      Many thanks to Xuan Zhuo for his report, and his tests/help.
      
      Fixes: 3226b158 ("net: avoid 32 x truesize under-estimation for tiny skbs")
      Reported-by: default avatarXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Link: https://www.spinics.net/lists/netdev/msg731397.htmlCo-Developed-by: default avatarXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: default avatarXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: virtualization@lists.linux-foundation.org
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0f6925b3
    • Lv Yunlong's avatar
      net: broadcom: bcm4908enet: Fix a double free in bcm4908_enet_dma_alloc · b25b343d
      Lv Yunlong authored
      In bcm4908_enet_dma_alloc, if callee bcm4908_dma_alloc_buf_descs() failed,
      it will free the ring->cpu_addr by dma_free_coherent() and return error.
      Then bcm4908_enet_dma_free() will be called, and free the same cpu_addr
      by dma_free_coherent() again.
      
      My patch set ring->cpu_addr to NULL after it is freed in
      bcm4908_dma_alloc_buf_descs() to avoid the double free.
      
      Fixes: 4feffead ("net: broadcom: bcm4908enet: add BCM4908 controller driver")
      Signed-off-by: default avatarLv Yunlong <lyl2019@mail.ustc.edu.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b25b343d
  2. 05 Apr, 2021 6 commits
  3. 02 Apr, 2021 3 commits
  4. 01 Apr, 2021 17 commits
  5. 31 Mar, 2021 9 commits
    • Ong Boon Leong's avatar
      xdp: fix xdp_return_frame() kernel BUG throw for page_pool memory model · 622d1369
      Ong Boon Leong authored
      xdp_return_frame() may be called outside of NAPI context to return
      xdpf back to page_pool. xdp_return_frame() calls __xdp_return() with
      napi_direct = false. For page_pool memory model, __xdp_return() calls
      xdp_return_frame_no_direct() unconditionally and below false negative
      kernel BUG throw happened under preempt-rt build:
      
      [  430.450355] BUG: using smp_processor_id() in preemptible [00000000] code: modprobe/3884
      [  430.451678] caller is __xdp_return+0x1ff/0x2e0
      [  430.452111] CPU: 0 PID: 3884 Comm: modprobe Tainted: G     U      E     5.12.0-rc2+ #45
      
      Changes in v2:
       - This patch fixes the issue by making xdp_return_frame_no_direct() is
         only called if napi_direct = true, as recommended for better by
         Jesper Dangaard Brouer. Thanks!
      
      Fixes: 2539650f ("xdp: Helpers for disabling napi_direct of xdp_return_frame")
      Signed-off-by: default avatarOng Boon Leong <boon.leong.ong@intel.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      622d1369
    • Eric Dumazet's avatar
      Revert "net: correct sk_acceptq_is_full()" · c609e6aa
      Eric Dumazet authored
      This reverts commit f211ac15.
      
      We had similar attempt in the past, and we reverted it.
      
      History:
      
      64a14651 [NET]: Revert incorrect accept queue backlog changes.
      8488df89 [NET]: Fix bugs in "Whether sock accept queue is full" checking
      
      I am adding a fat comment so that future attempts will
      be much harder.
      
      Fixes: f211ac15 ("net: correct sk_acceptq_is_full()")
      Cc: iuyacan <yacanliu@163.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c609e6aa
    • David S. Miller's avatar
      Merge tag 'mlx5-fixes-2021-03-31' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 9dc22c0d
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5 fixes 2021-03-31
      
      This series introduces some fixes to mlx5 driver.
      Please pull and let me know if there is any problem.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9dc22c0d
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec · c9170f13
      David S. Miller authored
      Steffen Klassert says:
      
      ====================
      pull request (net): ipsec 2021-03-31
      
      1) Fix ipv4 pmtu checks for xfrm anf vti interfaces.
         From Eyal Birger.
      
      2) There are situations where the socket passed to
         xfrm_output_resume() is not the same as the one
         attached to the skb. Use the socket passed to
         xfrm_output_resume() to avoid lookup failures
         when xfrm is used with VRFs.
         From Evan Nimmo.
      
      3) Make the xfrm_state_hash_generation sequence counter per
         network namespace because but its write serialization
         lock is also per network namespace. Write protection
         is insufficient otherwise.
         From Ahmed S. Darwish.
      
      4) Fixup sctp featue flags when used with esp offload.
         From Xin Long.
      
      5) xfrm BEET mode doesn't support fragments for inner packets.
         This is a limitation of the protocol, so no fix possible.
         Warn at least to notify the user about that situation.
         From Xin Long.
      
      6) Fix NULL pointer dereference on policy lookup when
         namespaces are uses in combination with esp offload.
      
      7) Fix incorrect transformation on esp offload when
         packets get segmented at layer 3.
      
      8) Fix some user triggered usages of WARN_ONCE in
         the xfrm compat layer.
         From Dmitry Safonov.
      
      Please pull or let me know if there are problems.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c9170f13
    • Lv Yunlong's avatar
      net/rds: Fix a use after free in rds_message_map_pages · bdc2ab5c
      Lv Yunlong authored
      In rds_message_map_pages, the rm is freed by rds_message_put(rm).
      But rm is still used by rm->data.op_sg in return value.
      
      My patch assigns ERR_CAST(rm->data.op_sg) to err before the rm is
      freed to avoid the uaf.
      
      Fixes: 7dba9203 ("net/rds: Use ERR_PTR for rds_message_alloc_sgs()")
      Signed-off-by: default avatarLv Yunlong <lyl2019@mail.ustc.edu.cn>
      Reviewed-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bdc2ab5c
    • Tong Zhu's avatar
      neighbour: Disregard DEAD dst in neigh_update · d47ec7a0
      Tong Zhu authored
      After a short network outage, the dst_entry is timed out and put
      in DST_OBSOLETE_DEAD. We are in this code because arp reply comes
      from this neighbour after network recovers. There is a potential
      race condition that dst_entry is still in DST_OBSOLETE_DEAD.
      With that, another neighbour lookup causes more harm than good.
      
      In best case all packets in arp_queue are lost. This is
      counterproductive to the original goal of finding a better path
      for those packets.
      
      I observed a worst case with 4.x kernel where a dst_entry in
      DST_OBSOLETE_DEAD state is associated with loopback net_device.
      It leads to an ethernet header with all zero addresses.
      A packet with all zero source MAC address is quite deadly with
      mac80211, ath9k and 802.11 block ack.  It fails
      ieee80211_find_sta_by_ifaddr in ath9k (xmit.c). Ath9k flushes tx
      queue (ath_tx_complete_aggr). BAW (block ack window) is not
      updated. BAW logic is damaged and ath9k transmission is disabled.
      Signed-off-by: default avatarTong Zhu <zhutong@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d47ec7a0
    • Tariq Toukan's avatar
      net/mlx5e: Guarantee room for XSK wakeup NOP on async ICOSQ · 3ff3874f
      Tariq Toukan authored
      XSK wakeup flow triggers an IRQ by posting a NOP WQE and hitting
      the doorbell on the async ICOSQ.
      It maintains its state so that it doesn't issue another NOP WQE
      if it has an outstanding one already.
      
      For this flow to work properly, the NOP post must not fail.
      Make sure to reserve room for the NOP WQE in all WQE posts to the
      async ICOSQ.
      
      Fixes: 8d94b590 ("net/mlx5e: Turn XSK ICOSQ into a general asynchronous one")
      Fixes: 1182f365 ("net/mlx5e: kTLS, Add kTLS RX HW offload support")
      Fixes: 0419d8c9 ("net/mlx5e: kTLS, Add kTLS RX resync support")
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      3ff3874f
    • Dima Chumak's avatar
      net/mlx5e: Consider geneve_opts for encap contexts · 929a2fad
      Dima Chumak authored
      Current algorithm for encap keys is legacy from initial vxlan
      implementation and doesn't take into account all possible fields of a
      tunnel. For example, for a Geneve tunnel, which may have additional TLV
      options, they are ignored when comparing encap keys and a rule can be
      attached to an incorrect encap entry.
      
      Fix that by introducing encap_info_equal() operation in
      struct mlx5e_tc_tunnel. Geneve tunnel type uses custom implementation,
      which extends generic algorithm and considers options if they are set.
      
      Fixes: 7f1a546e ("net/mlx5e: Consider tunnel type for encap contexts")
      Signed-off-by: default avatarDima Chumak <dchumak@nvidia.com>
      Reviewed-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      929a2fad
    • Daniel Jurgens's avatar
      net/mlx5: Don't request more than supported EQs · a7b76002
      Daniel Jurgens authored
      Calculating the number of compeltion EQs based on the number of
      available IRQ vectors doesn't work now that all async EQs share one IRQ.
      Thus the max number of EQs can be exceeded on systems with more than
      approximately 256 CPUs. Take this into account when calculating the
      number of available completion EQs.
      
      Fixes: 81bfa206 ("net/mlx5: Use a single IRQ for all async EQs")
      Signed-off-by: default avatarDaniel Jurgens <danielj@mellanox.com>
      Reviewed-by: default avatarParav Pandit <parav@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      a7b76002