1. 19 Sep, 2022 3 commits
  2. 16 Sep, 2022 9 commits
    • Peilin Ye's avatar
      tcp: Use WARN_ON_ONCE() in tcp_read_skb() · 96628951
      Peilin Ye authored
      Prevent tcp_read_skb() from flooding the syslog.
      Suggested-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarPeilin Ye <peilin.ye@bytedance.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      96628951
    • David S. Miller's avatar
      Merge branch 'net-unsync-addresses-from-ports' · 34d2d336
      David S. Miller authored
      From: Benjamin Poirier <bpoirier@nvidia.com>
      To: netdev@vger.kernel.org
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>,
      	Veaceslav Falico <vfalico@gmail.com>,
      	Andy Gospodarek <andy@greyhouse.net>,
      	"David S. Miller" <davem@davemloft.net>,
      	Eric Dumazet <edumazet@google.com>,
      	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
      	Jiri Pirko <jiri@resnulli.us>, Shuah Khan <shuah@kernel.org>,
      	Jonathan Toppins <jtoppins@redhat.com>,
      	linux-kselftest@vger.kernel.org
      Subject: [PATCH net v3 0/4] Unsync addresses from ports when stopping aggregated devices
      Date: Wed,  7 Sep 2022 16:56:38 +0900	[thread overview]
      Message-ID: <20220907075642.475236-1-bpoirier@nvidia.com> (raw)
      
      This series fixes similar problems in the bonding and team drivers.
      
      Because of missing dev_{uc,mc}_unsync() calls, addresses added to
      underlying devices may be leftover after the aggregated device is deleted.
      Add the missing calls and a few related tests.
      
      v2:
      * fix selftest installation, see patch 3
      
      v3:
      * Split lacpdu_multicast changes to their own patch, #1
      * In ndo_{add,del}_slave methods, only perform address list changes when
        the aggregated device is up (patches 2 & 3)
      * Add selftest function related to the above change (patch 4)
      ====================
      Acked-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34d2d336
    • Benjamin Poirier's avatar
      net: Add tests for bonding and team address list management · bbb774d9
      Benjamin Poirier authored
      Test that the bonding and team drivers clean up an underlying device's
      address lists (dev->uc, dev->mc) when the aggregated device is deleted.
      
      Test addition and removal of the LACPDU multicast address on underlying
      devices by the bonding driver.
      
      v2:
      * add lag_lib.sh to TEST_FILES
      
      v3:
      * extend bond_listen_lacpdu_multicast test to init_state up and down cases
      * remove some superfluous shell syntax and 'set dev ... up' commands
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bbb774d9
    • Benjamin Poirier's avatar
      net: team: Unsync device addresses on ndo_stop · bd602342
      Benjamin Poirier authored
      Netdev drivers are expected to call dev_{uc,mc}_sync() in their
      ndo_set_rx_mode method and dev_{uc,mc}_unsync() in their ndo_stop method.
      This is mentioned in the kerneldoc for those dev_* functions.
      
      The team driver calls dev_{uc,mc}_unsync() during ndo_uninit instead of
      ndo_stop. This is ineffective because address lists (dev->{uc,mc}) have
      already been emptied in unregister_netdevice_many() before ndo_uninit is
      called. This mistake can result in addresses being leftover on former team
      ports after a team device has been deleted; see test_LAG_cleanup() in the
      last patch in this series.
      
      Add unsync calls at their expected location, team_close().
      
      v3:
      * When adding or deleting a port, only sync/unsync addresses if the team
        device is up. In other cases, it is taken care of at the right time by
        ndo_open/ndo_set_rx_mode/ndo_stop.
      
      Fixes: 3d249d4c ("net: introduce ethernet teaming device")
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd602342
    • Benjamin Poirier's avatar
      net: bonding: Unsync device addresses on ndo_stop · 86247aba
      Benjamin Poirier authored
      Netdev drivers are expected to call dev_{uc,mc}_sync() in their
      ndo_set_rx_mode method and dev_{uc,mc}_unsync() in their ndo_stop method.
      This is mentioned in the kerneldoc for those dev_* functions.
      
      The bonding driver calls dev_{uc,mc}_unsync() during ndo_uninit instead of
      ndo_stop. This is ineffective because address lists (dev->{uc,mc}) have
      already been emptied in unregister_netdevice_many() before ndo_uninit is
      called. This mistake can result in addresses being leftover on former bond
      slaves after a bond has been deleted; see test_LAG_cleanup() in the last
      patch in this series.
      
      Add unsync calls, via bond_hw_addr_flush(), at their expected location,
      bond_close().
      Add dev_mc_add() call to bond_open() to match the above change.
      
      v3:
      * When adding or deleting a slave, only sync/unsync, add/del addresses if
        the bond is up. In other cases, it is taken care of at the right time by
        ndo_open/ndo_set_rx_mode/ndo_stop.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      86247aba
    • Benjamin Poirier's avatar
      net: bonding: Share lacpdu_mcast_addr definition · 1d9a143e
      Benjamin Poirier authored
      There are already a few definitions of arrays containing
      MULTICAST_LACPDU_ADDR and the next patch will add one more use. These all
      contain the same constant data so define one common instance for all
      bonding code.
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d9a143e
    • David S. Miller's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 21be1ad6
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2022-09-08 (ice, iavf)
      
      This series contains updates to ice and iavf drivers.
      
      Dave removes extra unplug of auxiliary bus on reset which caused a
      scheduling while atomic to be reported for ice.
      
      Ding Hui defers setting of queues for TCs to ensure valid configuration
      and restores old config if invalid for ice.
      
      Sylwester fixes a check of setting MAC address to occur after result is
      received from PF for iavf driver.
      
      Brett changes check of ring tail to use software cached value as not all
      devices have access to register tail for iavf driver.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21be1ad6
    • Oleksandr Mazur's avatar
      net: marvell: prestera: add support for for Aldrin2 · 9124dbcc
      Oleksandr Mazur authored
      Aldrin2 (98DX8525) is a Marvell Prestera PP, with 100G support.
      Signed-off-by: default avatarOleksandr Mazur <oleksandr.mazur@plvision.eu>
      
      V2:
        - retarget to net tree instead of net-next;
        - fix missed colon in patch subject ('net marvell' vs 'net: mavell');
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9124dbcc
    • Haimin Zhang's avatar
      net/ieee802154: fix uninit value bug in dgram_sendmsg · 94160108
      Haimin Zhang authored
      There is uninit value bug in dgram_sendmsg function in
      net/ieee802154/socket.c when the length of valid data pointed by the
      msg->msg_name isn't verified.
      
      We introducing a helper function ieee802154_sockaddr_check_size to
      check namelen. First we check there is addr_type in ieee802154_addr_sa.
      Then, we check namelen according to addr_type.
      
      Also fixed in raw_bind, dgram_bind, dgram_connect.
      Signed-off-by: default avatarHaimin Zhang <tcs_kernel@tencent.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      94160108
  3. 13 Sep, 2022 3 commits
    • Matthieu Baerts's avatar
      Documentation: mptcp: fix pm_type formatting · 0727a9a5
      Matthieu Baerts authored
      When looking at the rendered HTML version, we can see 'pm_type' is not
      displayed with a bold font:
      
        https://docs.kernel.org/5.19/networking/mptcp-sysctl.html
      
      The empty line under 'pm_type' is then removed to have the same style as
      the others.
      
      Fixes: 6bb63ccc ("mptcp: Add a per-namespace sysctl to set the default path manager type")
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Link: https://lore.kernel.org/r/20220906180404.1255873-2-matthieu.baerts@tessares.netSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      0727a9a5
    • Paolo Abeni's avatar
      mptcp: fix fwd memory accounting on coalesce · 7288ff6e
      Paolo Abeni authored
      The intel bot reported a memory accounting related splat:
      
      [  240.473094] ------------[ cut here ]------------
      [  240.478507] page_counter underflow: -4294828518 nr_pages=4294967290
      [  240.485500] WARNING: CPU: 2 PID: 14986 at mm/page_counter.c:56 page_counter_cancel+0x96/0xc0
      [  240.570849] CPU: 2 PID: 14986 Comm: mptcp_connect Tainted: G S                5.19.0-rc4-00739-gd24141fe #1
      [  240.581637] Hardware name: HP HP Z240 SFF Workstation/802E, BIOS N51 Ver. 01.63 10/05/2017
      [  240.590600] RIP: 0010:page_counter_cancel+0x96/0xc0
      [  240.596179] Code: 00 00 00 45 31 c0 48 89 ef 5d 4c 89 c6 41 5c e9 40 fd ff ff 4c 89 e2 48 c7 c7 20 73 39 84 c6 05 d5 b1 52 04 01 e8 e7 95 f3
      01 <0f> 0b eb a9 48 89 ef e8 1e 25 fc ff eb c3 66 66 2e 0f 1f 84 00 00
      [  240.615639] RSP: 0018:ffffc9000496f7c8 EFLAGS: 00010082
      [  240.621569] RAX: 0000000000000000 RBX: ffff88819c9c0120 RCX: 0000000000000000
      [  240.629404] RDX: 0000000000000027 RSI: 0000000000000004 RDI: fffff5200092deeb
      [  240.637239] RBP: ffff88819c9c0120 R08: 0000000000000001 R09: ffff888366527a2b
      [  240.645069] R10: ffffed106cca4f45 R11: 0000000000000001 R12: 00000000fffffffa
      [  240.652903] R13: ffff888366536118 R14: 00000000fffffffa R15: ffff88819c9c0000
      [  240.660738] FS:  00007f3786e72540(0000) GS:ffff888366500000(0000) knlGS:0000000000000000
      [  240.669529] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  240.675974] CR2: 00007f966b346000 CR3: 0000000168cea002 CR4: 00000000003706e0
      [  240.683807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  240.691641] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  240.699468] Call Trace:
      [  240.702613]  <TASK>
      [  240.705413]  page_counter_uncharge+0x29/0x80
      [  240.710389]  drain_stock+0xd0/0x180
      [  240.714585]  refill_stock+0x278/0x580
      [  240.718951]  __sk_mem_reduce_allocated+0x222/0x5c0
      [  240.729248]  __mptcp_update_rmem+0x235/0x2c0
      [  240.734228]  __mptcp_move_skbs+0x194/0x6c0
      [  240.749764]  mptcp_recvmsg+0xdfa/0x1340
      [  240.763153]  inet_recvmsg+0x37f/0x500
      [  240.782109]  sock_read_iter+0x24a/0x380
      [  240.805353]  new_sync_read+0x420/0x540
      [  240.838552]  vfs_read+0x37f/0x4c0
      [  240.842582]  ksys_read+0x170/0x200
      [  240.864039]  do_syscall_64+0x5c/0x80
      [  240.872770]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
      [  240.878526] RIP: 0033:0x7f3786d9ae8e
      [  240.882805] Code: c0 e9 b6 fe ff ff 50 48 8d 3d 6e 18 0a 00 e8 89 e8 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
      [  240.902259] RSP: 002b:00007fff7be81e08 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
      [  240.910533] RAX: ffffffffffffffda RBX: 0000000000002000 RCX: 00007f3786d9ae8e
      [  240.918368] RDX: 0000000000002000 RSI: 00007fff7be87ec0 RDI: 0000000000000005
      [  240.926206] RBP: 0000000000000005 R08: 00007f3786e6a230 R09: 00007f3786e6a240
      [  240.934046] R10: fffffffffffff288 R11: 0000000000000246 R12: 0000000000002000
      [  240.941884] R13: 00007fff7be87ec0 R14: 00007fff7be87ec0 R15: 0000000000002000
      [  240.949741]  </TASK>
      [  240.952632] irq event stamp: 27367
      [  240.956735] hardirqs last  enabled at (27366): [<ffffffff81ba50ea>] mem_cgroup_uncharge_skmem+0x6a/0x80
      [  240.966848] hardirqs last disabled at (27367): [<ffffffff81b8fd42>] refill_stock+0x282/0x580
      [  240.976017] softirqs last  enabled at (27360): [<ffffffff83a4d8ef>] mptcp_recvmsg+0xaf/0x1340
      [  240.985273] softirqs last disabled at (27364): [<ffffffff83a4d30c>] __mptcp_move_skbs+0x18c/0x6c0
      [  240.994872] ---[ end trace 0000000000000000 ]---
      
      After commit d24141fe ("mptcp: drop SK_RECLAIM_* macros"),
      if rmem_fwd_alloc become negative, mptcp_rmem_uncharge() can
      try to reclaim a negative amount of pages, since the expression:
      
      	reclaimable >= PAGE_SIZE
      
      will evaluate to true for any negative value of the int
      'reclaimable': 'PAGE_SIZE' is an unsigned long and
      the negative integer will be promoted to a (very large)
      unsigned long value.
      
      Still after the mentioned commit, kfree_skb_partial()
      in mptcp_try_coalesce() will reclaim most of just released fwd
      memory, so that following charging of the skb delta size will
      lead to negative fwd memory values.
      
      At that point a racing recvmsg() can trigger the splat.
      
      Address the issue switching the order of the memory accounting
      operations. The fwd memory can still transiently reach negative
      values, but that will happen in an atomic scope and no code
      path could touch/use such value.
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Fixes: d24141fe ("mptcp: drop SK_RECLAIM_* macros")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Link: https://lore.kernel.org/r/20220906180404.1255873-1-matthieu.baerts@tessares.netSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      7288ff6e
    • Ioana Ciornei's avatar
      net: phy: aquantia: wait for the suspend/resume operations to finish · ca2dccde
      Ioana Ciornei authored
      The Aquantia datasheet notes that after issuing a Processor-Intensive
      MDIO operation, like changing the low-power state of the device, the
      driver should wait for the operation to finish before issuing a new MDIO
      command.
      
      The new aqr107_wait_processor_intensive_op() function is added which can
      be used after these kind of MDIO operations. At the moment, we are only
      adding it at the end of the suspend/resume calls.
      
      The issue was identified on a board featuring the AQR113C PHY, on
      which commands like 'ip link (..) up / down' issued without any delays
      between them would render the link on the PHY to remain down.
      The issue was easy to reproduce with a one-liner:
       $ ip link set dev ethX down; ip link set dev ethX up; \
       ip link set dev ethX down; ip link set dev ethX up;
      
      Fixes: ac9e81c2 ("net: phy: aquantia: add suspend / resume callbacks for AQR107 family")
      Signed-off-by: default avatarIoana Ciornei <ioana.ciornei@nxp.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20220906130451.1483448-1-ioana.ciornei@nxp.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      ca2dccde
  4. 09 Sep, 2022 3 commits
    • Ludovic Cintrat's avatar
      net: core: fix flow symmetric hash · 64ae13ed
      Ludovic Cintrat authored
      __flow_hash_consistentify() wrongly swaps ipv4 addresses in few cases.
      This function is indirectly used by __skb_get_hash_symmetric(), which is
      used to fanout packets in AF_PACKET.
      Intrusion detection systems may be impacted by this issue.
      
      __flow_hash_consistentify() computes the addresses difference then swaps
      them if the difference is negative. In few cases src - dst and dst - src
      are both negative.
      
      The following snippet mimics __flow_hash_consistentify():
      
      ```
       #include <stdio.h>
       #include <stdint.h>
      
       int main(int argc, char** argv) {
      
           int diffs_d, diffd_s;
           uint32_t dst  = 0xb225a8c0; /* 178.37.168.192 --> 192.168.37.178 */
           uint32_t src  = 0x3225a8c0; /*  50.37.168.192 --> 192.168.37.50  */
           uint32_t dst2 = 0x3325a8c0; /*  51.37.168.192 --> 192.168.37.51  */
      
           diffs_d = src - dst;
           diffd_s = dst - src;
      
           printf("src:%08x dst:%08x, diff(s-d)=%d(0x%x) diff(d-s)=%d(0x%x)\n",
                   src, dst, diffs_d, diffs_d, diffd_s, diffd_s);
      
           diffs_d = src - dst2;
           diffd_s = dst2 - src;
      
           printf("src:%08x dst:%08x, diff(s-d)=%d(0x%x) diff(d-s)=%d(0x%x)\n",
                   src, dst2, diffs_d, diffs_d, diffd_s, diffd_s);
      
           return 0;
       }
      ```
      
      Results:
      
      src:3225a8c0 dst:b225a8c0, \
          diff(s-d)=-2147483648(0x80000000) \
          diff(d-s)=-2147483648(0x80000000)
      
      src:3225a8c0 dst:3325a8c0, \
          diff(s-d)=-16777216(0xff000000) \
          diff(d-s)=16777216(0x1000000)
      
      In the first case the addresses differences are always < 0, therefore
      __flow_hash_consistentify() always swaps, thus dst->src and src->dst
      packets have differents hashes.
      
      Fixes: c3f83241 ("net: Add full IPv6 addresses to flow_keys")
      Signed-off-by: default avatarLudovic Cintrat <ludovic.cintrat@gatewatcher.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64ae13ed
    • Lu Wei's avatar
      ipvlan: Fix out-of-bound bugs caused by unset skb->mac_header · 81225b2e
      Lu Wei authored
      If an AF_PACKET socket is used to send packets through ipvlan and the
      default xmit function of the AF_PACKET socket is changed from
      dev_queue_xmit() to packet_direct_xmit() via setsockopt() with the option
      name of PACKET_QDISC_BYPASS, the skb->mac_header may not be reset and
      remains as the initial value of 65535, this may trigger slab-out-of-bounds
      bugs as following:
      
      =================================================================
      UG: KASAN: slab-out-of-bounds in ipvlan_xmit_mode_l2+0xdb/0x330 [ipvlan]
      PU: 2 PID: 1768 Comm: raw_send Kdump: loaded Not tainted 6.0.0-rc4+ #6
      ardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33
      all Trace:
      print_address_description.constprop.0+0x1d/0x160
      print_report.cold+0x4f/0x112
      kasan_report+0xa3/0x130
      ipvlan_xmit_mode_l2+0xdb/0x330 [ipvlan]
      ipvlan_start_xmit+0x29/0xa0 [ipvlan]
      __dev_direct_xmit+0x2e2/0x380
      packet_direct_xmit+0x22/0x60
      packet_snd+0x7c9/0xc40
      sock_sendmsg+0x9a/0xa0
      __sys_sendto+0x18a/0x230
      __x64_sys_sendto+0x74/0x90
      do_syscall_64+0x3b/0x90
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      The root cause is:
        1. packet_snd() only reset skb->mac_header when sock->type is SOCK_RAW
           and skb->protocol is not specified as in packet_parse_headers()
      
        2. packet_direct_xmit() doesn't reset skb->mac_header as dev_queue_xmit()
      
      In this case, skb->mac_header is 65535 when ipvlan_xmit_mode_l2() is
      called. So when ipvlan_xmit_mode_l2() gets mac header with eth_hdr() which
      use "skb->head + skb->mac_header", out-of-bound access occurs.
      
      This patch replaces eth_hdr() with skb_eth_hdr() in ipvlan_xmit_mode_l2()
      and reset mac header in multicast to solve this out-of-bound bug.
      
      Fixes: 2ad7bf36 ("ipvlan: Initial check-in of the IPVLAN driver.")
      Signed-off-by: default avatarLu Wei <luwei32@huawei.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      81225b2e
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · df2a6017
      David S. Miller authored
      Florian Westhal says:
      
      ====================
      netfilter: bugfixes for net
      
      The following set contains four netfilter patches for your *net* tree.
      
      When there are multiple Contact headers in a SIP message its possible
      the next headers won't be found because the SIP helper confuses relative
      and absolute offsets in the message.  From Igor Ryzhov.
      
      Make the nft_concat_range self-test support socat, this makes the
      selftest pass on my test VM, from myself.
      
      nf_conntrack_irc helper can be tricked into opening a local port forward
      that the client never requested by embedding a DCC message in a PING
      request sent to the client.  Fix from David Leadbeater.
      
      Both have been broken since the kernel 2.6.x days.
      
      The 'osf' match might indicate success while it could not find
      anything, broken since 5.2 .  Fix from Pablo Neira.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df2a6017
  5. 08 Sep, 2022 9 commits
    • Brett Creeley's avatar
      iavf: Fix cached head and tail value for iavf_get_tx_pending · 809f23c0
      Brett Creeley authored
      The underlying hardware may or may not allow reading of the head or tail
      registers and it really makes no difference if we use the software
      cached values. So, always used the software cached values.
      
      Fixes: 9c6c1259 ("i40e: Detection and recovery of TX queue hung logic moved to service_task from tx_timeout")
      Signed-off-by: default avatarBrett Creeley <brett.creeley@intel.com>
      Co-developed-by: default avatarNorbert Zulinski <norbertx.zulinski@intel.com>
      Signed-off-by: default avatarNorbert Zulinski <norbertx.zulinski@intel.com>
      Signed-off-by: default avatarMateusz Palczewski <mateusz.palczewski@intel.com>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      809f23c0
    • Sylwester Dziedziuch's avatar
      iavf: Fix change VF's mac address · f66b98c8
      Sylwester Dziedziuch authored
      Previously changing mac address gives false negative because
      ip link set <interface> address <MAC> return with
      RTNLINK: Permission denied.
      In iavf_set_mac was check if PF handled our mac set request,
      even before filter was added to list.
      Because this check returns always true and it never waits for
      PF's response.
      
      Move iavf_is_mac_handled to wait_event_interruptible_timeout
      instead of false. Now it will wait for PF's response and then
      check if address was added or rejected.
      
      Fixes: 35a2443d ("iavf: Add waiting for response from PF in set mac")
      Signed-off-by: default avatarSylwester Dziedziuch <sylwesterx.dziedziuch@intel.com>
      Co-developed-by: default avatarNorbert Zulinski <norbertx.zulinski@intel.com>
      Signed-off-by: default avatarNorbert Zulinski <norbertx.zulinski@intel.com>
      Signed-off-by: default avatarMateusz Palczewski <mateusz.palczewski@intel.com>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      f66b98c8
    • Ding Hui's avatar
      ice: Fix crash by keep old cfg when update TCs more than queues · a509702c
      Ding Hui authored
      There are problems if allocated queues less than Traffic Classes.
      
      Commit a632b2a4 ("ice: ethtool: Prohibit improper channel config
      for DCB") already disallow setting less queues than TCs.
      
      Another case is if we first set less queues, and later update more TCs
      config due to LLDP, ice_vsi_cfg_tc() will failed but left dirty
      num_txq/rxq and tc_cfg in vsi, that will cause invalid pointer access.
      
      [   95.968089] ice 0000:3b:00.1: More TCs defined than queues/rings allocated.
      [   95.968092] ice 0000:3b:00.1: Trying to use more Rx queues (8), than were allocated (1)!
      [   95.968093] ice 0000:3b:00.1: Failed to config TC for VSI index: 0
      [   95.969621] general protection fault: 0000 [#1] SMP NOPTI
      [   95.969705] CPU: 1 PID: 58405 Comm: lldpad Kdump: loaded Tainted: G     U  W  O     --------- -t - 4.18.0 #1
      [   95.969867] Hardware name: O.E.M/BC11SPSCB10, BIOS 8.23 12/30/2021
      [   95.969992] RIP: 0010:devm_kmalloc+0xa/0x60
      [   95.970052] Code: 5c ff ff ff 31 c0 5b 5d 41 5c c3 b8 f4 ff ff ff eb f4 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 89 d1 <8b> 97 60 02 00 00 48 8d 7e 18 48 39 f7 72 3f 55 89 ce 53 48 8b 4c
      [   95.970344] RSP: 0018:ffffc9003f553888 EFLAGS: 00010206
      [   95.970425] RAX: dead000000000200 RBX: ffffea003c425b00 RCX: 00000000006080c0
      [   95.970536] RDX: 00000000006080c0 RSI: 0000000000000200 RDI: dead000000000200
      [   95.970648] RBP: dead000000000200 R08: 00000000000463c0 R09: ffff888ffa900000
      [   95.970760] R10: 0000000000000000 R11: 0000000000000002 R12: ffff888ff6b40100
      [   95.970870] R13: ffff888ff6a55018 R14: 0000000000000000 R15: ffff888ff6a55460
      [   95.970981] FS:  00007f51b7d24700(0000) GS:ffff88903ee80000(0000) knlGS:0000000000000000
      [   95.971108] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   95.971197] CR2: 00007fac5410d710 CR3: 0000000f2c1de002 CR4: 00000000007606e0
      [   95.971309] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   95.971419] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   95.971530] PKRU: 55555554
      [   95.971573] Call Trace:
      [   95.971622]  ice_setup_rx_ring+0x39/0x110 [ice]
      [   95.971695]  ice_vsi_setup_rx_rings+0x54/0x90 [ice]
      [   95.971774]  ice_vsi_open+0x25/0x120 [ice]
      [   95.971843]  ice_open_internal+0xb8/0x1f0 [ice]
      [   95.971919]  ice_ena_vsi+0x4f/0xd0 [ice]
      [   95.971987]  ice_dcb_ena_dis_vsi.constprop.5+0x29/0x90 [ice]
      [   95.972082]  ice_pf_dcb_cfg+0x29a/0x380 [ice]
      [   95.972154]  ice_dcbnl_setets+0x174/0x1b0 [ice]
      [   95.972220]  dcbnl_ieee_set+0x89/0x230
      [   95.972279]  ? dcbnl_ieee_del+0x150/0x150
      [   95.972341]  dcb_doit+0x124/0x1b0
      [   95.972392]  rtnetlink_rcv_msg+0x243/0x2f0
      [   95.972457]  ? dcb_doit+0x14d/0x1b0
      [   95.972510]  ? __kmalloc_node_track_caller+0x1d3/0x280
      [   95.972591]  ? rtnl_calcit.isra.31+0x100/0x100
      [   95.972661]  netlink_rcv_skb+0xcf/0xf0
      [   95.972720]  netlink_unicast+0x16d/0x220
      [   95.972781]  netlink_sendmsg+0x2ba/0x3a0
      [   95.975891]  sock_sendmsg+0x4c/0x50
      [   95.979032]  ___sys_sendmsg+0x2e4/0x300
      [   95.982147]  ? kmem_cache_alloc+0x13e/0x190
      [   95.985242]  ? __wake_up_common_lock+0x79/0x90
      [   95.988338]  ? __check_object_size+0xac/0x1b0
      [   95.991440]  ? _copy_to_user+0x22/0x30
      [   95.994539]  ? move_addr_to_user+0xbb/0xd0
      [   95.997619]  ? __sys_sendmsg+0x53/0x80
      [   96.000664]  __sys_sendmsg+0x53/0x80
      [   96.003747]  do_syscall_64+0x5b/0x1d0
      [   96.006862]  entry_SYSCALL_64_after_hwframe+0x65/0xca
      
      Only update num_txq/rxq when passed check, and restore tc_cfg if setup
      queue map failed.
      
      Fixes: a632b2a4 ("ice: ethtool: Prohibit improper channel config for DCB")
      Signed-off-by: default avatarDing Hui <dinghui@sangfor.com.cn>
      Reviewed-by: default avatarAnatolii Gerasymenko <anatolii.gerasymenko@intel.com>
      Tested-by: Arpana Arland <arpanax.arland@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      a509702c
    • Dave Ertman's avatar
      ice: Don't double unplug aux on peer initiated reset · 23c61919
      Dave Ertman authored
      In the IDC callback that is accessed when the aux drivers request a reset,
      the function to unplug the aux devices is called.  This function is also
      called in the ice_prepare_for_reset function. This double call is causing
      a "scheduling while atomic" BUG.
      
      [  662.676430] ice 0000:4c:00.0 rocep76s0: cqp opcode = 0x1 maj_err_code = 0xffff min_err_code = 0x8003
      
      [  662.676609] ice 0000:4c:00.0 rocep76s0: [Modify QP Cmd Error][op_code=8] status=-29 waiting=1 completion_err=1 maj=0xffff min=0x8003
      
      [  662.815006] ice 0000:4c:00.0 rocep76s0: ICE OICR event notification: oicr = 0x10000003
      
      [  662.815014] ice 0000:4c:00.0 rocep76s0: critical PE Error, GLPE_CRITERR=0x00011424
      
      [  662.815017] ice 0000:4c:00.0 rocep76s0: Requesting a reset
      
      [  662.815475] BUG: scheduling while atomic: swapper/37/0/0x00010002
      
      [  662.815475] BUG: scheduling while atomic: swapper/37/0/0x00010002
      [  662.815477] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill 8021q garp mrp stp llc vfat fat rpcrdma intel_rapl_msr intel_rapl_common sunrpc i10nm_edac rdma_ucm nfit ib_srpt libnvdimm ib_isert iscsi_target_mod x86_pkg_temp_thermal intel_powerclamp coretemp target_core_mod snd_hda_intel ib_iser snd_intel_dspcfg libiscsi snd_intel_sdw_acpi scsi_transport_iscsi kvm_intel iTCO_wdt rdma_cm snd_hda_codec kvm iw_cm ipmi_ssif iTCO_vendor_support snd_hda_core irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hwdep snd_seq snd_seq_device rapl snd_pcm snd_timer isst_if_mbox_pci pcspkr isst_if_mmio irdma intel_uncore idxd acpi_ipmi joydev isst_if_common snd mei_me idxd_bus ipmi_si soundcore i2c_i801 mei ipmi_devintf i2c_smbus i2c_ismt ipmi_msghandler acpi_power_meter acpi_pad rv(OE) ib_uverbs ib_cm ib_core xfs libcrc32c ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helpe
       r ttm
      [  662.815546]  nvme nvme_core ice drm crc32c_intel i40e t10_pi wmi pinctrl_emmitsburg dm_mirror dm_region_hash dm_log dm_mod fuse
      [  662.815557] Preemption disabled at:
      [  662.815558] [<0000000000000000>] 0x0
      [  662.815563] CPU: 37 PID: 0 Comm: swapper/37 Kdump: loaded Tainted: G S         OE     5.17.1 #2
      [  662.815566] Hardware name: Intel Corporation D50DNP/D50DNP, BIOS SE5C6301.86B.6624.D18.2111021741 11/02/2021
      [  662.815568] Call Trace:
      [  662.815572]  <IRQ>
      [  662.815574]  dump_stack_lvl+0x33/0x42
      [  662.815581]  __schedule_bug.cold.147+0x7d/0x8a
      [  662.815588]  __schedule+0x798/0x990
      [  662.815595]  schedule+0x44/0xc0
      [  662.815597]  schedule_preempt_disabled+0x14/0x20
      [  662.815600]  __mutex_lock.isra.11+0x46c/0x490
      [  662.815603]  ? __ibdev_printk+0x76/0xc0 [ib_core]
      [  662.815633]  device_del+0x37/0x3d0
      [  662.815639]  ice_unplug_aux_dev+0x1a/0x40 [ice]
      [  662.815674]  ice_schedule_reset+0x3c/0xd0 [ice]
      [  662.815693]  irdma_iidc_event_handler.cold.7+0xb6/0xd3 [irdma]
      [  662.815712]  ? bitmap_find_next_zero_area_off+0x45/0xa0
      [  662.815719]  ice_send_event_to_aux+0x54/0x70 [ice]
      [  662.815741]  ice_misc_intr+0x21d/0x2d0 [ice]
      [  662.815756]  __handle_irq_event_percpu+0x4c/0x180
      [  662.815762]  handle_irq_event_percpu+0xf/0x40
      [  662.815764]  handle_irq_event+0x34/0x60
      [  662.815766]  handle_edge_irq+0x9a/0x1c0
      [  662.815770]  __common_interrupt+0x62/0x100
      [  662.815774]  common_interrupt+0xb4/0xd0
      [  662.815779]  </IRQ>
      [  662.815780]  <TASK>
      [  662.815780]  asm_common_interrupt+0x1e/0x40
      [  662.815785] RIP: 0010:cpuidle_enter_state+0xd6/0x380
      [  662.815789] Code: 49 89 c4 0f 1f 44 00 00 31 ff e8 65 d7 95 ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 64 02 00 00 31 ff e8 ae c5 9c ff fb 45 85 f6 <0f> 88 12 01 00 00 49 63 d6 4c 2b 24 24 48 8d 04 52 48 8d 04 82 49
      [  662.815791] RSP: 0018:ff2c2c4f18edbe80 EFLAGS: 00000202
      [  662.815793] RAX: ff280805df140000 RBX: 0000000000000002 RCX: 000000000000001f
      [  662.815795] RDX: 0000009a52da2d08 RSI: ffffffff93f8240b RDI: ffffffff93f53ee7
      [  662.815796] RBP: ff5e2bd11ff41928 R08: 0000000000000000 R09: 000000000002f8c0
      [  662.815797] R10: 0000010c3f18e2cf R11: 000000000000000f R12: 0000009a52da2d08
      [  662.815798] R13: ffffffff94ad7e20 R14: 0000000000000002 R15: 0000000000000000
      [  662.815801]  cpuidle_enter+0x29/0x40
      [  662.815803]  do_idle+0x261/0x2b0
      [  662.815807]  cpu_startup_entry+0x19/0x20
      [  662.815809]  start_secondary+0x114/0x150
      [  662.815813]  secondary_startup_64_no_verify+0xd5/0xdb
      [  662.815818]  </TASK>
      [  662.815846] bad: scheduling from the idle thread!
      [  662.815849] CPU: 37 PID: 0 Comm: swapper/37 Kdump: loaded Tainted: G S      W  OE     5.17.1 #2
      [  662.815852] Hardware name: Intel Corporation D50DNP/D50DNP, BIOS SE5C6301.86B.6624.D18.2111021741 11/02/2021
      [  662.815853] Call Trace:
      [  662.815855]  <IRQ>
      [  662.815856]  dump_stack_lvl+0x33/0x42
      [  662.815860]  dequeue_task_idle+0x20/0x30
      [  662.815863]  __schedule+0x1c3/0x990
      [  662.815868]  schedule+0x44/0xc0
      [  662.815871]  schedule_preempt_disabled+0x14/0x20
      [  662.815873]  __mutex_lock.isra.11+0x3a8/0x490
      [  662.815876]  ? __ibdev_printk+0x76/0xc0 [ib_core]
      [  662.815904]  device_del+0x37/0x3d0
      [  662.815909]  ice_unplug_aux_dev+0x1a/0x40 [ice]
      [  662.815937]  ice_schedule_reset+0x3c/0xd0 [ice]
      [  662.815961]  irdma_iidc_event_handler.cold.7+0xb6/0xd3 [irdma]
      [  662.815979]  ? bitmap_find_next_zero_area_off+0x45/0xa0
      [  662.815985]  ice_send_event_to_aux+0x54/0x70 [ice]
      [  662.816011]  ice_misc_intr+0x21d/0x2d0 [ice]
      [  662.816033]  __handle_irq_event_percpu+0x4c/0x180
      [  662.816037]  handle_irq_event_percpu+0xf/0x40
      [  662.816039]  handle_irq_event+0x34/0x60
      [  662.816042]  handle_edge_irq+0x9a/0x1c0
      [  662.816045]  __common_interrupt+0x62/0x100
      [  662.816048]  common_interrupt+0xb4/0xd0
      [  662.816052]  </IRQ>
      [  662.816053]  <TASK>
      [  662.816054]  asm_common_interrupt+0x1e/0x40
      [  662.816057] RIP: 0010:cpuidle_enter_state+0xd6/0x380
      [  662.816060] Code: 49 89 c4 0f 1f 44 00 00 31 ff e8 65 d7 95 ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 64 02 00 00 31 ff e8 ae c5 9c ff fb 45 85 f6 <0f> 88 12 01 00 00 49 63 d6 4c 2b 24 24 48 8d 04 52 48 8d 04 82 49
      [  662.816063] RSP: 0018:ff2c2c4f18edbe80 EFLAGS: 00000202
      [  662.816065] RAX: ff280805df140000 RBX: 0000000000000002 RCX: 000000000000001f
      [  662.816067] RDX: 0000009a52da2d08 RSI: ffffffff93f8240b RDI: ffffffff93f53ee7
      [  662.816068] RBP: ff5e2bd11ff41928 R08: 0000000000000000 R09: 000000000002f8c0
      [  662.816070] R10: 0000010c3f18e2cf R11: 000000000000000f R12: 0000009a52da2d08
      [  662.816071] R13: ffffffff94ad7e20 R14: 0000000000000002 R15: 0000000000000000
      [  662.816075]  cpuidle_enter+0x29/0x40
      [  662.816077]  do_idle+0x261/0x2b0
      [  662.816080]  cpu_startup_entry+0x19/0x20
      [  662.816083]  start_secondary+0x114/0x150
      [  662.816087]  secondary_startup_64_no_verify+0xd5/0xdb
      [  662.816091]  </TASK>
      [  662.816169] bad: scheduling from the idle thread!
      
      The correct place to unplug the aux devices for a reset is in the
      prepare_for_reset function, as this is a common place for all reset flows.
      It also has built in protection from being called twice in a single reset
      instance before the aux devices are replugged.
      
      Fixes: f9f5301e ("ice: Register auxiliary device to provide RDMA")
      Signed-off-by: default avatarDave Ertman <david.m.ertman@intel.com>
      Tested-by: default avatarHelena Anna Dubel <helena.anna.dubel@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      23c61919
    • Linus Torvalds's avatar
      Merge tag 'net-6.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 26b12249
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from rxrpc, netfilter, wireless and bluetooth
        subtrees.
      
        Current release - regressions:
      
         - skb: export skb drop reaons to user by TRACE_DEFINE_ENUM
      
         - bluetooth: fix regression preventing ACL packet transmission
      
        Current release - new code bugs:
      
         - dsa: microchip: fix kernel oops on ksz8 switches
      
         - dsa: qca8k: fix NULL pointer dereference for
           of_device_get_match_data
      
        Previous releases - regressions:
      
         - netfilter: clean up hook list when offload flags check fails
      
         - wifi: mt76: fix crash in chip reset fail
      
         - rxrpc: fix ICMP/ICMP6 error handling
      
         - ice: fix DMA mappings leak
      
         - i40e: fix kernel crash during module removal
      
        Previous releases - always broken:
      
         - ipv6: sr: fix out-of-bounds read when setting HMAC data.
      
         - tcp: TX zerocopy should not sense pfmemalloc status
      
         - sch_sfb: don't assume the skb is still around after
           enqueueing to child
      
         - netfilter: drop dst references before setting
      
         - wifi: wilc1000: fix DMA on stack objects
      
         - rxrpc: fix an insufficiently large sglist in
           rxkad_verify_packet_2()
      
         - fec: use a spinlock to guard `fep->ptp_clk_on`
      
        Misc:
      
         - usb: qmi_wwan: add Quectel RM520N"
      
      * tag 'net-6.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
        sch_sfb: Also store skb len before calling child enqueue
        net: phy: lan87xx: change interrupt src of link_up to comm_ready
        net/smc: Fix possible access to freed memory in link clear
        net: ethernet: mtk_eth_soc: check max allowed hash in mtk_ppe_check_skb
        net: skb: export skb drop reaons to user by TRACE_DEFINE_ENUM
        net: ethernet: mtk_eth_soc: fix typo in __mtk_foe_entry_clear
        net: dsa: felix: access QSYS_TAG_CONFIG under tas_lock in vsc9959_sched_speed_set
        net: dsa: felix: disable cut-through forwarding for frames oversized for tc-taprio
        net: dsa: felix: tc-taprio intervals smaller than MTU should send at least one packet
        net: usb: qmi_wwan: add Quectel RM520N
        net: dsa: qca8k: fix NULL pointer dereference for of_device_get_match_data
        tcp: fix early ETIMEDOUT after spurious non-SACK RTO
        stmmac: intel: Simplify intel_eth_pci_remove()
        net: mvpp2: debugfs: fix memory leak when using debugfs_lookup()
        ipv6: sr: fix out-of-bounds read when setting HMAC data.
        bonding: accept unsolicited NA message
        bonding: add all node mcast address when slave up
        bonding: use unspecified address if no available link local address
        wifi: use struct_group to copy addresses
        wifi: mac80211_hwsim: check length for virtio packets
        ...
      26b12249
    • Linus Torvalds's avatar
      fs: only do a memory barrier for the first set_buffer_uptodate() · 2f79cdfe
      Linus Torvalds authored
      Commit d4252071 ("add barriers to buffer_uptodate and
      set_buffer_uptodate") added proper memory barriers to the buffer head
      BH_Uptodate bit, so that anybody who tests a buffer for being up-to-date
      will be guaranteed to actually see initialized state.
      
      However, that commit didn't _just_ add the memory barrier, it also ended
      up dropping the "was it already set" logic that the BUFFER_FNS() macro
      had.
      
      That's conceptually the right thing for a generic "this is a memory
      barrier" operation, but in the case of the buffer contents, we really
      only care about the memory barrier for the _first_ time we set the bit,
      in that the only memory ordering protection we need is to avoid anybody
      seeing uninitialized memory contents.
      
      Any other access ordering wouldn't be about the BH_Uptodate bit anyway,
      and would require some other proper lock (typically BH_Lock or the folio
      lock).  A reader that races with somebody invalidating the buffer head
      isn't an issue wrt the memory ordering, it's a serialization issue.
      
      Now, you'd think that the buffer head operations don't matter in this
      day and age (and I certainly thought so), but apparently some loads
      still end up being heavy users of buffer heads.  In particular, the
      kernel test robot reported that not having this bit access optimization
      in place caused a noticeable direct IO performance regression on ext4:
      
        fxmark.ssd_ext4_no_jnl_DWTL_54_directio.works/sec -26.5% regression
      
      although you presumably need a fast disk and a lot of cores to actually
      notice.
      
      Link: https://lore.kernel.org/all/Yw8L7HTZ%2FdE2%2Fo9C@xsang-OptiPlex-9020/Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Tested-by: default avatarFengwei Yin <fengwei.yin@intel.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f79cdfe
    • Linus Torvalds's avatar
      Merge tag 'efi-urgent-for-v6.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi · f280b987
      Linus Torvalds authored
      Pull EFI fixes from Ard Biesheuvel:
       "A couple of low-priority EFI fixes:
      
         - prevent the randstruct plugin from re-ordering EFI protocol
           definitions
      
         - fix a use-after-free in the capsule loader
      
         - drop unused variable"
      
      * tag 'efi-urgent-for-v6.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
        efi: capsule-loader: Fix use-after-free in efi_capsule_write
        efi/x86: libstub: remove unused variable
        efi: libstub: Disable struct randomization
      f280b987
    • Toke Høiland-Jørgensen's avatar
      sch_sfb: Also store skb len before calling child enqueue · 2f09707d
      Toke Høiland-Jørgensen authored
      Cong Wang noticed that the previous fix for sch_sfb accessing the queued
      skb after enqueueing it to a child qdisc was incomplete: the SFB enqueue
      function was also calling qdisc_qstats_backlog_inc() after enqueue, which
      reads the pkt len from the skb cb field. Fix this by also storing the skb
      len, and using the stored value to increment the backlog after enqueueing.
      
      Fixes: 9efd2329 ("sch_sfb: Don't assume the skb is still around after enqueueing to child")
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Acked-by: default avatarCong Wang <cong.wang@bytedance.com>
      Link: https://lore.kernel.org/r/20220905192137.965549-1-toke@toke.dkSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      2f09707d
    • Arun Ramadoss's avatar
      net: phy: lan87xx: change interrupt src of link_up to comm_ready · 5382033a
      Arun Ramadoss authored
      Currently phy link up/down interrupt is enabled using the
      LAN87xx_INTERRUPT_MASK register. In the lan87xx_read_status function,
      phy link is determined using the T1_MODE_STAT_REG register comm_ready bit.
      comm_ready bit is set using the loc_rcvr_status & rem_rcvr_status.
      Whenever the phy link is up, LAN87xx_INTERRUPT_SOURCE link_up bit is set
      first but comm_ready bit takes some time to set based on local and
      remote receiver status.
      As per the current implementation, interrupt is triggered using link_up
      but the comm_ready bit is still cleared in the read_status function. So,
      link is always down.  Initially tested with the shared interrupt
      mechanism with switch and internal phy which is working, but after
      implementing interrupt controller it is not working.
      It can fixed either by updating the read_status function to read from
      LAN87XX_INTERRUPT_SOURCE register or enable the interrupt mask for
      comm_ready bit. But the validation team recommends the use of comm_ready
      for link detection.
      This patch fixes by enabling the comm_ready bit for link_up in the
      LAN87XX_INTERRUPT_MASK_2 register (MISC Bank) and link_down in
      LAN87xx_INTERRUPT_MASK register.
      
      Fixes: 8a1b415d ("net: phy: added ethtool master-slave configuration support")
      Signed-off-by: default avatarArun Ramadoss <arun.ramadoss@microchip.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20220905152750.5079-1-arun.ramadoss@microchip.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      5382033a
  6. 07 Sep, 2022 13 commits
    • Hyunwoo Kim's avatar
      efi: capsule-loader: Fix use-after-free in efi_capsule_write · 9cb636b5
      Hyunwoo Kim authored
      A race condition may occur if the user calls close() on another thread
      during a write() operation on the device node of the efi capsule.
      
      This is a race condition that occurs between the efi_capsule_write() and
      efi_capsule_flush() functions of efi_capsule_fops, which ultimately
      results in UAF.
      
      So, the page freeing process is modified to be done in
      efi_capsule_release() instead of efi_capsule_flush().
      
      Cc: <stable@vger.kernel.org> # v4.9+
      Signed-off-by: default avatarHyunwoo Kim <imv4bel@gmail.com>
      Link: https://lore.kernel.org/all/20220907102920.GA88602@ubuntu/Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      9cb636b5
    • Yacan Liu's avatar
      net/smc: Fix possible access to freed memory in link clear · e9b1a4f8
      Yacan Liu authored
      After modifying the QP to the Error state, all RX WR would be completed
      with WC in IB_WC_WR_FLUSH_ERR status. Current implementation does not
      wait for it is done, but destroy the QP and free the link group directly.
      So there is a risk that accessing the freed memory in tasklet context.
      
      Here is a crash example:
      
       BUG: unable to handle page fault for address: ffffffff8f220860
       #PF: supervisor write access in kernel mode
       #PF: error_code(0x0002) - not-present page
       PGD f7300e067 P4D f7300e067 PUD f7300f063 PMD 8c4e45063 PTE 800ffff08c9df060
       Oops: 0002 [#1] SMP PTI
       CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Tainted: G S         OE     5.10.0-0607+ #23
       Hardware name: Inspur NF5280M4/YZMB-00689-101, BIOS 4.1.20 07/09/2018
       RIP: 0010:native_queued_spin_lock_slowpath+0x176/0x1b0
       Code: f3 90 48 8b 32 48 85 f6 74 f6 eb d5 c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 00 c8 02 00 48 03 04 f5 00 09 98 8e <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32
       RSP: 0018:ffffb3b6c001ebd8 EFLAGS: 00010086
       RAX: ffffffff8f220860 RBX: 0000000000000246 RCX: 0000000000080000
       RDX: ffff91db1f86c800 RSI: 000000000000173c RDI: ffff91db62bace00
       RBP: ffff91db62bacc00 R08: 0000000000000000 R09: c00000010000028b
       R10: 0000000000055198 R11: ffffb3b6c001ea58 R12: ffff91db80e05010
       R13: 000000000000000a R14: 0000000000000006 R15: 0000000000000040
       FS:  0000000000000000(0000) GS:ffff91db1f840000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffffffff8f220860 CR3: 00000001f9580004 CR4: 00000000003706e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        <IRQ>
        _raw_spin_lock_irqsave+0x30/0x40
        mlx5_ib_poll_cq+0x4c/0xc50 [mlx5_ib]
        smc_wr_rx_tasklet_fn+0x56/0xa0 [smc]
        tasklet_action_common.isra.21+0x66/0x100
        __do_softirq+0xd5/0x29c
        asm_call_irq_on_stack+0x12/0x20
        </IRQ>
        do_softirq_own_stack+0x37/0x40
        irq_exit_rcu+0x9d/0xa0
        sysvec_call_function_single+0x34/0x80
        asm_sysvec_call_function_single+0x12/0x20
      
      Fixes: bd4ad577 ("smc: initialize IB transport incl. PD, MR, QP, CQ, event, WR")
      Signed-off-by: default avatarYacan Liu <liuyacan@corp.netease.com>
      Reviewed-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e9b1a4f8
    • Lorenzo Bianconi's avatar
      net: ethernet: mtk_eth_soc: check max allowed hash in mtk_ppe_check_skb · f27b405e
      Lorenzo Bianconi authored
      Even if max hash configured in hw in mtk_ppe_hash_entry is
      MTK_PPE_ENTRIES - 1, check theoretical OOB accesses in
      mtk_ppe_check_skb routine
      
      Fixes: c4f033d9 ("net: ethernet: mtk_eth_soc: rework hardware flow table management")
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f27b405e
    • Menglong Dong's avatar
      net: skb: export skb drop reaons to user by TRACE_DEFINE_ENUM · 9cb252c4
      Menglong Dong authored
      As Eric reported, the 'reason' field is not presented when trace the
      kfree_skb event by perf:
      
      $ perf record -e skb:kfree_skb -a sleep 10
      $ perf script
        ip_defrag 14605 [021]   221.614303:   skb:kfree_skb:
        skbaddr=0xffff9d2851242700 protocol=34525 location=0xffffffffa39346b1
        reason:
      
      The cause seems to be passing kernel address directly to TP_printk(),
      which is not right. As the enum 'skb_drop_reason' is not exported to
      user space through TRACE_DEFINE_ENUM(), perf can't get the drop reason
      string from the 'reason' field, which is a number.
      
      Therefore, we introduce the macro DEFINE_DROP_REASON(), which is used
      to define the trace enum by TRACE_DEFINE_ENUM(). With the help of
      DEFINE_DROP_REASON(), now we can remove the auto-generate that we
      introduced in the commit ec43908d
      ("net: skb: use auto-generation to convert skb drop reason to string"),
      and define the string array 'drop_reasons'.
      
      Hmmmm...now we come back to the situation that have to maintain drop
      reasons in both enum skb_drop_reason and DEFINE_DROP_REASON. But they
      are both in dropreason.h, which makes it easier.
      
      After this commit, now the format of kfree_skb is like this:
      
      $ cat /tracing/events/skb/kfree_skb/format
      name: kfree_skb
      ID: 1524
      format:
              field:unsigned short common_type;       offset:0;       size:2; signed:0;
              field:unsigned char common_flags;       offset:2;       size:1; signed:0;
              field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
              field:int common_pid;   offset:4;       size:4; signed:1;
      
              field:void * skbaddr;   offset:8;       size:8; signed:0;
              field:void * location;  offset:16;      size:8; signed:0;
              field:unsigned short protocol;  offset:24;      size:2; signed:0;
              field:enum skb_drop_reason reason;      offset:28;      size:4; signed:0;
      
      print fmt: "skbaddr=%p protocol=%u location=%p reason: %s", REC->skbaddr, REC->protocol, REC->location, __print_symbolic(REC->reason, { 1, "NOT_SPECIFIED" }, { 2, "NO_SOCKET" } ......
      
      Fixes: ec43908d ("net: skb: use auto-generation to convert skb drop reason to string")
      Link: https://lore.kernel.org/netdev/CANn89i+bx0ybvE55iMYf5GJM48WwV1HNpdm9Q6t-HaEstqpCSA@mail.gmail.com/Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarMenglong Dong <imagedong@tencent.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9cb252c4
    • Lorenzo Bianconi's avatar
      net: ethernet: mtk_eth_soc: fix typo in __mtk_foe_entry_clear · 0e80707d
      Lorenzo Bianconi authored
      Set ib1 state to MTK_FOE_STATE_UNBIND in __mtk_foe_entry_clear routine.
      
      Fixes: 33fc42de ("net: ethernet: mtk_eth_soc: support creating mac address based offload entries")
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0e80707d
    • Pablo Neira Ayuso's avatar
      netfilter: nfnetlink_osf: fix possible bogus match in nf_osf_find() · 559c36c5
      Pablo Neira Ayuso authored
      nf_osf_find() incorrectly returns true on mismatch, this leads to
      copying uninitialized memory area in nft_osf which can be used to leak
      stale kernel stack data to userspace.
      
      Fixes: 22c7652c ("netfilter: nft_osf: Add version option support")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      559c36c5
    • David Leadbeater's avatar
      netfilter: nf_conntrack_irc: Tighten matching on DCC message · e8d5dfd1
      David Leadbeater authored
      CTCP messages should only be at the start of an IRC message, not
      anywhere within it.
      
      While the helper only decodes packes in the ORIGINAL direction, its
      possible to make a client send a CTCP message back by empedding one into
      a PING request.  As-is, thats enough to make the helper believe that it
      saw a CTCP message.
      
      Fixes: 869f37d8 ("[NETFILTER]: nf_conntrack/nf_nat: add IRC helper port")
      Signed-off-by: default avatarDavid Leadbeater <dgl@dgl.cx>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      e8d5dfd1
    • Florian Westphal's avatar
      selftests: nft_concat_range: add socat support · 25b327d4
      Florian Westphal authored
      There are different flavors of 'nc' around, this script fails on
      my test vm because 'nc' is 'nmap-ncat' which isn't 100% compatible.
      
      Add socat support and use it if available.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      25b327d4
    • Igor Ryzhov's avatar
      netfilter: nf_conntrack_sip: fix ct_sip_walk_headers · 39aebede
      Igor Ryzhov authored
      ct_sip_next_header and ct_sip_get_header return an absolute
      value of matchoff, not a shift from current dataoff.
      So dataoff should be assigned matchoff, not incremented by it.
      
      This issue can be seen in the scenario when there are multiple
      Contact headers and the first one is using a hostname and other headers
      use IP addresses. In this case, ct_sip_walk_headers will work as follows:
      
      The first ct_sip_get_header call to will find the first Contact header
      but will return -1 as the header uses a hostname. But matchoff will
      be changed to the offset of this header. After that, dataoff should be
      set to matchoff, so that the next ct_sip_get_header call find the next
      Contact header. But instead of assigning dataoff to matchoff, it is
      incremented by it, which is not correct, as matchoff is an absolute
      value of the offset. So on the next call to the ct_sip_get_header,
      dataoff will be incorrect, and the next Contact header may not be
      found at all.
      
      Fixes: 05e3ced2 ("[NETFILTER]: nf_conntrack_sip: introduce SIP-URI parsing helper")
      Signed-off-by: default avatarIgor Ryzhov <iryzhov@nfware.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      39aebede
    • David S. Miller's avatar
      Merge branch 'dsa-felix-fixes' · 0f51fa2a
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      Fixes for Felix DSA driver calculation of tc-taprio guard bands
      
      This series fixes some bugs which are not quite new, but date from v5.13
      when static guard bands were enabled by Michael Walle to prevent
      tc-taprio overruns.
      
      The investigation started when Xiaoliang asked privately what is the
      expected max SDU for a traffic class when its minimum gate interval is
      10 us. The answer, as it turns out, is not an L1 size of 1250 octets,
      but 1245 octets, since otherwise, the switch will not consider frames
      for egress scheduling, because the static guard band is exactly as large
      as the time interval. The switch needs a minimum of 33 ns outside of the
      guard band to consider a frame for scheduling, and the reduction of the
      max SDU by 5 provides exactly for that.
      
      The fix for that (patch 1/3) is relatively small, but during testing, it
      became apparent that cut-through forwarding prevents oversized frame
      dropping from working properly. This is solved through the larger patch
      2/3. Finally, patch 3/3 fixes one more tc-taprio locking problem found
      through code inspection.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0f51fa2a
    • Vladimir Oltean's avatar
      net: dsa: felix: access QSYS_TAG_CONFIG under tas_lock in vsc9959_sched_speed_set · a4bb481a
      Vladimir Oltean authored
      The read-modify-write of QSYS_TAG_CONFIG from vsc9959_sched_speed_set()
      runs unlocked with respect to the other functions that access it, which
      are vsc9959_tas_guard_bands_update(), vsc9959_qos_port_tas_set() and
      vsc9959_tas_clock_adjust(). All the others are under ocelot->tas_lock,
      so move the vsc9959_sched_speed_set() access under that lock as well, to
      resolve the concurrency.
      
      Fixes: 55a515b1 ("net: dsa: felix: drop oversized frames with tc-taprio instead of hanging the port")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a4bb481a
    • Vladimir Oltean's avatar
      net: dsa: felix: disable cut-through forwarding for frames oversized for tc-taprio · 843794bb
      Vladimir Oltean authored
      Experimentally, it looks like when QSYS_QMAXSDU_CFG_7 is set to 605,
      frames even way larger than 601 octets are transmitted even though these
      should be considered as oversized, according to the documentation, and
      dropped.
      
      Since oversized frame dropping depends on frame size, which is only
      known at the EOF stage, and therefore not at SOF when cut-through
      forwarding begins, it means that the switch cannot take QSYS_QMAXSDU_CFG_*
      into consideration for traffic classes that are cut-through.
      
      Since cut-through forwarding has no UAPI to control it, and the driver
      enables it based on the mantra "if we can, then why not", the strategy
      is to alter vsc9959_cut_through_fwd() to take into consideration which
      tc's have oversize frame dropping enabled, and disable cut-through for
      them. Then, from vsc9959_tas_guard_bands_update(), we re-trigger the
      cut-through determination process.
      
      There are 2 strategies for vsc9959_cut_through_fwd() to determine
      whether a tc has oversized dropping enabled or not. One is to keep a bit
      mask of traffic classes per port, and the other is to read back from the
      hardware registers (a non-zero value of QSYS_QMAXSDU_CFG_* means the
      feature is enabled). We choose reading back from registers, because
      struct ocelot_port is shared with drivers (ocelot, seville) that don't
      support either cut-through nor tc-taprio, and we don't have a felix
      specific extension of struct ocelot_port. Furthermore, reading registers
      from the Felix hardware is quite cheap, since they are memory-mapped.
      
      Fixes: 55a515b1 ("net: dsa: felix: drop oversized frames with tc-taprio instead of hanging the port")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      843794bb
    • Vladimir Oltean's avatar
      net: dsa: felix: tc-taprio intervals smaller than MTU should send at least one packet · 11afdc65
      Vladimir Oltean authored
      The blamed commit broke tc-taprio schedules such as this one:
      
      tc qdisc replace dev $swp1 root taprio \
              num_tc 8 \
              map 0 1 2 3 4 5 6 7 \
              queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
              base-time 0 \
              sched-entry S 0x7f 990000 \
              sched-entry S 0x80  10000 \
              flags 0x2
      
      because the gate entry for TC 7 (S 0x80 10000 ns) now has a static guard
      band added earlier than its 'gate close' event, such that packet
      overruns won't occur in the worst case of the largest packet possible.
      
      Since guard bands are statically determined based on the per-tc
      QSYS_QMAXSDU_CFG_* with a fallback on the port-based QSYS_PORT_MAX_SDU,
      we need to discuss what happens with TC 7 depending on kernel version,
      since the driver, prior to commit 55a515b1 ("net: dsa: felix: drop
      oversized frames with tc-taprio instead of hanging the port"), did not
      touch QSYS_QMAXSDU_CFG_*, and therefore relied on QSYS_PORT_MAX_SDU.
      
      1 (before vsc9959_tas_guard_bands_update): QSYS_PORT_MAX_SDU defaults to
        1518, and at gigabit this introduces a static guard band (independent
        of packet sizes) of 12144 ns, plus QSYS::HSCH_MISC_CFG.FRM_ADJ (bit
        time of 20 octets => 160 ns). But this is larger than the time window
        itself, of 10000 ns. So, the queue system never considers a frame with
        TC 7 as eligible for transmission, since the gate practically never
        opens, and these frames are forever stuck in the TX queues and hang
        the port.
      
      2 (after vsc9959_tas_guard_bands_update): Under the sole goal of
        enabling oversized frame dropping, we make an effort to set
        QSYS_QMAXSDU_CFG_7 to 1230 bytes. But QSYS_QMAXSDU_CFG_7 plays
        one more role, which we did not take into account: per-tc static guard
        band, expressed in L2 byte time (auto-adjusted for FCS and L1 overhead).
        There is a discrepancy between what the driver thinks (that there is
        no guard band, and 100% of min_gate_len[tc] is available for egress
        scheduling) and what the hardware actually does (crops the equivalent
        of QSYS_QMAXSDU_CFG_7 ns out of min_gate_len[tc]). In practice, this
        means that the hardware thinks it has exactly 0 ns for scheduling tc 7.
      
      In both cases, even minimum sized Ethernet frames are stuck on egress
      rather than being considered for scheduling on TC 7, even if they would
      fit given a proper configuration. Considering the current situation,
      with vsc9959_tas_guard_bands_update(), frames between 60 octets and 1230
      octets in size are not eligible for oversized dropping (because they are
      smaller than QSYS_QMAXSDU_CFG_7), but won't be considered as eligible
      for scheduling either, because the min_gate_len[7] (10000 ns) minus the
      guard band determined by QSYS_QMAXSDU_CFG_7 (1230 octets * 8 ns per
      octet == 9840 ns) minus the guard band auto-added for L1 overhead by
      QSYS::HSCH_MISC_CFG.FRM_ADJ (20 octets * 8 ns per octet == 160 octets)
      leaves 0 ns for scheduling in the queue system proper.
      
      Investigating the hardware behavior, it becomes apparent that the queue
      system needs precisely 33 ns of 'gate open' time in order to consider a
      frame as eligible for scheduling to a tc. So the solution to this
      problem is to amend vsc9959_tas_guard_bands_update(), by giving the
      per-tc guard bands less space by exactly 33 ns, just enough for one
      frame to be scheduled in that interval. This allows the queue system to
      make forward progress for that port-tc, and prevents it from hanging.
      
      Fixes: 297c4de6 ("net: dsa: felix: re-enable TAS guard band mode")
      Reported-by: default avatarXiaoliang Yang <xiaoliang.yang_1@nxp.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11afdc65