1. 29 Apr, 2022 3 commits
  2. 28 Apr, 2022 10 commits
    • Pengcheng Yang's avatar
      tcp: fix F-RTO may not work correctly when receiving DSACK · d9157f68
      Pengcheng Yang authored
      Currently DSACK is regarded as a dupack, which may cause
      F-RTO to incorrectly enter "loss was real" when receiving
      DSACK.
      
      Packetdrill to demonstrate:
      
      // Enable F-RTO and TLP
          0 `sysctl -q net.ipv4.tcp_frto=2`
          0 `sysctl -q net.ipv4.tcp_early_retrans=3`
          0 `sysctl -q net.ipv4.tcp_congestion_control=cubic`
      
      // Establish a connection
         +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
      // RTT 10ms, RTO 210ms
        +.1 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <...>
       +.01 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
      // Send 2 data segments
         +0 write(4, ..., 2000) = 2000
         +0 > P. 1:2001(2000) ack 1
      
      // TLP
      +.022 > P. 1001:2001(1000) ack 1
      
      // Continue to send 8 data segments
         +0 write(4, ..., 10000) = 10000
         +0 > P. 2001:10001(8000) ack 1
      
      // RTO
      +.188 > . 1:1001(1000) ack 1
      
      // The original data is acked and new data is sent(F-RTO step 2.b)
         +0 < . 1:1(0) ack 2001 win 257
         +0 > P. 10001:12001(2000) ack 1
      
      // D-SACK caused by TLP is regarded as a dupack, this results in
      // the incorrect judgment of "loss was real"(F-RTO step 3.a)
      +.022 < . 1:1(0) ack 2001 win 257 <sack 1001:2001,nop,nop>
      
      // Never-retransmitted data(3001:4001) are acked and
      // expect to switch to open state(F-RTO step 3.b)
         +0 < . 1:1(0) ack 4001 win 257
      +0 %{ assert tcpi_ca_state == 0, tcpi_ca_state }%
      
      Fixes: e33099f9 ("tcp: implement RFC5682 F-RTO")
      Signed-off-by: default avatarPengcheng Yang <yangpc@wangsu.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Tested-by: default avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/1650967419-2150-1-git-send-email-yangpc@wangsu.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d9157f68
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · c26d0d98
      Jakub Kicinski authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      1) Fix incorrect TCP connection tracking window reset for non-syn
         packets, from Florian Westphal.
      
      2) Incorrect dependency on CONFIG_NFT_FLOW_OFFLOAD, from Volodymyr Mytnyk.
      
      3) Fix nft_socket from the output path, from Florian Westphal.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
        netfilter: nft_socket: only do sk lookups when indev is available
        netfilter: conntrack: fix udp offload timeout sysctl
        netfilter: nf_conntrack_tcp: re-init for syn packets only
      ====================
      
      Link: https://lore.kernel.org/r/20220428142109.38726-1-pablo@netfilter.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c26d0d98
    • Dany Madden's avatar
      Revert "ibmvnic: Add ethtool private flag for driver-defined queue limits" · aeaf59b7
      Dany Madden authored
      This reverts commit 723ad916
      
      When client requests channel or ring size larger than what the server
      can support the server will cap the request to the supported max. So,
      the client would not be able to successfully request resources that
      exceed the server limit.
      
      Fixes: 723ad916 ("ibmvnic: Add ethtool private flag for driver-defined queue limits")
      Signed-off-by: default avatarDany Madden <drt@linux.ibm.com>
      Link: https://lore.kernel.org/r/20220427235146.23189-1-drt@linux.ibm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      aeaf59b7
    • Vladimir Oltean's avatar
      net: enetc: allow tc-etf offload even with NETIF_F_CSUM_MASK · 66a2f5ef
      Vladimir Oltean authored
      The Time-Specified Departure feature is indeed mutually exclusive with
      TX IP checksumming in ENETC, but TX checksumming in itself is broken and
      was removed from this driver in commit 82728b91 ("enetc: Remove Tx
      checksumming offload code").
      
      The blamed commit declared NETIF_F_HW_CSUM in dev->features to comply
      with software TSO's expectations, and still did the checksumming in
      software by calling skb_checksum_help(). So there isn't any restriction
      for the Time-Specified Departure feature.
      
      However, enetc_setup_tc_txtime() doesn't understand that, and blindly
      looks for NETIF_F_CSUM_MASK.
      
      Instead of checking for things which can literally never happen in the
      current code base, just remove the check and let the driver offload
      tc-etf qdiscs.
      
      Fixes: acede3c5 ("net: enetc: declare NETIF_F_HW_CSUM and do it in software")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20220427203017.1291634-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      66a2f5ef
    • Leon Romanovsky's avatar
      ixgbe: ensure IPsec VF<->PF compatibility · f049efc7
      Leon Romanovsky authored
      The VF driver can forward any IPsec flags and such makes the function
      is not extendable and prone to backward/forward incompatibility.
      
      If new software runs on VF, it won't know that PF configured something
      completely different as it "knows" only XFRM_OFFLOAD_INBOUND flag.
      
      Fixes: eda0333a ("ixgbe: add VF IPsec management")
      Reviewed-by: default avatarRaed Salem <raeds@nvidia.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarShannon Nelson <snelson@pensando.io>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Link: https://lore.kernel.org/r/20220427173152.443102-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f049efc7
    • Florian Fainelli's avatar
      MAINTAINERS: Update BNXT entry with firmware files · 126858db
      Florian Fainelli authored
      There appears to be a maintainer gap for BNXT TEE firmware files which
      causes some patches to be missed. Update the entry for the BNXT Ethernet
      controller with its companion firmware files.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Link: https://lore.kernel.org/r/20220427163606.126154-1-f.fainelli@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      126858db
    • Florian Westphal's avatar
      netfilter: nft_socket: only do sk lookups when indev is available · 743b83f1
      Florian Westphal authored
      Check if the incoming interface is available and NFT_BREAK
      in case neither skb->sk nor input device are set.
      
      Because nf_sk_lookup_slow*() assume packet headers are in the
      'in' direction, use in postrouting is not going to yield a meaningful
      result.  Same is true for the forward chain, so restrict the use
      to prerouting, input and output.
      
      Use in output work if a socket is already attached to the skb.
      
      Fixes: 554ced0a ("netfilter: nf_tables: add support for native socket matching")
      Reported-and-tested-by: default avatarTopi Miettinen <toiwoton@gmail.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      743b83f1
    • Paolo Abeni's avatar
      Merge tag 'for-net-2022-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth · febb2d2f
      Paolo Abeni authored
      Luiz Augusto von Dentz says:
      
      ====================
      bluetooth pull request for net:
      
       - Fix regression causing some HCI events to be discarded when they
         shouldn't.
      
      * tag 'for-net-2022-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
        Bluetooth: hci_sync: Cleanup hci_conn if it cannot be aborted
        Bluetooth: hci_event: Fix creating hci_conn object on error status
        Bluetooth: hci_event: Fix checking for invalid handle on error status
      ====================
      
      Link: https://lore.kernel.org/r/20220427234031.1257281-1-luiz.dentz@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      febb2d2f
    • Yang Yingliang's avatar
      net: fec: add missing of_node_put() in fec_enet_init_stop_mode() · d2b52ec0
      Yang Yingliang authored
      Put device node in error path in fec_enet_init_stop_mode().
      
      Fixes: 8a448bf8 ("net: ethernet: fec: move GPR register offset and bit into DT")
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Link: https://lore.kernel.org/r/20220426125231.375688-1-yangyingliang@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d2b52ec0
    • Manish Chopra's avatar
      bnx2x: fix napi API usage sequence · af68656d
      Manish Chopra authored
      While handling PCI errors (AER flow) driver tries to
      disable NAPI [napi_disable()] after NAPI is deleted
      [__netif_napi_del()] which causes unexpected system
      hang/crash.
      
      System message log shows the following:
      =======================================
      [ 3222.537510] EEH: Detected PCI bus error on PHB#384-PE#800000 [ 3222.537511] EEH: This PCI device has failed 2 times in the last hour and will be permanently disabled after 5 failures.
      [ 3222.537512] EEH: Notify device drivers to shutdown [ 3222.537513] EEH: Beginning: 'error_detected(IO frozen)'
      [ 3222.537514] EEH: PE#800000 (PCI 0384:80:00.0): Invoking
      bnx2x->error_detected(IO frozen)
      [ 3222.537516] bnx2x: [bnx2x_io_error_detected:14236(eth14)]IO error detected [ 3222.537650] EEH: PE#800000 (PCI 0384:80:00.0): bnx2x driver reports:
      'need reset'
      [ 3222.537651] EEH: PE#800000 (PCI 0384:80:00.1): Invoking
      bnx2x->error_detected(IO frozen)
      [ 3222.537651] bnx2x: [bnx2x_io_error_detected:14236(eth13)]IO error detected [ 3222.537729] EEH: PE#800000 (PCI 0384:80:00.1): bnx2x driver reports:
      'need reset'
      [ 3222.537729] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
      [ 3222.537890] EEH: Collect temporary log [ 3222.583481] EEH: of node=0384:80:00.0 [ 3222.583519] EEH: PCI device/vendor: 168e14e4 [ 3222.583557] EEH: PCI cmd/status register: 00100140 [ 3222.583557] EEH: PCI-E capabilities and status follow:
      [ 3222.583744] EEH: PCI-E 00: 00020010 012c8da2 00095d5e 00455c82 [ 3222.583892] EEH: PCI-E 10: 10820000 00000000 00000000 00000000 [ 3222.583893] EEH: PCI-E 20: 00000000 [ 3222.583893] EEH: PCI-E AER capability register set follows:
      [ 3222.584079] EEH: PCI-E AER 00: 13c10001 00000000 00000000 00062030 [ 3222.584230] EEH: PCI-E AER 10: 00002000 000031c0 000001e0 00000000 [ 3222.584378] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 3222.584416] EEH: PCI-E AER 30: 00000000 00000000 [ 3222.584416] EEH: of node=0384:80:00.1 [ 3222.584454] EEH: PCI device/vendor: 168e14e4 [ 3222.584491] EEH: PCI cmd/status register: 00100140 [ 3222.584492] EEH: PCI-E capabilities and status follow:
      [ 3222.584677] EEH: PCI-E 00: 00020010 012c8da2 00095d5e 00455c82 [ 3222.584825] EEH: PCI-E 10: 10820000 00000000 00000000 00000000 [ 3222.584826] EEH: PCI-E 20: 00000000 [ 3222.584826] EEH: PCI-E AER capability register set follows:
      [ 3222.585011] EEH: PCI-E AER 00: 13c10001 00000000 00000000 00062030 [ 3222.585160] EEH: PCI-E AER 10: 00002000 000031c0 000001e0 00000000 [ 3222.585309] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 3222.585347] EEH: PCI-E AER 30: 00000000 00000000 [ 3222.586872] RTAS: event: 5, Type: Platform Error (224), Severity: 2 [ 3222.586873] EEH: Reset without hotplug activity [ 3224.762767] EEH: Beginning: 'slot_reset'
      [ 3224.762770] EEH: PE#800000 (PCI 0384:80:00.0): Invoking
      bnx2x->slot_reset()
      [ 3224.762771] bnx2x: [bnx2x_io_slot_reset:14271(eth14)]IO slot reset initializing...
      [ 3224.762887] bnx2x 0384:80:00.0: enabling device (0140 -> 0142) [ 3224.768157] bnx2x: [bnx2x_io_slot_reset:14287(eth14)]IO slot reset
      --> driver unload
      
      Uninterruptible tasks
      =====================
      crash> ps | grep UN
           213      2  11  c000000004c89e00  UN   0.0       0      0  [eehd]
           215      2   0  c000000004c80000  UN   0.0       0      0
      [kworker/0:2]
          2196      1  28  c000000004504f00  UN   0.1   15936  11136  wickedd
          4287      1   9  c00000020d076800  UN   0.0    4032   3008  agetty
          4289      1  20  c00000020d056680  UN   0.0    7232   3840  agetty
         32423      2  26  c00000020038c580  UN   0.0       0      0
      [kworker/26:3]
         32871   4241  27  c0000002609ddd00  UN   0.1   18624  11648  sshd
         32920  10130  16  c00000027284a100  UN   0.1   48512  12608  sendmail
         33092  32987   0  c000000205218b00  UN   0.1   48512  12608  sendmail
         33154   4567  16  c000000260e51780  UN   0.1   48832  12864  pickup
         33209   4241  36  c000000270cb6500  UN   0.1   18624  11712  sshd
         33473  33283   0  c000000205211480  UN   0.1   48512  12672  sendmail
         33531   4241  37  c00000023c902780  UN   0.1   18624  11648  sshd
      
      EEH handler hung while bnx2x sleeping and holding RTNL lock
      ===========================================================
      crash> bt 213
      PID: 213    TASK: c000000004c89e00  CPU: 11  COMMAND: "eehd"
        #0 [c000000004d477e0] __schedule at c000000000c70808
        #1 [c000000004d478b0] schedule at c000000000c70ee0
        #2 [c000000004d478e0] schedule_timeout at c000000000c76dec
        #3 [c000000004d479c0] msleep at c0000000002120cc
        #4 [c000000004d479f0] napi_disable at c000000000a06448
                                              ^^^^^^^^^^^^^^^^
        #5 [c000000004d47a30] bnx2x_netif_stop at c0080000018dba94 [bnx2x]
        #6 [c000000004d47a60] bnx2x_io_slot_reset at c0080000018a551c [bnx2x]
        #7 [c000000004d47b20] eeh_report_reset at c00000000004c9bc
        #8 [c000000004d47b90] eeh_pe_report at c00000000004d1a8
        #9 [c000000004d47c40] eeh_handle_normal_event at c00000000004da64
      
      And the sleeping source code
      ============================
      crash> dis -ls c000000000a06448
      FILE: ../net/core/dev.c
      LINE: 6702
      
         6697  {
         6698          might_sleep();
         6699          set_bit(NAPI_STATE_DISABLE, &n->state);
         6700
         6701          while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
      * 6702                  msleep(1);
         6703          while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
         6704                  msleep(1);
         6705
         6706          hrtimer_cancel(&n->timer);
         6707
         6708          clear_bit(NAPI_STATE_DISABLE, &n->state);
         6709  }
      
      EEH calls into bnx2x twice based on the system log above, first through
      bnx2x_io_error_detected() and then bnx2x_io_slot_reset(), and executes
      the following call chains:
      
      bnx2x_io_error_detected()
        +-> bnx2x_eeh_nic_unload()
             +-> bnx2x_del_all_napi()
                  +-> __netif_napi_del()
      
      bnx2x_io_slot_reset()
        +-> bnx2x_netif_stop()
             +-> bnx2x_napi_disable()
                  +->napi_disable()
      
      Fix this by correcting the sequence of NAPI APIs usage,
      that is delete the NAPI after disabling it.
      
      Fixes: 7fa6f340 ("bnx2x: AER revised")
      Reported-by: default avatarDavid Christensen <drc@linux.vnet.ibm.com>
      Tested-by: default avatarDavid Christensen <drc@linux.vnet.ibm.com>
      Signed-off-by: default avatarManish Chopra <manishc@marvell.com>
      Signed-off-by: default avatarAriel Elior <aelior@marvell.com>
      Link: https://lore.kernel.org/r/20220426153913.6966-1-manishc@marvell.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      af68656d
  3. 27 Apr, 2022 8 commits
    • Maxim Mikityanskiy's avatar
      tls: Skip tls_append_frag on zero copy size · a0df7194
      Maxim Mikityanskiy authored
      Calling tls_append_frag when max_open_record_len == record->len might
      add an empty fragment to the TLS record if the call happens to be on the
      page boundary. Normally tls_append_frag coalesces the zero-sized
      fragment to the previous one, but not if it's on page boundary.
      
      If a resync happens then, the mlx5 driver posts dump WQEs in
      tx_post_resync_dump, and the empty fragment may become a data segment
      with byte_count == 0, which will confuse the NIC and lead to a CQE
      error.
      
      This commit fixes the described issue by skipping tls_append_frag on
      zero size to avoid adding empty fragments. The fix is not in the driver,
      because an empty fragment is hardly the desired behavior.
      
      Fixes: e8f69799 ("net/tls: Add generic NIC offload infrastructure")
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20220426154949.159055-1-maximmi@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a0df7194
    • Jakub Kicinski's avatar
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 347cb5de
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2022-04-27
      
      We've added 5 non-merge commits during the last 20 day(s) which contain
      a total of 6 files changed, 34 insertions(+), 12 deletions(-).
      
      The main changes are:
      
      1) Fix xsk sockets when rx and tx are separately bound to the same umem, also
         fix xsk copy mode combined with busy poll, from Maciej Fijalkowski.
      
      2) Fix BPF tunnel/collect_md helpers with bpf_xmit lwt hook usage which triggered
         a crash due to invalid metadata_dst access, from Eyal Birger.
      
      3) Fix release of page pool in XDP live packet mode, from Toke Høiland-Jørgensen.
      
      4) Fix potential NULL pointer dereference in kretprobes, from Adam Zabrocki.
      
         (Masami & Steven preferred this small fix to be routed via bpf tree given it's
          follow-up fix to Masami's rethook work that went via bpf earlier, too.)
      
      * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        xsk: Fix possible crash when multiple sockets are created
        kprobes: Fix KRETPROBES when CONFIG_KRETPROBE_ON_RETHOOK is set
        bpf, lwt: Fix crash when using bpf_skb_set_tunnel_key() from bpf_xmit lwt hook
        bpf: Fix release of page_pool in BPF_PROG_RUN in test runner
        xsk: Fix l2fwd for copy mode + busy poll combo
      ====================
      
      Link: https://lore.kernel.org/r/20220427212748.9576-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      347cb5de
    • Jakub Kicinski's avatar
    • Volodymyr Mytnyk's avatar
      netfilter: conntrack: fix udp offload timeout sysctl · 626873c4
      Volodymyr Mytnyk authored
      `nf_flowtable_udp_timeout` sysctl option is available only
      if CONFIG_NFT_FLOW_OFFLOAD enabled. But infra for this flow
      offload UDP timeout was added under CONFIG_NF_FLOW_TABLE
      config option. So, if you have CONFIG_NFT_FLOW_OFFLOAD
      disabled and CONFIG_NF_FLOW_TABLE enabled, the
      `nf_flowtable_udp_timeout` is not present in sysfs.
      Please note, that TCP flow offload timeout sysctl option
      is present even CONFIG_NFT_FLOW_OFFLOAD is disabled.
      
      I suppose it was a typo in commit that adds UDP flow offload
      timeout and CONFIG_NF_FLOW_TABLE should be used instead.
      
      Fixes: 975c5750 ("netfilter: conntrack: Introduce udp offload timeout configuration")
      Signed-off-by: default avatarVolodymyr Mytnyk <volodymyr.mytnyk@plvision.eu>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      626873c4
    • Florian Westphal's avatar
      netfilter: nf_conntrack_tcp: re-init for syn packets only · c7aab4f1
      Florian Westphal authored
      Jaco Kroon reported tcp problems that Eric Dumazet and Neal Cardwell
      pinpointed to nf_conntrack tcp_in_window() bug.
      
      tcp trace shows following sequence:
      
      I > R Flags [S], seq 3451342529, win 62580, options [.. tfo [|tcp]>
      R > I Flags [S.], seq 2699962254, ack 3451342530, win 65535, options [..]
      R > I Flags [P.], seq 1:89, ack 1, [..]
      
      Note 3rd ACK is from responder to initiator so following branch is taken:
          } else if (((state->state == TCP_CONNTRACK_SYN_SENT
                     && dir == IP_CT_DIR_ORIGINAL)
                     || (state->state == TCP_CONNTRACK_SYN_RECV
                     && dir == IP_CT_DIR_REPLY))
                     && after(end, sender->td_end)) {
      
      ... because state == TCP_CONNTRACK_SYN_RECV and dir is REPLY.
      This causes the scaling factor to be reset to 0: window scale option
      is only present in syn(ack) packets.  This in turn makes nf_conntrack
      mark valid packets as out-of-window.
      
      This was always broken, it exists even in original commit where
      window tracking was added to ip_conntrack (nf_conntrack predecessor)
      in 2.6.9-rc1 kernel.
      
      Restrict to 'tcph->syn', just like the 3rd condtional added in
      commit 82b72cb9 ("netfilter: conntrack: re-init state for retransmitted syn-ack").
      
      Upon closer look, those conditionals/branches can be merged:
      
      Because earlier checks prevent syn-ack from showing up in
      original direction, the 'dir' checks in the conditional quoted above are
      redundant, remove them. Return early for pure syn retransmitted in reply
      direction (simultaneous open).
      
      Fixes: 9fb9cbb1 ("[NETFILTER]: Add nf_conntrack subsystem.")
      Reported-by: default avatarJaco Kroon <jaco@uls.co.za>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarJozsef Kadlecsik <kadlec@netfilter.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      c7aab4f1
    • David S. Miller's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net · a1bde8c9
      David S. Miller authored
      -queue
      
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2022-04-26
      
      This series contains updates to ice driver only.
      
      Ivan Vecera removes races related to VF message processing by changing
      mutex_trylock() call to mutex_lock() and moving additional operations
      to occur under mutex.
      
      Petr Oros increases wait time after firmware flash as current time is
      not sufficient.
      
      Jake resolves a use-after-free issue for mailbox snapshot.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1bde8c9
    • Martin Blumenstingl's avatar
      net: dsa: lantiq_gswip: Don't set GSWIP_MII_CFG_RMII_CLK · 71cffebf
      Martin Blumenstingl authored
      Commit 4b592324 ("net: dsa: lantiq_gswip: Configure all remaining
      GSWIP_MII_CFG bits") added all known bits in the GSWIP_MII_CFGp
      register. It helped bring this register into a well-defined state so the
      driver has to rely less on the bootloader to do things right.
      Unfortunately it also sets the GSWIP_MII_CFG_RMII_CLK bit without any
      possibility to configure it. Upon further testing it turns out that all
      boards which are supported by the GSWIP driver in OpenWrt which use an
      RMII PHY have a dedicated oscillator on the board which provides the
      50MHz RMII reference clock.
      
      Don't set the GSWIP_MII_CFG_RMII_CLK bit (but keep the code which always
      clears it) to fix support for the Fritz!Box 7362 SL in OpenWrt. This is
      a board with two Atheros AR8030 RMII PHYs. With the "RMII clock" bit set
      the MAC also generates the RMII reference clock whose signal then
      conflicts with the signal from the oscillator on the board. This results
      in a constant cycle of the PHY detecting link up/down (and as a result
      of that: the two ports using the AR8030 PHYs are not working).
      
      At the time of writing this patch there's no known board where the MAC
      (GSWIP) has to generate the RMII reference clock. If needed this can be
      implemented in future by providing a device-tree flag so the
      GSWIP_MII_CFG_RMII_CLK bit can be toggled per port.
      
      Fixes: 4b592324 ("net: dsa: lantiq_gswip: Configure all remaining GSWIP_MII_CFG bits")
      Tested-by: default avatarJan Hoffmann <jan@3e8.eu>
      Signed-off-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Acked-by: default avatarHauke Mehrtens <hauke@hauke-m.de>
      Link: https://lore.kernel.org/r/20220425152027.2220750-1-martin.blumenstingl@googlemail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      71cffebf
    • Sebastian Andrzej Siewior's avatar
      net: Use this_cpu_inc() to increment net->core_stats · 6510ea97
      Sebastian Andrzej Siewior authored
      The macro dev_core_stats_##FIELD##_inc() disables preemption and invokes
      netdev_core_stats_alloc() to return a per-CPU pointer.
      netdev_core_stats_alloc() will allocate memory on its first invocation
      which breaks on PREEMPT_RT because it requires non-atomic context for
      memory allocation.
      
      This can be avoided by enabling preemption in netdev_core_stats_alloc()
      assuming the caller always disables preemption.
      
      It might be better to replace local_inc() with this_cpu_inc() now that
      dev_core_stats_##FIELD##_inc() gained a preempt-disable section and does
      not rely on already disabled preemption. This results in less
      instructions on x86-64:
      local_inc:
      |          incl %gs:__preempt_count(%rip)  # __preempt_count
      |          movq    488(%rdi), %rax # _1->core_stats, _22
      |          testq   %rax, %rax      # _22
      |          je      .L585   #,
      |          add %gs:this_cpu_off(%rip), %rax        # this_cpu_off, tcp_ptr__
      |  .L586:
      |          testq   %rax, %rax      # _27
      |          je      .L587   #,
      |          incq (%rax)            # _6->a.counter
      |  .L587:
      |          decl %gs:__preempt_count(%rip)  # __preempt_count
      
      this_cpu_inc(), this patch:
      |         movq    488(%rdi), %rax # _1->core_stats, _5
      |         testq   %rax, %rax      # _5
      |         je      .L591   #,
      | .L585:
      |         incq %gs:(%rax) # _18->rx_dropped
      
      Use unsigned long as type for the counter. Use this_cpu_inc() to
      increment the counter. Use a plain read of the counter.
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/YmbO0pxgtKpCw4SY@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6510ea97
  4. 26 Apr, 2022 14 commits
    • Luiz Augusto von Dentz's avatar
      Bluetooth: hci_sync: Cleanup hci_conn if it cannot be aborted · 9b3628d7
      Luiz Augusto von Dentz authored
      This attempts to cleanup the hci_conn if it cannot be aborted as
      otherwise it would likely result in having the controller and host
      stack out of sync with respect to connection handle.
      Signed-off-by: default avatarLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      9b3628d7
    • Luiz Augusto von Dentz's avatar
      Bluetooth: hci_event: Fix creating hci_conn object on error status · aef2aa4f
      Luiz Augusto von Dentz authored
      It is useless to create a hci_conn object if on error status as the
      result would be it being freed in the process and anyway it is likely
      the result of controller and host stack being out of sync.
      Signed-off-by: default avatarLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      aef2aa4f
    • Luiz Augusto von Dentz's avatar
      Bluetooth: hci_event: Fix checking for invalid handle on error status · c86cc5a3
      Luiz Augusto von Dentz authored
      Commit d5ebaa7c introduces checks for handle range
      (e.g HCI_CONN_HANDLE_MAX) but controllers like Intel AX200 don't seem
      to respect the valid range int case of error status:
      
      > HCI Event: Connect Complete (0x03) plen 11
              Status: Page Timeout (0x04)
              Handle: 65535
              Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment&
      	Sound Products Inc)
              Link type: ACL (0x01)
              Encryption: Disabled (0x00)
      [1644965.827560] Bluetooth: hci0: Ignoring HCI_Connection_Complete for invalid handle
      
      Because of it is impossible to cleanup the connections properly since
      the stack would attempt to cancel the connection which is no longer in
      progress causing the following trace:
      
      < HCI Command: Create Connection Cancel (0x01|0x0008) plen 6
              Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment&
      	Sound Products Inc)
      = bluetoothd: src/profile.c:record_cb() Unable to get Hands-Free Voice
      	gateway SDP record: Connection timed out
      > HCI Event: Command Complete (0x0e) plen 10
            Create Connection Cancel (0x01|0x0008) ncmd 1
              Status: Unknown Connection Identifier (0x02)
              Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment&
      	Sound Products Inc)
      < HCI Command: Create Connection Cancel (0x01|0x0008) plen 6
              Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment&
      	Sound Products Inc)
      
      Fixes: d5ebaa7c ("Bluetooth: hci_event: Ignore multiple conn complete events")
      Signed-off-by: default avatarLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      c86cc5a3
    • Jacob Keller's avatar
      ice: fix use-after-free when deinitializing mailbox snapshot · b668f4cd
      Jacob Keller authored
      During ice_sriov_configure, if num_vfs is 0, we are being asked by the
      kernel to remove all VFs.
      
      The driver first de-initializes the snapshot before freeing all the VFs.
      This results in a use-after-free BUG detected by KASAN. The bug occurs
      because the snapshot can still be accessed until all VFs are removed.
      
      Fix this by freeing all the VFs first before calling
      ice_mbx_deinit_snapshot.
      
      [  +0.032591] ==================================================================
      [  +0.000021] BUG: KASAN: use-after-free in ice_mbx_vf_state_handler+0x1c3/0x410 [ice]
      [  +0.000315] Write of size 28 at addr ffff889908eb6f28 by task kworker/55:2/1530996
      
      [  +0.000029] CPU: 55 PID: 1530996 Comm: kworker/55:2 Kdump: loaded Tainted: G S        I       5.17.0-dirty #1
      [  +0.000022] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 1.6.13 12/17/2018
      [  +0.000013] Workqueue: ice ice_service_task [ice]
      [  +0.000279] Call Trace:
      [  +0.000012]  <TASK>
      [  +0.000011]  dump_stack_lvl+0x33/0x42
      [  +0.000030]  print_report.cold.13+0xb2/0x6b3
      [  +0.000028]  ? ice_mbx_vf_state_handler+0x1c3/0x410 [ice]
      [  +0.000295]  kasan_report+0xa5/0x120
      [  +0.000026]  ? __switch_to_asm+0x21/0x70
      [  +0.000024]  ? ice_mbx_vf_state_handler+0x1c3/0x410 [ice]
      [  +0.000298]  kasan_check_range+0x183/0x1e0
      [  +0.000019]  memset+0x1f/0x40
      [  +0.000018]  ice_mbx_vf_state_handler+0x1c3/0x410 [ice]
      [  +0.000304]  ? ice_conv_link_speed_to_virtchnl+0x160/0x160 [ice]
      [  +0.000297]  ? ice_vsi_dis_spoofchk+0x40/0x40 [ice]
      [  +0.000305]  ice_is_malicious_vf+0x1aa/0x250 [ice]
      [  +0.000303]  ? ice_restore_all_vfs_msi_state+0x160/0x160 [ice]
      [  +0.000297]  ? __mutex_unlock_slowpath.isra.15+0x410/0x410
      [  +0.000022]  ? ice_debug_cq+0xb7/0x230 [ice]
      [  +0.000273]  ? __kasan_slab_alloc+0x2f/0x90
      [  +0.000022]  ? memset+0x1f/0x40
      [  +0.000017]  ? do_raw_spin_lock+0x119/0x1d0
      [  +0.000022]  ? rwlock_bug.part.2+0x60/0x60
      [  +0.000024]  __ice_clean_ctrlq+0x3a6/0xd60 [ice]
      [  +0.000273]  ? newidle_balance+0x5b1/0x700
      [  +0.000026]  ? ice_print_link_msg+0x2f0/0x2f0 [ice]
      [  +0.000271]  ? update_cfs_group+0x1b/0x140
      [  +0.000018]  ? load_balance+0x1260/0x1260
      [  +0.000022]  ? ice_process_vflr_event+0x27/0x130 [ice]
      [  +0.000301]  ice_service_task+0x136e/0x1470 [ice]
      [  +0.000281]  process_one_work+0x3b4/0x6c0
      [  +0.000030]  worker_thread+0x65/0x660
      [  +0.000023]  ? __kthread_parkme+0xe4/0x100
      [  +0.000021]  ? process_one_work+0x6c0/0x6c0
      [  +0.000020]  kthread+0x179/0x1b0
      [  +0.000018]  ? kthread_complete_and_exit+0x20/0x20
      [  +0.000022]  ret_from_fork+0x22/0x30
      [  +0.000026]  </TASK>
      
      [  +0.000018] Allocated by task 10742:
      [  +0.000013]  kasan_save_stack+0x1c/0x40
      [  +0.000018]  __kasan_kmalloc+0x84/0xa0
      [  +0.000016]  kmem_cache_alloc_trace+0x16c/0x2e0
      [  +0.000015]  intel_iommu_probe_device+0xeb/0x860
      [  +0.000015]  __iommu_probe_device+0x9a/0x2f0
      [  +0.000016]  iommu_probe_device+0x43/0x270
      [  +0.000015]  iommu_bus_notifier+0xa7/0xd0
      [  +0.000015]  blocking_notifier_call_chain+0x90/0xc0
      [  +0.000017]  device_add+0x5f3/0xd70
      [  +0.000014]  pci_device_add+0x404/0xa40
      [  +0.000015]  pci_iov_add_virtfn+0x3b0/0x550
      [  +0.000016]  sriov_enable+0x3bb/0x600
      [  +0.000013]  ice_ena_vfs+0x113/0xa79 [ice]
      [  +0.000293]  ice_sriov_configure.cold.17+0x21/0xe0 [ice]
      [  +0.000291]  sriov_numvfs_store+0x160/0x200
      [  +0.000015]  kernfs_fop_write_iter+0x1db/0x270
      [  +0.000018]  new_sync_write+0x21d/0x330
      [  +0.000013]  vfs_write+0x376/0x410
      [  +0.000013]  ksys_write+0xba/0x150
      [  +0.000012]  do_syscall_64+0x3a/0x80
      [  +0.000012]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      [  +0.000028] Freed by task 10742:
      [  +0.000011]  kasan_save_stack+0x1c/0x40
      [  +0.000015]  kasan_set_track+0x21/0x30
      [  +0.000016]  kasan_set_free_info+0x20/0x30
      [  +0.000012]  __kasan_slab_free+0x104/0x170
      [  +0.000016]  kfree+0x9b/0x470
      [  +0.000013]  devres_destroy+0x1c/0x20
      [  +0.000015]  devm_kfree+0x33/0x40
      [  +0.000012]  ice_mbx_deinit_snapshot+0x39/0x70 [ice]
      [  +0.000295]  ice_sriov_configure+0xb0/0x260 [ice]
      [  +0.000295]  sriov_numvfs_store+0x1bc/0x200
      [  +0.000015]  kernfs_fop_write_iter+0x1db/0x270
      [  +0.000016]  new_sync_write+0x21d/0x330
      [  +0.000012]  vfs_write+0x376/0x410
      [  +0.000012]  ksys_write+0xba/0x150
      [  +0.000012]  do_syscall_64+0x3a/0x80
      [  +0.000012]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      [  +0.000024] Last potentially related work creation:
      [  +0.000010]  kasan_save_stack+0x1c/0x40
      [  +0.000016]  __kasan_record_aux_stack+0x98/0xa0
      [  +0.000013]  insert_work+0x34/0x160
      [  +0.000015]  __queue_work+0x20e/0x650
      [  +0.000016]  queue_work_on+0x4c/0x60
      [  +0.000015]  nf_nat_masq_schedule+0x297/0x2e0 [nf_nat]
      [  +0.000034]  masq_device_event+0x5a/0x60 [nf_nat]
      [  +0.000031]  raw_notifier_call_chain+0x5f/0x80
      [  +0.000017]  dev_close_many+0x1d6/0x2c0
      [  +0.000015]  unregister_netdevice_many+0x4e3/0xa30
      [  +0.000015]  unregister_netdevice_queue+0x192/0x1d0
      [  +0.000014]  iavf_remove+0x8f9/0x930 [iavf]
      [  +0.000058]  pci_device_remove+0x65/0x110
      [  +0.000015]  device_release_driver_internal+0xf8/0x190
      [  +0.000017]  pci_stop_bus_device+0xb5/0xf0
      [  +0.000014]  pci_stop_and_remove_bus_device+0xe/0x20
      [  +0.000016]  pci_iov_remove_virtfn+0x19c/0x230
      [  +0.000015]  sriov_disable+0x4f/0x170
      [  +0.000014]  ice_free_vfs+0x9a/0x490 [ice]
      [  +0.000306]  ice_sriov_configure+0xb8/0x260 [ice]
      [  +0.000294]  sriov_numvfs_store+0x1bc/0x200
      [  +0.000015]  kernfs_fop_write_iter+0x1db/0x270
      [  +0.000016]  new_sync_write+0x21d/0x330
      [  +0.000012]  vfs_write+0x376/0x410
      [  +0.000012]  ksys_write+0xba/0x150
      [  +0.000012]  do_syscall_64+0x3a/0x80
      [  +0.000012]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      [  +0.000025] The buggy address belongs to the object at ffff889908eb6f00
                     which belongs to the cache kmalloc-96 of size 96
      [  +0.000016] The buggy address is located 40 bytes inside of
                     96-byte region [ffff889908eb6f00, ffff889908eb6f60)
      
      [  +0.000026] The buggy address belongs to the physical page:
      [  +0.000010] page:00000000b7e99a2e refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1908eb6
      [  +0.000016] flags: 0x57ffffc0000200(slab|node=1|zone=2|lastcpupid=0x1fffff)
      [  +0.000024] raw: 0057ffffc0000200 ffffea0069d9fd80 dead000000000002 ffff88810004c780
      [  +0.000015] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
      [  +0.000009] page dumped because: kasan: bad access detected
      
      [  +0.000016] Memory state around the buggy address:
      [  +0.000012]  ffff889908eb6e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
      [  +0.000014]  ffff889908eb6e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
      [  +0.000014] >ffff889908eb6f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
      [  +0.000011]                                   ^
      [  +0.000013]  ffff889908eb6f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
      [  +0.000013]  ffff889908eb7000: fa fb fb fb fb fb fb fb fc fc fc fc fa fb fb fb
      [  +0.000012] ==================================================================
      
      Fixes: 0891c896 ("ice: warn about potentially malicious VFs")
      Reported-by: default avatarSlawomir Laba <slawomirx.laba@intel.com>
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      b668f4cd
    • Petr Oros's avatar
      ice: wait 5 s for EMP reset after firmware flash · b537752e
      Petr Oros authored
      We need to wait 5 s for EMP reset after firmware flash. Code was extracted
      from OOT driver (ice v1.8.3 downloaded from sourceforge). Without this
      wait, fw_activate let card in inconsistent state and recoverable only
      by second flash/activate. Flash was tested on these fw's:
      From -> To
       3.00 -> 3.10/3.20
       3.10 -> 3.00/3.20
       3.20 -> 3.00/3.10
      
      Reproducer:
      [root@host ~]# devlink dev flash pci/0000:ca:00.0 file E810_XXVDA4_FH_O_SEC_FW_1p6p1p9_NVM_3p10_PLDMoMCTP_0.11_8000AD7B.bin
      Preparing to flash
      [fw.mgmt] Erasing
      [fw.mgmt] Erasing done
      [fw.mgmt] Flashing 100%
      [fw.mgmt] Flashing done 100%
      [fw.undi] Erasing
      [fw.undi] Erasing done
      [fw.undi] Flashing 100%
      [fw.undi] Flashing done 100%
      [fw.netlist] Erasing
      [fw.netlist] Erasing done
      [fw.netlist] Flashing 100%
      [fw.netlist] Flashing done 100%
      Activate new firmware by devlink reload
      [root@host ~]# devlink dev reload pci/0000:ca:00.0 action fw_activate
      reload_actions_performed:
          fw_activate
      [root@host ~]# ip link show ens7f0
      71: ens7f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
          link/ether b4:96:91:dc:72:e0 brd ff:ff:ff:ff:ff:ff
          altname enp202s0f0
      
      dmesg after flash:
      [   55.120788] ice: Copyright (c) 2018, Intel Corporation.
      [   55.274734] ice 0000:ca:00.0: Get PHY capabilities failed status = -5, continuing anyway
      [   55.569797] ice 0000:ca:00.0: The DDP package was successfully loaded: ICE OS Default Package version 1.3.28.0
      [   55.603629] ice 0000:ca:00.0: Get PHY capability failed.
      [   55.608951] ice 0000:ca:00.0: ice_init_nvm_phy_type failed: -5
      [   55.647348] ice 0000:ca:00.0: PTP init successful
      [   55.675536] ice 0000:ca:00.0: DCB is enabled in the hardware, max number of TCs supported on this port are 8
      [   55.685365] ice 0000:ca:00.0: FW LLDP is disabled, DCBx/LLDP in SW mode.
      [   55.692179] ice 0000:ca:00.0: Commit DCB Configuration to the hardware
      [   55.701382] ice 0000:ca:00.0: 126.024 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x8 link at 0000:c9:02.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
      Reboot doesn’t help, only second flash/activate with OOT or patched
      driver put card back in consistent state.
      
      After patch:
      [root@host ~]# devlink dev flash pci/0000:ca:00.0 file E810_XXVDA4_FH_O_SEC_FW_1p6p1p9_NVM_3p10_PLDMoMCTP_0.11_8000AD7B.bin
      Preparing to flash
      [fw.mgmt] Erasing
      [fw.mgmt] Erasing done
      [fw.mgmt] Flashing 100%
      [fw.mgmt] Flashing done 100%
      [fw.undi] Erasing
      [fw.undi] Erasing done
      [fw.undi] Flashing 100%
      [fw.undi] Flashing done 100%
      [fw.netlist] Erasing
      [fw.netlist] Erasing done
      [fw.netlist] Flashing 100%
      [fw.netlist] Flashing done 100%
      Activate new firmware by devlink reload
      [root@host ~]# devlink dev reload pci/0000:ca:00.0 action fw_activate
      reload_actions_performed:
          fw_activate
      [root@host ~]# ip link show ens7f0
      19: ens7f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
          link/ether b4:96:91:dc:72:e0 brd ff:ff:ff:ff:ff:ff
          altname enp202s0f0
      
      Fixes: 399e27db ("ice: support immediate firmware activation via devlink reload")
      Signed-off-by: default avatarPetr Oros <poros@redhat.com>
      Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      b537752e
    • Ivan Vecera's avatar
      ice: Protect vf_state check by cfg_lock in ice_vc_process_vf_msg() · 77d64d28
      Ivan Vecera authored
      Previous patch labelled "ice: Fix incorrect locking in
      ice_vc_process_vf_msg()"  fixed an issue with ignored messages
      sent by VF driver but a small race window still left.
      
      Recently caught trace during 'ip link set ... vf 0 vlan ...' operation:
      
      [ 7332.995625] ice 0000:3b:00.0: Clearing port VLAN on VF 0
      [ 7333.001023] iavf 0000:3b:01.0: Reset indication received from the PF
      [ 7333.007391] iavf 0000:3b:01.0: Scheduling reset task
      [ 7333.059575] iavf 0000:3b:01.0: PF returned error -5 (IAVF_ERR_PARAM) to our request 3
      [ 7333.059626] ice 0000:3b:00.0: Invalid message from VF 0, opcode 3, len 4, error -1
      
      Setting of VLAN for VF causes a reset of the affected VF using
      ice_reset_vf() function that runs with cfg_lock taken:
      
      1. ice_notify_vf_reset() informs IAVF driver that reset is needed and
         IAVF schedules its own reset procedure
      2. Bit ICE_VF_STATE_DIS is set in vf->vf_state
      3. Misc initialization steps
      4. ice_sriov_post_vsi_rebuild() -> ice_vf_set_initialized() and that
         clears ICE_VF_STATE_DIS in vf->vf_state
      
      Step 3 is mentioned race window because IAVF reset procedure runs in
      parallel and one of its step is sending of VIRTCHNL_OP_GET_VF_RESOURCES
      message (opcode==3). This message is handled in ice_vc_process_vf_msg()
      and if it is received during the mentioned race window then it's
      marked as invalid and error is returned to VF driver.
      
      Protect vf_state check in ice_vc_process_vf_msg() by cfg_lock to avoid
      this race condition.
      
      Fixes: e6ba5273 ("ice: Fix race conditions between virtchnl handling and VF ndo ops")
      Tested-by: default avatarFei Liu <feliu@redhat.com>
      Signed-off-by: default avatarIvan Vecera <ivecera@redhat.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      77d64d28
    • Ivan Vecera's avatar
      ice: Fix incorrect locking in ice_vc_process_vf_msg() · aaf461af
      Ivan Vecera authored
      Usage of mutex_trylock() in ice_vc_process_vf_msg() is incorrect
      because message sent from VF is ignored and never processed.
      
      Use mutex_lock() instead to fix the issue. It is safe because this
      mutex is used to prevent races between VF related NDOs and
      handlers processing request messages from VF and these handlers
      are running in ice_service_task() context. Additionally move this
      mutex lock prior ice_vc_is_opcode_allowed() call to avoid potential
      races during allowlist access.
      
      Fixes: e6ba5273 ("ice: Fix race conditions between virtchnl handling and VF ndo ops")
      Signed-off-by: default avatarIvan Vecera <ivecera@redhat.com>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      aaf461af
    • Maciej Fijalkowski's avatar
      xsk: Fix possible crash when multiple sockets are created · ba3beec2
      Maciej Fijalkowski authored
      Fix a crash that happens if an Rx only socket is created first, then a
      second socket is created that is Tx only and bound to the same umem as
      the first socket and also the same netdev and queue_id together with the
      XDP_SHARED_UMEM flag. In this specific case, the tx_descs array page
      pool was not created by the first socket as it was an Rx only socket.
      When the second socket is bound it needs this tx_descs array of this
      shared page pool as it has a Tx component, but unfortunately it was
      never allocated, leading to a crash. Note that this array is only used
      for zero-copy drivers using the batched Tx APIs, currently only ice and
      i40e.
      
      [ 5511.150360] BUG: kernel NULL pointer dereference, address: 0000000000000008
      [ 5511.158419] #PF: supervisor write access in kernel mode
      [ 5511.164472] #PF: error_code(0x0002) - not-present page
      [ 5511.170416] PGD 0 P4D 0
      [ 5511.173347] Oops: 0002 [#1] PREEMPT SMP PTI
      [ 5511.178186] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G            E     5.18.0-rc1+ #97
      [ 5511.187245] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016
      [ 5511.198418] RIP: 0010:xsk_tx_peek_release_desc_batch+0x198/0x310
      [ 5511.205375] Code: c0 83 c6 01 84 c2 74 6d 8d 46 ff 23 07 44 89 e1 48 83 c0 14 48 c1 e1 04 48 c1 e0 04 48 03 47 10 4c 01 c1 48 8b 50 08 48 8b 00 <48> 89 51 08 48 89 01 41 80 bd d7 00 00 00 00 75 82 48 8b 19 49 8b
      [ 5511.227091] RSP: 0018:ffffc90000003dd0 EFLAGS: 00010246
      [ 5511.233135] RAX: 0000000000000000 RBX: ffff88810c8da600 RCX: 0000000000000000
      [ 5511.241384] RDX: 000000000000003c RSI: 0000000000000001 RDI: ffff888115f555c0
      [ 5511.249634] RBP: ffffc90000003e08 R08: 0000000000000000 R09: ffff889092296b48
      [ 5511.257886] R10: 0000ffffffffffff R11: ffff889092296800 R12: 0000000000000000
      [ 5511.266138] R13: ffff88810c8db500 R14: 0000000000000040 R15: 0000000000000100
      [ 5511.274387] FS:  0000000000000000(0000) GS:ffff88903f800000(0000) knlGS:0000000000000000
      [ 5511.283746] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5511.290389] CR2: 0000000000000008 CR3: 00000001046e2001 CR4: 00000000003706f0
      [ 5511.298640] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 5511.306892] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 5511.315142] Call Trace:
      [ 5511.317972]  <IRQ>
      [ 5511.320301]  ice_xmit_zc+0x68/0x2f0 [ice]
      [ 5511.324977]  ? ktime_get+0x38/0xa0
      [ 5511.328913]  ice_napi_poll+0x7a/0x6a0 [ice]
      [ 5511.333784]  __napi_poll+0x2c/0x160
      [ 5511.337821]  net_rx_action+0xdd/0x200
      [ 5511.342058]  __do_softirq+0xe6/0x2dd
      [ 5511.346198]  irq_exit_rcu+0xb5/0x100
      [ 5511.350339]  common_interrupt+0xa4/0xc0
      [ 5511.354777]  </IRQ>
      [ 5511.357201]  <TASK>
      [ 5511.359625]  asm_common_interrupt+0x1e/0x40
      [ 5511.364466] RIP: 0010:cpuidle_enter_state+0xd2/0x360
      [ 5511.370211] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 e9 00 7b ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 72 02 00 00 31 ff e8 02 0c 80 ff fb 45 85 f6 <0f> 88 11 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d 14 90 49
      [ 5511.391921] RSP: 0018:ffffffff82a03e60 EFLAGS: 00000202
      [ 5511.397962] RAX: ffff88903f800000 RBX: 0000000000000001 RCX: 000000000000001f
      [ 5511.406214] RDX: 0000000000000000 RSI: ffffffff823400b9 RDI: ffffffff8234c046
      [ 5511.424646] RBP: ffff88810a384800 R08: 000005032a28c046 R09: 0000000000000008
      [ 5511.443233] R10: 000000000000000b R11: 0000000000000006 R12: ffffffff82bcf700
      [ 5511.461922] R13: 000005032a28c046 R14: 0000000000000001 R15: 0000000000000000
      [ 5511.480300]  cpuidle_enter+0x29/0x40
      [ 5511.494329]  do_idle+0x1c7/0x250
      [ 5511.507610]  cpu_startup_entry+0x19/0x20
      [ 5511.521394]  start_kernel+0x649/0x66e
      [ 5511.534626]  secondary_startup_64_no_verify+0xc3/0xcb
      [ 5511.549230]  </TASK>
      
      Detect such case during bind() and allocate this memory region via newly
      introduced xp_alloc_tx_descs(). Also, use kvcalloc instead of kcalloc as
      for other buffer pool allocations, so that it matches the kvfree() from
      xp_destroy().
      
      Fixes: d1bc532e ("i40e: xsk: Move tmp desc array from driver to pool")
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/20220425153745.481322-1-maciej.fijalkowski@intel.com
      ba3beec2
    • Adam Zabrocki's avatar
      kprobes: Fix KRETPROBES when CONFIG_KRETPROBE_ON_RETHOOK is set · 1d661ed5
      Adam Zabrocki authored
      The recent kernel change in 73f9b911 ("kprobes: Use rethook for kretprobe
      if possible"), introduced a potential NULL pointer dereference bug in the
      KRETPROBE mechanism. The official Kprobes documentation defines that "Any or
      all handlers can be NULL". Unfortunately, there is a missing return handler
      verification to fulfill these requirements and can result in a NULL pointer
      dereference bug.
      
      This patch adds such verification in kretprobe_rethook_handler() function.
      
      Fixes: 73f9b911 ("kprobes: Use rethook for kretprobe if possible")
      Signed-off-by: default avatarAdam Zabrocki <pi3@pi3.com.pl>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Anil S. Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Link: https://lore.kernel.org/bpf/20220422164027.GA7862@pi3.com.pl
      1d661ed5
    • Nikolay Aleksandrov's avatar
      virtio_net: fix wrong buf address calculation when using xdp · acb16b39
      Nikolay Aleksandrov authored
      We received a report[1] of kernel crashes when Cilium is used in XDP
      mode with virtio_net after updating to newer kernels. After
      investigating the reason it turned out that when using mergeable bufs
      with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
      calculates the build_skb address wrong because the offset can become less
      than the headroom so it gets the address of the previous page (-X bytes
      depending on how lower offset is):
       page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
      
      This is a pr_err() I added in the beginning of page_to_skb which clearly
      shows offset that is less than headroom by adding 4 bytes of metadata
      via an xdp prog. The calculations done are:
       receive_mergeable():
       headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
       offset = xdp.data - page_address(xdp_page) -
                vi->hdr_len - metasize;
      
       page_to_skb():
       p = page_address(page) + offset;
       ...
       buf = p - headroom;
      
      Now buf goes -4 bytes from the page's starting address as can be seen
      above which is set as skb->head and skb->data by build_skb later. Depending
      on what's done with the skb (when it's freed most often) we get all kinds
      of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate
      the new headroom after the xdp program has run, similar to how offset
      and len are recalculated. Headroom is directly related to
      data_hard_start, data and data_meta, so we use them to get the new size.
      The result is correct (similar pr_err() in page_to_skb, one case of
      xdp_page and one case of virtnet buf):
       a) Case with 4 bytes of metadata
       [  115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252
       [  121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252
       b) Case of pushing data +32 bytes
       [  153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288
       [  158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288
       c) Case of pushing data -33 bytes
       [  835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223
       [  840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223
      
      Offset and headroom are equal because offset points to the start of
      reserved bytes for the virtio_net header which are at buf start +
      headroom, while data points at buf start + vnet hdr size + headroom so
      when data or data_meta are adjusted by the xdp prog both the headroom size
      and the offset change equally. We can use data_hard_start to compute the
      new headroom after the xdp prog (linearized / page start case, the
      virtnet buf case is similar just with bigger base offset):
       xdp.data_hard_start = page_address + vnet_hdr
       xdp.data = page_address + vnet_hdr + headroom
       new headroom after xdp prog = xdp.data - xdp.data_hard_start - metasize
      
      An example reproducer xdp prog[3] is below.
      
      [1] https://github.com/cilium/cilium/issues/19453
      
      [2] Two of the many traces:
       [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
       [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
       [   41.300891] kernel BUG at include/linux/mm.h:720!
       [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
       [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
       [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
       [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
       [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
       [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
       [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
       [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
       [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
       [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
       [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
       [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
       [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
       [   41.321387] Call Trace:
       [   41.321819]  <TASK>
       [   41.322193]  skb_release_data+0x13f/0x1c0
       [   41.322902]  __kfree_skb+0x20/0x30
       [   41.343870]  tcp_recvmsg_locked+0x671/0x880
       [   41.363764]  tcp_recvmsg+0x5e/0x1c0
       [   41.384102]  inet_recvmsg+0x42/0x100
       [   41.406783]  ? sock_recvmsg+0x1d/0x70
       [   41.428201]  sock_read_iter+0x84/0xd0
       [   41.445592]  ? 0xffffffffa3000000
       [   41.462442]  new_sync_read+0x148/0x160
       [   41.479314]  ? 0xffffffffa3000000
       [   41.496937]  vfs_read+0x138/0x190
       [   41.517198]  ksys_read+0x87/0xc0
       [   41.535336]  do_syscall_64+0x3b/0x90
       [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
       [   41.568050] RIP: 0033:0x48765b
       [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
       [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
       [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
       [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
       [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
       [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
       [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
       [   41.744254]  </TASK>
       [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
      
       and
      
       [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
       [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
       [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
       [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
       [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
       [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
       [   33.532471] page dumped because: nonzero mapcount
       [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
       [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
       [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
       [   33.532484] Call Trace:
       [   33.532496]  <TASK>
       [   33.532500]  dump_stack_lvl+0x45/0x5a
       [   33.532506]  bad_page.cold+0x63/0x94
       [   33.532510]  free_pcp_prepare+0x290/0x420
       [   33.532515]  free_unref_page+0x1b/0x100
       [   33.532518]  skb_release_data+0x13f/0x1c0
       [   33.532524]  kfree_skb_reason+0x3e/0xc0
       [   33.532527]  ip6_mc_input+0x23c/0x2b0
       [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
       [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
      
      [3] XDP program to reproduce(xdp_pass.c):
       #include <linux/bpf.h>
       #include <bpf/bpf_helpers.h>
      
       SEC("xdp_pass")
       int xdp_pkt_pass(struct xdp_md *ctx)
       {
                bpf_xdp_adjust_head(ctx, -(int)32);
                return XDP_PASS;
       }
      
       char _license[] SEC("license") = "GPL";
      
       compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
       load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
      
      CC: stable@vger.kernel.org
      CC: Jason Wang <jasowang@redhat.com>
      CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      CC: Daniel Borkmann <daniel@iogearbox.net>
      CC: "Michael S. Tsirkin" <mst@redhat.com>
      CC: virtualization@lists.linux-foundation.org
      Fixes: 8fb7da9e ("virtio_net: get build_skb() buf by data ptr")
      Signed-off-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Reviewed-by: default avatarXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/r/20220425103703.3067292-1-razor@blackwall.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      acb16b39
    • Nathan Rossi's avatar
      net: dsa: mv88e6xxx: Fix port_hidden_wait to account for port_base_addr · 24cbdb91
      Nathan Rossi authored
      The other port_hidden functions rely on the port_read/port_write
      functions to access the hidden control port. These functions apply the
      offset for port_base_addr where applicable. Update port_hidden_wait to
      use the port_wait_bit so that port_base_addr offsets are accounted for
      when waiting for the busy bit to change.
      
      Without the offset the port_hidden_wait function would timeout on
      devices that have a non-zero port_base_addr (e.g. MV88E6141), however
      devices that have a zero port_base_addr would operate correctly (e.g.
      MV88E6390).
      
      Fixes: 60907013 ("net: dsa: mv88e6xxx: update code operating on hidden registers")
      Signed-off-by: default avatarNathan Rossi <nathan@nathanrossi.com>
      Reviewed-by: default avatarMarek Behún <kabel@kernel.org>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20220425070454.348584-1-nathan@nathanrossi.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      24cbdb91
    • Baruch Siach's avatar
      net: phy: marvell10g: fix return value on error · 0ed9704b
      Baruch Siach authored
      Return back the error value that we get from phy_read_mmd().
      
      Fixes: c84786fa ("net: phy: marvell10g: read copper results from CSSR1")
      Signed-off-by: default avatarBaruch Siach <baruch.siach@siklu.com>
      Reviewed-by: default avatarMarek Behún <kabel@kernel.org>
      Reviewed-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Link: https://lore.kernel.org/r/f47cb031aeae873bb008ba35001607304a171a20.1650868058.git.baruch@tkos.co.ilSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      0ed9704b
    • Jonathan Lemon's avatar
      net: bcmgenet: hide status block before TX timestamping · acac0541
      Jonathan Lemon authored
      The hardware checksum offloading requires use of a transmit
      status block inserted before the outgoing frame data, this was
      updated in '9a9ba2a4 ("net: bcmgenet: always enable status blocks")'
      
      However, skb_tx_timestamp() assumes that it is passed a raw frame
      and PTP parsing chokes on this status block.
      
      Fix this by calling __skb_pull(), which hides the TSB before calling
      skb_tx_timestamp(), so an outgoing PTP packet is parsed correctly.
      
      As the data in the skb has already been set up for DMA, and the
      dma_unmap_* calls use a separately stored address, there is no
      no effective change in the data transmission.
      Signed-off-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/20220424165307.591145-1-jonathan.lemon@gmail.com
      Fixes: d03825fb ("net: bcmgenet: add skb_tx_timestamp call")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      acac0541
    • Lin Ma's avatar
      mctp: defer the kfree of object mdev->addrs · b561275d
      Lin Ma authored
      The function mctp_unregister() reclaims the device's relevant resource
      when a netcard detaches. However, a running routine may be unaware of
      this and cause the use-after-free of the mdev->addrs object.
      
      The race condition can be demonstrated below
      
       cleanup thread               another thread
                                |
      unregister_netdev()       |  mctp_sendmsg()
      ...                       |    ...
        mctp_unregister()       |    rt = mctp_route_lookup()
          ...                   |    mctl_local_output()
          kfree(mdev->addrs)    |      ...
                                |      saddr = rt->dev->addrs[0];
                                |
      
      An attacker can adopt the (recent provided) mtcpserial driver with pty
      to fake the device detaching and use the userfaultfd to increase the
      race success chance (in mctp_sendmsg). The KASan report for such a POC
      is shown below:
      
      [   86.051955] ==================================================================
      [   86.051955] BUG: KASAN: use-after-free in mctp_local_output+0x4e9/0xb7d
      [   86.051955] Read of size 1 at addr ffff888005f298c0 by task poc/295
      [   86.051955]
      [   86.051955] Call Trace:
      [   86.051955]  <TASK>
      [   86.051955]  dump_stack_lvl+0x33/0x42
      [   86.051955]  print_report.cold.13+0xb2/0x6b3
      [   86.051955]  ? preempt_schedule_irq+0x57/0x80
      [   86.051955]  ? mctp_local_output+0x4e9/0xb7d
      [   86.051955]  kasan_report+0xa5/0x120
      [   86.051955]  ? mctp_local_output+0x4e9/0xb7d
      [   86.051955]  mctp_local_output+0x4e9/0xb7d
      [   86.051955]  ? mctp_dev_set_key+0x79/0x79
      [   86.051955]  ? copyin+0x38/0x50
      [   86.051955]  ? _copy_from_iter+0x1b6/0xf20
      [   86.051955]  ? sysvec_apic_timer_interrupt+0x97/0xb0
      [   86.051955]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
      [   86.051955]  ? mctp_local_output+0x1/0xb7d
      [   86.051955]  mctp_sendmsg+0x64d/0xdb0
      [   86.051955]  ? mctp_sk_close+0x20/0x20
      [   86.051955]  ? __fget_light+0x2fd/0x4f0
      [   86.051955]  ? mctp_sk_close+0x20/0x20
      [   86.051955]  sock_sendmsg+0xdd/0x110
      [   86.051955]  __sys_sendto+0x1cc/0x2a0
      [   86.051955]  ? __ia32_sys_getpeername+0xa0/0xa0
      [   86.051955]  ? new_sync_write+0x335/0x550
      [   86.051955]  ? alloc_file+0x22f/0x500
      [   86.051955]  ? __ip_do_redirect+0x820/0x1820
      [   86.051955]  ? vfs_write+0x44d/0x7b0
      [   86.051955]  ? vfs_write+0x44d/0x7b0
      [   86.051955]  ? fput_many+0x15/0x120
      [   86.051955]  ? ksys_write+0x155/0x1b0
      [   86.051955]  ? __ia32_sys_read+0xa0/0xa0
      [   86.051955]  __x64_sys_sendto+0xd8/0x1b0
      [   86.051955]  ? exit_to_user_mode_prepare+0x2f/0x120
      [   86.051955]  ? syscall_exit_to_user_mode+0x12/0x20
      [   86.051955]  do_syscall_64+0x3a/0x80
      [   86.051955]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [   86.051955] RIP: 0033:0x7f82118a56b3
      [   86.051955] RSP: 002b:00007ffdb154b110 EFLAGS: 00000293 ORIG_RAX: 000000000000002c
      [   86.051955] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f82118a56b3
      [   86.051955] RDX: 0000000000000010 RSI: 00007f8211cd4000 RDI: 0000000000000007
      [   86.051955] RBP: 00007ffdb154c1d0 R08: 00007ffdb154b164 R09: 000000000000000c
      [   86.051955] R10: 0000000000000000 R11: 0000000000000293 R12: 000055d779800db0
      [   86.051955] R13: 00007ffdb154c2b0 R14: 0000000000000000 R15: 0000000000000000
      [   86.051955]  </TASK>
      [   86.051955]
      [   86.051955] Allocated by task 295:
      [   86.051955]  kasan_save_stack+0x1c/0x40
      [   86.051955]  __kasan_kmalloc+0x84/0xa0
      [   86.051955]  mctp_rtm_newaddr+0x242/0x610
      [   86.051955]  rtnetlink_rcv_msg+0x2fd/0x8b0
      [   86.051955]  netlink_rcv_skb+0x11c/0x340
      [   86.051955]  netlink_unicast+0x439/0x630
      [   86.051955]  netlink_sendmsg+0x752/0xc00
      [   86.051955]  sock_sendmsg+0xdd/0x110
      [   86.051955]  __sys_sendto+0x1cc/0x2a0
      [   86.051955]  __x64_sys_sendto+0xd8/0x1b0
      [   86.051955]  do_syscall_64+0x3a/0x80
      [   86.051955]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [   86.051955]
      [   86.051955] Freed by task 301:
      [   86.051955]  kasan_save_stack+0x1c/0x40
      [   86.051955]  kasan_set_track+0x21/0x30
      [   86.051955]  kasan_set_free_info+0x20/0x30
      [   86.051955]  __kasan_slab_free+0x104/0x170
      [   86.051955]  kfree+0x8c/0x290
      [   86.051955]  mctp_dev_notify+0x161/0x2c0
      [   86.051955]  raw_notifier_call_chain+0x8b/0xc0
      [   86.051955]  unregister_netdevice_many+0x299/0x1180
      [   86.051955]  unregister_netdevice_queue+0x210/0x2f0
      [   86.051955]  unregister_netdev+0x13/0x20
      [   86.051955]  mctp_serial_close+0x6d/0xa0
      [   86.051955]  tty_ldisc_kill+0x31/0xa0
      [   86.051955]  tty_ldisc_hangup+0x24f/0x560
      [   86.051955]  __tty_hangup.part.28+0x2ce/0x6b0
      [   86.051955]  tty_release+0x327/0xc70
      [   86.051955]  __fput+0x1df/0x8b0
      [   86.051955]  task_work_run+0xca/0x150
      [   86.051955]  exit_to_user_mode_prepare+0x114/0x120
      [   86.051955]  syscall_exit_to_user_mode+0x12/0x20
      [   86.051955]  do_syscall_64+0x46/0x80
      [   86.051955]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [   86.051955]
      [   86.051955] The buggy address belongs to the object at ffff888005f298c0
      [   86.051955]  which belongs to the cache kmalloc-8 of size 8
      [   86.051955] The buggy address is located 0 bytes inside of
      [   86.051955]  8-byte region [ffff888005f298c0, ffff888005f298c8)
      [   86.051955]
      [   86.051955] The buggy address belongs to the physical page:
      [   86.051955] flags: 0x100000000000200(slab|node=0|zone=1)
      [   86.051955] raw: 0100000000000200 dead000000000100 dead000000000122 ffff888005c42280
      [   86.051955] raw: 0000000000000000 0000000080660066 00000001ffffffff 0000000000000000
      [   86.051955] page dumped because: kasan: bad access detected
      [   86.051955]
      [   86.051955] Memory state around the buggy address:
      [   86.051955]  ffff888005f29780: 00 fc fc fc fc 00 fc fc fc fc 00 fc fc fc fc 00
      [   86.051955]  ffff888005f29800: fc fc fc fc 00 fc fc fc fc 00 fc fc fc fc 00 fc
      [   86.051955] >ffff888005f29880: fc fc fc fb fc fc fc fc fa fc fc fc fc fa fc fc
      [   86.051955]                                            ^
      [   86.051955]  ffff888005f29900: fc fc 00 fc fc fc fc 00 fc fc fc fc 00 fc fc fc
      [   86.051955]  ffff888005f29980: fc 00 fc fc fc fc 00 fc fc fc fc 00 fc fc fc fc
      [   86.051955] ==================================================================
      
      To this end, just like the commit e0448092 ("Bluetooth: defer
      cleanup of resources in hci_unregister_dev()")  this patch defers the
      destructive kfree(mdev->addrs) in mctp_unregister to the mctp_dev_put,
      where the refcount of mdev is zero and the entire device is reclaimed.
      This prevents the use-after-free because the sendmsg thread holds the
      reference of mdev in the mctp_route object.
      
      Fixes: 583be982 (mctp: Add device handling and netlink interface)
      Signed-off-by: default avatarLin Ma <linma@zju.edu.cn>
      Acked-by: default avatarJeremy Kerr <jk@codeconstruct.com.au>
      Link: https://lore.kernel.org/r/20220422114340.32346-1-linma@zju.edu.cnSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      b561275d
  5. 25 Apr, 2022 5 commits
    • Jakub Kicinski's avatar
      Merge branch 'net-smc-two-fixes-for-smc-fallback' · c3e8d5a4
      Jakub Kicinski authored
      Wen Gu says:
      
      ====================
      net/smc: Two fixes for smc fallback
      
      This patch set includes two fixes for smc fallback:
      
      Patch 1/2 introduces some simple helpers to wrap the replacement
      and restore of clcsock's callback functions. Make sure that only
      the original callbacks will be saved and not overwritten.
      
      Patch 2/2 fixes a syzbot reporting slab-out-of-bound issue where
      smc_fback_error_report() accesses the already freed smc sock (see
      https://lore.kernel.org/r/00000000000013ca8105d7ae3ada@google.com/).
      The patch fixes it by resetting sk_user_data and restoring clcsock
      callback functions timely in fallback situation.
      
      But it should be noted that although patch 2/2 can fix the issue
      of 'slab-out-of-bounds/use-after-free in smc_fback_error_report',
      it can't pass the syzbot reproducer test. Because after applying
      these two patches in upstream, syzbot reproducer triggered another
      known issue like this:
      
      ==================================================================
      BUG: KASAN: use-after-free in tcp_retransmit_timer+0x2ef3/0x3360 net/ipv4/tcp_timer.c:511
      Read of size 8 at addr ffff888020328380 by task udevd/4158
      
      CPU: 1 PID: 4158 Comm: udevd Not tainted 5.18.0-rc3-syzkaller-00074-gb05a5683-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
        print_address_description.constprop.0.cold+0xeb/0x467 mm/kasan/report.c:313
        print_report mm/kasan/report.c:429 [inline]
        kasan_report.cold+0xf4/0x1c6 mm/kasan/report.c:491
        tcp_retransmit_timer+0x2ef3/0x3360 net/ipv4/tcp_timer.c:511
        tcp_write_timer_handler+0x5e6/0xbc0 net/ipv4/tcp_timer.c:622
        tcp_write_timer+0xa2/0x2b0 net/ipv4/tcp_timer.c:642
        call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1421
        expire_timers kernel/time/timer.c:1466 [inline]
        __run_timers.part.0+0x679/0xa80 kernel/time/timer.c:1737
        __run_timers kernel/time/timer.c:1715 [inline]
        run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1750
        __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
        invoke_softirq kernel/softirq.c:432 [inline]
        __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
        irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
        sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
       </IRQ>
       ...
      (detail report can be found in https://syzkaller.appspot.com/text?tag=CrashReport&x=15406b44f00000)
      
      IMHO, the above issue is the same as this known one: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed,
      and it doesn't seem to be related with SMC. The discussion about this known issue is ongoing and can be found in
      https://lore.kernel.org/bpf/000000000000f75af905d3ba0716@google.com/T/.
      
      And I added the temporary solution mentioned in the above discussion on
      top of my two patches, the syzbot reproducer of 'slab-out-of-bounds/
      use-after-free in smc_fback_error_report' no longer triggers any issue.
      ====================
      
      Link: https://lore.kernel.org/r/1650614179-11529-1-git-send-email-guwen@linux.alibaba.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c3e8d5a4
    • Wen Gu's avatar
      net/smc: Fix slab-out-of-bounds issue in fallback · 0558226c
      Wen Gu authored
      syzbot reported a slab-out-of-bounds/use-after-free issue,
      which was caused by accessing an already freed smc sock in
      fallback-specific callback functions of clcsock.
      
      This patch fixes the issue by restoring fallback-specific
      callback functions to original ones and resetting clcsock
      sk_user_data to NULL before freeing smc sock.
      
      Meanwhile, this patch introduces sk_callback_lock to make
      the access and assignment to sk_user_data mutually exclusive.
      
      Reported-by: syzbot+b425899ed22c6943e00b@syzkaller.appspotmail.com
      Fixes: 341adeec ("net/smc: Forward wakeup to smc socket waitqueue after fallback")
      Link: https://lore.kernel.org/r/00000000000013ca8105d7ae3ada@google.com/Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Acked-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0558226c
    • Wen Gu's avatar
      net/smc: Only save the original clcsock callback functions · 97b9af7a
      Wen Gu authored
      Both listen and fallback process will save the current clcsock
      callback functions and establish new ones. But if both of them
      happen, the saved callback functions will be overwritten.
      
      So this patch introduces some helpers to ensure that only save
      the original callback functions of clcsock.
      
      Fixes: 341adeec ("net/smc: Forward wakeup to smc socket waitqueue after fallback")
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Acked-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      97b9af7a
    • Eric Dumazet's avatar
      tcp: make sure treq->af_specific is initialized · ba5a4fdd
      Eric Dumazet authored
      syzbot complained about a recent change in TCP stack,
      hitting a NULL pointer [1]
      
      tcp request sockets have an af_specific pointer, which
      was used before the blamed change only for SYNACK generation
      in non SYNCOOKIE mode.
      
      tcp requests sockets momentarily created when third packet
      coming from client in SYNCOOKIE mode were not using
      treq->af_specific.
      
      Make sure this field is populated, in the same way normal
      TCP requests sockets do in tcp_conn_request().
      
      [1]
      TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending cookies.  Check SNMP counters.
      general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
      KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
      CPU: 1 PID: 3695 Comm: syz-executor864 Not tainted 5.18.0-rc3-syzkaller-00224-g5fd1fe48 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:tcp_create_openreq_child+0xe16/0x16b0 net/ipv4/tcp_minisocks.c:534
      Code: 48 c1 ea 03 80 3c 02 00 0f 85 e5 07 00 00 4c 8b b3 28 01 00 00 48 b8 00 00 00 00 00 fc ff df 49 8d 7e 08 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 c9 07 00 00 48 8b 3c 24 48 89 de 41 ff 56 08 48
      RSP: 0018:ffffc90000de0588 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: ffff888076490330 RCX: 0000000000000100
      RDX: 0000000000000001 RSI: ffffffff87d67ff0 RDI: 0000000000000008
      RBP: ffff88806ee1c7f8 R08: 0000000000000000 R09: 0000000000000000
      R10: ffffffff87d67f00 R11: 0000000000000000 R12: ffff88806ee1bfc0
      R13: ffff88801b0e0368 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f517fe58700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffcead76960 CR3: 000000006f97b000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       tcp_v6_syn_recv_sock+0x199/0x23b0 net/ipv6/tcp_ipv6.c:1267
       tcp_get_cookie_sock+0xc9/0x850 net/ipv4/syncookies.c:207
       cookie_v6_check+0x15c3/0x2340 net/ipv6/syncookies.c:258
       tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1131 [inline]
       tcp_v6_do_rcv+0x1148/0x13b0 net/ipv6/tcp_ipv6.c:1486
       tcp_v6_rcv+0x3305/0x3840 net/ipv6/tcp_ipv6.c:1725
       ip6_protocol_deliver_rcu+0x2e9/0x1900 net/ipv6/ip6_input.c:422
       ip6_input_finish+0x14c/0x2c0 net/ipv6/ip6_input.c:464
       NF_HOOK include/linux/netfilter.h:307 [inline]
       NF_HOOK include/linux/netfilter.h:301 [inline]
       ip6_input+0x9c/0xd0 net/ipv6/ip6_input.c:473
       dst_input include/net/dst.h:461 [inline]
       ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
       NF_HOOK include/linux/netfilter.h:307 [inline]
       NF_HOOK include/linux/netfilter.h:301 [inline]
       ipv6_rcv+0x27f/0x3b0 net/ipv6/ip6_input.c:297
       __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5405
       __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5519
       process_backlog+0x3a0/0x7c0 net/core/dev.c:5847
       __napi_poll+0xb3/0x6e0 net/core/dev.c:6413
       napi_poll net/core/dev.c:6480 [inline]
       net_rx_action+0x8ec/0xc60 net/core/dev.c:6567
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
       irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
       sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
      
      Fixes: 5b0b9e4c ("tcp: md5: incorrect tcp_header_len for incoming connections")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Francesco Ruggeri <fruggeri@arista.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba5a4fdd
    • Eric Dumazet's avatar
      tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT · 4bfe744f
      Eric Dumazet authored
      I had this bug sitting for too long in my pile, it is time to fix it.
      
      Thanks to Doug Porter for reminding me of it!
      
      We had various attempts in the past, including commit
      0cbe6a8f ("tcp: remove SOCK_QUEUE_SHRUNK"),
      but the issue is that TCP stack currently only generates
      EPOLLOUT from input path, when tp->snd_una has advanced
      and skb(s) cleaned from rtx queue.
      
      If a flow has a big RTT, and/or receives SACKs, it is possible
      that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
      and no more data can be sent until tp->snd_una finally advances.
      
      What is needed is to also check if POLLOUT needs to be generated
      whenever tp->snd_nxt is advanced, from output path.
      
      This bug triggers more often after an idle period, as
      we do not receive ACK for at least one RTT. tcp_notsent_lowat
      could be a fraction of what CWND and pacing rate would allow to
      send during this RTT.
      
      In a followup patch, I will remove the bogus call
      to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
      from tcp_check_space(). Fact that we have decided to generate
      an EPOLLOUT does not mean the application has immediately
      refilled the transmit queue. This optimistic call
      might have been the reason the bug seemed not too serious.
      
      Tested:
      
      200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]
      
      $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
      $ cat bench_rr.sh
      SUM=0
      for i in {1..10}
      do
       V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
       echo $V
       SUM=$(($SUM + $V))
      done
      echo SUM=$SUM
      
      Before patch:
      $ bench_rr.sh
      130000000
      80000000
      140000000
      140000000
      140000000
      140000000
      130000000
      40000000
      90000000
      110000000
      SUM=1140000000
      
      After patch:
      $ bench_rr.sh
      430000000
      590000000
      530000000
      450000000
      450000000
      350000000
      450000000
      490000000
      480000000
      460000000
      SUM=4680000000  # This is 410 % of the value before patch.
      
      Fixes: c9bee3b7 ("tcp: TCP_NOTSENT_LOWAT socket option")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDoug Porter <dsp@fb.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4bfe744f