1. 27 Apr, 2022 14 commits
    • David S. Miller's avatar
      Merge branch 'mptcp-MP_FAIL-timeout' · 124de271
      David S. Miller authored
      Mat Martineau says:
      
      ====================
      mptcp: Timeout for MP_FAIL response
      
      When one peer sends an infinite mapping to coordinate fallback from
      MPTCP to regular TCP, the other peer is expected to send a packet with
      the MPTCP MP_FAIL option to acknowledge the infinite mapping. Rather
      than leave the connection in some half-fallback state, this series adds
      a timeout after which the infinite mapping sender will reset the
      connection.
      
      Patch 1 adds a fallback self test.
      
      Patches 2-5 make use of the MPTCP socket's retransmit timer to reset the
      MPTCP connection if no MP_FAIL was received.
      
      Patches 6 and 7 extends the self test to check MP_FAIL-related MIBs.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      124de271
    • Geliang Tang's avatar
      selftests: mptcp: print extra msg in chk_csum_nr · 53f368bf
      Geliang Tang authored
      When the multiple checksum errors occur in chk_csum_nr(), print the
      numbers of the errors as an extra message.
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      53f368bf
    • Geliang Tang's avatar
      selftests: mptcp: check MP_FAIL response mibs · 1f7d325f
      Geliang Tang authored
      This patch extends chk_fail_nr to check the MP_FAIL response mibs.
      
      Add a new argument invert for chk_fail_nr to allow it can check the
      MP_FAIL TX and RX mibs from the opposite direction.
      
      When the infinite map is received before the MP_FAIL response, the
      response will be lost. A '-' can be added into fail_tx or fail_rx to
      represent that MP_FAIL response TX or RX can be lost when doing the
      checks.
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f7d325f
    • Geliang Tang's avatar
      mptcp: reset subflow when MP_FAIL doesn't respond · 49fa1919
      Geliang Tang authored
      This patch adds a new msk->flags bit MPTCP_FAIL_NO_RESPONSE, then reuses
      sk_timer to trigger a check if we have not received a response from the
      peer after sending MP_FAIL. If the peer doesn't respond properly, reset
      the subflow.
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      49fa1919
    • Geliang Tang's avatar
      mptcp: add MP_FAIL response support · 9c81be0d
      Geliang Tang authored
      This patch adds a new struct member mp_fail_response_expect in struct
      mptcp_subflow_context to support MP_FAIL response. In the single subflow
      with checksum error and contiguous data special case, a MP_FAIL is sent
      in response to another MP_FAIL.
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c81be0d
    • Geliang Tang's avatar
      mptcp: add data lock for sk timers · 4293248c
      Geliang Tang authored
      mptcp_data_lock() needs to be held when manipulating the msk
      retransmit_timer or the sk sk_timer. This patch adds the data
      lock for the both timers.
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4293248c
    • Geliang Tang's avatar
      mptcp: use mptcp_stop_timer · bcf3cf93
      Geliang Tang authored
      Use the helper mptcp_stop_timer() instead of using sk_stop_timer() to
      stop icsk_retransmit_timer directly.
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bcf3cf93
    • Geliang Tang's avatar
      selftests: mptcp: add infinite map testcase · b6e074e1
      Geliang Tang authored
      Add the single subflow test case for MP_FAIL, to test the infinite
      mapping case. Use the test_linkfail value to make 128KB test files.
      
      Add a new function reset_with_fail(), in it use 'iptables' and 'tc
      action pedit' rules to produce the bit flips to trigger the checksum
      failures. Set validate_checksum to enable checksums for the MP_FAIL
      tests without passing the '-C' argument. Set check_invert flag to
      enable the invert bytes check for the output data in check_transfer().
      Instead of the file mismatch error, this test prints out the inverted
      bytes.
      
      Add a new function pedit_action_pkts() to get the numbers of the packets
      edited by the tc pedit actions. Print this numbers to the output.
      
      Also add the needed kernel configures in the selftests config file.
      Suggested-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Co-developed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6e074e1
    • Marcel Ziswiler's avatar
      net: stmmac: dwmac-imx: comment spelling fix · b1190d51
      Marcel Ziswiler authored
      Fix spelling in comment.
      
      Fixes: 94abdad6 ("net: ethernet: dwmac: add ethernet glue logic for NXP imx8 chip")
      Signed-off-by: default avatarMarcel Ziswiler <marcel.ziswiler@toradex.com>
      Link: https://lore.kernel.org/r/20220425154856.169499-1-marcel@ziswiler.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b1190d51
    • Bjorn Helgaas's avatar
      net: remove comments that mention obsolete __SLOW_DOWN_IO · e39f63fe
      Bjorn Helgaas authored
      The only remaining definitions of __SLOW_DOWN_IO (for alpha and ia64) do
      nothing, and the only mentions in networking are in comments.  Remove these
      mentions.
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e39f63fe
    • Bjorn Helgaas's avatar
      net: wan: atp: remove unused eeprom_delay() · dac173db
      Bjorn Helgaas authored
      atp.h is included only by atp.c, which does not use eeprom_delay().  Remove
      the unused definition.
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dac173db
    • Jakub Kicinski's avatar
      net: tls: fix async vs NIC crypto offload · c706b2b5
      Jakub Kicinski authored
      When NIC takes care of crypto (or the record has already
      been decrypted) we forget to update darg->async. ->async
      is supposed to mean whether record is async capable on
      input and whether record has been queued for async crypto
      on output.
      Reported-by: default avatarGal Pressman <gal@nvidia.com>
      Fixes: 3547a1f9 ("tls: rx: use async as an in-out argument")
      Tested-by: default avatarGal Pressman <gal@nvidia.com>
      Link: https://lore.kernel.org/r/20220425233309.344858-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c706b2b5
    • Russell King (Oracle)'s avatar
      net: dsa: mt753x: fix pcs conversion regression · fae46308
      Russell King (Oracle) authored
      Daniel Golle reports that the conversion of mt753x to phylink PCS caused
      an oops as below.
      
      The problem is with the placement of the PCS initialisation, which
      occurs after mt7531_setup() has been called. However, burited in this
      function is a call to setup the CPU port, which requires the PCS
      structure to be already setup.
      
      Fix this by changing the initialisation order.
      
      Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
      Mem abort info:
        ESR = 0x96000005
        EC = 0x25: DABT (current EL), IL = 32 bits
        SET = 0, FnV = 0
        EA = 0, S1PTW = 0
        FSC = 0x05: level 1 translation fault
      Data abort info:
        ISV = 0, ISS = 0x00000005
        CM = 0, WnR = 0
      user pgtable: 4k pages, 39-bit VAs, pgdp=0000000046057000
      [0000000000000020] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
      Internal error: Oops: 96000005 [#1] SMP
      Modules linked in:
      CPU: 0 PID: 32 Comm: kworker/u4:1 Tainted: G S 5.18.0-rc3-next-20220422+ #0
      Hardware name: Bananapi BPI-R64 (DT)
      Workqueue: events_unbound deferred_probe_work_func
      pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      pc : mt7531_cpu_port_config+0xcc/0x1b0
      lr : mt7531_cpu_port_config+0xc0/0x1b0
      sp : ffffffc008d5b980
      x29: ffffffc008d5b990 x28: ffffff80060562c8 x27: 00000000f805633b
      x26: ffffff80001a8880 x25: 00000000000009c4 x24: 0000000000000016
      x23: ffffff8005eb6470 x22: 0000000000003600 x21: ffffff8006948080
      x20: 0000000000000000 x19: 0000000000000006 x18: 0000000000000000
      x17: 0000000000000001 x16: 0000000000000001 x15: 02963607fcee069e
      x14: 0000000000000000 x13: 0000000000000030 x12: 0101010101010101
      x11: ffffffc037302000 x10: 0000000000000870 x9 : ffffffc008d5b800
      x8 : ffffff800028f950 x7 : 0000000000000001 x6 : 00000000662b3000
      x5 : 00000000000002f0 x4 : 0000000000000000 x3 : ffffff800028f080
      x2 : 0000000000000000 x1 : ffffff800028f080 x0 : 0000000000000000
      Call trace:
       mt7531_cpu_port_config+0xcc/0x1b0
       mt753x_cpu_port_enable+0x24/0x1f0
       mt7531_setup+0x49c/0x5c0
       mt753x_setup+0x20/0x31c
       dsa_register_switch+0x8bc/0x1020
       mt7530_probe+0x118/0x200
       mdio_probe+0x30/0x64
       really_probe.part.0+0x98/0x280
       __driver_probe_device+0x94/0x140
       driver_probe_device+0x40/0x114
       __device_attach_driver+0xb0/0x10c
       bus_for_each_drv+0x64/0xa0
       __device_attach+0xa8/0x16c
       device_initial_probe+0x10/0x20
       bus_probe_device+0x94/0x9c
       deferred_probe_work_func+0x80/0xb4
       process_one_work+0x200/0x3a0
       worker_thread+0x260/0x4c0
       kthread+0xd4/0xe0
       ret_from_fork+0x10/0x20
      Code: 9409e911 937b7e60 8b0002a0 f9405800 (f9401005)
      ---[ end trace 0000000000000000 ]---
      Reported-by: default avatarDaniel Golle <daniel@makrotopia.org>
      Tested-by: default avatarDaniel Golle <daniel@makrotopia.org>
      Fixes: cbd1f243 ("net: dsa: mt7530: partially convert to phylink_pcs")
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/E1nj6FW-007WZB-5Y@rmk-PC.armlinux.org.ukSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fae46308
    • Eric Dumazet's avatar
      net: generalize skb freeing deferral to per-cpu lists · 68822bdf
      Eric Dumazet authored
      Logic added in commit f35f8219 ("tcp: defer skb freeing after socket
      lock is released") helped bulk TCP flows to move the cost of skbs
      frees outside of critical section where socket lock was held.
      
      But for RPC traffic, or hosts with RFS enabled, the solution is far from
      being ideal.
      
      For RPC traffic, recvmsg() has to return to user space right after
      skb payload has been consumed, meaning that BH handler has no chance
      to pick the skb before recvmsg() thread. This issue is more visible
      with BIG TCP, as more RPC fit one skb.
      
      For RFS, even if BH handler picks the skbs, they are still picked
      from the cpu on which user thread is running.
      
      Ideally, it is better to free the skbs (and associated page frags)
      on the cpu that originally allocated them.
      
      This patch removes the per socket anchor (sk->defer_list) and
      instead uses a per-cpu list, which will hold more skbs per round.
      
      This new per-cpu list is drained at the end of net_action_rx(),
      after incoming packets have been processed, to lower latencies.
      
      In normal conditions, skbs are added to the per-cpu list with
      no further action. In the (unlikely) cases where the cpu does not
      run net_action_rx() handler fast enough, we use an IPI to raise
      NET_RX_SOFTIRQ on the remote cpu.
      
      Also, we do not bother draining the per-cpu list from dev_cpu_dead()
      This is because skbs in this list have no requirement on how fast
      they should be freed.
      
      Note that we can add in the future a small per-cpu cache
      if we see any contention on sd->defer_lock.
      
      Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
      and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
      page recycling strategy used by NIC driver (its page pool capacity
      being too small compared to number of skbs/pages held in sockets
      receive queues)
      
      Note that this tuning was only done to demonstrate worse
      conditions for skb freeing for this particular test.
      These conditions can happen in more general production workload.
      
      10 runs of one TCP_STREAM flow
      
      Before:
      Average throughput: 49685 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() show high cost for
      skb freeing related functions (*)
      
          57.81%  [kernel]       [k] copy_user_enhanced_fast_string
      (*) 12.87%  [kernel]       [k] skb_release_data
      (*)  4.25%  [kernel]       [k] __free_one_page
      (*)  3.57%  [kernel]       [k] __list_del_entry_valid
           1.85%  [kernel]       [k] __netif_receive_skb_core
           1.60%  [kernel]       [k] __skb_datagram_iter
      (*)  1.59%  [kernel]       [k] free_unref_page_commit
      (*)  1.16%  [kernel]       [k] __slab_free
           1.16%  [kernel]       [k] _copy_to_iter
      (*)  1.01%  [kernel]       [k] kfree
      (*)  0.88%  [kernel]       [k] free_unref_page
           0.57%  [kernel]       [k] ip6_rcv_core
           0.55%  [kernel]       [k] ip6t_do_table
           0.54%  [kernel]       [k] flush_smp_call_function_queue
      (*)  0.54%  [kernel]       [k] free_pcppages_bulk
           0.51%  [kernel]       [k] llist_reverse_order
           0.38%  [kernel]       [k] process_backlog
      (*)  0.38%  [kernel]       [k] free_pcp_prepare
           0.37%  [kernel]       [k] tcp_recvmsg_locked
      (*)  0.37%  [kernel]       [k] __list_add_valid
           0.34%  [kernel]       [k] sock_rfree
           0.34%  [kernel]       [k] _raw_spin_lock_irq
      (*)  0.33%  [kernel]       [k] __page_cache_release
           0.33%  [kernel]       [k] tcp_v6_rcv
      (*)  0.33%  [kernel]       [k] __put_page
      (*)  0.29%  [kernel]       [k] __mod_zone_page_state
           0.27%  [kernel]       [k] _raw_spin_lock
      
      After patch:
      Average throughput: 73076 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() looks better:
      
          81.35%  [kernel]       [k] copy_user_enhanced_fast_string
           1.95%  [kernel]       [k] _copy_to_iter
           1.95%  [kernel]       [k] __skb_datagram_iter
           1.27%  [kernel]       [k] __netif_receive_skb_core
           1.03%  [kernel]       [k] ip6t_do_table
           0.60%  [kernel]       [k] sock_rfree
           0.50%  [kernel]       [k] tcp_v6_rcv
           0.47%  [kernel]       [k] ip6_rcv_core
           0.45%  [kernel]       [k] read_tsc
           0.44%  [kernel]       [k] _raw_spin_lock_irqsave
           0.37%  [kernel]       [k] _raw_spin_lock
           0.37%  [kernel]       [k] native_irq_return_iret
           0.33%  [kernel]       [k] __inet6_lookup_established
           0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
           0.29%  [kernel]       [k] tcp_rcv_established
           0.29%  [kernel]       [k] llist_reverse_order
      
      v2: kdoc issue (kernel bots)
          do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
          replace the sk_buff_head with a single-linked list (Jakub)
          add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      68822bdf
  2. 26 Apr, 2022 4 commits
  3. 25 Apr, 2022 19 commits
  4. 23 Apr, 2022 3 commits
    • David S. Miller's avatar
      Merge branch 'dsa-selftests' · cfc1d91a
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      DSA selftests
      
      When working on complex new features or reworks it becomes increasingly
      difficult to ensure there aren't regressions being introduced, and
      therefore it would be nice if we could go over the functionality we
      already have and write some tests for it.
      
      Verbally I know from Tobias Waldekranz that he has been working on some
      selftests for DSA, yet I have never seen them, so here I am adding some
      tests I have written which have been useful for me. The list is by no
      means complete (it only covers elementary functionality), but it's still
      good to have as a starting point. I also borrowed some refactoring
      changes from Joachim Wiberg that he submitted for his "net: bridge:
      forwarding of unknown IPv4/IPv6/MAC BUM traffic" series, but not the
      entirety of his selftests. I now think that his selftests have some
      overlap with bridge_vlan_unaware.sh and bridge_vlan_aware.sh and they
      should be more tightly integrated with each other - yet I didn't do that
      either :). Another issue I had with his selftests was that they jumped
      straight ahead to configure brport flags on br0 (a radical new idea
      still at RFC status) while we have bigger problems, and we don't have
      nearly enough coverage for the *existing* functionality.
      
      One idea introduced here which I haven't seen before is the symlinking
      of relevant forwarding selftests to the selftests/drivers/net/<my-driver>/
      folder, plus a forwarding.config file. I think there's some value in
      having things structured this way, since the forwarding dir has so many
      selftests that aren't relevant to DSA that it is a bit difficult to find
      the ones that are.
      
      While searching for applications that I could use for multicast testing
      (not my domain of interest/knowledge really), I found Joachim Wiberg's
      mtools, mcjoin and omping, and I tried them all with various degrees of
      success. In particular, I was going to use mcjoin, but I faced some
      issues getting IPv6 multicast traffic to work in a VRF, and I bothered
      David Ahern about it here:
      https://lore.kernel.org/netdev/97eaffb8-2125-834e-641f-c99c097b6ee2@gmail.com/t/
      It seems that the problem is that this application should use
      SO_BINDTODEVICE, yet it doesn't.
      
      So I ended up patching the bare-bones mtools (msend, mreceive) forked by
      Joachim from the University of Virginia's Multimedia Networks Group to
      include IPv6 support, and to use SO_BINDTODEVICE. This is what I'm using
      now for IPv6.
      
      Note that mausezahn doesn't appear to do a particularly good job of
      supporting IPv6 really, and I needed a program to emit the actual
      IP_ADD_MEMBERSHIP calls, for dev_mc_add(), so I could test RX filtering.
      Crafting the IGMP/MLD reports by hand doesn't really do the trick.
      While extremely bare-bones, the mreceive application now seems to do
      what I need it to.
      
      Feedback appreciated, it is very likely that I could have done things in
      a better way.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cfc1d91a
    • Vladimir Oltean's avatar
      selftests: drivers: dsa: add a subset of forwarding selftests · 07c8a2dd
      Vladimir Oltean authored
      This adds an initial subset of forwarding selftests which I considered
      to be relevant for DSA drivers, along with a forwarding.config that
      makes it easier to run them (disables veth pair creation, makes sure MAC
      addresses are unique and stable).
      
      The intention is to request driver writers to run these selftests during
      review and make sure that the tests pass, or at least that the problems
      are known.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      07c8a2dd
    • Vladimir Oltean's avatar
      selftests: forwarding: add a test for local_termination.sh · 90b9566a
      Vladimir Oltean authored
      This tests the capability of switch ports to filter out undesired
      traffic. Different drivers are expected to have different capabilities
      here (so some may fail and some may pass), yet the test still has some
      value, for example to check for regressions.
      
      There are 2 kinds of failures, one is when a packet which should have
      been accepted isn't (and that should be fixed), and the other "failure"
      (as reported by the test) is when a packet could have been filtered out
      (for being unnecessary) yet it was received.
      
      The bridge driver fares particularly badly at this test:
      
      TEST: br0: Unicast IPv4 to primary MAC address                      [ OK ]
      TEST: br0: Unicast IPv4 to macvlan MAC address                      [ OK ]
      TEST: br0: Unicast IPv4 to unknown MAC address                      [FAIL]
              reception succeeded, but should have failed
      TEST: br0: Unicast IPv4 to unknown MAC address, promisc             [ OK ]
      TEST: br0: Unicast IPv4 to unknown MAC address, allmulti            [FAIL]
              reception succeeded, but should have failed
      TEST: br0: Multicast IPv4 to joined group                           [ OK ]
      TEST: br0: Multicast IPv4 to unknown group                          [FAIL]
              reception succeeded, but should have failed
      TEST: br0: Multicast IPv4 to unknown group, promisc                 [ OK ]
      TEST: br0: Multicast IPv4 to unknown group, allmulti                [ OK ]
      TEST: br0: Multicast IPv6 to joined group                           [ OK ]
      TEST: br0: Multicast IPv6 to unknown group                          [FAIL]
              reception succeeded, but should have failed
      TEST: br0: Multicast IPv6 to unknown group, promisc                 [ OK ]
      TEST: br0: Multicast IPv6 to unknown group, allmulti                [ OK ]
      
      mainly because it does not implement IFF_UNICAST_FLT. Yet I still think
      having the test (with the failures) is useful in case somebody wants to
      tackle that problem in the future, to make an easy before-and-after
      comparison.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90b9566a