1. 22 May, 2022 36 commits
    • Jakub Kicinski's avatar
      eth: mtk_eth_soc: silence the GCC 12 array-bounds warning · 06da3e8f
      Jakub Kicinski authored
      GCC 12 gets upset because in mtk_foe_entry_commit_subflow()
      this driver allocates a partial structure. The writes are
      within bounds.
      
      Silence these warnings for now, our build bot runs GCC 12
      so we won't allow any new instances.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06da3e8f
    • Vladimir Oltean's avatar
      net: mscc: ocelot: offload tc action "ok" using an empty action vector · 4149af28
      Vladimir Oltean authored
      The "ok" tc action is useful when placed in front of a more generic
      filter to exclude some more specific rules from matching it.
      
      The ocelot switches can offload this tc action by creating an empty
      action vector (no _ENA fields set to 1). This makes sense for all of
      VCAP IS1, IS2 and ES0 (but not for PSFP).
      
      Add support for this action. Note that this makes the
      gact_drop_and_ok_test() selftest pass, where "action ok" is used in
      front of an "action drop" rule, both offloaded to VCAP IS2.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4149af28
    • David S. Miller's avatar
      Merge branch 'ocelot-selftests' · cb7f2d05
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      Streamline Ocelot tc-chains selftest
      
      This series changes the output and the argument format of the Ocelot
      switch selftest so that it is more similar to what can be found in
      tools/testing/selftests/net/forwarding/.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cb7f2d05
    • Vladimir Oltean's avatar
      selftests: ocelot: tc_flower_chains: reorder interfaces · 4ea1396a
      Vladimir Oltean authored
      Use the standard interface order h1, swp1, swp2, h2 that is used by the
      forwarding selftest framework. The previous order was confusing even
      with the ASCII drawing. That isn't needed anymore.
      
      This also drops the fixed MAC addresses and uses STABLE_MAC_ADDRS, which
      ensures the MAC addresses are unique.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ea1396a
    • Vladimir Oltean's avatar
      selftests: ocelot: tc_flower_chains: use conventional interface names · 93196ef9
      Vladimir Oltean authored
      This is a robotic rename as follows:
      
      eth0 -> swp1
      eth1 -> swp2
      eth2 -> h2
      eth3 -> h1
      
      This brings the selftest more in line with the other forwarding
      selftests, where h1 is connected to swp1, and h2 to swp2.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93196ef9
    • Vladimir Oltean's avatar
      selftests: ocelot: tc_flower_chains: streamline test output · 980e74ca
      Vladimir Oltean authored
      Bring this driver-specific selftest output in line with the other
      selftests.
      
      Before:
      
      Testing VLAN pop..                      OK
      Testing VLAN push..                     OK
      Testing ingress VLAN modification..             OK
      Testing egress VLAN modification..              OK
      Testing frame prioritization..          OK
      
      After:
      
      TEST: VLAN pop                                                      [ OK ]
      TEST: VLAN push                                                     [ OK ]
      TEST: Ingress VLAN modification                                     [ OK ]
      TEST: Egress VLAN modification                                      [ OK ]
      TEST: Frame prioritization                                          [ OK ]
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      980e74ca
    • Jakub Kicinski's avatar
      net: wrap the wireless pointers in struct net_device in an ifdef · c304eddc
      Jakub Kicinski authored
      Most protocol-specific pointers in struct net_device are under
      a respective ifdef. Wireless is the notable exception. Since
      there's a sizable number of custom-built kernels for datacenter
      workloads which don't build wireless it seems reasonable to
      ifdefy those pointers as well.
      
      While at it move IPv4 and IPv6 pointers up, those are special
      for obvious reasons.
      Acked-by: default avatarJohannes Berg <johannes@sipsolutions.net>
      Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> # ieee802154
      Acked-by: default avatarSven Eckelmann <sven@narfation.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c304eddc
    • Uwe Kleine-König's avatar
      net: fec: Do proper error checking for enet_out clk · 5ff851b7
      Uwe Kleine-König authored
      An error code returned by devm_clk_get() might have other meanings than
      "This clock doesn't exist". So use devm_clk_get_optional() and handle
      all remaining errors as fatal.
      Signed-off-by: default avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5ff851b7
    • Tommaso Merciai's avatar
      net: phy: DP83822: enable rgmii mode if phy_interface_is_rgmii · 621427fb
      Tommaso Merciai authored
      RGMII mode can be enable from dp83822 straps, and also writing bit 9
      of register 0x17 - RMII and Status Register (RCSR).
      When phy_interface_is_rgmii rgmii mode must be enabled, same for
      contrary, this prevents malconfigurations of hw straps
      
      References:
       - https://www.ti.com/lit/gpn/dp83822i p66
      Signed-off-by: default avatarTommaso Merciai <tommaso.merciai@amarulasolutions.com>
      Co-developed-by: default avatarMichael Trimarchi <michael@amarulasolutions.com>
      Suggested-by: default avatarAlberto Bianchi <alberto.bianchi@amarulasolutions.com>
      Tested-by: default avatarTommaso Merciai <tommaso.merciai@amarulasolutions.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      621427fb
    • Muhammad Usama Anjum's avatar
      net: selftests: Add stress_reuseport_listen to .gitignore · a3f7404c
      Muhammad Usama Anjum authored
      Add newly added stress_reuseport_listen object to .gitignore file.
      
      Fixes: ec8cb4f6 ("net: selftests: Stress reuseport listen")
      Signed-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3f7404c
    • David S. Miller's avatar
      Merge branch 'rxrpc-misc' · baea40de
      David S. Miller authored
      David Howells says:
      
      ====================
      rxrpc: Miscellaneous changes
      
      Here are some miscellaneous changes for AF_RXRPC:
      
       (1) Allow the list of local endpoints to be viewed through /proc.
      
       (2) Switch to using refcount_t for refcounting.
      
       (3) Fix a locking issue found by lockdep.
      
       (4) Autogenerate tracing symbol enums from symbol->string maps to make it
           easier to keep them in sync.
      
       (5) Return an error to sendmsg() if a call it tried to set up failed.
           Because it failed at this point, no notification will be generated for
           recvmsg to pick up - but userspace still needs to know about the
           failure.
      
       (6) Fix the selection of abort codes generated by internal events.  In
           particular, rxrpc and kafs shouldn't be generating RX_USER_ABORT
           unless it's because userspace did something to cancel a call.
      
       (7) Adjust the interpretation and handling of certain ACK types to try and
           detect NAT changes causing a call to seem to start mid-flow from a
           different peer.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      baea40de
    • David Howells's avatar
      afs: Adjust ACK interpretation to try and cope with NAT · adc9613f
      David Howells authored
      If a client's address changes, say if it is NAT'd, this can disrupt an in
      progress operation.  For most operations, this is not much of a problem,
      but StoreData can be different as some servers modify the target file as
      the data comes in, so if a store request is disrupted, the file can get
      corrupted on the server.
      
      The problem is that the server doesn't recognise packets that come after
      the change of address as belonging to the original client and will bounce
      them, either by sending an OUT_OF_SEQUENCE ACK to the apparent new call if
      the packet number falls within the initial sequence number window of a call
      or by sending an EXCEEDS_WINDOW ACK if it falls outside and then aborting
      it.  In both cases, firstPacket will be 1 and previousPacket will be 0 in
      the ACK information.
      
      Fix this by the following means:
      
       (1) If a client call receives an EXCEEDS_WINDOW ACK with firstPacket as 1
           and previousPacket as 0, assume this indicates that the server saw the
           incoming packets from a different peer and thus as a different call.
           Fail the call with error -ENETRESET.
      
       (2) Also fail the call if a similar OUT_OF_SEQUENCE ACK occurs if the
           first packet has been hard-ACK'd.  If it hasn't been hard-ACK'd, the
           ACK packet will cause it to get retransmitted, so the call will just
           be repeated.
      
       (3) Make afs_select_fileserver() treat -ENETRESET as a straight fail of
           the operation.
      
       (4) Prioritise the error code over things like -ECONNRESET as the server
           did actually respond.
      
       (5) Make writeback treat -ENETRESET as a retryable error and make it
           redirty all the pages involved in a write so that the VM will retry.
      
      Note that there is still a circumstance that I can't easily deal with: if
      the operation is fully received and processed by the server, but the reply
      is lost due to address change.  There's no way to know if the op happened.
      We can examine the server, but a conflicting change could have been made by
      a third party - and we can't tell the difference.  In such a case, a
      message like:
      
          kAFS: vnode modified {100058:146266} b7->b8 YFS.StoreData64 (op=2646a)
      
      will be logged to dmesg on the next op to touch the file and the client
      will reset the inode state, including invalidating clean parts of the
      pagecache.
      Reported-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: linux-afs@lists.infradead.org
      Link: http://lists.infradead.org/pipermail/linux-afs/2021-December/004811.html # v1
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      adc9613f
    • David Howells's avatar
      rxrpc, afs: Fix selection of abort codes · de696c47
      David Howells authored
      The RX_USER_ABORT code should really only be used to indicate that the user
      of the rxrpc service (ie. userspace) implicitly caused a call to be aborted
      - for instance if the AF_RXRPC socket is closed whilst the call was in
      progress.  (The user may also explicitly abort a call and specify the abort
      code to use).
      
      Change some of the points of generation to use other abort codes instead:
      
       (1) Abort the call with RXGEN_SS_UNMARSHAL or RXGEN_CC_UNMARSHAL if we see
           ENOMEM and EFAULT during received data delivery and abort with
           RX_CALL_DEAD in the default case.
      
       (2) Abort with RXGEN_SS_MARSHAL if we get ENOMEM whilst trying to send a
           reply.
      
       (3) Abort with RX_CALL_DEAD if we stop hearing from the peer if we had
           heard from the peer and abort with RX_CALL_TIMEOUT if we hadn't.
      
       (4) Abort with RX_CALL_DEAD if we try to disconnect a call that's not
           completed successfully or been aborted.
      Reported-by: default avatarJeffrey Altman <jaltman@auristor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      de696c47
    • David Howells's avatar
      rxrpc: Return an error to sendmsg if call failed · 4ba68c51
      David Howells authored
      If at the end of rxrpc sendmsg() or rxrpc_kernel_send_data() the call that
      was being given data was aborted remotely or otherwise failed, return an
      error rather than returning the amount of data buffered for transmission.
      
      The call (presumably) did not complete, so there's not much point
      continuing with it.  AF_RXRPC considers it "complete" and so will be
      unwilling to do anything else with it - and won't send a notification for
      it, deeming the return from sendmsg sufficient.
      
      Not returning an error causes afs to incorrectly handle a StoreData
      operation that gets interrupted by a change of address due to NAT
      reconfiguration.
      
      This doesn't normally affect most operations since their request parameters
      tend to fit into a single UDP packet and afs_make_call() returns before the
      server responds; StoreData is different as it involves transmission of a
      lot of data.
      
      This can be triggered on a client by doing something like:
      
      	dd if=/dev/zero of=/afs/example.com/foo bs=1M count=512
      
      at one prompt, and then changing the network address at another prompt,
      e.g.:
      
      	ifconfig enp6s0 inet 192.168.6.2 && route add 192.168.6.1 dev enp6s0
      
      Tracing packets on an Auristor fileserver looks something like:
      
      192.168.6.1 -> 192.168.6.3  RX 107 ACK Idle  Seq: 0  Call: 4  Source Port: 7000  Destination Port: 7001
      192.168.6.3 -> 192.168.6.1  AFS (RX) 1482 FS Request: Unknown(64538) (64538)
      192.168.6.3 -> 192.168.6.1  AFS (RX) 1482 FS Request: Unknown(64538) (64538)
      192.168.6.1 -> 192.168.6.3  RX 107 ACK Idle  Seq: 0  Call: 4  Source Port: 7000  Destination Port: 7001
      <ARP exchange for 192.168.6.2>
      192.168.6.2 -> 192.168.6.1  AFS (RX) 1482 FS Request: Unknown(0) (0)
      192.168.6.2 -> 192.168.6.1  AFS (RX) 1482 FS Request: Unknown(0) (0)
      192.168.6.1 -> 192.168.6.2  RX 107 ACK Exceeds Window  Seq: 0  Call: 4  Source Port: 7000  Destination Port: 7001
      192.168.6.1 -> 192.168.6.2  RX 74 ABORT  Seq: 0  Call: 4  Source Port: 7000  Destination Port: 7001
      192.168.6.1 -> 192.168.6.2  RX 74 ABORT  Seq: 29321  Call: 4  Source Port: 7000  Destination Port: 7001
      
      The Auristor fileserver logs code -453 (RXGEN_SS_UNMARSHAL), but the abort
      code received by kafs is -5 (RX_PROTOCOL_ERROR) as the rx layer sees the
      condition and generates an abort first and the unmarshal error is a
      consequence of that at the application layer.
      Reported-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: linux-afs@lists.infradead.org
      Link: http://lists.infradead.org/pipermail/linux-afs/2021-December/004810.html # v1
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ba68c51
    • David Howells's avatar
      rxrpc: Automatically generate trace tag enums · dc9fd093
      David Howells authored
      Automatically generate trace tag enums from the symbol -> string mapping
      tables rather than having the enums as well, thereby reducing duplicated
      data.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc9fd093
    • David Howells's avatar
      rxrpc: Fix locking issue · ad25f5cb
      David Howells authored
      There's a locking issue with the per-netns list of calls in rxrpc.  The
      pieces of code that add and remove a call from the list use write_lock()
      and the calls procfile uses read_lock() to access it.  However, the timer
      callback function may trigger a removal by trying to queue a call for
      processing and finding that it's already queued - at which point it has a
      spare refcount that it has to do something with.  Unfortunately, if it puts
      the call and this reduces the refcount to 0, the call will be removed from
      the list.  Unfortunately, since the _bh variants of the locking functions
      aren't used, this can deadlock.
      
      ================================
      WARNING: inconsistent lock state
      5.18.0-rc3-build4+ #10 Not tainted
      --------------------------------
      inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
      ksoftirqd/2/25 [HC0[0]:SC1[1]:HE1:SE0] takes:
      ffff888107ac4038 (&rxnet->call_lock){+.?.}-{2:2}, at: rxrpc_put_call+0x103/0x14b
      {SOFTIRQ-ON-W} state was registered at:
      ...
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&rxnet->call_lock);
        <Interrupt>
          lock(&rxnet->call_lock);
      
       *** DEADLOCK ***
      
      1 lock held by ksoftirqd/2/25:
       #0: ffff8881008ffdb0 ((&call->timer)){+.-.}-{0:0}, at: call_timer_fn+0x5/0x23d
      
      Changes
      =======
      ver #2)
       - Changed to using list_next_rcu() rather than rcu_dereference() directly.
      
      Fixes: 17926a79 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ad25f5cb
    • David Howells's avatar
      rxrpc: Use refcount_t rather than atomic_t · a0575429
      David Howells authored
      Move to using refcount_t rather than atomic_t for refcounts in rxrpc.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a0575429
    • David Howells's avatar
      rxrpc: Allow list of in-use local UDP endpoints to be viewed in /proc · 33912c26
      David Howells authored
      Allow the list of in-use local UDP endpoints in the current network
      namespace to be viewed in /proc.
      
      To aid with this, the endpoint list is converted to an hlist and RCU-safe
      manipulation is used so that the list can be read with only the RCU
      read lock held.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33912c26
    • David S. Miller's avatar
      Merge branch 'ipa-next' · 0598cec9
      David S. Miller authored
      Alex Elder says:
      
      ====================
      net: ipa: a few more small items
      
      This series consists of three small sets of changes.  Version 2 adds
      a patch that avoids a warning that occurs when handling a modem
      crash (I unfortunately didn't notice it earlier).  All other patches
      are the same--just rebased.
      
      The first three patches allow a few endpoint features to be
      specified.  At this time, currently-defined endpoints retain the
      same configuration, but when the monitor functionality is added in
      the next cycle these options will be required.
      
      The fourth patch simply removes an unused function, explaining also
      why it would likely never be used.
      
      The fifth patch is new.  It counts the number of modem TX endpoints
      and uses it to determine how many TREs a transaction needs when
      when handling a modem crash.  It is needed to avoid exceeding the
      limited number of commands imposed by the last four patches.
      
      And the last four patches refactor code related to IPA immediate
      commands, eliminating an unused field and then simplifying and
      removing some unneeded code.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0598cec9
    • Alex Elder's avatar
      net: ipa: use data space for command opcodes · a224bd4b
      Alex Elder authored
      The 64-bit data field in a transaction is not used for commands.
      And the opcode array is *only* used for commands.  They're
      (currently) the same size; save a little space in the transaction
      structure by enclosing the two fields in a union.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a224bd4b
    • Alex Elder's avatar
      net: ipa: remove command info pool · 8797972a
      Alex Elder authored
      The ipa_cmd_info structure now contains only one field, and it's an
      enumerated type whose values all fit in 8 bits.  Currently we'll
      never use more than 8 TREs in a command transaction, and we can
      represent that number of command opcodes in the same space as a 64
      bit pointer to an ipa_cmd_info structure.
      
      Define IPA_COMMAND_TRANS_TRE_MAX as the maximum number of TREs that
      can be in a command transaction.  Replace the info pointer in a
      transaction with a fixed-size array named cmd_opcode[] of that many
      bytes.  Store the opcode in this array when adding a command TRE to
      a transaction, as was done previously for the info array.  This
      makes the ipa_cmd_info unused, so get rid of it.
      
      When committing an immediate command transaction, use the channel's
      Boolean command flag to determine whether to fill in the opcode,
      which will be taken (as before) from the array in the transaction.
      
      This makes the command info pool unnecessary, so get rid of it.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8797972a
    • Alex Elder's avatar
      net: ipa: remove command direction argument · 4de284b7
      Alex Elder authored
      We no longer use the direction argument for gsi_trans_cmd_add(), so
      get rid of it in its definition, and in its seven callers.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4de284b7
    • Alex Elder's avatar
      net: ipa: get rid of ipa_cmd_info->direction · 7ffba3bd
      Alex Elder authored
      The direction field of the ipa_cmd_info structure is set, but never
      used.  It seems it might have been used for the DMA_SHARED_MEM
      immediate command, but the DIRECTION flag is set based on the value
      of the passed-in direction flag there.
      
      Anyway, remove this unused field from the ipa_cmd_info structure.
      This is done as a separate patch to make it very obvious that it's
      not required.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7ffba3bd
    • Alex Elder's avatar
      net: ipa: count the number of modem TX endpoints · 2091c79a
      Alex Elder authored
      In ipa_endpoint_modem_exception_reset_all(), a high estimate was
      made of the number of endpoints that need their status register
      updated.  We only used what was needed, so the high estimate didn't
      matter much.
      
      However the next few patches are going to limit the number of
      commands in a single transaction, and the overestimate would exceed
      that.  So count the number of modem TX endpoints at initialization
      time, and use it in ipa_endpoint_modem_exception_reset_all().
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2091c79a
    • Alex Elder's avatar
      net: ipa: kill gsi_trans_commit_wait_timeout() · d15180b4
      Alex Elder authored
      Since the beginning gsi_trans_commit_wait_timeout() has existed to
      provide a way to allow waiting a limited time for a transaction
      to complete.  But that function has never been used.
      
      In fact, there is no use for this function, because a transaction
      committed to hardware should *always* complete.  The only reason it
      might not complete is if there were a hardware failure, or perhaps a
      system configuration error.
      
      Furthermore, if a timeout ever did occur, the IPA hardware would be
      in an indeterminate state, from which there is no recovery.  It
      would require some sort of complete IPA reset, and would require the
      participation of the modem, and at this time there is no such
      sequence defined.
      
      So get rid of the definition of gsi_trans_commit_wait_timeout(), and
      update a few comments accordingly.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d15180b4
    • Alex Elder's avatar
      net: ipa: specify RX aggregation time limit in config data · beb90cba
      Alex Elder authored
      Don't assume that a 500 microsecond time limit should be used for
      all receive endpoints that support aggregation.  Instead, specify
      the time limit to use in the configuration data.
      
      Set a 500 microsecond limit for all existing RX endpoints, as before.
      
      Checking for overflow for the time limit field is a bit complicated.
      Rather than duplicate a lot of code in ipa_endpoint_data_valid_one(),
      call WARN() if any value is found to be too large when encoding it.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      beb90cba
    • Alex Elder's avatar
      net: ipa: support hard aggregation limits · 3cebb7c2
      Alex Elder authored
      Add a new flag for AP receive endpoints that indicates whether
      a "hard limit" is used as a criterion for closing aggregation.
      Add comments explaining the difference between "hard" and "soft"
      aggregation limits.
      
      Pass a flag to ipa_aggr_size_kb() so it computes the proper
      aggregation size value whether using hard or soft limits.  Move
      that function earlier in "ipa_endpoint.c" so it can be used
      without a forward-reference.
      
      Update ipa_endpoint_data_valid_one() so it validates endpoints whose
      data indicate a hard aggregation limit is used, and so it reports
      set aggregation flags for endpoints without aggregation enabled.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3cebb7c2
    • Alex Elder's avatar
      net: ipa: make endpoint HOLB drop configurable · 153213f0
      Alex Elder authored
      Add a new Boolean flag for RX endpoints defining whether HOLB drop
      is initially enabled or disabled for the endpoint.  All existing AP
      endpoints should have HOLB drop disabled.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      153213f0
    • Julia Lawall's avatar
      qed: fix typos in comments · 60f243ad
      Julia Lawall authored
      Spelling mistakes (triple letters) in comments.
      Detected with the help of Coccinelle.
      Signed-off-by: default avatarJulia Lawall <Julia.Lawall@inria.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      60f243ad
    • Julia Lawall's avatar
      nfp: flower: fix typo in comment · b993e72c
      Julia Lawall authored
      Spelling mistake (triple letters) in comment.
      Detected with the help of Coccinelle.
      Signed-off-by: default avatarJulia Lawall <Julia.Lawall@inria.fr>
      Reviewed-by: default avatarNiklas Söderlund <niklas.soderlund@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b993e72c
    • Julia Lawall's avatar
      net: marvell: prestera: fix typo in comment · 878e2eb2
      Julia Lawall authored
      Spelling mistake (triple letters) in comment.
      Detected with the help of Coccinelle.
      Signed-off-by: default avatarJulia Lawall <Julia.Lawall@inria.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      878e2eb2
    • Julia Lawall's avatar
      cirrus: cs89x0: fix typo in comment · 3f660c18
      Julia Lawall authored
      Spelling mistake (triple letters) in comment.
      Detected with the help of Coccinelle.
      Signed-off-by: default avatarJulia Lawall <Julia.Lawall@inria.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f660c18
    • Julia Lawall's avatar
      net: qed: fix typos in comments · cc4e7fa5
      Julia Lawall authored
      Spelling mistakes (triple letters) in comments.
      Detected with the help of Coccinelle.
      Signed-off-by: default avatarJulia Lawall <Julia.Lawall@inria.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc4e7fa5
    • Julia Lawall's avatar
      net/mlx5: fix typo in comment · b0ea505b
      Julia Lawall authored
      Spelling mistake (triple letters) in comment.
      Detected with the help of Coccinelle.
      Signed-off-by: default avatarJulia Lawall <Julia.Lawall@inria.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0ea505b
    • Julia Lawall's avatar
      net: mvpp2: fix typo in comment · e34be16b
      Julia Lawall authored
      Spelling mistake (triple letters) in comment.
      Detected with the help of Coccinelle.
      Signed-off-by: default avatarJulia Lawall <Julia.Lawall@inria.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e34be16b
    • Julia Lawall's avatar
      net: sparx5: switchdev: fix typo in comment · 1f36a72a
      Julia Lawall authored
      Spelling mistake (triple letters) in comment.
      Detected with the help of Coccinelle.
      Signed-off-by: default avatarJulia Lawall <Julia.Lawall@inria.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f36a72a
  2. 21 May, 2022 4 commits
    • Jakub Kicinski's avatar
      Merge branch 'add-a-bhash2-table-hashed-by-port-address' · aa5334b1
      Jakub Kicinski authored
      Joanne Koong says:
      
      ====================
      Add a bhash2 table hashed by port + address
      
      This patchset proposes adding a bhash2 table that hashes by port and address.
      The motivation behind bhash2 is to expedite bind requests in situations where
      the port has many sockets in its bhash table entry, which makes checking bind
      conflicts costly especially given that we acquire the table entry spinlock
      while doing so, which can cause softirq cpu lockups and can prevent new tcp
      connections.
      
      We ran into this problem at Meta where the traffic team binds a large number
      of IPs to port 443 and the bind() call took a significant amount of time
      which led to cpu softirq lockups, which caused packet drops and other failures
      on the machine
      
      The patches are as follows:
      1/2 - Adds a second bhash table (bhash2) hashed by port and address
      2/2 - Adds a test for timing how long an additional bind request takes when
      the bhash entry is populated
      
      When experimentally testing this on a local server for ~24k sockets bound to
      the port, the results seen were:
      
      ipv4:
      before - 0.002317 seconds
      with bhash2 - 0.000018 seconds
      
      ipv6:
      before - 0.002431 seconds
      with bhash2 - 0.000021 seconds
      ====================
      
      Link: https://lore.kernel.org/r/20220520001834.2247810-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      aa5334b1
    • Joanne Koong's avatar
      selftests: Add test for timing a bind request to a port with a populated bhash entry · 538aaf9b
      Joanne Koong authored
      This test populates the bhash table for a given port with
      MAX_THREADS * MAX_CONNECTIONS sockets, and then times how long
      a bind request on the port takes.
      
      When populating the bhash table, we create the sockets and then bind
      the sockets to the same address and port (SO_REUSEADDR and SO_REUSEPORT
      are set). When timing how long a bind on the port takes, we bind on a
      different address without SO_REUSEPORT set. We do not set SO_REUSEPORT
      because we are interested in the case where the bind request does not
      go through the tb->fastreuseport path, which is fragile (eg
      tb->fastreuseport path does not work if binding with a different uid).
      
      To run the test locally, I did:
      * ulimit -n 65535000
      * ip addr add 2001:0db8:0:f101::1 dev eth0
      * ./bind_bhash_test 443
      Signed-off-by: default avatarJoanne Koong <joannelkoong@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      538aaf9b
    • Joanne Koong's avatar
      net: Add a second bind table hashed by port and address · d5a42de8
      Joanne Koong authored
      We currently have one tcp bind table (bhash) which hashes by port
      number only. In the socket bind path, we check for bind conflicts by
      traversing the specified port's inet_bind2_bucket while holding the
      bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()).
      
      In instances where there are tons of sockets hashed to the same port
      at different addresses, checking for a bind conflict is time-intensive
      and can cause softirq cpu lockups, as well as stops new tcp connections
      since __inet_inherit_port() also contests for the spinlock.
      
      This patch proposes adding a second bind table, bhash2, that hashes by
      port and ip address. Searching the bhash2 table leads to significantly
      faster conflict resolution and less time holding the spinlock.
      Signed-off-by: default avatarJoanne Koong <joannelkoong@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d5a42de8
    • Jakub Kicinski's avatar
      wwan: iosm: use a flexible array rather than allocate short objects · eac67d83
      Jakub Kicinski authored
      GCC array-bounds warns that ipc_coredump_get_list() under-allocates
      the size of struct iosm_cd_table *cd_table.
      
      This is avoidable - we just need a flexible array. Nothing calls
      sizeof() on struct iosm_cd_list or anything that contains it.
      Reviewed-by: default avatarM Chetan Kumar <m.chetan.kumar@intel.com>
      Link: https://lore.kernel.org/r/20220520060013.2309497-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      eac67d83