1. 02 Feb, 2021 2 commits
  2. 30 Jan, 2021 38 commits
    • Neal Cardwell's avatar
      tcp: shrink inet_connection_sock icsk_mtup enabled and probe_size · 14e8e0f6
      Neal Cardwell authored
      This commit shrinks inet_connection_sock by 4 bytes, by shrinking
      icsk_mtup.enabled from 32 bits to 1 bit, and shrinking
      icsk_mtup.probe_size from s32 to an unsuigned 31 bit field.
      
      This is to save space to compensate for the recent introduction of a
      new u32 in inet_connection_sock, icsk_probes_tstamp, in the recent bug
      fix commit 9d9b1ee0 ("tcp: fix TCP_USER_TIMEOUT with zero window").
      
      This should not change functionality, since icsk_mtup.enabled is only
      ever set to 0 or 1, and icsk_mtup.probe_size can only be either 0
      or a positive MTU value returned by tcp_mss_to_mtu()
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20210129185438.1813237-1-ncardwell.kernel@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      14e8e0f6
    • Jakub Kicinski's avatar
      Merge branch 'net-bridge-drop-hosts-limit-sysfs-and-add-a-comment' · 4e146def
      Jakub Kicinski authored
      Nikolay Aleksandrov says:
      
      ====================
      net: bridge: drop hosts limit sysfs and add a comment
      
      As recently discussed[1] we should stop extending the bridge sysfs
      support for new options and move to using netlink only, so patch 01
      drops the recently added hosts limit sysfs support which is still in
      net-next only and patch 02 adds comments in br_sysfs_br/if.c to warn
      against adding new sysfs options.
      
      [1] https://lore.kernel.org/netdev/20210128105201.7c6bed82@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/T/#mda7265b2e57b52bdab863f286efa85291cf83822
      ====================
      
      Link: https://lore.kernel.org/r/20210129115142.188455-2-razor@blackwall.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4e146def
    • Nikolay Aleksandrov's avatar
      net: bridge: add warning comments to avoid extending sysfs · 1e16f382
      Nikolay Aleksandrov authored
      We're moving to netlink-only options, so add comments in the bridge's
      sysfs files to warn against adding any new sysfs entries.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1e16f382
    • Nikolay Aleksandrov's avatar
      net: bridge: mcast: drop hosts limit sysfs support · 7d0888d5
      Nikolay Aleksandrov authored
      We decided to stop adding new sysfs bridge options and continue with
      netlink only, so remove hosts limit sysfs support.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7d0888d5
    • Jakub Kicinski's avatar
      Merge branch 'tag_8021q-for-ocelot-switches' · 56435d91
      Jakub Kicinski authored
      Vladimir Oltean says:
      
      ====================
      tag_8021q for Ocelot switches
      
      The Felix switch inside LS1028A has an issue. It has a 2.5G CPU port,
      and the external ports, in the majority of use cases, run at 1G. This
      means that, when the CPU injects traffic into the switch, it is very
      easy to run into congestion. This is not to say that it is impossible to
      enter congestion even with all ports running at the same speed, just
      that the default configuration is already very prone to that by design.
      
      Normally, the way to deal with that is using Ethernet flow control
      (PAUSE frames).
      
      However, this functionality is not working today with the ENETC - Felix
      switch pair. The hardware issue is undergoing documentation right now as
      an erratum within NXP, but several customers have been requesting a
      reasonable workaround for it.
      
      In truth, the LS1028A has 2 internal port pairs. The lack of flow control
      is an issue only when NPI mode (Node Processor Interface, aka the mode
      where the "CPU port module", which carries DSA-style tagged packets, is
      connected to a regular Ethernet port) is used, and NPI mode is supported
      by Felix on a single port.
      
      In past BSPs, we have had setups where both internal port pairs were
      enabled. We were advertising the following setup:
      
      "data port"     "control port"
        (2.5G)            (1G)
      
         eno2             eno3
          ^                ^
          |                |
          | regular        | DSA-tagged
          | frames         | frames
          |                |
          v                v
         swp4             swp5
      
      This works but is highly unpractical, due to NXP shifting the task of
      designing a functional system (choosing which port to use, depending on
      type of traffic required) up to the end user. The swpN interfaces would
      have to be bridged with swp4, in order for the eno2 "data port" to have
      access to the outside network. And the swpN interfaces would still be
      capable of IP networking. So running a DHCP client would give us two IP
      interfaces from the same subnet, one assigned to eno2, and the other to
      swpN (0, 1, 2, 3).
      
      Also, the dual port design doesn't scale. When attaching another DSA
      switch to a Felix port, the end result is that the "data port" cannot
      carry any meaningful data to the external world, since it lacks the DSA
      tags required to traverse the sja1105 switches below. All that traffic
      needs to go through the "control port".
      
      So in newer BSPs there was a desire to simplify that setup, and only
      have one internal port pair:
      
         eno2            eno3
          ^
          |
          | DSA-tagged    x disabled
          | frames
          |
          v
         swp4            swp5
      
      However, this setup only exacerbates the issue of not having flow
      control on the NPI port, since that is the only port now. Also, there
      are use cases that still require the "data port", such as IEEE 802.1CB
      (TSN stream identification doesn't work over an NPI port), source
      MAC address learning over NPI, etc.
      
      Again, there is a desire to keep the simplicity of the single internal
      port setup, while regaining the benefits of having a dedicated data port
      as well. And this series attempts to deliver just that.
      
      So the NPI functionality is disabled conditionally. Its purpose was:
      - To ensure individually addressable ports on TX. This can be replaced
        by using some designated VLAN tags which are pushed by the DSA tagger
        code, then removed by the switch (so they are invisible to the outside
        world and to the user).
      - To ensure source port identification on RX. Again, this can be
        replaced by using some designated VLAN tags to encapsulate all RX
        traffic (each VLAN uniquely identifies a source port). The DSA tagger
        determines which port it was based on the VLAN number, then removes
        that header.
      - To deliver PTP timestamps. This cannot be obtained through VLAN
        headers, so we need to take a step back and see how else we can do
        that. The Microchip Ocelot-1 (VSC7514 MIPS) driver performs manual
        injection/extraction from the CPU port module using register-based
        MMIO, and not over Ethernet. We will need to do the same from DSA,
        which makes this tagger a sort of hybrid between DSA and pure
        switchdev.
      ====================
      
      Link: https://lore.kernel.org/r/20210129010009.3959398-1-olteanv@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      56435d91
    • Vladimir Oltean's avatar
      net: dsa: felix: perform switch setup for tag_8021q · e21268ef
      Vladimir Oltean authored
      Unlike sja1105, the only other user of the software-defined tag_8021q.c
      tagger format, the implementation we choose for the Felix DSA switch
      driver preserves full functionality under a vlan_filtering bridge
      (i.e. IP termination works through the DSA user ports under all
      circumstances).
      
      The tag_8021q protocol just wants:
      - Identifying the ingress switch port based on the RX VLAN ID, as seen
        by the CPU. We achieve this by using the TCAM engines (which are also
        used for tc-flower offload) to push the RX VLAN as a second, outer
        tag, on egress towards the CPU port.
      - Steering traffic injected into the switch from the network stack
        towards the correct front port based on the TX VLAN, and consuming
        (popping) that header on the switch's egress.
      
      A tc-flower pseudocode of the static configuration done by the driver
      would look like this:
      
      $ tc qdisc add dev <cpu-port> clsact
      $ for eth in swp0 swp1 swp2 swp3; do \
      	tc filter add dev <cpu-port> egress flower indev ${eth} \
      		action vlan push id <rxvlan> protocol 802.1ad; \
      	tc filter add dev <cpu-port> ingress protocol 802.1Q flower
      		vlan_id <txvlan> action vlan pop \
      		action mirred egress redirect dev ${eth}; \
      done
      
      but of course since DSA does not register network interfaces for the CPU
      port, this configuration would be impossible for the user to do. Also,
      due to the same reason, it is impossible for the user to inadvertently
      delete these rules using tc. These rules do not collide in any way with
      tc-flower, they just consume some TCAM space, which is something we can
      live with.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e21268ef
    • Vladimir Oltean's avatar
      net: dsa: add a second tagger for Ocelot switches based on tag_8021q · 7c83a7c5
      Vladimir Oltean authored
      There are use cases for which the existing tagger, based on the NPI
      (Node Processor Interface) functionality, is insufficient.
      
      Namely:
      - Frames injected through the NPI port bypass the frame analyzer, so no
        source address learning is performed, no TSN stream classification,
        etc.
      - Flow control is not functional over an NPI port (PAUSE frames are
        encapsulated in the same Extraction Frame Header as all other frames)
      - There can be at most one NPI port configured for an Ocelot switch. But
        in NXP LS1028A and T1040 there are two Ethernet CPU ports. The non-NPI
        port is currently either disabled, or operated as a plain user port
        (albeit an internally-facing one). Having the ability to configure the
        two CPU ports symmetrically could pave the way for e.g. creating a LAG
        between them, to increase bandwidth seamlessly for the system.
      
      So there is a desire to have an alternative to the NPI mode. This change
      keeps the default tagger for the Seville and Felix switches as "ocelot",
      but it can be changed via the following device attribute:
      
      echo ocelot-8021q > /sys/class/<dsa-master>/dsa/tagging
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7c83a7c5
    • Vladimir Oltean's avatar
      net: dsa: felix: convert to the new .change_tag_protocol DSA API · adb3dccf
      Vladimir Oltean authored
      In expectation of the new tag_ocelot_8021q tagger implementation, we
      need to be able to do runtime switchover between one tagger and another.
      So we must structure the existing code for the current NPI-based tagger
      in a certain way.
      
      We move the felix_npi_port_init function in expectation of the future
      driver configuration necessary for tag_ocelot_8021q: we would like to
      not have the NPI-related bits interspersed with the tag_8021q bits.
      
      The conversion from this:
      
      	ocelot_write_rix(ocelot,
      			 ANA_PGID_PGID_PGID(GENMASK(ocelot->num_phys_ports, 0)),
      			 ANA_PGID_PGID, PGID_UC);
      
      to this:
      
      	cpu_flood = ANA_PGID_PGID_PGID(BIT(ocelot->num_phys_ports));
      	ocelot_rmw_rix(ocelot, cpu_flood, cpu_flood, ANA_PGID_PGID, PGID_UC);
      
      is perhaps non-trivial, but is nonetheless non-functional. The PGID_UC
      (replicator for unknown unicast) is already configured out of hardware
      reset to flood to all ports except ocelot->num_phys_ports (the CPU port
      module). All we change is that we use a read-modify-write to only add
      the CPU port module to the unknown unicast replicator, as opposed to
      doing a full write to the register.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      adb3dccf
    • Vladimir Oltean's avatar
      net: dsa: allow changing the tag protocol via the "tagging" device attribute · 53da0eba
      Vladimir Oltean authored
      Currently DSA exposes the following sysfs:
      $ cat /sys/class/net/eno2/dsa/tagging
      ocelot
      
      which is a read-only device attribute, introduced in the kernel as
      commit 98cdb480 ("net: dsa: Expose tagging protocol to user-space"),
      and used by libpcap since its commit 993db3800d7d ("Add support for DSA
      link-layer types").
      
      It would be nice if we could extend this device attribute by making it
      writable:
      $ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
      
      This is useful with DSA switches that can make use of more than one
      tagging protocol. It may be useful in dsa_loop in the future too, to
      perform offline testing of various taggers, or for changing between dsa
      and edsa on Marvell switches, if that is desirable.
      
      In terms of implementation, drivers can support this feature by
      implementing .change_tag_protocol, which should always leave the switch
      in a consistent state: either with the new protocol if things went well,
      or with the old one if something failed. Teardown of the old protocol,
      if necessary, must be handled by the driver.
      
      Some things remain as before:
      - The .get_tag_protocol is currently only called at probe time, to load
        the initial tagging protocol driver. Nonetheless, new drivers should
        report the tagging protocol in current use now.
      - The driver should manage by itself the initial setup of tagging
        protocol, no later than the .setup() method, as well as destroying
        resources used by the last tagger in use, no earlier than the
        .teardown() method.
      
      For multi-switch DSA trees, error handling is a bit more complicated,
      since e.g. the 5th out of 7 switches may fail to change the tag
      protocol. When that happens, a revert to the original tag protocol is
      attempted, but that may fail too, leaving the tree in an inconsistent
      state despite each individual switch implementing .change_tag_protocol
      transactionally. Since the intersection between drivers that implement
      .change_tag_protocol and drivers that support D in DSA is currently the
      empty set, the possibility for this error to happen is ignored for now.
      
      Testing:
      
      $ insmod mscc_felix.ko
      [   79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
      [   79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
      $ insmod tag_ocelot.ko
      $ rmmod mscc_felix.ko
      $ insmod mscc_felix.ko
      [   97.261724] libphy: VSC9959 internal MDIO bus: probed
      [   97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
      [   97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
      [   97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
      [   97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
      [   97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
      [   97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
      [   97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
      [   97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
      [   98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
      [   98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
      [   98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
      [   98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
      [   98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
      [   98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
      [   98.527948] device eno2 entered promiscuous mode
      [   98.544755] DSA: tree 0 setup
      
      $ ping 10.0.0.1
      PING 10.0.0.1 (10.0.0.1): 56 data bytes
      64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
      64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
      ^C
       -  10.0.0.1 ping statistics  -
      2 packets transmitted, 2 packets received, 0% packet loss
      round-trip min/avg/max = 0.754/1.545/2.337 ms
      
      $ cat /sys/class/net/eno2/dsa/tagging
      ocelot
      $ cat ./test_ocelot_8021q.sh
              #!/bin/bash
      
              ip link set swp0 down
              ip link set swp1 down
              ip link set swp2 down
              ip link set swp3 down
              ip link set swp5 down
              ip link set eno2 down
              echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
              ip link set eno2 up
              ip link set swp0 up
              ip link set swp1 up
              ip link set swp2 up
              ip link set swp3 up
              ip link set swp5 up
      $ ./test_ocelot_8021q.sh
      ./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
      $ rmmod tag_ocelot.ko
      rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
      $ insmod tag_ocelot_8021q.ko
      $ ./test_ocelot_8021q.sh
      $ cat /sys/class/net/eno2/dsa/tagging
      ocelot-8021q
      $ rmmod tag_ocelot.ko
      $ rmmod tag_ocelot_8021q.ko
      rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
      $ ping 10.0.0.1
      PING 10.0.0.1 (10.0.0.1): 56 data bytes
      64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
      64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
      64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
      $ rmmod mscc_felix.ko
      [  645.544426] mscc_felix 0000:00:00.5: Link is Down
      [  645.838608] DSA: tree 0 torn down
      $ rmmod tag_ocelot_8021q.ko
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      53da0eba
    • Vladimir Oltean's avatar
      net: dsa: keep a copy of the tagging protocol in the DSA switch tree · 357f203b
      Vladimir Oltean authored
      Cascading DSA switches can be done multiple ways. There is the brute
      force approach / tag stacking, where one upstream switch, located
      between leaf switches and the host Ethernet controller, will just
      happily transport the DSA header of those leaf switches as payload.
      For this kind of setups, DSA works without any special kind of treatment
      compared to a single switch - they just aren't aware of each other.
      Then there's the approach where the upstream switch understands the tags
      it transports from its leaves below, as it doesn't push a tag of its own,
      but it routes based on the source port & switch id information present
      in that tag (as opposed to DMAC & VID) and it strips the tag when
      egressing a front-facing port. Currently only Marvell implements the
      latter, and Marvell DSA trees contain only Marvell switches.
      
      So it is safe to say that DSA trees already have a single tag protocol
      shared by all switches, and in fact this is what makes the switches able
      to understand each other. This fact is also implied by the fact that
      currently, the tagging protocol is reported as part of a sysfs installed
      on the DSA master and not per port, so it must be the same for all the
      ports connected to that DSA master regardless of the switch that they
      belong to.
      
      It's time to make this official and enforce it (yes, this also means we
      won't have any "switch understands tag to some extent but is not able to
      speak it" hardware oddities that we'll support in the future).
      
      This is needed due to the imminent introduction of the dsa_switch_ops::
      change_tag_protocol driver API. When that is introduced, we'll have
      to notify switches of the tagging protocol that they're configured to
      use. Currently the tag_ops structure pointer is held only for CPU ports.
      But there are switches which don't have CPU ports and nonetheless still
      need to be configured. These would be Marvell leaf switches whose
      upstream port is just a DSA link. How do we inform these of their
      tagging protocol setup/deletion?
      
      One answer to the above would be: iterate through the DSA switch tree's
      ports once, list the CPU ports, get their tag_ops, then iterate again
      now that we have it, and notify everybody of that tag_ops. But what to
      do if conflicts appear between one cpu_dp->tag_ops and another? There's
      no escaping the fact that conflict resolution needs to be done, so we
      can be upfront about it.
      
      Ease our work and just keep the master copy of the tag_ops inside the
      struct dsa_switch_tree. Reference counting is now moved to be per-tree
      too, instead of per-CPU port.
      
      There are many places in the data path that access master->dsa_ptr->tag_ops
      and we would introduce unnecessary performance penalty going through yet
      another indirection, so keep those right where they are.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      357f203b
    • Vladimir Oltean's avatar
      net: dsa: document the existing switch tree notifiers and add a new one · 886f8e26
      Vladimir Oltean authored
      The existence of dsa_broadcast has generated some confusion in the past:
      https://www.mail-archive.com/netdev@vger.kernel.org/msg365042.html
      
      So let's document the existing dsa_port_notify and dsa_broadcast
      functions and explain when each of them should be used.
      
      Also, in fact, the in-between function has always been there but was
      lacking a name, and is the main reason for this patch: dsa_tree_notify.
      Refactor dsa_broadcast to use it.
      
      This patch also moves dsa_broadcast (a top-level function) to dsa2.c,
      where it really belonged in the first place, but had no companion so it
      stood with dsa_port_notify.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      886f8e26
    • Vladimir Oltean's avatar
      net: mscc: ocelot: don't use NPI tag prefix for the CPU port module · cacea62f
      Vladimir Oltean authored
      Context: Ocelot switches put the injection/extraction frame header in
      front of the Ethernet header. When used in NPI mode, a DSA master would
      see junk instead of the destination MAC address, and it would most
      likely drop the packets. So the Ocelot frame header can have an optional
      prefix, which is just "ff:ff:ff:ff:ff:fe > ff:ff:ff:ff:ff:ff" padding
      put before the actual tag (still before the real Ethernet header) such
      that the DSA master thinks it's looking at a broadcast frame with a
      strange EtherType.
      
      Unfortunately, a lesson learned in commit 69df578c ("net: mscc:
      ocelot: eliminate confusion between CPU and NPI port") seems to have
      been forgotten in the meanwhile.
      
      The CPU port module and the NPI port have independent settings for the
      length of the tag prefix. However, the driver is using the same variable
      to program both of them.
      
      There is no reason really to use any tag prefix with the CPU port
      module, since that is not connected to any Ethernet port. So this patch
      makes the inj_prefix and xtr_prefix variables apply only to the NPI
      port (which the switchdev ocelot_vsc7514 driver does not use).
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cacea62f
    • Vladimir Oltean's avatar
      net: mscc: ocelot: reapply bridge forwarding mask on bonding join/leave · 9b521250
      Vladimir Oltean authored
      Applying the bridge forwarding mask currently is done only on the STP
      state changes for any port. But it depends on both STP state changes,
      and bonding interface state changes. Export the bit that recalculates
      the forwarding mask so that it could be reused, and call it when a port
      starts and stops offloading a bonding interface.
      
      Now that the logic is split into a separate function, we can rename "p"
      into "port", since the "port" variable was already taken in
      ocelot_bridge_stp_state_set. Also, we can rename "i" into "lag", to make
      it more clear what is it that we're iterating through.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9b521250
    • Vladimir Oltean's avatar
      net: mscc: ocelot: store a namespaced VCAP filter ID · 50c6cc5b
      Vladimir Oltean authored
      We will be adding some private VCAP filters that should not interfere in
      any way with the filters added using tc-flower. So we need to allocate
      some IDs which will not be used by tc.
      
      Currently ocelot uses an u32 id derived from the flow cookie, which in
      itself is an unsigned long. This is a problem in itself, since on 64 bit
      systems, sizeof(unsigned long)=8, so the driver is already truncating
      these.
      
      Create a struct ocelot_vcap_id which contains the full unsigned long
      cookie from tc, as well as a boolean that is supposed to namespace the
      filters added by tc with the ones that aren't.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      50c6cc5b
    • Vladimir Oltean's avatar
      net: mscc: ocelot: export VCAP structures to include/soc/mscc · 0e9bb4e9
      Vladimir Oltean authored
      The Felix driver will need to preinstall some VCAP filters for its
      tag_8021q implementation (outside of the tc-flower offload logic), so
      these need to be exported to the common includes.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0e9bb4e9
    • Vladimir Oltean's avatar
      net: dsa: tag_8021q: add helpers to deduce whether a VLAN ID is RX or TX VLAN · 9c7caf28
      Vladimir Oltean authored
      The sja1105 implementation can be blind about this, but the felix driver
      doesn't do exactly what it's being told, so it needs to know whether it
      is a TX or an RX VLAN, so it can install the appropriate type of TCAM
      rule.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9c7caf28
    • Ronak Doshi's avatar
      vmxnet3: Remove buf_info from device accessible structures · de1da8bc
      Ronak Doshi authored
      buf_info structures in RX & TX queues are private driver data that
      do not need to be visible to the device.  Although there is physical
      address and length in the queue descriptor that points to these
      structures, their layout is not standardized, and device never looks
      at them.
      
      So lets allocate these structures in non-DMA-able memory, and fill
      physical address as all-ones and length as zero in the queue
      descriptor.
      
      That should alleviate worries brought by Martin Radev in
      https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20210104/022829.html
      that malicious vmxnet3 device could subvert SVM/TDX guarantees.
      Signed-off-by: default avatarPetr Vandrovec <petr@vmware.com>
      Signed-off-by: default avatarRonak Doshi <doshir@vmware.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      de1da8bc
    • Kurt Kanzenbach's avatar
      net: dsa: hellcreek: Add missing TAPRIO dependency · 6c13d75b
      Kurt Kanzenbach authored
      Add missing dependency to TAPRIO to avoid build failures such as:
      
      |ERROR: modpost: "taprio_offload_get" [drivers/net/dsa/hirschmann/hellcreek_sw.ko] undefined!
      |ERROR: modpost: "taprio_offload_free" [drivers/net/dsa/hirschmann/hellcreek_sw.ko] undefined!
      
      Fixes: 24dfc6eb ("net: dsa: hellcreek: Add TAPRIO offloading support")
      Reported-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
      Link: https://lore.kernel.org/r/20210128163338.22665-1-kurt@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6c13d75b
    • Eric Dumazet's avatar
      net: proc: speedup /proc/net/netstat · 0d6cd689
      Eric Dumazet authored
      Use cache friendly helpers to better use cpu caches
      while reading /proc/net/netstat
      
      Tested on a platform with 256 threads (AMD Rome)
      
      Before: 305 usec spent in netstat_seq_show()
      After: 130 usec spent in netstat_seq_show()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20210128162145.1703601-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0d6cd689
    • Kuniyuki Iwashima's avatar
      net: Remove redundant calls of sk_tx_queue_clear(). · df610cd9
      Kuniyuki Iwashima authored
      The commit 41b14fb8 ("net: Do not clear the sock TX queue in
      sk_set_socket()") removes sk_tx_queue_clear() from sk_set_socket() and adds
      it instead in sk_alloc() and sk_clone_lock() to fix an issue introduced in
      the commit e022f0b4 ("net: Introduce sk_tx_queue_mapping"). On the
      other hand, the original commit had already put sk_tx_queue_clear() in
      sk_prot_alloc(): the callee of sk_alloc() and sk_clone_lock(). Thus
      sk_tx_queue_clear() is called twice in each path.
      
      If we remove sk_tx_queue_clear() in sk_alloc() and sk_clone_lock(), it
      currently works well because (i) sk_tx_queue_mapping is defined between
      sk_dontcopy_begin and sk_dontcopy_end, and (ii) sock_copy() called after
      sk_prot_alloc() in sk_clone_lock() does not overwrite sk_tx_queue_mapping.
      However, if we move sk_tx_queue_mapping out of the no copy area, it
      introduces a bug unintentionally.
      
      Therefore, this patch adds a compile-time check to take care of the order
      of sock_copy() and sk_tx_queue_clear() and removes sk_tx_queue_clear() from
      sk_prot_alloc() so that it does the only allocation and its callers
      initialize fields.
      
      CC: Boris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Acked-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20210128150217.6060-1-kuniyu@amazon.co.jpSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      df610cd9
    • Jakub Kicinski's avatar
      Merge branch 'net-hns3-updates-for-next' · 77609b1d
      Jakub Kicinski authored
      Huazhong Tan says:
      
      ====================
      net: hns3: updates for -next
      
      This patchset adds dump tm info of nodes, priority and qset in debugfs.
      Three debugfs files tm_nodes, tm_priority and tm_qset are created in
      new tm directory, and use cat command to dump their info, for examples:
      
      $ cat tm_nodes
             BASE_ID  MAX_NUM
      PG         0         8
      PRI        0         8
      QSET       0         8
      QUEUE      0      1024
      
      $ cat tm_priority
      ID    MODE  DWRR  C_IR_B  C_IR_U  C_IR_S  C_BS_B  C_BS_S  C_FLAG  C_RATE(Mbps)  P_IR_B  P_IR_U  P_IR_S  P_BS_B  P_BS_S  P_FLAG  P_RATE(Mbps)
      0000  dwrr  100     0       0       0       5      20       0          0        150       7       0       5      20       0          0
      0001    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
      0002    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
      0003    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
      0004    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
      0005    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
      0006    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
      0007    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
      
      $ cat tm_qset
      ID    MAP_PRI  LINK_VLD  MODE  DWRR
      0000     0        1      dwrr  100
      0001     0        0        sp    0
      0002     0        0        sp    0
      0003     0        0        sp    0
      0004     0        0        sp    0
      0005     0        0        sp    0
      0006     0        0        sp    0
      
      change log:
      V2: add readonly files for dump all nodes, priority and qset info
          suggested by Jakub Kicinski.
      
      previous version:
      V1: https://patchwork.kernel.org/project/netdevbpf/patch/1610694569-43099-1-git-send-email-tanhuazhong@huawei.com/
      ====================
      
      Link: https://lore.kernel.org/r/1611834696-56207-1-git-send-email-tanhuazhong@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      77609b1d
    • Guangbin Huang's avatar
      net: hns3: add debugfs support for tm nodes, priority and qset info · 04987ca1
      Guangbin Huang authored
      In order to query tm info of nodes, priority and qset
      for debugging, adds three debugfs files tm_nodes,
      tm_priority and tm_qset in newly created tm directory.
      
      Unlike previous debugfs commands, these three files
      just support read ops, so they only support to use cat
      command to dump their info.
      
      The new tm file style is acccording to suggestion from
      Jakub Kicinski's opinion as link https://lkml.org/lkml/2020/9/29/2101.
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      04987ca1
    • Guangbin Huang's avatar
      net: hns3: add interfaces to query information of tm priority/qset · 2bbad0aa
      Guangbin Huang authored
      Add some interfaces to get information of tm priority and qset,
      then they can be used by debugfs.
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2bbad0aa
    • Jakub Kicinski's avatar
      Merge branch 'net-add-support-for-ip-generic-checksum-offload-for-gre' · 2d88296a
      Jakub Kicinski authored
      Xin Long says:
      
      ====================
      net: add support for ip generic checksum offload for gre
      
      This patchset it to add ip generic csum processing first in
      skb_csum_hwoffload_help() in Patch 1/2 and then add csum
      offload support for GRE header in Patch 2/2.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1611825446.git.lucien.xin@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2d88296a
    • Xin Long's avatar
      ip_gre: add csum offload support for gre header · efa1a65c
      Xin Long authored
      This patch is to add csum offload support for gre header:
      
      On the TX path in gre_build_header(), when CHECKSUM_PARTIAL's set
      for inner proto, it will calculate the csum for outer proto, and
      inner csum will be offloaded later. Otherwise, CHECKSUM_PARTIAL
      and csum_start/offset will be set for outer proto, and the outer
      csum will be offloaded later.
      
      On the GSO path in gre_gso_segment(), when CHECKSUM_PARTIAL is
      not set for inner proto and the hardware supports csum offload,
      CHECKSUM_PARTIAL and csum_start/offset will be set for outer
      proto, and outer csum will be offloaded later. Otherwise, it
      will do csum for outer proto by calling gso_make_checksum().
      
      Note that SCTP has to do the csum by itself for non GSO path in
      sctp_packet_pack(), as gre_build_header() can't handle the csum
      with CHECKSUM_PARTIAL set for SCTP CRC csum offload.
      
      v1->v2:
        - remove the SCTP part, as GRE dev doesn't support SCTP CRC CSUM
          and it will always do checksum for SCTP in sctp_packet_pack()
          when it's not a GSO packet.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      efa1a65c
    • Xin Long's avatar
      net: support ip generic csum processing in skb_csum_hwoffload_help · 62fafcd6
      Xin Long authored
      NETIF_F_IP|IPV6_CSUM feature flag indicates UDP and TCP csum offload
      while NETIF_F_HW_CSUM feature flag indicates ip generic csum offload
      for HW, which includes not only for TCP/UDP csum, but also for other
      protocols' csum like GRE's.
      
      However, in skb_csum_hwoffload_help() it only checks features against
      NETIF_F_CSUM_MASK(NETIF_F_HW|IP|IPV6_CSUM). So if it's a non TCP/UDP
      packet and the features doesn't support NETIF_F_HW_CSUM, but supports
      NETIF_F_IP|IPV6_CSUM only, it would still return 0 and leave the HW
      to do csum.
      
      This patch is to support ip generic csum processing by checking
      NETIF_F_HW_CSUM for all protocols, and check (NETIF_F_IP_CSUM |
      NETIF_F_IPV6_CSUM) only for TCP and UDP.
      
      Note that we're using skb->csum_offset to check if it's a TCP/UDP
      proctol, this might be fragile. However, as Alex said, for now we
      only have a few L4 protocols that are requesting Tx csum offload,
      we'd better fix this until a new protocol comes with a same csum
      offset.
      
      v1->v2:
        - not extend skb->csum_not_inet, but use skb->csum_offset to tell
          if it's an UDP/TCP csum packet.
      v2->v3:
        - add a note in the changelog, as Willem suggested.
      Suggested-by: default avatarAlexander Duyck <alexander.duyck@gmail.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      62fafcd6
    • Jakub Kicinski's avatar
      Merge tag 'linux-can-next-for-5.12-20210129' of... · fd3d3755
      Jakub Kicinski authored
      Merge tag 'linux-can-next-for-5.12-20210129' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
      
      Marc Kleine-Budde says:
      
      ====================
      linux-can-next-for-5.12-20210129
      
      All patches are by me and target the mcp251xfd driver. The first 4
      patches update the information regarding the "85% of (FSYSCLK/2)"
      errata. The other 4 are misc cleanups, unitfy error messages, add
      missing postfix to a macro, simplify the return of a function, and
      make use of dev_err_probe() in the mcp251xfd_probe() function.
      ====================
      
      Link: https://lore.kernel.org/r/20210129084302.3040284-1-mkl@pengutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fd3d3755
    • Loic Poulain's avatar
      net: mhi: Get rid of local rx queue count · 6e10785e
      Loic Poulain authored
      Use the new mhi_get_free_desc_count helper to track queue usage
      instead of relying on the locally maintained rx_queued count.
      Signed-off-by: default avatarLoic Poulain <loic.poulain@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6e10785e
    • Loic Poulain's avatar
      net: mhi: Get RX queue size from MHI core · e6ec3ccd
      Loic Poulain authored
      The RX queue size can be determined at runtime by retrieving the
      number of available transfer descriptors.
      Signed-off-by: default avatarLoic Poulain <loic.poulain@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e6ec3ccd
    • Jakub Kicinski's avatar
      2bca263c
    • Jan Luebbe's avatar
      docs: networking: timestamping: fix section title markup · 5daf8384
      Jan Luebbe authored
      This section was missed during the conversion to ReST, so convert it in the
      same style as the surrounding section titles.
      Signed-off-by: default avatarJan Luebbe <jlu@pengutronix.de>
      Link: https://lore.kernel.org/r/20210128111930.29473-1-jlu@pengutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5daf8384
    • dingsenjie's avatar
      net/ethernet: convert to use module_platform_driver in octeon_mgmt.c · afa4f675
      dingsenjie authored
      Simplify the code by using module_platform_driver macro
      for octeon_mgmt.
      Signed-off-by: default avatardingsenjie <dingsenjie@yulong.com>
      Link: https://lore.kernel.org/r/20210128035330.17676-1-dingsenjie@163.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      afa4f675
    • Emil Renner Berthing's avatar
      net: atm: pppoatm: use new API for wakeup tasklet · a5874597
      Emil Renner Berthing authored
      This converts the driver to use the new tasklet API introduced in
      commit 12cc923f ("tasklet: Introduce new initialization API")
      Signed-off-by: default avatarEmil Renner Berthing <kernel@esmil.dk>
      Link: https://lore.kernel.org/r/20210127173256.13954-2-kernel@esmil.dkSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a5874597
    • Emil Renner Berthing's avatar
      net: atm: pppoatm: use tasklet_init to initialize wakeup tasklet · a5b88632
      Emil Renner Berthing authored
      Previously a temporary tasklet structure was initialized on the stack
      using DECLARE_TASKLET_OLD() and then copied over and modified. Nothing
      else in the kernel seems to use this pattern, so let's just call
      tasklet_init() like everyone else.
      Signed-off-by: default avatarEmil Renner Berthing <kernel@esmil.dk>
      Link: https://lore.kernel.org/r/20210127173256.13954-1-kernel@esmil.dkSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a5b88632
    • Jakub Kicinski's avatar
      Merge branch 'net-sched-cls_flower-add-support-for-matching-on-ct_state-reply-flag' · 810e754c
      Jakub Kicinski authored
      Paul Blakey says:
      
      ====================
      net/sched: cls_flower: Add support for matching on ct_state reply flag
      
      This patchset adds software match support and offload of flower
      match ct_state reply flag (+/-rpl).
      
      The first patch adds the definition for the flag and match to flower.
      
      Second patch gives the direction of the connection to the offloading
      drivers via ct_metadata flow offload action.
      
      The last patch does offload of this new ct_state by using the supplied
      connection's direction.
      ====================
      
      Link: https://lore.kernel.org/r/1611757967-18236-1-git-send-email-paulb@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      810e754c
    • Paul Blakey's avatar
      net/mlx5: CT: Add support for matching on ct_state reply flag · 6895cb3a
      Paul Blakey authored
      Add support for matching on ct_state reply flag.
      
      Example:
      $ tc filter add dev ens1f0_0 ingress prio 1 chain 1 proto ip flower \
        ct_state +trk+est+rpl \
        action mirred egress redirect dev ens1f0_1
      $ tc filter add dev ens1f0_1 ingress prio 1 chain 1 proto ip flower \
        ct_state +trk+est-rpl \
        action mirred egress redirect dev ens1f0_0
      Signed-off-by: default avatarPaul Blakey <paulb@nvidia.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6895cb3a
    • Paul Blakey's avatar
      net: flow_offload: Add original direction flag to ct_metadata · 941eff5a
      Paul Blakey authored
      Give offloading drivers the direction of the offloaded ct flow,
      this will be used for matches on direction (ct_state +/-rpl).
      Signed-off-by: default avatarPaul Blakey <paulb@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      941eff5a
    • Paul Blakey's avatar
      net/sched: cls_flower: Add match on the ct_state reply flag · 8c85d18c
      Paul Blakey authored
      Add match on the ct_state reply flag.
      
      Example:
      $ tc filter add dev ens1f0_0 ingress prio 1 chain 1 proto ip flower \
        ct_state +trk+est+rpl \
        action mirred egress redirect dev ens1f0_1
      $ tc filter add dev ens1f0_1 ingress prio 1 chain 1 proto ip flower \
        ct_state +trk+est-rpl \
        action mirred egress redirect dev ens1f0_0
      Signed-off-by: default avatarPaul Blakey <paulb@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8c85d18c