1. 06 Feb, 2023 28 commits
    • David S. Miller's avatar
      Merge branch 'bridge-mdb-limit' · cb3086ce
      David S. Miller authored
      Petr Machata says:
      
      ====================
      bridge: Limit number of MDB entries per port, port-vlan
      
      The MDB maintained by the bridge is limited. When the bridge is configured
      for IGMP / MLD snooping, a buggy or malicious client can easily exhaust its
      capacity. In SW datapath, the capacity is configurable through the
      IFLA_BR_MCAST_HASH_MAX parameter, but ultimately is finite. Obviously a
      similar limit exists in the HW datapath for purposes of offloading.
      
      In order to prevent the issue of unilateral exhaustion of MDB resources,
      introduce two parameters in each of two contexts:
      
      - Per-port and (when BROPT_MCAST_VLAN_SNOOPING_ENABLED is enabled)
        per-port-VLAN number of MDB entries that the port is member in.
      
      - Per-port and (when BROPT_MCAST_VLAN_SNOOPING_ENABLED is enabled)
        per-port-VLAN maximum permitted number of MDB entries, or 0 for
        no limit.
      
      Per-port number of entries keeps track of the total number of MDB entries
      configured on a given port. The per-port-VLAN value then keeps track of the
      subset of MDB entries configured specifically for the given VLAN, on that
      port. The number is adjusted as port_groups are created and deleted, and
      therefore under multicast lock.
      
      A maximum value, if non-zero, then places a limit on the number of entries
      that can be configured in a given context. Attempts to add entries above
      the maximum are rejected.
      
      Rejection reason of netlink-based requests to add MDB entries is
      communicated through extack. This channel is unavailable for rejections
      triggered from the control path. To address this lack of visibility, the
      patchset adds a tracepoint, bridge:br_mdb_full:
      
      	# perf record -e bridge:br_mdb_full &
      	# [...]
      	# perf script | cut -d: -f4-
      	 dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 0
      	 dev v2 af 10 src :: grp ff0e::112/00:00:00:00:00:00 vid 0
      	 dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 10
      	 dev v2 af 10 src 2001:db8:1::1 grp ff0e::1/00:00:00:00:00:00 vid 10
      	 dev v2 af 2 src ::ffff:192.0.2.1 grp ::ffff:239.1.1.1/00:00:00:00:00:00 vid 10
      
      Another option to consume the tracepoint is e.g. through the bpftrace tool:
      
      	# bpftrace -e ' tracepoint:bridge:br_mdb_full /args->af != 0/ {
      			    printf("dev %s src %s grp %s vid %u\n",
      				   str(args->dev), ntop(args->src),
      				   ntop(args->grp), args->vid);
      			}
      			tracepoint:bridge:br_mdb_full /args->af == 0/ {
      			    printf("dev %s grp %s vid %u\n",
      				   str(args->dev),
      				   macaddr(args->grpmac), args->vid);
      			}'
      
      This tracepoint is triggered for mcast_hash_max exhaustions as well.
      
      The following is an example of how the feature is used. A more extensive
      example is available in patch #8:
      
      	# bridge vlan set dev v1 vid 1 mcast_max_groups 1
      	# bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 1
      	# bridge mdb add dev br port v1 grp 230.1.2.4 temp vid 1
      	Error: bridge: Port-VLAN is already in 1 groups, and mcast_max_groups=1.
      
      The patchset progresses as follows:
      
      - In patch #1, set strict_start_type at two bridge-related policies. The
        reason is we are adding a new attribute to one of these, and want the new
        attribute to be parsed strictly. The other was adjusted for completeness'
        sake.
      
      - In patches #2 to #5, br_mdb and br_multicast code is adjusted to make the
        following additions smoother.
      
      - In patch #6, add the tracepoint.
      
      - In patch #7, the code to maintain number of MDB entries is added as
        struct net_bridge_mcast_port::mdb_n_entries. The maximum is added, too,
        as struct net_bridge_mcast_port::mdb_max_entries, however at this point
        there is no way to set the value yet, and since 0 is treated as "no
        limit", the functionality doesn't change at this point. Note however,
        that mcast_hash_max violations already do trigger at this point.
      
      - In patch #8, netlink plumbing is added: reading of number of entries, and
        reading and writing of maximum.
      
        The per-port values are passed through RTM_NEWLINK / RTM_GETLINK messages
        in IFLA_BRPORT_MCAST_N_GROUPS and _MAX_GROUPS, inside IFLA_PROTINFO nest.
      
        The per-port-vlan values are passed through RTM_GETVLAN / RTM_NEWVLAN
        messages in BRIDGE_VLANDB_ENTRY_MCAST_N_GROUPS, _MAX_GROUPS, inside
        BRIDGE_VLANDB_ENTRY.
      
      The following patches deal with the selftest:
      
      - Patches #9 and #10 clean up and move around some selftest code.
      
      - Patches #11 to #14 add helpers and generalize the existing IGMP / MLD
        support to allow generating packets with configurable group addresses and
        varying source lists for (S,G) memberships.
      
      - Patch #15 adds code to generate IGMP leave and MLD done packets.
      
      - Patch #16 finally adds the selftest itself.
      
      v3:
      - Patch #7:
          - Access mdb_max_/_n_entries through READ_/WRITE_ONCE
          - Move extack setting to br_multicast_port_ngroups_inc_one().
            Since we use NL_SET_ERR_MSG_FMT_MOD, the correct context
            (port / port-vlan) can be passed through an argument.
            This also removes the need for more READ/WRITE_ONCE's
            at the extack-setting site.
      - Patch #8:
          - Move the br_multicast_port_ctx_vlan_disabled() check
            out to the _vlan_ helpers callers. Thus these helpers
            cannot fail, which makes them very similar to the
            _port_ helpers. Have them take the MC context directly
            and unify them.
      
      v2:
      - Cover letter:
          - Add an example of a bpftrace-based probe script
      - Patch #6:
          - Report IPv4 as an IPv6-mapped address through the IPv6 buffer
            as well, to save ring buffer space.
      - Patch #7:
          - In br_multicast_port_ngroups_inc_one(), bounce
            if n>=max, not if n==max
          - Adjust extack messages to mention ngroups, now
            that the bounces appear when n>=max, not n==max
          - In __br_multicast_enable_port_ctx(), do not reset
            max to 0. Also do not count number of entries by
            going through _inc, as that would end up incorrectly
            bouncing the entries.
      - Patch #8:
          - Drop locks around accesses in
            br_multicast_{port,vlan}_ngroups_{get,set_max}(),
          - Drop bounces due to max<n in
            br_multicast_{port,vlan}_ngroups_set_max().
      - Patch #12:
          - In the comment at payload_template_calc_checksum(),
            s/%#02x/%02x/, that's the mausezahn payload format.
      - Patch #16:
          - Adjust the tests that check setting max below n and
            reset of max on VLAN snooping enablement
          - Make test naming uniform
          - Enable testing of control path (IGMP/MLD) in
            mcast_vlan_snooping bridge
          - Reorganize the code so that test instances (per bridge
            type and configuration type) always come right after
            the test, in order of {d,q,qvs}{4,6}{cfg,ctl}.
            Then groups of selftests are at the end of the file.
            Similarly adjust invocation order of the tests.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cb3086ce
    • Petr Machata's avatar
      selftests: forwarding: bridge_mdb_max: Add a new selftest · 3446dcd7
      Petr Machata authored
      Add a suite covering mcast_n_groups and mcast_max_groups bridge features.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3446dcd7
    • Petr Machata's avatar
      selftests: forwarding: lib: Add helpers to build IGMP/MLD leave packets · 9ae85469
      Petr Machata authored
      The testsuite that checks for mcast_max_groups functionality will need to
      wipe the added groups as well. Add helpers to build an IGMP or MLD packets
      announcing that host is leaving a given group.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ae85469
    • Petr Machata's avatar
      selftests: forwarding: lib: Allow list of IPs for IGMPv3/MLDv2 · 705d4bc7
      Petr Machata authored
      The testsuite that checks for mcast_max_groups functionality will need
      to generate IGMP and MLD packets with configurable number of (S,G)
      addresses. To that end, further extend igmpv3_is_in_get() and
      mldv2_is_in_get() to allow a list of IP addresses instead of one
      address.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      705d4bc7
    • Petr Machata's avatar
      selftests: forwarding: lib: Parameterize IGMPv3/MLDv2 generation · 506a1ac9
      Petr Machata authored
      In order to generate IGMPv3 and MLDv2 packets on the fly, the
      functions that generate these packets need to be able to generate
      packets for different groups and different sources. Generating MLDv2
      packets further needs the source address of the packet for purposes of
      checksum calculation. Add the necessary parameters, and generate the
      payload accordingly by dispatching to helpers added in the previous
      patches.
      
      Adjust the sole client, bridge_mdb.sh, as well.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      506a1ac9
    • Petr Machata's avatar
      selftests: forwarding: lib: Add helpers for checksum handling · 952e0ee3
      Petr Machata authored
      In order to generate IGMPv3 and MLDv2 packets on the fly, we will need
      helpers to calculate the packet checksum.
      
      The approach presented in this patch revolves around payload templates
      for mausezahn. These are mausezahn-like payload strings (01:23:45:...)
      with possibly one 2-byte sequence replaced with the word PAYLOAD. The
      main function is payload_template_calc_checksum(), which calculates
      RFC 1071 checksum of the message. There are further helpers to then
      convert the checksum to the payload format, and to expand it.
      
      For IPv6, MLDv2 message checksum is computed using a pseudoheader that
      differs from the header used in the payload itself. The fact that the
      two messages are different means that the checksum needs to be
      returned as a separate quantity, instead of being expanded in-place in
      the payload itself. Furthermore, the pseudoheader includes a length of
      the message. Much like the checksum, this needs to be expanded in
      mausezahn format. And likewise for number of addresses for (S,G)
      entries. Thus we have several places where a computed quantity needs
      to be presented in the payload format. Add a helper u16_to_bytes(),
      which will be used in all these cases.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      952e0ee3
    • Petr Machata's avatar
      selftests: forwarding: lib: Add helpers for IP address handling · fcf49276
      Petr Machata authored
      In order to generate IGMPv3 and MLDv2 packets on the fly, we will need
      helpers to expand IPv4 and IPv6 addresses given as parameters in
      mausezahn payload notation. Add helpers that do it.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fcf49276
    • Petr Machata's avatar
      selftests: forwarding: bridge_mdb: Fix a typo · f7ccf60c
      Petr Machata authored
      Add the letter missing from the word "INCLUDE".
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7ccf60c
    • Petr Machata's avatar
      selftests: forwarding: Move IGMP- and MLD-related functions to lib · 344dd2c9
      Petr Machata authored
      These functions will be helpful for other testsuites as well. Extract them
      to a common place.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      344dd2c9
    • Petr Machata's avatar
      net: bridge: Add netlink knobs for number / maximum MDB entries · a1aee20d
      Petr Machata authored
      The previous patch added accounting for number of MDB entries per port and
      per port-VLAN, and the logic to verify that these values stay within
      configured bounds. However it didn't provide means to actually configure
      those bounds or read the occupancy. This patch does that.
      
      Two new netlink attributes are added for the MDB occupancy:
      IFLA_BRPORT_MCAST_N_GROUPS for the per-port occupancy and
      BRIDGE_VLANDB_ENTRY_MCAST_N_GROUPS for the per-port-VLAN occupancy.
      And another two for the maximum number of MDB entries:
      IFLA_BRPORT_MCAST_MAX_GROUPS for the per-port maximum, and
      BRIDGE_VLANDB_ENTRY_MCAST_MAX_GROUPS for the per-port-VLAN one.
      
      Note that the two new IFLA_BRPORT_ attributes prompt bumping of
      RTNL_SLAVE_MAX_TYPE to size the slave attribute tables large enough.
      
      The new attributes are used like this:
      
       # ip link add name br up type bridge vlan_filtering 1 mcast_snooping 1 \
                                            mcast_vlan_snooping 1 mcast_querier 1
       # ip link set dev v1 master br
       # bridge vlan add dev v1 vid 2
      
       # bridge vlan set dev v1 vid 1 mcast_max_groups 1
       # bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 1
       # bridge mdb add dev br port v1 grp 230.1.2.4 temp vid 1
       Error: bridge: Port-VLAN is already in 1 groups, and mcast_max_groups=1.
      
       # bridge link set dev v1 mcast_max_groups 1
       # bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 2
       Error: bridge: Port is already in 1 groups, and mcast_max_groups=1.
      
       # bridge -d link show
       5: v1@v2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br [...]
           [...] mcast_n_groups 1 mcast_max_groups 1
      
       # bridge -d vlan show
       port              vlan-id
       br                1 PVID Egress Untagged
                           state forwarding mcast_router 1
       v1                1 PVID Egress Untagged
                           [...] mcast_n_groups 1 mcast_max_groups 1
                         2
                           [...] mcast_n_groups 0 mcast_max_groups 0
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1aee20d
    • Petr Machata's avatar
      net: bridge: Maintain number of MDB entries in net_bridge_mcast_port · b57e8d87
      Petr Machata authored
      The MDB maintained by the bridge is limited. When the bridge is configured
      for IGMP / MLD snooping, a buggy or malicious client can easily exhaust its
      capacity. In SW datapath, the capacity is configurable through the
      IFLA_BR_MCAST_HASH_MAX parameter, but ultimately is finite. Obviously a
      similar limit exists in the HW datapath for purposes of offloading.
      
      In order to prevent the issue of unilateral exhaustion of MDB resources,
      introduce two parameters in each of two contexts:
      
      - Per-port and per-port-VLAN number of MDB entries that the port
        is member in.
      
      - Per-port and (when BROPT_MCAST_VLAN_SNOOPING_ENABLED is enabled)
        per-port-VLAN maximum permitted number of MDB entries, or 0 for
        no limit.
      
      The per-port multicast context is used for tracking of MDB entries for the
      port as a whole. This is available for all bridges.
      
      The per-port-VLAN multicast context is then only available on
      VLAN-filtering bridges on VLANs that have multicast snooping on.
      
      With these changes in place, it will be possible to configure MDB limit for
      bridge as a whole, or any one port as a whole, or any single port-VLAN.
      
      Note that unlike the global limit, exhaustion of the per-port and
      per-port-VLAN maximums does not cause disablement of multicast snooping.
      It is also permitted to configure the local limit larger than hash_max,
      even though that is not useful.
      
      In this patch, introduce only the accounting for number of entries, and the
      max field itself, but not the means to toggle the max. The next patch
      introduces the netlink APIs to toggle and read the values.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b57e8d87
    • Petr Machata's avatar
      net: bridge: Add a tracepoint for MDB overflows · d47230a3
      Petr Machata authored
      The following patch will add two more maximum MDB allowances to the global
      one, mcast_hash_max, that exists today. In all these cases, attempts to add
      MDB entries above the configured maximums through netlink, fail noisily and
      obviously. Such visibility is missing when adding entries through the
      control plane traffic, by IGMP or MLD packets.
      
      To improve visibility in those cases, add a trace point that reports the
      violation, including the relevant netdevice (be it a slave or the bridge
      itself), and the MDB entry parameters:
      
      	# perf record -e bridge:br_mdb_full &
      	# [...]
      	# perf script | cut -d: -f4-
      	 dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 0
      	 dev v2 af 10 src :: grp ff0e::112/00:00:00:00:00:00 vid 0
      	 dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 10
      	 dev v2 af 10 src 2001:db8:1::1 grp ff0e::1/00:00:00:00:00:00 vid 10
      	 dev v2 af 2 src ::ffff:192.0.2.1 grp ::ffff:239.1.1.1/00:00:00:00:00:00 vid 10
      
      CC: Steven Rostedt <rostedt@goodmis.org>
      CC: linux-trace-kernel@vger.kernel.org
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d47230a3
    • Petr Machata's avatar
      net: bridge: Change a cleanup in br_multicast_new_port_group() to goto · eceb3085
      Petr Machata authored
      This function is getting more to clean up in the following patches.
      Structuring the cleanups in one labeled block will allow reusing the same
      cleanup from several places.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eceb3085
    • Petr Machata's avatar
      net: bridge: Add br_multicast_del_port_group() · 976b3858
      Petr Machata authored
      Since cleaning up the effects of br_multicast_new_port_group() just
      consists of delisting and freeing the memory, the function
      br_mdb_add_group_star_g() inlines the corresponding code. In the following
      patches, number of per-port and per-port-VLAN MDB entries is going to be
      maintained, and that counter will have to be updated. Because that logic
      is going to be hidden in the br_multicast module, introduce a new hook
      intended to again remove a newly-created group.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      976b3858
    • Petr Machata's avatar
      net: bridge: Move extack-setting to br_multicast_new_port_group() · 1c85b80b
      Petr Machata authored
      Now that br_multicast_new_port_group() takes an extack argument, move
      setting the extack there. The downside is that the error messages end
      up being less specific (the function cannot distinguish between (S,G)
      and (*,G) groups). However, the alternative is to check in the caller
      whether the callee set the extack, and if it didn't, set it. But that
      is only done when the callee is not exactly known. (E.g. in case of a
      notifier invocation.)
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c85b80b
    • Petr Machata's avatar
      net: bridge: Add extack to br_multicast_new_port_group() · 60977a0c
      Petr Machata authored
      Make it possible to set an extack in br_multicast_new_port_group().
      Eventually, this function will check for per-port and per-port-vlan
      MDB maximums, and will use the extack to communicate the reason for
      the bounce.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      60977a0c
    • Petr Machata's avatar
      net: bridge: Set strict_start_type at two policies · c00041cf
      Petr Machata authored
      Make any attributes newly-added to br_port_policy or vlan_tunnel_policy
      parsed strictly, to prevent userspace from passing garbage. Note that this
      patchset only touches the former policy. The latter was adjusted for
      completeness' sake. There do not appear to be other _deprecated calls
      with non-NULL policies.
      Suggested-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c00041cf
    • David S. Miller's avatar
      Merge branch 'sparx5-PSFP-support' · 8b7018fa
      David S. Miller authored
      Daniel Machon says:
      
      ====================
      net: Add support for PSFP in Sparx5
      
      ================================================================================
      Add support for Per-Stream Filtering and Policing (802.1Q-2018, 8.6.5.1).
      ================================================================================
      
      The VCAP CLM (VCAP IS0 ingress classifier) classifies streams,
      identified by ISDX (Ingress Service Index, frame metadata), and maps
      ISDX to streams.
      
      Flow meters are also classified by ISDX, and implemented using service
      policers (Service Dual Leacky Buckets, SDLB). Leacky buckets are linked
      together in a leak chain of a leak group. Leak groups a preconfigured to serve
      buckets within a certain rate interval.
      
      Stream gates are time-based policers used by PSFP. Frames are dropped
      based on the gate state (OPEN/ CLOSE), whose state will be altered based
      on the Gate Control List (GCL) and current PTP time. Apart from
      time-based policing, stream gates can alter egress queue selection for
      the frames that pass through the Gate. This is done through Internal
      Priority Selector (IPS). Stream gates are mapped from stream filters.
      
      Support for tc actions gate and police, have been added to the VCAP IS0 set of
      supported actions.
      
      Examples:
      
      // tc filter with gate action
      $ tc filter add dev eth1 ingress chain 1100000 prio 1 handle 1001 protocol \
      802.1q flower skip_sw vlan_id 100 action gate base-time 0 sched-entry open \
      700000 7 8m sched-entry close 300000 action goto chain 1200000
      
      // tc filter with police action
      $ tc filter add dev eth1 ingress chain 1100000 prio 1 handle 1002 protocol \
      802.1q flower skip_sw vlan_id 100 action police rate 1gbit burst 8096      \
      conform-exceed drop action goto chain 1200000
      
      ================================================================================
      Patches
      ================================================================================
      Patch #1:  Adds new register needed for PSFP.
      Patch #2:  Adds resource pools to control PSFP needed chip resources.
      Patch #3:  Adds support for SDLB's needed for flow-meters.
      Patch #4:  Adds support for service policers.
      Patch #5:  Adds support for PSFP flow-meters, using service policers.
      Patch #6:  Adds a new function to calculate basetime, required by flow-meters.
      Patch #7:  Adds support for PSFP stream gates.
      Patch #8:  Adds support for PSFP stream filters.
      Patch #9:  Adds a function to initialize flow-meters, stream gates and stream
                 filters.
      Patch #10: Adds the required flower code to configure PSFP using the tc command.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b7018fa
    • Daniel Machon's avatar
      sparx5: add support for configuring PSFP via tc · 6ebf182b
      Daniel Machon authored
      Add support for tc actions gate and police, in order to implement
      support for configuring PSFP through tc.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ebf182b
    • Daniel Machon's avatar
      net: microchip: sparx5: initialize PSFP · e116b19d
      Daniel Machon authored
      Initialize the SDLB's, stream gates and stream filters.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e116b19d
    • Daniel Machon's avatar
      net: microchip: sparx5: add support for PSFP stream filters · ae3e691f
      Daniel Machon authored
      Add support for configuring PSFP stream filters (IEEE 802.1Q-2018,
      8.6.5.1.1).
      
      The VCAP CLM (VCAP IS0 ingress classifier) classifies streams,
      identified by ISDX (Ingress Service Index, frame metadata), and maps
      ISDX to streams.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae3e691f
    • Daniel Machon's avatar
      net: microchip: sparx5: add support for PSFP stream gates · c70a5e2c
      Daniel Machon authored
      Add support for configuring PSFP stream gates (IEEE 802.1Q-2018,
      8.6.5.1.2).
      
      Stream gates are time-based policers used by PSFP. Frames are dropped
      based on the gate state (OPEN/ CLOSE), whose state will be altered based
      on the Gate Control List (GCL) and current PTP time. Apart from
      time-based policing, stream gates can alter egress queue selection for
      the frames that pass through the Gate. This is done through Internal
      Priority Selector (IPS). Stream gates are mapped from stream filters.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c70a5e2c
    • Daniel Machon's avatar
      net: microchip: sparx5: add function for calculating PTP basetime · 9e02131e
      Daniel Machon authored
      Add a new function for calculating PTP basetime, required by the stream
      gate scheduler to calculate gate state (open / close).
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9e02131e
    • Daniel Machon's avatar
      net: microchip: sparx5: add support for PSFP flow-meters · d2185e79
      Daniel Machon authored
      Add support for configuring PSFP flow-meters (IEEE 802.1Q-2018,
      8.6.5.1.3).
      
      The VCAP CLM (VCAP IS0 ingress classifier) classifies streams,
      identified by ISDX (Ingress Service Index, frame metadata), and maps
      ISDX to flow-meters. SDLB's provide the flow-meter parameters.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d2185e79
    • Daniel Machon's avatar
      net: microchip: sparx5: add support for service policers · 1db82abf
      Daniel Machon authored
      Add initial API for configuring policers. This patch add support for
      service policers.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1db82abf
    • Daniel Machon's avatar
      net: microchip: sparx5: add support for Service Dual Leacky Buckets · 9bf50889
      Daniel Machon authored
      Add support for Service Dual Leacky Buckets (SDLB), used to implement
      PSFP flow-meters. Buckets are linked together in a leak chain of a leak
      group. Leak groups a preconfigured to serve buckets within a certain
      rate interval.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9bf50889
    • Daniel Machon's avatar
      net: microchip: sparx5: add resource pools · bb535c0d
      Daniel Machon authored
      Add resource pools and accessor functions. These pools can be queried by
      the driver, whenever a finite resource is required. Some resources can
      be reused, in which case an index and a reference count is used to keep
      track of users.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb535c0d
    • Daniel Machon's avatar
      net: microchip: add registers needed for PSFP · edad83e2
      Daniel Machon authored
      Add registers needed for PSFP. This patch also renames a single
      register, shortening its name (SYS_CLK_PER_100PS). Uses have been update
      accordingly.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      edad83e2
  2. 04 Feb, 2023 12 commits
    • David S. Miller's avatar
      Merge branch 'net-smc-parallelism' · 042b7858
      David S. Miller authored
      D. Wythe says:
      
      ====================
      net/smc: optimize the parallelism of SMC-R connections
      
      This patch set attempts to optimize the parallelism of SMC-R connections,
      mainly to reduce unnecessary blocking on locks, and to fix exceptions that
      occur after thoses optimization.
      
      According to Off-CPU graph, SMC worker's off-CPU as that:
      
      smc_close_passive_work                  (1.09%)
              smcr_buf_unuse                  (1.08%)
                      smc_llc_flow_initiate   (1.02%)
      
      smc_listen_work                         (48.17%)
              __mutex_lock.isra.11            (47.96%)
      
      An ideal SMC-R connection process should only block on the IO events
      of the network, but it's quite clear that the SMC-R connection now is
      queued on the lock most of the time.
      
      The goal of this patchset is to achieve our ideal situation where
      network IO events are blocked for the majority of the connection lifetime.
      
      There are three big locks here:
      
      1. smc_client_lgr_pending & smc_server_lgr_pending
      
      2. llc_conf_mutex
      
      3. rmbs_lock & sndbufs_lock
      
      And an implementation issue:
      
      1. confirm/delete rkey msg can't be sent concurrently while
      protocol allows indeed.
      
      Unfortunately,The above problems together affect the parallelism of
      SMC-R connection. If any of them are not solved. our goal cannot
      be achieved.
      
      After this patch set, we can get a quite ideal off-CPU graph as
      following:
      
      smc_close_passive_work                                  (41.58%)
              smcr_buf_unuse                                  (41.57%)
                      smc_llc_do_delete_rkey                  (41.57%)
      
      smc_listen_work                                         (39.10%)
              smc_clc_wait_msg                                (13.18%)
                      tcp_recvmsg_locked                      (13.18)
              smc_listen_find_device                          (25.87%)
                      smcr_lgr_reg_rmbs                       (25.87%)
                              smc_llc_do_confirm_rkey         (25.87%)
      
      We can see that most of the waiting times are waiting for network IO
      events. This also has a certain performance improvement on our
      short-lived conenction wrk/nginx benchmark test:
      
      +--------------+------+------+-------+--------+------+--------+
      |conns/qps     |c4    | c8   |  c16  |  c32   | c64  |  c200  |
      +--------------+------+------+-------+--------+------+--------+
      |SMC-R before  |9.7k  | 10k  |  10k  |  9.9k  | 9.1k |  8.9k  |
      +--------------+------+------+-------+--------+------+--------+
      |SMC-R now     |13k   | 19k  |  18k  |  16k   | 15k  |  12k   |
      +--------------+------+------+-------+--------+------+--------+
      |TCP           |15k   | 35k  |  51k  |  80k   | 100k |  162k  |
      +--------------+------+------+-------+--------+------+--------+
      
      The reason why the benefit is not obvious after the number of connections
      has increased dues to workqueue. If we try to change workqueue to UNBOUND,
      we can obtain at least 4-5 times performance improvement, reach up to half
      of TCP. However, this is not an elegant solution, the optimization of it
      will be much more complicated. But in any case, we will submit relevant
      optimization patches as soon as possible.
      
      Please note that the premise here is that the lock related problem
      must be solved first, otherwise, no matter how we optimize the workqueue,
      there won't be much improvement.
      
      Because there are a lot of related changes to the code, if you have
      any questions or suggestions, please let me know.
      
      Thanks
      D. Wythe
      
      v1 -> v2:
      
      1. Fix panic in SMC-D scenario
      2. Fix lnkc related hashfn calculation exception, caused by operator
      priority
      3. Only wake up one connection if the lnk is not active
      4. Delete obsolete unlock logic in smc_listen_work()
      5. PATCH format, do Reverse Christmas tree
      6. PATCH format, change all xxx_lnk_xxx function to xxx_link_xxx
      7. PATCH format, add correct fix tag for the patches for fixes.
      8. PATCH format, fix some spelling error
      9. PATCH format, rename slow to do_slow
      
      v2 -> v3:
      
      1. add SMC-D support, remove the concept of link cluster since SMC-D has
      no link at all. Replace it by lgr decision maker, who provides suggestions
      to SMC-D and SMC-R on whether to create new link group.
      
      2. Fix the corruption problem described by PATCH 'fix application
      data exception' on SMC-D.
      
      v3 -> v4:
      
      1. Fix panic caused by uninitialization map.
      
      v4 -> v5:
      
      1. Make SMC-D buf creation be serial to avoid Potential error
      2. Add a flag to synchronize the success of the first contact
      with the ready of the link group, including SMC-D and SMC-R.
      3. Fixed possible reference count leak in smc_llc_flow_start().
      4. reorder the patch, make bugfix PATCH be ahead.
      
      v5 -> v6:
      
      1. Separate the bugfix patches to make it independent.
      2. Merge patch 'fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending'
      with patch 'remove locks smc_client_lgr_pending and smc_server_lgr_pending'
      3. Format code styles, including alignment and reverse christmas tree
      style.
      4. Fix a possible memory leak in smc_llc_rmt_delete_rkey()
      and smc_llc_rmt_conf_rkey().
      
      v6 -> v7:
      
      1. Discard patch attempting to remove global locks
      2. Discard patch attempting make confirm/delete rkey process concurrently
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      042b7858
    • D. Wythe's avatar
      net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore · aff7bfed
      D. Wythe authored
      It's clear that rmbs_lock and sndbufs_lock are aims to protect the
      rmbs list or the sndbufs list.
      
      During connection establieshment, smc_buf_get_slot() will always
      be invoked, and it only performs read semantics in rmbs list and
      sndbufs list.
      
      Based on the above considerations, we replace mutex with rw_semaphore.
      Only smc_buf_get_slot() use down_read() to allow smc_buf_get_slot()
      run concurrently, other part use down_write() to keep exclusive
      semantics.
      Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aff7bfed
    • D. Wythe's avatar
      net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs() · 4da68744
      D. Wythe authored
      Unlike smc_buf_create() and smcr_buf_unuse(), smcr_lgr_reg_rmbs() is
      exclusive when assigned rmb_desc was not registered, although it can be
      executed in parallel when assigned rmb_desc was registered already
      and only performs read semtamics on it. Hence, we can not simply replace
      it with read semaphore.
      
      The idea here is that if the assigned rmb_desc was registered already,
      use read semaphore to protect the critical section, once the assigned
      rmb_desc was not registered, keep using keep write semaphore still
      to keep its exclusivity.
      
      Thanks to the reusable features of rmb_desc, which allows us to execute
      in parallel in most cases.
      Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4da68744
    • D. Wythe's avatar
      net/smc: use read semaphores to reduce unnecessary blocking in smc_buf_create() & smcr_buf_unuse() · f6421014
      D. Wythe authored
      Following is part of Off-CPU graph during frequent SMC-R short-lived
      processing:
      
      process_one_work				(51.19%)
      smc_close_passive_work			(28.36%)
      	smcr_buf_unuse				(28.34%)
      	rwsem_down_write_slowpath		(28.22%)
      
      smc_listen_work				(22.83%)
      	smc_clc_wait_msg			(1.84%)
      	smc_buf_create				(20.45%)
      		smcr_buf_map_usable_links
      		rwsem_down_write_slowpath	(20.43%)
      	smcr_lgr_reg_rmbs			(0.53%)
      		rwsem_down_write_slowpath	(0.43%)
      		smc_llc_do_confirm_rkey		(0.08%)
      
      We can clearly see that during the connection establishment time,
      waiting time of connections is not on IO, but on llc_conf_mutex.
      
      What is more important, the core critical area (smcr_buf_unuse() &
      smc_buf_create()) only perfroms read semantics on links, we can
      easily replace it with read semaphore.
      Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f6421014
    • D. Wythe's avatar
      net/smc: llc_conf_mutex refactor, replace it with rw_semaphore · b5dd4d69
      D. Wythe authored
      llc_conf_mutex was used to protect links and link related configurations
      in the same link group, for example, add or delete links. However,
      in most cases, the protected critical area has only read semantics and
      with no write semantics at all, such as obtaining a usable link or an
      available rmb_desc.
      
      This patch do simply code refactoring, replace mutex with rw_semaphore,
      replace mutex_lock with down_write and replace mutex_unlock with
      up_write.
      
      Theoretically, this replacement is equivalent, but after this patch,
      we can distinguish lock granularity according to different semantics
      of critical areas.
      Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b5dd4d69
    • Jakub Kicinski's avatar
      Merge branch 'updates-to-enetc-txq-management' · 88c940cc
      Jakub Kicinski authored
      Vladimir Oltean says:
      
      ====================
      Updates to ENETC TXQ management
      
      The set ensures that the number of TXQs given by enetc to the network
      stack (mqprio or TX hashing) + the number of TXQs given to XDP never
      exceeds the number of available TXQs.
      
      These are the first 4 patches of series "[v5,net-next,00/17] ENETC
      mqprio/taprio cleanup" from here:
      https://patchwork.kernel.org/project/netdevbpf/cover/20230202003621.2679603-1-vladimir.oltean@nxp.com/
      
      There is no change in this version compared to there. I split them off
      because this contains a fix for net-next and it would be good if it
      could go in quickly. I also did it to reduce the patch count of that
      other series, if I need to respin it again.
      ====================
      
      Link: https://lore.kernel.org/r/20230203001116.3814809-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      88c940cc
    • Vladimir Oltean's avatar
      net: enetc: ensure we always have a minimum number of TXQs for stack · 800db2d1
      Vladimir Oltean authored
      Currently it can happen that an mqprio qdisc is installed with num_tc 8,
      and this will reserve 8 (out of 8) TXQs for the network stack. Then we
      can attach an XDP program, and this will crop 2 TXQs, leaving just 6 for
      mqprio. That's not what the user requested, and we should fail it.
      
      On the other hand, if mqprio isn't requested, we still give the 8 TXQs
      to the network stack (with hashing among a single traffic class), but
      then, cropping 2 TXQs for XDP is fine, because the user didn't
      explicitly ask for any number of TXQs, so no expectations are violated.
      
      Simply put, the logic that mqprio should impose a minimum number of TXQs
      for the network never existed. Let's say (more or less arbitrarily) that
      without mqprio, the driver expects a minimum number of TXQs equal to the
      number of CPUs (on NXP LS1028A, that is either 1, or 2). And with mqprio,
      mqprio gives the minimum required number of TXQs.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      800db2d1
    • Vladimir Oltean's avatar
      net: enetc: recalculate num_real_tx_queues when XDP program attaches · 4ea1dd74
      Vladimir Oltean authored
      Since the blamed net-next commit, enetc_setup_xdp_prog() no longer goes
      through enetc_open(), and therefore, the function which was supposed to
      detect whether a BPF program exists (in order to crop some TX queues
      from network stack usage), enetc_num_stack_tx_queues(), no longer gets
      called.
      
      We can move the netif_set_real_num_rx_queues() call to enetc_alloc_msix()
      (probe time), since it is a runtime invariant. We can do the same thing
      with netif_set_real_num_tx_queues(), and let enetc_reconfigure_xdp_cb()
      explicitly recalculate and change the number of stack TX queues.
      
      Fixes: c33bfaf9 ("net: enetc: set up XDP program under enetc_reconfigure()")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4ea1dd74
    • Vladimir Oltean's avatar
      net: enetc: allow the enetc_reconfigure() callback to fail · 46a0ecf9
      Vladimir Oltean authored
      enetc_reconfigure() was modified in commit c33bfaf9 ("net: enetc:
      set up XDP program under enetc_reconfigure()") to take an optional
      callback that runs while the netdev is down, but this callback currently
      cannot fail.
      
      Code up the error handling so that the interface is restarted with the
      old resources if the callback fails.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      46a0ecf9
    • Vladimir Oltean's avatar
      net: enetc: simplify enetc_num_stack_tx_queues() · 1c81a9b3
      Vladimir Oltean authored
      We keep a pointer to the xdp_prog in the private netdev structure as
      well; what's replicated per RX ring is done so just for more convenient
      access from the NAPI poll procedure.
      
      Simplify enetc_num_stack_tx_queues() by looking at priv->xdp_prog rather
      than iterating through the information replicated per RX ring.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1c81a9b3
    • Jakub Kicinski's avatar
      Merge branch 'raw-add-drop-reasons-and-use-another-hash-function' · 8788260e
      Jakub Kicinski authored
      Eric Dumazet says:
      
      ====================
      raw: add drop reasons and use another hash function
      
      Two first patches add drop reasons to raw input processing.
      
      Last patch spreads RAW sockets in the shared hash tables
      to avoid long hash buckets in some cases.
      ====================
      
      Link: https://lore.kernel.org/r/20230202094100.3083177-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8788260e
    • Eric Dumazet's avatar
      raw: use net_hash_mix() in hash function · 6579f5ba
      Eric Dumazet authored
      Some applications seem to rely on RAW sockets.
      
      If they use private netns, we can avoid piling all RAW
      sockets bound to a given protocol into a single bucket.
      
      Also place (struct raw_hashinfo).lock into its own
      cache line to limit false sharing.
      
      Alternative would be to have per-netns hashtables,
      but this seems too expensive for most netns
      where RAW sockets are not used.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6579f5ba