1. 31 May, 2023 9 commits
    • Jakub Kicinski's avatar
      Merge branch 'add-layer-2-miss-indication-and-filtering' · e180a33c
      Jakub Kicinski authored
      Ido Schimmel says:
      
      ====================
      Add layer 2 miss indication and filtering
      
      tl;dr
      =====
      
      This patchset adds a single bit to the tc skb extension to indicate that
      a packet encountered a layer 2 miss in the bridge and extends flower to
      match on this metadata. This is required for non-DF (Designated
      Forwarder) filtering in EVPN multi-homing which prevents decapsulated
      BUM packets from being forwarded multiple times to the same multi-homed
      host.
      
      Background
      ==========
      
      In a typical EVPN multi-homing setup each host is multi-homed using a
      set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf
      switches in a rack. These switches act as VTEPs and are not directly
      connected (as opposed to MLAG), but can communicate with each other (as
      well as with VTEPs in remote racks) via spine switches over L3.
      
      When a host sends a BUM packet over ES1 to VTEP1, the VTEP will flood it
      to other VTEPs in the network, including those connected to the host
      over ES1. The receiving VTEPs must drop the packet and not forward it
      back to the host. This is called "split-horizon filtering" (SPH) [1].
      
      FRR configures SPH filtering using two tc filters. The first, an ingress
      filter that matches on packets received from VTEP1 and marks them using
      a fwmark (firewall mark). The second, an egress filter configured on the
      LAG interface connected to the host that matches on the fwmark and drops
      the packets. Example:
      
       # tc filter add dev vxlan0 ingress pref 1 proto all flower enc_src_ip $VTEP1_IP action skbedit mark 101
       # tc filter add dev bond0 egress pref 1 handle 101 fw action drop
      
      Motivation
      ==========
      
      For each ES, only one VTEP is elected by the control plane as the DF.
      The DF is responsible for forwarding decapsulated BUM traffic to the
      host over the ES. The non-DF VTEPs must drop such traffic as otherwise
      the host will receive multiple copies of BUM traffic. This is called
      "non-DF filtering" [2].
      
      Filtering of multicast and broadcast traffic can be achieved using the
      following flower filter:
      
       # tc filter add dev bond0 egress pref 1 proto all flower indev vxlan0 dst_mac 01:00:00:00:00:00/01:00:00:00:00:00 action drop
      
      Unlike broadcast and multicast traffic, it is not currently possible to
      filter unknown unicast traffic. The classification into unknown unicast
      is performed by the bridge driver, but is not visible to other layers.
      
      Implementation
      ==============
      
      The proposed solution is to add a single bit to the tc skb extension
      that is set by the bridge for packets that encountered an FDB or MDB
      miss. The flower classifier is extended to be able to match on this new
      metadata bit in a similar fashion to existing metadata options such as
      'indev'.
      
      A bit that is set for every flooded packet would also work, but it does
      not allow us to differentiate between registered and unregistered
      multicast traffic which might be useful in the future.
      
      A relatively generic name is chosen for this bit - 'l2_miss' - to allow
      its use to be extended to other layer 2 devices such as VXLAN, should a
      use case arise.
      
      With the above, the control plane can implement a non-DF filter using
      the following tc filters:
      
       # tc filter add dev bond0 egress pref 1 proto all flower indev vxlan0 dst_mac 01:00:00:00:00:00/01:00:00:00:00:00 action drop
       # tc filter add dev bond0 egress pref 2 proto all flower indev vxlan0 l2_miss true action drop
      
      The first drops broadcast and multicast traffic and the second drops
      unknown unicast traffic.
      
      Testing
      =======
      
      A test exercising the different permutations of the 'l2_miss' bit is
      added in patch #8.
      
      Patchset overview
      =================
      
      Patch #1 adds the new bit to the tc skb extension and sets it in the
      bridge driver for packets that encountered a miss. The marking of the
      packets and the use of this extension is protected by the
      'tc_skb_ext_tc' static key in order to keep performance impact to a
      minimum when the feature is not in use.
      
      Patch #2 extends the flow dissector to dissect this information from the
      tc skb extension into the 'FLOW_DISSECTOR_KEY_META' key.
      
      Patch #3 extends the flower classifier to be able to match on the new
      layer 2 miss metadata. The classifier enables the 'tc_skb_ext_tc' static
      key upon the installation of the first filter that matches on 'l2_miss'
      and disables the key upon the removal of the last filter that matches on
      it.
      
      Patch #4 rejects matching on the new metadata in drivers that already
      support the 'FLOW_DISSECTOR_KEY_META' key.
      
      Patches #5-#6 are small preparations in mlxsw.
      
      Patch #7 extends mlxsw to be able to match on layer 2 miss.
      
      Patch #8 adds a selftest.
      
      iproute2 patches can be found here [3].
      
      [1] https://datatracker.ietf.org/doc/html/rfc7432#section-8.3
      [2] https://datatracker.ietf.org/doc/html/rfc7432#section-8.5
      [3] https://github.com/idosch/iproute2/tree/submit/non_df_filter_v1
      [4] https://lore.kernel.org/netdev/20230518113328.1952135-1-idosch@nvidia.com/
      [5] https://lore.kernel.org/netdev/20230509070446.246088-1-idosch@nvidia.com/
      ====================
      
      Link: https://lore.kernel.org/r/20230529114835.372140-1-idosch@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e180a33c
    • Ido Schimmel's avatar
      selftests: forwarding: Add layer 2 miss test cases · 8c33266a
      Ido Schimmel authored
      Add test cases to verify that the bridge driver correctly marks layer 2
      misses only when it should and that the flower classifier can match on
      this metadata.
      
      Example output:
      
       # ./tc_flower_l2_miss.sh
       TEST: L2 miss - Unicast                                             [ OK ]
       TEST: L2 miss - Multicast (IPv4)                                    [ OK ]
       TEST: L2 miss - Multicast (IPv6)                                    [ OK ]
       TEST: L2 miss - Link-local multicast (IPv4)                         [ OK ]
       TEST: L2 miss - Link-local multicast (IPv6)                         [ OK ]
       TEST: L2 miss - Broadcast                                           [ OK ]
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8c33266a
    • Ido Schimmel's avatar
      mlxsw: spectrum_flower: Add ability to match on layer 2 miss · caa4c58a
      Ido Schimmel authored
      Add the 'fdb_miss' key element to supported key blocks and make use of
      it to match on layer 2 miss.
      
      The key is only supported on Spectrum-{2,3,4}. An error is returned for
      Spectrum-1 since the key element is not present in any of its key
      blocks.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      caa4c58a
    • Ido Schimmel's avatar
      mlxsw: spectrum_flower: Do not force matching on iif · 0b9cd74b
      Ido Schimmel authored
      Currently, mlxsw only supports the 'ingress_ifindex' field in the
      'FLOW_DISSECTOR_KEY_META' key, but subsequent patches are going to add
      support for the 'l2_miss' field as well. It is valid to only match on
      'l2_miss' without 'ingress_ifindex', so do not force matching on it.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0b9cd74b
    • Ido Schimmel's avatar
      mlxsw: spectrum_flower: Split iif parsing to a separate function · d04e2650
      Ido Schimmel authored
      Currently, mlxsw only supports the 'ingress_ifindex' field in the
      'FLOW_DISSECTOR_KEY_META' key, but subsequent patches are going to add
      support for the 'l2_miss' field as well. Split the parsing of the
      'ingress_ifindex' field to a separate function to avoid nesting. No
      functional changes intended.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d04e2650
    • Ido Schimmel's avatar
      flow_offload: Reject matching on layer 2 miss · f4356947
      Ido Schimmel authored
      Adjust drivers that support the 'FLOW_DISSECTOR_KEY_META' key to reject
      filters that try to match on the newly added layer 2 miss field. Add an
      extack message to clearly communicate the failure reason to user space.
      
      The following users were not patched:
      
      1. mtk_flow_offload_replace(): Only checks that the key is present, but
         does not do anything with it.
      2. mlx5_tc_ct_set_tuple_match(): Used as part of netfilter offload,
         which does not make use of the new field, unlike tc.
      3. get_netdev_from_rule() in nfp: Likewise.
      
      Example:
      
       # tc filter add dev swp1 egress pref 1 proto all flower skip_sw l2_miss true action drop
       Error: mlxsw_spectrum: Can't match on "l2_miss".
       We have an error talking to the kernel
      Acked-by: default avatarElad Nachman <enachman@marvell.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f4356947
    • Ido Schimmel's avatar
      net/sched: flower: Allow matching on layer 2 miss · 1a432018
      Ido Schimmel authored
      Add the 'TCA_FLOWER_L2_MISS' netlink attribute that allows user space to
      match on packets that encountered a layer 2 miss. The miss indication is
      set as metadata in the tc skb extension by the bridge driver upon FDB or
      MDB lookup miss and dissected by the flow dissector to the
      'FLOW_DISSECTOR_KEY_META' key.
      
      The use of this skb extension is guarded by the 'tc_skb_ext_tc' static
      key. As such, enable / disable this key when filters that match on layer
      2 miss are added / deleted.
      
      Tested:
      
       # cat tc_skb_ext_tc.py
       #!/usr/bin/env -S drgn -s vmlinux
      
       refcount = prog["tc_skb_ext_tc"].key.enabled.counter.value_()
       print(f"tc_skb_ext_tc reference count is {refcount}")
      
       # ./tc_skb_ext_tc.py
       tc_skb_ext_tc reference count is 0
      
       # tc filter add dev swp1 egress proto all handle 101 pref 1 flower src_mac 00:11:22:33:44:55 action drop
       # tc filter add dev swp1 egress proto all handle 102 pref 2 flower src_mac 00:11:22:33:44:55 l2_miss true action drop
       # tc filter add dev swp1 egress proto all handle 103 pref 3 flower src_mac 00:11:22:33:44:55 l2_miss false action drop
      
       # ./tc_skb_ext_tc.py
       tc_skb_ext_tc reference count is 2
      
       # tc filter replace dev swp1 egress proto all handle 102 pref 2 flower src_mac 00:01:02:03:04:05 l2_miss false action drop
      
       # ./tc_skb_ext_tc.py
       tc_skb_ext_tc reference count is 2
      
       # tc filter del dev swp1 egress proto all handle 103 pref 3 flower
       # tc filter del dev swp1 egress proto all handle 102 pref 2 flower
       # tc filter del dev swp1 egress proto all handle 101 pref 1 flower
      
       # ./tc_skb_ext_tc.py
       tc_skb_ext_tc reference count is 0
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1a432018
    • Ido Schimmel's avatar
      flow_dissector: Dissect layer 2 miss from tc skb extension · d5ccfd90
      Ido Schimmel authored
      Extend the 'FLOW_DISSECTOR_KEY_META' key with a new 'l2_miss' field and
      populate it from a field with the same name in the tc skb extension.
      This field is set by the bridge driver for packets that incur an FDB or
      MDB miss.
      
      The next patch will extend the flower classifier to be able to match on
      layer 2 misses.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d5ccfd90
    • Ido Schimmel's avatar
      skbuff: bridge: Add layer 2 miss indication · 7b4858df
      Ido Schimmel authored
      For EVPN non-DF (Designated Forwarder) filtering we need to be able to
      prevent decapsulated traffic from being flooded to a multi-homed host.
      Filtering of multicast and broadcast traffic can be achieved using the
      following flower filter:
      
       # tc filter add dev bond0 egress pref 1 proto all flower indev vxlan0 dst_mac 01:00:00:00:00:00/01:00:00:00:00:00 action drop
      
      Unlike broadcast and multicast traffic, it is not currently possible to
      filter unknown unicast traffic. The classification into unknown unicast
      is performed by the bridge driver, but is not visible to other layers
      such as tc.
      
      Solve this by adding a new 'l2_miss' bit to the tc skb extension. Clear
      the bit whenever a packet enters the bridge (received from a bridge port
      or transmitted via the bridge) and set it if the packet did not match an
      FDB or MDB entry. If there is no skb extension and the bit needs to be
      cleared, then do not allocate one as no extension is equivalent to the
      bit being cleared. The bit is not set for broadcast packets as they
      never perform a lookup and therefore never incur a miss.
      
      A bit that is set for every flooded packet would also work for the
      current use case, but it does not allow us to differentiate between
      registered and unregistered multicast traffic, which might be useful in
      the future.
      
      To keep the performance impact to a minimum, the marking of packets is
      guarded by the 'tc_skb_ext_tc' static key. When 'false', the skb is not
      touched and an skb extension is not allocated. Instead, only a
      5 bytes nop is executed, as demonstrated below for the call site in
      br_handle_frame().
      
      Before the patch:
      
      ```
              memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
        c37b09:       49 c7 44 24 28 00 00    movq   $0x0,0x28(%r12)
        c37b10:       00 00
      
              p = br_port_get_rcu(skb->dev);
        c37b12:       49 8b 44 24 10          mov    0x10(%r12),%rax
              memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
        c37b17:       49 c7 44 24 30 00 00    movq   $0x0,0x30(%r12)
        c37b1e:       00 00
        c37b20:       49 c7 44 24 38 00 00    movq   $0x0,0x38(%r12)
        c37b27:       00 00
      ```
      
      After the patch (when static key is disabled):
      
      ```
              memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
        c37c29:       49 c7 44 24 28 00 00    movq   $0x0,0x28(%r12)
        c37c30:       00 00
        c37c32:       49 8d 44 24 28          lea    0x28(%r12),%rax
        c37c37:       48 c7 40 08 00 00 00    movq   $0x0,0x8(%rax)
        c37c3e:       00
        c37c3f:       48 c7 40 10 00 00 00    movq   $0x0,0x10(%rax)
        c37c46:       00
      
      #ifdef CONFIG_HAVE_JUMP_LABEL_HACK
      
      static __always_inline bool arch_static_branch(struct static_key *key, bool branch)
      {
              asm_volatile_goto("1:"
        c37c47:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
              br_tc_skb_miss_set(skb, false);
      
              p = br_port_get_rcu(skb->dev);
        c37c4c:       49 8b 44 24 10          mov    0x10(%r12),%rax
      ```
      
      Subsequent patches will extend the flower classifier to be able to match
      on the new 'l2_miss' bit and enable / disable the static key when
      filters that match on it are added / deleted.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7b4858df
  2. 30 May, 2023 31 commits