Merge branch 'backup-nexthop-ID'

Ido Schimmel says: ==================== Add backup nexthop ID support tl;dr ===== This patchset adds a new bridge port attribute specifying the nexthop object ID to attach to a redirected skb as tunnel metadata. The ID is used by the VXLAN driver to choose the target VTEP for the skb. This is useful for EVPN multi-homing, where we want to redirect local (intra-rack) traffic upon carrier loss through one of the other VTEPs (ES peers) connected to the target host. Background ========== In a typical EVPN multi-homing setup each host is multi-homed using a set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf switches in a rack. These switches act as VTEPs and are not directly connected (as opposed to MLAG), but can communicate with each other (as well as with VTEPs in remote racks) via spine switches over L3. The control plane uses Type 1 routes [1] to create a mapping between an ES and VTEPs where the ES has active links. In addition, the control plane uses Type 2 routes [2] to create a mapping between {MAC, VLAN} and an ES. These tables are then used by the control plane to instruct VTEPs how to reach remote hosts. For example, assuming {MAC X, VLAN Y} is accessible via ES1 and this ES has active links to VTEP1 and VTEP2. The control plane will program the following entries to a remote VTEP: # ip nexthop add id 1 via $VTEP1_IP fdb # ip nexthop add id 2 via $VTEP2_IP fdb # ip nexthop add id 10 group 1/2 fdb # bridge fdb add $MAC_X dev vx0 master extern_learn vlan $VLAN_Y # bridge fdb add $MAC_Y dev vx0 self extern_learn nhid 10 src_vni $VNI_Y Remote traffic towards the host will be load balanced between VTEP1 and VTEP2. If the control plane notices a carrier loss on the ES1 link connected to VTEP1, it will issue a Type 1 route withdraw, prompting remote VTEPs to remove the effected nexthop from the group: # ip nexthop replace id 10 group 2 fdb Motivation ========== While remote traffic can be redirected to a VTEP with an active ES link by withdrawing a Type 1 route, the same is not true for local traffic. A host that is multi-homed to VTEP1 and VTEP2 via another ES (e.g., ES2) will send its traffic to {MAC X, VLAN Y} via one of these two switches, according to its LAG hash algorithm which is not under our control. If the traffic arrives at VTEP1 - which no longer has an active ES1 link - it will be dropped due to the carrier loss. In MLAG setups, the above problem is solved by redirecting the traffic through the peer link upon carrier loss. This is achieved by defining the peer link as the backup port of the host facing bond. For example: # bridge link set dev bond0 backup_port bond_peer Unlike MLAG, there is no peer link between the leaf switches in EVPN. Instead, upon carrier loss, local traffic should be redirected through one of the active ES peers. This can be achieved by defining the VXLAN port as the backup port of the host facing bonds. For example: # bridge link set dev es1_bond backup_port vx0 However, the VXLAN driver is not programmed with FDB entries for locally attached hosts and therefore does not know to which VTEP to redirect the traffic to. This will result in the traffic being replicated to all the VTEPs (potentially hundreds) in the network and each VTEP dropping the traffic, except for the active ES peer. Avoiding the flooding by programming local FDB entries in the VXLAN driver is not a viable solution as it requires to significantly increase the number of programmed FDB entries. Implementation ============== The proposed solution is to create an FDB nexthop group for each ES with the IP addresses of the active ES peers and set this ID as the backup nexthop ID (new bridge port attribute) of the ES link. For example, on VTEP1: # ip nexthop add id 1 via $VTEP2_IP fdb # ip nexthop add id 10 group 1 fdb # bridge link set dev es1_bond backup_nhid 10 # bridge link set dev es1_bond backup_port vx0 When the ES link loses its carrier, traffic will be redirected to the VXLAN port, but instead of only attaching the tunnel ID (i.e., VNI) as tunnel metadata to the skb, the backup nexthop ID will be attached as well. The VXLAN driver will then use this information to forward the skb via the nexthop object associated with the ID, as if the skb hit an FDB entry associated with this ID. Testing ======= A test for both the existing backup port attribute as well as the new backup nexthop ID attribute is added in patch #4. Patchset overview ================= Patch #1 extends the tunnel key structure with the new nexthop ID field. Patch #2 uses the new field in the VXLAN driver to forward packets via the specified nexthop ID. Patch #3 adds the new backup nexthop ID bridge port attribute and adjusts the bridge driver to attach the ID as tunnel metadata upon redirection. Patch #4 adds a selftest. iproute2 patches can be found here [3]. Changelog ========= Since RFC [4]: * Added Nik's tags. [1] https://datatracker.ietf.org/doc/html/rfc7432#section-7.1 [2] https://datatracker.ietf.org/doc/html/rfc7432#section-7.2 [3] https://github.com/idosch/iproute2/tree/submit/backup_nhid_v1 [4] https://lore.kernel.org/netdev/20230713070925.3955850-1-idosch@nvidia.com/ ==================== Acked-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'backup-nexthop-ID'
Ido Schimmel says: ==================== Add backup nexthop ID support tl;dr ===== This patchset adds a new bridge port attribute specifying the nexthop object ID to attach to a redirected skb as tunnel metadata. The ID is used by the VXLAN driver to choose the target VTEP for the skb. This is useful for EVPN multi-homing, where we want to redirect local (intra-rack) traffic upon carrier loss through one of the other VTEPs (ES peers) connected to the target host. Background ========== In a typical EVPN multi-homing setup each host is multi-homed using a set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf switches in a rack. These switches act as VTEPs and are not directly connected (as opposed to MLAG), but can communicate with each other (as well as with VTEPs in remote racks) via spine switches over L3. The control plane uses Type 1 routes [1] to create a mapping between an ES and VTEPs where the ES has active links. In addition, the control plane uses Type 2 routes [2] to create a mapping between {MAC, VLAN} and an ES. These tables are then used by the control plane to instruct VTEPs how to reach remote hosts. For example, assuming {MAC X, VLAN Y} is accessible via ES1 and this ES has active links to VTEP1 and VTEP2. The control plane will program the following entries to a remote VTEP: # ip nexthop add id 1 via $VTEP1_IP fdb # ip nexthop add id 2 via $VTEP2_IP fdb # ip nexthop add id 10 group 1/2 fdb # bridge fdb add $MAC_X dev vx0 master extern_learn vlan $VLAN_Y # bridge fdb add $MAC_Y dev vx0 self extern_learn nhid 10 src_vni $VNI_Y Remote traffic towards the host will be load balanced between VTEP1 and VTEP2. If the control plane notices a carrier loss on the ES1 link connected to VTEP1, it will issue a Type 1 route withdraw, prompting remote VTEPs to remove the effected nexthop from the group: # ip nexthop replace id 10 group 2 fdb Motivation ========== While remote traffic can be redirected to a VTEP with an active ES link by withdrawing a Type 1 route, the same is not true for local traffic. A host that is multi-homed to VTEP1 and VTEP2 via another ES (e.g., ES2) will send its traffic to {MAC X, VLAN Y} via one of these two switches, according to its LAG hash algorithm which is not under our control. If the traffic arrives at VTEP1 - which no longer has an active ES1 link - it will be dropped due to the carrier loss. In MLAG setups, the above problem is solved by redirecting the traffic through the peer link upon carrier loss. This is achieved by defining the peer link as the backup port of the host facing bond. For example: # bridge link set dev bond0 backup_port bond_peer Unlike MLAG, there is no peer link between the leaf switches in EVPN. Instead, upon carrier loss, local traffic should be redirected through one of the active ES peers. This can be achieved by defining the VXLAN port as the backup port of the host facing bonds. For example: # bridge link set dev es1_bond backup_port vx0 However, the VXLAN driver is not programmed with FDB entries for locally attached hosts and therefore does not know to which VTEP to redirect the traffic to. This will result in the traffic being replicated to all the VTEPs (potentially hundreds) in the network and each VTEP dropping the traffic, except for the active ES peer. Avoiding the flooding by programming local FDB entries in the VXLAN driver is not a viable solution as it requires to significantly increase the number of programmed FDB entries. Implementation ============== The proposed solution is to create an FDB nexthop group for each ES with the IP addresses of the active ES peers and set this ID as the backup nexthop ID (new bridge port attribute) of the ES link. For example, on VTEP1: # ip nexthop add id 1 via $VTEP2_IP fdb # ip nexthop add id 10 group 1 fdb # bridge link set dev es1_bond backup_nhid 10 # bridge link set dev es1_bond backup_port vx0 When the ES link loses its carrier, traffic will be redirected to the VXLAN port, but instead of only attaching the tunnel ID (i.e., VNI) as tunnel metadata to the skb, the backup nexthop ID will be attached as well. The VXLAN driver will then use this information to forward the skb via the nexthop object associated with the ID, as if the skb hit an FDB entry associated with this ID. Testing ======= A test for both the existing backup port attribute as well as the new backup nexthop ID attribute is added in patch #4. Patchset overview ================= Patch #1 extends the tunnel key structure with the new nexthop ID field. Patch #2 uses the new field in the VXLAN driver to forward packets via the specified nexthop ID. Patch #3 adds the new backup nexthop ID bridge port attribute and adjusts the bridge driver to attach the ID as tunnel metadata upon redirection. Patch #4 adds a selftest. iproute2 patches can be found here [3]. Changelog ========= Since RFC [4]: * Added Nik's tags. [1] https://datatracker.ietf.org/doc/html/rfc7432#section-7.1 [2] https://datatracker.ietf.org/doc/html/rfc7432#section-7.2 [3] https://github.com/idosch/iproute2/tree/submit/backup_nhid_v1 [4] https://lore.kernel.org/netdev/20230713070925.3955850-1-idosch@nvidia.com/ ==================== Acked-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
b3f937f1 · David S. Miller · 3223eeaf · b4084530 · b3f937f1 · b3f937f1
Commit b3f937f1 authored Jul 19, 2023 by David S. Miller
10 changed files
--- a/drivers/net/vxlan/vxlan_core.c
+++ b/drivers/net/vxlan/vxlan_core.c
@@ -2672,6 +2672,45 @@ static void vxlan_xmit_nh(struct sk_buff *skb, struct net_device *dev,
 	dev_kfree_skb(skb);
 }

+static netdev_tx_t vxlan_xmit_nhid(struct sk_buff *skb, struct net_device *dev,
+				   u32 nhid, __be32 vni)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	struct vxlan_rdst nh_rdst;
+	struct nexthop *nh;
+	bool do_xmit;
+	u32 hash;
+
+	memset(&nh_rdst, 0, sizeof(struct vxlan_rdst));
+	hash = skb_get_hash(skb);
+
+	rcu_read_lock();
+	nh = nexthop_find_by_id(dev_net(dev), nhid);
+	if (unlikely(!nh || !nexthop_is_fdb(nh) || !nexthop_is_multipath(nh))) {
+		rcu_read_unlock();
+		goto drop;
+	}
+	do_xmit = vxlan_fdb_nh_path_select(nh, hash, &nh_rdst);
+	rcu_read_unlock();
+
+	if (vxlan->cfg.saddr.sa.sa_family != nh_rdst.remote_ip.sa.sa_family)
+		goto drop;
+
+	if (likely(do_xmit))
+		vxlan_xmit_one(skb, dev, vni, &nh_rdst, false);
+	else
+		goto drop;
+
+	return NETDEV_TX_OK;
+
+drop:
+	dev->stats.tx_dropped++;
+	vxlan_vnifilter_count(netdev_priv(dev), vni, NULL,
+			      VXLAN_VNI_STATS_TX_DROPS, 0);
+	dev_kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
 /* Transmit local packets over Vxlan
 *
 * Outer IP header inherits ECN and DF from inner header.
@@ -2687,6 +2726,7 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct vxlan_fdb *f;
 	struct ethhdr *eth;
 	__be32 vni = 0;
+	u32 nhid = 0;

 	info = skb_tunnel_info(skb);

@@ -2696,6 +2736,7 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (info && info->mode & IP_TUNNEL_INFO_BRIDGE &&
 		    info->mode & IP_TUNNEL_INFO_TX) {
 			vni = tunnel_id_to_key32(info->key.tun_id);
+			nhid = info->key.nhid;
 		} else {
 			if (info && info->mode & IP_TUNNEL_INFO_TX)
 				vxlan_xmit_one(skb, dev, vni, NULL, false);
@@ -2723,6 +2764,9 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
 #endif
 	}

+	if (nhid)
+		return vxlan_xmit_nhid(skb, dev, nhid, vni);
+
 	if (vxlan->cfg.flags & VXLAN_F_MDB) {
 		struct vxlan_mdb_entry *mdb_entry;


--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -52,6 +52,7 @@ struct ip_tunnel_key {
 	u8			tos;		/* TOS for IPv4, TC for IPv6 */
 	u8			ttl;		/* TTL for IPv4, HL for IPv6 */
 	__be32			label;		/* Flow Label for IPv6 */
+	u32			nhid;
 	__be16			tp_src;
 	__be16			tp_dst;
 	__u8			flow_flags;

--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -570,6 +570,7 @@ enum {
 	IFLA_BRPORT_MCAST_N_GROUPS,
 	IFLA_BRPORT_MCAST_MAX_GROUPS,
 	IFLA_BRPORT_NEIGH_VLAN_SUPPRESS,
+	IFLA_BRPORT_BACKUP_NHID,
 	__IFLA_BRPORT_MAX
 };
 #define IFLA_BRPORT_MAX (__IFLA_BRPORT_MAX - 1)

--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -154,6 +154,7 @@ void br_forward(const struct net_bridge_port *to,
 		backup_port = rcu_dereference(to->backup_port);
 		if (unlikely(!backup_port))
 			goto out;
+		BR_INPUT_SKB_CB(skb)->backup_nhid = READ_ONCE(to->backup_nhid);
 		to = backup_port;
 	}


--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -211,6 +211,7 @@ static inline size_t br_port_info_size(void)
 		+ nla_total_size(sizeof(u8))	/* IFLA_BRPORT_MRP_IN_OPEN */
 		+ nla_total_size(sizeof(u32))	/* IFLA_BRPORT_MCAST_EHT_HOSTS_LIMIT */
 		+ nla_total_size(sizeof(u32))	/* IFLA_BRPORT_MCAST_EHT_HOSTS_CNT */
+		+ nla_total_size(sizeof(u32))	/* IFLA_BRPORT_BACKUP_NHID */
 		+ 0;
 }

@@ -319,6 +320,10 @@ static int br_port_fill_attrs(struct sk_buff *skb,
 			    backup_p->dev->ifindex);
 	rcu_read_unlock();

+	if (p->backup_nhid &&
+	    nla_put_u32(skb, IFLA_BRPORT_BACKUP_NHID, p->backup_nhid))
+		return -EMSGSIZE;
+
 	return 0;
 }

@@ -895,6 +900,7 @@ static const struct nla_policy br_port_policy[IFLA_BRPORT_MAX + 1] = {
 	[IFLA_BRPORT_MCAST_N_GROUPS] = { .type = NLA_REJECT },
 	[IFLA_BRPORT_MCAST_MAX_GROUPS] = { .type = NLA_U32 },
 	[IFLA_BRPORT_NEIGH_VLAN_SUPPRESS] = NLA_POLICY_MAX(NLA_U8, 1),
+	[IFLA_BRPORT_BACKUP_NHID] = { .type = NLA_U32 },
 };

 /* Change the state of the port and notify spanning tree */
@@ -1065,6 +1071,12 @@ static int br_setport(struct net_bridge_port *p, struct nlattr *tb[],
 			return err;
 	}

+	if (tb[IFLA_BRPORT_BACKUP_NHID]) {
+		u32 backup_nhid = nla_get_u32(tb[IFLA_BRPORT_BACKUP_NHID]);
+
+		WRITE_ONCE(p->backup_nhid, backup_nhid);
+	}
+
 	return 0;
 }


--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -387,6 +387,7 @@ struct net_bridge_port {
 	struct net_bridge_vlan_group	__rcu *vlgrp;
 #endif
 	struct net_bridge_port		__rcu *backup_port;
+	u32				backup_nhid;

 	/* STP */
 	u8				priority;
@@ -605,6 +606,8 @@ struct br_input_skb_cb {
 	 */
 	unsigned long fwd_hwdoms;
 #endif
+
+	u32 backup_nhid;
 };

 #define BR_INPUT_SKB_CB(__skb)	((struct br_input_skb_cb *)(__skb)->cb)

--- a/net/bridge/br_vlan_tunnel.c
+++ b/net/bridge/br_vlan_tunnel.c
@@ -201,6 +201,21 @@ int br_handle_egress_vlan_tunnel(struct sk_buff *skb,
 	if (err)
 		return err;

+	if (BR_INPUT_SKB_CB(skb)->backup_nhid) {
+		tunnel_dst = __ip_tun_set_dst(0, 0, 0, 0, 0, TUNNEL_KEY,
+					      tunnel_id, 0);
+		if (!tunnel_dst)
+			return -ENOMEM;
+
+		tunnel_dst->u.tun_info.mode |= IP_TUNNEL_INFO_TX |
+					       IP_TUNNEL_INFO_BRIDGE;
+		tunnel_dst->u.tun_info.key.nhid =
+			BR_INPUT_SKB_CB(skb)->backup_nhid;
+		skb_dst_set(skb, &tunnel_dst->dst);
+
+		return 0;
+	}
+
 	tunnel_dst = rcu_dereference(vlan->tinfo.tunnel_dst);
 	if (tunnel_dst && dst_hold_safe(&tunnel_dst->dst))
 		skb_dst_set(skb, &tunnel_dst->dst);

--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -61,7 +61,7 @@
 #include "dev.h"

 #define RTNL_MAX_TYPE		50
-#define RTNL_SLAVE_MAX_TYPE	43
+#define RTNL_SLAVE_MAX_TYPE	44

 struct rtnl_link {
 	rtnl_doit_func		doit;

--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -87,6 +87,7 @@ TEST_GEN_FILES += bind_wildcard
 TEST_PROGS += test_vxlan_mdb.sh
 TEST_PROGS += test_bridge_neigh_suppress.sh
 TEST_PROGS += test_vxlan_nolocalbypass.sh
+TEST_PROGS += test_bridge_backup_port.sh

 TEST_FILES := settings


--- a/tools/testing/selftests/net/test_bridge_backup_port.sh
+++ b/tools/testing/selftests/net/test_bridge_backup_port.sh