Commit 2a0186a3 authored by David S. Miller's avatar David S. Miller

Merge branch 'nexthop-Resilient-next-hop-groups'

Petr Machata says:

====================
nexthop: Resilient next-hop groups

At this moment, there is only one type of next-hop group: an mpath group.
Mpath groups implement the hash-threshold algorithm, described in RFC
2992[1].

To select a next hop, hash-threshold algorithm first assigns a range of
hashes to each next hop in the group, and then selects the next hop by
comparing the SKB hash with the individual ranges. When a next hop is
removed from the group, the ranges are recomputed, which leads to
reassignment of parts of hash space from one next hop to another. RFC 2992
illustrates it thus:

             +-------+-------+-------+-------+-------+
             |   1   |   2   |   3   |   4   |   5   |
             +-------+-+-----+---+---+-----+-+-------+
             |    1    |    2    |    4    |    5    |
             +---------+---------+---------+---------+

              Before and after deletion of next hop 3
	      under the hash-threshold algorithm.

Note how next hop 2 gave up part of the hash space in favor of next hop 1,
and 4 in favor of 5. While there will usually be some overlap between the
previous and the new distribution, some traffic flows change the next hop
that they resolve to.

If a multipath group is used for load-balancing between multiple servers,
this hash space reassignment causes an issue that packets from a single
flow suddenly end up arriving at a server that does not expect them, which
may lead to TCP reset.

If a multipath group is used for load-balancing among available paths to
the same server, the issue is that different latencies and reordering along
the way causes the packets to arrive in the wrong order.

Resilient hashing is a technique to address the above problem. Resilient
next-hop group has another layer of indirection between the group itself
and its constituent next hops: a hash table. The selection algorithm uses a
straightforward modulo operation on the SKB hash to choose a hash table
bucket, then reads the next hop that this bucket contains, and forwards
traffic there.

This indirection brings an important feature. In the hash-threshold
algorithm, the range of hashes associated with a next hop must be
continuous. With a hash table, mapping between the hash table buckets and
the individual next hops is arbitrary. Therefore when a next hop is deleted
the buckets that held it are simply reassigned to other next hops:

             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	                      v v v v
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

              Before and after deletion of next hop 3
	      under the resilient hashing algorithm.

When weights of next hops in a group are altered, it may be possible to
choose a subset of buckets that are currently not used for forwarding
traffic, and use those to satisfy the new next-hop distribution demands,
keeping the "busy" buckets intact. This way, established flows are ideally
kept being forwarded to the same endpoints through the same paths as before
the next-hop group change.

This patch set adds the implementation of resilient next-hop groups.

In a nutshell, the algorithm works as follows. Each next hop has a number
of buckets that it wants to have, according to its weight and the number of
buckets in the hash table. In case of an event that might cause bucket
allocation change, the numbers for individual next hops are updated,
similarly to how ranges are updated for mpath group next hops. Following
that, a new "upkeep" algorithm runs, and for idle buckets that belong to a
next hop that is currently occupying more buckets than it wants (it is
"overweight"), it migrates the buckets to one of the next hops that has
fewer buckets than it wants (it is "underweight"). If, after this, there
are still underweight next hops, another upkeep run is scheduled to a
future time.

Chances are there are not enough "idle" buckets to satisfy the new demands.
The algorithm has knobs to select both what it means for a bucket to be
idle, and for whether and when to forcefully migrate buckets if there keeps
being an insufficient number of idle ones.

To illustrate the usage, consider the following commands:

 # ip nexthop add id 1 via 192.0.2.2 dev dummy1
 # ip nexthop add id 2 via 192.0.2.3 dev dummy1
 # ip nexthop add id 10 group 1/2 type resilient \
	buckets 8 idle_timer 60 unbalanced_timer 300

The last command creates a resilient next-hop group. It will have 8
buckets, each bucket will be considered idle when no traffic hits it for at
least 60 seconds, and if the table remains out of balance for 300 seconds,
it will be forcefully brought into balance.

If not present in netlink message, the idle timer defaults to 120 seconds,
and there is no unbalanced timer, meaning the group may remain unbalanced
indefinitely. The value of 120 is the default in Cumulus implementation of
resilient next-hop groups. To a degree the default is arbitrary, the only
value that certainly does not make sense is 0. Therefore going with an
existing deployed implementation is reasonable.

Unbalanced time, i.e. how long since the last time that all nexthops had as
many buckets as they should according to their weights, is reported when
the group is dumped:

 # ip nexthop show id 10
 id 10 group 1/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0

When replacing next hops or changing weights, if one does not specify some
parameters, their value is left as it was:

 # ip nexthop replace id 10 group 1,2/2 type resilient
 # ip nexthop show id 10
 id 10 group 1,2/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0

It is also possible to do a dump of individual buckets (and now you know
why there were only 8 of them in the example above):

 # ip nexthop bucket show id 10
 id 10 index 0 idle_time 5.59 nhid 1
 id 10 index 1 idle_time 5.59 nhid 1
 id 10 index 2 idle_time 8.74 nhid 2
 id 10 index 3 idle_time 8.74 nhid 2
 id 10 index 4 idle_time 8.74 nhid 1
 id 10 index 5 idle_time 8.74 nhid 1
 id 10 index 6 idle_time 8.74 nhid 1
 id 10 index 7 idle_time 8.74 nhid 1

Note the two buckets that have a shorter idle time. Those are the ones that
were migrated after the nexthop replace command to satisfy the new demand
that nexthop 1 be given 6 buckets instead of 4.

The patchset proceeds as follows:

- Patches #1 and #2 are small refactoring patches.

- Patch #3 adds a new flag to struct nh_group, is_multipath. This flag is
  meant to be set for all nexthop groups that in general have several
  nexthops from which they choose, and avoids a more expensive dispatch
  based on reading several flags, one for each nexthop group type.

- Patch #4 contains defines of new UAPI attributes and the new next-hop
  group type. At this point, the nexthop code is made to bounce the new
  type. As the resilient hashing code is gradually added in the following
  patch sets, it will remain dead. The last patch will make it accessible.

  This patch also adds a suite of new messages related to next hop buckets.
  This approach was taken instead of overloading the information on the
  existing RTM_{NEW,DEL,GET}NEXTHOP messages for the following reasons.

  First, a next-hop group can contain a large number of next-hop buckets
  (4k is not unheard of). This imposes limits on the amount of information
  that can be encoded for each next-hop bucket given a netlink message is
  limited to 64k bytes.

  Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at this
  point, in the future it can be extended to provide user space with
  control over next-hop buckets configuration.

- Patch #5 contains the meat of the resilient next-hop group support.

- Patches #6 and #7 implement support for notifications towards the
  drivers.

- Patch #8 adds an interface for the drivers to report resilient hash
  table bucket activity. Drivers will be able to report through this
  interface whether traffic is hitting a given bucket.

- Patch #9 adds an interface for the drivers to report whether a given
  hash table bucket is offloaded or trapping traffic.

- In patches #10, #11, #12 and #13, UAPI is implemented. This includes all
  the code necessary for creation of resilient groups, bucket dumping and
  getting, and bucket migration notifications.

- In patch #14 the next-hop groups are finally made available.

The overall plan is to contribute approximately the following patchsets:

1) Nexthop policy refactoring (already pushed)
2) Preparations for resilient next-hop groups (already pushed)
3) Implementation of resilient next-hop groups (this patchset)
4) Netdevsim offload plus a suite of selftests
5) Preparations for mlxsw offload of resilient next-hop groups
6) mlxsw offload including selftests

Interested parties can look at the current state of the code at [2] and
[3].

[1] https://tools.ietf.org/html/rfc2992
[2] https://github.com/idosch/linux/commits/submit/res_integ_v1
[3] https://github.com/idosch/iproute2/commits/submit/res_v1
====================
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents 1d5d0a07 15e1dd57
...@@ -40,6 +40,12 @@ struct nh_config { ...@@ -40,6 +40,12 @@ struct nh_config {
struct nlattr *nh_grp; struct nlattr *nh_grp;
u16 nh_grp_type; u16 nh_grp_type;
u16 nh_grp_res_num_buckets;
unsigned long nh_grp_res_idle_timer;
unsigned long nh_grp_res_unbalanced_timer;
bool nh_grp_res_has_num_buckets;
bool nh_grp_res_has_idle_timer;
bool nh_grp_res_has_unbalanced_timer;
struct nlattr *nh_encap; struct nlattr *nh_encap;
u16 nh_encap_type; u16 nh_encap_type;
...@@ -63,6 +69,32 @@ struct nh_info { ...@@ -63,6 +69,32 @@ struct nh_info {
}; };
}; };
struct nh_res_bucket {
struct nh_grp_entry __rcu *nh_entry;
atomic_long_t used_time;
unsigned long migrated_time;
bool occupied;
u8 nh_flags;
};
struct nh_res_table {
struct net *net;
u32 nhg_id;
struct delayed_work upkeep_dw;
/* List of NHGEs that have too few buckets ("uw" for underweight).
* Reclaimed buckets will be given to entries in this list.
*/
struct list_head uw_nh_entries;
unsigned long unbalanced_since;
u32 idle_timer;
u32 unbalanced_timer;
u16 num_nh_buckets;
struct nh_res_bucket nh_buckets[];
};
struct nh_grp_entry { struct nh_grp_entry {
struct nexthop *nh; struct nexthop *nh;
u8 weight; u8 weight;
...@@ -71,6 +103,13 @@ struct nh_grp_entry { ...@@ -71,6 +103,13 @@ struct nh_grp_entry {
struct { struct {
atomic_t upper_bound; atomic_t upper_bound;
} mpath; } mpath;
struct {
/* Member on uw_nh_entries. */
struct list_head uw_nh_entry;
u16 count_buckets;
u16 wants_buckets;
} res;
}; };
struct list_head nh_list; struct list_head nh_list;
...@@ -80,9 +119,13 @@ struct nh_grp_entry { ...@@ -80,9 +119,13 @@ struct nh_grp_entry {
struct nh_group { struct nh_group {
struct nh_group *spare; /* spare group for removals */ struct nh_group *spare; /* spare group for removals */
u16 num_nh; u16 num_nh;
bool is_multipath;
bool mpath; bool mpath;
bool resilient;
bool fdb_nh; bool fdb_nh;
bool has_v4; bool has_v4;
struct nh_res_table __rcu *res_table;
struct nh_grp_entry nh_entries[]; struct nh_grp_entry nh_entries[];
}; };
...@@ -112,11 +155,15 @@ struct nexthop { ...@@ -112,11 +155,15 @@ struct nexthop {
enum nexthop_event_type { enum nexthop_event_type {
NEXTHOP_EVENT_DEL, NEXTHOP_EVENT_DEL,
NEXTHOP_EVENT_REPLACE, NEXTHOP_EVENT_REPLACE,
NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE,
NEXTHOP_EVENT_BUCKET_REPLACE,
}; };
enum nh_notifier_info_type { enum nh_notifier_info_type {
NH_NOTIFIER_INFO_TYPE_SINGLE, NH_NOTIFIER_INFO_TYPE_SINGLE,
NH_NOTIFIER_INFO_TYPE_GRP, NH_NOTIFIER_INFO_TYPE_GRP,
NH_NOTIFIER_INFO_TYPE_RES_TABLE,
NH_NOTIFIER_INFO_TYPE_RES_BUCKET,
}; };
struct nh_notifier_single_info { struct nh_notifier_single_info {
...@@ -143,6 +190,19 @@ struct nh_notifier_grp_info { ...@@ -143,6 +190,19 @@ struct nh_notifier_grp_info {
struct nh_notifier_grp_entry_info nh_entries[]; struct nh_notifier_grp_entry_info nh_entries[];
}; };
struct nh_notifier_res_bucket_info {
u16 bucket_index;
unsigned int idle_timer_ms;
bool force;
struct nh_notifier_single_info old_nh;
struct nh_notifier_single_info new_nh;
};
struct nh_notifier_res_table_info {
u16 num_nh_buckets;
struct nh_notifier_single_info nhs[];
};
struct nh_notifier_info { struct nh_notifier_info {
struct net *net; struct net *net;
struct netlink_ext_ack *extack; struct netlink_ext_ack *extack;
...@@ -151,6 +211,8 @@ struct nh_notifier_info { ...@@ -151,6 +211,8 @@ struct nh_notifier_info {
union { union {
struct nh_notifier_single_info *nh; struct nh_notifier_single_info *nh;
struct nh_notifier_grp_info *nh_grp; struct nh_notifier_grp_info *nh_grp;
struct nh_notifier_res_table_info *nh_res_table;
struct nh_notifier_res_bucket_info *nh_res_bucket;
}; };
}; };
...@@ -158,6 +220,10 @@ int register_nexthop_notifier(struct net *net, struct notifier_block *nb, ...@@ -158,6 +220,10 @@ int register_nexthop_notifier(struct net *net, struct notifier_block *nb,
struct netlink_ext_ack *extack); struct netlink_ext_ack *extack);
int unregister_nexthop_notifier(struct net *net, struct notifier_block *nb); int unregister_nexthop_notifier(struct net *net, struct notifier_block *nb);
void nexthop_set_hw_flags(struct net *net, u32 id, bool offload, bool trap); void nexthop_set_hw_flags(struct net *net, u32 id, bool offload, bool trap);
void nexthop_bucket_set_hw_flags(struct net *net, u32 id, u16 bucket_index,
bool offload, bool trap);
void nexthop_res_grp_activity_update(struct net *net, u32 id, u16 num_buckets,
unsigned long *activity);
/* caller is holding rcu or rtnl; no reference taken to nexthop */ /* caller is holding rcu or rtnl; no reference taken to nexthop */
struct nexthop *nexthop_find_by_id(struct net *net, u32 id); struct nexthop *nexthop_find_by_id(struct net *net, u32 id);
...@@ -212,7 +278,7 @@ static inline bool nexthop_is_multipath(const struct nexthop *nh) ...@@ -212,7 +278,7 @@ static inline bool nexthop_is_multipath(const struct nexthop *nh)
struct nh_group *nh_grp; struct nh_group *nh_grp;
nh_grp = rcu_dereference_rtnl(nh->nh_grp); nh_grp = rcu_dereference_rtnl(nh->nh_grp);
return nh_grp->mpath; return nh_grp->is_multipath;
} }
return false; return false;
} }
...@@ -227,7 +293,7 @@ static inline unsigned int nexthop_num_path(const struct nexthop *nh) ...@@ -227,7 +293,7 @@ static inline unsigned int nexthop_num_path(const struct nexthop *nh)
struct nh_group *nh_grp; struct nh_group *nh_grp;
nh_grp = rcu_dereference_rtnl(nh->nh_grp); nh_grp = rcu_dereference_rtnl(nh->nh_grp);
if (nh_grp->mpath) if (nh_grp->is_multipath)
rc = nh_grp->num_nh; rc = nh_grp->num_nh;
} }
...@@ -308,7 +374,7 @@ struct fib_nh_common *nexthop_fib_nhc(struct nexthop *nh, int nhsel) ...@@ -308,7 +374,7 @@ struct fib_nh_common *nexthop_fib_nhc(struct nexthop *nh, int nhsel)
struct nh_group *nh_grp; struct nh_group *nh_grp;
nh_grp = rcu_dereference_rtnl(nh->nh_grp); nh_grp = rcu_dereference_rtnl(nh->nh_grp);
if (nh_grp->mpath) { if (nh_grp->is_multipath) {
nh = nexthop_mpath_select(nh_grp, nhsel); nh = nexthop_mpath_select(nh_grp, nhsel);
if (!nh) if (!nh)
return NULL; return NULL;
......
...@@ -21,7 +21,10 @@ struct nexthop_grp { ...@@ -21,7 +21,10 @@ struct nexthop_grp {
}; };
enum { enum {
NEXTHOP_GRP_TYPE_MPATH, /* default type if not specified */ NEXTHOP_GRP_TYPE_MPATH, /* hash-threshold nexthop group
* default type if not specified
*/
NEXTHOP_GRP_TYPE_RES, /* resilient nexthop group */
__NEXTHOP_GRP_TYPE_MAX, __NEXTHOP_GRP_TYPE_MAX,
}; };
...@@ -52,8 +55,50 @@ enum { ...@@ -52,8 +55,50 @@ enum {
NHA_FDB, /* flag; nexthop belongs to a bridge fdb */ NHA_FDB, /* flag; nexthop belongs to a bridge fdb */
/* if NHA_FDB is added, OIF, BLACKHOLE, ENCAP cannot be set */ /* if NHA_FDB is added, OIF, BLACKHOLE, ENCAP cannot be set */
/* nested; resilient nexthop group attributes */
NHA_RES_GROUP,
/* nested; nexthop bucket attributes */
NHA_RES_BUCKET,
__NHA_MAX, __NHA_MAX,
}; };
#define NHA_MAX (__NHA_MAX - 1) #define NHA_MAX (__NHA_MAX - 1)
enum {
NHA_RES_GROUP_UNSPEC,
/* Pad attribute for 64-bit alignment. */
NHA_RES_GROUP_PAD = NHA_RES_GROUP_UNSPEC,
/* u16; number of nexthop buckets in a resilient nexthop group */
NHA_RES_GROUP_BUCKETS,
/* clock_t as u32; nexthop bucket idle timer (per-group) */
NHA_RES_GROUP_IDLE_TIMER,
/* clock_t as u32; nexthop unbalanced timer */
NHA_RES_GROUP_UNBALANCED_TIMER,
/* clock_t as u64; nexthop unbalanced time */
NHA_RES_GROUP_UNBALANCED_TIME,
__NHA_RES_GROUP_MAX,
};
#define NHA_RES_GROUP_MAX (__NHA_RES_GROUP_MAX - 1)
enum {
NHA_RES_BUCKET_UNSPEC,
/* Pad attribute for 64-bit alignment. */
NHA_RES_BUCKET_PAD = NHA_RES_BUCKET_UNSPEC,
/* u16; nexthop bucket index */
NHA_RES_BUCKET_INDEX,
/* clock_t as u64; nexthop bucket idle time */
NHA_RES_BUCKET_IDLE_TIME,
/* u32; nexthop id assigned to the nexthop bucket */
NHA_RES_BUCKET_NH_ID,
__NHA_RES_BUCKET_MAX,
};
#define NHA_RES_BUCKET_MAX (__NHA_RES_BUCKET_MAX - 1)
#endif #endif
...@@ -178,6 +178,13 @@ enum { ...@@ -178,6 +178,13 @@ enum {
RTM_GETVLAN, RTM_GETVLAN,
#define RTM_GETVLAN RTM_GETVLAN #define RTM_GETVLAN RTM_GETVLAN
RTM_NEWNEXTHOPBUCKET = 116,
#define RTM_NEWNEXTHOPBUCKET RTM_NEWNEXTHOPBUCKET
RTM_DELNEXTHOPBUCKET,
#define RTM_DELNEXTHOPBUCKET RTM_DELNEXTHOPBUCKET
RTM_GETNEXTHOPBUCKET,
#define RTM_GETNEXTHOPBUCKET RTM_GETNEXTHOPBUCKET
__RTM_MAX, __RTM_MAX,
#define RTM_MAX (((__RTM_MAX + 3) & ~3) - 1) #define RTM_MAX (((__RTM_MAX + 3) & ~3) - 1)
}; };
......
This diff is collapsed.
...@@ -88,6 +88,9 @@ static const struct nlmsg_perm nlmsg_route_perms[] = ...@@ -88,6 +88,9 @@ static const struct nlmsg_perm nlmsg_route_perms[] =
{ RTM_NEWVLAN, NETLINK_ROUTE_SOCKET__NLMSG_WRITE }, { RTM_NEWVLAN, NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
{ RTM_DELVLAN, NETLINK_ROUTE_SOCKET__NLMSG_WRITE }, { RTM_DELVLAN, NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
{ RTM_GETVLAN, NETLINK_ROUTE_SOCKET__NLMSG_READ }, { RTM_GETVLAN, NETLINK_ROUTE_SOCKET__NLMSG_READ },
{ RTM_NEWNEXTHOPBUCKET, NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
{ RTM_DELNEXTHOPBUCKET, NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
{ RTM_GETNEXTHOPBUCKET, NETLINK_ROUTE_SOCKET__NLMSG_READ },
}; };
static const struct nlmsg_perm nlmsg_tcpdiag_perms[] = static const struct nlmsg_perm nlmsg_tcpdiag_perms[] =
...@@ -171,7 +174,7 @@ int selinux_nlmsg_lookup(u16 sclass, u16 nlmsg_type, u32 *perm) ...@@ -171,7 +174,7 @@ int selinux_nlmsg_lookup(u16 sclass, u16 nlmsg_type, u32 *perm)
* structures at the top of this file with the new mappings * structures at the top of this file with the new mappings
* before updating the BUILD_BUG_ON() macro! * before updating the BUILD_BUG_ON() macro!
*/ */
BUILD_BUG_ON(RTM_MAX != (RTM_NEWVLAN + 3)); BUILD_BUG_ON(RTM_MAX != (RTM_NEWNEXTHOPBUCKET + 3));
err = nlmsg_perm(nlmsg_type, perm, nlmsg_route_perms, err = nlmsg_perm(nlmsg_type, perm, nlmsg_route_perms,
sizeof(nlmsg_route_perms)); sizeof(nlmsg_route_perms));
break; break;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment