Commit d03b195b authored by Maxim Mikityanskiy's avatar Maxim Mikityanskiy Committed by Jakub Kicinski

sch_htb: Hierarchical QoS hardware offload

HTB doesn't scale well because of contention on a single lock, and it
also consumes CPU. This patch adds support for offloading HTB to
hardware that supports hierarchical rate limiting.

In the offload mode, HTB passes control commands to the driver using
ndo_setup_tc. The driver has to replicate the whole hierarchy of classes
and their settings (rate, ceil) in the NIC. Every modification of the
HTB tree caused by the admin results in ndo_setup_tc being called.

After this setup, the HTB algorithm is done completely in the NIC. An SQ
(send queue) is created for every leaf class and attached to the
hierarchy, so that the NIC can calculate and obey aggregated rate
limits, too. In the future, it can be changed, so that multiple SQs will
back a single leaf class.

ndo_select_queue is responsible for selecting the right queue that
serves the traffic class of each packet.

The data path works as follows: a packet is classified by clsact, the
driver selects a hardware queue according to its class, and the packet
is enqueued into this queue's qdisc.

This solution addresses two main problems of scaling HTB:

1. Contention by flow classification. Currently the filters are attached
to the HTB instance as follows:

    # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
    classid 1:10

It's possible to move classification to clsact egress hook, which is
thread-safe and lock-free:

    # tc filter add dev eth0 egress protocol ip flower dst_port 80
    action skbedit priority 1:10

This way classification still happens in software, but the lock
contention is eliminated, and it happens before selecting the TX queue,
allowing the driver to translate the class to the corresponding hardware
queue in ndo_select_queue.

Note that this is already compatible with non-offloaded HTB and doesn't
require changes to the kernel nor iproute2.

2. Contention by handling packets. HTB is not multi-queue, it attaches
to a whole net device, and handling of all packets takes the same lock.
When HTB is offloaded, it registers itself as a multi-queue qdisc,
similarly to mq: HTB is attached to the netdev, and each queue has its
own qdisc.

Some features of HTB may be not supported by some particular hardware,
for example, the maximum number of classes may be limited, the
granularity of rate and ceil parameters may be different, etc. - so, the
offload is not enabled by default, a new parameter is used to enable it:

    # tc qdisc replace dev eth0 root handle 1: htb offload
Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
parent 4dd78a73
...@@ -858,6 +858,7 @@ enum tc_setup_type { ...@@ -858,6 +858,7 @@ enum tc_setup_type {
TC_SETUP_QDISC_ETS, TC_SETUP_QDISC_ETS,
TC_SETUP_QDISC_TBF, TC_SETUP_QDISC_TBF,
TC_SETUP_QDISC_FIFO, TC_SETUP_QDISC_FIFO,
TC_SETUP_QDISC_HTB,
}; };
/* These structures hold the attributes of bpf state that are being passed /* These structures hold the attributes of bpf state that are being passed
......
...@@ -783,6 +783,42 @@ struct tc_mq_qopt_offload { ...@@ -783,6 +783,42 @@ struct tc_mq_qopt_offload {
}; };
}; };
enum tc_htb_command {
/* Root */
TC_HTB_CREATE, /* Initialize HTB offload. */
TC_HTB_DESTROY, /* Destroy HTB offload. */
/* Classes */
/* Allocate qid and create leaf. */
TC_HTB_LEAF_ALLOC_QUEUE,
/* Convert leaf to inner, preserve and return qid, create new leaf. */
TC_HTB_LEAF_TO_INNER,
/* Delete leaf, while siblings remain. */
TC_HTB_LEAF_DEL,
/* Delete leaf, convert parent to leaf, preserving qid. */
TC_HTB_LEAF_DEL_LAST,
/* TC_HTB_LEAF_DEL_LAST, but delete driver data on hardware errors. */
TC_HTB_LEAF_DEL_LAST_FORCE,
/* Modify parameters of a node. */
TC_HTB_NODE_MODIFY,
/* Class qdisc */
TC_HTB_LEAF_QUERY_QUEUE, /* Query qid by classid. */
};
struct tc_htb_qopt_offload {
struct netlink_ext_ack *extack;
enum tc_htb_command command;
u16 classid;
u32 parent_classid;
u16 qid;
u16 moved_qid;
u64 rate;
u64 ceil;
};
#define TC_HTB_CLASSID_ROOT U32_MAX
enum tc_red_command { enum tc_red_command {
TC_RED_REPLACE, TC_RED_REPLACE,
TC_RED_DESTROY, TC_RED_DESTROY,
......
...@@ -434,6 +434,7 @@ enum { ...@@ -434,6 +434,7 @@ enum {
TCA_HTB_RATE64, TCA_HTB_RATE64,
TCA_HTB_CEIL64, TCA_HTB_CEIL64,
TCA_HTB_PAD, TCA_HTB_PAD,
TCA_HTB_OFFLOAD,
__TCA_HTB_MAX, __TCA_HTB_MAX,
}; };
......
This diff is collapsed.
...@@ -414,6 +414,7 @@ enum { ...@@ -414,6 +414,7 @@ enum {
TCA_HTB_RATE64, TCA_HTB_RATE64,
TCA_HTB_CEIL64, TCA_HTB_CEIL64,
TCA_HTB_PAD, TCA_HTB_PAD,
TCA_HTB_OFFLOAD,
__TCA_HTB_MAX, __TCA_HTB_MAX,
}; };
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment