Commit 35dfaad7 authored by Daniel Borkmann's avatar Daniel Borkmann Committed by Martin KaFai Lau

netkit, bpf: Add bpf programmable net device

This work adds a new, minimal BPF-programmable device called "netkit"
(former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
core idea is that BPF programs are executed within the drivers xmit routine
and therefore e.g. in case of containers/Pods moving BPF processing closer
to the source.

One of the goals was that in case of Pod egress traffic, this allows to
move BPF programs from hostns tcx ingress into the device itself, providing
earlier drop or forward mechanisms, for example, if the BPF program
determines that the skb must be sent out of the node, then a redirect to
the physical device can take place directly without going through per-CPU
backlog queue. This helps to shift processing for such traffic from softirq
to process context, leading to better scheduling decisions/performance (see
measurements in the slides).

In this initial version, the netkit device ships as a pair, but we plan to
extend this further so it can also operate in single device mode. The pair
comes with a primary and a peer device. Only the primary device, typically
residing in hostns, can manage BPF programs for itself and its peer. The
peer device is designated for containers/Pods and cannot attach/detach
BPF programs. Upon the device creation, the user can set the default policy
to 'pass' or 'drop' for the case when no BPF program is attached.

Additionally, the device can be operated in L3 (default) or L2 mode. The
management of BPF programs is done via bpf_mprog, so that multi-attach is
supported right from the beginning with similar API and dependency controls
as tcx. For details on the latter see commit 053c8e1f ("bpf: Add generic
attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
so that existing programs can be easily migrated.

Going forward, we plan to use netkit devices in Cilium as the main device
type for connecting Pods. They will be operated in L3 mode in order to
simplify a Pod's neighbor management and the peer will operate in default
drop mode, so that no traffic is leaving between the time when a Pod is
brought up by the CNI plugin and programs attached by the agent.
Additionally, the programs we attach via tcx on the physical devices are
using bpf_redirect_peer() for inbound traffic into netkit device, hence the
latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
directly. Also, BIG TCP is supported on netkit device. For the follow-up
work in single device mode, we plan to convert Cilium's cilium_host/_net
devices into a single one.

An extensive test suite for checking device operations and the BPF program
and link management API comes as BPF selftests in this series.
Co-developed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
Acked-by: default avatarStanislav Fomichev <sdf@google.com>
Acked-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
Link: https://github.com/borkmann/iproute2/tree/pr/netkit
Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.)
Link: https://lore.kernel.org/r/20231024214904.29825-2-daniel@iogearbox.netSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
parent 42d31dd6
......@@ -3795,6 +3795,15 @@ L: bpf@vger.kernel.org
S: Odd Fixes
K: (?:\b|_)bpf(?:\b|_)
BPF [NETKIT] (BPF-programmable network device)
M: Daniel Borkmann <daniel@iogearbox.net>
M: Nikolay Aleksandrov <razor@blackwall.org>
L: bpf@vger.kernel.org
L: netdev@vger.kernel.org
S: Supported
F: drivers/net/netkit.c
F: include/net/netkit.h
BPF [NETWORKING] (struct_ops, reuseport)
M: Martin KaFai Lau <martin.lau@linux.dev>
L: bpf@vger.kernel.org
......
......@@ -448,6 +448,15 @@ config NLMON
diagnostics, etc. This is mostly intended for developers or support
to debug netlink issues. If unsure, say N.
config NETKIT
bool "BPF-programmable network device"
depends on BPF_SYSCALL
help
The netkit device is a virtual networking device where BPF programs
can be attached to the device(s) transmission routine in order to
implement the driver's internal logic. The device can be configured
to operate in L3 or L2 mode. If unsure, say N.
config NET_VRF
tristate "Virtual Routing and Forwarding (Lite)"
depends on IP_MULTIPLE_TABLES
......
......@@ -22,6 +22,7 @@ obj-$(CONFIG_MDIO) += mdio.o
obj-$(CONFIG_NET) += loopback.o
obj-$(CONFIG_NETDEV_LEGACY_INIT) += Space.o
obj-$(CONFIG_NETCONSOLE) += netconsole.o
obj-$(CONFIG_NETKIT) += netkit.o
obj-y += phy/
obj-y += pse-pd/
obj-y += mdio/
......
This diff is collapsed.
/* SPDX-License-Identifier: GPL-2.0 */
/* Copyright (c) 2023 Isovalent */
#ifndef __NET_NETKIT_H
#define __NET_NETKIT_H
#include <linux/bpf.h>
#ifdef CONFIG_NETKIT
int netkit_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
int netkit_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
int netkit_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog);
int netkit_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr);
#else
static inline int netkit_prog_attach(const union bpf_attr *attr,
struct bpf_prog *prog)
{
return -EINVAL;
}
static inline int netkit_link_attach(const union bpf_attr *attr,
struct bpf_prog *prog)
{
return -EINVAL;
}
static inline int netkit_prog_detach(const union bpf_attr *attr,
struct bpf_prog *prog)
{
return -EINVAL;
}
static inline int netkit_prog_query(const union bpf_attr *attr,
union bpf_attr __user *uattr)
{
return -EINVAL;
}
#endif /* CONFIG_NETKIT */
#endif /* __NET_NETKIT_H */
......@@ -1052,6 +1052,8 @@ enum bpf_attach_type {
BPF_CGROUP_UNIX_RECVMSG,
BPF_CGROUP_UNIX_GETPEERNAME,
BPF_CGROUP_UNIX_GETSOCKNAME,
BPF_NETKIT_PRIMARY,
BPF_NETKIT_PEER,
__MAX_BPF_ATTACH_TYPE
};
......@@ -1071,6 +1073,7 @@ enum bpf_link_type {
BPF_LINK_TYPE_NETFILTER = 10,
BPF_LINK_TYPE_TCX = 11,
BPF_LINK_TYPE_UPROBE_MULTI = 12,
BPF_LINK_TYPE_NETKIT = 13,
MAX_BPF_LINK_TYPE,
};
......@@ -1656,6 +1659,13 @@ union bpf_attr {
__u32 flags;
__u32 pid;
} uprobe_multi;
struct {
union {
__u32 relative_fd;
__u32 relative_id;
};
__u64 expected_revision;
} netkit;
};
} link_create;
......@@ -6576,6 +6586,10 @@ struct bpf_link_info {
__u32 ifindex;
__u32 attach_type;
} tcx;
struct {
__u32 ifindex;
__u32 attach_type;
} netkit;
};
} __attribute__((aligned(8)));
......
......@@ -756,6 +756,30 @@ struct tunnel_msg {
__u32 ifindex;
};
/* netkit section */
enum netkit_action {
NETKIT_NEXT = -1,
NETKIT_PASS = 0,
NETKIT_DROP = 2,
NETKIT_REDIRECT = 7,
};
enum netkit_mode {
NETKIT_L2,
NETKIT_L3,
};
enum {
IFLA_NETKIT_UNSPEC,
IFLA_NETKIT_PEER_INFO,
IFLA_NETKIT_PRIMARY,
IFLA_NETKIT_POLICY,
IFLA_NETKIT_PEER_POLICY,
IFLA_NETKIT_MODE,
__IFLA_NETKIT_MAX,
};
#define IFLA_NETKIT_MAX (__IFLA_NETKIT_MAX - 1)
/* VXLAN section */
/* include statistics in the dump */
......
......@@ -35,8 +35,9 @@
#include <linux/rcupdate_trace.h>
#include <linux/memcontrol.h>
#include <linux/trace_events.h>
#include <net/netfilter/nf_bpf_link.h>
#include <net/netfilter/nf_bpf_link.h>
#include <net/netkit.h>
#include <net/tcx.h>
#define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
......@@ -3730,6 +3731,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
return BPF_PROG_TYPE_LSM;
case BPF_TCX_INGRESS:
case BPF_TCX_EGRESS:
case BPF_NETKIT_PRIMARY:
case BPF_NETKIT_PEER:
return BPF_PROG_TYPE_SCHED_CLS;
default:
return BPF_PROG_TYPE_UNSPEC;
......@@ -3781,7 +3784,9 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
return 0;
case BPF_PROG_TYPE_SCHED_CLS:
if (attach_type != BPF_TCX_INGRESS &&
attach_type != BPF_TCX_EGRESS)
attach_type != BPF_TCX_EGRESS &&
attach_type != BPF_NETKIT_PRIMARY &&
attach_type != BPF_NETKIT_PEER)
return -EINVAL;
return 0;
default:
......@@ -3864,7 +3869,11 @@ static int bpf_prog_attach(const union bpf_attr *attr)
ret = cgroup_bpf_prog_attach(attr, ptype, prog);
break;
case BPF_PROG_TYPE_SCHED_CLS:
ret = tcx_prog_attach(attr, prog);
if (attr->attach_type == BPF_TCX_INGRESS ||
attr->attach_type == BPF_TCX_EGRESS)
ret = tcx_prog_attach(attr, prog);
else
ret = netkit_prog_attach(attr, prog);
break;
default:
ret = -EINVAL;
......@@ -3925,7 +3934,11 @@ static int bpf_prog_detach(const union bpf_attr *attr)
ret = cgroup_bpf_prog_detach(attr, ptype);
break;
case BPF_PROG_TYPE_SCHED_CLS:
ret = tcx_prog_detach(attr, prog);
if (attr->attach_type == BPF_TCX_INGRESS ||
attr->attach_type == BPF_TCX_EGRESS)
ret = tcx_prog_detach(attr, prog);
else
ret = netkit_prog_detach(attr, prog);
break;
default:
ret = -EINVAL;
......@@ -3992,6 +4005,9 @@ static int bpf_prog_query(const union bpf_attr *attr,
case BPF_TCX_INGRESS:
case BPF_TCX_EGRESS:
return tcx_prog_query(attr, uattr);
case BPF_NETKIT_PRIMARY:
case BPF_NETKIT_PEER:
return netkit_prog_query(attr, uattr);
default:
return -EINVAL;
}
......@@ -4973,7 +4989,11 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
ret = bpf_xdp_link_attach(attr, prog);
break;
case BPF_PROG_TYPE_SCHED_CLS:
ret = tcx_link_attach(attr, prog);
if (attr->link_create.attach_type == BPF_TCX_INGRESS ||
attr->link_create.attach_type == BPF_TCX_EGRESS)
ret = tcx_link_attach(attr, prog);
else
ret = netkit_link_attach(attr, prog);
break;
case BPF_PROG_TYPE_NETFILTER:
ret = bpf_nf_link_attach(attr, prog);
......
......@@ -1052,6 +1052,8 @@ enum bpf_attach_type {
BPF_CGROUP_UNIX_RECVMSG,
BPF_CGROUP_UNIX_GETPEERNAME,
BPF_CGROUP_UNIX_GETSOCKNAME,
BPF_NETKIT_PRIMARY,
BPF_NETKIT_PEER,
__MAX_BPF_ATTACH_TYPE
};
......@@ -1071,6 +1073,7 @@ enum bpf_link_type {
BPF_LINK_TYPE_NETFILTER = 10,
BPF_LINK_TYPE_TCX = 11,
BPF_LINK_TYPE_UPROBE_MULTI = 12,
BPF_LINK_TYPE_NETKIT = 13,
MAX_BPF_LINK_TYPE,
};
......@@ -1656,6 +1659,13 @@ union bpf_attr {
__u32 flags;
__u32 pid;
} uprobe_multi;
struct {
union {
__u32 relative_fd;
__u32 relative_id;
};
__u64 expected_revision;
} netkit;
};
} link_create;
......@@ -6576,6 +6586,10 @@ struct bpf_link_info {
__u32 ifindex;
__u32 attach_type;
} tcx;
struct {
__u32 ifindex;
__u32 attach_type;
} netkit;
};
} __attribute__((aligned(8)));
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment