Commit e57892f5 authored by Alexei Starovoitov's avatar Alexei Starovoitov

Merge branch 'bpf-socket-lookup'

Jakub Sitnicki says:

====================
Changelog
=========
v4 -> v5:
- Enforce BPF prog return value to be SK_DROP or SK_PASS. (Andrii)
- Simplify prog runners now that only SK_DROP/PASS can be returned.
- Enable bpf_perf_event_output from the start. (Andrii)
- Drop patch
  "selftests/bpf: Rename test_sk_lookup_kern.c to test_ref_track_kern.c"
- Remove tests for narrow loads from context at an offset wider in size
  than target field, while we are discussing how to fix it:
  https://lore.kernel.org/bpf/20200710173123.427983-1-jakub@cloudflare.com/
- Rebase onto recent bpf-next (bfdfa517)
- Other minor changes called out in per-patch changelogs,
  see patches: 2, 4, 6, 13-15
- Carried over Andrii's Acks where nothing changed.

v3 -> v4:
- Reduce BPF prog return codes to SK_DROP/SK_PASS (Lorenz)
- Default to drop on illegal return value from BPF prog (Lorenz)
- Extend bpf_sk_assign to accept NULL socket pointer.
- Switch to saner return values and add docs for new prog_array API (Andrii)
- Add support for narrow loads from BPF context fields (Yonghong)
- Fix broken build when IPv6 is compiled as a module (kernel test robot)
- Fix null/wild-ptr-deref on BPF context access
- Rebase to recent bpf-next (eef8a42d)
- Other minor changes called out in per-patch changelogs,
  see patches 1-2, 4, 6, 8, 10-12, 14, 16

v2 -> v3:
- Switch to link-based program attachment
- Support for multi-prog attachment
- Ability to skip reuseport socket selection
- Code on RX path is guarded by a static key
- struct in6_addr's are no longer copied into BPF prog context
- BPF prog context is initialized as late as possible
- Changes called out in patches 1-2, 4, 6, 8, 10-14, 16
- Patches dropped:
  01/17 flow_dissector: Extract attach/detach/query helpers
  03/17 inet: Store layer 4 protocol in inet_hashinfo
  08/17 udp: Store layer 4 protocol in udp_table

v1 -> v2:
- Changes called out in patches 2, 13-15, 17
- Rebase to recent bpf-next (b4563fac)

RFCv2 -> v1:

- Switch to fetching a socket from a map and selecting a socket with
  bpf_sk_assign, instead of having a dedicated helper that does both.
- Run reuseport logic on sockets selected by BPF sk_lookup.
- Allow BPF sk_lookup to fail the lookup with no match.
- Go back to having just 2 hash table lookups in UDP.

RFCv1 -> RFCv2:

- Make socket lookup redirection map-based. BPF program now uses a
  dedicated helper and a SOCKARRAY map to select the socket to redirect to.
  A consequence of this change is that bpf_inet_lookup context is now
  read-only.
- Look for connected UDP sockets before allowing redirection from BPF.
  This makes connected UDP socket work as expected in the presence of
  inet_lookup prog.
- Share the code for BPF_PROG_{ATTACH,DETACH,QUERY} with flow_dissector,
  the only other per-netns BPF prog type.

Overview
========

This series proposes a new BPF program type named BPF_PROG_TYPE_SK_LOOKUP,
or BPF sk_lookup for short.

BPF sk_lookup program runs when transport layer is looking up a listening
socket for a new connection request (TCP), or when looking up an
unconnected socket for a packet (UDP).

This serves as a mechanism to overcome the limits of what bind() API allows
to express. Two use-cases driving this work are:

 (1) steer packets destined to an IP range, fixed port to a single socket

     192.0.2.0/24, port 80 -> NGINX socket

 (2) steer packets destined to an IP address, any port to a single socket

     198.51.100.1, any port -> L7 proxy socket

In its context, program receives information about the packet that
triggered the socket lookup. Namely IP version, L4 protocol identifier, and
address 4-tuple.

To select a socket BPF program fetches it from a map holding socket
references, like SOCKMAP or SOCKHASH, calls bpf_sk_assign(ctx, sk, ...)
helper to record the selection, and returns SK_PASS code. Transport layer
then uses the selected socket as a result of socket lookup.

Alternatively, program can also fail the lookup (SK_DROP), or let the
lookup continue as usual (SK_PASS without selecting a socket).

This lets the user match packets with listening (TCP) or receiving (UDP)
sockets freely at the last possible point on the receive path, where we
know that packets are destined for local delivery after undergoing
policing, filtering, and routing.

Program is attached to a network namespace, similar to BPF flow_dissector.
We add a new attach type, BPF_SK_LOOKUP, for this. Multiple programs can be
attached at the same time, in which case their return values are aggregated
according the rules outlined in patch #4 description.

Series structure
================

Patches are organized as so:

 1: enables multiple link-based prog attachments for bpf-netns
 2: introduces sk_lookup program type
 3-4: hook up the program to run on ipv4/tcp socket lookup
 5-6: hook up the program to run on ipv6/tcp socket lookup
 7-8: hook up the program to run on ipv4/udp socket lookup
 9-10: hook up the program to run on ipv6/udp socket lookup
 11-13: libbpf & bpftool support for sk_lookup
 14-15: verifier and selftests for sk_lookup

Patches are also available on GH:

  https://github.com/jsitnicki/linux/commits/bpf-inet-lookup-v5

Follow-up work
==============

I'll follow up with below items, which IMHO don't block the review:

- benchmark results for udp6 small packet flood scenario,
- user docs for new BPF prog type, Documentation/bpf/prog_sk_lookup.rst,
- timeout for accept() in tests after extending network_helper.[ch].

Thanks to the reviewers for their feedback to this patch series:

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andriin@fb.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Cc: Marek Majkowski <marek@cloudflare.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>

-jkbs

[RFCv1] https://lore.kernel.org/bpf/20190618130050.8344-1-jakub@cloudflare.com/
[RFCv2] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/
[v1] https://lore.kernel.org/bpf/20200511185218.1422406-18-jakub@cloudflare.com/
[v2] https://lore.kernel.org/bpf/20200506125514.1020829-1-jakub@cloudflare.com/
[v3] https://lore.kernel.org/bpf/20200702092416.11961-1-jakub@cloudflare.com/
[v4] https://lore.kernel.org/bpf/20200713174654.642628-1-jakub@cloudflare.com/
====================
Reviewed-by: default avatarLorenz Bauer <lmb@cloudflare.com>
Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
parents bfdfa517 0ab5539f
......@@ -8,6 +8,7 @@
enum netns_bpf_attach_type {
NETNS_BPF_INVALID = -1,
NETNS_BPF_FLOW_DISSECTOR = 0,
NETNS_BPF_SK_LOOKUP,
MAX_NETNS_BPF_ATTACH_TYPE
};
......@@ -17,6 +18,8 @@ to_netns_bpf_attach_type(enum bpf_attach_type attach_type)
switch (attach_type) {
case BPF_FLOW_DISSECTOR:
return NETNS_BPF_FLOW_DISSECTOR;
case BPF_SK_LOOKUP:
return NETNS_BPF_SK_LOOKUP;
default:
return NETNS_BPF_INVALID;
}
......
......@@ -249,6 +249,7 @@ enum bpf_arg_type {
ARG_PTR_TO_INT, /* pointer to int */
ARG_PTR_TO_LONG, /* pointer to long */
ARG_PTR_TO_SOCKET, /* pointer to bpf_sock (fullsock) */
ARG_PTR_TO_SOCKET_OR_NULL, /* pointer to bpf_sock (fullsock) or NULL */
ARG_PTR_TO_BTF_ID, /* pointer to in-kernel struct */
ARG_PTR_TO_ALLOC_MEM, /* pointer to dynamically allocated memory */
ARG_PTR_TO_ALLOC_MEM_OR_NULL, /* pointer to dynamically allocated memory or NULL */
......@@ -928,6 +929,9 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array *progs,
void bpf_prog_array_delete_safe(struct bpf_prog_array *progs,
struct bpf_prog *old_prog);
int bpf_prog_array_delete_safe_at(struct bpf_prog_array *array, int index);
int bpf_prog_array_update_at(struct bpf_prog_array *array, int index,
struct bpf_prog *prog);
int bpf_prog_array_copy_info(struct bpf_prog_array *array,
u32 *prog_ids, u32 request_cnt,
u32 *prog_cnt);
......
......@@ -64,6 +64,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
#ifdef CONFIG_INET
BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport,
struct sk_reuseport_md, struct sk_reuseport_kern)
BPF_PROG_TYPE(BPF_PROG_TYPE_SK_LOOKUP, sk_lookup,
struct bpf_sk_lookup, struct bpf_sk_lookup_kern)
#endif
#if defined(CONFIG_BPF_JIT)
BPF_PROG_TYPE(BPF_PROG_TYPE_STRUCT_OPS, bpf_struct_ops,
......
......@@ -1278,4 +1278,151 @@ struct bpf_sockopt_kern {
s32 retval;
};
struct bpf_sk_lookup_kern {
u16 family;
u16 protocol;
struct {
__be32 saddr;
__be32 daddr;
} v4;
struct {
const struct in6_addr *saddr;
const struct in6_addr *daddr;
} v6;
__be16 sport;
u16 dport;
struct sock *selected_sk;
bool no_reuseport;
};
extern struct static_key_false bpf_sk_lookup_enabled;
/* Runners for BPF_SK_LOOKUP programs to invoke on socket lookup.
*
* Allowed return values for a BPF SK_LOOKUP program are SK_PASS and
* SK_DROP. Their meaning is as follows:
*
* SK_PASS && ctx.selected_sk != NULL: use selected_sk as lookup result
* SK_PASS && ctx.selected_sk == NULL: continue to htable-based socket lookup
* SK_DROP : terminate lookup with -ECONNREFUSED
*
* This macro aggregates return values and selected sockets from
* multiple BPF programs according to following rules in order:
*
* 1. If any program returned SK_PASS and a non-NULL ctx.selected_sk,
* macro result is SK_PASS and last ctx.selected_sk is used.
* 2. If any program returned SK_DROP return value,
* macro result is SK_DROP.
* 3. Otherwise result is SK_PASS and ctx.selected_sk is NULL.
*
* Caller must ensure that the prog array is non-NULL, and that the
* array as well as the programs it contains remain valid.
*/
#define BPF_PROG_SK_LOOKUP_RUN_ARRAY(array, ctx, func) \
({ \
struct bpf_sk_lookup_kern *_ctx = &(ctx); \
struct bpf_prog_array_item *_item; \
struct sock *_selected_sk = NULL; \
bool _no_reuseport = false; \
struct bpf_prog *_prog; \
bool _all_pass = true; \
u32 _ret; \
\
migrate_disable(); \
_item = &(array)->items[0]; \
while ((_prog = READ_ONCE(_item->prog))) { \
/* restore most recent selection */ \
_ctx->selected_sk = _selected_sk; \
_ctx->no_reuseport = _no_reuseport; \
\
_ret = func(_prog, _ctx); \
if (_ret == SK_PASS && _ctx->selected_sk) { \
/* remember last non-NULL socket */ \
_selected_sk = _ctx->selected_sk; \
_no_reuseport = _ctx->no_reuseport; \
} else if (_ret == SK_DROP && _all_pass) { \
_all_pass = false; \
} \
_item++; \
} \
_ctx->selected_sk = _selected_sk; \
_ctx->no_reuseport = _no_reuseport; \
migrate_enable(); \
_all_pass || _selected_sk ? SK_PASS : SK_DROP; \
})
static inline bool bpf_sk_lookup_run_v4(struct net *net, int protocol,
const __be32 saddr, const __be16 sport,
const __be32 daddr, const u16 dport,
struct sock **psk)
{
struct bpf_prog_array *run_array;
struct sock *selected_sk = NULL;
bool no_reuseport = false;
rcu_read_lock();
run_array = rcu_dereference(net->bpf.run_array[NETNS_BPF_SK_LOOKUP]);
if (run_array) {
struct bpf_sk_lookup_kern ctx = {
.family = AF_INET,
.protocol = protocol,
.v4.saddr = saddr,
.v4.daddr = daddr,
.sport = sport,
.dport = dport,
};
u32 act;
act = BPF_PROG_SK_LOOKUP_RUN_ARRAY(run_array, ctx, BPF_PROG_RUN);
if (act == SK_PASS) {
selected_sk = ctx.selected_sk;
no_reuseport = ctx.no_reuseport;
} else {
selected_sk = ERR_PTR(-ECONNREFUSED);
}
}
rcu_read_unlock();
*psk = selected_sk;
return no_reuseport;
}
#if IS_ENABLED(CONFIG_IPV6)
static inline bool bpf_sk_lookup_run_v6(struct net *net, int protocol,
const struct in6_addr *saddr,
const __be16 sport,
const struct in6_addr *daddr,
const u16 dport,
struct sock **psk)
{
struct bpf_prog_array *run_array;
struct sock *selected_sk = NULL;
bool no_reuseport = false;
rcu_read_lock();
run_array = rcu_dereference(net->bpf.run_array[NETNS_BPF_SK_LOOKUP]);
if (run_array) {
struct bpf_sk_lookup_kern ctx = {
.family = AF_INET6,
.protocol = protocol,
.v6.saddr = saddr,
.v6.daddr = daddr,
.sport = sport,
.dport = dport,
};
u32 act;
act = BPF_PROG_SK_LOOKUP_RUN_ARRAY(run_array, ctx, BPF_PROG_RUN);
if (act == SK_PASS) {
selected_sk = ctx.selected_sk;
no_reuseport = ctx.no_reuseport;
} else {
selected_sk = ERR_PTR(-ECONNREFUSED);
}
}
rcu_read_unlock();
*psk = selected_sk;
return no_reuseport;
}
#endif /* IS_ENABLED(CONFIG_IPV6) */
#endif /* __LINUX_FILTER_H__ */
......@@ -189,6 +189,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_STRUCT_OPS,
BPF_PROG_TYPE_EXT,
BPF_PROG_TYPE_LSM,
BPF_PROG_TYPE_SK_LOOKUP,
};
enum bpf_attach_type {
......@@ -228,6 +229,7 @@ enum bpf_attach_type {
BPF_XDP_DEVMAP,
BPF_CGROUP_INET_SOCK_RELEASE,
BPF_XDP_CPUMAP,
BPF_SK_LOOKUP,
__MAX_BPF_ATTACH_TYPE
};
......@@ -3069,6 +3071,10 @@ union bpf_attr {
*
* long bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
* Description
* Helper is overloaded depending on BPF program type. This
* description applies to **BPF_PROG_TYPE_SCHED_CLS** and
* **BPF_PROG_TYPE_SCHED_ACT** programs.
*
* Assign the *sk* to the *skb*. When combined with appropriate
* routing configuration to receive the packet towards the socket,
* will cause *skb* to be delivered to the specified socket.
......@@ -3094,6 +3100,56 @@ union bpf_attr {
* **-ESOCKTNOSUPPORT** if the socket type is not supported
* (reuseport).
*
* long bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
* Description
* Helper is overloaded depending on BPF program type. This
* description applies to **BPF_PROG_TYPE_SK_LOOKUP** programs.
*
* Select the *sk* as a result of a socket lookup.
*
* For the operation to succeed passed socket must be compatible
* with the packet description provided by the *ctx* object.
*
* L4 protocol (**IPPROTO_TCP** or **IPPROTO_UDP**) must
* be an exact match. While IP family (**AF_INET** or
* **AF_INET6**) must be compatible, that is IPv6 sockets
* that are not v6-only can be selected for IPv4 packets.
*
* Only TCP listeners and UDP unconnected sockets can be
* selected. *sk* can also be NULL to reset any previous
* selection.
*
* *flags* argument can combination of following values:
*
* * **BPF_SK_LOOKUP_F_REPLACE** to override the previous
* socket selection, potentially done by a BPF program
* that ran before us.
*
* * **BPF_SK_LOOKUP_F_NO_REUSEPORT** to skip
* load-balancing within reuseport group for the socket
* being selected.
*
* On success *ctx->sk* will point to the selected socket.
*
* Return
* 0 on success, or a negative errno in case of failure.
*
* * **-EAFNOSUPPORT** if socket family (*sk->family*) is
* not compatible with packet family (*ctx->family*).
*
* * **-EEXIST** if socket has been already selected,
* potentially by another program, and
* **BPF_SK_LOOKUP_F_REPLACE** flag was not specified.
*
* * **-EINVAL** if unsupported flags were specified.
*
* * **-EPROTOTYPE** if socket L4 protocol
* (*sk->protocol*) doesn't match packet protocol
* (*ctx->protocol*).
*
* * **-ESOCKTNOSUPPORT** if socket is not in allowed
* state (TCP listening or UDP unconnected).
*
* u64 bpf_ktime_get_boot_ns(void)
* Description
* Return the time elapsed since system boot, in nanoseconds.
......@@ -3607,6 +3663,12 @@ enum {
BPF_RINGBUF_HDR_SZ = 8,
};
/* BPF_FUNC_sk_assign flags in bpf_sk_lookup context. */
enum {
BPF_SK_LOOKUP_F_REPLACE = (1ULL << 0),
BPF_SK_LOOKUP_F_NO_REUSEPORT = (1ULL << 1),
};
/* Mode for BPF_FUNC_skb_adjust_room helper. */
enum bpf_adj_room_mode {
BPF_ADJ_ROOM_NET,
......@@ -4349,4 +4411,19 @@ struct bpf_pidns_info {
__u32 pid;
__u32 tgid;
};
/* User accessible data for SK_LOOKUP programs. Add new fields at the end. */
struct bpf_sk_lookup {
__bpf_md_ptr(struct bpf_sock *, sk); /* Selected socket */
__u32 family; /* Protocol family (AF_INET, AF_INET6) */
__u32 protocol; /* IP protocol (IPPROTO_TCP, IPPROTO_UDP) */
__u32 remote_ip4; /* Network byte order */
__u32 remote_ip6[4]; /* Network byte order */
__u32 remote_port; /* Network byte order */
__u32 local_ip4; /* Network byte order */
__u32 local_ip6[4]; /* Network byte order */
__u32 local_port; /* Host byte order */
};
#endif /* _UAPI__LINUX_BPF_H__ */
......@@ -1958,6 +1958,61 @@ void bpf_prog_array_delete_safe(struct bpf_prog_array *array,
}
}
/**
* bpf_prog_array_delete_safe_at() - Replaces the program at the given
* index into the program array with
* a dummy no-op program.
* @array: a bpf_prog_array
* @index: the index of the program to replace
*
* Skips over dummy programs, by not counting them, when calculating
* the the position of the program to replace.
*
* Return:
* * 0 - Success
* * -EINVAL - Invalid index value. Must be a non-negative integer.
* * -ENOENT - Index out of range
*/
int bpf_prog_array_delete_safe_at(struct bpf_prog_array *array, int index)
{
return bpf_prog_array_update_at(array, index, &dummy_bpf_prog.prog);
}
/**
* bpf_prog_array_update_at() - Updates the program at the given index
* into the program array.
* @array: a bpf_prog_array
* @index: the index of the program to update
* @prog: the program to insert into the array
*
* Skips over dummy programs, by not counting them, when calculating
* the position of the program to update.
*
* Return:
* * 0 - Success
* * -EINVAL - Invalid index value. Must be a non-negative integer.
* * -ENOENT - Index out of range
*/
int bpf_prog_array_update_at(struct bpf_prog_array *array, int index,
struct bpf_prog *prog)
{
struct bpf_prog_array_item *item;
if (unlikely(index < 0))
return -EINVAL;
for (item = array->items; item->prog; item++) {
if (item->prog == &dummy_bpf_prog.prog)
continue;
if (!index) {
WRITE_ONCE(item->prog, prog);
return 0;
}
index--;
}
return -ENOENT;
}
int bpf_prog_array_copy(struct bpf_prog_array *old_array,
struct bpf_prog *exclude_prog,
struct bpf_prog *include_prog,
......
......@@ -25,6 +25,28 @@ struct bpf_netns_link {
/* Protects updates to netns_bpf */
DEFINE_MUTEX(netns_bpf_mutex);
static void netns_bpf_attach_type_unneed(enum netns_bpf_attach_type type)
{
switch (type) {
case NETNS_BPF_SK_LOOKUP:
static_branch_dec(&bpf_sk_lookup_enabled);
break;
default:
break;
}
}
static void netns_bpf_attach_type_need(enum netns_bpf_attach_type type)
{
switch (type) {
case NETNS_BPF_SK_LOOKUP:
static_branch_inc(&bpf_sk_lookup_enabled);
break;
default:
break;
}
}
/* Must be called with netns_bpf_mutex held. */
static void netns_bpf_run_array_detach(struct net *net,
enum netns_bpf_attach_type type)
......@@ -36,12 +58,50 @@ static void netns_bpf_run_array_detach(struct net *net,
bpf_prog_array_free(run_array);
}
static int link_index(struct net *net, enum netns_bpf_attach_type type,
struct bpf_netns_link *link)
{
struct bpf_netns_link *pos;
int i = 0;
list_for_each_entry(pos, &net->bpf.links[type], node) {
if (pos == link)
return i;
i++;
}
return -ENOENT;
}
static int link_count(struct net *net, enum netns_bpf_attach_type type)
{
struct list_head *pos;
int i = 0;
list_for_each(pos, &net->bpf.links[type])
i++;
return i;
}
static void fill_prog_array(struct net *net, enum netns_bpf_attach_type type,
struct bpf_prog_array *prog_array)
{
struct bpf_netns_link *pos;
unsigned int i = 0;
list_for_each_entry(pos, &net->bpf.links[type], node) {
prog_array->items[i].prog = pos->link.prog;
i++;
}
}
static void bpf_netns_link_release(struct bpf_link *link)
{
struct bpf_netns_link *net_link =
container_of(link, struct bpf_netns_link, link);
enum netns_bpf_attach_type type = net_link->netns_type;
struct bpf_prog_array *old_array, *new_array;
struct net *net;
int cnt, idx;
mutex_lock(&netns_bpf_mutex);
......@@ -53,9 +113,30 @@ static void bpf_netns_link_release(struct bpf_link *link)
if (!net)
goto out_unlock;
netns_bpf_run_array_detach(net, type);
/* Mark attach point as unused */
netns_bpf_attach_type_unneed(type);
/* Remember link position in case of safe delete */
idx = link_index(net, type, net_link);
list_del(&net_link->node);
cnt = link_count(net, type);
if (!cnt) {
netns_bpf_run_array_detach(net, type);
goto out_unlock;
}
old_array = rcu_dereference_protected(net->bpf.run_array[type],
lockdep_is_held(&netns_bpf_mutex));
new_array = bpf_prog_array_alloc(cnt, GFP_KERNEL);
if (!new_array) {
WARN_ON(bpf_prog_array_delete_safe_at(old_array, idx));
goto out_unlock;
}
fill_prog_array(net, type, new_array);
rcu_assign_pointer(net->bpf.run_array[type], new_array);
bpf_prog_array_free(old_array);
out_unlock:
mutex_unlock(&netns_bpf_mutex);
}
......@@ -77,7 +158,7 @@ static int bpf_netns_link_update_prog(struct bpf_link *link,
enum netns_bpf_attach_type type = net_link->netns_type;
struct bpf_prog_array *run_array;
struct net *net;
int ret = 0;
int idx, ret;
if (old_prog && old_prog != link->prog)
return -EPERM;
......@@ -95,7 +176,10 @@ static int bpf_netns_link_update_prog(struct bpf_link *link,
run_array = rcu_dereference_protected(net->bpf.run_array[type],
lockdep_is_held(&netns_bpf_mutex));
WRITE_ONCE(run_array->items[0].prog, new_prog);
idx = link_index(net, type, net_link);
ret = bpf_prog_array_update_at(run_array, idx, new_prog);
if (ret)
goto out_unlock;
old_prog = xchg(&link->prog, new_prog);
bpf_prog_put(old_prog);
......@@ -309,18 +393,30 @@ int netns_bpf_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
return ret;
}
static int netns_bpf_max_progs(enum netns_bpf_attach_type type)
{
switch (type) {
case NETNS_BPF_FLOW_DISSECTOR:
return 1;
case NETNS_BPF_SK_LOOKUP:
return 64;
default:
return 0;
}
}
static int netns_bpf_link_attach(struct net *net, struct bpf_link *link,
enum netns_bpf_attach_type type)
{
struct bpf_netns_link *net_link =
container_of(link, struct bpf_netns_link, link);
struct bpf_prog_array *run_array;
int err;
int cnt, err;
mutex_lock(&netns_bpf_mutex);
/* Allow attaching only one prog or link for now */
if (!list_empty(&net->bpf.links[type])) {
cnt = link_count(net, type);
if (cnt >= netns_bpf_max_progs(type)) {
err = -E2BIG;
goto out_unlock;
}
......@@ -334,6 +430,9 @@ static int netns_bpf_link_attach(struct net *net, struct bpf_link *link,
case NETNS_BPF_FLOW_DISSECTOR:
err = flow_dissector_bpf_prog_attach_check(net, link->prog);
break;
case NETNS_BPF_SK_LOOKUP:
err = 0; /* nothing to check */
break;
default:
err = -EINVAL;
break;
......@@ -341,16 +440,22 @@ static int netns_bpf_link_attach(struct net *net, struct bpf_link *link,
if (err)
goto out_unlock;
run_array = bpf_prog_array_alloc(1, GFP_KERNEL);
run_array = bpf_prog_array_alloc(cnt + 1, GFP_KERNEL);
if (!run_array) {
err = -ENOMEM;
goto out_unlock;
}
run_array->items[0].prog = link->prog;
rcu_assign_pointer(net->bpf.run_array[type], run_array);
list_add_tail(&net_link->node, &net->bpf.links[type]);
fill_prog_array(net, type, run_array);
run_array = rcu_replace_pointer(net->bpf.run_array[type], run_array,
lockdep_is_held(&netns_bpf_mutex));
bpf_prog_array_free(run_array);
/* Mark attach point as used */
netns_bpf_attach_type_need(type);
out_unlock:
mutex_unlock(&netns_bpf_mutex);
return err;
......@@ -426,8 +531,10 @@ static void __net_exit netns_bpf_pernet_pre_exit(struct net *net)
mutex_lock(&netns_bpf_mutex);
for (type = 0; type < MAX_NETNS_BPF_ATTACH_TYPE; type++) {
netns_bpf_run_array_detach(net, type);
list_for_each_entry(net_link, &net->bpf.links[type], node)
list_for_each_entry(net_link, &net->bpf.links[type], node) {
net_link->net = NULL; /* auto-detach link */
netns_bpf_attach_type_unneed(type);
}
if (net->bpf.progs[type])
bpf_prog_put(net->bpf.progs[type]);
}
......
......@@ -2022,6 +2022,10 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
default:
return -EINVAL;
}
case BPF_PROG_TYPE_SK_LOOKUP:
if (expected_attach_type == BPF_SK_LOOKUP)
return 0;
return -EINVAL;
case BPF_PROG_TYPE_EXT:
if (expected_attach_type)
return -EINVAL;
......@@ -2756,6 +2760,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
case BPF_PROG_TYPE_CGROUP_SOCK:
case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
case BPF_PROG_TYPE_CGROUP_SOCKOPT:
case BPF_PROG_TYPE_SK_LOOKUP:
return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
case BPF_PROG_TYPE_CGROUP_SKB:
if (!capable(CAP_NET_ADMIN))
......@@ -2817,6 +2822,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
return BPF_PROG_TYPE_CGROUP_SOCKOPT;
case BPF_TRACE_ITER:
return BPF_PROG_TYPE_TRACING;
case BPF_SK_LOOKUP:
return BPF_PROG_TYPE_SK_LOOKUP;
default:
return BPF_PROG_TYPE_UNSPEC;
}
......@@ -2953,6 +2960,7 @@ static int bpf_prog_query(const union bpf_attr *attr,
case BPF_LIRC_MODE2:
return lirc_prog_query(attr, uattr);
case BPF_FLOW_DISSECTOR:
case BPF_SK_LOOKUP:
return netns_bpf_prog_query(attr, uattr);
default:
return -EINVAL;
......@@ -3891,6 +3899,7 @@ static int link_create(union bpf_attr *attr)
ret = tracing_bpf_link_attach(attr, prog);
break;
case BPF_PROG_TYPE_FLOW_DISSECTOR:
case BPF_PROG_TYPE_SK_LOOKUP:
ret = netns_bpf_link_create(attr, prog);
break;
default:
......
......@@ -3878,10 +3878,14 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
}
meta->ref_obj_id = reg->ref_obj_id;
}
} else if (arg_type == ARG_PTR_TO_SOCKET) {
} else if (arg_type == ARG_PTR_TO_SOCKET ||
arg_type == ARG_PTR_TO_SOCKET_OR_NULL) {
expected_type = PTR_TO_SOCKET;
if (type != expected_type)
goto err_type;
if (!(register_is_null(reg) &&
arg_type == ARG_PTR_TO_SOCKET_OR_NULL)) {
if (type != expected_type)
goto err_type;
}
} else if (arg_type == ARG_PTR_TO_BTF_ID) {
expected_type = PTR_TO_BTF_ID;
if (type != expected_type)
......@@ -7354,6 +7358,9 @@ static int check_return_code(struct bpf_verifier_env *env)
return -ENOTSUPP;
}
break;
case BPF_PROG_TYPE_SK_LOOKUP:
range = tnum_range(SK_DROP, SK_PASS);
break;
case BPF_PROG_TYPE_EXT:
/* freplace program can return anything as its return value
* depends on the to-be-replaced kernel func or bpf program.
......
......@@ -9229,6 +9229,189 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
const struct bpf_prog_ops sk_reuseport_prog_ops = {
};
DEFINE_STATIC_KEY_FALSE(bpf_sk_lookup_enabled);
EXPORT_SYMBOL(bpf_sk_lookup_enabled);
BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
struct sock *, sk, u64, flags)
{
if (unlikely(flags & ~(BPF_SK_LOOKUP_F_REPLACE |
BPF_SK_LOOKUP_F_NO_REUSEPORT)))
return -EINVAL;
if (unlikely(sk && sk_is_refcounted(sk)))
return -ESOCKTNOSUPPORT; /* reject non-RCU freed sockets */
if (unlikely(sk && sk->sk_state == TCP_ESTABLISHED))
return -ESOCKTNOSUPPORT; /* reject connected sockets */
/* Check if socket is suitable for packet L3/L4 protocol */
if (sk && sk->sk_protocol != ctx->protocol)
return -EPROTOTYPE;
if (sk && sk->sk_family != ctx->family &&
(sk->sk_family == AF_INET || ipv6_only_sock(sk)))
return -EAFNOSUPPORT;
if (ctx->selected_sk && !(flags & BPF_SK_LOOKUP_F_REPLACE))
return -EEXIST;
/* Select socket as lookup result */
ctx->selected_sk = sk;
ctx->no_reuseport = flags & BPF_SK_LOOKUP_F_NO_REUSEPORT;
return 0;
}
static const struct bpf_func_proto bpf_sk_lookup_assign_proto = {
.func = bpf_sk_lookup_assign,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_PTR_TO_SOCKET_OR_NULL,
.arg3_type = ARG_ANYTHING,
};
static const struct bpf_func_proto *
sk_lookup_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
switch (func_id) {
case BPF_FUNC_perf_event_output:
return &bpf_event_output_data_proto;
case BPF_FUNC_sk_assign:
return &bpf_sk_lookup_assign_proto;
case BPF_FUNC_sk_release:
return &bpf_sk_release_proto;
default:
return bpf_base_func_proto(func_id);
}
}
static bool sk_lookup_is_valid_access(int off, int size,
enum bpf_access_type type,
const struct bpf_prog *prog,
struct bpf_insn_access_aux *info)
{
if (off < 0 || off >= sizeof(struct bpf_sk_lookup))
return false;
if (off % size != 0)
return false;
if (type != BPF_READ)
return false;
switch (off) {
case offsetof(struct bpf_sk_lookup, sk):
info->reg_type = PTR_TO_SOCKET_OR_NULL;
return size == sizeof(__u64);
case bpf_ctx_range(struct bpf_sk_lookup, family):
case bpf_ctx_range(struct bpf_sk_lookup, protocol):
case bpf_ctx_range(struct bpf_sk_lookup, remote_ip4):
case bpf_ctx_range(struct bpf_sk_lookup, local_ip4):
case bpf_ctx_range_till(struct bpf_sk_lookup, remote_ip6[0], remote_ip6[3]):
case bpf_ctx_range_till(struct bpf_sk_lookup, local_ip6[0], local_ip6[3]):
case bpf_ctx_range(struct bpf_sk_lookup, remote_port):
case bpf_ctx_range(struct bpf_sk_lookup, local_port):
bpf_ctx_record_field_size(info, sizeof(__u32));
return bpf_ctx_narrow_access_ok(off, size, sizeof(__u32));
default:
return false;
}
}
static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
const struct bpf_insn *si,
struct bpf_insn *insn_buf,
struct bpf_prog *prog,
u32 *target_size)
{
struct bpf_insn *insn = insn_buf;
switch (si->off) {
case offsetof(struct bpf_sk_lookup, sk):
*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
offsetof(struct bpf_sk_lookup_kern, selected_sk));
break;
case offsetof(struct bpf_sk_lookup, family):
*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
bpf_target_off(struct bpf_sk_lookup_kern,
family, 2, target_size));
break;
case offsetof(struct bpf_sk_lookup, protocol):
*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
bpf_target_off(struct bpf_sk_lookup_kern,
protocol, 2, target_size));
break;
case offsetof(struct bpf_sk_lookup, remote_ip4):
*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
bpf_target_off(struct bpf_sk_lookup_kern,
v4.saddr, 4, target_size));
break;
case offsetof(struct bpf_sk_lookup, local_ip4):
*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
bpf_target_off(struct bpf_sk_lookup_kern,
v4.daddr, 4, target_size));
break;
case bpf_ctx_range_till(struct bpf_sk_lookup,
remote_ip6[0], remote_ip6[3]): {
#if IS_ENABLED(CONFIG_IPV6)
int off = si->off;
off -= offsetof(struct bpf_sk_lookup, remote_ip6[0]);
off += bpf_target_off(struct in6_addr, s6_addr32[0], 4, target_size);
*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
offsetof(struct bpf_sk_lookup_kern, v6.saddr));
*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);
*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
#else
*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
#endif
break;
}
case bpf_ctx_range_till(struct bpf_sk_lookup,
local_ip6[0], local_ip6[3]): {
#if IS_ENABLED(CONFIG_IPV6)
int off = si->off;
off -= offsetof(struct bpf_sk_lookup, local_ip6[0]);
off += bpf_target_off(struct in6_addr, s6_addr32[0], 4, target_size);
*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
offsetof(struct bpf_sk_lookup_kern, v6.daddr));
*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);
*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
#else
*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
#endif
break;
}
case offsetof(struct bpf_sk_lookup, remote_port):
*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
bpf_target_off(struct bpf_sk_lookup_kern,
sport, 2, target_size));
break;
case offsetof(struct bpf_sk_lookup, local_port):
*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
bpf_target_off(struct bpf_sk_lookup_kern,
dport, 2, target_size));
break;
}
return insn - insn_buf;
}
const struct bpf_prog_ops sk_lookup_prog_ops = {
};
const struct bpf_verifier_ops sk_lookup_verifier_ops = {
.get_func_proto = sk_lookup_func_proto,
.is_valid_access = sk_lookup_is_valid_access,
.convert_ctx_access = sk_lookup_convert_ctx_access,
};
#endif /* CONFIG_INET */
DEFINE_BPF_DISPATCHER(xdp)
......
......@@ -246,6 +246,21 @@ static inline int compute_score(struct sock *sk, struct net *net,
return score;
}
static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
struct sk_buff *skb, int doff,
__be32 saddr, __be16 sport,
__be32 daddr, unsigned short hnum)
{
struct sock *reuse_sk = NULL;
u32 phash;
if (sk->sk_reuseport) {
phash = inet_ehashfn(net, daddr, hnum, saddr, sport);
reuse_sk = reuseport_select_sock(sk, phash, skb, doff);
}
return reuse_sk;
}
/*
* Here are some nice properties to exploit here. The BSD API
* does not allow a listening sock to specify the remote port nor the
......@@ -265,21 +280,17 @@ static struct sock *inet_lhash2_lookup(struct net *net,
struct inet_connection_sock *icsk;
struct sock *sk, *result = NULL;
int score, hiscore = 0;
u32 phash = 0;
inet_lhash2_for_each_icsk_rcu(icsk, &ilb2->head) {
sk = (struct sock *)icsk;
score = compute_score(sk, net, hnum, daddr,
dif, sdif, exact_dif);
if (score > hiscore) {
if (sk->sk_reuseport) {
phash = inet_ehashfn(net, daddr, hnum,
saddr, sport);
result = reuseport_select_sock(sk, phash,
skb, doff);
if (result)
return result;
}
result = lookup_reuseport(net, sk, skb, doff,
saddr, sport, daddr, hnum);
if (result)
return result;
result = sk;
hiscore = score;
}
......@@ -288,6 +299,29 @@ static struct sock *inet_lhash2_lookup(struct net *net,
return result;
}
static inline struct sock *inet_lookup_run_bpf(struct net *net,
struct inet_hashinfo *hashinfo,
struct sk_buff *skb, int doff,
__be32 saddr, __be16 sport,
__be32 daddr, u16 hnum)
{
struct sock *sk, *reuse_sk;
bool no_reuseport;
if (hashinfo != &tcp_hashinfo)
return NULL; /* only TCP is supported */
no_reuseport = bpf_sk_lookup_run_v4(net, IPPROTO_TCP,
saddr, sport, daddr, hnum, &sk);
if (no_reuseport || IS_ERR_OR_NULL(sk))
return sk;
reuse_sk = lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
if (reuse_sk)
sk = reuse_sk;
return sk;
}
struct sock *__inet_lookup_listener(struct net *net,
struct inet_hashinfo *hashinfo,
struct sk_buff *skb, int doff,
......@@ -299,6 +333,14 @@ struct sock *__inet_lookup_listener(struct net *net,
struct sock *result = NULL;
unsigned int hash2;
/* Lookup redirect from BPF */
if (static_branch_unlikely(&bpf_sk_lookup_enabled)) {
result = inet_lookup_run_bpf(net, hashinfo, skb, doff,
saddr, sport, daddr, hnum);
if (result)
goto done;
}
hash2 = ipv4_portaddr_hash(net, daddr, hnum);
ilb2 = inet_lhash2_bucket(hashinfo, hash2);
......
......@@ -408,6 +408,25 @@ static u32 udp_ehashfn(const struct net *net, const __be32 laddr,
udp_ehash_secret + net_hash_mix(net));
}
static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
struct sk_buff *skb,
__be32 saddr, __be16 sport,
__be32 daddr, unsigned short hnum)
{
struct sock *reuse_sk = NULL;
u32 hash;
if (sk->sk_reuseport && sk->sk_state != TCP_ESTABLISHED) {
hash = udp_ehashfn(net, daddr, hnum, saddr, sport);
reuse_sk = reuseport_select_sock(sk, hash, skb,
sizeof(struct udphdr));
/* Fall back to scoring if group has connections */
if (reuseport_has_conns(sk, false))
return NULL;
}
return reuse_sk;
}
/* called with rcu_read_lock() */
static struct sock *udp4_lib_lookup2(struct net *net,
__be32 saddr, __be16 sport,
......@@ -418,7 +437,6 @@ static struct sock *udp4_lib_lookup2(struct net *net,
{
struct sock *sk, *result;
int score, badness;
u32 hash = 0;
result = NULL;
badness = 0;
......@@ -426,15 +444,11 @@ static struct sock *udp4_lib_lookup2(struct net *net,
score = compute_score(sk, net, saddr, sport,
daddr, hnum, dif, sdif);
if (score > badness) {
if (sk->sk_reuseport &&
sk->sk_state != TCP_ESTABLISHED) {
hash = udp_ehashfn(net, daddr, hnum,
saddr, sport);
result = reuseport_select_sock(sk, hash, skb,
sizeof(struct udphdr));
if (result && !reuseport_has_conns(sk, false))
return result;
}
result = lookup_reuseport(net, sk, skb,
saddr, sport, daddr, hnum);
if (result)
return result;
badness = score;
result = sk;
}
......@@ -442,6 +456,29 @@ static struct sock *udp4_lib_lookup2(struct net *net,
return result;
}
static inline struct sock *udp4_lookup_run_bpf(struct net *net,
struct udp_table *udptable,
struct sk_buff *skb,
__be32 saddr, __be16 sport,
__be32 daddr, u16 hnum)
{
struct sock *sk, *reuse_sk;
bool no_reuseport;
if (udptable != &udp_table)
return NULL; /* only UDP is supported */
no_reuseport = bpf_sk_lookup_run_v4(net, IPPROTO_UDP,
saddr, sport, daddr, hnum, &sk);
if (no_reuseport || IS_ERR_OR_NULL(sk))
return sk;
reuse_sk = lookup_reuseport(net, sk, skb, saddr, sport, daddr, hnum);
if (reuse_sk)
sk = reuse_sk;
return sk;
}
/* UDP is nearly always wildcards out the wazoo, it makes no sense to try
* harder than this. -DaveM
*/
......@@ -449,27 +486,45 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
__be16 sport, __be32 daddr, __be16 dport, int dif,
int sdif, struct udp_table *udptable, struct sk_buff *skb)
{
struct sock *result;
unsigned short hnum = ntohs(dport);
unsigned int hash2, slot2;
struct udp_hslot *hslot2;
struct sock *result, *sk;
hash2 = ipv4_portaddr_hash(net, daddr, hnum);
slot2 = hash2 & udptable->mask;
hslot2 = &udptable->hash2[slot2];
/* Lookup connected or non-wildcard socket */
result = udp4_lib_lookup2(net, saddr, sport,
daddr, hnum, dif, sdif,
hslot2, skb);
if (!result) {
hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum);
slot2 = hash2 & udptable->mask;
hslot2 = &udptable->hash2[slot2];
result = udp4_lib_lookup2(net, saddr, sport,
htonl(INADDR_ANY), hnum, dif, sdif,
hslot2, skb);
if (!IS_ERR_OR_NULL(result) && result->sk_state == TCP_ESTABLISHED)
goto done;
/* Lookup redirect from BPF */
if (static_branch_unlikely(&bpf_sk_lookup_enabled)) {
sk = udp4_lookup_run_bpf(net, udptable, skb,
saddr, sport, daddr, hnum);
if (sk) {
result = sk;
goto done;
}
}
/* Got non-wildcard socket or error on first lookup */
if (result)
goto done;
/* Lookup wildcard sockets */
hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum);
slot2 = hash2 & udptable->mask;
hslot2 = &udptable->hash2[slot2];
result = udp4_lib_lookup2(net, saddr, sport,
htonl(INADDR_ANY), hnum, dif, sdif,
hslot2, skb);
done:
if (IS_ERR(result))
return NULL;
return result;
......
......@@ -21,6 +21,8 @@
#include <net/ip.h>
#include <net/sock_reuseport.h>
extern struct inet_hashinfo tcp_hashinfo;
u32 inet6_ehashfn(const struct net *net,
const struct in6_addr *laddr, const u16 lport,
const struct in6_addr *faddr, const __be16 fport)
......@@ -111,6 +113,23 @@ static inline int compute_score(struct sock *sk, struct net *net,
return score;
}
static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
struct sk_buff *skb, int doff,
const struct in6_addr *saddr,
__be16 sport,
const struct in6_addr *daddr,
unsigned short hnum)
{
struct sock *reuse_sk = NULL;
u32 phash;
if (sk->sk_reuseport) {
phash = inet6_ehashfn(net, daddr, hnum, saddr, sport);
reuse_sk = reuseport_select_sock(sk, phash, skb, doff);
}
return reuse_sk;
}
/* called with rcu_read_lock() */
static struct sock *inet6_lhash2_lookup(struct net *net,
struct inet_listen_hashbucket *ilb2,
......@@ -123,21 +142,17 @@ static struct sock *inet6_lhash2_lookup(struct net *net,
struct inet_connection_sock *icsk;
struct sock *sk, *result = NULL;
int score, hiscore = 0;
u32 phash = 0;
inet_lhash2_for_each_icsk_rcu(icsk, &ilb2->head) {
sk = (struct sock *)icsk;
score = compute_score(sk, net, hnum, daddr, dif, sdif,
exact_dif);
if (score > hiscore) {
if (sk->sk_reuseport) {
phash = inet6_ehashfn(net, daddr, hnum,
saddr, sport);
result = reuseport_select_sock(sk, phash,
skb, doff);
if (result)
return result;
}
result = lookup_reuseport(net, sk, skb, doff,
saddr, sport, daddr, hnum);
if (result)
return result;
result = sk;
hiscore = score;
}
......@@ -146,6 +161,31 @@ static struct sock *inet6_lhash2_lookup(struct net *net,
return result;
}
static inline struct sock *inet6_lookup_run_bpf(struct net *net,
struct inet_hashinfo *hashinfo,
struct sk_buff *skb, int doff,
const struct in6_addr *saddr,
const __be16 sport,
const struct in6_addr *daddr,
const u16 hnum)
{
struct sock *sk, *reuse_sk;
bool no_reuseport;
if (hashinfo != &tcp_hashinfo)
return NULL; /* only TCP is supported */
no_reuseport = bpf_sk_lookup_run_v6(net, IPPROTO_TCP,
saddr, sport, daddr, hnum, &sk);
if (no_reuseport || IS_ERR_OR_NULL(sk))
return sk;
reuse_sk = lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
if (reuse_sk)
sk = reuse_sk;
return sk;
}
struct sock *inet6_lookup_listener(struct net *net,
struct inet_hashinfo *hashinfo,
struct sk_buff *skb, int doff,
......@@ -157,6 +197,14 @@ struct sock *inet6_lookup_listener(struct net *net,
struct sock *result = NULL;
unsigned int hash2;
/* Lookup redirect from BPF */
if (static_branch_unlikely(&bpf_sk_lookup_enabled)) {
result = inet6_lookup_run_bpf(net, hashinfo, skb, doff,
saddr, sport, daddr, hnum);
if (result)
goto done;
}
hash2 = ipv6_portaddr_hash(net, daddr, hnum);
ilb2 = inet_lhash2_bucket(hashinfo, hash2);
......
......@@ -141,6 +141,27 @@ static int compute_score(struct sock *sk, struct net *net,
return score;
}
static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
struct sk_buff *skb,
const struct in6_addr *saddr,
__be16 sport,
const struct in6_addr *daddr,
unsigned int hnum)
{
struct sock *reuse_sk = NULL;
u32 hash;
if (sk->sk_reuseport && sk->sk_state != TCP_ESTABLISHED) {
hash = udp6_ehashfn(net, daddr, hnum, saddr, sport);
reuse_sk = reuseport_select_sock(sk, hash, skb,
sizeof(struct udphdr));
/* Fall back to scoring if group has connections */
if (reuseport_has_conns(sk, false))
return NULL;
}
return reuse_sk;
}
/* called with rcu_read_lock() */
static struct sock *udp6_lib_lookup2(struct net *net,
const struct in6_addr *saddr, __be16 sport,
......@@ -150,7 +171,6 @@ static struct sock *udp6_lib_lookup2(struct net *net,
{
struct sock *sk, *result;
int score, badness;
u32 hash = 0;
result = NULL;
badness = -1;
......@@ -158,16 +178,11 @@ static struct sock *udp6_lib_lookup2(struct net *net,
score = compute_score(sk, net, saddr, sport,
daddr, hnum, dif, sdif);
if (score > badness) {
if (sk->sk_reuseport &&
sk->sk_state != TCP_ESTABLISHED) {
hash = udp6_ehashfn(net, daddr, hnum,
saddr, sport);
result = reuseport_select_sock(sk, hash, skb,
sizeof(struct udphdr));
if (result && !reuseport_has_conns(sk, false))
return result;
}
result = lookup_reuseport(net, sk, skb,
saddr, sport, daddr, hnum);
if (result)
return result;
result = sk;
badness = score;
}
......@@ -175,6 +190,31 @@ static struct sock *udp6_lib_lookup2(struct net *net,
return result;
}
static inline struct sock *udp6_lookup_run_bpf(struct net *net,
struct udp_table *udptable,
struct sk_buff *skb,
const struct in6_addr *saddr,
__be16 sport,
const struct in6_addr *daddr,
u16 hnum)
{
struct sock *sk, *reuse_sk;
bool no_reuseport;
if (udptable != &udp_table)
return NULL; /* only UDP is supported */
no_reuseport = bpf_sk_lookup_run_v6(net, IPPROTO_UDP,
saddr, sport, daddr, hnum, &sk);
if (no_reuseport || IS_ERR_OR_NULL(sk))
return sk;
reuse_sk = lookup_reuseport(net, sk, skb, saddr, sport, daddr, hnum);
if (reuse_sk)
sk = reuse_sk;
return sk;
}
/* rcu_read_lock() must be held */
struct sock *__udp6_lib_lookup(struct net *net,
const struct in6_addr *saddr, __be16 sport,
......@@ -185,25 +225,42 @@ struct sock *__udp6_lib_lookup(struct net *net,
unsigned short hnum = ntohs(dport);
unsigned int hash2, slot2;
struct udp_hslot *hslot2;
struct sock *result;
struct sock *result, *sk;
hash2 = ipv6_portaddr_hash(net, daddr, hnum);
slot2 = hash2 & udptable->mask;
hslot2 = &udptable->hash2[slot2];
/* Lookup connected or non-wildcard sockets */
result = udp6_lib_lookup2(net, saddr, sport,
daddr, hnum, dif, sdif,
hslot2, skb);
if (!result) {
hash2 = ipv6_portaddr_hash(net, &in6addr_any, hnum);
slot2 = hash2 & udptable->mask;
if (!IS_ERR_OR_NULL(result) && result->sk_state == TCP_ESTABLISHED)
goto done;
/* Lookup redirect from BPF */
if (static_branch_unlikely(&bpf_sk_lookup_enabled)) {
sk = udp6_lookup_run_bpf(net, udptable, skb,
saddr, sport, daddr, hnum);
if (sk) {
result = sk;
goto done;
}
}
hslot2 = &udptable->hash2[slot2];
/* Got non-wildcard socket or error on first lookup */
if (result)
goto done;
result = udp6_lib_lookup2(net, saddr, sport,
&in6addr_any, hnum, dif, sdif,
hslot2, skb);
}
/* Lookup wildcard sockets */
hash2 = ipv6_portaddr_hash(net, &in6addr_any, hnum);
slot2 = hash2 & udptable->mask;
hslot2 = &udptable->hash2[slot2];
result = udp6_lib_lookup2(net, saddr, sport,
&in6addr_any, hnum, dif, sdif,
hslot2, skb);
done:
if (IS_ERR(result))
return NULL;
return result;
......
......@@ -404,6 +404,7 @@ class PrinterHelpers(Printer):
type_fwds = [
'struct bpf_fib_lookup',
'struct bpf_sk_lookup',
'struct bpf_perf_event_data',
'struct bpf_perf_event_value',
'struct bpf_pidns_info',
......@@ -450,6 +451,7 @@ class PrinterHelpers(Printer):
'struct bpf_perf_event_data',
'struct bpf_perf_event_value',
'struct bpf_pidns_info',
'struct bpf_sk_lookup',
'struct bpf_sock',
'struct bpf_sock_addr',
'struct bpf_sock_ops',
......@@ -487,6 +489,11 @@ class PrinterHelpers(Printer):
'struct sk_msg_buff': 'struct sk_msg_md',
'struct xdp_buff': 'struct xdp_md',
}
# Helpers overloaded for different context types.
overloaded_helpers = [
'bpf_get_socket_cookie',
'bpf_sk_assign',
]
def print_header(self):
header = '''\
......@@ -543,7 +550,7 @@ class PrinterHelpers(Printer):
for i, a in enumerate(proto['args']):
t = a['type']
n = a['name']
if proto['name'] == 'bpf_get_socket_cookie' and i == 0:
if proto['name'] in self.overloaded_helpers and i == 0:
t = 'void'
n = 'ctx'
one_arg = '{}{}'.format(comma, self.map_type(t))
......
......@@ -45,7 +45,7 @@ PROG COMMANDS
| **cgroup/getsockname4** | **cgroup/getsockname6** | **cgroup/sendmsg4** | **cgroup/sendmsg6** |
| **cgroup/recvmsg4** | **cgroup/recvmsg6** | **cgroup/sysctl** |
| **cgroup/getsockopt** | **cgroup/setsockopt** |
| **struct_ops** | **fentry** | **fexit** | **freplace**
| **struct_ops** | **fentry** | **fexit** | **freplace** | **sk_lookup**
| }
| *ATTACH_TYPE* := {
| **msg_verdict** | **stream_verdict** | **stream_parser** | **flow_dissector**
......
......@@ -479,7 +479,7 @@ _bpftool()
cgroup/post_bind4 cgroup/post_bind6 \
cgroup/sysctl cgroup/getsockopt \
cgroup/setsockopt struct_ops \
fentry fexit freplace" -- \
fentry fexit freplace sk_lookup" -- \
"$cur" ) )
return 0
;;
......
......@@ -64,6 +64,7 @@ const char * const attach_type_name[__MAX_BPF_ATTACH_TYPE] = {
[BPF_TRACE_FEXIT] = "fexit",
[BPF_MODIFY_RETURN] = "mod_ret",
[BPF_LSM_MAC] = "lsm_mac",
[BPF_SK_LOOKUP] = "sk_lookup",
};
void p_err(const char *fmt, ...)
......
......@@ -59,6 +59,7 @@ const char * const prog_type_name[] = {
[BPF_PROG_TYPE_TRACING] = "tracing",
[BPF_PROG_TYPE_STRUCT_OPS] = "struct_ops",
[BPF_PROG_TYPE_EXT] = "ext",
[BPF_PROG_TYPE_SK_LOOKUP] = "sk_lookup",
};
const size_t prog_type_name_size = ARRAY_SIZE(prog_type_name);
......@@ -1905,7 +1906,7 @@ static int do_help(int argc, char **argv)
" cgroup/getsockname4 | cgroup/getsockname6 | cgroup/sendmsg4 |\n"
" cgroup/sendmsg6 | cgroup/recvmsg4 | cgroup/recvmsg6 |\n"
" cgroup/getsockopt | cgroup/setsockopt |\n"
" struct_ops | fentry | fexit | freplace }\n"
" struct_ops | fentry | fexit | freplace | sk_lookup }\n"
" ATTACH_TYPE := { msg_verdict | stream_verdict | stream_parser |\n"
" flow_dissector }\n"
" METRIC := { cycles | instructions | l1d_loads | llc_misses }\n"
......
......@@ -189,6 +189,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_STRUCT_OPS,
BPF_PROG_TYPE_EXT,
BPF_PROG_TYPE_LSM,
BPF_PROG_TYPE_SK_LOOKUP,
};
enum bpf_attach_type {
......@@ -228,6 +229,7 @@ enum bpf_attach_type {
BPF_XDP_DEVMAP,
BPF_CGROUP_INET_SOCK_RELEASE,
BPF_XDP_CPUMAP,
BPF_SK_LOOKUP,
__MAX_BPF_ATTACH_TYPE
};
......@@ -3069,6 +3071,10 @@ union bpf_attr {
*
* long bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
* Description
* Helper is overloaded depending on BPF program type. This
* description applies to **BPF_PROG_TYPE_SCHED_CLS** and
* **BPF_PROG_TYPE_SCHED_ACT** programs.
*
* Assign the *sk* to the *skb*. When combined with appropriate
* routing configuration to receive the packet towards the socket,
* will cause *skb* to be delivered to the specified socket.
......@@ -3094,6 +3100,56 @@ union bpf_attr {
* **-ESOCKTNOSUPPORT** if the socket type is not supported
* (reuseport).
*
* long bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
* Description
* Helper is overloaded depending on BPF program type. This
* description applies to **BPF_PROG_TYPE_SK_LOOKUP** programs.
*
* Select the *sk* as a result of a socket lookup.
*
* For the operation to succeed passed socket must be compatible
* with the packet description provided by the *ctx* object.
*
* L4 protocol (**IPPROTO_TCP** or **IPPROTO_UDP**) must
* be an exact match. While IP family (**AF_INET** or
* **AF_INET6**) must be compatible, that is IPv6 sockets
* that are not v6-only can be selected for IPv4 packets.
*
* Only TCP listeners and UDP unconnected sockets can be
* selected. *sk* can also be NULL to reset any previous
* selection.
*
* *flags* argument can combination of following values:
*
* * **BPF_SK_LOOKUP_F_REPLACE** to override the previous
* socket selection, potentially done by a BPF program
* that ran before us.
*
* * **BPF_SK_LOOKUP_F_NO_REUSEPORT** to skip
* load-balancing within reuseport group for the socket
* being selected.
*
* On success *ctx->sk* will point to the selected socket.
*
* Return
* 0 on success, or a negative errno in case of failure.
*
* * **-EAFNOSUPPORT** if socket family (*sk->family*) is
* not compatible with packet family (*ctx->family*).
*
* * **-EEXIST** if socket has been already selected,
* potentially by another program, and
* **BPF_SK_LOOKUP_F_REPLACE** flag was not specified.
*
* * **-EINVAL** if unsupported flags were specified.
*
* * **-EPROTOTYPE** if socket L4 protocol
* (*sk->protocol*) doesn't match packet protocol
* (*ctx->protocol*).
*
* * **-ESOCKTNOSUPPORT** if socket is not in allowed
* state (TCP listening or UDP unconnected).
*
* u64 bpf_ktime_get_boot_ns(void)
* Description
* Return the time elapsed since system boot, in nanoseconds.
......@@ -3607,6 +3663,12 @@ enum {
BPF_RINGBUF_HDR_SZ = 8,
};
/* BPF_FUNC_sk_assign flags in bpf_sk_lookup context. */
enum {
BPF_SK_LOOKUP_F_REPLACE = (1ULL << 0),
BPF_SK_LOOKUP_F_NO_REUSEPORT = (1ULL << 1),
};
/* Mode for BPF_FUNC_skb_adjust_room helper. */
enum bpf_adj_room_mode {
BPF_ADJ_ROOM_NET,
......@@ -4349,4 +4411,19 @@ struct bpf_pidns_info {
__u32 pid;
__u32 tgid;
};
/* User accessible data for SK_LOOKUP programs. Add new fields at the end. */
struct bpf_sk_lookup {
__bpf_md_ptr(struct bpf_sock *, sk); /* Selected socket */
__u32 family; /* Protocol family (AF_INET, AF_INET6) */
__u32 protocol; /* IP protocol (IPPROTO_TCP, IPPROTO_UDP) */
__u32 remote_ip4; /* Network byte order */
__u32 remote_ip6[4]; /* Network byte order */
__u32 remote_port; /* Network byte order */
__u32 local_ip4; /* Network byte order */
__u32 local_ip6[4]; /* Network byte order */
__u32 local_port; /* Host byte order */
};
#endif /* _UAPI__LINUX_BPF_H__ */
......@@ -6799,6 +6799,7 @@ BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
BPF_PROG_TYPE_FNS(tracing, BPF_PROG_TYPE_TRACING);
BPF_PROG_TYPE_FNS(struct_ops, BPF_PROG_TYPE_STRUCT_OPS);
BPF_PROG_TYPE_FNS(extension, BPF_PROG_TYPE_EXT);
BPF_PROG_TYPE_FNS(sk_lookup, BPF_PROG_TYPE_SK_LOOKUP);
enum bpf_attach_type
bpf_program__get_expected_attach_type(struct bpf_program *prog)
......@@ -6981,6 +6982,8 @@ static const struct bpf_sec_def section_defs[] = {
BPF_EAPROG_SEC("cgroup/setsockopt", BPF_PROG_TYPE_CGROUP_SOCKOPT,
BPF_CGROUP_SETSOCKOPT),
BPF_PROG_SEC("struct_ops", BPF_PROG_TYPE_STRUCT_OPS),
BPF_EAPROG_SEC("sk_lookup/", BPF_PROG_TYPE_SK_LOOKUP,
BPF_SK_LOOKUP),
};
#undef BPF_PROG_SEC_IMPL
......
......@@ -350,6 +350,7 @@ LIBBPF_API int bpf_program__set_perf_event(struct bpf_program *prog);
LIBBPF_API int bpf_program__set_tracing(struct bpf_program *prog);
LIBBPF_API int bpf_program__set_struct_ops(struct bpf_program *prog);
LIBBPF_API int bpf_program__set_extension(struct bpf_program *prog);
LIBBPF_API int bpf_program__set_sk_lookup(struct bpf_program *prog);
LIBBPF_API enum bpf_prog_type bpf_program__get_type(struct bpf_program *prog);
LIBBPF_API void bpf_program__set_type(struct bpf_program *prog,
......@@ -377,6 +378,7 @@ LIBBPF_API bool bpf_program__is_perf_event(const struct bpf_program *prog);
LIBBPF_API bool bpf_program__is_tracing(const struct bpf_program *prog);
LIBBPF_API bool bpf_program__is_struct_ops(const struct bpf_program *prog);
LIBBPF_API bool bpf_program__is_extension(const struct bpf_program *prog);
LIBBPF_API bool bpf_program__is_sk_lookup(const struct bpf_program *prog);
/*
* No need for __attribute__((packed)), all members of 'bpf_map_def'
......
......@@ -287,6 +287,8 @@ LIBBPF_0.1.0 {
bpf_map__type;
bpf_map__value_size;
bpf_program__autoload;
bpf_program__is_sk_lookup;
bpf_program__set_autoload;
bpf_program__set_sk_lookup;
btf__set_fd;
} LIBBPF_0.0.9;
......@@ -78,6 +78,9 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
xattr.expected_attach_type = BPF_CGROUP_INET4_CONNECT;
break;
case BPF_PROG_TYPE_SK_LOOKUP:
xattr.expected_attach_type = BPF_SK_LOOKUP;
break;
case BPF_PROG_TYPE_KPROBE:
xattr.kern_version = get_kernel_version();
break;
......
......@@ -73,29 +73,8 @@ int start_server(int family, int type, const char *addr_str, __u16 port,
socklen_t len;
int fd;
if (family == AF_INET) {
struct sockaddr_in *sin = (void *)&addr;
sin->sin_family = AF_INET;
sin->sin_port = htons(port);
if (addr_str &&
inet_pton(AF_INET, addr_str, &sin->sin_addr) != 1) {
log_err("inet_pton(AF_INET, %s)", addr_str);
return -1;
}
len = sizeof(*sin);
} else {
struct sockaddr_in6 *sin6 = (void *)&addr;
sin6->sin6_family = AF_INET6;
sin6->sin6_port = htons(port);
if (addr_str &&
inet_pton(AF_INET6, addr_str, &sin6->sin6_addr) != 1) {
log_err("inet_pton(AF_INET6, %s)", addr_str);
return -1;
}
len = sizeof(*sin6);
}
if (make_sockaddr(family, addr_str, port, &addr, &len))
return -1;
fd = socket(family, type, 0);
if (fd < 0) {
......@@ -194,3 +173,36 @@ int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms)
return 0;
}
int make_sockaddr(int family, const char *addr_str, __u16 port,
struct sockaddr_storage *addr, socklen_t *len)
{
if (family == AF_INET) {
struct sockaddr_in *sin = (void *)addr;
sin->sin_family = AF_INET;
sin->sin_port = htons(port);
if (addr_str &&
inet_pton(AF_INET, addr_str, &sin->sin_addr) != 1) {
log_err("inet_pton(AF_INET, %s)", addr_str);
return -1;
}
if (len)
*len = sizeof(*sin);
return 0;
} else if (family == AF_INET6) {
struct sockaddr_in6 *sin6 = (void *)addr;
sin6->sin6_family = AF_INET6;
sin6->sin6_port = htons(port);
if (addr_str &&
inet_pton(AF_INET6, addr_str, &sin6->sin6_addr) != 1) {
log_err("inet_pton(AF_INET6, %s)", addr_str);
return -1;
}
if (len)
*len = sizeof(*sin6);
return 0;
}
return -1;
}
......@@ -37,5 +37,7 @@ int start_server(int family, int type, const char *addr, __u16 port,
int timeout_ms);
int connect_to_fd(int server_fd, int timeout_ms);
int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms);
int make_sockaddr(int family, const char *addr_str, __u16 port,
struct sockaddr_storage *addr, socklen_t *len);
#endif
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment