Commit 0813a841 authored by Martin KaFai Lau's avatar Martin KaFai Lau Committed by Alexei Starovoitov

bpf: tcp: Allow bpf prog to write and parse TCP header option

[ Note: The TCP changes here is mainly to implement the bpf
  pieces into the bpf_skops_*() functions introduced
  in the earlier patches. ]

The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF.  It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.

The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance.  Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal only use.

For example, we want to test the idea in putting maximum delay
ACK in TCP header option which is similar to a draft RFC proposal [1].

This patch introduces the necessary BPF API and use them in the
TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
and write TCP header options.  It currently supports most of
the TCP packet except RST.

Supported TCP header option:
───────────────────────────
This patch allows the bpf-prog to write any option kind.
Different bpf-progs can write its own option by calling the new helper
bpf_store_hdr_opt().  The helper will ensure there is no duplicated
option in the header.

By allowing bpf-prog to write any option kind, this gives a lot of
flexibility to the bpf-prog.  Different bpf-prog can write its
own option kind.  It could also allow the bpf-prog to support a
recently standardized option on an older kernel.

Sockops Callback Flags:
──────────────────────
The bpf program will only be called to parse/write tcp header option
if the following newly added callback flags are enabled
in tp->bpf_sock_ops_cb_flags:
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG

A few words on the PARSE CB flags.  When the above PARSE CB flags are
turned on, the bpf-prog will be called on packets received
at a sk that has at least reached the ESTABLISHED state.
The parsing of the SYN-SYNACK-ACK will be discussed in the
"3 Way HandShake" section.

The default is off for all of the above new CB flags, i.e. the bpf prog
will not be called to parse or write bpf hdr option.  There are
details comment on these new cb flags in the UAPI bpf.h.

sock_ops->skb_data and bpf_load_hdr_opt()
─────────────────────────────────────────
sock_ops->skb_data and sock_ops->skb_data_end covers the whole
TCP header and its options.  They are read only.

The new bpf_load_hdr_opt() helps to read a particular option "kind"
from the skb_data.

Please refer to the comment in UAPI bpf.h.  It has details
on what skb_data contains under different sock_ops->op.

3 Way HandShake
───────────────
The bpf-prog can learn if it is sending SYN or SYNACK by reading the
sock_ops->skb_tcp_flags.

* Passive side

When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
the received SYN skb will be available to the bpf prog.  The bpf prog can
use the SYN skb (which may carry the header option sent from the remote bpf
prog) to decide what bpf header option should be written to the outgoing
SYNACK skb.  The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*).
More on this later.  Also, the bpf prog can learn if it is in syncookie
mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE).

The bpf prog can store the received SYN pkt by using the existing
bpf_setsockopt(TCP_SAVE_SYN).  The example in a later patch does it.
[ Note that the fullsock here is a listen sk, bpf_sk_storage
  is not very useful here since the listen sk will be shared
  by many concurrent connection requests.

  Extending bpf_sk_storage support to request_sock will add weight
  to the minisock and it is not necessary better than storing the
  whole ~100 bytes SYN pkt. ]

When the connection is established, the bpf prog will be called
in the existing PASSIVE_ESTABLISHED_CB callback.  At that time,
the bpf prog can get the header option from the saved syn and
then apply the needed operation to the newly established socket.
The later patch will use the max delay ack specified in the SYN
header and set the RTO of this newly established connection
as an example.

The received ACK (that concludes the 3WHS) will also be available to
the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data.
It could be useful in syncookie scenario.  More on this later.

There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
saved syn pkt which includes the IP[46] header and the TCP header.
A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
start getting from, e.g. starting from TCP header, or from IP[46] header.

The new getsockopt(TCP_BPF_SYN*) will also know where it can get
the SYN's packet from:
  - (a) the just received syn (available when the bpf prog is writing SYNACK)
        and it is the only way to get SYN during syncookie mode.
  or
  - (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
        existing CB).

The bpf prog does not need to know where the SYN pkt is coming from.
The getsockopt(TCP_BPF_SYN*) will hide this details.

Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
bpf_load_hdr_opt() to read a particular header option from the SYN packet.

* Fastopen

Fastopen should work the same as the regular non fastopen case.
This is a test in a later patch.

* Syncookie

For syncookie, the later example patch asks the active
side's bpf prog to resend the header options in ACK.  The server
can use bpf_load_hdr_opt() to look at the options in this
received ACK during PASSIVE_ESTABLISHED_CB.

* Active side

The bpf prog will get a chance to write the bpf header option
in the SYN packet during WRITE_HDR_OPT_CB.  The received SYNACK
pkt will also be available to the bpf prog during the existing
ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
and bpf_load_hdr_opt().

* Turn off header CB flags after 3WHS

If the bpf prog does not need to write/parse header options
beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
to avoid being called for header options.
Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
so that the kernel will only call it when there is option that
the kernel cannot handle.

[1]: draft-wang-tcpm-low-latency-opt-00
     https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com
parent c9985d09
...@@ -279,6 +279,31 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, ...@@ -279,6 +279,31 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
#define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr) \ #define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr) \
BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_RECVMSG, NULL) BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_RECVMSG, NULL)
/* The SOCK_OPS"_SK" macro should be used when sock_ops->sk is not a
* fullsock and its parent fullsock cannot be traced by
* sk_to_full_sk().
*
* e.g. sock_ops->sk is a request_sock and it is under syncookie mode.
* Its listener-sk is not attached to the rsk_listener.
* In this case, the caller holds the listener-sk (unlocked),
* set its sock_ops->sk to req_sk, and call this SOCK_OPS"_SK" with
* the listener-sk such that the cgroup-bpf-progs of the
* listener-sk will be run.
*
* Regardless of syncookie mode or not,
* calling bpf_setsockopt on listener-sk will not make sense anyway,
* so passing 'sock_ops->sk == req_sk' to the bpf prog is appropriate here.
*/
#define BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(sock_ops, sk) \
({ \
int __ret = 0; \
if (cgroup_bpf_enabled) \
__ret = __cgroup_bpf_run_filter_sock_ops(sk, \
sock_ops, \
BPF_CGROUP_SOCK_OPS); \
__ret; \
})
#define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) \ #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) \
({ \ ({ \
int __ret = 0; \ int __ret = 0; \
......
...@@ -1241,8 +1241,12 @@ struct bpf_sock_ops_kern { ...@@ -1241,8 +1241,12 @@ struct bpf_sock_ops_kern {
u32 reply; u32 reply;
u32 replylong[4]; u32 replylong[4];
}; };
struct sk_buff *syn_skb;
struct sk_buff *skb;
void *skb_data_end;
u8 op; u8 op;
u8 is_fullsock; u8 is_fullsock;
u8 remaining_opt_len;
u64 temp; /* temp and everything after is not u64 temp; /* temp and everything after is not
* initialized to 0 before calling * initialized to 0 before calling
* the BPF program. New fields that * the BPF program. New fields that
......
...@@ -2235,6 +2235,55 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, ...@@ -2235,6 +2235,55 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
struct msghdr *msg, int len, int flags); struct msghdr *msg, int len, int flags);
#endif /* CONFIG_NET_SOCK_MSG */ #endif /* CONFIG_NET_SOCK_MSG */
#ifdef CONFIG_CGROUP_BPF
/* Copy the listen sk's HDR_OPT_CB flags to its child.
*
* During 3-Way-HandShake, the synack is usually sent from
* the listen sk with the HDR_OPT_CB flags set so that
* bpf-prog will be called to write the BPF hdr option.
*
* In fastopen, the child sk is used to send synack instead
* of the listen sk. Thus, inheriting the HDR_OPT_CB flags
* from the listen sk gives the bpf-prog a chance to write
* BPF hdr option in the synack pkt during fastopen.
*
* Both fastopen and non-fastopen child will inherit the
* HDR_OPT_CB flags to keep the bpf-prog having a consistent
* behavior when deciding to clear this cb flags (or not)
* during the PASSIVE_ESTABLISHED_CB.
*
* In the future, other cb flags could be inherited here also.
*/
static inline void bpf_skops_init_child(const struct sock *sk,
struct sock *child)
{
tcp_sk(child)->bpf_sock_ops_cb_flags =
tcp_sk(sk)->bpf_sock_ops_cb_flags &
(BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG |
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG |
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG);
}
static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
struct sk_buff *skb,
unsigned int end_offset)
{
skops->skb = skb;
skops->skb_data_end = skb->data + end_offset;
}
#else
static inline void bpf_skops_init_child(const struct sock *sk,
struct sock *child)
{
}
static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
struct sk_buff *skb,
unsigned int end_offset)
{
}
#endif
/* Call BPF_SOCK_OPS program that returns an int. If the return value /* Call BPF_SOCK_OPS program that returns an int. If the return value
* is < 0, then the BPF op failed (for example if the loaded BPF * is < 0, then the BPF op failed (for example if the loaded BPF
* program does not support the chosen operation or there is no BPF * program does not support the chosen operation or there is no BPF
......
This diff is collapsed.
This diff is collapsed.
...@@ -146,6 +146,7 @@ static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb) ...@@ -146,6 +146,7 @@ static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG); BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG);
bool parse_all_opt = BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), bool parse_all_opt = BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG); BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG);
struct bpf_sock_ops_kern sock_ops;
if (likely(!unknown_opt && !parse_all_opt)) if (likely(!unknown_opt && !parse_all_opt))
return; return;
...@@ -161,12 +162,15 @@ static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb) ...@@ -161,12 +162,15 @@ static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
return; return;
} }
/* BPF prog will have access to the sk and skb. sock_owned_by_me(sk);
*
* The bpf running context preparation and the actual bpf prog memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
* calling will be implemented in a later PATCH together with sock_ops.op = BPF_SOCK_OPS_PARSE_HDR_OPT_CB;
* other bpf pieces. sock_ops.is_fullsock = 1;
*/ sock_ops.sk = sk;
bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
} }
static void bpf_skops_established(struct sock *sk, int bpf_op, static void bpf_skops_established(struct sock *sk, int bpf_op,
...@@ -180,7 +184,9 @@ static void bpf_skops_established(struct sock *sk, int bpf_op, ...@@ -180,7 +184,9 @@ static void bpf_skops_established(struct sock *sk, int bpf_op,
sock_ops.op = bpf_op; sock_ops.op = bpf_op;
sock_ops.is_fullsock = 1; sock_ops.is_fullsock = 1;
sock_ops.sk = sk; sock_ops.sk = sk;
/* skb will be passed to the bpf prog in a later patch. */ /* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
if (skb)
bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops); BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
} }
......
...@@ -548,6 +548,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk, ...@@ -548,6 +548,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
newtp->fastopen_req = NULL; newtp->fastopen_req = NULL;
RCU_INIT_POINTER(newtp->fastopen_rsk, NULL); RCU_INIT_POINTER(newtp->fastopen_rsk, NULL);
bpf_skops_init_child(sk, newsk);
tcp_bpf_clone(sk, newsk); tcp_bpf_clone(sk, newsk);
__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS); __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
......
...@@ -454,6 +454,18 @@ static void mptcp_options_write(__be32 *ptr, struct tcp_out_options *opts) ...@@ -454,6 +454,18 @@ static void mptcp_options_write(__be32 *ptr, struct tcp_out_options *opts)
} }
#ifdef CONFIG_CGROUP_BPF #ifdef CONFIG_CGROUP_BPF
static int bpf_skops_write_hdr_opt_arg0(struct sk_buff *skb,
enum tcp_synack_type synack_type)
{
if (unlikely(!skb))
return BPF_WRITE_HDR_TCP_CURRENT_MSS;
if (unlikely(synack_type == TCP_SYNACK_COOKIE))
return BPF_WRITE_HDR_TCP_SYNACK_COOKIE;
return 0;
}
/* req, syn_skb and synack_type are used when writing synack */ /* req, syn_skb and synack_type are used when writing synack */
static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb, static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
struct request_sock *req, struct request_sock *req,
...@@ -462,15 +474,60 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb, ...@@ -462,15 +474,60 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
struct tcp_out_options *opts, struct tcp_out_options *opts,
unsigned int *remaining) unsigned int *remaining)
{ {
struct bpf_sock_ops_kern sock_ops;
int err;
if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) || BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) ||
!*remaining) !*remaining)
return; return;
/* The bpf running context preparation and the actual bpf prog /* *remaining has already been aligned to 4 bytes, so *remaining >= 4 */
* calling will be implemented in a later PATCH together with
* other bpf pieces. /* init sock_ops */
*/ memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
sock_ops.op = BPF_SOCK_OPS_HDR_OPT_LEN_CB;
if (req) {
/* The listen "sk" cannot be passed here because
* it is not locked. It would not make too much
* sense to do bpf_setsockopt(listen_sk) based
* on individual connection request also.
*
* Thus, "req" is passed here and the cgroup-bpf-progs
* of the listen "sk" will be run.
*
* "req" is also used here for fastopen even the "sk" here is
* a fullsock "child" sk. It is to keep the behavior
* consistent between fastopen and non-fastopen on
* the bpf programming side.
*/
sock_ops.sk = (struct sock *)req;
sock_ops.syn_skb = syn_skb;
} else {
sock_owned_by_me(sk);
sock_ops.is_fullsock = 1;
sock_ops.sk = sk;
}
sock_ops.args[0] = bpf_skops_write_hdr_opt_arg0(skb, synack_type);
sock_ops.remaining_opt_len = *remaining;
/* tcp_current_mss() does not pass a skb */
if (skb)
bpf_skops_init_skb(&sock_ops, skb, 0);
err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
if (err || sock_ops.remaining_opt_len == *remaining)
return;
opts->bpf_opt_len = *remaining - sock_ops.remaining_opt_len;
/* round up to 4 bytes */
opts->bpf_opt_len = (opts->bpf_opt_len + 3) & ~3;
*remaining -= opts->bpf_opt_len;
} }
static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb, static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
...@@ -479,13 +536,42 @@ static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb, ...@@ -479,13 +536,42 @@ static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
enum tcp_synack_type synack_type, enum tcp_synack_type synack_type,
struct tcp_out_options *opts) struct tcp_out_options *opts)
{ {
if (likely(!opts->bpf_opt_len)) u8 first_opt_off, nr_written, max_opt_len = opts->bpf_opt_len;
struct bpf_sock_ops_kern sock_ops;
int err;
if (likely(!max_opt_len))
return; return;
/* The bpf running context preparation and the actual bpf prog memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
* calling will be implemented in a later PATCH together with
* other bpf pieces. sock_ops.op = BPF_SOCK_OPS_WRITE_HDR_OPT_CB;
*/
if (req) {
sock_ops.sk = (struct sock *)req;
sock_ops.syn_skb = syn_skb;
} else {
sock_owned_by_me(sk);
sock_ops.is_fullsock = 1;
sock_ops.sk = sk;
}
sock_ops.args[0] = bpf_skops_write_hdr_opt_arg0(skb, synack_type);
sock_ops.remaining_opt_len = max_opt_len;
first_opt_off = tcp_hdrlen(skb) - max_opt_len;
bpf_skops_init_skb(&sock_ops, skb, first_opt_off);
err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
if (err)
nr_written = 0;
else
nr_written = max_opt_len - sock_ops.remaining_opt_len;
if (nr_written < max_opt_len)
memset(skb->data + first_opt_off + nr_written, TCPOPT_NOP,
max_opt_len - nr_written);
} }
#else #else
static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb, static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
......
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment