Commit 890f4365 authored by Alexei Starovoitov's avatar Alexei Starovoitov

Merge branch 'bpf-tcp-header-opts'

Martin KaFai Lau says:

====================
The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF.  It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.

The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance.  Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal traffic only.

This patch set introduces the necessary BPF logic and API to
allow bpf program to write and parse header options.

There are also some changes to TCP and they are mostly to provide
the needed sk and skb info to the bpf program to make decision.

Patch 9 is the main patch and has more details on the API and design.

The set includes an example which sends the max delay ack in
the BPF TCP header option and the receiving side can
then adjust its RTO accordingly.

v5:
- Move some of the comments from git commit message to the UAPI bpf.h
  in patch 9

- Some variable clean up in the tests (patch 11).

v4:
- Since bpf-next is currently closed, tag the set with RFC to keep the
  review cadence

- Separate tcp changes in its own patches (5, 6, 7).  It is a bit
  tricky since most of the tcp changes is to call out the bpf prog to
  write and parse the header.  The write and parse callout has been
  modularized into a few bpf_skops_* function in v3.

  This revision (v4) tries to move those bpf_skops_* functions into separate
  TCP patches.  However, they will be half implemented to highlight
  the changes to the TCP stack, mainly:
    - when the bpf prog will be called in the TCP stack and
    - what information needs to pump through the TCP stack to the actual bpf
      prog callsite.

  The bpf_skops_* functions will be fully implemented in patch 9 together
  with other bpf pieces.

- Use struct_size() in patch 1 (Eric)

- Add saw_unknown to struct tcp_options_received in patch 4 (Eric)

v3:
- Add kdoc for tcp_make_synack (Jakub Kicinski)
- Add BPF_WRITE_HDR_TCP_CURRENT_MSS and BPF_WRITE_HDR_TCP_SYNACK_COOKIE
  in bpf.h to give a clearer meaning to sock_ops->args[0] when
  writing header option.
- Rename BPF_SOCK_OPS_PARSE_UNKWN_HDR_OPT_CB_FLAG
  to     BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG

v2:
- Instead of limiting the bpf prog to write experimental
  option (kind:254, magic:0xeB9F), this revision allows the bpf prog to
  write any TCP header option through the bpf_store_hdr_opt() helper.
  That will allow different bpf-progs to write its own
  option and the helper will guarantee there is no duplication.

- Add bpf_load_hdr_opt() helper to search a particular option by kind.
  Some of the get_syn logic is refactored to bpf_sock_ops_get_syn().

- Since bpf prog is no longer limited to option (254, 0xeB9F),
  the TCP_SKB_CB(skb)->bpf_hdr_opt_off is no longer needed.
  Instead, when there is any option kernel cannot recognize,
  the bpf prog will be called if the
  BPF_SOCK_OPS_PARSE_UNKWN_HDR_OPT_CB_FLAG is set.
  [ The "unknown_opt" is learned in tcp_parse_options() in patch 4. ]

- Add BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG.
  If this flag is set, the bpf-prog will be called
  on all tcp packet received at an established sk.
  It will be useful to ensure a previously written header option is
  received by the peer.
  e.g. The latter test is using this on the active-side during syncookie.

- The test_tcp_hdr_options.c is adjusted accordingly
  to test writing both experimental and regular TCP header option.

- The test_misc_tcp_hdr_options.c is added to mainly
  test different cases on the new helpers.

- Break up the TCP_BPF_RTO_MIN and TCP_BPF_DELACK_MAX into
  two patches.

- Directly store the tcp_hdrlen in "struct saved_syn" instead of
  going back to the tcp header to obtain it by "th->doff * 4"

- Add a new optval(==2) for setsockopt(TCP_SAVE_SYN) such
  that it will also store the mac header (patch 9).
====================
Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
parents 9c0f8cbd 267cf9fa
...@@ -279,6 +279,31 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, ...@@ -279,6 +279,31 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
#define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr) \ #define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr) \
BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_RECVMSG, NULL) BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_RECVMSG, NULL)
/* The SOCK_OPS"_SK" macro should be used when sock_ops->sk is not a
* fullsock and its parent fullsock cannot be traced by
* sk_to_full_sk().
*
* e.g. sock_ops->sk is a request_sock and it is under syncookie mode.
* Its listener-sk is not attached to the rsk_listener.
* In this case, the caller holds the listener-sk (unlocked),
* set its sock_ops->sk to req_sk, and call this SOCK_OPS"_SK" with
* the listener-sk such that the cgroup-bpf-progs of the
* listener-sk will be run.
*
* Regardless of syncookie mode or not,
* calling bpf_setsockopt on listener-sk will not make sense anyway,
* so passing 'sock_ops->sk == req_sk' to the bpf prog is appropriate here.
*/
#define BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(sock_ops, sk) \
({ \
int __ret = 0; \
if (cgroup_bpf_enabled) \
__ret = __cgroup_bpf_run_filter_sock_ops(sk, \
sock_ops, \
BPF_CGROUP_SOCK_OPS); \
__ret; \
})
#define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) \ #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) \
({ \ ({ \
int __ret = 0; \ int __ret = 0; \
......
...@@ -1236,13 +1236,17 @@ struct bpf_sock_addr_kern { ...@@ -1236,13 +1236,17 @@ struct bpf_sock_addr_kern {
struct bpf_sock_ops_kern { struct bpf_sock_ops_kern {
struct sock *sk; struct sock *sk;
u32 op;
union { union {
u32 args[4]; u32 args[4];
u32 reply; u32 reply;
u32 replylong[4]; u32 replylong[4];
}; };
u32 is_fullsock; struct sk_buff *syn_skb;
struct sk_buff *skb;
void *skb_data_end;
u8 op;
u8 is_fullsock;
u8 remaining_opt_len;
u64 temp; /* temp and everything after is not u64 temp; /* temp and everything after is not
* initialized to 0 before calling * initialized to 0 before calling
* the BPF program. New fields that * the BPF program. New fields that
......
...@@ -92,6 +92,8 @@ struct tcp_options_received { ...@@ -92,6 +92,8 @@ struct tcp_options_received {
smc_ok : 1, /* SMC seen on SYN packet */ smc_ok : 1, /* SMC seen on SYN packet */
snd_wscale : 4, /* Window scaling received from sender */ snd_wscale : 4, /* Window scaling received from sender */
rcv_wscale : 4; /* Window scaling to send to receiver */ rcv_wscale : 4; /* Window scaling to send to receiver */
u8 saw_unknown:1, /* Received unknown option */
unused:7;
u8 num_sacks; /* Number of SACK blocks */ u8 num_sacks; /* Number of SACK blocks */
u16 user_mss; /* mss requested by user in ioctl */ u16 user_mss; /* mss requested by user in ioctl */
u16 mss_clamp; /* Maximal mss, negotiated at connection setup */ u16 mss_clamp; /* Maximal mss, negotiated at connection setup */
...@@ -237,14 +239,13 @@ struct tcp_sock { ...@@ -237,14 +239,13 @@ struct tcp_sock {
repair : 1, repair : 1,
frto : 1;/* F-RTO (RFC5682) activated in CA_Loss */ frto : 1;/* F-RTO (RFC5682) activated in CA_Loss */
u8 repair_queue; u8 repair_queue;
u8 syn_data:1, /* SYN includes data */ u8 save_syn:2, /* Save headers of SYN packet */
syn_data:1, /* SYN includes data */
syn_fastopen:1, /* SYN includes Fast Open option */ syn_fastopen:1, /* SYN includes Fast Open option */
syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */ syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */
syn_fastopen_ch:1, /* Active TFO re-enabling probe */ syn_fastopen_ch:1, /* Active TFO re-enabling probe */
syn_data_acked:1,/* data in SYN is acked by SYN-ACK */ syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
save_syn:1, /* Save headers of SYN packet */ is_cwnd_limited:1;/* forward progress limited by snd_cwnd? */
is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */
syn_smc:1; /* SYN includes SMC */
u32 tlp_high_seq; /* snd_nxt at the time of TLP */ u32 tlp_high_seq; /* snd_nxt at the time of TLP */
u32 tcp_tx_delay; /* delay (in usec) added to TX packets */ u32 tcp_tx_delay; /* delay (in usec) added to TX packets */
...@@ -391,6 +392,9 @@ struct tcp_sock { ...@@ -391,6 +392,9 @@ struct tcp_sock {
#if IS_ENABLED(CONFIG_MPTCP) #if IS_ENABLED(CONFIG_MPTCP)
bool is_mptcp; bool is_mptcp;
#endif #endif
#if IS_ENABLED(CONFIG_SMC)
bool syn_smc; /* SYN includes SMC */
#endif
#ifdef CONFIG_TCP_MD5SIG #ifdef CONFIG_TCP_MD5SIG
/* TCP AF-Specific parts; only used by MD5 Signature support so far */ /* TCP AF-Specific parts; only used by MD5 Signature support so far */
...@@ -406,7 +410,7 @@ struct tcp_sock { ...@@ -406,7 +410,7 @@ struct tcp_sock {
* socket. Used to retransmit SYNACKs etc. * socket. Used to retransmit SYNACKs etc.
*/ */
struct request_sock __rcu *fastopen_rsk; struct request_sock __rcu *fastopen_rsk;
u32 *saved_syn; struct saved_syn *saved_syn;
}; };
enum tsq_enum { enum tsq_enum {
...@@ -484,6 +488,12 @@ static inline void tcp_saved_syn_free(struct tcp_sock *tp) ...@@ -484,6 +488,12 @@ static inline void tcp_saved_syn_free(struct tcp_sock *tp)
tp->saved_syn = NULL; tp->saved_syn = NULL;
} }
static inline u32 tcp_saved_syn_len(const struct saved_syn *saved_syn)
{
return saved_syn->mac_hdrlen + saved_syn->network_hdrlen +
saved_syn->tcp_hdrlen;
}
struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk, struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk,
const struct sk_buff *orig_skb); const struct sk_buff *orig_skb);
......
...@@ -86,6 +86,8 @@ struct inet_connection_sock { ...@@ -86,6 +86,8 @@ struct inet_connection_sock {
struct timer_list icsk_retransmit_timer; struct timer_list icsk_retransmit_timer;
struct timer_list icsk_delack_timer; struct timer_list icsk_delack_timer;
__u32 icsk_rto; __u32 icsk_rto;
__u32 icsk_rto_min;
__u32 icsk_delack_max;
__u32 icsk_pmtu_cookie; __u32 icsk_pmtu_cookie;
const struct tcp_congestion_ops *icsk_ca_ops; const struct tcp_congestion_ops *icsk_ca_ops;
const struct inet_connection_sock_af_ops *icsk_af_ops; const struct inet_connection_sock_af_ops *icsk_af_ops;
......
...@@ -41,6 +41,13 @@ struct request_sock_ops { ...@@ -41,6 +41,13 @@ struct request_sock_ops {
int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req); int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req);
struct saved_syn {
u32 mac_hdrlen;
u32 network_hdrlen;
u32 tcp_hdrlen;
u8 data[];
};
/* struct request_sock - mini sock to represent a connection request /* struct request_sock - mini sock to represent a connection request
*/ */
struct request_sock { struct request_sock {
...@@ -60,7 +67,7 @@ struct request_sock { ...@@ -60,7 +67,7 @@ struct request_sock {
struct timer_list rsk_timer; struct timer_list rsk_timer;
const struct request_sock_ops *rsk_ops; const struct request_sock_ops *rsk_ops;
struct sock *sk; struct sock *sk;
u32 *saved_syn; struct saved_syn *saved_syn;
u32 secid; u32 secid;
u32 peer_secid; u32 peer_secid;
}; };
......
...@@ -394,7 +394,7 @@ void tcp_metrics_init(void); ...@@ -394,7 +394,7 @@ void tcp_metrics_init(void);
bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst); bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst);
void tcp_close(struct sock *sk, long timeout); void tcp_close(struct sock *sk, long timeout);
void tcp_init_sock(struct sock *sk); void tcp_init_sock(struct sock *sk);
void tcp_init_transfer(struct sock *sk, int bpf_op); void tcp_init_transfer(struct sock *sk, int bpf_op, struct sk_buff *skb);
__poll_t tcp_poll(struct file *file, struct socket *sock, __poll_t tcp_poll(struct file *file, struct socket *sock,
struct poll_table_struct *wait); struct poll_table_struct *wait);
int tcp_getsockopt(struct sock *sk, int level, int optname, int tcp_getsockopt(struct sock *sk, int level, int optname,
...@@ -455,7 +455,8 @@ enum tcp_synack_type { ...@@ -455,7 +455,8 @@ enum tcp_synack_type {
struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
struct request_sock *req, struct request_sock *req,
struct tcp_fastopen_cookie *foc, struct tcp_fastopen_cookie *foc,
enum tcp_synack_type synack_type); enum tcp_synack_type synack_type,
struct sk_buff *syn_skb);
int tcp_disconnect(struct sock *sk, int flags); int tcp_disconnect(struct sock *sk, int flags);
void tcp_finish_connect(struct sock *sk, struct sk_buff *skb); void tcp_finish_connect(struct sock *sk, struct sk_buff *skb);
...@@ -699,7 +700,7 @@ static inline void tcp_fast_path_check(struct sock *sk) ...@@ -699,7 +700,7 @@ static inline void tcp_fast_path_check(struct sock *sk)
static inline u32 tcp_rto_min(struct sock *sk) static inline u32 tcp_rto_min(struct sock *sk)
{ {
const struct dst_entry *dst = __sk_dst_get(sk); const struct dst_entry *dst = __sk_dst_get(sk);
u32 rto_min = TCP_RTO_MIN; u32 rto_min = inet_csk(sk)->icsk_rto_min;
if (dst && dst_metric_locked(dst, RTAX_RTO_MIN)) if (dst && dst_metric_locked(dst, RTAX_RTO_MIN))
rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN); rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN);
...@@ -2035,7 +2036,8 @@ struct tcp_request_sock_ops { ...@@ -2035,7 +2036,8 @@ struct tcp_request_sock_ops {
int (*send_synack)(const struct sock *sk, struct dst_entry *dst, int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
struct flowi *fl, struct request_sock *req, struct flowi *fl, struct request_sock *req,
struct tcp_fastopen_cookie *foc, struct tcp_fastopen_cookie *foc,
enum tcp_synack_type synack_type); enum tcp_synack_type synack_type,
struct sk_buff *syn_skb);
}; };
extern const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops; extern const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops;
...@@ -2233,6 +2235,55 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, ...@@ -2233,6 +2235,55 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
struct msghdr *msg, int len, int flags); struct msghdr *msg, int len, int flags);
#endif /* CONFIG_NET_SOCK_MSG */ #endif /* CONFIG_NET_SOCK_MSG */
#ifdef CONFIG_CGROUP_BPF
/* Copy the listen sk's HDR_OPT_CB flags to its child.
*
* During 3-Way-HandShake, the synack is usually sent from
* the listen sk with the HDR_OPT_CB flags set so that
* bpf-prog will be called to write the BPF hdr option.
*
* In fastopen, the child sk is used to send synack instead
* of the listen sk. Thus, inheriting the HDR_OPT_CB flags
* from the listen sk gives the bpf-prog a chance to write
* BPF hdr option in the synack pkt during fastopen.
*
* Both fastopen and non-fastopen child will inherit the
* HDR_OPT_CB flags to keep the bpf-prog having a consistent
* behavior when deciding to clear this cb flags (or not)
* during the PASSIVE_ESTABLISHED_CB.
*
* In the future, other cb flags could be inherited here also.
*/
static inline void bpf_skops_init_child(const struct sock *sk,
struct sock *child)
{
tcp_sk(child)->bpf_sock_ops_cb_flags =
tcp_sk(sk)->bpf_sock_ops_cb_flags &
(BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG |
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG |
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG);
}
static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
struct sk_buff *skb,
unsigned int end_offset)
{
skops->skb = skb;
skops->skb_data_end = skb->data + end_offset;
}
#else
static inline void bpf_skops_init_child(const struct sock *sk,
struct sock *child)
{
}
static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
struct sk_buff *skb,
unsigned int end_offset)
{
}
#endif
/* Call BPF_SOCK_OPS program that returns an int. If the return value /* Call BPF_SOCK_OPS program that returns an int. If the return value
* is < 0, then the BPF op failed (for example if the loaded BPF * is < 0, then the BPF op failed (for example if the loaded BPF
* program does not support the chosen operation or there is no BPF * program does not support the chosen operation or there is no BPF
......
This diff is collapsed.
This diff is collapsed.
...@@ -418,6 +418,8 @@ void tcp_init_sock(struct sock *sk) ...@@ -418,6 +418,8 @@ void tcp_init_sock(struct sock *sk)
INIT_LIST_HEAD(&tp->tsorted_sent_queue); INIT_LIST_HEAD(&tp->tsorted_sent_queue);
icsk->icsk_rto = TCP_TIMEOUT_INIT; icsk->icsk_rto = TCP_TIMEOUT_INIT;
icsk->icsk_rto_min = TCP_RTO_MIN;
icsk->icsk_delack_max = TCP_DELACK_MAX;
tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT); tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
minmax_reset(&tp->rtt_min, tcp_jiffies32, ~0U); minmax_reset(&tp->rtt_min, tcp_jiffies32, ~0U);
...@@ -2685,6 +2687,8 @@ int tcp_disconnect(struct sock *sk, int flags) ...@@ -2685,6 +2687,8 @@ int tcp_disconnect(struct sock *sk, int flags)
icsk->icsk_backoff = 0; icsk->icsk_backoff = 0;
icsk->icsk_probes_out = 0; icsk->icsk_probes_out = 0;
icsk->icsk_rto = TCP_TIMEOUT_INIT; icsk->icsk_rto = TCP_TIMEOUT_INIT;
icsk->icsk_rto_min = TCP_RTO_MIN;
icsk->icsk_delack_max = TCP_DELACK_MAX;
tp->snd_ssthresh = TCP_INFINITE_SSTHRESH; tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
tp->snd_cwnd = TCP_INIT_CWND; tp->snd_cwnd = TCP_INIT_CWND;
tp->snd_cwnd_cnt = 0; tp->snd_cwnd_cnt = 0;
...@@ -3207,7 +3211,8 @@ static int do_tcp_setsockopt(struct sock *sk, int level, int optname, ...@@ -3207,7 +3211,8 @@ static int do_tcp_setsockopt(struct sock *sk, int level, int optname,
break; break;
case TCP_SAVE_SYN: case TCP_SAVE_SYN:
if (val < 0 || val > 1) /* 0: disable, 1: enable, 2: start from ether_header */
if (val < 0 || val > 2)
err = -EINVAL; err = -EINVAL;
else else
tp->save_syn = val; tp->save_syn = val;
...@@ -3788,20 +3793,21 @@ static int do_tcp_getsockopt(struct sock *sk, int level, ...@@ -3788,20 +3793,21 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
lock_sock(sk); lock_sock(sk);
if (tp->saved_syn) { if (tp->saved_syn) {
if (len < tp->saved_syn[0]) { if (len < tcp_saved_syn_len(tp->saved_syn)) {
if (put_user(tp->saved_syn[0], optlen)) { if (put_user(tcp_saved_syn_len(tp->saved_syn),
optlen)) {
release_sock(sk); release_sock(sk);
return -EFAULT; return -EFAULT;
} }
release_sock(sk); release_sock(sk);
return -EINVAL; return -EINVAL;
} }
len = tp->saved_syn[0]; len = tcp_saved_syn_len(tp->saved_syn);
if (put_user(len, optlen)) { if (put_user(len, optlen)) {
release_sock(sk); release_sock(sk);
return -EFAULT; return -EFAULT;
} }
if (copy_to_user(optval, tp->saved_syn + 1, len)) { if (copy_to_user(optval, tp->saved_syn->data, len)) {
release_sock(sk); release_sock(sk);
return -EFAULT; return -EFAULT;
} }
......
...@@ -295,7 +295,7 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk, ...@@ -295,7 +295,7 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk,
refcount_set(&req->rsk_refcnt, 2); refcount_set(&req->rsk_refcnt, 2);
/* Now finish processing the fastopen child socket. */ /* Now finish processing the fastopen child socket. */
tcp_init_transfer(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB); tcp_init_transfer(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB, skb);
tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1; tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
......
...@@ -138,6 +138,69 @@ void clean_acked_data_flush(void) ...@@ -138,6 +138,69 @@ void clean_acked_data_flush(void)
EXPORT_SYMBOL_GPL(clean_acked_data_flush); EXPORT_SYMBOL_GPL(clean_acked_data_flush);
#endif #endif
#ifdef CONFIG_CGROUP_BPF
static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
{
bool unknown_opt = tcp_sk(sk)->rx_opt.saw_unknown &&
BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG);
bool parse_all_opt = BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG);
struct bpf_sock_ops_kern sock_ops;
if (likely(!unknown_opt && !parse_all_opt))
return;
/* The skb will be handled in the
* bpf_skops_established() or
* bpf_skops_write_hdr_opt().
*/
switch (sk->sk_state) {
case TCP_SYN_RECV:
case TCP_SYN_SENT:
case TCP_LISTEN:
return;
}
sock_owned_by_me(sk);
memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
sock_ops.op = BPF_SOCK_OPS_PARSE_HDR_OPT_CB;
sock_ops.is_fullsock = 1;
sock_ops.sk = sk;
bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
}
static void bpf_skops_established(struct sock *sk, int bpf_op,
struct sk_buff *skb)
{
struct bpf_sock_ops_kern sock_ops;
sock_owned_by_me(sk);
memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
sock_ops.op = bpf_op;
sock_ops.is_fullsock = 1;
sock_ops.sk = sk;
/* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
if (skb)
bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
}
#else
static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
{
}
static void bpf_skops_established(struct sock *sk, int bpf_op,
struct sk_buff *skb)
{
}
#endif
static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb, static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb,
unsigned int len) unsigned int len)
{ {
...@@ -3801,7 +3864,7 @@ static void tcp_parse_fastopen_option(int len, const unsigned char *cookie, ...@@ -3801,7 +3864,7 @@ static void tcp_parse_fastopen_option(int len, const unsigned char *cookie,
foc->exp = exp_opt; foc->exp = exp_opt;
} }
static void smc_parse_options(const struct tcphdr *th, static bool smc_parse_options(const struct tcphdr *th,
struct tcp_options_received *opt_rx, struct tcp_options_received *opt_rx,
const unsigned char *ptr, const unsigned char *ptr,
int opsize) int opsize)
...@@ -3810,10 +3873,13 @@ static void smc_parse_options(const struct tcphdr *th, ...@@ -3810,10 +3873,13 @@ static void smc_parse_options(const struct tcphdr *th,
if (static_branch_unlikely(&tcp_have_smc)) { if (static_branch_unlikely(&tcp_have_smc)) {
if (th->syn && !(opsize & 1) && if (th->syn && !(opsize & 1) &&
opsize >= TCPOLEN_EXP_SMC_BASE && opsize >= TCPOLEN_EXP_SMC_BASE &&
get_unaligned_be32(ptr) == TCPOPT_SMC_MAGIC) get_unaligned_be32(ptr) == TCPOPT_SMC_MAGIC) {
opt_rx->smc_ok = 1; opt_rx->smc_ok = 1;
return true;
}
} }
#endif #endif
return false;
} }
/* Try to parse the MSS option from the TCP header. Return 0 on failure, clamped /* Try to parse the MSS option from the TCP header. Return 0 on failure, clamped
...@@ -3874,6 +3940,7 @@ void tcp_parse_options(const struct net *net, ...@@ -3874,6 +3940,7 @@ void tcp_parse_options(const struct net *net,
ptr = (const unsigned char *)(th + 1); ptr = (const unsigned char *)(th + 1);
opt_rx->saw_tstamp = 0; opt_rx->saw_tstamp = 0;
opt_rx->saw_unknown = 0;
while (length > 0) { while (length > 0) {
int opcode = *ptr++; int opcode = *ptr++;
...@@ -3964,15 +4031,21 @@ void tcp_parse_options(const struct net *net, ...@@ -3964,15 +4031,21 @@ void tcp_parse_options(const struct net *net,
*/ */
if (opsize >= TCPOLEN_EXP_FASTOPEN_BASE && if (opsize >= TCPOLEN_EXP_FASTOPEN_BASE &&
get_unaligned_be16(ptr) == get_unaligned_be16(ptr) ==
TCPOPT_FASTOPEN_MAGIC) TCPOPT_FASTOPEN_MAGIC) {
tcp_parse_fastopen_option(opsize - tcp_parse_fastopen_option(opsize -
TCPOLEN_EXP_FASTOPEN_BASE, TCPOLEN_EXP_FASTOPEN_BASE,
ptr + 2, th->syn, foc, true); ptr + 2, th->syn, foc, true);
else break;
smc_parse_options(th, opt_rx, ptr, }
opsize);
if (smc_parse_options(th, opt_rx, ptr, opsize))
break;
opt_rx->saw_unknown = 1;
break; break;
default:
opt_rx->saw_unknown = 1;
} }
ptr += opsize-2; ptr += opsize-2;
length -= opsize; length -= opsize;
...@@ -5590,6 +5663,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, ...@@ -5590,6 +5663,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
goto discard; goto discard;
} }
bpf_skops_parse_hdr(sk, skb);
return true; return true;
discard: discard:
...@@ -5798,7 +5873,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb) ...@@ -5798,7 +5873,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
} }
EXPORT_SYMBOL(tcp_rcv_established); EXPORT_SYMBOL(tcp_rcv_established);
void tcp_init_transfer(struct sock *sk, int bpf_op) void tcp_init_transfer(struct sock *sk, int bpf_op, struct sk_buff *skb)
{ {
struct inet_connection_sock *icsk = inet_csk(sk); struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk); struct tcp_sock *tp = tcp_sk(sk);
...@@ -5819,7 +5894,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op) ...@@ -5819,7 +5894,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op)
tp->snd_cwnd = tcp_init_cwnd(tp, __sk_dst_get(sk)); tp->snd_cwnd = tcp_init_cwnd(tp, __sk_dst_get(sk));
tp->snd_cwnd_stamp = tcp_jiffies32; tp->snd_cwnd_stamp = tcp_jiffies32;
tcp_call_bpf(sk, bpf_op, 0, NULL); bpf_skops_established(sk, bpf_op, skb);
tcp_init_congestion_control(sk); tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk); tcp_init_buffer_space(sk);
} }
...@@ -5838,7 +5913,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb) ...@@ -5838,7 +5913,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
sk_mark_napi_id(sk, skb); sk_mark_napi_id(sk, skb);
} }
tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB); tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB, skb);
/* Prevent spurious tcp_cwnd_restart() on first data /* Prevent spurious tcp_cwnd_restart() on first data
* packet. * packet.
...@@ -6310,7 +6385,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) ...@@ -6310,7 +6385,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
} else { } else {
tcp_try_undo_spurious_syn(sk); tcp_try_undo_spurious_syn(sk);
tp->retrans_stamp = 0; tp->retrans_stamp = 0;
tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB); tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB,
skb);
WRITE_ONCE(tp->copied_seq, tp->rcv_nxt); WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
} }
smp_mb(); smp_mb();
...@@ -6599,13 +6675,27 @@ static void tcp_reqsk_record_syn(const struct sock *sk, ...@@ -6599,13 +6675,27 @@ static void tcp_reqsk_record_syn(const struct sock *sk,
{ {
if (tcp_sk(sk)->save_syn) { if (tcp_sk(sk)->save_syn) {
u32 len = skb_network_header_len(skb) + tcp_hdrlen(skb); u32 len = skb_network_header_len(skb) + tcp_hdrlen(skb);
u32 *copy; struct saved_syn *saved_syn;
u32 mac_hdrlen;
void *base;
if (tcp_sk(sk)->save_syn == 2) { /* Save full header. */
base = skb_mac_header(skb);
mac_hdrlen = skb_mac_header_len(skb);
len += mac_hdrlen;
} else {
base = skb_network_header(skb);
mac_hdrlen = 0;
}
copy = kmalloc(len + sizeof(u32), GFP_ATOMIC); saved_syn = kmalloc(struct_size(saved_syn, data, len),
if (copy) { GFP_ATOMIC);
copy[0] = len; if (saved_syn) {
memcpy(&copy[1], skb_network_header(skb), len); saved_syn->mac_hdrlen = mac_hdrlen;
req->saved_syn = copy; saved_syn->network_hdrlen = skb_network_header_len(skb);
saved_syn->tcp_hdrlen = tcp_hdrlen(skb);
memcpy(saved_syn->data, base, len);
req->saved_syn = saved_syn;
} }
} }
} }
...@@ -6752,7 +6842,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, ...@@ -6752,7 +6842,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
} }
if (fastopen_sk) { if (fastopen_sk) {
af_ops->send_synack(fastopen_sk, dst, &fl, req, af_ops->send_synack(fastopen_sk, dst, &fl, req,
&foc, TCP_SYNACK_FASTOPEN); &foc, TCP_SYNACK_FASTOPEN, skb);
/* Add the child socket directly into the accept queue */ /* Add the child socket directly into the accept queue */
if (!inet_csk_reqsk_queue_add(sk, req, fastopen_sk)) { if (!inet_csk_reqsk_queue_add(sk, req, fastopen_sk)) {
reqsk_fastopen_remove(fastopen_sk, req, false); reqsk_fastopen_remove(fastopen_sk, req, false);
...@@ -6770,7 +6860,8 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, ...@@ -6770,7 +6860,8 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
tcp_timeout_init((struct sock *)req)); tcp_timeout_init((struct sock *)req));
af_ops->send_synack(sk, dst, &fl, req, &foc, af_ops->send_synack(sk, dst, &fl, req, &foc,
!want_cookie ? TCP_SYNACK_NORMAL : !want_cookie ? TCP_SYNACK_NORMAL :
TCP_SYNACK_COOKIE); TCP_SYNACK_COOKIE,
skb);
if (want_cookie) { if (want_cookie) {
reqsk_free(req); reqsk_free(req);
return 0; return 0;
......
...@@ -965,7 +965,8 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst, ...@@ -965,7 +965,8 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
struct flowi *fl, struct flowi *fl,
struct request_sock *req, struct request_sock *req,
struct tcp_fastopen_cookie *foc, struct tcp_fastopen_cookie *foc,
enum tcp_synack_type synack_type) enum tcp_synack_type synack_type,
struct sk_buff *syn_skb)
{ {
const struct inet_request_sock *ireq = inet_rsk(req); const struct inet_request_sock *ireq = inet_rsk(req);
struct flowi4 fl4; struct flowi4 fl4;
...@@ -976,7 +977,7 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst, ...@@ -976,7 +977,7 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL) if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
return -1; return -1;
skb = tcp_make_synack(sk, dst, req, foc, synack_type); skb = tcp_make_synack(sk, dst, req, foc, synack_type, syn_skb);
if (skb) { if (skb) {
__tcp_v4_send_check(skb, ireq->ir_loc_addr, ireq->ir_rmt_addr); __tcp_v4_send_check(skb, ireq->ir_loc_addr, ireq->ir_rmt_addr);
......
...@@ -548,6 +548,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk, ...@@ -548,6 +548,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
newtp->fastopen_req = NULL; newtp->fastopen_req = NULL;
RCU_INIT_POINTER(newtp->fastopen_rsk, NULL); RCU_INIT_POINTER(newtp->fastopen_rsk, NULL);
bpf_skops_init_child(sk, newsk);
tcp_bpf_clone(sk, newsk); tcp_bpf_clone(sk, newsk);
__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS); __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
......
...@@ -438,6 +438,7 @@ struct tcp_out_options { ...@@ -438,6 +438,7 @@ struct tcp_out_options {
u8 ws; /* window scale, 0 to disable */ u8 ws; /* window scale, 0 to disable */
u8 num_sack_blocks; /* number of SACK blocks to include */ u8 num_sack_blocks; /* number of SACK blocks to include */
u8 hash_size; /* bytes in hash_location */ u8 hash_size; /* bytes in hash_location */
u8 bpf_opt_len; /* length of BPF hdr option */
__u8 *hash_location; /* temporary pointer, overloaded */ __u8 *hash_location; /* temporary pointer, overloaded */
__u32 tsval, tsecr; /* need to include OPTION_TS */ __u32 tsval, tsecr; /* need to include OPTION_TS */
struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */ struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */
...@@ -452,6 +453,145 @@ static void mptcp_options_write(__be32 *ptr, struct tcp_out_options *opts) ...@@ -452,6 +453,145 @@ static void mptcp_options_write(__be32 *ptr, struct tcp_out_options *opts)
#endif #endif
} }
#ifdef CONFIG_CGROUP_BPF
static int bpf_skops_write_hdr_opt_arg0(struct sk_buff *skb,
enum tcp_synack_type synack_type)
{
if (unlikely(!skb))
return BPF_WRITE_HDR_TCP_CURRENT_MSS;
if (unlikely(synack_type == TCP_SYNACK_COOKIE))
return BPF_WRITE_HDR_TCP_SYNACK_COOKIE;
return 0;
}
/* req, syn_skb and synack_type are used when writing synack */
static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
struct request_sock *req,
struct sk_buff *syn_skb,
enum tcp_synack_type synack_type,
struct tcp_out_options *opts,
unsigned int *remaining)
{
struct bpf_sock_ops_kern sock_ops;
int err;
if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) ||
!*remaining)
return;
/* *remaining has already been aligned to 4 bytes, so *remaining >= 4 */
/* init sock_ops */
memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
sock_ops.op = BPF_SOCK_OPS_HDR_OPT_LEN_CB;
if (req) {
/* The listen "sk" cannot be passed here because
* it is not locked. It would not make too much
* sense to do bpf_setsockopt(listen_sk) based
* on individual connection request also.
*
* Thus, "req" is passed here and the cgroup-bpf-progs
* of the listen "sk" will be run.
*
* "req" is also used here for fastopen even the "sk" here is
* a fullsock "child" sk. It is to keep the behavior
* consistent between fastopen and non-fastopen on
* the bpf programming side.
*/
sock_ops.sk = (struct sock *)req;
sock_ops.syn_skb = syn_skb;
} else {
sock_owned_by_me(sk);
sock_ops.is_fullsock = 1;
sock_ops.sk = sk;
}
sock_ops.args[0] = bpf_skops_write_hdr_opt_arg0(skb, synack_type);
sock_ops.remaining_opt_len = *remaining;
/* tcp_current_mss() does not pass a skb */
if (skb)
bpf_skops_init_skb(&sock_ops, skb, 0);
err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
if (err || sock_ops.remaining_opt_len == *remaining)
return;
opts->bpf_opt_len = *remaining - sock_ops.remaining_opt_len;
/* round up to 4 bytes */
opts->bpf_opt_len = (opts->bpf_opt_len + 3) & ~3;
*remaining -= opts->bpf_opt_len;
}
static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
struct request_sock *req,
struct sk_buff *syn_skb,
enum tcp_synack_type synack_type,
struct tcp_out_options *opts)
{
u8 first_opt_off, nr_written, max_opt_len = opts->bpf_opt_len;
struct bpf_sock_ops_kern sock_ops;
int err;
if (likely(!max_opt_len))
return;
memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
sock_ops.op = BPF_SOCK_OPS_WRITE_HDR_OPT_CB;
if (req) {
sock_ops.sk = (struct sock *)req;
sock_ops.syn_skb = syn_skb;
} else {
sock_owned_by_me(sk);
sock_ops.is_fullsock = 1;
sock_ops.sk = sk;
}
sock_ops.args[0] = bpf_skops_write_hdr_opt_arg0(skb, synack_type);
sock_ops.remaining_opt_len = max_opt_len;
first_opt_off = tcp_hdrlen(skb) - max_opt_len;
bpf_skops_init_skb(&sock_ops, skb, first_opt_off);
err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
if (err)
nr_written = 0;
else
nr_written = max_opt_len - sock_ops.remaining_opt_len;
if (nr_written < max_opt_len)
memset(skb->data + first_opt_off + nr_written, TCPOPT_NOP,
max_opt_len - nr_written);
}
#else
static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
struct request_sock *req,
struct sk_buff *syn_skb,
enum tcp_synack_type synack_type,
struct tcp_out_options *opts,
unsigned int *remaining)
{
}
static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
struct request_sock *req,
struct sk_buff *syn_skb,
enum tcp_synack_type synack_type,
struct tcp_out_options *opts)
{
}
#endif
/* Write previously computed TCP options to the packet. /* Write previously computed TCP options to the packet.
* *
* Beware: Something in the Internet is very sensitive to the ordering of * Beware: Something in the Internet is very sensitive to the ordering of
...@@ -691,6 +831,8 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb, ...@@ -691,6 +831,8 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
} }
} }
bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, 0, opts, &remaining);
return MAX_TCP_OPTION_SPACE - remaining; return MAX_TCP_OPTION_SPACE - remaining;
} }
...@@ -701,7 +843,8 @@ static unsigned int tcp_synack_options(const struct sock *sk, ...@@ -701,7 +843,8 @@ static unsigned int tcp_synack_options(const struct sock *sk,
struct tcp_out_options *opts, struct tcp_out_options *opts,
const struct tcp_md5sig_key *md5, const struct tcp_md5sig_key *md5,
struct tcp_fastopen_cookie *foc, struct tcp_fastopen_cookie *foc,
enum tcp_synack_type synack_type) enum tcp_synack_type synack_type,
struct sk_buff *syn_skb)
{ {
struct inet_request_sock *ireq = inet_rsk(req); struct inet_request_sock *ireq = inet_rsk(req);
unsigned int remaining = MAX_TCP_OPTION_SPACE; unsigned int remaining = MAX_TCP_OPTION_SPACE;
...@@ -758,6 +901,9 @@ static unsigned int tcp_synack_options(const struct sock *sk, ...@@ -758,6 +901,9 @@ static unsigned int tcp_synack_options(const struct sock *sk,
smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining); smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining);
bpf_skops_hdr_opt_len((struct sock *)sk, skb, req, syn_skb,
synack_type, opts, &remaining);
return MAX_TCP_OPTION_SPACE - remaining; return MAX_TCP_OPTION_SPACE - remaining;
} }
...@@ -826,6 +972,15 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb ...@@ -826,6 +972,15 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK; opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
} }
if (unlikely(BPF_SOCK_OPS_TEST_FLAG(tp,
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG))) {
unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, 0, opts, &remaining);
size = MAX_TCP_OPTION_SPACE - remaining;
}
return size; return size;
} }
...@@ -1213,6 +1368,9 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, ...@@ -1213,6 +1368,9 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
} }
#endif #endif
/* BPF prog is the last one writing header option */
bpf_skops_write_hdr_opt(sk, skb, NULL, NULL, 0, &opts);
INDIRECT_CALL_INET(icsk->icsk_af_ops->send_check, INDIRECT_CALL_INET(icsk->icsk_af_ops->send_check,
tcp_v6_send_check, tcp_v4_send_check, tcp_v6_send_check, tcp_v4_send_check,
sk, skb); sk, skb);
...@@ -3336,20 +3494,20 @@ int tcp_send_synack(struct sock *sk) ...@@ -3336,20 +3494,20 @@ int tcp_send_synack(struct sock *sk)
} }
/** /**
* tcp_make_synack - Prepare a SYN-ACK. * tcp_make_synack - Allocate one skb and build a SYNACK packet.
* sk: listener socket * @sk: listener socket
* dst: dst entry attached to the SYNACK * @dst: dst entry attached to the SYNACK. It is consumed and caller
* req: request_sock pointer * should not use it again.
* foc: cookie for tcp fast open * @req: request_sock pointer
* synack_type: Type of synback to prepare * @foc: cookie for tcp fast open
* * @synack_type: Type of synack to prepare
* Allocate one skb and build a SYNACK packet. * @syn_skb: SYN packet just received. It could be NULL for rtx case.
* @dst is consumed : Caller should not use it again.
*/ */
struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
struct request_sock *req, struct request_sock *req,
struct tcp_fastopen_cookie *foc, struct tcp_fastopen_cookie *foc,
enum tcp_synack_type synack_type) enum tcp_synack_type synack_type,
struct sk_buff *syn_skb)
{ {
struct inet_request_sock *ireq = inet_rsk(req); struct inet_request_sock *ireq = inet_rsk(req);
const struct tcp_sock *tp = tcp_sk(sk); const struct tcp_sock *tp = tcp_sk(sk);
...@@ -3408,8 +3566,11 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, ...@@ -3408,8 +3566,11 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
md5 = tcp_rsk(req)->af_specific->req_md5_lookup(sk, req_to_sk(req)); md5 = tcp_rsk(req)->af_specific->req_md5_lookup(sk, req_to_sk(req));
#endif #endif
skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4); skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4);
/* bpf program will be interested in the tcp_flags */
TCP_SKB_CB(skb)->tcp_flags = TCPHDR_SYN | TCPHDR_ACK;
tcp_header_size = tcp_synack_options(sk, req, mss, skb, &opts, md5, tcp_header_size = tcp_synack_options(sk, req, mss, skb, &opts, md5,
foc, synack_type) + sizeof(*th); foc, synack_type,
syn_skb) + sizeof(*th);
skb_push(skb, tcp_header_size); skb_push(skb, tcp_header_size);
skb_reset_transport_header(skb); skb_reset_transport_header(skb);
...@@ -3441,6 +3602,9 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, ...@@ -3441,6 +3602,9 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
rcu_read_unlock(); rcu_read_unlock();
#endif #endif
bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb,
synack_type, &opts);
skb->skb_mstamp_ns = now; skb->skb_mstamp_ns = now;
tcp_add_tx_delay(skb, tp); tcp_add_tx_delay(skb, tp);
...@@ -3741,6 +3905,8 @@ void tcp_send_delayed_ack(struct sock *sk) ...@@ -3741,6 +3905,8 @@ void tcp_send_delayed_ack(struct sock *sk)
ato = min(ato, max_ato); ato = min(ato, max_ato);
} }
ato = min_t(u32, ato, inet_csk(sk)->icsk_delack_max);
/* Stay within the limit we were given */ /* Stay within the limit we were given */
timeout = jiffies + ato; timeout = jiffies + ato;
...@@ -3934,7 +4100,8 @@ int tcp_rtx_synack(const struct sock *sk, struct request_sock *req) ...@@ -3934,7 +4100,8 @@ int tcp_rtx_synack(const struct sock *sk, struct request_sock *req)
int res; int res;
tcp_rsk(req)->txhash = net_tx_rndhash(); tcp_rsk(req)->txhash = net_tx_rndhash();
res = af_ops->send_synack(sk, NULL, &fl, req, NULL, TCP_SYNACK_NORMAL); res = af_ops->send_synack(sk, NULL, &fl, req, NULL, TCP_SYNACK_NORMAL,
NULL);
if (!res) { if (!res) {
__TCP_INC_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS); __TCP_INC_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS);
__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNRETRANS); __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
......
...@@ -501,7 +501,8 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst, ...@@ -501,7 +501,8 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst,
struct flowi *fl, struct flowi *fl,
struct request_sock *req, struct request_sock *req,
struct tcp_fastopen_cookie *foc, struct tcp_fastopen_cookie *foc,
enum tcp_synack_type synack_type) enum tcp_synack_type synack_type,
struct sk_buff *syn_skb)
{ {
struct inet_request_sock *ireq = inet_rsk(req); struct inet_request_sock *ireq = inet_rsk(req);
struct ipv6_pinfo *np = tcp_inet6_sk(sk); struct ipv6_pinfo *np = tcp_inet6_sk(sk);
...@@ -515,7 +516,7 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst, ...@@ -515,7 +516,7 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst,
IPPROTO_TCP)) == NULL) IPPROTO_TCP)) == NULL)
goto done; goto done;
skb = tcp_make_synack(sk, dst, req, foc, synack_type); skb = tcp_make_synack(sk, dst, req, foc, synack_type, syn_skb);
if (skb) { if (skb) {
__tcp_v6_send_check(skb, &ireq->ir_v6_loc_addr, __tcp_v6_send_check(skb, &ireq->ir_v6_loc_addr,
......
This diff is collapsed.
...@@ -104,6 +104,43 @@ int start_server(int family, int type, const char *addr_str, __u16 port, ...@@ -104,6 +104,43 @@ int start_server(int family, int type, const char *addr_str, __u16 port,
return -1; return -1;
} }
int fastopen_connect(int server_fd, const char *data, unsigned int data_len,
int timeout_ms)
{
struct sockaddr_storage addr;
socklen_t addrlen = sizeof(addr);
struct sockaddr_in *addr_in;
int fd, ret;
if (getsockname(server_fd, (struct sockaddr *)&addr, &addrlen)) {
log_err("Failed to get server addr");
return -1;
}
addr_in = (struct sockaddr_in *)&addr;
fd = socket(addr_in->sin_family, SOCK_STREAM, 0);
if (fd < 0) {
log_err("Failed to create client socket");
return -1;
}
if (settimeo(fd, timeout_ms))
goto error_close;
ret = sendto(fd, data, data_len, MSG_FASTOPEN, (struct sockaddr *)&addr,
addrlen);
if (ret != data_len) {
log_err("sendto(data, %u) != %d\n", data_len, ret);
goto error_close;
}
return fd;
error_close:
save_errno_close(fd);
return -1;
}
static int connect_fd_to_addr(int fd, static int connect_fd_to_addr(int fd,
const struct sockaddr_storage *addr, const struct sockaddr_storage *addr,
socklen_t addrlen) socklen_t addrlen)
......
...@@ -37,6 +37,8 @@ int start_server(int family, int type, const char *addr, __u16 port, ...@@ -37,6 +37,8 @@ int start_server(int family, int type, const char *addr, __u16 port,
int timeout_ms); int timeout_ms);
int connect_to_fd(int server_fd, int timeout_ms); int connect_to_fd(int server_fd, int timeout_ms);
int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms); int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms);
int fastopen_connect(int server_fd, const char *data, unsigned int data_len,
int timeout_ms);
int make_sockaddr(int family, const char *addr_str, __u16 port, int make_sockaddr(int family, const char *addr_str, __u16 port,
struct sockaddr_storage *addr, socklen_t *len); struct sockaddr_storage *addr, socklen_t *len);
......
This diff is collapsed.
// SPDX-License-Identifier: GPL-2.0
/* Copyright (c) 2020 Facebook */
#include <stddef.h>
#include <errno.h>
#include <stdbool.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <linux/ipv6.h>
#include <linux/tcp.h>
#include <linux/socket.h>
#include <linux/bpf.h>
#include <linux/types.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
#define BPF_PROG_TEST_TCP_HDR_OPTIONS
#include "test_tcp_hdr_options.h"
__u16 last_addr16_n = __bpf_htons(0xeB9F);
__u16 active_lport_n = 0;
__u16 active_lport_h = 0;
__u16 passive_lport_n = 0;
__u16 passive_lport_h = 0;
/* options received at passive side */
unsigned int nr_pure_ack = 0;
unsigned int nr_data = 0;
unsigned int nr_syn = 0;
unsigned int nr_fin = 0;
/* Check the header received from the active side */
static int __check_active_hdr_in(struct bpf_sock_ops *skops, bool check_syn)
{
union {
struct tcphdr th;
struct ipv6hdr ip6;
struct tcp_exprm_opt exprm_opt;
struct tcp_opt reg_opt;
__u8 data[100]; /* IPv6 (40) + Max TCP hdr (60) */
} hdr = {};
__u64 load_flags = check_syn ? BPF_LOAD_HDR_OPT_TCP_SYN : 0;
struct tcphdr *pth;
int ret;
hdr.reg_opt.kind = 0xB9;
/* The option is 4 bytes long instead of 2 bytes */
ret = bpf_load_hdr_opt(skops, &hdr.reg_opt, 2, load_flags);
if (ret != -ENOSPC)
RET_CG_ERR(ret);
/* Test searching magic with regular kind */
hdr.reg_opt.len = 4;
ret = bpf_load_hdr_opt(skops, &hdr.reg_opt, sizeof(hdr.reg_opt),
load_flags);
if (ret != -EINVAL)
RET_CG_ERR(ret);
hdr.reg_opt.len = 0;
ret = bpf_load_hdr_opt(skops, &hdr.reg_opt, sizeof(hdr.reg_opt),
load_flags);
if (ret != 4 || hdr.reg_opt.len != 4 || hdr.reg_opt.kind != 0xB9 ||
hdr.reg_opt.data[0] != 0xfa || hdr.reg_opt.data[1] != 0xce)
RET_CG_ERR(ret);
/* Test searching experimental option with invalid kind length */
hdr.exprm_opt.kind = TCPOPT_EXP;
hdr.exprm_opt.len = 5;
hdr.exprm_opt.magic = 0;
ret = bpf_load_hdr_opt(skops, &hdr.exprm_opt, sizeof(hdr.exprm_opt),
load_flags);
if (ret != -EINVAL)
RET_CG_ERR(ret);
/* Test searching experimental option with 0 magic value */
hdr.exprm_opt.len = 4;
ret = bpf_load_hdr_opt(skops, &hdr.exprm_opt, sizeof(hdr.exprm_opt),
load_flags);
if (ret != -ENOMSG)
RET_CG_ERR(ret);
hdr.exprm_opt.magic = __bpf_htons(0xeB9F);
ret = bpf_load_hdr_opt(skops, &hdr.exprm_opt, sizeof(hdr.exprm_opt),
load_flags);
if (ret != 4 || hdr.exprm_opt.len != 4 ||
hdr.exprm_opt.kind != TCPOPT_EXP ||
hdr.exprm_opt.magic != __bpf_htons(0xeB9F))
RET_CG_ERR(ret);
if (!check_syn)
return CG_OK;
/* Test loading from skops->syn_skb if sk_state == TCP_NEW_SYN_RECV
*
* Test loading from tp->saved_syn for other sk_state.
*/
ret = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN_IP, &hdr.ip6,
sizeof(hdr.ip6));
if (ret != -ENOSPC)
RET_CG_ERR(ret);
if (hdr.ip6.saddr.s6_addr16[7] != last_addr16_n ||
hdr.ip6.daddr.s6_addr16[7] != last_addr16_n)
RET_CG_ERR(0);
ret = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN_IP, &hdr, sizeof(hdr));
if (ret < 0)
RET_CG_ERR(ret);
pth = (struct tcphdr *)(&hdr.ip6 + 1);
if (pth->dest != passive_lport_n || pth->source != active_lport_n)
RET_CG_ERR(0);
ret = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN, &hdr, sizeof(hdr));
if (ret < 0)
RET_CG_ERR(ret);
if (hdr.th.dest != passive_lport_n || hdr.th.source != active_lport_n)
RET_CG_ERR(0);
return CG_OK;
}
static int check_active_syn_in(struct bpf_sock_ops *skops)
{
return __check_active_hdr_in(skops, true);
}
static int check_active_hdr_in(struct bpf_sock_ops *skops)
{
struct tcphdr *th;
if (__check_active_hdr_in(skops, false) == CG_ERR)
return CG_ERR;
th = skops->skb_data;
if (th + 1 > skops->skb_data_end)
RET_CG_ERR(0);
if (tcp_hdrlen(th) < skops->skb_len)
nr_data++;
if (th->fin)
nr_fin++;
if (th->ack && !th->fin && tcp_hdrlen(th) == skops->skb_len)
nr_pure_ack++;
return CG_OK;
}
static int active_opt_len(struct bpf_sock_ops *skops)
{
int err;
/* Reserve more than enough to allow the -EEXIST test in
* the write_active_opt().
*/
err = bpf_reserve_hdr_opt(skops, 12, 0);
if (err)
RET_CG_ERR(err);
return CG_OK;
}
static int write_active_opt(struct bpf_sock_ops *skops)
{
struct tcp_exprm_opt exprm_opt = {};
struct tcp_opt win_scale_opt = {};
struct tcp_opt reg_opt = {};
struct tcphdr *th;
int err, ret;
exprm_opt.kind = TCPOPT_EXP;
exprm_opt.len = 4;
exprm_opt.magic = __bpf_htons(0xeB9F);
reg_opt.kind = 0xB9;
reg_opt.len = 4;
reg_opt.data[0] = 0xfa;
reg_opt.data[1] = 0xce;
win_scale_opt.kind = TCPOPT_WINDOW;
err = bpf_store_hdr_opt(skops, &exprm_opt, sizeof(exprm_opt), 0);
if (err)
RET_CG_ERR(err);
/* Store the same exprm option */
err = bpf_store_hdr_opt(skops, &exprm_opt, sizeof(exprm_opt), 0);
if (err != -EEXIST)
RET_CG_ERR(err);
err = bpf_store_hdr_opt(skops, &reg_opt, sizeof(reg_opt), 0);
if (err)
RET_CG_ERR(err);
err = bpf_store_hdr_opt(skops, &reg_opt, sizeof(reg_opt), 0);
if (err != -EEXIST)
RET_CG_ERR(err);
/* Check the option has been written and can be searched */
ret = bpf_load_hdr_opt(skops, &exprm_opt, sizeof(exprm_opt), 0);
if (ret != 4 || exprm_opt.len != 4 || exprm_opt.kind != TCPOPT_EXP ||
exprm_opt.magic != __bpf_htons(0xeB9F))
RET_CG_ERR(ret);
reg_opt.len = 0;
ret = bpf_load_hdr_opt(skops, &reg_opt, sizeof(reg_opt), 0);
if (ret != 4 || reg_opt.len != 4 || reg_opt.kind != 0xB9 ||
reg_opt.data[0] != 0xfa || reg_opt.data[1] != 0xce)
RET_CG_ERR(ret);
th = skops->skb_data;
if (th + 1 > skops->skb_data_end)
RET_CG_ERR(0);
if (th->syn) {
active_lport_h = skops->local_port;
active_lport_n = th->source;
/* Search the win scale option written by kernel
* in the SYN packet.
*/
ret = bpf_load_hdr_opt(skops, &win_scale_opt,
sizeof(win_scale_opt), 0);
if (ret != 3 || win_scale_opt.len != 3 ||
win_scale_opt.kind != TCPOPT_WINDOW)
RET_CG_ERR(ret);
/* Write the win scale option that kernel
* has already written.
*/
err = bpf_store_hdr_opt(skops, &win_scale_opt,
sizeof(win_scale_opt), 0);
if (err != -EEXIST)
RET_CG_ERR(err);
}
return CG_OK;
}
static int handle_hdr_opt_len(struct bpf_sock_ops *skops)
{
__u8 tcp_flags = skops_tcp_flags(skops);
if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK)
/* Check the SYN from bpf_sock_ops_kern->syn_skb */
return check_active_syn_in(skops);
/* Passive side should have cleared the write hdr cb by now */
if (skops->local_port == passive_lport_h)
RET_CG_ERR(0);
return active_opt_len(skops);
}
static int handle_write_hdr_opt(struct bpf_sock_ops *skops)
{
if (skops->local_port == passive_lport_h)
RET_CG_ERR(0);
return write_active_opt(skops);
}
static int handle_parse_hdr(struct bpf_sock_ops *skops)
{
/* Passive side is not writing any non-standard/unknown
* option, so the active side should never be called.
*/
if (skops->local_port == active_lport_h)
RET_CG_ERR(0);
return check_active_hdr_in(skops);
}
static int handle_passive_estab(struct bpf_sock_ops *skops)
{
int err;
/* No more write hdr cb */
bpf_sock_ops_cb_flags_set(skops,
skops->bpf_sock_ops_cb_flags &
~BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG);
/* Recheck the SYN but check the tp->saved_syn this time */
err = check_active_syn_in(skops);
if (err == CG_ERR)
return err;
nr_syn++;
/* The ack has header option written by the active side also */
return check_active_hdr_in(skops);
}
SEC("sockops/misc_estab")
int misc_estab(struct bpf_sock_ops *skops)
{
int true_val = 1;
switch (skops->op) {
case BPF_SOCK_OPS_TCP_LISTEN_CB:
passive_lport_h = skops->local_port;
passive_lport_n = __bpf_htons(passive_lport_h);
bpf_setsockopt(skops, SOL_TCP, TCP_SAVE_SYN,
&true_val, sizeof(true_val));
set_hdr_cb_flags(skops);
break;
case BPF_SOCK_OPS_TCP_CONNECT_CB:
set_hdr_cb_flags(skops);
break;
case BPF_SOCK_OPS_PARSE_HDR_OPT_CB:
return handle_parse_hdr(skops);
case BPF_SOCK_OPS_HDR_OPT_LEN_CB:
return handle_hdr_opt_len(skops);
case BPF_SOCK_OPS_WRITE_HDR_OPT_CB:
return handle_write_hdr_opt(skops);
case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
return handle_passive_estab(skops);
}
return CG_OK;
}
char _license[] SEC("license") = "GPL";
This diff is collapsed.
/* SPDX-License-Identifier: GPL-2.0 */
/* Copyright (c) 2020 Facebook */
#ifndef _TEST_TCP_HDR_OPTIONS_H
#define _TEST_TCP_HDR_OPTIONS_H
struct bpf_test_option {
__u8 flags;
__u8 max_delack_ms;
__u8 rand;
} __attribute__((packed));
enum {
OPTION_RESEND,
OPTION_MAX_DELACK_MS,
OPTION_RAND,
__NR_OPTION_FLAGS,
};
#define OPTION_F_RESEND (1 << OPTION_RESEND)
#define OPTION_F_MAX_DELACK_MS (1 << OPTION_MAX_DELACK_MS)
#define OPTION_F_RAND (1 << OPTION_RAND)
#define OPTION_MASK ((1 << __NR_OPTION_FLAGS) - 1)
#define TEST_OPTION_FLAGS(flags, option) (1 & ((flags) >> (option)))
#define SET_OPTION_FLAGS(flags, option) ((flags) |= (1 << (option)))
/* Store in bpf_sk_storage */
struct hdr_stg {
bool active;
bool resend_syn; /* active side only */
bool syncookie; /* passive side only */
bool fastopen; /* passive side only */
};
struct linum_err {
unsigned int linum;
int err;
};
#define TCPHDR_FIN 0x01
#define TCPHDR_SYN 0x02
#define TCPHDR_RST 0x04
#define TCPHDR_PSH 0x08
#define TCPHDR_ACK 0x10
#define TCPHDR_URG 0x20
#define TCPHDR_ECE 0x40
#define TCPHDR_CWR 0x80
#define TCPHDR_SYNACK (TCPHDR_SYN | TCPHDR_ACK)
#define TCPOPT_EOL 0
#define TCPOPT_NOP 1
#define TCPOPT_WINDOW 3
#define TCPOPT_EXP 254
#define TCP_BPF_EXPOPT_BASE_LEN 4
#define MAX_TCP_HDR_LEN 60
#define MAX_TCP_OPTION_SPACE 40
#ifdef BPF_PROG_TEST_TCP_HDR_OPTIONS
#define CG_OK 1
#define CG_ERR 0
#ifndef SOL_TCP
#define SOL_TCP 6
#endif
struct tcp_exprm_opt {
__u8 kind;
__u8 len;
__u16 magic;
union {
__u8 data[4];
__u32 data32;
};
} __attribute__((packed));
struct tcp_opt {
__u8 kind;
__u8 len;
union {
__u8 data[4];
__u32 data32;
};
} __attribute__((packed));
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 2);
__type(key, int);
__type(value, struct linum_err);
} lport_linum_map SEC(".maps");
static inline unsigned int tcp_hdrlen(const struct tcphdr *th)
{
return th->doff << 2;
}
static inline __u8 skops_tcp_flags(const struct bpf_sock_ops *skops)
{
return skops->skb_tcp_flags;
}
static inline void clear_hdr_cb_flags(struct bpf_sock_ops *skops)
{
bpf_sock_ops_cb_flags_set(skops,
skops->bpf_sock_ops_cb_flags &
~(BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG |
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG));
}
static inline void set_hdr_cb_flags(struct bpf_sock_ops *skops)
{
bpf_sock_ops_cb_flags_set(skops,
skops->bpf_sock_ops_cb_flags |
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG |
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG);
}
static inline void
clear_parse_all_hdr_cb_flags(struct bpf_sock_ops *skops)
{
bpf_sock_ops_cb_flags_set(skops,
skops->bpf_sock_ops_cb_flags &
~BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG);
}
static inline void
set_parse_all_hdr_cb_flags(struct bpf_sock_ops *skops)
{
bpf_sock_ops_cb_flags_set(skops,
skops->bpf_sock_ops_cb_flags |
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG);
}
#define RET_CG_ERR(__err) ({ \
struct linum_err __linum_err; \
int __lport; \
\
__linum_err.linum = __LINE__; \
__linum_err.err = __err; \
__lport = skops->local_port; \
bpf_map_update_elem(&lport_linum_map, &__lport, &__linum_err, BPF_NOEXIST); \
clear_hdr_cb_flags(skops); \
clear_parse_all_hdr_cb_flags(skops); \
return CG_ERR; \
})
#endif /* BPF_PROG_TEST_TCP_HDR_OPTIONS */
#endif /* _TEST_TCP_HDR_OPTIONS_H */
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment