Commit 95d1815f authored by Jakub Kicinski's avatar Jakub Kicinski

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

1) Incorrect error check in nft_expr_inner_parse(), from Dan Carpenter.

2) Add DATA_SENT state to SCTP connection tracking helper, from
   Sriram Yagnaraman.

3) Consolidate nf_confirm for ipv4 and ipv6, from Florian Westphal.

4) Add bitmask support for ipset, from Vishwanath Pai.

5) Handle icmpv6 redirects as RELATED, from Florian Westphal.

6) Add WARN_ON_ONCE() to impossible case in flowtable datapath,
   from Li Qiong.

7) A large batch of IPVS updates to replace timer-based estimators by
   kthreads to scale up wrt. CPUs and workload (millions of estimators).

Julian Anastasov says:

	This patchset implements stats estimation in kthread context.
It replaces the code that runs on single CPU in timer context every 2
seconds and causing latency splats as shown in reports [1], [2], [3].
The solution targets setups with thousands of IPVS services,
destinations and multi-CPU boxes.

	Spread the estimation on multiple (configured) CPUs and multiple
time slots (timer ticks) by using multiple chains organized under RCU
rules.  When stats are not needed, it is recommended to use
run_estimation=0 as already implemented before this change.

RCU Locking:

- As stats are now RCU-locked, tot_stats, svc and dest which
hold estimator structures are now always freed from RCU
callback. This ensures RCU grace period after the
ip_vs_stop_estimator() call.

Kthread data:

- every kthread works over its own data structure and all
such structures are attached to array. For now we limit
kthreads depending on the number of CPUs.

- even while there can be a kthread structure, its task
may not be running, eg. before first service is added or
while the sysctl var is set to an empty cpulist or
when run_estimation is set to 0 to disable the estimation.

- the allocated kthread context may grow from 1 to 50
allocated structures for timer ticks which saves memory for
setups with small number of estimators

- a task and its structure may be released if all
estimators are unlinked from its chains, leaving the
slot in the array empty

- every kthread data structure allows limited number
of estimators. Kthread 0 is also used to initially
calculate the max number of estimators to allow in every
chain considering a sub-100 microsecond cond_resched
rate. This number can be from 1 to hundreds.

- kthread 0 has an additional job of optimizing the
adding of estimators: they are first added in
temp list (est_temp_list) and later kthread 0
distributes them to other kthreads. The optimization
is based on the fact that newly added estimator
should be estimated after 2 seconds, so we have the
time to offload the adding to chain from controlling
process to kthread 0.

- to add new estimators we use the last added kthread
context (est_add_ktid). The new estimators are linked to
the chains just before the estimated one, based on add_row.
This ensures their estimation will start after 2 seconds.
If estimators are added in bursts, common case if all
services and dests are initially configured, we may
spread the estimators to more chains and as result,
reducing the initial delay below 2 seconds.

Many thanks to Jiri Wiesner for his valuable comments
and for spending a lot of time reviewing and testing
the changes on different platforms with 48-256 CPUs and
1-8 NUMA nodes under different cpufreq governors.

The new IPVS estimators do not use workqueue infrastructure
because:

- The estimation can take long time when using multiple IPVS rules (eg.
  millions estimator structures) and especially when box has multiple
  CPUs due to the for_each_possible_cpu usage that expects packets from
  any CPU. With est_nice sysctl we have more control how to prioritize the
  estimation kthreads compared to other processes/kthreads that have
  latency requirements (such as servers). As a benefit, we can see these
  kthreads in top and decide if we will need some further control to limit
  their CPU usage (max number of structure to estimate per kthread).

- with kthreads we run code that is read-mostly, no write/lock
  operations to process the estimators in 2-second intervals.

- work items are one-shot: as estimators are processed every
  2 seconds, they need to be re-added every time. This again
  loads the timers (add_timer) if we use delayed works, as there are
  no kthreads to do the timings.

[1] Report from Yunhong Jiang:
    https://lore.kernel.org/netdev/D25792C1-1B89-45DE-9F10-EC350DC04ADC@gmail.com/
[2] https://marc.info/?l=linux-virtual-server&m=159679809118027&w=2
[3] Report from Dust:
    https://archive.linuxvirtualserver.org/html/lvs-devel/2020-12/msg00000.html

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  ipvs: run_estimation should control the kthread tasks
  ipvs: add est_cpulist and est_nice sysctl vars
  ipvs: use kthreads for stats estimation
  ipvs: use u64_stats_t for the per-cpu counters
  ipvs: use common functions for stats allocation
  ipvs: add rcu protection to stats
  netfilter: flowtable: add a 'default' case to flowtable datapath
  netfilter: conntrack: set icmpv6 redirects as RELATED
  netfilter: ipset: Add support for new bitmask parameter
  netfilter: conntrack: merge ipv4+ipv6 confirm functions
  netfilter: conntrack: add sctp DATA_SENT state
  netfilter: nft_inner: fix IS_ERR() vs NULL check
====================

Link: https://lore.kernel.org/r/20221211101204.1751-1-pablo@netfilter.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
parents 15eb1621 144361c1
......@@ -129,6 +129,26 @@ drop_packet - INTEGER
threshold. When the mode 3 is set, the always mode drop rate
is controlled by the /proc/sys/net/ipv4/vs/am_droprate.
est_cpulist - CPULIST
Allowed CPUs for estimation kthreads
Syntax: standard cpulist format
empty list - stop kthread tasks and estimation
default - the system's housekeeping CPUs for kthreads
Example:
"all": all possible CPUs
"0-N": all possible CPUs, N denotes last CPU number
"0,1-N:1/2": first and all CPUs with odd number
"": empty list
est_nice - INTEGER
default 0
Valid range: -20 (more favorable) .. 19 (less favorable)
Niceness value to use for the estimation kthreads (scheduling
priority)
expire_nodest_conn - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
......@@ -304,8 +324,8 @@ run_estimation - BOOLEAN
0 - disabled
not 0 - enabled (default)
If disabled, the estimation will be stop, and you can't see
any update on speed estimation data.
If disabled, the estimation will be suspended and kthread tasks
stopped.
You can always re-enable estimation by setting this value to 1.
But be careful, the first estimation after re-enable is not
......
......@@ -515,6 +515,16 @@ ip_set_init_skbinfo(struct ip_set_skbinfo *skbinfo,
*skbinfo = ext->skbinfo;
}
static inline void
nf_inet_addr_mask_inplace(union nf_inet_addr *a1,
const union nf_inet_addr *mask)
{
a1->all[0] &= mask->all[0];
a1->all[1] &= mask->all[1];
a1->all[2] &= mask->all[2];
a1->all[3] &= mask->all[3];
}
#define IP_SET_INIT_KEXT(skb, opt, set) \
{ .bytes = (skb)->len, .packets = 1, .target = true,\
.timeout = ip_set_adt_opt_timeout(opt, set) }
......
......@@ -29,6 +29,7 @@
#include <net/netfilter/nf_conntrack.h>
#endif
#include <net/net_namespace.h> /* Netw namespace */
#include <linux/sched/isolation.h>
#define IP_VS_HDR_INVERSE 1
#define IP_VS_HDR_ICMP 2
......@@ -42,6 +43,8 @@ static inline struct netns_ipvs *net_ipvs(struct net* net)
/* Connections' size value needed by ip_vs_ctl.c */
extern int ip_vs_conn_tab_size;
extern struct mutex __ip_vs_mutex;
struct ip_vs_iphdr {
int hdr_flags; /* ipvs flags */
__u32 off; /* Where IP or IPv4 header starts */
......@@ -351,11 +354,11 @@ struct ip_vs_seq {
/* counters per cpu */
struct ip_vs_counters {
__u64 conns; /* connections scheduled */
__u64 inpkts; /* incoming packets */
__u64 outpkts; /* outgoing packets */
__u64 inbytes; /* incoming bytes */
__u64 outbytes; /* outgoing bytes */
u64_stats_t conns; /* connections scheduled */
u64_stats_t inpkts; /* incoming packets */
u64_stats_t outpkts; /* outgoing packets */
u64_stats_t inbytes; /* incoming bytes */
u64_stats_t outbytes; /* outgoing bytes */
};
/* Stats per cpu */
struct ip_vs_cpu_stats {
......@@ -363,9 +366,12 @@ struct ip_vs_cpu_stats {
struct u64_stats_sync syncp;
};
/* Default nice for estimator kthreads */
#define IPVS_EST_NICE 0
/* IPVS statistics objects */
struct ip_vs_estimator {
struct list_head list;
struct hlist_node list;
u64 last_inbytes;
u64 last_outbytes;
......@@ -378,6 +384,10 @@ struct ip_vs_estimator {
u64 outpps;
u64 inbps;
u64 outbps;
s32 ktid:16, /* kthread ID, -1=temp list */
ktrow:8, /* row/tick ID for kthread */
ktcid:8; /* chain ID for kthread tick */
};
/*
......@@ -405,6 +415,76 @@ struct ip_vs_stats {
struct ip_vs_kstats kstats0; /* reset values */
};
struct ip_vs_stats_rcu {
struct ip_vs_stats s;
struct rcu_head rcu_head;
};
int ip_vs_stats_init_alloc(struct ip_vs_stats *s);
struct ip_vs_stats *ip_vs_stats_alloc(void);
void ip_vs_stats_release(struct ip_vs_stats *stats);
void ip_vs_stats_free(struct ip_vs_stats *stats);
/* Process estimators in multiple timer ticks (20/50/100, see ktrow) */
#define IPVS_EST_NTICKS 50
/* Estimation uses a 2-second period containing ticks (in jiffies) */
#define IPVS_EST_TICK ((2 * HZ) / IPVS_EST_NTICKS)
/* Limit of CPU load per kthread (8 for 12.5%), ratio of CPU capacity (1/C).
* Value of 4 and above ensures kthreads will take work without exceeding
* the CPU capacity under different circumstances.
*/
#define IPVS_EST_LOAD_DIVISOR 8
/* Kthreads should not have work that exceeds the CPU load above 50% */
#define IPVS_EST_CPU_KTHREADS (IPVS_EST_LOAD_DIVISOR / 2)
/* Desired number of chains per timer tick (chain load factor in 100us units),
* 48=4.8ms of 40ms tick (12% CPU usage):
* 2 sec * 1000 ms in sec * 10 (100us in ms) / 8 (12.5%) / 50
*/
#define IPVS_EST_CHAIN_FACTOR \
ALIGN_DOWN(2 * 1000 * 10 / IPVS_EST_LOAD_DIVISOR / IPVS_EST_NTICKS, 8)
/* Compiled number of chains per tick
* The defines should match cond_resched_rcu
*/
#if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
#define IPVS_EST_TICK_CHAINS IPVS_EST_CHAIN_FACTOR
#else
#define IPVS_EST_TICK_CHAINS 1
#endif
#if IPVS_EST_NTICKS > 127
#error Too many timer ticks for ktrow
#endif
/* Multiple chains processed in same tick */
struct ip_vs_est_tick_data {
struct hlist_head chains[IPVS_EST_TICK_CHAINS];
DECLARE_BITMAP(present, IPVS_EST_TICK_CHAINS);
DECLARE_BITMAP(full, IPVS_EST_TICK_CHAINS);
int chain_len[IPVS_EST_TICK_CHAINS];
};
/* Context for estimation kthread */
struct ip_vs_est_kt_data {
struct netns_ipvs *ipvs;
struct task_struct *task; /* task if running */
struct ip_vs_est_tick_data __rcu *ticks[IPVS_EST_NTICKS];
DECLARE_BITMAP(avail, IPVS_EST_NTICKS); /* tick has space for ests */
unsigned long est_timer; /* estimation timer (jiffies) */
struct ip_vs_stats *calc_stats; /* Used for calculation */
int tick_len[IPVS_EST_NTICKS]; /* est count */
int id; /* ktid per netns */
int chain_max; /* max ests per tick chain */
int tick_max; /* max ests per tick */
int est_count; /* attached ests to kthread */
int est_max_count; /* max ests per kthread */
int add_row; /* row for new ests */
int est_row; /* estimated row */
};
struct dst_entry;
struct iphdr;
struct ip_vs_conn;
......@@ -688,6 +768,7 @@ struct ip_vs_dest {
union nf_inet_addr vaddr; /* virtual IP address */
__u32 vfwmark; /* firewall mark of service */
struct rcu_head rcu_head;
struct list_head t_list; /* in dest_trash */
unsigned int in_rs_table:1; /* we are in rs_table */
};
......@@ -869,7 +950,7 @@ struct netns_ipvs {
atomic_t conn_count; /* connection counter */
/* ip_vs_ctl */
struct ip_vs_stats tot_stats; /* Statistics & est. */
struct ip_vs_stats_rcu *tot_stats; /* Statistics & est. */
int num_services; /* no of virtual services */
int num_services6; /* IPv6 virtual services */
......@@ -932,6 +1013,12 @@ struct netns_ipvs {
int sysctl_schedule_icmp;
int sysctl_ignore_tunneled;
int sysctl_run_estimation;
#ifdef CONFIG_SYSCTL
cpumask_var_t sysctl_est_cpulist; /* kthread cpumask */
int est_cpulist_valid; /* cpulist set */
int sysctl_est_nice; /* kthread nice */
int est_stopped; /* stop tasks */
#endif
/* ip_vs_lblc */
int sysctl_lblc_expiration;
......@@ -942,9 +1029,17 @@ struct netns_ipvs {
struct ctl_table_header *lblcr_ctl_header;
struct ctl_table *lblcr_ctl_table;
/* ip_vs_est */
struct list_head est_list; /* estimator list */
spinlock_t est_lock;
struct timer_list est_timer; /* Estimation timer */
struct delayed_work est_reload_work;/* Reload kthread tasks */
struct mutex est_mutex; /* protect kthread tasks */
struct hlist_head est_temp_list; /* Ests during calc phase */
struct ip_vs_est_kt_data **est_kt_arr; /* Array of kthread data ptrs */
unsigned long est_max_threads;/* Hard limit of kthreads */
int est_calc_phase; /* Calculation phase */
int est_chain_max; /* Calculated chain_max */
int est_kt_count; /* Allocated ptrs */
int est_add_ktid; /* ktid where to add ests */
atomic_t est_genid; /* kthreads reload genid */
atomic_t est_genid_done; /* applied genid */
/* ip_vs_sync */
spinlock_t sync_lock;
struct ipvs_master_sync_state *ms;
......@@ -1077,6 +1172,19 @@ static inline int sysctl_run_estimation(struct netns_ipvs *ipvs)
return ipvs->sysctl_run_estimation;
}
static inline const struct cpumask *sysctl_est_cpulist(struct netns_ipvs *ipvs)
{
if (ipvs->est_cpulist_valid)
return ipvs->sysctl_est_cpulist;
else
return housekeeping_cpumask(HK_TYPE_KTHREAD);
}
static inline int sysctl_est_nice(struct netns_ipvs *ipvs)
{
return ipvs->sysctl_est_nice;
}
#else
static inline int sysctl_sync_threshold(struct netns_ipvs *ipvs)
......@@ -1174,6 +1282,16 @@ static inline int sysctl_run_estimation(struct netns_ipvs *ipvs)
return 1;
}
static inline const struct cpumask *sysctl_est_cpulist(struct netns_ipvs *ipvs)
{
return housekeeping_cpumask(HK_TYPE_KTHREAD);
}
static inline int sysctl_est_nice(struct netns_ipvs *ipvs)
{
return IPVS_EST_NICE;
}
#endif
/* IPVS core functions
......@@ -1475,10 +1593,41 @@ int stop_sync_thread(struct netns_ipvs *ipvs, int state);
void ip_vs_sync_conn(struct netns_ipvs *ipvs, struct ip_vs_conn *cp, int pkts);
/* IPVS rate estimator prototypes (from ip_vs_est.c) */
void ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
int ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
void ip_vs_stop_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
void ip_vs_zero_estimator(struct ip_vs_stats *stats);
void ip_vs_read_estimator(struct ip_vs_kstats *dst, struct ip_vs_stats *stats);
void ip_vs_est_reload_start(struct netns_ipvs *ipvs);
int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
struct ip_vs_est_kt_data *kd);
void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd);
static inline void ip_vs_est_stopped_recalc(struct netns_ipvs *ipvs)
{
#ifdef CONFIG_SYSCTL
/* Stop tasks while cpulist is empty or if disabled with flag */
ipvs->est_stopped = !sysctl_run_estimation(ipvs) ||
(ipvs->est_cpulist_valid &&
cpumask_empty(sysctl_est_cpulist(ipvs)));
#endif
}
static inline bool ip_vs_est_stopped(struct netns_ipvs *ipvs)
{
#ifdef CONFIG_SYSCTL
return ipvs->est_stopped;
#else
return false;
#endif
}
static inline int ip_vs_est_max_threads(struct netns_ipvs *ipvs)
{
unsigned int limit = IPVS_EST_CPU_KTHREADS *
cpumask_weight(sysctl_est_cpulist(ipvs));
return max(1U, limit);
}
/* Various IPVS packet transmitters (from ip_vs_xmit.c) */
int ip_vs_null_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
......
......@@ -71,8 +71,7 @@ static inline int nf_conntrack_confirm(struct sk_buff *skb)
return ret;
}
unsigned int nf_confirm(struct sk_buff *skb, unsigned int protoff,
struct nf_conn *ct, enum ip_conntrack_info ctinfo);
unsigned int nf_confirm(void *priv, struct sk_buff *skb, const struct nf_hook_state *state);
void print_tuple(struct seq_file *s, const struct nf_conntrack_tuple *tuple,
const struct nf_conntrack_l4proto *proto);
......
......@@ -85,6 +85,7 @@ enum {
IPSET_ATTR_CADT_LINENO = IPSET_ATTR_LINENO, /* 9 */
IPSET_ATTR_MARK, /* 10 */
IPSET_ATTR_MARKMASK, /* 11 */
IPSET_ATTR_BITMASK, /* 12 */
/* Reserve empty slots */
IPSET_ATTR_CADT_MAX = 16,
/* Create-only specific attributes */
......@@ -153,6 +154,7 @@ enum ipset_errno {
IPSET_ERR_COMMENT,
IPSET_ERR_INVALID_MARKMASK,
IPSET_ERR_SKBINFO,
IPSET_ERR_BITMASK_NETMASK_EXCL,
/* Type specific error codes */
IPSET_ERR_TYPE_SPECIFIC = 4352,
......
......@@ -16,6 +16,7 @@ enum sctp_conntrack {
SCTP_CONNTRACK_SHUTDOWN_ACK_SENT,
SCTP_CONNTRACK_HEARTBEAT_SENT,
SCTP_CONNTRACK_HEARTBEAT_ACKED,
SCTP_CONNTRACK_DATA_SENT,
SCTP_CONNTRACK_MAX
};
......
......@@ -95,6 +95,7 @@ enum ctattr_timeout_sctp {
CTA_TIMEOUT_SCTP_SHUTDOWN_ACK_SENT,
CTA_TIMEOUT_SCTP_HEARTBEAT_SENT,
CTA_TIMEOUT_SCTP_HEARTBEAT_ACKED,
CTA_TIMEOUT_SCTP_DATA_SENT,
__CTA_TIMEOUT_SCTP_MAX
};
#define CTA_TIMEOUT_SCTP_MAX (__CTA_TIMEOUT_SCTP_MAX - 1)
......
......@@ -366,42 +366,12 @@ static int nf_ct_bridge_refrag_post(struct net *net, struct sock *sk,
return br_dev_queue_push_xmit(net, sk, skb);
}
static unsigned int nf_ct_bridge_confirm(struct sk_buff *skb)
{
enum ip_conntrack_info ctinfo;
struct nf_conn *ct;
int protoff;
ct = nf_ct_get(skb, &ctinfo);
if (!ct || ctinfo == IP_CT_RELATED_REPLY)
return nf_conntrack_confirm(skb);
switch (skb->protocol) {
case htons(ETH_P_IP):
protoff = skb_network_offset(skb) + ip_hdrlen(skb);
break;
case htons(ETH_P_IPV6): {
unsigned char pnum = ipv6_hdr(skb)->nexthdr;
__be16 frag_off;
protoff = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &pnum,
&frag_off);
if (protoff < 0 || (frag_off & htons(~0x7)) != 0)
return nf_conntrack_confirm(skb);
}
break;
default:
return NF_ACCEPT;
}
return nf_confirm(skb, protoff, ct, ctinfo);
}
static unsigned int nf_ct_bridge_post(void *priv, struct sk_buff *skb,
const struct nf_hook_state *state)
{
int ret;
ret = nf_ct_bridge_confirm(skb);
ret = nf_confirm(priv, skb, state);
if (ret != NF_ACCEPT)
return ret;
......
......@@ -159,6 +159,17 @@ htable_size(u8 hbits)
(SET_WITH_TIMEOUT(set) && \
ip_set_timeout_expired(ext_timeout(d, set)))
#if defined(IP_SET_HASH_WITH_NETMASK) || defined(IP_SET_HASH_WITH_BITMASK)
static const union nf_inet_addr onesmask = {
.all[0] = 0xffffffff,
.all[1] = 0xffffffff,
.all[2] = 0xffffffff,
.all[3] = 0xffffffff
};
static const union nf_inet_addr zeromask = {};
#endif
#endif /* _IP_SET_HASH_GEN_H */
#ifndef MTYPE
......@@ -283,8 +294,9 @@ struct htype {
u32 markmask; /* markmask value for mark mask to store */
#endif
u8 bucketsize; /* max elements in an array block */
#ifdef IP_SET_HASH_WITH_NETMASK
#if defined(IP_SET_HASH_WITH_NETMASK) || defined(IP_SET_HASH_WITH_BITMASK)
u8 netmask; /* netmask value for subnets to store */
union nf_inet_addr bitmask; /* stores bitmask */
#endif
struct list_head ad; /* Resize add|del backlist */
struct mtype_elem next; /* temporary storage for uadd */
......@@ -459,8 +471,8 @@ mtype_same_set(const struct ip_set *a, const struct ip_set *b)
/* Resizing changes htable_bits, so we ignore it */
return x->maxelem == y->maxelem &&
a->timeout == b->timeout &&
#ifdef IP_SET_HASH_WITH_NETMASK
x->netmask == y->netmask &&
#if defined(IP_SET_HASH_WITH_NETMASK) || defined(IP_SET_HASH_WITH_BITMASK)
nf_inet_addr_cmp(&x->bitmask, &y->bitmask) &&
#endif
#ifdef IP_SET_HASH_WITH_MARKMASK
x->markmask == y->markmask &&
......@@ -1264,9 +1276,21 @@ mtype_head(struct ip_set *set, struct sk_buff *skb)
htonl(jhash_size(htable_bits))) ||
nla_put_net32(skb, IPSET_ATTR_MAXELEM, htonl(h->maxelem)))
goto nla_put_failure;
#ifdef IP_SET_HASH_WITH_BITMASK
/* if netmask is set to anything other than HOST_MASK we know that the user supplied netmask
* and not bitmask. These two are mutually exclusive. */
if (h->netmask == HOST_MASK && !nf_inet_addr_cmp(&onesmask, &h->bitmask)) {
if (set->family == NFPROTO_IPV4) {
if (nla_put_ipaddr4(skb, IPSET_ATTR_BITMASK, h->bitmask.ip))
goto nla_put_failure;
} else if (set->family == NFPROTO_IPV6) {
if (nla_put_ipaddr6(skb, IPSET_ATTR_BITMASK, &h->bitmask.in6))
goto nla_put_failure;
}
}
#endif
#ifdef IP_SET_HASH_WITH_NETMASK
if (h->netmask != HOST_MASK &&
nla_put_u8(skb, IPSET_ATTR_NETMASK, h->netmask))
if (h->netmask != HOST_MASK && nla_put_u8(skb, IPSET_ATTR_NETMASK, h->netmask))
goto nla_put_failure;
#endif
#ifdef IP_SET_HASH_WITH_MARKMASK
......@@ -1429,8 +1453,10 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set,
u32 markmask;
#endif
u8 hbits;
#ifdef IP_SET_HASH_WITH_NETMASK
u8 netmask;
#if defined(IP_SET_HASH_WITH_NETMASK) || defined(IP_SET_HASH_WITH_BITMASK)
int ret __attribute__((unused)) = 0;
u8 netmask = set->family == NFPROTO_IPV4 ? 32 : 128;
union nf_inet_addr bitmask = onesmask;
#endif
size_t hsize;
struct htype *h;
......@@ -1468,7 +1494,6 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set,
#endif
#ifdef IP_SET_HASH_WITH_NETMASK
netmask = set->family == NFPROTO_IPV4 ? 32 : 128;
if (tb[IPSET_ATTR_NETMASK]) {
netmask = nla_get_u8(tb[IPSET_ATTR_NETMASK]);
......@@ -1476,6 +1501,33 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set,
(set->family == NFPROTO_IPV6 && netmask > 128) ||
netmask == 0)
return -IPSET_ERR_INVALID_NETMASK;
/* we convert netmask to bitmask and store it */
if (set->family == NFPROTO_IPV4)
bitmask.ip = ip_set_netmask(netmask);
else
ip6_netmask(&bitmask, netmask);
}
#endif
#ifdef IP_SET_HASH_WITH_BITMASK
if (tb[IPSET_ATTR_BITMASK]) {
/* bitmask and netmask do the same thing, allow only one of these options */
if (tb[IPSET_ATTR_NETMASK])
return -IPSET_ERR_BITMASK_NETMASK_EXCL;
if (set->family == NFPROTO_IPV4) {
ret = ip_set_get_ipaddr4(tb[IPSET_ATTR_BITMASK], &bitmask.ip);
if (ret || !bitmask.ip)
return -IPSET_ERR_INVALID_NETMASK;
} else if (set->family == NFPROTO_IPV6) {
ret = ip_set_get_ipaddr6(tb[IPSET_ATTR_BITMASK], &bitmask);
if (ret || ipv6_addr_any(&bitmask.in6))
return -IPSET_ERR_INVALID_NETMASK;
}
if (nf_inet_addr_cmp(&bitmask, &zeromask))
return -IPSET_ERR_INVALID_NETMASK;
}
#endif
......@@ -1518,7 +1570,8 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set,
for (i = 0; i < ahash_numof_locks(hbits); i++)
spin_lock_init(&t->hregion[i].lock);
h->maxelem = maxelem;
#ifdef IP_SET_HASH_WITH_NETMASK
#if defined(IP_SET_HASH_WITH_NETMASK) || defined(IP_SET_HASH_WITH_BITMASK)
h->bitmask = bitmask;
h->netmask = netmask;
#endif
#ifdef IP_SET_HASH_WITH_MARKMASK
......
......@@ -24,7 +24,8 @@
/* 2 Comments support */
/* 3 Forceadd support */
/* 4 skbinfo support */
#define IPSET_TYPE_REV_MAX 5 /* bucketsize, initval support */
/* 5 bucketsize, initval support */
#define IPSET_TYPE_REV_MAX 6 /* bitmask support */
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@netfilter.org>");
......@@ -34,6 +35,7 @@ MODULE_ALIAS("ip_set_hash:ip");
/* Type specific function prefix */
#define HTYPE hash_ip
#define IP_SET_HASH_WITH_NETMASK
#define IP_SET_HASH_WITH_BITMASK
/* IPv4 variant */
......@@ -86,7 +88,7 @@ hash_ip4_kadt(struct ip_set *set, const struct sk_buff *skb,
__be32 ip;
ip4addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &ip);
ip &= ip_set_netmask(h->netmask);
ip &= h->bitmask.ip;
if (ip == 0)
return -EINVAL;
......@@ -119,7 +121,7 @@ hash_ip4_uadt(struct ip_set *set, struct nlattr *tb[],
if (ret)
return ret;
ip &= ip_set_hostmask(h->netmask);
ip &= ntohl(h->bitmask.ip);
e.ip = htonl(ip);
if (e.ip == 0)
return -IPSET_ERR_HASH_ELEM;
......@@ -185,12 +187,6 @@ hash_ip6_data_equal(const struct hash_ip6_elem *ip1,
return ipv6_addr_equal(&ip1->ip.in6, &ip2->ip.in6);
}
static void
hash_ip6_netmask(union nf_inet_addr *ip, u8 prefix)
{
ip6_netmask(ip, prefix);
}
static bool
hash_ip6_data_list(struct sk_buff *skb, const struct hash_ip6_elem *e)
{
......@@ -227,7 +223,7 @@ hash_ip6_kadt(struct ip_set *set, const struct sk_buff *skb,
struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
ip6addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &e.ip.in6);
hash_ip6_netmask(&e.ip, h->netmask);
nf_inet_addr_mask_inplace(&e.ip, &h->bitmask);
if (ipv6_addr_any(&e.ip.in6))
return -EINVAL;
......@@ -266,7 +262,7 @@ hash_ip6_uadt(struct ip_set *set, struct nlattr *tb[],
if (ret)
return ret;
hash_ip6_netmask(&e.ip, h->netmask);
nf_inet_addr_mask_inplace(&e.ip, &h->bitmask);
if (ipv6_addr_any(&e.ip.in6))
return -IPSET_ERR_HASH_ELEM;
......@@ -293,6 +289,7 @@ static struct ip_set_type hash_ip_type __read_mostly = {
[IPSET_ATTR_RESIZE] = { .type = NLA_U8 },
[IPSET_ATTR_TIMEOUT] = { .type = NLA_U32 },
[IPSET_ATTR_NETMASK] = { .type = NLA_U8 },
[IPSET_ATTR_BITMASK] = { .type = NLA_NESTED },
[IPSET_ATTR_CADT_FLAGS] = { .type = NLA_U32 },
},
.adt_policy = {
......
......@@ -26,7 +26,8 @@
/* 3 Comments support added */
/* 4 Forceadd support added */
/* 5 skbinfo support added */
#define IPSET_TYPE_REV_MAX 6 /* bucketsize, initval support added */
/* 6 bucketsize, initval support added */
#define IPSET_TYPE_REV_MAX 7 /* bitmask support added */
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@netfilter.org>");
......@@ -35,6 +36,8 @@ MODULE_ALIAS("ip_set_hash:ip,port");
/* Type specific function prefix */
#define HTYPE hash_ipport
#define IP_SET_HASH_WITH_NETMASK
#define IP_SET_HASH_WITH_BITMASK
/* IPv4 variant */
......@@ -92,12 +95,16 @@ hash_ipport4_kadt(struct ip_set *set, const struct sk_buff *skb,
ipset_adtfn adtfn = set->variant->adt[adt];
struct hash_ipport4_elem e = { .ip = 0 };
struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
const struct MTYPE *h = set->data;
if (!ip_set_get_ip4_port(skb, opt->flags & IPSET_DIM_TWO_SRC,
&e.port, &e.proto))
return -EINVAL;
ip4addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &e.ip);
e.ip &= h->bitmask.ip;
if (e.ip == 0)
return -EINVAL;
return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags);
}
......@@ -129,6 +136,10 @@ hash_ipport4_uadt(struct ip_set *set, struct nlattr *tb[],
if (ret)
return ret;
e.ip &= h->bitmask.ip;
if (e.ip == 0)
return -EINVAL;
e.port = nla_get_be16(tb[IPSET_ATTR_PORT]);
if (tb[IPSET_ATTR_PROTO]) {
......@@ -253,12 +264,17 @@ hash_ipport6_kadt(struct ip_set *set, const struct sk_buff *skb,
ipset_adtfn adtfn = set->variant->adt[adt];
struct hash_ipport6_elem e = { .ip = { .all = { 0 } } };
struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
const struct MTYPE *h = set->data;
if (!ip_set_get_ip6_port(skb, opt->flags & IPSET_DIM_TWO_SRC,
&e.port, &e.proto))
return -EINVAL;
ip6addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &e.ip.in6);
nf_inet_addr_mask_inplace(&e.ip, &h->bitmask);
if (ipv6_addr_any(&e.ip.in6))
return -EINVAL;
return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags);
}
......@@ -298,6 +314,10 @@ hash_ipport6_uadt(struct ip_set *set, struct nlattr *tb[],
if (ret)
return ret;
nf_inet_addr_mask_inplace(&e.ip, &h->bitmask);
if (ipv6_addr_any(&e.ip.in6))
return -EINVAL;
e.port = nla_get_be16(tb[IPSET_ATTR_PORT]);
if (tb[IPSET_ATTR_PROTO]) {
......@@ -356,6 +376,8 @@ static struct ip_set_type hash_ipport_type __read_mostly = {
[IPSET_ATTR_PROTO] = { .type = NLA_U8 },
[IPSET_ATTR_TIMEOUT] = { .type = NLA_U32 },
[IPSET_ATTR_CADT_FLAGS] = { .type = NLA_U32 },
[IPSET_ATTR_NETMASK] = { .type = NLA_U8 },
[IPSET_ATTR_BITMASK] = { .type = NLA_NESTED },
},
.adt_policy = {
[IPSET_ATTR_IP] = { .type = NLA_NESTED },
......
......@@ -23,7 +23,8 @@
#define IPSET_TYPE_REV_MIN 0
/* 1 Forceadd support added */
/* 2 skbinfo support added */
#define IPSET_TYPE_REV_MAX 3 /* bucketsize, initval support added */
/* 3 bucketsize, initval support added */
#define IPSET_TYPE_REV_MAX 4 /* bitmask support added */
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Oliver Smith <oliver@8.c.9.b.0.7.4.0.1.0.0.2.ip6.arpa>");
......@@ -33,6 +34,8 @@ MODULE_ALIAS("ip_set_hash:net,net");
/* Type specific function prefix */
#define HTYPE hash_netnet
#define IP_SET_HASH_WITH_NETS
#define IP_SET_HASH_WITH_NETMASK
#define IP_SET_HASH_WITH_BITMASK
#define IPSET_NET_COUNT 2
/* IPv4 variants */
......@@ -153,8 +156,8 @@ hash_netnet4_kadt(struct ip_set *set, const struct sk_buff *skb,
ip4addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &e.ip[0]);
ip4addrptr(skb, opt->flags & IPSET_DIM_TWO_SRC, &e.ip[1]);
e.ip[0] &= ip_set_netmask(e.cidr[0]);
e.ip[1] &= ip_set_netmask(e.cidr[1]);
e.ip[0] &= (ip_set_netmask(e.cidr[0]) & h->bitmask.ip);
e.ip[1] &= (ip_set_netmask(e.cidr[1]) & h->bitmask.ip);
return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags);
}
......@@ -213,8 +216,8 @@ hash_netnet4_uadt(struct ip_set *set, struct nlattr *tb[],
if (adt == IPSET_TEST || !(tb[IPSET_ATTR_IP_TO] ||
tb[IPSET_ATTR_IP2_TO])) {
e.ip[0] = htonl(ip & ip_set_hostmask(e.cidr[0]));
e.ip[1] = htonl(ip2_from & ip_set_hostmask(e.cidr[1]));
e.ip[0] = htonl(ip & ntohl(h->bitmask.ip) & ip_set_hostmask(e.cidr[0]));
e.ip[1] = htonl(ip2_from & ntohl(h->bitmask.ip) & ip_set_hostmask(e.cidr[1]));
ret = adtfn(set, &e, &ext, &ext, flags);
return ip_set_enomatch(ret, flags, adt, set) ? -ret :
ip_set_eexist(ret, flags) ? 0 : ret;
......@@ -404,6 +407,11 @@ hash_netnet6_kadt(struct ip_set *set, const struct sk_buff *skb,
ip6_netmask(&e.ip[0], e.cidr[0]);
ip6_netmask(&e.ip[1], e.cidr[1]);
nf_inet_addr_mask_inplace(&e.ip[0], &h->bitmask);
nf_inet_addr_mask_inplace(&e.ip[1], &h->bitmask);
if (e.cidr[0] == HOST_MASK && ipv6_addr_any(&e.ip[0].in6))
return -EINVAL;
return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags);
}
......@@ -414,6 +422,7 @@ hash_netnet6_uadt(struct ip_set *set, struct nlattr *tb[],
ipset_adtfn adtfn = set->variant->adt[adt];
struct hash_netnet6_elem e = { };
struct ip_set_ext ext = IP_SET_INIT_UEXT(set);
const struct hash_netnet6 *h = set->data;
int ret;
if (tb[IPSET_ATTR_LINENO])
......@@ -453,6 +462,11 @@ hash_netnet6_uadt(struct ip_set *set, struct nlattr *tb[],
ip6_netmask(&e.ip[0], e.cidr[0]);
ip6_netmask(&e.ip[1], e.cidr[1]);
nf_inet_addr_mask_inplace(&e.ip[0], &h->bitmask);
nf_inet_addr_mask_inplace(&e.ip[1], &h->bitmask);
if (e.cidr[0] == HOST_MASK && ipv6_addr_any(&e.ip[0].in6))
return -IPSET_ERR_HASH_ELEM;
if (tb[IPSET_ATTR_CADT_FLAGS]) {
u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
......@@ -484,6 +498,8 @@ static struct ip_set_type hash_netnet_type __read_mostly = {
[IPSET_ATTR_RESIZE] = { .type = NLA_U8 },
[IPSET_ATTR_TIMEOUT] = { .type = NLA_U32 },
[IPSET_ATTR_CADT_FLAGS] = { .type = NLA_U32 },
[IPSET_ATTR_NETMASK] = { .type = NLA_U8 },
[IPSET_ATTR_BITMASK] = { .type = NLA_NESTED },
},
.adt_policy = {
[IPSET_ATTR_IP] = { .type = NLA_NESTED },
......
......@@ -132,21 +132,21 @@ ip_vs_in_stats(struct ip_vs_conn *cp, struct sk_buff *skb)
s = this_cpu_ptr(dest->stats.cpustats);
u64_stats_update_begin(&s->syncp);
s->cnt.inpkts++;
s->cnt.inbytes += skb->len;
u64_stats_inc(&s->cnt.inpkts);
u64_stats_add(&s->cnt.inbytes, skb->len);
u64_stats_update_end(&s->syncp);
svc = rcu_dereference(dest->svc);
s = this_cpu_ptr(svc->stats.cpustats);
u64_stats_update_begin(&s->syncp);
s->cnt.inpkts++;
s->cnt.inbytes += skb->len;
u64_stats_inc(&s->cnt.inpkts);
u64_stats_add(&s->cnt.inbytes, skb->len);
u64_stats_update_end(&s->syncp);
s = this_cpu_ptr(ipvs->tot_stats.cpustats);
s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
u64_stats_update_begin(&s->syncp);
s->cnt.inpkts++;
s->cnt.inbytes += skb->len;
u64_stats_inc(&s->cnt.inpkts);
u64_stats_add(&s->cnt.inbytes, skb->len);
u64_stats_update_end(&s->syncp);
local_bh_enable();
......@@ -168,21 +168,21 @@ ip_vs_out_stats(struct ip_vs_conn *cp, struct sk_buff *skb)
s = this_cpu_ptr(dest->stats.cpustats);
u64_stats_update_begin(&s->syncp);
s->cnt.outpkts++;
s->cnt.outbytes += skb->len;
u64_stats_inc(&s->cnt.outpkts);
u64_stats_add(&s->cnt.outbytes, skb->len);
u64_stats_update_end(&s->syncp);
svc = rcu_dereference(dest->svc);
s = this_cpu_ptr(svc->stats.cpustats);
u64_stats_update_begin(&s->syncp);
s->cnt.outpkts++;
s->cnt.outbytes += skb->len;
u64_stats_inc(&s->cnt.outpkts);
u64_stats_add(&s->cnt.outbytes, skb->len);
u64_stats_update_end(&s->syncp);
s = this_cpu_ptr(ipvs->tot_stats.cpustats);
s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
u64_stats_update_begin(&s->syncp);
s->cnt.outpkts++;
s->cnt.outbytes += skb->len;
u64_stats_inc(&s->cnt.outpkts);
u64_stats_add(&s->cnt.outbytes, skb->len);
u64_stats_update_end(&s->syncp);
local_bh_enable();
......@@ -200,17 +200,17 @@ ip_vs_conn_stats(struct ip_vs_conn *cp, struct ip_vs_service *svc)
s = this_cpu_ptr(cp->dest->stats.cpustats);
u64_stats_update_begin(&s->syncp);
s->cnt.conns++;
u64_stats_inc(&s->cnt.conns);
u64_stats_update_end(&s->syncp);
s = this_cpu_ptr(svc->stats.cpustats);
u64_stats_update_begin(&s->syncp);
s->cnt.conns++;
u64_stats_inc(&s->cnt.conns);
u64_stats_update_end(&s->syncp);
s = this_cpu_ptr(ipvs->tot_stats.cpustats);
s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
u64_stats_update_begin(&s->syncp);
s->cnt.conns++;
u64_stats_inc(&s->cnt.conns);
u64_stats_update_end(&s->syncp);
local_bh_enable();
......@@ -2448,6 +2448,10 @@ static void __exit ip_vs_cleanup(void)
ip_vs_conn_cleanup();
ip_vs_protocol_cleanup();
ip_vs_control_cleanup();
/* common rcu_barrier() used by:
* - ip_vs_control_cleanup()
*/
rcu_barrier();
pr_info("ipvs unloaded.\n");
}
......
This diff is collapsed.
This diff is collapsed.
......@@ -121,17 +121,61 @@ const struct nf_conntrack_l4proto *nf_ct_l4proto_find(u8 l4proto)
};
EXPORT_SYMBOL_GPL(nf_ct_l4proto_find);
unsigned int nf_confirm(struct sk_buff *skb, unsigned int protoff,
struct nf_conn *ct, enum ip_conntrack_info ctinfo)
static bool in_vrf_postrouting(const struct nf_hook_state *state)
{
#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
if (state->hook == NF_INET_POST_ROUTING &&
netif_is_l3_master(state->out))
return true;
#endif
return false;
}
unsigned int nf_confirm(void *priv,
struct sk_buff *skb,
const struct nf_hook_state *state)
{
const struct nf_conn_help *help;
enum ip_conntrack_info ctinfo;
unsigned int protoff;
struct nf_conn *ct;
bool seqadj_needed;
__be16 frag_off;
u8 pnum;
ct = nf_ct_get(skb, &ctinfo);
if (!ct || in_vrf_postrouting(state))
return NF_ACCEPT;
help = nfct_help(ct);
seqadj_needed = test_bit(IPS_SEQ_ADJUST_BIT, &ct->status) && !nf_is_loopback_packet(skb);
if (!help && !seqadj_needed)
return nf_conntrack_confirm(skb);
/* helper->help() do not expect ICMP packets */
if (ctinfo == IP_CT_RELATED_REPLY)
return nf_conntrack_confirm(skb);
switch (nf_ct_l3num(ct)) {
case NFPROTO_IPV4:
protoff = skb_network_offset(skb) + ip_hdrlen(skb);
break;
case NFPROTO_IPV6:
pnum = ipv6_hdr(skb)->nexthdr;
protoff = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &pnum, &frag_off);
if (protoff < 0 || (frag_off & htons(~0x7)) != 0)
return nf_conntrack_confirm(skb);
break;
default:
return nf_conntrack_confirm(skb);
}
if (help) {
const struct nf_conntrack_helper *helper;
int ret;
/* rcu_read_lock()ed by nf_hook_thresh */
/* rcu_read_lock()ed by nf_hook */
helper = rcu_dereference(help->helper);
if (helper) {
ret = helper->help(skb,
......@@ -142,12 +186,10 @@ unsigned int nf_confirm(struct sk_buff *skb, unsigned int protoff,
}
}
if (test_bit(IPS_SEQ_ADJUST_BIT, &ct->status) &&
!nf_is_loopback_packet(skb)) {
if (!nf_ct_seq_adjust(skb, ct, ctinfo, protoff)) {
NF_CT_STAT_INC_ATOMIC(nf_ct_net(ct), drop);
return NF_DROP;
}
if (seqadj_needed &&
!nf_ct_seq_adjust(skb, ct, ctinfo, protoff)) {
NF_CT_STAT_INC_ATOMIC(nf_ct_net(ct), drop);
return NF_DROP;
}
/* We've seen it coming out the other side: confirm it */
......@@ -155,35 +197,6 @@ unsigned int nf_confirm(struct sk_buff *skb, unsigned int protoff,
}
EXPORT_SYMBOL_GPL(nf_confirm);
static bool in_vrf_postrouting(const struct nf_hook_state *state)
{
#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
if (state->hook == NF_INET_POST_ROUTING &&
netif_is_l3_master(state->out))
return true;
#endif
return false;
}
static unsigned int ipv4_confirm(void *priv,
struct sk_buff *skb,
const struct nf_hook_state *state)
{
enum ip_conntrack_info ctinfo;
struct nf_conn *ct;
ct = nf_ct_get(skb, &ctinfo);
if (!ct || ctinfo == IP_CT_RELATED_REPLY)
return nf_conntrack_confirm(skb);
if (in_vrf_postrouting(state))
return NF_ACCEPT;
return nf_confirm(skb,
skb_network_offset(skb) + ip_hdrlen(skb),
ct, ctinfo);
}
static unsigned int ipv4_conntrack_in(void *priv,
struct sk_buff *skb,
const struct nf_hook_state *state)
......@@ -230,13 +243,13 @@ static const struct nf_hook_ops ipv4_conntrack_ops[] = {
.priority = NF_IP_PRI_CONNTRACK,
},
{
.hook = ipv4_confirm,
.hook = nf_confirm,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_POST_ROUTING,
.priority = NF_IP_PRI_CONNTRACK_CONFIRM,
},
{
.hook = ipv4_confirm,
.hook = nf_confirm,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_LOCAL_IN,
.priority = NF_IP_PRI_CONNTRACK_CONFIRM,
......@@ -373,33 +386,6 @@ static struct nf_sockopt_ops so_getorigdst6 = {
.owner = THIS_MODULE,
};
static unsigned int ipv6_confirm(void *priv,
struct sk_buff *skb,
const struct nf_hook_state *state)
{
struct nf_conn *ct;
enum ip_conntrack_info ctinfo;
unsigned char pnum = ipv6_hdr(skb)->nexthdr;
__be16 frag_off;
int protoff;
ct = nf_ct_get(skb, &ctinfo);
if (!ct || ctinfo == IP_CT_RELATED_REPLY)
return nf_conntrack_confirm(skb);
if (in_vrf_postrouting(state))
return NF_ACCEPT;
protoff = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &pnum,
&frag_off);
if (protoff < 0 || (frag_off & htons(~0x7)) != 0) {
pr_debug("proto header not found\n");
return nf_conntrack_confirm(skb);
}
return nf_confirm(skb, protoff, ct, ctinfo);
}
static unsigned int ipv6_conntrack_in(void *priv,
struct sk_buff *skb,
const struct nf_hook_state *state)
......@@ -428,13 +414,13 @@ static const struct nf_hook_ops ipv6_conntrack_ops[] = {
.priority = NF_IP6_PRI_CONNTRACK,
},
{
.hook = ipv6_confirm,
.hook = nf_confirm,
.pf = NFPROTO_IPV6,
.hooknum = NF_INET_POST_ROUTING,
.priority = NF_IP6_PRI_LAST,
},
{
.hook = ipv6_confirm,
.hook = nf_confirm,
.pf = NFPROTO_IPV6,
.hooknum = NF_INET_LOCAL_IN,
.priority = NF_IP6_PRI_LAST - 1,
......
......@@ -129,6 +129,56 @@ static void icmpv6_error_log(const struct sk_buff *skb,
nf_l4proto_log_invalid(skb, state, IPPROTO_ICMPV6, "%s", msg);
}
static noinline_for_stack int
nf_conntrack_icmpv6_redirect(struct nf_conn *tmpl, struct sk_buff *skb,
unsigned int dataoff,
const struct nf_hook_state *state)
{
u8 hl = ipv6_hdr(skb)->hop_limit;
union nf_inet_addr outer_daddr;
union {
struct nd_opt_hdr nd_opt;
struct rd_msg rd_msg;
} tmp;
const struct nd_opt_hdr *nd_opt;
const struct rd_msg *rd_msg;
rd_msg = skb_header_pointer(skb, dataoff, sizeof(*rd_msg), &tmp.rd_msg);
if (!rd_msg) {
icmpv6_error_log(skb, state, "short redirect");
return -NF_ACCEPT;
}
if (rd_msg->icmph.icmp6_code != 0)
return NF_ACCEPT;
if (hl != 255 || !(ipv6_addr_type(&ipv6_hdr(skb)->saddr) & IPV6_ADDR_LINKLOCAL)) {
icmpv6_error_log(skb, state, "invalid saddr or hoplimit for redirect");
return -NF_ACCEPT;
}
dataoff += sizeof(*rd_msg);
/* warning: rd_msg no longer usable after this call */
nd_opt = skb_header_pointer(skb, dataoff, sizeof(*nd_opt), &tmp.nd_opt);
if (!nd_opt || nd_opt->nd_opt_len == 0) {
icmpv6_error_log(skb, state, "redirect without options");
return -NF_ACCEPT;
}
/* We could call ndisc_parse_options(), but it would need
* skb_linearize() and a bit more work.
*/
if (nd_opt->nd_opt_type != ND_OPT_REDIRECT_HDR)
return NF_ACCEPT;
memcpy(&outer_daddr.ip6, &ipv6_hdr(skb)->daddr,
sizeof(outer_daddr.ip6));
dataoff += 8;
return nf_conntrack_inet_error(tmpl, skb, dataoff, state,
IPPROTO_ICMPV6, &outer_daddr);
}
int nf_conntrack_icmpv6_error(struct nf_conn *tmpl,
struct sk_buff *skb,
unsigned int dataoff,
......@@ -159,6 +209,9 @@ int nf_conntrack_icmpv6_error(struct nf_conn *tmpl,
return NF_ACCEPT;
}
if (icmp6h->icmp6_type == NDISC_REDIRECT)
return nf_conntrack_icmpv6_redirect(tmpl, skb, dataoff, state);
/* is not error message ? */
if (icmp6h->icmp6_type >= 128)
return NF_ACCEPT;
......
......@@ -60,6 +60,7 @@ static const unsigned int sctp_timeouts[SCTP_CONNTRACK_MAX] = {
[SCTP_CONNTRACK_SHUTDOWN_ACK_SENT] = 3 SECS,
[SCTP_CONNTRACK_HEARTBEAT_SENT] = 30 SECS,
[SCTP_CONNTRACK_HEARTBEAT_ACKED] = 210 SECS,
[SCTP_CONNTRACK_DATA_SENT] = 30 SECS,
};
#define SCTP_FLAG_HEARTBEAT_VTAG_FAILED 1
......@@ -74,6 +75,7 @@ static const unsigned int sctp_timeouts[SCTP_CONNTRACK_MAX] = {
#define sSA SCTP_CONNTRACK_SHUTDOWN_ACK_SENT
#define sHS SCTP_CONNTRACK_HEARTBEAT_SENT
#define sHA SCTP_CONNTRACK_HEARTBEAT_ACKED
#define sDS SCTP_CONNTRACK_DATA_SENT
#define sIV SCTP_CONNTRACK_MAX
/*
......@@ -90,15 +92,16 @@ COOKIE WAIT - We have seen an INIT chunk in the original direction, or als
COOKIE ECHOED - We have seen a COOKIE_ECHO chunk in the original direction.
ESTABLISHED - We have seen a COOKIE_ACK in the reply direction.
SHUTDOWN_SENT - We have seen a SHUTDOWN chunk in the original direction.
SHUTDOWN_RECD - We have seen a SHUTDOWN chunk in the reply directoin.
SHUTDOWN_RECD - We have seen a SHUTDOWN chunk in the reply direction.
SHUTDOWN_ACK_SENT - We have seen a SHUTDOWN_ACK chunk in the direction opposite
to that of the SHUTDOWN chunk.
CLOSED - We have seen a SHUTDOWN_COMPLETE chunk in the direction of
the SHUTDOWN chunk. Connection is closed.
HEARTBEAT_SENT - We have seen a HEARTBEAT in a new flow.
HEARTBEAT_ACKED - We have seen a HEARTBEAT-ACK in the direction opposite to
that of the HEARTBEAT chunk. Secondary connection is
established.
HEARTBEAT_ACKED - We have seen a HEARTBEAT-ACK/DATA/SACK in the direction
opposite to that of the HEARTBEAT/DATA chunk. Secondary connection
is established.
DATA_SENT - We have seen a DATA/SACK in a new flow.
*/
/* TODO
......@@ -112,36 +115,38 @@ cookie echoed to closed.
*/
/* SCTP conntrack state transitions */
static const u8 sctp_conntracks[2][11][SCTP_CONNTRACK_MAX] = {
static const u8 sctp_conntracks[2][12][SCTP_CONNTRACK_MAX] = {
{
/* ORIGINAL */
/* sNO, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA */
/* init */ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sCW, sHA},
/* init_ack */ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sCL, sHA},
/* abort */ {sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL},
/* shutdown */ {sCL, sCL, sCW, sCE, sSS, sSS, sSR, sSA, sCL, sSS},
/* shutdown_ack */ {sSA, sCL, sCW, sCE, sES, sSA, sSA, sSA, sSA, sHA},
/* error */ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sCL, sHA},/* Can't have Stale cookie*/
/* cookie_echo */ {sCL, sCL, sCE, sCE, sES, sSS, sSR, sSA, sCL, sHA},/* 5.2.4 - Big TODO */
/* cookie_ack */ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sCL, sHA},/* Can't come in orig dir */
/* shutdown_comp*/ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sCL, sCL, sHA},
/* heartbeat */ {sHS, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA},
/* heartbeat_ack*/ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA}
/* sNO, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA, sDS */
/* init */ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sCW, sHA, sCW},
/* init_ack */ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sCL, sHA, sCL},
/* abort */ {sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL},
/* shutdown */ {sCL, sCL, sCW, sCE, sSS, sSS, sSR, sSA, sCL, sSS, sCL},
/* shutdown_ack */ {sSA, sCL, sCW, sCE, sES, sSA, sSA, sSA, sSA, sHA, sSA},
/* error */ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sCL, sHA, sCL},/* Can't have Stale cookie*/
/* cookie_echo */ {sCL, sCL, sCE, sCE, sES, sSS, sSR, sSA, sCL, sHA, sCL},/* 5.2.4 - Big TODO */
/* cookie_ack */ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sCL, sHA, sCL},/* Can't come in orig dir */
/* shutdown_comp*/ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sCL, sCL, sHA, sCL},
/* heartbeat */ {sHS, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA, sDS},
/* heartbeat_ack*/ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA, sDS},
/* data/sack */ {sDS, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA, sDS}
},
{
/* REPLY */
/* sNO, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA */
/* init */ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sIV, sHA},/* INIT in sCL Big TODO */
/* init_ack */ {sIV, sCW, sCW, sCE, sES, sSS, sSR, sSA, sIV, sHA},
/* abort */ {sIV, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sIV, sCL},
/* shutdown */ {sIV, sCL, sCW, sCE, sSR, sSS, sSR, sSA, sIV, sSR},
/* shutdown_ack */ {sIV, sCL, sCW, sCE, sES, sSA, sSA, sSA, sIV, sHA},
/* error */ {sIV, sCL, sCW, sCL, sES, sSS, sSR, sSA, sIV, sHA},
/* cookie_echo */ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sIV, sHA},/* Can't come in reply dir */
/* cookie_ack */ {sIV, sCL, sCW, sES, sES, sSS, sSR, sSA, sIV, sHA},
/* shutdown_comp*/ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sCL, sIV, sHA},
/* heartbeat */ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA},
/* heartbeat_ack*/ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHA, sHA}
/* sNO, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA, sDS */
/* init */ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sIV, sHA, sIV},/* INIT in sCL Big TODO */
/* init_ack */ {sIV, sCW, sCW, sCE, sES, sSS, sSR, sSA, sIV, sHA, sIV},
/* abort */ {sIV, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sIV, sCL, sIV},
/* shutdown */ {sIV, sCL, sCW, sCE, sSR, sSS, sSR, sSA, sIV, sSR, sIV},
/* shutdown_ack */ {sIV, sCL, sCW, sCE, sES, sSA, sSA, sSA, sIV, sHA, sIV},
/* error */ {sIV, sCL, sCW, sCL, sES, sSS, sSR, sSA, sIV, sHA, sIV},
/* cookie_echo */ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sIV, sHA, sIV},/* Can't come in reply dir */
/* cookie_ack */ {sIV, sCL, sCW, sES, sES, sSS, sSR, sSA, sIV, sHA, sIV},
/* shutdown_comp*/ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sCL, sIV, sHA, sIV},
/* heartbeat */ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA, sHA},
/* heartbeat_ack*/ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHA, sHA, sHA},
/* data/sack */ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHA, sHA, sHA},
}
};
......@@ -253,6 +258,11 @@ static int sctp_new_state(enum ip_conntrack_dir dir,
pr_debug("SCTP_CID_HEARTBEAT_ACK");
i = 10;
break;
case SCTP_CID_DATA:
case SCTP_CID_SACK:
pr_debug("SCTP_CID_DATA/SACK");
i = 11;
break;
default:
/* Other chunks like DATA or SACK do not change the state */
pr_debug("Unknown chunk type, Will stay in %s\n",
......@@ -306,7 +316,9 @@ sctp_new(struct nf_conn *ct, const struct sk_buff *skb,
ih->init_tag);
ct->proto.sctp.vtag[IP_CT_DIR_REPLY] = ih->init_tag;
} else if (sch->type == SCTP_CID_HEARTBEAT) {
} else if (sch->type == SCTP_CID_HEARTBEAT ||
sch->type == SCTP_CID_DATA ||
sch->type == SCTP_CID_SACK) {
pr_debug("Setting vtag %x for secondary conntrack\n",
sh->vtag);
ct->proto.sctp.vtag[IP_CT_DIR_ORIGINAL] = sh->vtag;
......@@ -392,19 +404,19 @@ int nf_conntrack_sctp_packet(struct nf_conn *ct,
if (!sctp_new(ct, skb, sh, dataoff))
return -NF_ACCEPT;
}
/* Check the verification tag (Sec 8.5) */
if (!test_bit(SCTP_CID_INIT, map) &&
!test_bit(SCTP_CID_SHUTDOWN_COMPLETE, map) &&
!test_bit(SCTP_CID_COOKIE_ECHO, map) &&
!test_bit(SCTP_CID_ABORT, map) &&
!test_bit(SCTP_CID_SHUTDOWN_ACK, map) &&
!test_bit(SCTP_CID_HEARTBEAT, map) &&
!test_bit(SCTP_CID_HEARTBEAT_ACK, map) &&
sh->vtag != ct->proto.sctp.vtag[dir]) {
pr_debug("Verification tag check failed\n");
goto out;
} else {
/* Check the verification tag (Sec 8.5) */
if (!test_bit(SCTP_CID_INIT, map) &&
!test_bit(SCTP_CID_SHUTDOWN_COMPLETE, map) &&
!test_bit(SCTP_CID_COOKIE_ECHO, map) &&
!test_bit(SCTP_CID_ABORT, map) &&
!test_bit(SCTP_CID_SHUTDOWN_ACK, map) &&
!test_bit(SCTP_CID_HEARTBEAT, map) &&
!test_bit(SCTP_CID_HEARTBEAT_ACK, map) &&
sh->vtag != ct->proto.sctp.vtag[dir]) {
pr_debug("Verification tag check failed\n");
goto out;
}
}
old_state = new_state = SCTP_CONNTRACK_NONE;
......@@ -464,6 +476,11 @@ int nf_conntrack_sctp_packet(struct nf_conn *ct,
} else if (ct->proto.sctp.flags & SCTP_FLAG_HEARTBEAT_VTAG_FAILED) {
ct->proto.sctp.flags &= ~SCTP_FLAG_HEARTBEAT_VTAG_FAILED;
}
} else if (sch->type == SCTP_CID_DATA || sch->type == SCTP_CID_SACK) {
if (ct->proto.sctp.vtag[dir] == 0) {
pr_debug("Setting vtag %x for dir %d\n", sh->vtag, dir);
ct->proto.sctp.vtag[dir] = sh->vtag;
}
}
old_state = ct->proto.sctp.state;
......@@ -684,6 +701,7 @@ sctp_timeout_nla_policy[CTA_TIMEOUT_SCTP_MAX+1] = {
[CTA_TIMEOUT_SCTP_SHUTDOWN_ACK_SENT] = { .type = NLA_U32 },
[CTA_TIMEOUT_SCTP_HEARTBEAT_SENT] = { .type = NLA_U32 },
[CTA_TIMEOUT_SCTP_HEARTBEAT_ACKED] = { .type = NLA_U32 },
[CTA_TIMEOUT_SCTP_DATA_SENT] = { .type = NLA_U32 },
};
#endif /* CONFIG_NF_CONNTRACK_TIMEOUT */
......
......@@ -602,6 +602,7 @@ enum nf_ct_sysctl_index {
NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_SHUTDOWN_ACK_SENT,
NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_HEARTBEAT_SENT,
NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_HEARTBEAT_ACKED,
NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_DATA_SENT,
#endif
#ifdef CONFIG_NF_CT_PROTO_DCCP
NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_REQUEST,
......@@ -892,6 +893,12 @@ static struct ctl_table nf_ct_sysctl_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec_jiffies,
},
[NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_DATA_SENT] = {
.procname = "nf_conntrack_sctp_timeout_data_sent",
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = proc_dointvec_jiffies,
},
#endif
#ifdef CONFIG_NF_CT_PROTO_DCCP
[NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_REQUEST] = {
......@@ -1036,6 +1043,7 @@ static void nf_conntrack_standalone_init_sctp_sysctl(struct net *net,
XASSIGN(SHUTDOWN_ACK_SENT, sn);
XASSIGN(HEARTBEAT_SENT, sn);
XASSIGN(HEARTBEAT_ACKED, sn);
XASSIGN(DATA_SENT, sn);
#undef XASSIGN
#endif
}
......
......@@ -421,6 +421,10 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
if (ret == NF_DROP)
flow_offload_teardown(flow);
break;
default:
WARN_ON_ONCE(1);
ret = NF_DROP;
break;
}
return ret;
......@@ -682,6 +686,10 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
if (ret == NF_DROP)
flow_offload_teardown(flow);
break;
default:
WARN_ON_ONCE(1);
ret = NF_DROP;
break;
}
return ret;
......
......@@ -2873,8 +2873,8 @@ int nft_expr_inner_parse(const struct nft_ctx *ctx, const struct nlattr *nla,
return -EINVAL;
type = __nft_expr_type_get(ctx->family, tb[NFTA_EXPR_NAME]);
if (IS_ERR(type))
return PTR_ERR(type);
if (!type)
return -ENOENT;
if (!type->inner_ops)
return -EOPNOTSUPP;
......
......@@ -35,6 +35,8 @@ cleanup() {
for i in 1 2;do ip netns del nsrouter$i;done
}
trap cleanup EXIT
ipv4() {
echo -n 192.168.$1.2
}
......@@ -146,11 +148,17 @@ ip netns exec nsclient1 nft -f - <<EOF
table inet filter {
counter unknown { }
counter related { }
counter redir4 { }
counter redir6 { }
chain input {
type filter hook input priority 0; policy accept;
meta l4proto { icmp, icmpv6 } ct state established,untracked accept
icmp type "redirect" ct state "related" counter name "redir4" accept
icmpv6 type "nd-redirect" ct state "related" counter name "redir6" accept
meta l4proto { icmp, icmpv6 } ct state established,untracked accept
meta l4proto { icmp, icmpv6 } ct state "related" counter name "related" accept
counter name "unknown" drop
}
}
......@@ -279,5 +287,29 @@ else
echo "ERROR: icmp error RELATED state test has failed"
fi
cleanup
# add 'bad' route, expect icmp REDIRECT to be generated
ip netns exec nsclient1 ip route add 192.168.1.42 via 192.168.1.1
ip netns exec nsclient1 ip route add dead:1::42 via dead:1::1
ip netns exec "nsclient1" ping -q -c 2 192.168.1.42 > /dev/null
expect="packets 1 bytes 112"
check_counter nsclient1 "redir4" "$expect"
if [ $? -ne 0 ];then
ret=1
fi
ip netns exec "nsclient1" ping -c 1 dead:1::42 > /dev/null
expect="packets 1 bytes 192"
check_counter nsclient1 "redir6" "$expect"
if [ $? -ne 0 ];then
ret=1
fi
if [ $ret -eq 0 ];then
echo "PASS: icmp redirects had RELATED state"
else
echo "ERROR: icmp redirect RELATED state test has failed"
fi
exit $ret
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment