Commit 5a1dc317 authored by David S. Miller's avatar David S. Miller

Merge branch 'ipfrags'

Jesper Dangaard Brouer says:

====================
This patchset is V2, with some trivial code fixes, which were noticed
by DaveM. It is still a partly respin of my fragmentation optimization
patches: http://thread.gmane.org/gmane.linux.network/250914

This is not the complete patchset, from the gmane link above. In this
patchset, I primarily focus on adjusting cacheline for better SMP/NUMA
performance.

Once this patchset have been agreed upon, I will continue and respin
the rest of my patches.

This time around, I have created a frag DoS generator, via the tool
trafgen (http://netsniff-ng.org/).  To create a stable DoS scenario
(no longer relying on frame dropping due to disabled flow-control).

Two 10G interfaces are under-test, and uses Ethernet flow-control.  A
third interface is used for generating the DoS attack (this interface
is also 10G, but it does not need to be, as 500Kpps DoS is enough).

Test types summary (netperf):
 Test-20G64K     == 2x10G with 65K fragments
 Test-20G3F      == 2x10G with 3x fragments (3*1472 bytes)
 Test-20G64K+DoS == Same as 20G64K with frag DoS
 Test-20G3F+DoS  == Same as 20G3F  with frag DoS

Patch list:
 Patch-01 - net: cacheline adjust struct netns_frags for better frag performance
 Patch-02 - net: cacheline adjust struct inet_frags for better frag performance
 Patch-03 - net: cacheline adjust struct inet_frag_queue
 Patch-04 - net: frag helper functions for mem limit tracking
 Patch-05 - net: use lib/percpu_counter API for fragmentation mem accounting
 Patch-06 - net: frag, move LRU list maintenance outside of rwlock

Performance table summary:

 Test-type:  Test-20G64K    Test-20G3F  20G64K+DoS   20G3F+DoS
 ----------  -----------    ----------  ----------   ---------
  net-next:  15114.5 Mbit/s   8954.21     2444.28     3918.01 Mbit/s
  Patch-01:  16075.8 Mbit/s   8976.18     2621.49     4072.79 Mbit/s
  Patch-02:  17806.9 Mbit/s   9280.32     2478.62     4274.59 Mbit/s
  Patch-03:  17317.4 Mbit/s   9308.62     2546.05     4336.59 Mbit/s
  Patch-04:  17635.9 Mbit/s   9256.16     2535.25     4327.63 Mbit/s
  Patch-05:  18027.0 Mbit/s   9918.99     2492.62     3621.68 Mbit/s
  Patch-06:  18486.7 Mbit/s  10723.20     3657.85     4560.64 Mbit/s

 I cannot explain the under-DoS regression that patch-05/percpu_counter
 introduces.  But patch-06/LRU-lock corrects the situation again.

Below is a testlab setup description, with links to the trafgen DoS
packet config used.

Testlab
=======

Server setup
------------
The machine acting as a server:
 - 2x CPU (E5-2630)
 - Thus a NUMA arch/machine
 - 4x 10Gbit/s ports
 - NICs 2x Intel Dual port 82599 based (driver ixgbe)

Setup:
 - Interfaces uses Ethernet flow control
 - Flush all iptables
 - Remove all iptables related module.
 - Kill irqbalance
 - Pin each 10G NIC port to a *single* CPU each

Pinning can easily be done by command hacks::

 for x in /proc/irq/*/eth8*/../smp_affinity_list ; do echo 1 > $x; done
 for x in /proc/irq/*/eth9*/../smp_affinity_list ; do echo 3 > $x; done
 for x in /proc/irq/*/eth31*/../smp_affinity_list; do echo 6 > $x; done
 for x in /proc/irq/*/eth32*/../smp_affinity_list; do echo 8 > $x; done

Notice NUMA setting: The CPU to NIC tying is carefully choosen
according to the NUMA node setup.  Thus, NICs connected to a PCI-e
slot that is connected to a physical CPU socket are tied together.

Choosing only a single CPU per NIC (port) is just to ease provoking
and debugging this performance issue. (In real setups, you can choose
more CPU, just remember the NUMA node in the equation).

Tools
-----

Netperf is used, with option -T to ensure CPU binding.
The netserver processes, are NAPI pinned::

 numactl -m0 -c0 netserver
 numactl -m1 -c 1 netserver -p 1337

I now have a frag DoS generator, created via the tool:
  trafgen (see: http://netsniff-ng.org/)

Trafgen packet config file:
 http://people.netfilter.org/hawk/frag_work/trafgen/frag_packet03_small_frag.txf

Notice, I'm using features of trafgen, recently developed by Daniel
Borkmann, thus you need the latest git tree to use my trafgen packet
config.

 git://github.com/borkmann/netsniff-ng.git

Command line:
 trafgen --dev eth51 --conf frag_packet03_small_frag.txf -V -k 100 --cpus 2

Tests types
-----------

Test(20G64K) UDP-64K 2x 10Gbit/s with no DoS traffic:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 export SIZE=$((65507)); export TIME=$((20)); export LOG=/tmp/netperf.log ;\
 netperf -p 1337 -H 192.168.31.2 -T7,7 -t UDP_STREAM -l $TIME -- -m $SIZE >> ${LOG}.31 &\
 netperf         -H 192.168.81.2 -T2,2 -t UDP_STREAM -l $TIME -- -m $SIZE >> ${LOG}.81 && \
 wait $! && tail -n3 ${LOG}.* && \
 tail -n3 ${LOG}.{31,81} | awk 'BEGIN{sum=0;} /212992        / {sum+=$4; print " +"$4} /==/ {print " file:"$2} END{print "sum:"sum" Mbit/s"}'

Test(20G3F) UDP-3xfrags 2x 10Gbit/s with no DoS traffic:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 export SIZE=$((3*1472)); export TIME=$((20)); export LOG=/tmp/netperf.log ;\
 netperf -p 1337 -H 192.168.31.2 -T7,7 -t UDP_STREAM -l $TIME -- -m $SIZE >> ${LOG}.31 &\
 netperf         -H 192.168.81.2 -T2,2 -t UDP_STREAM -l $TIME -- -m $SIZE >> ${LOG}.81 && \
 wait $! && tail -n3 ${LOG}.* && \
tail -n3 ${LOG}.{31,81} | awk 'BEGIN{sum=0;} /212992        / {sum+=$4; print " +"$4} /==/ {print " file:"$2} END{print "sum:"sum" Mbit/s"}'

Awk script for summming results:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

tail -n3 ${LOG}.{31,81} | awk 'BEGIN{sum=0;} /212992        / {sum+=$4; print " +"$4} /==/ {print " file:"$2} END{print "sum:"sum" Mbit/s"}'
====================
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents 656a05c8 3ef0eb0d
#ifndef __NET_FRAG_H__
#define __NET_FRAG_H__
#include <linux/percpu_counter.h>
struct netns_frags {
int nqueues;
atomic_t mem;
struct list_head lru_list;
spinlock_t lru_lock;
/* The percpu_counter "mem" need to be cacheline aligned.
* mem.count must not share cacheline with other writers
*/
struct percpu_counter mem ____cacheline_aligned_in_smp;
/* sysctls */
int timeout;
......@@ -13,12 +20,11 @@ struct netns_frags {
};
struct inet_frag_queue {
struct hlist_node list;
struct netns_frags *net;
struct list_head lru_list; /* lru list member */
spinlock_t lock;
atomic_t refcnt;
struct timer_list timer; /* when will this queue expire? */
struct list_head lru_list; /* lru list member */
struct hlist_node list;
atomic_t refcnt;
struct sk_buff *fragments; /* list of received fragments */
struct sk_buff *fragments_tail;
ktime_t stamp;
......@@ -31,24 +37,29 @@ struct inet_frag_queue {
#define INET_FRAG_LAST_IN 1
u16 max_size;
struct netns_frags *net;
};
#define INETFRAGS_HASHSZ 64
struct inet_frags {
struct hlist_head hash[INETFRAGS_HASHSZ];
rwlock_t lock;
u32 rnd;
int qsize;
/* This rwlock is a global lock (seperate per IPv4, IPv6 and
* netfilter). Important to keep this on a seperate cacheline.
*/
rwlock_t lock ____cacheline_aligned_in_smp;
int secret_interval;
struct timer_list secret_timer;
u32 rnd;
int qsize;
unsigned int (*hashfn)(struct inet_frag_queue *);
bool (*match)(struct inet_frag_queue *q, void *arg);
void (*constructor)(struct inet_frag_queue *q,
void *arg);
void (*destructor)(struct inet_frag_queue *);
void (*skb_free)(struct sk_buff *);
bool (*match)(struct inet_frag_queue *q, void *arg);
void (*frag_expire)(unsigned long data);
};
......@@ -72,4 +83,59 @@ static inline void inet_frag_put(struct inet_frag_queue *q, struct inet_frags *f
inet_frag_destroy(q, f, NULL);
}
/* Memory Tracking Functions. */
/* The default percpu_counter batch size is not big enough to scale to
* fragmentation mem acct sizes.
* The mem size of a 64K fragment is approx:
* (44 fragments * 2944 truesize) + frag_queue struct(200) = 129736 bytes
*/
static unsigned int frag_percpu_counter_batch = 130000;
static inline int frag_mem_limit(struct netns_frags *nf)
{
return percpu_counter_read(&nf->mem);
}
static inline void sub_frag_mem_limit(struct inet_frag_queue *q, int i)
{
__percpu_counter_add(&q->net->mem, -i, frag_percpu_counter_batch);
}
static inline void add_frag_mem_limit(struct inet_frag_queue *q, int i)
{
__percpu_counter_add(&q->net->mem, i, frag_percpu_counter_batch);
}
static inline void init_frag_mem_limit(struct netns_frags *nf)
{
percpu_counter_init(&nf->mem, 0);
}
static inline int sum_frag_mem_limit(struct netns_frags *nf)
{
return percpu_counter_sum_positive(&nf->mem);
}
static inline void inet_frag_lru_move(struct inet_frag_queue *q)
{
spin_lock(&q->net->lru_lock);
list_move_tail(&q->lru_list, &q->net->lru_list);
spin_unlock(&q->net->lru_lock);
}
static inline void inet_frag_lru_del(struct inet_frag_queue *q)
{
spin_lock(&q->net->lru_lock);
list_del(&q->lru_list);
spin_unlock(&q->net->lru_lock);
}
static inline void inet_frag_lru_add(struct netns_frags *nf,
struct inet_frag_queue *q)
{
spin_lock(&nf->lru_lock);
list_add_tail(&q->lru_list, &nf->lru_list);
spin_unlock(&nf->lru_lock);
}
#endif
......@@ -288,7 +288,7 @@ static inline int ip6_frag_nqueues(struct net *net)
static inline int ip6_frag_mem(struct net *net)
{
return atomic_read(&net->ipv6.frags.mem);
return sum_frag_mem_limit(&net->ipv6.frags);
}
#endif
......
......@@ -73,8 +73,9 @@ EXPORT_SYMBOL(inet_frags_init);
void inet_frags_init_net(struct netns_frags *nf)
{
nf->nqueues = 0;
atomic_set(&nf->mem, 0);
init_frag_mem_limit(nf);
INIT_LIST_HEAD(&nf->lru_list);
spin_lock_init(&nf->lru_lock);
}
EXPORT_SYMBOL(inet_frags_init_net);
......@@ -91,6 +92,8 @@ void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f)
local_bh_disable();
inet_frag_evictor(nf, f, true);
local_bh_enable();
percpu_counter_destroy(&nf->mem);
}
EXPORT_SYMBOL(inet_frags_exit_net);
......@@ -98,9 +101,9 @@ static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
{
write_lock(&f->lock);
hlist_del(&fq->list);
list_del(&fq->lru_list);
fq->net->nqueues--;
write_unlock(&f->lock);
inet_frag_lru_del(fq);
}
void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f)
......@@ -117,12 +120,8 @@ void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f)
EXPORT_SYMBOL(inet_frag_kill);
static inline void frag_kfree_skb(struct netns_frags *nf, struct inet_frags *f,
struct sk_buff *skb, int *work)
struct sk_buff *skb)
{
if (work)
*work -= skb->truesize;
atomic_sub(skb->truesize, &nf->mem);
if (f->skb_free)
f->skb_free(skb);
kfree_skb(skb);
......@@ -133,6 +132,7 @@ void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f,
{
struct sk_buff *fp;
struct netns_frags *nf;
unsigned int sum, sum_truesize = 0;
WARN_ON(!(q->last_in & INET_FRAG_COMPLETE));
WARN_ON(del_timer(&q->timer) != 0);
......@@ -143,13 +143,14 @@ void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f,
while (fp) {
struct sk_buff *xp = fp->next;
frag_kfree_skb(nf, f, fp, work);
sum_truesize += fp->truesize;
frag_kfree_skb(nf, f, fp);
fp = xp;
}
sum = sum_truesize + f->qsize;
if (work)
*work -= f->qsize;
atomic_sub(f->qsize, &nf->mem);
*work -= sum;
sub_frag_mem_limit(q, sum);
if (f->destructor)
f->destructor(q);
......@@ -164,22 +165,23 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force)
int work, evicted = 0;
if (!force) {
if (atomic_read(&nf->mem) <= nf->high_thresh)
if (frag_mem_limit(nf) <= nf->high_thresh)
return 0;
}
work = atomic_read(&nf->mem) - nf->low_thresh;
work = frag_mem_limit(nf) - nf->low_thresh;
while (work > 0) {
read_lock(&f->lock);
spin_lock(&nf->lru_lock);
if (list_empty(&nf->lru_list)) {
read_unlock(&f->lock);
spin_unlock(&nf->lru_lock);
break;
}
q = list_first_entry(&nf->lru_list,
struct inet_frag_queue, lru_list);
atomic_inc(&q->refcnt);
read_unlock(&f->lock);
spin_unlock(&nf->lru_lock);
spin_lock(&q->lock);
if (!(q->last_in & INET_FRAG_COMPLETE))
......@@ -233,9 +235,9 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
atomic_inc(&qp->refcnt);
hlist_add_head(&qp->list, &f->hash[hash]);
list_add_tail(&qp->lru_list, &nf->lru_list);
nf->nqueues++;
write_unlock(&f->lock);
inet_frag_lru_add(nf, qp);
return qp;
}
......@@ -250,7 +252,8 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
q->net = nf;
f->constructor(q, arg);
atomic_add(f->qsize, &nf->mem);
add_frag_mem_limit(q, f->qsize);
setup_timer(&q->timer, f->frag_expire, (unsigned long)q);
spin_lock_init(&q->lock);
atomic_set(&q->refcnt, 1);
......
......@@ -122,7 +122,7 @@ int ip_frag_nqueues(struct net *net)
int ip_frag_mem(struct net *net)
{
return atomic_read(&net->ipv4.frags.mem);
return sum_frag_mem_limit(&net->ipv4.frags);
}
static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
......@@ -161,13 +161,6 @@ static bool ip4_frag_match(struct inet_frag_queue *q, void *a)
qp->user == arg->user;
}
/* Memory Tracking Functions. */
static void frag_kfree_skb(struct netns_frags *nf, struct sk_buff *skb)
{
atomic_sub(skb->truesize, &nf->mem);
kfree_skb(skb);
}
static void ip4_frag_init(struct inet_frag_queue *q, void *a)
{
struct ipq *qp = container_of(q, struct ipq, q);
......@@ -340,6 +333,7 @@ static inline int ip_frag_too_far(struct ipq *qp)
static int ip_frag_reinit(struct ipq *qp)
{
struct sk_buff *fp;
unsigned int sum_truesize = 0;
if (!mod_timer(&qp->q.timer, jiffies + qp->q.net->timeout)) {
atomic_inc(&qp->q.refcnt);
......@@ -349,9 +343,12 @@ static int ip_frag_reinit(struct ipq *qp)
fp = qp->q.fragments;
do {
struct sk_buff *xp = fp->next;
frag_kfree_skb(qp->q.net, fp);
sum_truesize += fp->truesize;
kfree_skb(fp);
fp = xp;
} while (fp);
sub_frag_mem_limit(&qp->q, sum_truesize);
qp->q.last_in = 0;
qp->q.len = 0;
......@@ -496,7 +493,8 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
qp->q.fragments = next;
qp->q.meat -= free_it->len;
frag_kfree_skb(qp->q.net, free_it);
sub_frag_mem_limit(&qp->q, free_it->truesize);
kfree_skb(free_it);
}
}
......@@ -519,7 +517,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
qp->q.stamp = skb->tstamp;
qp->q.meat += skb->len;
qp->ecn |= ecn;
atomic_add(skb->truesize, &qp->q.net->mem);
add_frag_mem_limit(&qp->q, skb->truesize);
if (offset == 0)
qp->q.last_in |= INET_FRAG_FIRST_IN;
......@@ -531,9 +529,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
qp->q.meat == qp->q.len)
return ip_frag_reasm(qp, prev, dev);
write_lock(&ip4_frags.lock);
list_move_tail(&qp->q.lru_list, &qp->q.net->lru_list);
write_unlock(&ip4_frags.lock);
inet_frag_lru_move(&qp->q);
return -EINPROGRESS;
err:
......@@ -617,7 +613,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
head->len -= clone->len;
clone->csum = 0;
clone->ip_summed = head->ip_summed;
atomic_add(clone->truesize, &qp->q.net->mem);
add_frag_mem_limit(&qp->q, clone->truesize);
}
skb_push(head, head->data - skb_network_header(head));
......@@ -645,7 +641,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
}
fp = next;
}
atomic_sub(sum_truesize, &qp->q.net->mem);
sub_frag_mem_limit(&qp->q, sum_truesize);
head->next = NULL;
head->dev = dev;
......
......@@ -319,7 +319,7 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, struct sk_buff *skb,
fq->q.meat += skb->len;
if (payload_len > fq->q.max_size)
fq->q.max_size = payload_len;
atomic_add(skb->truesize, &fq->q.net->mem);
add_frag_mem_limit(&fq->q, skb->truesize);
/* The first fragment.
* nhoffset is obtained from the first fragment, of course.
......@@ -328,9 +328,8 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, struct sk_buff *skb,
fq->nhoffset = nhoff;
fq->q.last_in |= INET_FRAG_FIRST_IN;
}
write_lock(&nf_frags.lock);
list_move_tail(&fq->q.lru_list, &fq->q.net->lru_list);
write_unlock(&nf_frags.lock);
inet_frag_lru_move(&fq->q);
return 0;
discard_fq:
......@@ -398,7 +397,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev)
clone->ip_summed = head->ip_summed;
NFCT_FRAG6_CB(clone)->orig = NULL;
atomic_add(clone->truesize, &fq->q.net->mem);
add_frag_mem_limit(&fq->q, clone->truesize);
}
/* We have to remove fragment header from datagram and to relocate
......@@ -422,7 +421,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev)
head->csum = csum_add(head->csum, fp->csum);
head->truesize += fp->truesize;
}
atomic_sub(head->truesize, &fq->q.net->mem);
sub_frag_mem_limit(&fq->q, head->truesize);
head->local_df = 1;
head->next = NULL;
......
......@@ -327,7 +327,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
}
fq->q.stamp = skb->tstamp;
fq->q.meat += skb->len;
atomic_add(skb->truesize, &fq->q.net->mem);
add_frag_mem_limit(&fq->q, skb->truesize);
/* The first fragment.
* nhoffset is obtained from the first fragment, of course.
......@@ -341,9 +341,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
fq->q.meat == fq->q.len)
return ip6_frag_reasm(fq, prev, dev);
write_lock(&ip6_frags.lock);
list_move_tail(&fq->q.lru_list, &fq->q.net->lru_list);
write_unlock(&ip6_frags.lock);
inet_frag_lru_move(&fq->q);
return -1;
discard_fq:
......@@ -429,7 +427,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
head->len -= clone->len;
clone->csum = 0;
clone->ip_summed = head->ip_summed;
atomic_add(clone->truesize, &fq->q.net->mem);
add_frag_mem_limit(&fq->q, clone->truesize);
}
/* We have to remove fragment header from datagram and to relocate
......@@ -467,7 +465,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
}
fp = next;
}
atomic_sub(sum_truesize, &fq->q.net->mem);
sub_frag_mem_limit(&fq->q, sum_truesize);
head->next = NULL;
head->dev = dev;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment