Commits · a4e2405cc5d20ed6d58c4874325856e80e76a7f8 · nexedi / linux

10 Jul, 2015 4 commits

tcp: do not export tcp_init_xmit_timers() · a4e2405c

Eric Dumazet authored Jul 09, 2015

After commit 900f65d3 ("tcp: move duplicate code from
tcp_v4_init_sock()/tcp_v6_init_sock()"), we no longer
need to export tcp_init_xmit_timers()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a4e2405c

bridge: mdb: fill state in br_mdb_notify · 09cf0211

Nikolay Aleksandrov authored Jul 09, 2015

Fill also the port group state when sending notifications.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

09cf0211

route: remove unsed variable in __mkroute_input · cb1c6168

Masatake YAMATO authored Jul 09, 2015

flags local variable in __mkroute_input is not used as a variable.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cb1c6168

ipv6: Nonlocal bind · 35a256fe

Tom Herbert authored Jul 08, 2015

Add support to allow non-local binds similar to how this was done for IPv4.
Non-local binds are very useful in emulating the Internet in a box, etc.

This add the ip_nonlocal_bind sysctl under ipv6.

Testing:

Set up nonlocal binding and receive routing on a host, e.g.:

ip -6 rule add from ::/0 iif eth0 lookup 200
ip -6 route add local 2001:0:0:1::/64 dev lo proto kernel scope host table 200
sysctl -w net.ipv6.ip_nonlocal_bind=1

Set up routing to 2001:0:0:1::/64 on peer to go to first host

ping6 -I 2001:0:0:1::1 peer-address -- to verify
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

35a256fe

09 Jul, 2015 15 commits

Merge branch 'tw_cleanups' · 5a10ecec

David S. Miller authored Jul 09, 2015

Eric Dumazet says:

====================
inet: timewait cleanups

Another round of patches to make tw handling simpler.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

5a10ecec

inet: inet_twsk_deschedule factorization · dbe7faa4

Eric Dumazet authored Jul 08, 2015

inet_twsk_deschedule() calls are followed by inet_twsk_put().

Only particular case is in inet_twsk_purge() but there is no point
to defer the inet_twsk_put() after re-enabling BH.

Lets rename inet_twsk_deschedule() to inet_twsk_deschedule_put()
and move the inet_twsk_put() inside.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dbe7faa4

inet: simplify timewait refcounting · fc01538f

Eric Dumazet authored Jul 08, 2015

timewait sockets have a complex refcounting logic.
Once we realize it should be similar to established and
syn_recv sockets, we can use sk_nulls_del_node_init_rcu()
and remove inet_twsk_unhash()

In particular, deferred inet_twsk_put() added in commit
13475a30 ("tcp: connect() race with timewait reuse")
looks unecessary : When removing a timewait socket from
ehash or bhash, caller must own a reference on the socket
anyway.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fc01538f

inet: remove BUG_ON() in twsk_destructor() · 3fd2f1b9

Eric Dumazet authored Jul 08, 2015

Kernel will crash the same if one of the pointer is NULL anyway.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3fd2f1b9

ipv6: use flag instead of u16 for hop in inet6_skb_parm · 8b58a398

Florian Westphal authored Jul 08, 2015

Hop was always either 0 or sizeof(struct ipv6hdr).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

8b58a398

dsa: mv88e6352/mv88e6xxx: Add support for Marvell 88E6320 and 88E6321 · 7c3d0d67

Aleksey S. Kazantsev authored Jul 07, 2015

MV88E6320 and MV88E6321 are largely compatible to MV886352,
but are members of a different chip family.
Signed-off-by: Aleksey S. Kazantsev <ioctl@yandex.ru>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

7c3d0d67

Merge branch 'tcp-in-slow-start' · 986ca37e

David S. Miller authored Jul 09, 2015

Yuchung Cheng says:

====================
tcp: fixes some congestion control corner cases

This patch series fixes corner cases of TCP congestion control.
First issue is to avoid continuing slow start when cwnd reaches ssthresh.
Second issue is incorrectly processing order of congestion state and
cwnd update when entering fast recovery or undoing cwnd.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

986ca37e

tcp: update congestion state first before raising cwnd · b20a3fa3

Yuchung Cheng authored Jul 09, 2015

The congestion state and cwnd can be updated in the wrong order.
For example, upon receiving a dubious ACK, we incorrectly raise
the cwnd first (tcp_may_raise_cwnd()/tcp_cong_avoid()) because
the state is still Open, then enter recovery state to reduce cwnd.

For another example, if the ACK indicates spurious timeout or
retransmits, we first revert the cwnd reduction and congestion
state back to Open state.  But we don't raise the cwnd even though
the ACK does not indicate any congestion.

To fix this problem we should first call tcp_fastretrans_alert() to
process the dubious ACK and update the congestion state, then call
tcp_may_raise_cwnd() that raises cwnd based on the current state.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b20a3fa3

tcp: do not slow start when cwnd equals ssthresh · 76174004

Yuchung Cheng authored Jul 09, 2015

In the original design slow start is only used to raise cwnd
when cwnd is stricly below ssthresh. It makes little sense
to slow start when cwnd == ssthresh: especially
when hystart has set ssthresh in the initial ramp, or after
recovery when cwnd resets to ssthresh. Not doing so will
also help reduce the buffer bloat slightly.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

76174004

tcp: add tcp_in_slow_start helper · 071d5080

Yuchung Cheng authored Jul 09, 2015

Add a helper to test the slow start condition in various congestion
control modules and other places. This is to prepare a slight improvement
in policy as to exactly when to slow start.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

071d5080

net: skb_defer_rx_timestamp should check for phydev before setting up classify · 1007f59d

Alexander Duyck authored Jul 09, 2015

This change makes it so that the call skb_defer_rx_timestamp will first
check for a phydev before going in and manipulating the skb->data and
skb->len values. By doing this we can avoid unnecessary work on network
devices that don't support phydev. As a result we reduce the total
instruction count needed to process this on most devices.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1007f59d

tcp: v1 always send a quick ack when quickacks are enabled · 2251ae46

Jon Maxwell authored Jul 08, 2015

V1 of this patch contains Eric Dumazet's suggestion to move the per
dst RTAX_QUICKACK check into tcp_in_quickack_mode(). Thanks Eric.

I ran some tests and after setting the "ip route change quickack 1"
knob there were still many delayed ACKs sent. This occured
because when icsk_ack.quick=0 the !icsk_ack.pingpong value is
subsequently ignored as tcp_in_quickack_mode() checks both these
values. The condition for a quick ack to trigger requires
that both icsk_ack.quick != 0 and icsk_ack.pingpong=0. Currently
only icsk_ack.pingpong is controlled by the knob. But the
icsk_ack.quick value changes dynamically depending on heuristics.
The crux of the matter is that delayed acks still cannot be entirely
disabled even with the RTAX_QUICKACK per dst knob enabled. This
patch ensures that a quick ack is always sent when the RTAX_QUICKACK
per dst knob is turned on.

The "ip route change quickack 1" knob was recently added to enable
quickacks. It was modeled around the TCP_QUICKACK setsockopt() option.
This issue is that even with "ip route change quickack 1" enabled
we still see delayed ACKs under some conditions. It would be nice
to be able to completely disable delayed ACKs.

Here is an example:

# netstat -s|grep dela
    3 delayed acks sent

For all routes enable the knob

# ip route change quickack 1

Generate some traffic across a slow link and we still see the delayed
acks.

# netstat -s|grep dela
    106 delayed acks sent
    1 delayed acks further delayed because of locked socket

The issue is that both the "ip route change quickack 1" knob and
the TCP_QUICKACK option set the icsk_ack.pingpong variable to 0.
However at the business end in the __tcp_ack_snd_check() routine,
tcp_in_quickack_mode() checks that both icsk_ack.quick != 0
and icsk_ack.pingpong=0 in order to trigger a quickack. As
icsk_ack.quick is determined by heuristics it can be 0. When
that occurs the icsk_ack.pingpong value is ignored and a delayed
ACK is sent regardless.

This patch moves the RTAX_QUICKACK per dst check into the
tcp_in_quickack_mode() routine which ensures that a quickack is
always sent when the quickack knob is enabled for that dst.
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2251ae46

rocker: add change MTU support · 77a58c74

Scott Feldman authored Jul 08, 2015

Implement ndo_change_mtu: on MTU change, reallocate Rx ring bufs and signal
HW of new port MTU value.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Tested-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>

77a58c74

neterion: s2io: Use module_pci_driver · 910be1ab

Vaishali Thakkar authored Jul 09, 2015

Use module_pci_driver for drivers whose init and exit functions
only register and unregister, respectively.

A simplified version of the Coccinelle semantic patch that performs
this transformation is as follows:

@A@
identifier f, x;
@@
-static f(...) { return pci_register_driver(&x); }

@b depends on a@
identifier e, a.x;
statement S;
@@
-static e(...) {
-pci_unregister_driver(&x);
-DBG_PRINT(INIT_DBG,"S");
- }

@c depends on a && b@
identifier a.f;
declarer name module_init;
@@
-module_init(f);

@d depends on a && b && c@
identifier b.e, a.x;
declarer name module_exit;
declarer name module_pci_driver;
@@
-module_exit(e);
+module_pci_driver(x);
Signed-off-by: Vaishali Thakkar <vthakkar1994@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

910be1ab

cxgb4vf: Fix check to use new User Doorbell mechanism · 71d3c0b4

Hariprasad Shenai authored Jul 09, 2015

If we don't have access to the new User GTS (T5+), use the old doorbell
mechanism; otherwise use the new BAR2 mechanism.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

71d3c0b4

08 Jul, 2015 21 commits

test_bpf: extend tests for 32-bit endianness conversion · ba29becd

Xi Wang authored Jul 08, 2015

Currently "ALU_END_FROM_BE 32" and "ALU_END_FROM_LE 32" do not test if
the upper bits of the result are zeros (the arm64 JIT had such bugs).
Extend the two tests to catch this.
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ba29becd

Merge branch 'cxgb4-t6' · ca661a28

David S. Miller authored Jul 08, 2015

Hariprasad Shenai says:

====================
Cleanup, T6 changes and register range update

This patch series adds the following:
Don't use entire L2T table, update register ranges for T6 adapter,
read stats for only available channels for T6 and enable cim_la dump for
T6 adapter also.

This patch series has been created against net-next tree and includes
patches on cxgb4 driver.

We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
====================
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ca661a28

cxgb4: Enable cim_la dump to support T6 · b7660642

Hariprasad Shenai authored Jul 07, 2015

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b7660642

cxgb4: Read stats for only available channels · df459ebc

Hariprasad Shenai authored Jul 07, 2015

Updating the driver to read the stats of only available channels. T6 and
later has only 2 channels
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

df459ebc

cxgb4: Update register ranges for T6 adapter · 5b4e83e1

Hariprasad Shenai authored Jul 07, 2015

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5b4e83e1

cxgb4: Don't use entire L2T table, use only its slice · 5be9ed8d

Hariprasad Shenai authored Jul 07, 2015

The driver was retrieving the parameters for the bounds of its
slice of the L2T from the firmware and then throwing those away and
using the entire table. This corrects that problem.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5be9ed8d

net: ec_bhf: Use module_pci_driver · b11b6ed0

Vaishali Thakkar authored Jul 07, 2015

Use module_pci_driver for drivers whose init and exit functions
only register and unregister, respectively.

A simplified version of the Coccinelle semantic patch that performs
this transformation is as follows:

@A@
identifier f, x;
@@
-static f(...) { return pci_register_driver(&x); }

@b depends on a@
identifier e, a.x;
@@
-static e(...) { pci_unregister_driver(&x); }

@c depends on a && b@
identifier a.f;
declarer name module_init;
@@
-module_init(f);

@d depends on a && b && c@
identifier b.e, a.x;
declarer name module_exit;
declarer name module_pci_driver;
@@
-module_exit(e);
+module_pci_driver(x);
Signed-off-by: Vaishali Thakkar <vthakkar1994@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b11b6ed0

hv_netvsc: Add support to set MTU reservation from guest side · f9cbce34

Haiyang Zhang authored Jul 06, 2015

When packet encapsulation is in use, the MTU needs to be reduced for
headroom reservation.
The existing code takes the updated MTU value only from the host side.
But vSwitch extensions, such as Open vSwitch, require the flexibility
to change the MTU to different values from within a guest during the
lifecycle of a vNIC, when the encapsulation protocol is changed. The
patch supports this kind of MTU changes.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f9cbce34

ifb: add multiqueue operation · 9e29e21a

Eric Dumazet authored Jul 06, 2015

Add multiqueue capabilities to ifb netdevice.

This removes last bottleneck for ingress when mq qdisc can be used
to shard load from multiple RX queues on physical device.

Tested:

# netem based setup, installed at receiver side
ETH=eth0
IFB=ifb10
EST="est 1sec 4sec" # Optional rate estimator
RTT_HALF=2ms
#REORDER=20us
#LOSS="loss 1"
TXQ=8

ip link add ifb10 numtxqueues $TXQ type ifb
ip link set dev $IFB up

tc qdisc add dev $ETH ingress 2>/dev/null

tc filter add dev $ETH parent ffff: \
   protocol ip u32 match u32 0 0 flowid 1:1 \
	action mirred egress redirect dev $IFB

tc qdisc del dev $IFB root 2>/dev/null

tc qdisc add dev $IFB root handle 1: mq
for i in `seq 1 $TXQ`
do
 slot=$( printf %x $(( i )) )
 tc qd add dev $IFB parent 1:$slot $EST netem \
	limit 100000 delay $RTT_HALF $REORDER $LOSS
done

lpaa24:~# tc -s -d qd sh dev ifb10
qdisc mq 1: root
 Sent 316544766 bytes 5265927 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 98880b 1648p requeues 0
qdisc netem 8002: parent 1:1 limit 100000 delay 2.0ms
 Sent 39601416 bytes 658721 pkt (dropped 0, overlimits 0 requeues 0)
 rate 38235Kbit 79657pps backlog 12240b 204p requeues 0
qdisc netem 8003: parent 1:2 limit 100000 delay 2.0ms
 Sent 39472866 bytes 657227 pkt (dropped 0, overlimits 0 requeues 0)
 rate 38234Kbit 79655pps backlog 10620b 176p requeues 0
qdisc netem 8004: parent 1:3 limit 100000 delay 2.0ms
 Sent 39703417 bytes 659699 pkt (dropped 0, overlimits 0 requeues 0)
 rate 38320Kbit 79831pps backlog 12780b 213p requeues 0
qdisc netem 8005: parent 1:4 limit 100000 delay 2.0ms
 Sent 39565149 bytes 658011 pkt (dropped 0, overlimits 0 requeues 0)
 rate 38174Kbit 79530pps backlog 11880b 198p requeues 0
qdisc netem 8006: parent 1:5 limit 100000 delay 2.0ms
 Sent 39506078 bytes 657354 pkt (dropped 0, overlimits 0 requeues 0)
 rate 38195Kbit 79571pps backlog 12480b 208p requeues 0
qdisc netem 8007: parent 1:6 limit 100000 delay 2.0ms
 Sent 39675994 bytes 658849 pkt (dropped 0, overlimits 0 requeues 0)
 rate 38323Kbit 79838pps backlog 12600b 210p requeues 0
qdisc netem 8008: parent 1:7 limit 100000 delay 2.0ms
 Sent 39532042 bytes 658367 pkt (dropped 0, overlimits 0 requeues 0)
 rate 38177Kbit 79536pps backlog 13140b 219p requeues 0
qdisc netem 8009: parent 1:8 limit 100000 delay 2.0ms
 Sent 39488164 bytes 657705 pkt (dropped 0, overlimits 0 requeues 0)
 rate 38192Kbit 79568pps backlog 13Kb 222p requeues 0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9e29e21a

cxgb4: Add PCI device ids for few more T5 and T6 adapters · 81aa5079

Hariprasad Shenai authored Jul 06, 2015

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

81aa5079

net/mlx4_core: Add extra check for total vfs for SRIOV · 0beb44b0

Carol Soto authored Jul 06, 2015

Add extra check for total vfs for SRIOV to check if that value is
bigger than total vfs in pci SRIOV capabalities. Fix a check and
print of the number of maximum vfs that hw can handle. Fix a check
and print of the number of maximum vfs per port that driver can handle.
Signed-off-by: Carol L Soto <clsoto@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0beb44b0

samples: bpf: enable trace samples for s390x · d912557b

Michael Holzheu authored Jul 06, 2015

The trace bpf samples do not compile on s390x because they use x86
specific fields from the "pt_regs" structure.

Fix this and access the fields via new PT_REGS macros.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d912557b

net: macb: Add SG support for Zynq SOC family · 7baaa909

Punnaiah Choudary Kalluri authored Jul 06, 2015

Enable SG support for Zynq SOC family devices.
Signed-off-by: Punnaiah Choudary Kalluri <punnaia@xilinx.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7baaa909

xen-netback: remove duplicated function definition · 6ab13b27

Li, Liang Z authored Jul 06, 2015

There are two duplicated xenvif_zerocopy_callback() definitions.
Remove one of them.
Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6ab13b27

Merge branch 'sch_act_lockless' · 8685255e

David S. Miller authored Jul 08, 2015

Eric Dumazet says:

====================
net_sched: act: lockless operation

As mentioned by Alexei last week in Budapest, it is a bit weird
to take a spinlock in order to drop a packet in a tc filter...

Lets add percpu infra for tc actions and use it for gact & mirred.

Before changes, my host with 8 RX queues was handling 5 Mpps with gact,
and more than 11 Mpps after.

Mirred change is not yet visible if ifb+qdisc is used, as ifb is
not yet multi queue enabled, but is a step forward.
====================
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8685255e

net_sched: act_mirred: remove spinlock in fast path · 2ee22a90

Eric Dumazet authored Jul 06, 2015

Like act_gact, act_mirred can be lockless in packet processing

1) Use percpu stats
2) update lastuse only every clock tick to avoid false sharing
3) use rcu to protect tcfm_dev
4) Remove spinlock usage, as it is no longer needed.

Next step : add multi queue capability to ifb device
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2ee22a90

net_sched: act_gact: remove spinlock in fast path · 56e5d1ca

Eric Dumazet authored Jul 06, 2015

Final step for gact RCU operation :

1) Use percpu stats
2) update lastuse only every clock tick to avoid false sharing
3) Remove spinlock acquisition, as it is no longer needed.

Since this is the last contended lock in packet RX when tc gact is used,
this gives impressive gain.

My host with 8 RX queues was handling 5 Mpps before the patch,
and more than 11 Mpps after patch.

Tested:

On receiver :

dev=eth0
tc qdisc del dev $dev ingress 2>/dev/null
tc qdisc add dev $dev ingress
tc filter del dev $dev root pref 10 2>/dev/null
tc filter del dev $dev pref 10 2>/dev/null
tc filter add dev $dev est 1sec 4sec parent ffff: protocol ip prio 1 \
	u32 match ip src 7.0.0.0/8 flowid 1:15 action drop

Sender sends packets flood from 7/8 network
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

56e5d1ca

net_sched: act_gact: read tcfg_ptype once · 8f2ae965

Eric Dumazet authored Jul 06, 2015

Third step for gact RCU operation :

Following patch will get rid of spinlock protection,
so we need to read tcfg_ptype once.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8f2ae965

net_sched: act_gact: use a separate packet counters for gact_determ() · cc6510a9

Eric Dumazet authored Jul 06, 2015

Second step for gact RCU operation :

We want to get rid of the spinlock protecting gact operations.
Stats (packets/bytes) will soon be per cpu.

gact_determ() would not work without a central packet counter,
so lets add it for this mode.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cc6510a9

net_sched: act_gact: make tcfg_pval non zero · cef5ecf9

Eric Dumazet authored Jul 06, 2015

First step for gact RCU operation :

Instead of testing if tcfg_pval is zero or not, just make it 1.

No change in behavior, but slightly faster code.

The smp_rmb()/smp_wmb() barriers, while not strictly needed at this
stage are added for upcoming spinlock removal.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cef5ecf9

net: sched: add percpu stats to actions · 519c818e

Eric Dumazet authored Jul 06, 2015

Reuse existing percpu infrastructure John Fastabend added for qdisc.

This patch adds a new cpustats parameter to tcf_hash_create() and all
actions pass false, meaning this patch should have no effect yet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

519c818e