Commits · 8df2914598c5300a937760daa271fdbddce1108c · Kirill Smelkov / linux

12 May, 2015 11 commits

Merge branch 'netdev_page_frags' · 8df29145

David S. Miller authored May 12, 2015

Alexander Duyck says:

====================
Refactor netdev page frags and move them into mm/

This patch series addresses several things.

First I found an issue in the performance of the pfmemalloc check from
build_skb.  To work around it I have provided a cached copy of pfmemalloc
to be used in __netdev_alloc_skb and __napi_alloc_skb.

Second I moved the page fragment allocation logic into the mm tree and
added functionality for freeing page fragments.  I had to fix igb before I
could do this as it was using a reference to NETDEV_FRAG_PAGE_MAX_SIZE
incorrectly.

Finally I went through and replaced all of the duplicate code that was
calling put_page and replaced it with calls to skb_free_frag.

With these changes in place a simple receive and drop test increased from a
packet rate of 8.9Mpps to 9.8Mpps.  The gains breakdown as follows:

8.9Mpps	Before			9.8Mpps	After
------------------------	------------------------
7.8%	put_compound_page	9.1%	__free_page_frag
3.9%	skb_free_head
1.1%	put_page

4.9%	build_skb		3.8%	__napi_alloc_skb
2.5%	__alloc_rx_skb
1.9%	__napi_alloc_skb
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8df29145

bnx2x, tg3: Replace put_page(virt_to_head_page()) with skb_free_frag() · e51423d9
Alexander Duyck authored May 06, 2015
```
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
```
e51423d9
hisilicon: Replace put_page(virt_to_head_page()) with skb_free_frag() · edea5845
Alexander Duyck authored May 06, 2015
```
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
```
edea5845

e1000: Replace e1000_free_frag with skb_free_frag · 6bf93ba8

Alexander Duyck authored May 06, 2015

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6bf93ba8

mvneta: Replace put_page(virt_to_head_page(ptr)) w/ skb_free_frag · 13dc0d2b

Alexander Duyck authored May 06, 2015

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

13dc0d2b

netcp: Replace put_page(virt_to_head_page(ptr)) w/ skb_free_frag · 7d525c4e

Alexander Duyck authored May 06, 2015

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7d525c4e

net: Add skb_free_frag to replace use of put_page in freeing skb->head · 181edb2b

Alexander Duyck authored May 06, 2015

This change adds a function called skb_free_frag which is meant to
compliment the function netdev_alloc_frag. The general idea is to enable a
more lightweight version of page freeing since we don't actually need all
the overhead of a put_page, and we don't quite fit the model of __free_pages.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

181edb2b

mm/net: Rename and move page fragment handling from net/ to mm/ · b63ae8ca

Alexander Duyck authored May 06, 2015

This change moves the __alloc_page_frag functionality out of the networking
stack and into the page allocation portion of mm. The idea it so help make
this maintainable by placing it with other page allocation functions.

Since we are moving it from skbuff.c to page_alloc.c I have also renamed
the basic defines and structure from netdev_alloc_cache to page_frag_cache
to reflect that this is now part of a different kernel subsystem.

I have also added a simple __free_page_frag function which can handle
freeing the frags based on the skb->head pointer. The model for this is
based off of __free_pages since we don't actually need to deal with all of
the cases that put_page handles. I incorporated the virt_to_head_page call
and compound_order into the function as it actually allows for a signficant
size reduction by reducing code duplication.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b63ae8ca

net: Store virtual address instead of page in netdev_alloc_cache · 0e392508

Alexander Duyck authored May 06, 2015

This change makes it so that we store the virtual address of the page
in the netdev_alloc_cache instead of the page pointer. The idea behind
this is to avoid multiple calls to page_address since the virtual address
is required for every access, but the page pointer is only needed at
allocation or reset of the page.

While I was at it I also reordered the netdev_alloc_cache structure a bit
so that the size is always 16 bytes by dropping size in the case where
PAGE_SIZE is greater than or equal to 32KB.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0e392508

igb: Don't use NETDEV_FRAG_PAGE_MAX_SIZE in descriptor calculation · 2ee52ad4

Alexander Duyck authored May 06, 2015

This change updates igb so that it will correctly perform the descriptor
count calculation.  Previously it was taking NETDEV_FRAG_PAGE_MAX_SIZE
into account with isn't really correct since a different value is used to
determine the size of the pages used for TCP.  That is actually determined
by SKB_FRAG_PAGE_ORDER.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2ee52ad4

net: Use cached copy of pfmemalloc to avoid accessing page · 9451980a

Alexander Duyck authored May 06, 2015

While testing I found that the testing for pfmemalloc in build_skb was
rather expensive.  I found the issue to be two-fold.  First we have to get
from the virtual address to the head page and that comes at the cost of
something like 11 cycles.  Then there is the cost for reading pfmemalloc out
of the head page which can be cache cold due to the fact that
put_page_testzero is likely invalidating the cache-line on one or more
CPUs as the fragments can be shared.

To avoid this extra expense I have added a pfmemalloc member to the
netdev_alloc_cache.  I then pushed pieces of __alloc_rx_skb into
__napi_alloc_skb and __netdev_alloc_skb so that I could rewrite them to
make use of the cached pfmemalloc value.  The result is that my perf traces
show a reduction from 9.28% overhead to 3.7% for the code covered by
build_skb, __alloc_rx_skb, and __napi_alloc_skb when performing a test with
the packet being dropped instead of being handed to napi_gro_receive.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9451980a

11 May, 2015 20 commits

net: sched: deprecate enqueue_root() · b396cca6

Eric Dumazet authored May 11, 2015

Only left enqueue_root() user is netem, and it looks not necessary :

qdisc_skb_cb(skb)->pkt_len is preserved after one skb_clone()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b396cca6

net: ll_temac: Use one return statement instead of two · 3824246d

Michal Simek authored May 11, 2015

Use one return statement instead of two to simplify the code.
Both are returning the same value.
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3824246d

net: fec: add support of ethtool get_regs · db65f35f

Philippe Reynes authored May 11, 2015

This enables the ethtool's "-d" and "--register-dump"
options for fec devices.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

db65f35f

net: sched: fix typo in net_device ifdef · 4cda01e8

Daniel Borkmann authored May 11, 2015

This should have been #ifdef not #if.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Fixes: d2788d34 ("net: sched: further simplify handle_ing")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

4cda01e8

Merge branch 'handle_ing_lightweight' · 3bb45001

David S. Miller authored May 11, 2015

Daniel Borkmann says:

====================
handle_ing update

These are a couple of cleanups to make ingress a bit more lightweight.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

3bb45001

net: sched: further simplify handle_ing · d2788d34

Daniel Borkmann authored May 09, 2015

Ingress qdisc has no other purpose than calling into tc_classify()
that executes attached classifier(s) and action(s).

It has a 1:1 relationship to dev->ingress_queue. After having commit
087c1a60 ("net: sched: run ingress qdisc without locks") removed
the central ingress lock, one major contention point is gone.

The extra indirection layers however, are not necessary for calling
into ingress qdisc. pktgen calling locally into netif_receive_skb()
with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.

We can redirect the private classifier list to the netdev directly,
without changing any classifier API bits (!) and execute on that from
handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
is also not applicable, ingress_cl_list provides similar behaviour.
In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.

One next possible step is the removal of the dev's ingress (dummy)
netdev_queue, and to only have the list member in the netdevice
itself.

Note, the filter chain is RCU protected and individual filter elements
are being kfree'd by sched subsystem after RCU grace period. RCU read
lock is being held by __netif_receive_skb_core().

Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d2788d34

net: sched: consolidate handle_ing and ing_filter · c9e99fd0

Daniel Borkmann authored May 09, 2015

Given quite some code has been removed from ing_filter(), we can just
consolidate that function into handle_ing() and get rid of a few
instructions at the same time.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c9e99fd0

test: bpf: extend "load 64-bit immediate" testcase · 986ccfdb

Xi Wang authored May 09, 2015

Extend the testcase to catch a signedness bug in the arm64 JIT:

test_bpf: #58 load 64-bit immediate jited:1 ret -1 != 1 FAIL (1 times)

This is useful to ensure other JITs won't have a similar bug.

Link: https://lkml.org/lkml/2015/5/8/458
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

986ccfdb

Merge branch 'bonding_netlink_lacp' · 32f89e5c

David S. Miller authored May 11, 2015

Jonathan Toppins says:

====================
add netlink support for new lacp bonding parameters

This is a resubmit of Mahesh's last 3 bonding patches from this series
(http://marc.info/?l=linux-netdev&m=142432864626179&w=2) with one
additional kernel patch which adds the netlink bits. I have noted any
modifications I did to the original patches just above my signoff line.
Patch 5 is the iproute2 support for these bonding options. All patches
were coded against the net-next branch of their respective projects.

v2:
  * rebased
  * only send these new parameters via netlink when bond is in mode 4
  * fixed ad_actor_sys_prio to be 0xFFFF by default even when the bond
    is initially created in mode 0 and switched to mode 4

v3:
  * reverted changes to bond_option_ad_actor_system_set() from v1 in Mahesh's
    patch "bonding: Allow userspace to set actors' macaddr in an AD-system."
    Instead implementing all setting in the option specific set function as
    Nik suggested.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

32f89e5c

bonding: add netlink support for sys prio, actor sys mac, and port key · 171a42c3

Andy Gospodarek authored May 09, 2015

Adds netlink support for the following bonding options:
* BOND_OPT_AD_ACTOR_SYS_PRIO
* BOND_OPT_AD_ACTOR_SYSTEM
* BOND_OPT_AD_USER_PORT_KEY

When setting the actor system mac address we assume the netlink message
contains a binary mac and not a string representation of a mac.
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
[jt: completed the setting side of the netlink attributes]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

171a42c3

bonding: Implement user key part of port_key in an AD system. · d22a5fc0

Mahesh Bandewar authored May 09, 2015

The port key has three components - user-key, speed-part, and duplex-part.
The LSBit is for the duplex-part, next 5 bits are for the speed while the
remaining 10 bits are the user defined key bits. Get these 10 bits
from the user-space (through the SysFs interface) and use it to form the
admin port-key. Allowed range for the user-key is 0 - 1023 (10 bits). If
it is not provided then use zero for the user-key-bits (default).

It can set using following example code -

   # modprobe bonding mode=4
   # usr_port_key=$(( RANDOM & 0x3FF ))
   # echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key
   # echo +eth1 > /sys/class/net/bond0/bonding/slaves
   ...
   # ip link set bond0 up
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
[jt: * fixed up style issues reported by checkpatch
     * fixed up context from change in ad_actor_sys_prio patch]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d22a5fc0

bonding: Allow userspace to set actors' macaddr in an AD-system. · 74514957

Mahesh Bandewar authored May 09, 2015

In an AD system, the communication between actor and partner is the
business between these two entities. In the current setup anyone on the
same L2 can "guess" the LACPDU contents and then possibly send the
spoofed LACPDUs and trick the partner causing connectivity issues for
the AD system. This patch allows to use a random mac-address obscuring
it's identity making it harder for someone in the L2 is do the same thing.

This patch allows user-space to choose the mac-address for the AD-system.
This mac-address can not be NULL or a Multicast. If the mac-address is set
from user-space; kernel will honor it and will not overwrite it. In the
absence (value from user space); the logic will default to using the
masters' mac as the mac-address for the AD-system.

It can be set using example code below -

   # modprobe bonding mode=4
   # sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \
                    $(( (RANDOM & 0xFE) | 0x02 )) \
                    $(( RANDOM & 0xFF )) \
                    $(( RANDOM & 0xFF )) \
                    $(( RANDOM & 0xFF )) \
                    $(( RANDOM & 0xFF )) \
                    $(( RANDOM & 0xFF )))
   # echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system
   # echo +eth1 > /sys/class/net/bond0/bonding/slaves
   ...
   # ip link set bond0 up
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
[jt: fixed up style issues reported by checkpatch]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

74514957

bonding: Allow userspace to set actors' system_priority in AD system · 6791e466

Mahesh Bandewar authored May 09, 2015

This patch allows user to randomize the system-priority in an ad-system.
The allowed range is 1 - 0xFFFF while default value is 0xFFFF. If user
does not specify this value, the system defaults to 0xFFFF, which is
what it was before this patch.

Following example code could set the value -
    # modprobe bonding mode=4
    # sys_prio=$(( 1 + RANDOM + RANDOM ))
    # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio
    # echo +eth1 > /sys/class/net/bond0/bonding/slaves
    ...
    # ip link set bond0 up
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
[jt: * fixed up style issues reported by checkpatch
     * changed how the default value is set in bond_check_params(), this
       makes the default consistent between what gets set for a new bond
       and what the default is claimed to be in the bonding options.]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6791e466

Merge branch 'kernel_socket_netns' · 0198e09c

David S. Miller authored May 11, 2015

Eric W. Biederman says:

====================
Cleanup the kernel sockets.

Right now the situtation for allocating kernel sockets is a mess.
- sock_create_kern does not take a namespace parameter.
- kernel sockets must not reference count a network namespace and keep
  it alive or else we will have a reference counting loop.
- The way we avoid the reference counting loop with sk_change_net
  and sk_release_kernel are major hacks.

This patchset addresses this mess by fixing sock_create_kern to do
everything necessary to create a kernel socket.  None of the current
users of kernel sockets need the network namespace reference counted.
Either kernel sockets are network namespace aware (and using the current
hacks) or kernel sockets are limited to the initial network namespace
in which case it does not matter.

This patchset starts by addressing tun which should be using normal
userspace sockets like macvtap.

Then sock_create_kern is fixed to take a network namespace.
Then the in kernel status of sockets are passed through to sk_alloc.
Then sk_alloc is fixed to not reference count the network namespace
     of kernel sockets.
Then the callers of sock_create_kern are fixed up to stop using hacks.
Then netlink which uses it's own flavor of sock_create_kern is fixed.

Finally the hacks that are sk_change_net and sk_release_kernel are removed.

When it is all done the code is easier to follow, easier to use, easier
to maintain and shorter by about 70 lines.
====================
Reported-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0198e09c

net: kill sk_change_net and sk_release_kernel · affb9792

Eric W. Biederman authored May 08, 2015

These functions are no longer needed and no longer used kill them.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

affb9792

netlink: Create kernel netlink sockets in the proper network namespace · 13d3078e

Eric W. Biederman authored May 08, 2015

Utilize the new functionality of sk_alloc so that nothing needs to be
done to suprress the reference counting on kernel sockets.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

13d3078e

net: Modify sk_alloc to not reference count the netns of kernel sockets. · 26abe143

Eric W. Biederman authored May 08, 2015

Now that sk_alloc knows when a kernel socket is being allocated modify
it to not reference count the network namespace of kernel sockets.

Keep track of if a socket needs reference counting by adding a flag to
struct sock called sk_net_refcnt.

Update all of the callers of sock_create_kern to stop using
sk_change_net and sk_release_kernel as those hacks are no longer
needed, to avoid reference counting a kernel socket.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

26abe143

net: Pass kern from net_proto_family.create to sk_alloc · 11aa9c28

Eric W. Biederman authored May 08, 2015

In preparation for changing how struct net is refcounted
on kernel sockets pass the knowledge that we are creating
a kernel socket from sock_create_kern through to sk_alloc.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

11aa9c28

net: Add a struct net parameter to sock_create_kern · eeb1bd5c

Eric W. Biederman authored May 08, 2015

This is long overdue, and is part of cleaning up how we allocate kernel
sockets that don't reference count struct net.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

eeb1bd5c

tun: Utilize the normal socket network namespace refcounting. · 140e807d

Eric W. Biederman authored May 08, 2015

There is no need for tun to do the weird network namespace refcounting.
The existing network namespace refcounting in tfile has almost exactly
the same lifetime.  So rewrite the code to use the struct sock network
namespace refcounting and remove the unnecessary hand rolled network
namespace refcounting and the unncesary tfile->net.

This change allows the tun code to directly call sock_put bypassing
sock_release and making SOCK_EXTERNALLY_ALLOCATED unnecessary.

Remove the now unncessary tun_release so that if anything tries to use
the sock_release code path the kernel will oops, and let us know about
the bug.

The macvtap code already uses it's internal socket this way.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

140e807d

10 May, 2015 9 commits

codel: add ce_threshold attribute · 80ba92fa

Eric Dumazet authored May 08, 2015

For DCTCP or similar ECN based deployments on fabrics with shallow
buffers, hosts are responsible for a good part of the buffering.

This patch adds an optional ce_threshold to codel & fq_codel qdiscs,
so that DCTCP can have feedback from queuing in the host.

A DCTCP enabled egress port simply have a queue occupancy threshold
above which ECT packets get CE mark.

In codel language this translates to a sojourn time, so that one doesn't
have to worry about bytes or bandwidth but delays.

This makes the host an active participant in the health of the whole
network.

This also helps experimenting DCTCP in a setup without DCTCP compliant
fabric.

On following example, ce_threshold is set to 1ms, and we can see from
'ldelay xxx us' that TCP is not trying to go around the 5ms codel
target.

Queue has more capacity to absorb inelastic bursts (say from UDP
traffic), as queues are maintained to an optimal level.

lpaa23:~# ./tc -s -d qd sh dev eth1
qdisc mq 1: dev eth1 root
 Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961)
 backlog 3108242b 364p requeues 42961
qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
 Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503)
 rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503
  count 0 lastcount 0 ldelay 1.0ms drop_next 0us
  maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384
qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
 Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186)
 rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186
  count 0 lastcount 0 ldelay 694us drop_next 0us
  maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873
qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
 Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554)
 rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554
  count 0 lastcount 0 ldelay 889us drop_next 0us
  maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780
...
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

80ba92fa

ethernet: qualcomm: use spi instead of spi_device · cf9d0dcc

Varka Bhadram authored May 07, 2015

All spi based drivers have an instance of struct spi_device
as spi. This patch renames spi_device to spi to synchronize
with all the drivers.
Signed-off-by: Varka Bhadram <varkab@cdac.in>
Signed-off-by: David S. Miller <davem@davemloft.net>

cf9d0dcc

Merge branch 'pktgen-next' · 3e3b3468

David S. Miller authored May 09, 2015

Jesper Dangaard Brouer says:

====================
The following series introduce some pktgen changes

Patch01:
 Cleanup my own work when I introduced NO_TIMESTAMP.

Patch02:
 Took over patch from Alexei, and addressed my own concerns, as Alexie
 is too busy with other work, and this will provide an easy tool for
 measuring ingress path performance, which is a hot topic ATM.

 Changes were primarily user interface related.  Introduced a separate
 "xmit_mode" setting, instead of stealing one of the dev flags like
 Alexei did.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

3e3b3468

pktgen: introduce xmit_mode '<start_xmit|netif_receive>' · 62f64aed

Alexei Starovoitov authored May 07, 2015

Introduce xmit_mode 'netif_receive' for pktgen which generates the
packets using familiar pktgen commands, but feeds them into
netif_receive_skb() instead of ndo_start_xmit().

Default mode is called 'start_xmit'.

It is designed to test netif_receive_skb and ingress qdisc
performace only. Make sure to understand how it works before
using it for other rx benchmarking.

Sample script 'pktgen.sh':
\#!/bin/bash
function pgset() {
  local result

  echo $1 > $PGDEV

  result=`cat $PGDEV | fgrep "Result: OK:"`
  if [ "$result" = "" ]; then
    cat $PGDEV | fgrep Result:
  fi
}

[ -z "$1" ] && echo "Usage: $0 DEV" && exit 1
ETH=$1

PGDEV=/proc/net/pktgen/kpktgend_0
pgset "rem_device_all"
pgset "add_device $ETH"

PGDEV=/proc/net/pktgen/$ETH
pgset "xmit_mode netif_receive"
pgset "pkt_size 60"
pgset "dst 198.18.0.1"
pgset "dst_mac 90:e2:ba:ff:ff:ff"
pgset "count 10000000"
pgset "burst 32"

PGDEV=/proc/net/pktgen/pgctrl
echo "Running... ctrl^C to stop"
pgset "start"
echo "Done"
cat /proc/net/pktgen/$ETH

Usage:
$ sudo ./pktgen.sh eth2
...
Result: OK: 232376(c232372+d3) usec, 10000000 (60byte,0frags)
  43033682pps 20656Mb/sec (20656167360bps) errors: 10000000

Raw netif_receive_skb speed should be ~43 million packet
per second on 3.7Ghz x86 and 'perf report' should look like:
  37.69%  kpktgend_0   [kernel.vmlinux]  [k] __netif_receive_skb_core
  25.81%  kpktgend_0   [kernel.vmlinux]  [k] kfree_skb
   7.22%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv
   5.68%  kpktgend_0   [pktgen]          [k] pktgen_thread_worker

If fib_table_lookup is seen on top, it means skb was processed
by the stack. To benchmark netif_receive_skb only make sure
that 'dst_mac' of your pktgen script is different from
receiving device mac and it will be dropped by ip_rcv
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

62f64aed

pktgen: adjust flag NO_TIMESTAMP to be more pktgen compliant · f1f00d8f

Jesper Dangaard Brouer authored May 07, 2015

Allow flag NO_TIMESTAMP to turn timestamping on again, like other flags,
with a negation of the flag like !NO_TIMESTAMP.

Also document the option flag NO_TIMESTAMP.

Fixes: afb84b62 ("pktgen: add flag NO_TIMESTAMP to disable timestamping")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f1f00d8f

Merge branch 'netns-scalability' · 4d95b72f

David S. Miller authored May 09, 2015

Nicolas Dichtel says:

====================
netns: ease netlink use with a lot of netns

This idea was informally discussed in Ottawa / netdev0.1. The goal is to
ease the use/scalability of netns, from a userland point of view.
Today, users need to open one netlink socket per family and per netns.
Thus, when the number of netns inscreases (for example 5K or more), the
number of sockets needed to manage them grows a lot.

The goal of this series is to be able to monitor netlink events, for a
specified family, for a set of netns, with only one netlink socket. For
this purpose, a netlink socket option is added: NETLINK_LISTEN_ALL_NSID.
When this option is set on a netlink socket, this socket will receive
netlink notifications from all netns that have a nsid assigned into the
netns where the socket has been opened.
The nsid is sent to userland via an anscillary data.

Here is an example with a patched iproute2. vxlan10 is created in the
current netns (netns0, nsid 0) and then moved to another netns (netns1,
nsid 1):

$ ip netns exec netns0 ip monitor all-nsid label
[nsid 0][NSID]nsid 1 (iproute2 netns name: netns1)
[nsid 0][NEIGH]??? lladdr 00:00:00:00:00:00 REACHABLE,PERMANENT
[nsid 0][LINK]5: vxlan10@NONE: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default
    link/ether 92:33:17:e6:e7:1d brd ff:ff:ff:ff:ff:ff
[nsid 0][LINK]Deleted 5: vxlan10@NONE: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default
    link/ether 92:33:17:e6:e7:1d brd ff:ff:ff:ff:ff:ff
[nsid 1][NSID]nsid 0 (iproute2 netns name: netns0)
[nsid 1][LINK]5: vxlan10@NONE: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default
    link/ether 92:33:17:e6:e7:1d brd ff:ff:ff:ff:ff:ff link-netnsid 0
[nsid 1][ADDR]5: vxlan10    inet 192.168.0.249/24 brd 192.168.0.255 scope global vxlan10
       valid_lft forever preferred_lft forever
[nsid 1][ROUTE]local 192.168.0.249 dev vxlan10  table local  proto kernel  scope host  src 192.168.0.249
[nsid 1][ROUTE]ff00::/8 dev vxlan10  table local  metric 256  pref medium
[nsid 1][ROUTE]2001:123::/64 dev vxlan10  proto kernel  metric 256  pref medium
[nsid 1][LINK]5: vxlan10@NONE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether 92:33:17:e6:e7:1d brd ff:ff:ff:ff:ff:ff link-netnsid 0
[nsid 1][ROUTE]broadcast 192.168.0.255 dev vxlan10  table local  proto kernel  scope link  src 192.168.0.249
[nsid 1][ROUTE]192.168.0.0/24 dev vxlan10  proto kernel  scope link  src 192.168.0.249
[nsid 1][ROUTE]broadcast 192.168.0.0 dev vxlan10  table local  proto kernel  scope link  src 192.168.0.249
[nsid 1][ROUTE]fe80::/64 dev vxlan10  proto kernel  metric 256  pref medium
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

4d95b72f

netlink: allow to listen "all" netns · 59324cf3

Nicolas Dichtel authored May 07, 2015

More accurately, listen all netns that have a nsid assigned into the netns
where the netlink socket is opened.
For this purpose, a netlink socket option is added:
NETLINK_LISTEN_ALL_NSID. When this option is set on a netlink socket, this
socket will receive netlink notifications from all netns that have a nsid
assigned into the netns where the socket has been opened. The nsid is sent
to userland via an anscillary data.

With this patch, a daemon needs only one socket to listen many netns. This
is useful when the number of netns is high.

Because 0 is a valid value for a nsid, the field nsid_is_set indicates if
the field nsid is valid or not. skb->cb is initialized to 0 on skb
allocation, thus we are sure that we will never send a nsid 0 by error to
the userland.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

59324cf3

netlink: rename private flags and states · cc3a572f

Nicolas Dichtel authored May 07, 2015

These flags and states have the same prefix (NETLINK_) that netlink socket
options. To avoid confusion and to be able to name a flag like a socket
option, let's use an other prefix: NETLINK_[S|F]_.

Note: a comment has been fixed, it was talking about
NETLINK_RECV_NO_ENOBUFS socket option instead of NETLINK_NO_ENOBUFS.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

cc3a572f

netns: use a spin_lock to protect nsid management · 95f38411

Nicolas Dichtel authored May 07, 2015

Before this patch, nsid were protected by the rtnl lock. The goal of this
patch is to be able to find a nsid without needing to hold the rtnl lock.

The next patch will introduce a netlink socket option to listen to all
netns that have a nsid assigned into the netns where the socket is opened.
Thus, it's important to call rtnl_net_notifyid() outside the spinlock, to
avoid a recursive lock (nsid are notified via rtnl). This was the main
reason of the previous patch.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

95f38411