Commits · 87ffabb1f055e14e7d171c6599539a154d647904 · nexedi / linux

14 Apr, 2015 18 commits

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 87ffabb1

David S. Miller authored Apr 14, 2015

The dwmac-socfpga.c conflict was a case of a bug fix overlapping
changes in net-next to handle an error pointer differently.
Signed-off-by: David S. Miller <davem@davemloft.net>

87ffabb1

Merge branch 'cxgb4-next' · 5e0e0dc1

David S. Miller authored Apr 14, 2015

Hariprasad Shenai says:

====================
cxgb4: Misc. fixes for sge

Increases value of MAX_IMM_TX_PKT_LEN to improve latency, fill freelist
starving threshold based on adapter type, add comments for tx flits and sge
length code and don't call t4_slow_intr_handler when we are not master PF.

This patch series has been created against net-next tree and includes patches on
cxgb4 driver

We have included all the maintainers of respective drivers. Kindly review the
change and let us know in case of any review comments.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

5e0e0dc1

cxgb4: Don't call t4_slow_intr_handler when we're not the Master PF · c3c7b121
Hariprasad Shenai authored Apr 15, 2015
```
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
```
c3c7b121

cxgb4: Add comment for calculate tx flits and sge length code · 0aac3f56

Hariprasad Shenai authored Apr 15, 2015

Add comment for tx filt and sge length calucaltion code, also remove
a hardcoded value
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0aac3f56

cxgb4: Use device node in page allocation · d52ce920

Hariprasad Shenai authored Apr 15, 2015

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d52ce920

cxgb4: Freelist starving threshold varies from adapter to adapter · c098b026

Hariprasad Shenai authored Apr 15, 2015

fl_starv_thres could be different from adapter to adapter, don't use
hardcoded values
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c098b026

cxgb4: Increased the value of MAX_IMM_TX_PKT_LEN from 128 to 256 bytes · 21dcfad6

Hariprasad Shenai authored Apr 15, 2015

This allows a significant latency drop for packets of sizes between 128 and 192
bytes
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

21dcfad6

bgmac: drop ring->num_slots · 29ba877e

Felix Fietkau authored Apr 14, 2015

The ring size is always known at compile time, so make the code a bit
more efficient
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

29ba877e

bgmac: fix DMA rx corruption · 4668ae1f

Felix Fietkau authored Apr 14, 2015

The driver needs to inform the hardware about the first invalid (not yet
filled) rx slot, by writing its DMA descriptor pointer offset to the
BGMAC_DMA_RX_INDEX register.

This register was set to a value exceeding the rx ring size, effectively
allowing the hardware constant access to the full ring, regardless of
which slots are initialized.

To fix this issue, always mark the last filled rx slot as invalid.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

4668ae1f

bgmac: simplify dma init/cleanup · 74b6f291

Felix Fietkau authored Apr 14, 2015

Instead of allocating buffers at device init time and initializing
descriptors at device open, do both at the same time (during open).
Free all buffers when closing the device.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Acked-by: Rafał Miłecki <zajec5@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

74b6f291

bgmac: increase rx ring size from 511 to 512 · b9650557

Felix Fietkau authored Apr 14, 2015

Limiting it to 511 looks like a failed attempt at leaving one descriptor
empty to allow the hardware to stop processing a buffer that has not
been prepared yet. However, this doesn't work because this affects the
total ring size as well
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

b9650557

bgmac: add check for oversized packets · 6a6c7084

Felix Fietkau authored Apr 14, 2015

In very rare cases, the MAC can catch an internal buffer that is bigger
than it's supposed to be. Instead of crashing the kernel, simply pass
the buffer back to the hardware
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

6a6c7084

bgmac: simplify/optimize rx DMA error handling · 56faacd0

Felix Fietkau authored Apr 14, 2015

Allocate a new buffer before processing the completed one. If allocation
fails, reuse the old buffer.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Acked-by: Rafał Miłecki <zajec5@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

56faacd0

bgmac: set received skb headroom to NET_SKB_PAD · 4b62dce4

Felix Fietkau authored Apr 14, 2015

A packet buffer offset of 30 bytes is inefficient, because the first 2
bytes end up in a different cacheline.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

4b62dce4

bgmac: leave interrupts disabled as long as there is work to do · eb64e292

Felix Fietkau authored Apr 14, 2015

Always poll rx and tx during NAPI poll instead of relying on the status
of the first interrupt. This prevents bgmac_poll from leaving unfinished
work around until the next IRQ.
In my tests this makes bridging/routing throughput under heavy load more
stable and ensures that no new IRQs arrive as long as bgmac_poll uses up
the entire budget.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

eb64e292

bgmac: simplify tx ring index handling · b38c83dd

Felix Fietkau authored Apr 14, 2015

Keep incrementing ring->start and ring->end instead of pointing it to
the actual ring slot entry. This simplifies the calculation of the
number of free slots.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Acked-by: Rafał Miłecki <zajec5@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b38c83dd

toshiba: Remove celleb from Kconfig options · e0767834

Daniel Axtens authored Apr 14, 2015

The toshiba drivers had celleb as an optional dependency.
celleb has been dropped [1], so clean that out of Kconfig.

[1] http://patchwork.ozlabs.org/patch/451730/

CC: netdev@vger.kernel.org
CC: Valentin Rothberg <valentinrothberg@gmail.com>
CC: mpe@ellerman.id.au
CC: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

e0767834

hv_netvsc: Implement partial copy into send buffer · aa0a34be

Haiyang Zhang authored Apr 13, 2015

If remaining space in a send buffer slot is too small for the whole message,
we only copy the RNDIS header and PPI data into send buffer, so we can batch
one more packet each time. It reduces the vmbus per-message overhead.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

aa0a34be

13 Apr, 2015 22 commits

Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 6e8a9d91

David S. Miller authored Apr 13, 2015

Al Viro says:

====================
netdev-related stuff in vfs.git

There are several commits sitting in vfs.git that probably ought to go in
via net-next.git.  First of all, there's merge with vfs.git#iocb - that's
Christoph's aio rework, which has triggered conflicts with the ->sendmsg()
and ->recvmsg() patches a while ago.  It's not so much Christoph's stuff
that ought to be in net-next, as (pretty simple) conflict resolution on merge.
The next chunk is switch to {compat_,}import_iovec/import_single_range - new
safer primitives for initializing iov_iter.  The primitives themselves come
from vfs/git#iov_iter (and they are used quite a lot in vfs part of queue),
conversion of net/socket.c syscalls belongs in net-next, IMO.  Next there's
afs and rxrpc stuff from dhowells.  And then there's sanitizing kernel_sendmsg
et.al.  + missing inlined helper for "how much data is left in msg->msg_iter" -
this stuff is used in e.g.  cifs stuff, but it belongs in net-next.

That pile is pullable from
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-davem

I'll post the individual patches in there in followups; could you take a look
and tell if everything in there is OK with you?
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6e8a9d91

tcp/dccp: get rid of central timewait timer · 789f558c

Eric Dumazet authored Apr 12, 2015

Using a timer wheel for timewait sockets was nice ~15 years ago when
memory was expensive and machines had a single processor.

This does not scale, code is ugly and source of huge latencies
(Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)

We can afford to use an extra 64 bytes per timewait sock and spread
timewait load to all cpus to have better behavior.

Tested:

On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
on the target (lpaa24)

Before patch :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
419594

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
437171

While test is running, we can observe 25 or even 33 ms latencies.

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2

After patch :

About 90% increase of throughput :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
810442

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
800992

And latencies are kept to minimal values during this load, even
if network utilization is 90% higher :

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

789f558c

netfilter: Fix format string of nfnetlink_log proc file · 20a1d165

Richard Weinberger authored Apr 13, 2015

The printed values are all of type unsigned integer, therefore use
%u instead of %d. Otherwise an user can face negative values.
Signed-off-by: Richard Weinberger <richard@nod.at>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

20a1d165

netfilter: Fix format string of nfnetlink_queue proc file · 6b46f7b7

Richard Weinberger authored Apr 13, 2015

The printed values are all of type unsigned integer, therefore use
%u instead of %d. Otherwise an user can face negative values.

Fixes:
$ cat /proc/net/netfilter/nfnetlink_queue
    0  29508   278 2 65531     0 2004213241 -2129885586  1
    1 -27747     0 2 65531     0     0        0  1
    2 -27748     0 2 65531     0     0        0  1
Signed-off-by: Richard Weinberger <richard@nod.at>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

6b46f7b7

netfilter: Fix portid types · cc6bc448

Richard Weinberger authored Apr 13, 2015

The netlink portid is an unsigned integer, use this type
also in netfilter.
Signed-off-by: Richard Weinberger <richard@nod.at>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

cc6bc448

nfc: Fix portid type in urelease_work · 65bc4f93

Richard Weinberger authored Apr 13, 2015

portid is an unsigned integer. Fix urelease_work to
match all other portid user in the kernel.
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: David S. Miller <davem@davemloft.net>

65bc4f93

netlink: Fix portid type in netlink_notify · 0392d099

Richard Weinberger authored Apr 13, 2015

portid is an unsigned integer. Fix netlink_notify to
match all other portid user in the kernel.
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: David S. Miller <davem@davemloft.net>

0392d099

tcp: fix bogus RTT for CC when retransmissions are acked · 3d0d26c7

Kenneth Klette Jonassen authored Apr 11, 2015

Since retransmitted segments are not used for RTT estimation, previously
SACKed segments present in the rtx queue are used. This estimation can be
several times larger than the actual RTT. When a cumulative ack covers both
previously SACKed and retransmitted segments, CC may thus get a bogus RTT.

Such segments previously had an RTT estimation in tcp_sacktag_one(), so it
seems reasonable to not reuse them in tcp_clean_rtx_queue() at all.

Afaik, this has had no effect on SRTT/RTO because of Karn's check.
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3d0d26c7

net: use jump label patching for ingress qdisc in __netif_receive_skb_core · 4577139b

Daniel Borkmann authored Apr 10, 2015

Even if we make use of classifier and actions from the egress
path, we're going into handle_ing() executing additional code
on a per-packet cost for ingress qdisc, just to realize that
nothing is attached on ingress.

Instead, this can just be blinded out as a no-op entirely with
the use of a static key. On input fast-path, we already make
use of static keys in various places, e.g. skb time stamping,
in RPS, etc. It makes sense to not waste time when we're assured
that no ingress qdisc is attached anywhere.

Enabling/disabling of that code path is being done via two
helpers, namely net_{inc,dec}_ingress_queue(), that are being
invoked under RTNL mutex when a ingress qdisc is being either
initialized or destructed.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4577139b

Merge branch 'netdev_diet' · dfc96c19

David S. Miller authored Apr 13, 2015

Thomas Graf says:

====================
Bring sizeof(net_device) down to < 2K bytes

The size of struct net_device crossed the 2K boundary a while ago which
is a waste in combination with many net namespaces. This series brings
the size of struct net_device down to well below 2K in total size with
a typical configuration. Some reserves a several holes leave room for
further expansion.

Before:
/* size: 2176, cachelines: 34, members: 121 */

After:
/* size: 1984, cachelines: 31, members: 120 */
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

dfc96c19

net_device: Reorder members to fill holes · 14ffbbb8

Thomas Graf authored Apr 10, 2015

Some trivial reorders while preserving the RX/TX cache lines
split to fill a couple of holes.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

14ffbbb8

e1000e: Move pm_qos_req to e1000e adapter · e2c65448

Thomas Graf authored Apr 10, 2015

e1000e is the only driver requiring pm_qos_req, instead of causing
every device to waste up to 240 bytes. Allocate it for the specific
driver.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e2c65448

selinux/nlmsg: add a build time check for rtnl/xfrm cmds · cf890138

Nicolas Dichtel authored Apr 13, 2015

When a new rtnl or xfrm command is added, this part of the code is frequently
missing. Let's help the developer with a build time test.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cf890138

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · e60a9de4

David S. Miller authored Apr 12, 2015

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2015-04-11

This series contains updates to iflink, ixgbe and ixgbevf.

The entire set of changes come from Vlad Zolotarov to ultimately add
the ethtool ops to VF driver to allow querying the RSS indirection table
and RSS random key.

Currently we support only 82599 and x540 devices.  On those devices, VFs
share the RSS redirection table and hash key with a PF.  Letting the VF
query this information may introduce some security risks, therefore this
feature will be disabled by default.

The new netdev op allows a system administrator to change the default
behaviour with "ip link set" command.  The relevant iproute2 patch has
already been sent and awaits for this series upstream.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e60a9de4

Merge branch 'fou-next' · 1ceb0b8c

David S. Miller authored Apr 12, 2015

Cong Wang says:

====================
fou: some fixes and updates

Patch 1~3 fix some minor bugs in net/ipv4/fou.c, the only
thing I am not sure is if it's too late to change the
byte order of FOU_ATTR_PORT, if so we have to fix iproute2
instead of kernel.

Patch 4~5 add some new features to make it complete.

v2: make fou->port be16 too
====================
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>