Commits · c8fb217af57c6c232af3517d3115d2af4ce9900e · nexedi / linux

02 May, 2012 9 commits

vhost_net: zerocopy: adding and signalling immediately when fully copied · c8fb217a

Jason Wang authored May 02, 2012

When a packet were fully copied in zerocopy, we don't wait for the DMA done to
mark the done flag, so after the packet were passed to lower device, we need to
add used and signal guest immediately.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

c8fb217a

vhost_net: re-poll only on EAGAIN or ENOBUFS · dbf34207

Jason Wang authored May 02, 2012

Currently, we restart tx polling unconditionally when sendmsg()
fails. This would cause unnecessary wakeups of vhost wokers and waste
cpu utlization when evil userspace(guest driver) is able to hit EFAULT or
EINVAL.

The polling is only needed when the socket send buffer were exceeded or not
enough memory. So fix this by restarting polling only when sendmsg() returns
EAGAIN/ENOBUFS.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

dbf34207

vhost_net: zerocopy: fix possible NULL pointer dereference of vq->bufs · c460f057

Jason Wang authored May 02, 2012

When we want to disable vhost_net backend while there's a tx work, a possible
NULL pointer defernece may happen we we try to deference the vq->bufs after
vhost_net_set_backend() assign a NULL to it.

As suggested by Michael, fix this by checking the vq->bufs instead of
vhost_sock_zcopy().
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

c460f057

macvtap: zerocopy: validate vectors before building skb · b92946e2

Jason Wang authored May 02, 2012

There're several reasons that the vectors need to be validated:

- Return error when caller provides vectors whose num is greater than UIO_MAXIOV.
- Linearize part of skb when userspace provides vectors grater than MAX_SKB_FRAGS.
- Return error when userspace provides vectors whose total length may exceed
- MAX_SKB_FRAGS * PAGE_SIZE.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

b92946e2

macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully · 01d6657b

Jason Wang authored May 02, 2012

Current the SKBTX_DEV_ZEROCOPY is set unconditionally after
zerocopy_sg_from_iovec(), this would lead NULL pointer when macvtap
fails to build zerocopy skb because destructor_arg was not
initialized. Solve this by set this flag after the skb were built
successfully.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

01d6657b

macvtap: zerocopy: put page when fail to get all requested user pages · 02ce04bb

Jason Wang authored May 02, 2012

When get_user_pages_fast() fails to get all requested pages, we could not use
kfree_skb() to free it as it has not been put in the skb fragments. So we need
to call put_page() instead.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

02ce04bb

macvtap: zerocopy: fix truesize underestimation · 4ef67ebe

Jason Wang authored May 02, 2012

As the skb fragment were pinned/built from user pages, we should
account the page instead of length for truesize.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

4ef67ebe

macvtap: zerocopy: fix offset calculation when building skb · 3afc9621

Jason Wang authored May 02, 2012

This patch fixes the offset calculation when building skb:

- offset1 were used as skb data offset not vector offset
- reset offset to zero only when we advance to next vector
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

3afc9621

virtio/tools: add delayed interupt mode · 64d09888
Michael S. Tsirkin authored Apr 16, 2012
```
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
```
64d09888

01 May, 2012 31 commits

netem: add ECN capability · e4ae004b

Eric Dumazet authored Apr 30, 2012

Add ECN (Explicit Congestion Notification) marking capability to netem

tc qdisc add dev eth0 root netem drop 0.5 ecn

Instead of dropping packets, try to ECN mark them.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

e4ae004b

net: skb_peek()/skb_peek_tail() cleanups · 18d07000

Eric Dumazet authored Apr 30, 2012

remove useless casts and rename variables for less confusion.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

18d07000

net: add a prefetch in socket backlog processing · e4cbb02a

Eric Dumazet authored Apr 30, 2012

TCP or UDP stacks have big enough latencies that prefetching next
pointer is worth it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e4cbb02a

l2tp: let iproute2 create L2TPv3 IP tunnels using IPv6 · 5dac94e1

James Chapman authored Apr 29, 2012

The netlink API lets users create unmanaged L2TPv3 tunnels using
iproute2. Until now, a request to create an unmanaged L2TPv3 IP
encapsulation tunnel over IPv6 would be rejected with
EPROTONOSUPPORT. Now that l2tp_ip6 implements sockets for L2TP IP
encapsulation over IPv6, we can add support for that tunnel type.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5dac94e1

l2tp: introduce L2TPv3 IP encapsulation support for IPv6 · a32e0eec

Chris Elston authored Apr 29, 2012

L2TPv3 defines an IP encapsulation packet format where data is carried
directly over IP (no UDP). The kernel already has support for L2TP IP
encapsulation over IPv4 (l2tp_ip). This patch introduces support for
L2TP IP encapsulation over IPv6.

The implementation is derived from ipv6/raw and ipv4/l2tp_ip.
Signed-off-by: Chris Elston <celston@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a32e0eec

ipv6: Export ipv6 functions for use by other protocols · a495f836

Chris Elston authored Apr 29, 2012

For implementing other protocols on top of IPv6, such as L2TPv3's IP
encapsulation over ipv6, we'd like to call some IPv6 functions which
are not currently exported. This patch exports them.
Signed-off-by: Chris Elston <celston@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a495f836

l2tp: netlink api for l2tpv3 ipv6 unmanaged tunnels · f9bac8df

Chris Elston authored Apr 29, 2012

This patch adds support for unmanaged L2TPv3 tunnels over IPv6 using
the netlink API. We already support unmanaged L2TPv3 tunnels over
IPv4. A patch to iproute2 to make use of this feature will be
submitted separately.
Signed-off-by: Chris Elston <celston@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f9bac8df

l2tp: show IPv6 addresses in l2tp debugfs file · 2121c3f5

Chris Elston authored Apr 29, 2012

If an L2TP tunnel uses IPv6, make sure the l2tp debugfs file shows the
IPv6 address correctly.
Signed-off-by: Chris Elston <celston@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2121c3f5

l2tp: pppol2tp_connect() handles ipv6 sockaddr variants · b79585f5

James Chapman authored Apr 29, 2012

Userspace uses connect() to associate a pppol2tp socket with a tunnel
socket. This needs to allow the caller to supply the new IPv6
sockaddr_pppol2tp structures if IPv6 is used.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b79585f5

pppox: Replace __attribute__((packed)) in if_pppox.h · 9d4ec1ae

James Chapman authored Apr 29, 2012

Checkpatch warns about the use of __attribute__((packed)). So use the
recommended __packed syntax instead.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9d4ec1ae

l2tp: remove unused stats from l2tp_ip socket · c8657fd5

James Chapman authored Apr 29, 2012

The l2tp_ip socket currently maintains packet/byte stats in its
private socket structure. But these counters aren't exposed to
userspace and so serve no purpose. The counters were also
smp-unsafe. So this patch just gets rid of the stats.

While here, change a couple of internal __u32 variables to u32.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c8657fd5

l2tp: Use ip4_datagram_connect() in l2tp_ip_connect() · de3c7a18

James Chapman authored Apr 29, 2012

Cleanup the l2tp_ip code to make use of an existing ipv4 support function.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

de3c7a18

l2tp: fix locking of 64-bit counters for smp · 5de7aee5

James Chapman authored Apr 29, 2012

L2TP uses 64-bit counters but since these are not updated atomically,
we need to make them safe for smp. This patch addresses that.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5de7aee5

atl1c: remove PHY polling from atl1c_change_mtu · 80bcb423

Huang, Xiong authored Apr 30, 2012

PHY polling code for FPGA is considered in every MDIO R/W API.
no need to add additional code to atl1c_change_mtu.
Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: David Liu <dwliu@qca.qaulcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

80bcb423

atl1c: Disable L0S when no cable link · 4fc36352

Huang, Xiong authored Apr 30, 2012

L0S might be unstable if no cable link, only enable it when link up.
Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: Liu David <dwliu@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4fc36352

atl1c: do MAC-reset when PHY link down · 5e5c0964

Huang, Xiong authored Apr 30, 2012

There may be tx-skbs still pending in HW when PHY link down.
Reset MAC will make the DMA engine go to the start point.
and release all pending skbs.
Note: Reset MAC will clear any interrupt status and mask.
Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: Liu David <dwliu@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5e5c0964

atl1c: cancel task when interface closed · 0aa76ce3

Huang, Xiong authored Apr 30, 2012

common_task might be running while close routine is called,
wait/cancel it.
Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: Liu David <dwliu@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0aa76ce3

atl1c: enlarge L1 response waiting timer · f56fa567

Huang, Xiong authored Apr 30, 2012

The hardware incorrectly process L0S/L1 entrance if the chipset/root
response after specific/shorter timer and cause system hang.
Enlarge the timeout value to avoid this issue.
Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: Liu David <dwliu@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f56fa567

atl1c: refine mac address related code · 229e6b6e

Huang, Xiong authored Apr 30, 2012

On some platform with EEPROM/OTP existing, the BIOS could overwrite
a new MAC address for the NIC. so, the permanent mac address should
be from BIOS. the address is restored when driver removing.
Voltage raising isn't applicable for l1d.
Replace swab32 with htonl for big/little endian platform.
related Registers are refined as well.
Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: Liu David <dwliu@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

229e6b6e

atl1c: remove code of closing register writable attribution · e1192580

Huang, Xiong authored Apr 30, 2012

The Close-action is done by atl1c_reset_pcie, remove it from
atl1c_get_permanent_address.
Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: Liu David <dwliu@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e1192580

atl1c: clear WoL status when reset pcie · 87eabe6b

Huang, Xiong authored Apr 30, 2012

WoL status is read-clear and should be cleared when in S0
status.
putting it in atl1c_reset_pcie is more suitable than
in atl1c_get_permanent_address.
Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: Liu David <dwliu@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

87eabe6b

atl1c: add PHY link event(up/down) patch · 903d7ce0

Huang, Xiong authored Apr 30, 2012

On some platforms the PHY settings need to change depending on the
cable link status to get better stability.
Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: Liu David <dwliu@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

903d7ce0

atl1c: add workaround for issue of bit INTX-disable for MSI interrupt · 7cb6a291

Huang, Xiong authored Apr 30, 2012

All supported devices have one issue that msi interrupt doesn't assert
if pci command register bit (PCI_COMMAND_INTX_DISABLE) is set.
Add workaround in drivers/pci/quirks.c
Signed-off-by: xiong <xiong@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7cb6a291

Merge branch 'tipc_net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux · b6d151bb
David S. Miller authored Apr 30, 2012

b6d151bb

bnx2x: remove some bloat · 1191cb83

Eric Dumazet authored Apr 27, 2012

Before doing skb->head_frag work on bnx2x driver, I found too much stuff
was inlined in bnx2x/bnx2x_cmn.h for no good reason and made my work not
very easy.

Move some big functions out of this include file to the respective .c
file.

A lot of inline keywords are not needed at all in this huge driver.

   text	   data	    bss	    dec	    hex	filename
 490083	   1270	     56	 491409	  77f91	bnx2x/bnx2x.ko.before
 484206	   1270	     56	 485532	  7689c	bnx2x/bnx2x.ko
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eilon Greenstein <eilong@broadcom.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1191cb83

pch_gbe: reprogram multicast address register on reset · d344c4f3

RongQing.Li authored Apr 27, 2012

The reset logic after a Rx FIFO overrun will clear the programmed
multicast addresses. This patch fixes the issue by reprogramming the
registers after the reset.

The commit eefc48b0 ("pch_gbe: reprogram multicast address register on
reset") tried to fix this problem, but it introduces unnecessary
codes. In fact, all multicast addresses have been saved in netdev->mc,
So we can call pch_gbe_set_multi() directly after reset_hw and
reset_rx.

This commit kills 50+ line codes

Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Takahiro Shimizu <tshimizu818@gmail.com>
Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d344c4f3

net: makes skb_splice_bits() aware of skb->head_frag · 1d0c0b32

Eric Dumazet authored Apr 27, 2012

__skb_splice_bits() can check if skb to be spliced has its skb->head
mapped to a page fragment, instead of a kmalloc() area.

If so we can avoid a copy of the skb head and get a reference on
underlying page.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1d0c0b32

tcp: makes tcp_try_coalesce aware of skb->head_frag · 329033f6

Eric Dumazet authored Apr 27, 2012

TCP coalesce can check if skb to be merged has its skb->head mapped to a
page fragment, instead of a kmalloc() area.

We had to disable coalescing in this case, for performance reasons.

We 'upgrade' skb->head as a fragment in itself.

This reduces number of cache misses when user makes its copies, since a
less sk_buff are fetched.

This makes receive and ofo queues shorter and thus reduce cache line
misses in TCP stack.

This is a followup of patch "net: allow skb->head to be a page fragment"

Tested with tg3 nic, with GRO on or off. We can see "TCPRcvCoalesce"
counter being incremented.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

329033f6

net: make GRO aware of skb->head_frag · d7e8883c

Eric Dumazet authored Apr 30, 2012

GRO can check if skb to be merged has its skb->head mapped to a page
fragment, instead of a kmalloc() area.

We 'upgrade' skb->head as a fragment in itself

This avoids the frag_list fallback, and permits to build true GRO skb
(one sk_buff and up to 16 fragments), using less memory.

This reduces number of cache misses when user makes its copy, since a
single sk_buff is fetched.

This is a followup of patch "net: allow skb->head to be a page fragment"
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d7e8883c

tg3: provide frags as skb head · 8d4057a9

Eric Dumazet authored Apr 27, 2012

This patch converts tg3 driver, one of our reference drivers, to use new
build_skb() api in frag mode.

Instead of using kmalloc() to allocate the memory block that will be
used by build_skb() as skb->head, we use a page fragment.

This is a followup of patch "net: allow skb->head to be a page fragment"

This allows GRO, TCP coalescing, and splice() to be more efficient.

Incidentally, this also removes SLUB slow path contention in kfree()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8d4057a9

net: allow skb->head to be a page fragment · d3836f21

Eric Dumazet authored Apr 27, 2012

skb->head is currently allocated from kmalloc(). This is convenient but
has the drawback the data cannot be converted to a page fragment if
needed.

We have three spots were it hurts :

1) GRO aggregation

 When a linear skb must be appended to another skb, GRO uses the
frag_list fallback, very inefficient since we keep all struct sk_buff
around. So drivers enabling GRO but delivering linear skbs to network
stack aren't enabling full GRO power.

2) splice(socket -> pipe).

 We must copy the linear part to a page fragment.
 This kind of defeats splice() purpose (zero copy claim)

3) TCP coalescing.

 Recently introduced, this permits to group several contiguous segments
into a single skb. This shortens queue lengths and save kernel memory,
and greatly reduce probabilities of TCP collapses. This coalescing
doesnt work on linear skbs (or we would need to copy data, this would be
too slow)

Given all these issues, the following patch introduces the possibility
of having skb->head be a fragment in itself. We use a new skb flag,
skb->head_frag to carry this information.

build_skb() is changed to accept a frag_size argument. Drivers willing
to provide a page fragment instead of kmalloc() data will set a non zero
value, set to the fragment size.

Then, on situations we need to convert the skb head to a frag in itself,
we can check if skb->head_frag is set and avoid the copies or various
fallbacks we have.

This means drivers currently using frags could be updated to avoid the
current skb->head allocation and reduce their memory footprint (aka skb
truesize). (thats 512 or 1024 bytes saved per skb). This also makes
bpf/netfilter faster since the 'first frag' will be part of skb linear
part, no need to copy data.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d3836f21