Commits · 78296c97ca1fd3b104f12e1f1fbc06c46635990b · Kirill Smelkov / linux

16 Feb, 2015 3 commits

netfilter: xt_socket: fix a stack corruption bug · 78296c97

Eric Dumazet authored Feb 15, 2015

As soon as extract_icmp6_fields() returns, its local storage (automatic
variables) is deallocated and can be overwritten.

Lets add an additional parameter to make sure storage is valid long
enough.

While we are at it, adds some const qualifiers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: b64c9256 ("tproxy: added IPv6 support to the socket match")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

78296c97

netfilter: xt_recent: don't reject rule if new hitcount exceeds table max · cef9ed86

Florian Westphal authored Feb 13, 2015

given:
-A INPUT -m recent --update --seconds 30 --hitcount 4
and
iptables-save > foo

then
iptables-restore < foo

will fail with:
kernel: xt_recent: hitcount (4) is larger than packets to be remembered (4) for table DEFAULT

Even when the check is fixed, the restore won't work if the hitcount is
increased to e.g. 6, since by the time checkentry runs it will find the
'old' incarnation of the table.

We can avoid this by increasing the maximum threshold silently; we only
have to rm all the current entries of the table (these entries would
not have enough room to handle the increased hitcount).

This even makes (not-very-useful)
-A INPUT -m recent --update --seconds 30 --hitcount 4
-A INPUT -m recent --update --seconds 30 --hitcount 42
work.

Fixes: abc86d0f (netfilter: xt_recent: relax ip_pkt_list_tot restrictions)
Tracked-down-by: Chris Vine <chris@cvine.freeserve.co.uk>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

cef9ed86

netfilter: nft_compat: fix module refcount underflow · 520aa741

Pablo Neira Ayuso authored Feb 12, 2015

Feb 12 18:20:42 nfdev kernel: ------------[ cut here ]------------
Feb 12 18:20:42 nfdev kernel: WARNING: CPU: 4 PID: 4359 at kernel/module.c:963 module_put+0x9b/0xba()
Feb 12 18:20:42 nfdev kernel: CPU: 4 PID: 4359 Comm: ebtables-compat Tainted: G        W      3.19.0-rc6+ #43
[...]
Feb 12 18:20:42 nfdev kernel: Call Trace:
Feb 12 18:20:42 nfdev kernel: [<ffffffff815fd911>] dump_stack+0x4c/0x65
Feb 12 18:20:42 nfdev kernel: [<ffffffff8103e6f7>] warn_slowpath_common+0x9c/0xb6
Feb 12 18:20:42 nfdev kernel: [<ffffffff8109919f>] ? module_put+0x9b/0xba
Feb 12 18:20:42 nfdev kernel: [<ffffffff8103e726>] warn_slowpath_null+0x15/0x17
Feb 12 18:20:42 nfdev kernel: [<ffffffff8109919f>] module_put+0x9b/0xba
Feb 12 18:20:42 nfdev kernel: [<ffffffff813ecf7c>] nft_match_destroy+0x45/0x4c
Feb 12 18:20:42 nfdev kernel: [<ffffffff813e683f>] nf_tables_rule_destroy+0x28/0x70
Reported-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>

520aa741

09 Feb, 2015 1 commit

ipvs: fix inability to remove a mixed-family RS · dd3733b3

Alexey Andriyanov authored Feb 06, 2015

The current code prevents any operation with a mixed-family dest
unless IP_VS_CONN_F_TUNNEL flag is set. The problem is that it's impossible
for the client to follow this rule, because ip_vs_genl_parse_dest does
not even read the destination conn_flags when cmd = IPVS_CMD_DEL_DEST
(need_full_dest = 0).

Also, not every client can pass this flag when removing a dest. ipvsadm,
for example, does not support the "-i" command line option together with
the "-d" option.

This change disables any checks for mixed-family on IPVS_CMD_DEL_DEST command.
Signed-off-by: Alexey Andriyanov <alan@al-an.info>
Fixes: bc18d37f ("ipvs: Allow heterogeneous pools now that we support them")
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>

dd3733b3

03 Feb, 2015 7 commits

xen-netback: stop the guest rx thread after a fatal error · 42b5212f

David Vrabel authored Feb 02, 2015

After commit e9d8b2c2 (xen-netback:
disable rogue vif in kthread context), a fatal (protocol) error would
leave the guest Rx thread spinning, wasting CPU time.  Commit
ecf08d2d (xen-netback: reintroduce
guest Rx stall detection) made this even worse by removing a
cond_resched() from this path.

Since a fatal error is non-recoverable, just allow the guest Rx thread
to exit.  This requires taking additional refs to the task so the
thread exiting early is handled safely.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reported-by: Julien Grall <julien.grall@linaro.org>
Tested-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

42b5212f

net/mlx4_core: Fix kernel Oops (mem corruption) when working with more than 80 VFs · 5a2e87b1

Jack Morgenstein authored Feb 02, 2015

Commit de966c59 (net/mlx4_core: Support more than 64 VFs) was meant to
allow up to 126 VFs.  However, due to leaving MLX4_MFUNC_MAX too low, using
more than 80 VFs resulted in memory corruptions (and Oopses) when more than
80 VFs were requested. In addition, the number of slaves was left too high.

This commit fixes these issues.

Fixes: de966c59 ("net/mlx4_core: Support more than 64 VFs")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5a2e87b1

isdn: off by one in connect_res() · c101cff9

Dan Carpenter authored Feb 01, 2015

The bug here is that we use "Reject" as the index into the cau_t[] array
in the else path.  Since the cau_t[] has 9 elements if Reject == 9 then
we are reading beyond the end of the array.

My understanding of the code is that it's saying that if Reject is 1 or
too high then that's invalid and we should hang up.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c101cff9

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · 3ae55826

David S. Miller authored Feb 02, 2015

Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:

1) Validate hooks for nf_tables NAT expressions, otherwise users can
   crash the kernel when using them from the wrong hook. We already
   got one user trapped on this when configuring masquerading.

2) Fix a BUG splat in nf_tables with CONFIG_DEBUG_PREEMPT=y. Reported
   by Andreas Schultz.

3) Avoid unnecessary reroute of traffic in the local input path
   in IPVS that triggers a crash in in xfrm. Reported by Florian
   Wiessner and fixes by Julian Anastasov.

4) Fix memory and module refcount leak from the error path of
   nf_tables_newchain().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

3ae55826

Documentation: Update netlink_mmap.txt · e6b02be8

Richard Weinberger authored Jan 30, 2015

Update netlink_mmap.txt wrt. commit 4682a035
("netlink: Always copy on mmap TX.").
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: David S. Miller <davem@davemloft.net>

e6b02be8

sunvnet: set queue mapping when doing packet copies · 44ba582b

David L Stevens authored Jan 30, 2015

This patch fixes a bug where vnet_skb_shape() didn't set the already-selected
queue mapping when a packet copy was required. This results in using the
wrong queue index for stops/starts, hung tx queues and watchdog timeouts
under heavy load.
Signed-off-by: David L Stevens <david.stevens@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

44ba582b

qlge: Fix qlge_update_hw_vlan_features to handle if interface is down · 61132bf7

Marcelo Leitner authored Jan 30, 2015

Currently qlge_update_hw_vlan_features() will always first put the
interface down, then update features and then bring it up again. But it
is possible to hit this code while the adapter is down and this causes a
non-paired call to napi_disable(), which will get stuck.

This patch fixes it by skipping these down/up actions if the interface
is already down.

Fixes: a45adbe8 ("qlge: Enhance nested VLAN (Q-in-Q) handling.")
Cc: Harish Patil <harish.patil@qlogic.com>
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

61132bf7

02 Feb, 2015 1 commit

ipv4: tcp: get rid of ugly unicast_sock · bdbbb852

Eric Dumazet authored Jan 29, 2015

In commit be9f4a44 ("ipv4: tcp: remove per net tcp_sock")
I tried to address contention on a socket lock, but the solution
I chose was horrible :

commit 3a7c384f ("ipv4: tcp: unicast_sock should not land outside
of TCP stack") addressed a selinux regression.

commit 0980e56e ("ipv4: tcp: set unicast_sock uc_ttl to -1")
took care of another regression.

commit b5ec8eea ("ipv4: fix ip_send_skb()") fixed another regression.

commit 811230cd ("tcp: ipv4: initialize unicast_sock sk_pacing_rate")
was another shot in the dark.

Really, just use a proper socket per cpu, and remove the skb_orphan()
call, to re-enable flow control.

This solves a serious problem with FQ packet scheduler when used in
hostile environments, as we do not want to allocate a flow structure
for every RST packet sent in response to a spoofed packet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bdbbb852

01 Feb, 2015 2 commits

net: sched: fix panic in rate estimators · 0d32ef8c

Eric Dumazet authored Jan 29, 2015

Doing the following commands on a non idle network device
panics the box instantly, because cpu_bstats gets overwritten
by stats.

tc qdisc add dev eth0 root <your_favorite_qdisc>
... some traffic (one packet is enough) ...
tc qdisc replace dev eth0 root est 1sec 4sec <your_favorite_qdisc>

[  325.355596] BUG: unable to handle kernel paging request at ffff8841dc5a074c
[  325.362609] IP: [<ffffffff81541c9e>] __gnet_stats_copy_basic+0x3e/0x90
[  325.369158] PGD 1fa7067 PUD 0
[  325.372254] Oops: 0000 [#1] SMP
[  325.375514] Modules linked in: ...
[  325.398346] CPU: 13 PID: 14313 Comm: tc Not tainted 3.19.0-smp-DEV #1163
[  325.412042] task: ffff8800793ab5d0 ti: ffff881ff2fa4000 task.ti: ffff881ff2fa4000
[  325.419518] RIP: 0010:[<ffffffff81541c9e>]  [<ffffffff81541c9e>] __gnet_stats_copy_basic+0x3e/0x90
[  325.428506] RSP: 0018:ffff881ff2fa7928  EFLAGS: 00010286
[  325.433824] RAX: 000000000000000c RBX: ffff881ff2fa796c RCX: 000000000000000c
[  325.440988] RDX: ffff8841dc5a0744 RSI: 0000000000000060 RDI: 0000000000000060
[  325.448120] RBP: ffff881ff2fa7948 R08: ffffffff81cd4f80 R09: 0000000000000000
[  325.455268] R10: ffff883ff223e400 R11: 0000000000000000 R12: 000000015cba0744
[  325.462405] R13: ffffffff81cd4f80 R14: ffff883ff223e460 R15: ffff883feea0722c
[  325.469536] FS:  00007f2ee30fa700(0000) GS:ffff88407fa20000(0000) knlGS:0000000000000000
[  325.477630] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  325.483380] CR2: ffff8841dc5a074c CR3: 0000003feeae9000 CR4: 00000000001407e0
[  325.490510] Stack:
[  325.492524]  ffff883feea0722c ffff883fef719dc0 ffff883feea0722c ffff883ff223e4a0
[  325.499990]  ffff881ff2fa79a8 ffffffff815424ee ffff883ff223e49c 000000015cba0744
[  325.507460]  00000000f2fa7978 0000000000000000 ffff881ff2fa79a8 ffff883ff223e4a0
[  325.514956] Call Trace:
[  325.517412]  [<ffffffff815424ee>] gen_new_estimator+0x8e/0x230
[  325.523250]  [<ffffffff815427aa>] gen_replace_estimator+0x4a/0x60
[  325.529349]  [<ffffffff815718ab>] tc_modify_qdisc+0x52b/0x590
[  325.535117]  [<ffffffff8155edd0>] rtnetlink_rcv_msg+0xa0/0x240
[  325.540963]  [<ffffffff8155ed30>] ? __rtnl_unlock+0x20/0x20
[  325.546532]  [<ffffffff8157f811>] netlink_rcv_skb+0xb1/0xc0
[  325.552145]  [<ffffffff8155b355>] rtnetlink_rcv+0x25/0x40
[  325.557558]  [<ffffffff8157f0d8>] netlink_unicast+0x168/0x220
[  325.563317]  [<ffffffff8157f47c>] netlink_sendmsg+0x2ec/0x3e0

Lets play safe and not use an union : percpu 'pointers' are mostly read
anyway, and we have typically few qdiscs per host.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Fixes: 22e0f8b9 ("net: sched: make bstats per cpu and estimator RCU safe")
Signed-off-by: David S. Miller <davem@davemloft.net>

0d32ef8c

hyperv: Fix the error processing in netvsc_send() · d953ca4d

Haiyang Zhang authored Jan 29, 2015

The existing code frees the skb in EAGAIN case, in which the skb will be
retried from upper layer and used again.
Also, the existing code doesn't free send buffer slot in error case, because
there is no completion message for unsent packets.
This patch fixes these problems.

(Please also include this patch for stable trees. Thanks!)
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d953ca4d

31 Jan, 2015 8 commits

drivers: net: xgene: fix: Out of order descriptor bytes read · ecf6ba83

Iyappan Subramanian authored Jan 29, 2015

This patch fixes the following kernel crash,

	WARNING: CPU: 2 PID: 0 at net/ipv4/tcp_input.c:3079 tcp_clean_rtx_queue+0x658/0x80c()
	Call trace:
	[<fffffe0000096b7c>] dump_backtrace+0x0/0x184
	[<fffffe0000096d10>] show_stack+0x10/0x1c
	[<fffffe0000685ea0>] dump_stack+0x74/0x98
	[<fffffe00000b44e0>] warn_slowpath_common+0x88/0xb0
	[<fffffe00000b461c>] warn_slowpath_null+0x14/0x20
	[<fffffe00005b5c1c>] tcp_clean_rtx_queue+0x654/0x80c
	[<fffffe00005b6228>] tcp_ack+0x454/0x688
	[<fffffe00005b6ca8>] tcp_rcv_established+0x4a4/0x62c
	[<fffffe00005bf4b4>] tcp_v4_do_rcv+0x16c/0x350
	[<fffffe00005c225c>] tcp_v4_rcv+0x8e8/0x904
	[<fffffe000059d470>] ip_local_deliver_finish+0x100/0x26c
	[<fffffe000059dad8>] ip_local_deliver+0xac/0xc4
	[<fffffe000059d6c4>] ip_rcv_finish+0xe8/0x328
	[<fffffe000059dd3c>] ip_rcv+0x24c/0x38c
	[<fffffe0000563950>] __netif_receive_skb_core+0x29c/0x7c8
	[<fffffe0000563ea4>] __netif_receive_skb+0x28/0x7c
	[<fffffe0000563f54>] netif_receive_skb_internal+0x5c/0xe0
	[<fffffe0000564810>] napi_gro_receive+0xb4/0x110
	[<fffffe0000482a2c>] xgene_enet_process_ring+0x144/0x338
	[<fffffe0000482d18>] xgene_enet_napi+0x1c/0x50
	[<fffffe0000565454>] net_rx_action+0x154/0x228
	[<fffffe00000b804c>] __do_softirq+0x110/0x28c
	[<fffffe00000b8424>] irq_exit+0x8c/0xc0
	[<fffffe0000093898>] handle_IRQ+0x44/0xa8
	[<fffffe000009032c>] gic_handle_irq+0x38/0x7c
	[...]

Software writes poison data into the descriptor bytes[15:8] and upon
receiving the interrupt, if those bytes are overwritten by the hardware with
the valid data, software also reads bytes[7:0] and executes receive/tx
completion logic.

If the CPU executes the above two reads in out of order fashion, then the
bytes[7:0] will have older data and causing the kernel panic.  We have to
force the order of the reads and thus this patch introduces read memory
barrier between these reads.
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Keyur Chudgar <kchudgar@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ecf6ba83

Merge branch 'vlan_get_protocol' · 08178e5a

David S. Miller authored Jan 30, 2015

Toshiaki Makita says:

====================
Fix checksum error when using stacked vlan

When I was testing 802.1ad, I found several drivers don't take into
account 802.1ad or multiple vlans when retrieving L3 (IP/IPv6) or
L4 (TCP/UDP) protocol for checksum offload.

It is mainly due to vlan_get_protocol(), which extracts ether type only
when it is tagged with single 802.1Q. When 802.1ad is used or there are
multiple vlans, it extracts vlan protocol and drivers cannot determine
which L3/L4 protocol is used.

Those drivers, most of which have IP_CSUM/IPV6_CSUM features, get L3/L4
header-offset by software, so it seems that their checksum offload works
with multiple vlans if we can parse protocols correctly.
(They know mac header length, and probably don't care about what is in it.)

And another thing, some of Intel's drivers seem to use skb->protocol where
vlan_get_protocol() is more suitable.

I tested that at least igb/igbvf on I350 works with this patch set.

Note:
We can hand a double tagged packet with CHECKSUM_PARTIAL to a HW driver
by creating a vlan device on a bridge device and enabling vlan_filtering
of the bridge with 802.1ad protocol.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

08178e5a

ixgbevf: Fix checksum error when using stacked vlan · 10e4fb33

Toshiaki Makita authored Jan 29, 2015

When a skb has multiple vlans and it is CHECKSUM_PARTIAL,
ixgbevf_tx_csum() fails to get the network protocol and checksum related
descriptor fields are not configured correctly because skb->protocol
doesn't show the L3 protocol in this case.

Use first->protocol instead of skb->protocol to get the proper network
protocol.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>

10e4fb33

ixgbe: Fix checksum error when using stacked vlan · 0213668f

Toshiaki Makita authored Jan 29, 2015

When a skb has multiple vlans and it is CHECKSUM_PARTIAL,
ixgbe_tx_csum() fails to get the network protocol and checksum related
descriptor fields are not configured correctly because skb->protocol
doesn't show the L3 protocol in this case.

Use vlan_get_protocol() to get the proper network protocol.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>

0213668f

igbvf: Fix checksum error when using stacked vlan · 72b14059

Toshiaki Makita authored Jan 29, 2015

When a skb has multiple vlans and it is CHECKSUM_PARTIAL,
igbvf_tx_csum() fails to get the network protocol and checksum related
descriptor fields are not configured correctly because skb->protocol
doesn't show the L3 protocol in this case.

Use vlan_get_protocol() to get the proper network protocol.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>

72b14059

net: Fix vlan_get_protocol for stacked vlan · d4bcef3f

Toshiaki Makita authored Jan 29, 2015

vlan_get_protocol() could not get network protocol if a skb has a 802.1ad
vlan tag or multiple vlans, which caused incorrect checksum calculation
in several drivers.

Fix vlan_get_protocol() to retrieve network protocol instead of incorrect
vlan protocol.

As the logic is the same as skb_network_protocol(), create a common helper
function __vlan_get_protocol() and call it from existing functions.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>

d4bcef3f

net: sctp: fix passing wrong parameter header to param_type2af in sctp_process_param · cfbf654e

Saran Maruti Ramanara authored Jan 29, 2015

When making use of RFC5061, section 4.2.4. for setting the primary IP
address, we're passing a wrong parameter header to param_type2af(),
resulting always in NULL being returned.

At this point, param.p points to a sctp_addip_param struct, containing
a sctp_paramhdr (type = 0xc004, length = var), and crr_id as a correlation
id. Followed by that, as also presented in RFC5061 section 4.2.4., comes
the actual sctp_addr_param, which also contains a sctp_paramhdr, but
this time with the correct type SCTP_PARAM_IPV{4,6}_ADDRESS that
param_type2af() can make use of. Since we already hold a pointer to
addr_param from previous line, just reuse it for param_type2af().

Fixes: d6de3097 ("[SCTP]: Add the handling of "Set Primary IP Address" parameter to INIT")
Signed-off-by: Saran Maruti Ramanara <saran.neti@telus.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cfbf654e

netlink: fix wrong subscription bitmask to group mapping in · 8b7c36d8

Pablo Neira authored Jan 29, 2015

The subscription bitmask passed via struct sockaddr_nl is converted to
the group number when calling the netlink_bind() and netlink_unbind()
callbacks.

The conversion is however incorrect since bitmask (1 << 0) needs to be
mapped to group number 1. Note that you cannot specify the group number 0
(usually known as _NONE) from setsockopt() using NETLINK_ADD_MEMBERSHIP
since this is rejected through -EINVAL.

This problem became noticeable since 97840cb6 ("netfilter: nfnetlink:
fix insufficient validation in nfnetlink_bind") when binding to bitmask
(1 << 0) in ctnetlink.
Reported-by: Andre Tomt <andre@tomt.net>
Reported-by: Ivan Delalande <colona@arista.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

8b7c36d8

30 Jan, 2015 2 commits

netfilter: nf_tables: fix leaks in error path of nf_tables_newchain() · f5553c19

Pablo Neira Ayuso authored Jan 29, 2015

Release statistics and module refcount on memory allocation problems.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

f5553c19

ipvs: rerouting to local clients is not needed anymore · 579eb62a

Julian Anastasov authored Dec 18, 2014

commit f5a41847 ("ipvs: move ip_route_me_harder for ICMP")
from 2.6.37 introduced ip_route_me_harder() call for responses to
local clients, so that we can provide valid rt_src after SNAT.
It was used by TCP to provide valid daddr for ip_send_reply().
After commit 0a5ebb80 ("ipv4: Pass explicit daddr arg to
ip_send_reply()." from 3.0 this rerouting is not needed anymore
and should be avoided, especially in LOCAL_IN.

Fixes 3.12.33 crash in xfrm reported by Florian Wiessner:
"3.12.33 - BUG xfrm_selector_match+0x25/0x2f6"
Reported-by: Smart Weblications GmbH - Florian Wiessner <f.wiessner@smart-weblications.de>
Tested-by: Smart Weblications GmbH - Florian Wiessner <f.wiessner@smart-weblications.de>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>

579eb62a

29 Jan, 2015 16 commits

ipv4: Don't increase PMTU with Datagram Too Big message. · 3cdaa5be

Li Wei authored Jan 29, 2015

RFC 1191 said, "a host MUST not increase its estimate of the Path
MTU in response to the contents of a Datagram Too Big message."
Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3cdaa5be

Merge branch 'arm-build-fixes' · a1a0b558

David S. Miller authored Jan 29, 2015

Arnd Bergmann says:

====================
net: driver fixes from arm randconfig builds

These four patches are fallout from test builds on ARM. I have a
few more of them in my backlog but have not yet confirmed them
to still be valid.

The first three patches are about incomplete dependencies on
old drivers. One could backport them to the beginning of time
in theory, but there is little value since nobody would run into
these problems.

The final patch is one I had submitted before together with the
respective pcmcia patch but forgot to follow up on that. It's
still a valid but relatively theoretical bug, because the previous
behavior of the driver was just as broken as what we have in
mainline.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

a1a0b558

net: am2150: fix nmclan_cs.c shared interrupt handling · 96a30175

Arnd Bergmann authored Jan 28, 2015

A recent patch tried to work around a valid warning for the use of a
deprecated interface by blindly changing from the old
pcmcia_request_exclusive_irq() interface to pcmcia_request_irq().

This driver has an interrupt handler that is not currently aware
of shared interrupts, but can be easily converted to be.
At the moment, the driver reads the interrupt status register
repeatedly until it contains only zeroes in the interesting bits,
and handles each bit individually.

This patch adds the missing part of returning IRQ_NONE in case none
of the bits are set to start with, so we can move on to the next
interrupt source.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 5f5316fc ("am2150: Update nmclan_cs.c to use update PCMCIA API")
Signed-off-by: David S. Miller <davem@davemloft.net>

96a30175

net: lance,ni64: don't build for ARM · e9b106b8

Arnd Bergmann authored Jan 28, 2015

The ni65 and lance ethernet drivers manually program the ISA DMA
controller that is only available on x86 PCs and a few compatible
systems. Trying to build it on ARM results in this error:

ni65.c: In function 'ni65_probe1':
ni65.c:496:62: error: 'DMA1_STAT_REG' undeclared (first use in this function)
     ((inb(DMA1_STAT_REG) >> 4) & 0x0f)
                                                              ^
ni65.c:496:62: note: each undeclared identifier is reported only once for each function it appears in
ni65.c:497:63: error: 'DMA2_STAT_REG' undeclared (first use in this function)
     | (inb(DMA2_STAT_REG) & 0xf0);

The DMA1_STAT_REG and DMA2_STAT_REG registers are only defined for
alpha, mips, parisc, powerpc and x86, although it is not clear
which subarchitectures actually have them at the correct location.

This patch for now just disables it for ARM, to avoid randconfig
build errors. We could also decide to limit it to the set of
architectures on which it does compile, but that might look more
deliberate than guessing based on where the drivers build.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

e9b106b8

net: wan: add missing virt_to_bus dependencies · 303c28d8

Arnd Bergmann authored Jan 28, 2015

The cosa driver is rather outdated and does not get built on most
platforms because it requires the ISA_DMA_API symbol. However
there are some ARM platforms that have ISA_DMA_API but no virt_to_bus,
and they get this build error when enabling the ltpc driver.

drivers/net/wan/cosa.c: In function 'tx_interrupt':
drivers/net/wan/cosa.c:1768:3: error: implicit declaration of function 'virt_to_bus'
   unsigned long addr = virt_to_bus(cosa->txbuf);
   ^

The same problem exists for the Hostess SV-11 and Sealevel Systems 4021
drivers.

This adds another dependency in Kconfig to avoid that configuration.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

303c28d8

net: cs89x0: always build platform code if !HAS_IOPORT_MAP · fc9a5707

Arnd Bergmann authored Jan 28, 2015

The cs89x0 driver can either be built as an ISA driver or a platform
driver, the choice is controlled by the CS89x0_PLATFORM Kconfig
symbol. Building the ISA driver on a system that does not have
a way to map I/O ports fails with this error:

drivers/built-in.o: In function `cs89x0_ioport_probe.constprop.1':
:(.init.text+0x4794): undefined reference to `ioport_map'
:(.init.text+0x4830): undefined reference to `ioport_unmap'

This changes the Kconfig logic to take that option away and
always force building the platform variant of this driver if
CONFIG_HAS_IOPORT_MAP is not set. This is the only correct
choice in this case, and it avoids the build error.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

fc9a5707

ppp: deflate: never return len larger than output buffer · e2a4800e

Florian Westphal authored Jan 28, 2015

When we've run out of space in the output buffer to store more data, we
will call zlib_deflate with a NULL output buffer until we've consumed
remaining input.

When this happens, olen contains the size the output buffer would have
consumed iff we'd have had enough room.

This can later cause skb_over_panic when ppp_generic skb_put()s
the returned length.
Reported-by: Iain Douglas <centos@1n6.org.uk>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

e2a4800e

Merge branch 'netns' · d445d63b

David S. Miller authored Jan 29, 2015

Nicolas Dichtel says:

====================
netns: audit netdevice creation with IFLA_NET_NS_[PID|FD]

When one of these attributes is set, the netdevice is created into the netns
pointed by IFLA_NET_NS_[PID|FD] (see the call to rtnl_create_link() in
rtnl_newlink()). Let's call this netns the dest_net. After this creation, if the
newlink handler exists, it is called with a netns argument that points to the
netns where the netlink message has been received (called src_net in the code)
which is the link netns.
Hence, with one of these attributes, it's possible to create a x-netns
netdevice.

Here is the result of my code review:
- all ip tunnels (sit, ipip, ip6_tunnels, gre[tap][v6], ip_vti[6]) does not
  really allows to use this feature: the netdevice is created in the dest_net
  and the src_net is completely ignored in the newlink handler.
- VLAN properly handles this x-netns creation.
- bridge ignores src_net, which seems fine (NETIF_F_NETNS_LOCAL is set).
- CAIF subsystem is not clear for me (I don't know how it works), but it seems
  to wrongly use src_net. Patch #1 tries to fix this, but it was done only by
  code review (and only compile-tested), so please carefully review it. I may
  miss something.
- HSR subsystem uses src_net to parse IFLA_HSR_SLAVE[1|2], but the netdevice has
  the flag NETIF_F_NETNS_LOCAL, so the question is: does this netdevice really
  supports x-netns? If not, the newlink handler should use the dest_net instead
  of src_net, I can provide the patch.
- ieee802154 uses also src_net and does not have NETIF_F_NETNS_LOCAL. Same
  question: does this netdevice really supports x-netns?
- bonding ignores src_net and flag NETIF_F_NETNS_LOCAL is set, ie x-netns is not
  supported. Fine.
- CAN does not support rtnl/newlink, ok.
- ipvlan uses src_net and does not have NETIF_F_NETNS_LOCAL. After looking at
  the code, it seems that this drivers support x-netns. Am I right?
- macvlan/macvtap uses src_net and seems to have x-netns support.
- team ignores src_net and has the flag NETIF_F_NETNS_LOCAL, ie x-netns is not
  supported. Ok.
- veth uses src_net and have x-netns support ;-) Ok.
- VXLAN didn't properly handle this. The link netns (vxlan->net) is the src_net
  and not dest_net (see patch #2). Note that it was already possible to create a
  x-netns vxlan before the commit f01ec1c0 ("vxlan: add x-netns support")
  but the nedevice remains broken.

To summarize:
 - CAIF patch must be carefully reviewed
 - for HSR, ieee802154, ipvlan: is x-netns supported?
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d445d63b

vxlan: setup the right link netns in newlink hdlr · 33564bbb

Nicolas Dichtel authored Jan 26, 2015

Rename the netns to src_net to avoid confusion with the netns where the
interface stands. The user may specify IFLA_NET_NS_[PID|FD] to create
a x-netns netndevice: IFLA_NET_NS_[PID|FD] points to the netns where the
netdevice stands and src_net to the link netns.

Note that before commit f01ec1c0 ("vxlan: add x-netns support"), it was
possible to create a x-netns vxlan netdevice, but the netdevice was not
operational.

Fixes: f01ec1c0 ("vxlan: add x-netns support")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

33564bbb

caif: remove wrong dev_net_set() call · 8997c27e

Nicolas Dichtel authored Jan 26, 2015

src_net points to the netns where the netlink message has been received. This
netns may be different from the netns where the interface is created (because
the user may add IFLA_NET_NS_[PID|FD]). In this case, src_net is the link netns.

It seems wrong to override the netns in the newlink() handler because if it
was not already src_net, it means that the user explicitly asks to create the
netdevice in another netns.

CC: Sjur Brændeland <sjur.brandeland@stericsson.com>
CC: Dmitry Tarnyagin <dmitry.tarnyagin@lockless.no>
Fixes: 8391c4aa ("caif: Bugfixes in CAIF netdevice for close and flow control")
Fixes: c4125400 ("caif-hsi: Add rtnl support")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8997c27e

lib/checksum.c: fix build for generic csum_tcpudp_nofold · 9ce35779

karl beldan authored Jan 29, 2015

Fixed commit added from64to32 under _#ifndef do_csum_ but used it
under _#ifndef csum_tcpudp_nofold_, breaking some builds (Fengguang's
robot reported TILEGX's). Move from64to32 under the latter.

Fixes: 150ae0e9 ("lib/checksum.c: fix carry in csum_tcpudp_nofold")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Karl Beldan <karl.beldan@rivierawaves.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

9ce35779

tcp: ipv4: initialize unicast_sock sk_pacing_rate · 811230cd

Eric Dumazet authored Jan 28, 2015

When I added sk_pacing_rate field, I forgot to initialize its value
in the per cpu unicast_sock used in ip_send_unicast_reply()

This means that for sch_fq users, RST packets, or ACK packets sent
on behalf of TIME_WAIT sockets might be sent to slowly or even dropped
once we reach the per flow limit.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 95bd09eb ("tcp: TSO packets automatic sizing")
Signed-off-by: David S. Miller <davem@davemloft.net>

811230cd

lib/checksum.c: fix carry in csum_tcpudp_nofold · 150ae0e9

karl beldan authored Jan 28, 2015

The carry from the 64->32bits folding was dropped, e.g with:
saddr=0xFFFFFFFF daddr=0xFF0000FF len=0xFFFF proto=0 sum=1,
csum_tcpudp_nofold returned 0 instead of 1.
Signed-off-by: Karl Beldan <karl.beldan@rivierawaves.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

150ae0e9

bridge: dont send notification when skb->len == 0 in rtnl_bridge_notify · 59ccaaaa

Roopa Prabhu authored Jan 28, 2015

Reported in: https://bugzilla.kernel.org/show_bug.cgi?id=92081

This patch avoids calling rtnl_notify if the device ndo_bridge_getlink
handler does not return any bytes in the skb.

Alternately, the skb->len check can be moved inside rtnl_notify.

For the bridge vlan case described in 92081, there is also a fix needed
in bridge driver to generate a proper notification. Will fix that in
subsequent patch.

v2: rebase patch on net tree
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

59ccaaaa

Merge branch 'tcp_stretch_acks' · 95224ac1

David S. Miller authored Jan 28, 2015

Neal Cardwell says:

====================
fix stretch ACK bugs in TCP CUBIC and Reno

This patch series fixes the TCP CUBIC and Reno congestion control
modules to properly handle stretch ACKs in their respective additive
increase modes, and in the transitions from slow start to additive
increase.

This finishes the project started by commit 9f9843a7 ("tcp:
properly handle stretch acks in slow start"), which fixed behavior for
TCP congestion control when handling stretch ACKs in slow start mode.

Motivation: In the Jan 2015 netdev thread 'BW regression after "tcp:
refine TSO autosizing"', Eyal Perry documented a regression that Eric
Dumazet determined was caused by improper handling of TCP stretch
ACKs.

Background: LRO, GRO, delayed ACKs, and middleboxes can cause "stretch
ACKs" that cover more than the RFC-specified maximum of 2
packets. These stretch ACKs can cause serious performance shortfalls
in common congestion control algorithms, like Reno and CUBIC, which
were designed and tuned years ago with receiver hosts that were not
using LRO or GRO, and were instead ACKing every other packet.

Testing: at Google we have been using this approach for handling
stretch ACKs for CUBIC datacenter and Internet traffic for several
years, with good results.

v2:
 * fixed return type of tcp_slow_start() to be u32 instead of int
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

95224ac1

tcp: fix timing issue in CUBIC slope calculation · d6b1a8a9

Neal Cardwell authored Jan 28, 2015

This patch fixes a bug in CUBIC that causes cwnd to increase slightly
too slowly when multiple ACKs arrive in the same jiffy.

If cwnd is supposed to increase at a rate of more than once per jiffy,
then CUBIC was sometimes too slow. Because the bic_target is
calculated for a future point in time, calculated with time in
jiffies, the cwnd can increase over the course of the jiffy while the
bic_target calculated as the proper CUBIC cwnd at time
t=tcp_time_stamp+rtt does not increase, because tcp_time_stamp only
increases on jiffy tick boundaries.

So since the cnt is set to:
	ca->cnt = cwnd / (bic_target - cwnd);
as cwnd increases but bic_target does not increase due to jiffy
granularity, the cnt becomes too large, causing cwnd to increase
too slowly.

For example:
- suppose at the beginning of a jiffy, cwnd=40, bic_target=44
- so CUBIC sets:
   ca->cnt =  cwnd / (bic_target - cwnd) = 40 / (44 - 40) = 40/4 = 10
- suppose we get 10 acks, each for 1 segment, so tcp_cong_avoid_ai()
   increases cwnd to 41
- so CUBIC sets:
   ca->cnt =  cwnd / (bic_target - cwnd) = 41 / (44 - 41) = 41 / 3 = 13

So now CUBIC will wait for 13 packets to be ACKed before increasing
cwnd to 42, insted of 10 as it should.

The fix is to avoid adjusting the slope (determined by ca->cnt)
multiple times within a jiffy, and instead skip to compute the Reno
cwnd, the "TCP friendliness" code path.
Reported-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d6b1a8a9