- 09 Mar, 2018 40 commits
-
-
Ido Schimmel authored
[ Upstream commit d1c95af3 ] When mlxsw replaces (or deletes) a route it removes the offload indication from the replaced route. This is problematic for IPv4 routes, as the offload indication is stored in the fib_info which is usually shared between multiple routes. Instead of unconditionally clearing the offload indication, only clear it if no other route is using the fib_info. Fixes: 3984d1a8 ("mlxsw: spectrum_router: Provide offload indication using nexthop flags") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reported-by: Alexander Petrovskiy <alexpe@mellanox.com> Tested-by: Alexander Petrovskiy <alexpe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Paolo Abeni authored
[ Upstream commit d7cdee5e ] Li Shuang reported an Oops with cls_u32 due to an use-after-free in u32_destroy_key(). The use-after-free can be triggered with: dev=lo tc qdisc add dev $dev root handle 1: htb default 10 tc filter add dev $dev parent 1: prio 5 handle 1: protocol ip u32 divisor 256 tc filter add dev $dev protocol ip parent 1: prio 5 u32 ht 800:: match ip dst\ 10.0.0.0/8 hashkey mask 0x0000ff00 at 16 link 1: tc qdisc del dev $dev root Which causes the following kasan splat: ================================================================== BUG: KASAN: use-after-free in u32_destroy_key.constprop.21+0x117/0x140 [cls_u32] Read of size 4 at addr ffff881b83dae618 by task kworker/u48:5/571 CPU: 17 PID: 571 Comm: kworker/u48:5 Not tainted 4.15.0+ #87 Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016 Workqueue: tc_filter_workqueue u32_delete_key_freepf_work [cls_u32] Call Trace: dump_stack+0xd6/0x182 ? dma_virt_map_sg+0x22e/0x22e print_address_description+0x73/0x290 kasan_report+0x277/0x360 ? u32_destroy_key.constprop.21+0x117/0x140 [cls_u32] u32_destroy_key.constprop.21+0x117/0x140 [cls_u32] u32_delete_key_freepf_work+0x1c/0x30 [cls_u32] process_one_work+0xae0/0x1c80 ? sched_clock+0x5/0x10 ? pwq_dec_nr_in_flight+0x3c0/0x3c0 ? _raw_spin_unlock_irq+0x29/0x40 ? trace_hardirqs_on_caller+0x381/0x570 ? _raw_spin_unlock_irq+0x29/0x40 ? finish_task_switch+0x1e5/0x760 ? finish_task_switch+0x208/0x760 ? preempt_notifier_dec+0x20/0x20 ? __schedule+0x839/0x1ee0 ? check_noncircular+0x20/0x20 ? firmware_map_remove+0x73/0x73 ? find_held_lock+0x39/0x1c0 ? worker_thread+0x434/0x1820 ? lock_contended+0xee0/0xee0 ? lock_release+0x1100/0x1100 ? init_rescuer.part.16+0x150/0x150 ? retint_kernel+0x10/0x10 worker_thread+0x216/0x1820 ? process_one_work+0x1c80/0x1c80 ? lock_acquire+0x1a5/0x540 ? lock_downgrade+0x6b0/0x6b0 ? sched_clock+0x5/0x10 ? lock_release+0x1100/0x1100 ? compat_start_thread+0x80/0x80 ? do_raw_spin_trylock+0x190/0x190 ? _raw_spin_unlock_irq+0x29/0x40 ? trace_hardirqs_on_caller+0x381/0x570 ? _raw_spin_unlock_irq+0x29/0x40 ? finish_task_switch+0x1e5/0x760 ? finish_task_switch+0x208/0x760 ? preempt_notifier_dec+0x20/0x20 ? __schedule+0x839/0x1ee0 ? kmem_cache_alloc_trace+0x143/0x320 ? firmware_map_remove+0x73/0x73 ? sched_clock+0x5/0x10 ? sched_clock_cpu+0x18/0x170 ? find_held_lock+0x39/0x1c0 ? schedule+0xf3/0x3b0 ? lock_downgrade+0x6b0/0x6b0 ? __schedule+0x1ee0/0x1ee0 ? do_wait_intr_irq+0x340/0x340 ? do_raw_spin_trylock+0x190/0x190 ? _raw_spin_unlock_irqrestore+0x32/0x60 ? process_one_work+0x1c80/0x1c80 ? process_one_work+0x1c80/0x1c80 kthread+0x312/0x3d0 ? kthread_create_worker_on_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x50 Allocated by task 1688: kasan_kmalloc+0xa0/0xd0 __kmalloc+0x162/0x380 u32_change+0x1220/0x3c9e [cls_u32] tc_ctl_tfilter+0x1ba6/0x2f80 rtnetlink_rcv_msg+0x4f0/0x9d0 netlink_rcv_skb+0x124/0x320 netlink_unicast+0x430/0x600 netlink_sendmsg+0x8fa/0xd60 sock_sendmsg+0xb1/0xe0 ___sys_sendmsg+0x678/0x980 __sys_sendmsg+0xc4/0x210 do_syscall_64+0x232/0x7f0 return_from_SYSCALL_64+0x0/0x75 Freed by task 112: kasan_slab_free+0x71/0xc0 kfree+0x114/0x320 rcu_process_callbacks+0xc3f/0x1600 __do_softirq+0x2bf/0xc06 The buggy address belongs to the object at ffff881b83dae600 which belongs to the cache kmalloc-4096 of size 4096 The buggy address is located 24 bytes inside of 4096-byte region [ffff881b83dae600, ffff881b83daf600) The buggy address belongs to the page: page:ffffea006e0f6a00 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0 flags: 0x17ffffc0008100(slab|head) raw: 0017ffffc0008100 0000000000000000 0000000000000000 0000000100070007 raw: dead000000000100 dead000000000200 ffff880187c0e600 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff881b83dae500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff881b83dae580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff881b83dae600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff881b83dae680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff881b83dae700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== The problem is that the htnode is freed before the linked knodes and the latter will try to access the first at u32_destroy_key() time. This change addresses the issue using the htnode refcnt to guarantee the correct free order. While at it also add a RCU annotation, to keep sparse happy. v1 -> v2: use rtnl_derefence() instead of RCU read locks v2 -> v3: - don't check refcnt in u32_destroy_hnode() - cleaned-up u32_destroy() implementation - cleaned-up code comment v3 -> v4: - dropped unneeded comment Reported-by: Li Shuang <shuali@redhat.com> Fixes: c0d378ef ("net_sched: use tcf_queue_work() in u32 filter") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Tom Lendacky authored
[ Upstream commit cfd092f2 ] After resuming from suspend, the PCI device support must re-enable the interrupt setting so that interrupts are actually delivered. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eran Ben Elisha authored
[ Upstream commit f600c608 ] Driver tries to copy at least MLX5E_MIN_INLINE bytes into the control segment of the WQE. It assumes that the linear part contains at least MLX5E_MIN_INLINE bytes, which can be wrong. Cited commit verified that driver will not copy more bytes into the inline header part that the actual size of the packet. Re-factor this check to make sure we do not exceed the linear part as well. This fix is aligned with the current driver's assumption that the entire L2 will be present in the linear part of the SKB. Fixes: 6aace17e ("net/mlx5e: Fix inline header size for small packets") Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Ido Schimmel authored
[ Upstream commit 0e5a82ef ] When a VLAN is added on a port, a reference is taken on the corresponding master VLAN entry. If it does not already exist, then it is created and a reference taken. However, in the second case a reference is not really taken when CONFIG_REFCOUNT_FULL is enabled as refcount_inc() is replaced by refcount_inc_not_zero(). Fix this by using refcount_set() on a newly created master VLAN entry. Fixes: 25127759 ("net, bridge: convert net_bridge_vlan.refcnt from atomic_t to refcount_t") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Alexey Kodanev authored
[ Upstream commit 957d761c ] When going through the bind address list in sctp_v6_get_dst() and the previously found address is better ('matchlen > bmatchlen'), the code continues to the next iteration without releasing currently held destination. Fix it by releasing 'bdst' before continue to the next iteration, and instead of introducing one more '!IS_ERR(bdst)' check for dst_release(), move the already existed one right after ip6_dst_lookup_flow(), i.e. we shouldn't proceed further if we get an error for the route lookup. Fixes: dbc2b5e9 ("sctp: fix src address selection if using secondary addresses for ipv6") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
David Ahern authored
[ Upstream commit 1fe4b118 ] The result of the skb flow dissect is copied from keys to hash_keys to ensure only the intended data is hashed. The original L4 hash patch overlooked setting the addr_type for this case; add it. Fixes: bf4e0a3d ("net: ipv4: add support for ECMP hash policy choice") Reported-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Jiri Pirko authored
[ Upstream commit 0f2d2b27 ] Since mlxsw_sp_fib_create() and mlxsw_sp_mr_table_create() use ERR_PTR macro to propagate int err through return of a pointer, the return value is not NULL in case of failure. So if one of the calls fails, one of vr->fib4, vr->fib6 or vr->mr4_table is not NULL and mlxsw_sp_vr_is_used wrongly assumes that vr is in use which leads to crash like following one: [ 1293.949291] BUG: unable to handle kernel NULL pointer dereference at 00000000000006c9 [ 1293.952729] IP: mlxsw_sp_mr_table_flush+0x15/0x70 [mlxsw_spectrum] Fix this by using local variables to hold the pointers and set vr->* only in case everything went fine. Fixes: 76610ebb ("mlxsw: spectrum_router: Refactor virtual router handling") Fixes: a3d9bc50 ("mlxsw: spectrum_router: Extend virtual routers with IPv6 support") Fixes: d42b0965 ("mlxsw: spectrum_router: Add multicast routes notification handling functionality") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Yuchung Cheng authored
[ Upstream commit fc68e171 ] This reverts commit 89fe18e4. While the patch could detect more spurious timeouts, it could cause poor TCP performance on broken middle-boxes that modifies TCP packets (e.g. receive window, SACK options). Since the performance gain is much smaller compared to the potential loss. The best solution is to fully revert the change. Fixes: 89fe18e4 ("tcp: extend F-RTO to catch more spurious timeouts") Reported-by: Teodor Milkov <tm@del.bg> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Yuchung Cheng authored
[ Upstream commit d4131f09 ] This reverts commit cc663f4d. While fixing some broken middle-boxes that modifies receive window fields, it does not address middle-boxes that strip off SACK options. The best solution is to fully revert this patch and the root F-RTO enhancement. Fixes: cc663f4d ("tcp: restrict F-RTO to work-around broken middle-boxes") Reported-by: Teodor Milkov <tm@del.bg> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Xin Long authored
[ Upstream commit 27af86bb ] The pr_err in sctp_hash_transport was supposed to report a sctp bug for using rhashtable/rhlist. The err '-EEXIST' introduced in Commit cd2b7087 ("sctp: check duplicate node before inserting a new transport") doesn't belong to that case. So just return -EEXIST back without pr_err any kmsg. Fixes: cd2b7087 ("sctp: check duplicate node before inserting a new transport") Reported-by: Wei Chen <weichen@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Ivan Vecera authored
[ Upstream commit eb53f7af ] The following sequence is currently broken: # tc qdisc add dev foo ingress # tc filter replace dev foo protocol all ingress \ u32 match u8 0 0 action mirred egress mirror dev bar1 # tc filter replace dev foo protocol all ingress \ handle 800::800 pref 49152 \ u32 match u8 0 0 action mirred egress mirror dev bar2 Error: cls_u32: Key node flags do not match passed flags. We have an error talking to the kernel, -1 The error comes from u32_change() when comparing new and existing flags. The existing ones always contains one of TCA_CLS_FLAGS_{,NOT}_IN_HW flag depending on offloading state. These flags cannot be passed from userspace so the condition (n->flags != flags) in u32_change() always fails. Fix the condition so the flags TCA_CLS_FLAGS_NOT_IN_HW and TCA_CLS_FLAGS_IN_HW are not taken into account. Fixes: 24d3dc6d ("net/sched: cls_u32: Reflect HW offload status") Signed-off-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
[ Upstream commit a5f7add3 ] pfifo_fast got percpu stats lately, uncovering a bug I introduced last year in linux-4.10. I missed the fact that we have to clear our temporary storage before calling __gnet_stats_copy_basic() in the case of percpu stats. Without this fix, rate estimators (tc qd replace dev xxx root est 1sec 4sec pfifo_fast) are utterly broken. Fixes: 1c0d32fd ("net_sched: gen_estimator: complete rewrite of rate estimators") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Inbar Karmy authored
[ Upstream commit ef7a3518 ] When GRO is off, the transport header pointer in sk_buff is initialized to network's header. To find the udp header, instead of using udp_hdr() which assumes skb_network_header was set, manually calculate the udp header offset. Fixes: 0952da79 ("net/mlx5e: Add support for loopback selftest") Signed-off-by: Inbar Karmy <inbark@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Tonghao Zhang authored
[ Upstream commit a61a86f8 ] The SK_MEM_QUANTUM was changed from PAGE_SIZE to 4096. And the tcp_wmem/tcp_rmem min default values are 4096. Fixes: bd68a2a8 ("net: set SK_MEM_QUANTUM to 4096") Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
[ Upstream commit 350c9f48 ] BBR uses tcp_tso_autosize() in an attempt to probe what would be the burst sizes and to adjust cwnd in bbr_target_cwnd() with following gold formula : /* Allow enough full-sized skbs in flight to utilize end systems. */ cwnd += 3 * bbr->tso_segs_goal; But GSO can be lacking or be constrained to very small units (ip link set dev ... gso_max_segs 2) What we really want is to have enough packets in flight so that both GSO and GRO are efficient. So in the case GSO is off or downgraded, we still want to have the same number of packets in flight as if GSO/TSO was fully operational, so that GRO can hopefully be working efficiently. To fix this issue, we make tcp_tso_autosize() unaware of sk->sk_gso_max_segs Only tcp_tso_segs() has to enforce the gso_max_segs limit. Tested: ethtool -K eth0 tso off gso off tc qd replace dev eth0 root pfifo_fast Before patch: for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done 691 (ss -temoi shows cwnd is stuck around 6 ) 667 651 631 517 After patch : # for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done 1733 (ss -temoi shows cwnd is around 386 ) 1778 1746 1781 1718 Fixes: 0f8782ea ("tcp_bbr: add BBR congestion control") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
David Howells authored
[ Upstream commit 93c62c45 ] All the kernel_sendmsg() calls in rxrpc_send_data_packet() need to send both parts of the iov[] buffer, but one of them does not. Fix it so that it does. Without this, short IPv6 rxrpc DATA packets may be seen that have the rxrpc header included, but no payload. Fixes: 5a924b89 ("rxrpc: Don't store the rxrpc header in the Tx queue sk_buffs") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Ilya Lesokhin authored
[ Upstream commit 808cf9e3 ] Avoid SKB coalescing if eor bit is set in one of the relevant SKBs. Fixes: c134ecb8 ("tcp: Make use of MSG_EOR in tcp_sendmsg") Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Heiner Kallweit authored
[ Upstream commit 08f51385 ] This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates also PHY state changes and we should do what the symbol says. Fixes: 84a527a4 ("net: phylib: fix interrupts re-enablement in phy_start") Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Gal Pressman authored
[ Upstream commit 2f0db879 ] When allocating a drop rq, no numa node is explicitly set which means allocations are done on node zero. This is not necessarily the nearest numa node to the HCA, and even worse, might even be a memoryless numa node. Choose the numa_node given to us by the pci device in order to properly allocate the coherent dma memory instead of assuming zero is valid. Fixes: 556dd1b9 ("net/mlx5e: Set drop RQ's necessary parameters only") Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Shalom Toledo authored
[ Upstream commit 0a8a1bf1 ] Until now, we assumed that in case of error when adding FDB entries, the write operation will fail, but this is not the case. Instead, we need to check that the number of entries reported in the response is equal to the number of entries specified in the request. Fixes: 56ade8fe ("mlxsw: spectrum: Add initial support for Spectrum ASIC") Reported-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Shalom Toledo <shalomt@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Tommi Rantala authored
[ Upstream commit 4a31a6b1 ] Fix dst reference count leak in sctp_v4_get_dst() introduced in commit 410f0383 ("sctp: add routing output fallback"): When walking the address_list, successive ip_route_output_key() calls may return the same rt->dst with the reference incremented on each call. The code would not decrement the dst refcount when the dst pointer was identical from the previous iteration, causing the dst refcnt leak. Testcase: ip netns add TEST ip netns exec TEST ip link set lo up ip link add dummy0 type dummy ip link add dummy1 type dummy ip link add dummy2 type dummy ip link set dev dummy0 netns TEST ip link set dev dummy1 netns TEST ip link set dev dummy2 netns TEST ip netns exec TEST ip addr add 192.168.1.1/24 dev dummy0 ip netns exec TEST ip link set dummy0 up ip netns exec TEST ip addr add 192.168.1.2/24 dev dummy1 ip netns exec TEST ip link set dummy1 up ip netns exec TEST ip addr add 192.168.1.3/24 dev dummy2 ip netns exec TEST ip link set dummy2 up ip netns exec TEST sctp_test -H 192.168.1.2 -P 20002 -h 192.168.1.1 -p 20000 -s -B 192.168.1.3 ip netns del TEST In 4.4 and 4.9 kernels this results to: [ 354.179591] unregister_netdevice: waiting for lo to become free. Usage count = 1 [ 364.419674] unregister_netdevice: waiting for lo to become free. Usage count = 1 [ 374.663664] unregister_netdevice: waiting for lo to become free. Usage count = 1 [ 384.903717] unregister_netdevice: waiting for lo to become free. Usage count = 1 [ 395.143724] unregister_netdevice: waiting for lo to become free. Usage count = 1 [ 405.383645] unregister_netdevice: waiting for lo to become free. Usage count = 1 ... Fixes: 410f0383 ("sctp: add routing output fallback") Fixes: 0ca50d12 ("sctp: fix src address selection if using secondary addresses") Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Gal Pressman authored
[ Upstream commit 8babd44d ] When receiving an LRO packet, the checksum field is set by the hardware to the checksum of the first coalesced packet. Obviously, this checksum is not valid for the merged LRO packet and should be fixed. We can use the CQE checksum which covers the checksum of the entire merged packet TCP payload to help us calculate the checksum incrementally. Tested by sending IPv4/6 traffic with LRO enabled, RX checksum disabled and watching nstat checksum error counters (in addition to the obvious bandwidth drop caused by checksum errors). This bug is usually "hidden" since LRO packets would go through the CHECKSUM_UNNECESSARY flow which does not validate the packet checksum. It's important to note that previous to this patch, LRO packets provided with CHECKSUM_UNNECESSARY are indeed packets with a correct validated checksum (even though the checksum inside the TCP header is incorrect), since the hardware LRO aggregation is terminated upon receiving a packet with bad checksum. Fixes: e586b3b0 ("net/mlx5: Ethernet Datapath files") Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Alexey Kodanev authored
[ Upstream commit 15f35d49 ] Since UDP-Lite is always using checksum, the following path is triggered when calculating pseudo header for it: udp4_csum_init() or udp6_csum_init() skb_checksum_init_zero_check() __skb_checksum_validate_complete() The problem can appear if skb->len is less than CHECKSUM_BREAK. In this particular case __skb_checksum_validate_complete() also invokes __skb_checksum_complete(skb). If UDP-Lite is using partial checksum that covers only part of a packet, the function will return bad checksum and the packet will be dropped. It can be fixed if we skip skb_checksum_init_zero_check() and only set the required pseudo header checksum for UDP-Lite with partial checksum before udp4_csum_init()/udp6_csum_init() functions return. Fixes: ed70fcfc ("net: Call skb_checksum_init in IPv4") Fixes: e4f45b7f ("net: Call skb_checksum_init in IPv6") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Alexey Kodanev authored
[ Upstream commit 07f2c7ab ] When SCTP makes INIT or INIT_ACK packet the total chunk length can exceed SCTP_MAX_CHUNK_LEN which leads to kernel panic when transmitting these packets, e.g. the crash on sending INIT_ACK: [ 597.804948] skbuff: skb_over_panic: text:00000000ffae06e4 len:120168 put:120156 head:000000007aa47635 data:00000000d991c2de tail:0x1d640 end:0xfec0 dev:<NULL> ... [ 597.976970] ------------[ cut here ]------------ [ 598.033408] kernel BUG at net/core/skbuff.c:104! [ 600.314841] Call Trace: [ 600.345829] <IRQ> [ 600.371639] ? sctp_packet_transmit+0x2095/0x26d0 [sctp] [ 600.436934] skb_put+0x16c/0x200 [ 600.477295] sctp_packet_transmit+0x2095/0x26d0 [sctp] [ 600.540630] ? sctp_packet_config+0x890/0x890 [sctp] [ 600.601781] ? __sctp_packet_append_chunk+0x3b4/0xd00 [sctp] [ 600.671356] ? sctp_cmp_addr_exact+0x3f/0x90 [sctp] [ 600.731482] sctp_outq_flush+0x663/0x30d0 [sctp] [ 600.788565] ? sctp_make_init+0xbf0/0xbf0 [sctp] [ 600.845555] ? sctp_check_transmitted+0x18f0/0x18f0 [sctp] [ 600.912945] ? sctp_outq_tail+0x631/0x9d0 [sctp] [ 600.969936] sctp_cmd_interpreter.isra.22+0x3be1/0x5cb0 [sctp] [ 601.041593] ? sctp_sf_do_5_1B_init+0x85f/0xc30 [sctp] [ 601.104837] ? sctp_generate_t1_cookie_event+0x20/0x20 [sctp] [ 601.175436] ? sctp_eat_data+0x1710/0x1710 [sctp] [ 601.233575] sctp_do_sm+0x182/0x560 [sctp] [ 601.284328] ? sctp_has_association+0x70/0x70 [sctp] [ 601.345586] ? sctp_rcv+0xef4/0x32f0 [sctp] [ 601.397478] ? sctp6_rcv+0xa/0x20 [sctp] ... Here the chunk size for INIT_ACK packet becomes too big, mostly because of the state cookie (INIT packet has large size with many address parameters), plus additional server parameters. Later this chunk causes the panic in skb_put_data(): skb_packet_transmit() sctp_packet_pack() skb_put_data(nskb, chunk->skb->data, chunk->skb->len); 'nskb' (head skb) was previously allocated with packet->size from u16 'chunk->chunk_hdr->length'. As suggested by Marcelo we should check the chunk's length in _sctp_make_chunk() before trying to allocate skb for it and discard a chunk if its size bigger than SCTP_MAX_CHUNK_LEN. Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leinter@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Guillaume Nault authored
[ Upstream commit 77f840e3 ] PPP units don't hold any reference on the channels connected to it. It is the channel's responsibility to ensure that it disconnects from its unit before being destroyed. In practice, this is ensured by ppp_unregister_channel() disconnecting the channel from the unit before dropping a reference on the channel. However, it is possible for an unregistered channel to connect to a PPP unit: register a channel with ppp_register_net_channel(), attach a /dev/ppp file to it with ioctl(PPPIOCATTCHAN), unregister the channel with ppp_unregister_channel() and finally connect the /dev/ppp file to a PPP unit with ioctl(PPPIOCCONNECT). Once in this situation, the channel is only held by the /dev/ppp file, which can be released at anytime and free the channel without letting the parent PPP unit know. Then the ppp structure ends up with dangling pointers in its ->channels list. Prevent this scenario by forbidding unregistered channels from connecting to PPP units. This maintains the code logic by keeping ppp_unregister_channel() responsible from disconnecting the channel if necessary and avoids modification on the reference counting mechanism. This issue seems to predate git history (successfully reproduced on Linux 2.6.26 and earlier PPP commits are unrelated). Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Roman Kapl authored
[ Upstream commit 5ae437ad ] So far, if the filter was too large to fit in the allocated skb, the kernel did not return any error and stopped dumping. Modify the dumper so that it returns -EMSGSIZE when a filter fails to dump and it is the first filter in the skb. If we are not first, we will get a next chance with more room. I understand this is pretty near to being an API change, but the original design (silent truncation) can be considered a bug. Note: The error case can happen pretty easily if you create a filter with 32 actions and have 4kb pages. Also recent versions of iproute try to be clever with their buffer allocation size, which in turn leads to Signed-off-by: Roman Kapl <code@rkapl.cz> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Nicolas Dichtel authored
[ Upstream commit cb9f7a9a ] Nowadays, nlmsg_multicast() returns only 0 or -ESRCH but this was not the case when commit 134e6375 was pushed. However, there was no reason to stop the loop if a netns does not have listeners. Returns -ESRCH only if there was no listeners in all netns. To avoid having the same problem in the future, I didn't take the assumption that nlmsg_multicast() returns only 0 or -ESRCH. Fixes: 134e6375 ("genetlink: make netns aware") CC: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Sabrina Dubroca authored
[ Upstream commit c7272c2f ] According to RFC 1191 sections 3 and 4, ICMP frag-needed messages indicating an MTU below 68 should be rejected: A host MUST never reduce its estimate of the Path MTU below 68 octets. and (talking about ICMP frag-needed's Next-Hop MTU field): This field will never contain a value less than 68, since every router "must be able to forward a datagram of 68 octets without fragmentation". Furthermore, by letting net.ipv4.route.min_pmtu be set to negative values, we can end up with a very large PMTU when (-1) is cast into u32. Let's also make ip_rt_min_pmtu a u32, since it's only ever compared to unsigned ints. Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Jakub Kicinski authored
[ Upstream commit ac5b7019 ] netif_set_real_num_tx_queues() can be called when netdev is up. That usually happens when user requests change of number of channels/rings with ethtool -L. The procedure for changing the number of queues involves resetting the qdiscs and setting dev->num_tx_queues to the new value. When the new value is lower than the old one, extra care has to be taken to ensure ordering of accesses to the number of queues vs qdisc reset. Currently the queues are reset before new dev->num_tx_queues is assigned, leaving a window of time where packets can be enqueued onto the queues going down, leading to a likely crash in the drivers, since most drivers don't check if TX skbs are assigned to an active queue. Fixes: e6484930 ("net: allocate tx queues in register_netdevice") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Grygorii Strashko authored
[ Upstream commit 62f94c21 ] It was discovered that simple program which indefinitely sends 200b UDP packets and runs on TI AM574x SoC (SMP) under RT Kernel triggers network watchdog timeout in TI CPSW driver (<6 hours run). The network watchdog timeout is triggered due to race between cpsw_ndo_start_xmit() and cpsw_tx_handler() [NAPI] cpsw_ndo_start_xmit() if (unlikely(!cpdma_check_free_tx_desc(txch))) { txq = netdev_get_tx_queue(ndev, q_idx); netif_tx_stop_queue(txq); ^^ as per [1] barier has to be used after set_bit() otherwise new value might not be visible to other cpus } cpsw_tx_handler() if (unlikely(netif_tx_queue_stopped(txq))) netif_tx_wake_queue(txq); and when it happens ndev TX queue became disabled forever while driver's HW TX queue is empty. Fix this, by adding smp_mb__after_atomic() after netif_tx_stop_queue() calls and double check for free TX descriptors after stopping ndev TX queue - if there are free TX descriptors wake up ndev TX queue. [1] https://www.kernel.org/doc/html/latest/core-api/atomic_ops.htmlSigned-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Wolfram Sang authored
[ Upstream commit a3276892 ] Due to a typo, the mask was destroyed by a comparison instead of a bit shift. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Arnd Bergmann authored
[ Upstream commit ca79bec2 ] gcc-8 has a new warning that detects overlapping input and output arguments in memcpy(). It triggers for sit_init_net() calling ipip6_tunnel_clone_6rd(), which is actually correct: net/ipv6/sit.c: In function 'sit_init_net': net/ipv6/sit.c:192:3: error: 'memcpy' source argument is the same as destination [-Werror=restrict] The problem here is that the logic detecting the memcpy() arguments finds them to be the same, but the conditional that tests for the input and output of ipip6_tunnel_clone_6rd() to be identical is not a compile-time constant. We know that netdev_priv(t->dev) is the same as t for a tunnel device, and comparing "dev" directly here lets the compiler figure out as well that 'dev == sitn->fb_tunnel_dev' when called from sit_init_net(), so it no longer warns. This code is old, so Cc stable to make sure that we don't get the warning for older kernels built with new gcc. Cc: Martin Sebor <msebor@gmail.com> Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83456Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Denis Du authored
[ Upstream commit b6c3bad1 ] Sometimes when physical lines have a just good noise to make the protocol handshaking fail, but the carrier detect still good. Then after remove of the noise, nobody will trigger this protocol to be start again to cause the link to never come back. The fix is when the carrier is still on, not terminate the protocol handshaking. Signed-off-by: Denis Du <dudenis2000@yahoo.ca> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Stefano Brivio authored
[ Upstream commit a8c6db1d ] In fib_nh_match(), if output interface or gateway are passed in the FIB configuration, we don't have to check next hops of multipath routes to conclude whether we have a match or not. However, we might still have routes with different realms matching the same output interface and gateway configuration, and this needs to cause the match to fail. Otherwise the first route inserted in the FIB will match, regardless of the realms: # ip route add 1.1.1.1 dev eth0 table 1234 realms 1/2 # ip route append 1.1.1.1 dev eth0 table 1234 realms 3/4 # ip route list table 1234 1.1.1.1 dev eth0 scope link realms 1/2 1.1.1.1 dev eth0 scope link realms 3/4 # ip route del 1.1.1.1 dev ens3 table 1234 realms 3/4 # ip route list table 1234 1.1.1.1 dev ens3 scope link realms 3/4 whereas route with realms 3/4 should have been deleted instead. Explicitly check for fc_flow passed in the FIB configuration (this comes from RTA_FLOW extracted by rtm_to_fib_config()) and fail matching if it differs from nh_tclassid. The handling of RTA_FLOW for multipath routes later in fib_nh_match() is still needed, as we can have multiple RTA_FLOW attributes that need to be matched against the tclassid of each next hop. v2: Check that fc_flow is set before discarding the match, so that the user can still select the first matching rule by not specifying any realm, as suggested by David Ahern. Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Xin Long authored
[ Upstream commit 1b12580a ] Now br_sysfs_if file flush doesn't have attr show. To read it will cause kernel panic after users chmod u+r this file. Xiong found this issue when running the commands: ip link add br0 type bridge ip link add type veth ip link set veth0 master br0 chmod u+r /sys/devices/virtual/net/veth0/brport/flush timeout 3 cat /sys/devices/virtual/net/veth0/brport/flush kernel crashed with NULL a pointer dereference call trace. This patch is to fix it by return -EINVAL when brport_attr->show is null, just the same as the check for brport_attr->store in brport_store(). Fixes: 9cf63747 ("bridge: add sysfs hook to flush forwarding table") Reported-by: Xiong Zhou <xzhou@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Thomas Gleixner authored
commit 945fd17a upstream. The separation of the cpu_entry_area from the fixmap missed the fact that on 32bit non-PAE kernels the cpu_entry_area mapping might not be covered in initial_page_table by the previous synchronizations. This results in suspend/resume failures because 32bit utilizes initial page table for resume. The absence of the cpu_entry_area mapping results in a triple fault, aka. insta reboot. With PAE enabled this works by chance because the PGD entry which covers the fixmap and other parts incindentally provides the cpu_entry_area mapping as well. Synchronize the initial page table after setting up the cpu entry area. Instead of adding yet another copy of the same code, move it to a function and invoke it from the various places. It needs to be investigated if the existing calls in setup_arch() and setup_per_cpu_areas() can be replaced by the later invocation from setup_cpu_entry_areas(), but that's beyond the scope of this fix. Fixes: 92a0f81d ("x86/cpu_entry_area: Move it out of the fixmap") Reported-by: Woody Suwalski <terraluna977@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Woody Suwalski <terraluna977@gmail.com> Cc: William Grant <william.grant@canonical.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1802282137290.1392@nanos.tec.linutronix.deSigned-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Sebastian Panceac authored
commit 028091f8 upstream. When the Intel Edison module is powered with 3.3V, the reboot command makes the module stuck. If the module is powered at a greater voltage, like 4.4V (as the Edison Mini Breakout board does), reboot works OK. The official Intel Edison BSP sends the IPCMSG_COLD_RESET message to the SCU by default. The IPCMSG_COLD_BOOT which is used by the upstream kernel is only sent when explicitely selected on the kernel command line. Use IPCMSG_COLD_RESET unconditionally which makes reboot work independent of the power supply voltage. [ tglx: Massaged changelog ] Fixes: bda7b072 ("x86/platform/intel-mid: Implement power off sequence") Signed-off-by: Sebastian Panceac <sebastian@resin.io> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1519810849-15131-1-git-send-email-sebastian@resin.ioSigned-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Juergen Gross authored
commit 71c208dd upstream. Older Xen versions (4.5 and before) might have problems migrating pv guests with MSR_IA32_SPEC_CTRL having a non-zero value. So before suspending zero that MSR and restore it after being resumed. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jan Beulich <jbeulich@suse.com> Cc: stable@vger.kernel.org Cc: xen-devel@lists.xenproject.org Cc: boris.ostrovsky@oracle.com Link: https://lkml.kernel.org/r/20180226140818.4849-1-jgross@suse.comSigned-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Jan Kara authored
commit d9c10e5b upstream. Commit e864f395 "fs: add RWF_DSYNC aand RWF_SYNC" added additional way for direct IO to become synchronous and thus trigger fsync from the IO completion handler. Then commit 9830f4be "fs: Use RWF_* flags for AIO operations" allowed these flags to be set for AIO as well. However that commit forgot to update the condition checking whether the IO completion handling should be defered to a workqueue and thus AIO DIO with RWF_[D]SYNC set will call fsync() from IRQ context resulting in sleep in atomic. Fix the problem by checking directly iocb flags (the same way as it is done in dio_complete()) instead of checking all conditions that could lead to IO being synchronous. CC: Christoph Hellwig <hch@lst.de> CC: Goldwyn Rodrigues <rgoldwyn@suse.com> CC: stable@vger.kernel.org Reported-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Fixes: 9830f4beSigned-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-