1. 15 Nov, 2016 29 commits
    • David Ahern's avatar
      net: ipv6: Do not consider link state for nexthop validation · d46b1968
      David Ahern authored
      [ Upstream commit d5d32e4b ]
      
      Similar to IPv4, do not consider link state when validating next hops.
      
      Currently, if the link is down default routes can fail to insert:
       $ ip -6 ro add vrf blue default via 2100:2::64 dev eth2
       RTNETLINK answers: No route to host
      
      With this patch the command succeeds.
      
      Fixes: 8c14586f ("net: ipv6: Use passed in table for nexthop lookups")
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d46b1968
    • Tobias Brunner's avatar
      macsec: Fix header length if SCI is added if explicitly disabled · eb77db88
      Tobias Brunner authored
      [ Upstream commit e0f841f5 ]
      
      Even if sending SCIs is explicitly disabled, the code that creates the
      Security Tag might still decide to add it (e.g. if multiple RX SCs are
      defined on the MACsec interface).
      But because the header length so far only depended on the configuration
      option the SCI overwrote the original frame's contents (EtherType and
      e.g. the beginning of the IP header) and if encrypted did not visibly
      end up in the packet, while the SC flag in the TCI field of the Security
      Tag was still set, resulting in invalid MACsec frames.
      
      Fixes: c09440f7 ("macsec: introduce IEEE 802.1AE driver")
      Signed-off-by: default avatarTobias Brunner <tobias@strongswan.org>
      Acked-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eb77db88
    • Stephen Hemminger's avatar
      netvsc: fix incorrect receive checksum offloading · 027ab3b8
      Stephen Hemminger authored
      [ Upstream commit e52fed71 ]
      
      The Hyper-V netvsc driver was looking at the incorrect status bits
      in the checksum info. It was setting the receive checksum unnecessary
      flag based on the IP header checksum being correct. The checksum
      flag is skb is about TCP and UDP checksum status. Because of this
      bug, any packet received with bad TCP checksum would be passed
      up the stack and to the application causing data corruption.
      The problem is reproducible via netcat and netem.
      
      This had a side effect of not doing receive checksum offload
      on IPv6. The driver was also also always doing checksum offload
      independent of the checksum setting done via ethtool.
      Signed-off-by: default avatarStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      027ab3b8
    • Eric Dumazet's avatar
      udp: fix IP_CHECKSUM handling · b75edf27
      Eric Dumazet authored
      [ Upstream commit 10df8e61 ]
      
      First bug was added in commit ad6f939a ("ip: Add offset parameter to
      ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
      AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
      ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));
      
      Then commit e6afc8ac ("udp: remove headers from UDP packets before
      queueing") forgot to adjust the offsets now UDP headers are pulled
      before skb are put in receive queue.
      
      Fixes: ad6f939a ("ip: Add offset parameter to ip_cmsg_recv")
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Sam Kumar <samanthakumar@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Tested-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b75edf27
    • Xin Long's avatar
      sctp: fix the panic caused by route update · 5ee35602
      Xin Long authored
      [ Upstream commit ecc515d7 ]
      
      Commit 7303a147 ("sctp: identify chunks that need to be fragmented
      at IP level") made the chunk be fragmented at IP level in the next round
      if it's size exceed PMTU.
      
      But there still is another case, PMTU can be updated if transport's dst
      expires and transport's pmtu_pending is set in sctp_packet_transmit. If
      the new PMTU is less than the chunk, the same issue with that commit can
      be triggered.
      
      So we should drop this packet and let it retransmit in another round
      where it would be fragmented at IP level.
      
      This patch is to fix it by checking the chunk size after PMTU may be
      updated and dropping this packet if it's size exceed PMTU.
      
      Fixes: 90017acc ("sctp: Add GSO support")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@txudriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5ee35602
    • Jiri Slaby's avatar
      net: sctp, forbid negative length · d90cbfaf
      Jiri Slaby authored
      [ Upstream commit a4b8e71b ]
      
      Most of getsockopt handlers in net/sctp/socket.c check len against
      sizeof some structure like:
              if (len < sizeof(int))
                      return -EINVAL;
      
      On the first look, the check seems to be correct. But since len is int
      and sizeof returns size_t, int gets promoted to unsigned size_t too. So
      the test returns false for negative lengths. Yes, (-1 < sizeof(long)) is
      false.
      
      Fix this in sctp by explicitly checking len < 0 before any getsockopt
      handler is called.
      
      Note that sctp_getsockopt_events already handled the negative case.
      Since we added the < 0 check elsewhere, this one can be removed.
      
      If not checked, this is the result:
      UBSAN: Undefined behaviour in ../mm/page_alloc.c:2722:19
      shift exponent 52 is too large for 32-bit type 'int'
      CPU: 1 PID: 24535 Comm: syz-executor Not tainted 4.8.1-0-syzkaller #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
       0000000000000000 ffff88006d99f2a8 ffffffffb2f7bdea 0000000041b58ab3
       ffffffffb4363c14 ffffffffb2f7bcde ffff88006d99f2d0 ffff88006d99f270
       0000000000000000 0000000000000000 0000000000000034 ffffffffb5096422
      Call Trace:
       [<ffffffffb3051498>] ? __ubsan_handle_shift_out_of_bounds+0x29c/0x300
      ...
       [<ffffffffb273f0e4>] ? kmalloc_order+0x24/0x90
       [<ffffffffb27416a4>] ? kmalloc_order_trace+0x24/0x220
       [<ffffffffb2819a30>] ? __kmalloc+0x330/0x540
       [<ffffffffc18c25f4>] ? sctp_getsockopt_local_addrs+0x174/0xca0 [sctp]
       [<ffffffffc18d2bcd>] ? sctp_getsockopt+0x10d/0x1b0 [sctp]
       [<ffffffffb37c1219>] ? sock_common_getsockopt+0xb9/0x150
       [<ffffffffb37be2f5>] ? SyS_getsockopt+0x1a5/0x270
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: linux-sctp@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d90cbfaf
    • Fabio Estevam's avatar
      net: fec: Call swap_buffer() prior to IP header alignment · 64774617
      Fabio Estevam authored
      [ Upstream commit 235bde1e ]
      
      Commit 3ac72b7b ("net: fec: align IP header in hardware") breaks
      networking on mx28.
      
      There is an erratum on mx28 (ENGR121613 - ENET big endian mode
      not compatible with ARM little endian) that requires an additional
      byte-swap operation to workaround this problem.
      
      So call swap_buffer() prior to performing the IP header alignment
      to restore network functionality on mx28.
      
      Fixes: 3ac72b7b ("net: fec: align IP header in hardware")
      Reported-and-tested-by: default avatarHenri Roosen <henri.roosen@ginzinger.com>
      Signed-off-by: default avatarFabio Estevam <fabio.estevam@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      64774617
    • WANG Cong's avatar
      ipv4: use the right lock for ping_group_range · c6c82c2b
      WANG Cong authored
      [ Upstream commit 396a30cc ]
      
      This reverts commit a681574c
      ("ipv4: disable BH in set_ping_group_range()") because we never
      read ping_group_range in BH context (unlike local_port_range).
      
      Then, since we already have a lock for ping_group_range, those
      using ip_local_ports.lock for ping_group_range are clearly typos.
      
      We might consider to share a same lock for both ping_group_range
      and local_port_range w.r.t. space saving, but that should be for
      net-next.
      
      Fixes: a681574c ("ipv4: disable BH in set_ping_group_range()")
      Fixes: ba6b918a ("ping: move ping_group_range out of CONFIG_SYSCTL")
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Eric Salo <salo@google.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c6c82c2b
    • Eric Dumazet's avatar
      ipv4: disable BH in set_ping_group_range() · 8418193f
      Eric Dumazet authored
      [ Upstream commit a681574c ]
      
      In commit 4ee3bd4a ("ipv4: disable BH when changing ip local port
      range") Cong added BH protection in set_local_port_range() but missed
      that same fix was needed in set_ping_group_range()
      
      Fixes: b8f1a556 ("udp: Add function to make source port for UDP tunnels")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarEric Salo <salo@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8418193f
    • Sabrina Dubroca's avatar
      net: add recursion limit to GRO · 23c110c4
      Sabrina Dubroca authored
      [ Upstream commit fcd91dd4 ]
      
      Currently, GRO can do unlimited recursion through the gro_receive
      handlers.  This was fixed for tunneling protocols by limiting tunnel GRO
      to one level with encap_mark, but both VLAN and TEB still have this
      problem.  Thus, the kernel is vulnerable to a stack overflow, if we
      receive a packet composed entirely of VLAN headers.
      
      This patch adds a recursion counter to the GRO layer to prevent stack
      overflow.  When a gro_receive function hits the recursion limit, GRO is
      aborted for this skb and it is processed normally.  This recursion
      counter is put in the GRO CB, but could be turned into a percpu counter
      if we run out of space in the CB.
      
      Thanks to Vladimír Beneš <vbenes@redhat.com> for the initial bug report.
      
      Fixes: CVE-2016-7039
      Fixes: 9b174d88 ("net: Add Transparent Ethernet Bridging GRO support.")
      Fixes: 66e5133f ("vlan: Add GRO support for non hardware accelerated vlan")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: default avatarJiri Benc <jbenc@redhat.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23c110c4
    • Ido Schimmel's avatar
      net: core: Correctly iterate over lower adjacency list · d3bbd04b
      Ido Schimmel authored
      [ Upstream commit e4961b07 ]
      
      Tamir reported the following trace when processing ARP requests received
      via a vlan device on top of a VLAN-aware bridge:
      
       NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [swapper/1:0]
      [...]
       CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.8.0-rc7 #1
       Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016
       task: ffff88017edfea40 task.stack: ffff88017ee10000
       RIP: 0010:[<ffffffff815dcc73>]  [<ffffffff815dcc73>] netdev_all_lower_get_next_rcu+0x33/0x60
      [...]
       Call Trace:
        <IRQ>
        [<ffffffffa015de0a>] mlxsw_sp_port_lower_dev_hold+0x5a/0xa0 [mlxsw_spectrum]
        [<ffffffffa016f1b0>] mlxsw_sp_router_netevent_event+0x80/0x150 [mlxsw_spectrum]
        [<ffffffff810ad07a>] notifier_call_chain+0x4a/0x70
        [<ffffffff810ad13a>] atomic_notifier_call_chain+0x1a/0x20
        [<ffffffff815ee77b>] call_netevent_notifiers+0x1b/0x20
        [<ffffffff815f2eb6>] neigh_update+0x306/0x740
        [<ffffffff815f38ce>] neigh_event_ns+0x4e/0xb0
        [<ffffffff8165ea3f>] arp_process+0x66f/0x700
        [<ffffffff8170214c>] ? common_interrupt+0x8c/0x8c
        [<ffffffff8165ec29>] arp_rcv+0x139/0x1d0
        [<ffffffff816e505a>] ? vlan_do_receive+0xda/0x320
        [<ffffffff815e3794>] __netif_receive_skb_core+0x524/0xab0
        [<ffffffff815e6830>] ? dev_queue_xmit+0x10/0x20
        [<ffffffffa06d612d>] ? br_forward_finish+0x3d/0xc0 [bridge]
        [<ffffffffa06e5796>] ? br_handle_vlan+0xf6/0x1b0 [bridge]
        [<ffffffff815e3d38>] __netif_receive_skb+0x18/0x60
        [<ffffffff815e3dc0>] netif_receive_skb_internal+0x40/0xb0
        [<ffffffff815e3e4c>] netif_receive_skb+0x1c/0x70
        [<ffffffffa06d7856>] br_pass_frame_up+0xc6/0x160 [bridge]
        [<ffffffffa06d63d7>] ? deliver_clone+0x37/0x50 [bridge]
        [<ffffffffa06d656c>] ? br_flood+0xcc/0x160 [bridge]
        [<ffffffffa06d7b14>] br_handle_frame_finish+0x224/0x4f0 [bridge]
        [<ffffffffa06d7f94>] br_handle_frame+0x174/0x300 [bridge]
        [<ffffffff815e3599>] __netif_receive_skb_core+0x329/0xab0
        [<ffffffff81374815>] ? find_next_bit+0x15/0x20
        [<ffffffff8135e802>] ? cpumask_next_and+0x32/0x50
        [<ffffffff810c9968>] ? load_balance+0x178/0x9b0
        [<ffffffff815e3d38>] __netif_receive_skb+0x18/0x60
        [<ffffffff815e3dc0>] netif_receive_skb_internal+0x40/0xb0
        [<ffffffff815e3e4c>] netif_receive_skb+0x1c/0x70
        [<ffffffffa01544a1>] mlxsw_sp_rx_listener_func+0x61/0xb0 [mlxsw_spectrum]
        [<ffffffffa005c9f7>] mlxsw_core_skb_receive+0x187/0x200 [mlxsw_core]
        [<ffffffffa007332a>] mlxsw_pci_cq_tasklet+0x63a/0x9b0 [mlxsw_pci]
        [<ffffffff81091986>] tasklet_action+0xf6/0x110
        [<ffffffff81704556>] __do_softirq+0xf6/0x280
        [<ffffffff8109213f>] irq_exit+0xdf/0xf0
        [<ffffffff817042b4>] do_IRQ+0x54/0xd0
        [<ffffffff8170214c>] common_interrupt+0x8c/0x8c
      
      The problem is that netdev_all_lower_get_next_rcu() never advances the
      iterator, thereby causing the loop over the lower adjacency list to run
      forever.
      
      Fix this by advancing the iterator and avoid the infinite loop.
      
      Fixes: 7ce856aa ("mlxsw: spectrum: Add couple of lower device helper functions")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reported-by: default avatarTamir Winetroub <tamirw@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d3bbd04b
    • Jiri Pirko's avatar
      rtnetlink: Add rtnexthop offload flag to compare mask · fc5722f8
      Jiri Pirko authored
      [ Upstream commit 85dda4e5 ]
      
      The offload flag is a status flag and should not be used by
      FIB semantics for comparison.
      
      Fixes: 37ed9493 ("rtnetlink: add RTNH_F_EXTERNAL flag for fib offload")
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarAndy Gospodarek <andy@greyhouse.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fc5722f8
    • Ido Schimmel's avatar
      switchdev: Execute bridge ndos only for bridge ports · 4ac3ca8c
      Ido Schimmel authored
      [ Upstream commit 97c24290 ]
      
      We recently got the following warning after setting up a vlan device on
      top of an offloaded bridge and executing 'bridge link':
      
      WARNING: CPU: 0 PID: 18566 at drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c:81 mlxsw_sp_port_orig_get.part.9+0x55/0x70 [mlxsw_spectrum]
      [...]
       CPU: 0 PID: 18566 Comm: bridge Not tainted 4.8.0-rc7 #1
       Hardware name: Mellanox Technologies Ltd. Mellanox switch/Mellanox switch, BIOS 4.6.5 05/21/2015
        0000000000000286 00000000e64ab94f ffff880406e6f8f0 ffffffff8135eaa3
        0000000000000000 0000000000000000 ffff880406e6f930 ffffffff8108c43b
        0000005106e6f988 ffff8803df398840 ffff880403c60108 ffff880406e6f990
       Call Trace:
        [<ffffffff8135eaa3>] dump_stack+0x63/0x90
        [<ffffffff8108c43b>] __warn+0xcb/0xf0
        [<ffffffff8108c56d>] warn_slowpath_null+0x1d/0x20
        [<ffffffffa01420d5>] mlxsw_sp_port_orig_get.part.9+0x55/0x70 [mlxsw_spectrum]
        [<ffffffffa0142195>] mlxsw_sp_port_attr_get+0xa5/0xb0 [mlxsw_spectrum]
        [<ffffffff816f151f>] switchdev_port_attr_get+0x4f/0x140
        [<ffffffff816f15d0>] switchdev_port_attr_get+0x100/0x140
        [<ffffffff816f15d0>] switchdev_port_attr_get+0x100/0x140
        [<ffffffff816f1d6b>] switchdev_port_bridge_getlink+0x5b/0xc0
        [<ffffffff816f2680>] ? switchdev_port_fdb_dump+0x90/0x90
        [<ffffffff815f5427>] rtnl_bridge_getlink+0xe7/0x190
        [<ffffffff8161a1b2>] netlink_dump+0x122/0x290
        [<ffffffff8161b0df>] __netlink_dump_start+0x15f/0x190
        [<ffffffff815f5340>] ? rtnl_bridge_dellink+0x230/0x230
        [<ffffffff815fab46>] rtnetlink_rcv_msg+0x1a6/0x220
        [<ffffffff81208118>] ? __kmalloc_node_track_caller+0x208/0x2c0
        [<ffffffff815f5340>] ? rtnl_bridge_dellink+0x230/0x230
        [<ffffffff815fa9a0>] ? rtnl_newlink+0x890/0x890
        [<ffffffff8161cf54>] netlink_rcv_skb+0xa4/0xc0
        [<ffffffff815f56f8>] rtnetlink_rcv+0x28/0x30
        [<ffffffff8161c92c>] netlink_unicast+0x18c/0x240
        [<ffffffff8161ccdb>] netlink_sendmsg+0x2fb/0x3a0
        [<ffffffff815c5a48>] sock_sendmsg+0x38/0x50
        [<ffffffff815c6031>] SYSC_sendto+0x101/0x190
        [<ffffffff815c7111>] ? __sys_recvmsg+0x51/0x90
        [<ffffffff815c6b6e>] SyS_sendto+0xe/0x10
        [<ffffffff817017f2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
      
      The problem is that the 8021q module propagates the call to
      ndo_bridge_getlink() via switchdev ops, but the switch driver doesn't
      recognize the netdev, as it's not offloaded.
      
      While we can ignore calls being made to non-bridge ports inside the
      driver, a better fix would be to push this check up to the switchdev
      layer.
      
      Note that these ndos can be called for non-bridged netdev, but this only
      happens in certain PF drivers which don't call the corresponding
      switchdev functions anyway.
      
      Fixes: 99f44bb3 ("mlxsw: spectrum: Enable L3 interfaces on top of bridge devices")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reported-by: default avatarTamir Winetroub <tamirw@mellanox.com>
      Tested-by: default avatarTamir Winetroub <tamirw@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4ac3ca8c
    • Nikolay Aleksandrov's avatar
      bridge: multicast: restore perm router ports on multicast enable · 63d82a2c
      Nikolay Aleksandrov authored
      [ Upstream commit 7cb3f921 ]
      
      Satish reported a problem with the perm multicast router ports not getting
      reenabled after some series of events, in particular if it happens that the
      multicast snooping has been disabled and the port goes to disabled state
      then it will be deleted from the router port list, but if it moves into
      non-disabled state it will not be re-added because the mcast snooping is
      still disabled, and enabling snooping later does nothing.
      
      Here are the steps to reproduce, setup br0 with snooping enabled and eth1
      added as a perm router (multicast_router = 2):
      1. $ echo 0 > /sys/class/net/br0/bridge/multicast_snooping
      2. $ ip l set eth1 down
      ^ This step deletes the interface from the router list
      3. $ ip l set eth1 up
      ^ This step does not add it again because mcast snooping is disabled
      4. $ echo 1 > /sys/class/net/br0/bridge/multicast_snooping
      5. $ bridge -d -s mdb show
      <empty>
      
      At this point we have mcast enabled and eth1 as a perm router (value = 2)
      but it is not in the router list which is incorrect.
      
      After this change:
      1. $ echo 0 > /sys/class/net/br0/bridge/multicast_snooping
      2. $ ip l set eth1 down
      ^ This step deletes the interface from the router list
      3. $ ip l set eth1 up
      ^ This step does not add it again because mcast snooping is disabled
      4. $ echo 1 > /sys/class/net/br0/bridge/multicast_snooping
      5. $ bridge -d -s mdb show
      router ports on br0: eth1
      
      Note: we can directly do br_multicast_enable_port for all because the
      querier timer already has checks for the port state and will simply
      expire if it's in blocking/disabled. See the comment added by
      commit 9aa66382 ("bridge: multicast: add a comment to
      br_port_state_selection about blocking state")
      
      Fixes: 561f1103 ("bridge: Add multicast_snooping sysfs toggle")
      Reported-by: default avatarSatish Ashok <sashok@cumulusnetworks.com>
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      63d82a2c
    • Eric Dumazet's avatar
      net: pktgen: remove rcu locking in pktgen_change_name() · e9a5921c
      Eric Dumazet authored
      [ Upstream commit 9a0b1e8b ]
      
      After Jesper commit back in linux-3.18, we trigger a lockdep
      splat in proc_create_data() while allocating memory from
      pktgen_change_name().
      
      This patch converts t->if_lock to a mutex, since it is now only
      used from control path, and adds proper locking to pktgen_change_name()
      
      1) pktgen_thread_lock to protect the outer loop (iterating threads)
      2) t->if_lock to protect the inner loop (iterating devices)
      
      Note that before Jesper patch, pktgen_change_name() was lacking proper
      protection, but lockdep was not able to detect the problem.
      
      Fixes: 8788370a ("pktgen: RCU-ify "if_list" to remove lock in next_to_run()")
      Reported-by: default avatarJohn Sperbeck <jsperbeck@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e9a5921c
    • Brenden Blanco's avatar
      net/mlx4_en: fixup xdp tx irq to match rx · 2eeb5735
      Brenden Blanco authored
      [ Upstream commit 958b3d39 ]
      
      In cases where the number of tx rings is not a multiple of the number of
      rx rings, the tx completion event will be handled on a different core
      from the transmit and population of the ring. Races on the ring will
      lead to a double-free of the page, and possibly other corruption.
      
      The rings are initialized by default with a valid multiple of rings,
      based on the number of cpus, therefore an invalid configuration requires
      ethtool to change the ring layout. For instance 'ethtool -L eth0 rx 9 tx
      8' will cause packets received on rx0, and XDP_TX'd to tx48, to be
      completed on cpu3 (48 % 9 == 3).
      
      Resolve this discrepancy by shifting the irq for the xdp tx queues to
      start again from 0, modulo rx_ring_num.
      
      Fixes: 9ecc2d86 ("net/mlx4_en: add xdp forwarding and data write support")
      Reported-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2eeb5735
    • Paolo Abeni's avatar
      IB/ipoib: move back IB LL address into the hard header · 27bb6e31
      Paolo Abeni authored
      [ Upstream commit fc791b63 ]
      
      After the commit 9207f9d4 ("net: preserve IP control block
      during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
      That destroy the IPoIB address information cached there,
      causing a severe performance regression, as better described here:
      
      http://marc.info/?l=linux-kernel&m=146787279825501&w=2
      
      This change moves the data cached by the IPoIB driver from the
      skb control lock into the IPoIB hard header, as done before
      the commit 936d7de3 ("IPoIB: Stop lying about hard_header_len
      and use skb->cb to stash LL addresses").
      In order to avoid GRO issue, on packet reception, the IPoIB driver
      stash into the skb a dummy pseudo header, so that the received
      packets have actually a hard header matching the declared length.
      To avoid changing the connected mode maximum mtu, the allocated
      head buffer size is increased by the pseudo header length.
      
      After this commit, IPoIB performances are back to pre-regression
      value.
      
      v2 -> v3: rebased
      v1 -> v2: avoid changing the max mtu, increasing the head buf size
      
      Fixes: 9207f9d4 ("net: preserve IP control block during GSO segmentation")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      27bb6e31
    • Nicolas Dichtel's avatar
      ipv6: correctly add local routes when lo goes up · f280126e
      Nicolas Dichtel authored
      [ Upstream commit a220445f ]
      
      The goal of the patch is to fix this scenario:
       ip link add dummy1 type dummy
       ip link set dummy1 up
       ip link set lo down ; ip link set lo up
      
      After that sequence, the local route to the link layer address of dummy1 is
      not there anymore.
      
      When the loopback is set down, all local routes are deleted by
      addrconf_ifdown()/rt6_ifdown(). At this time, the rt6_info entry still
      exists, because the corresponding idev has a reference on it. After the rcu
      grace period, dst_rcu_free() is called, and thus ___dst_free(), which will
      set obsolete to DST_OBSOLETE_DEAD.
      
      In this case, init_loopback() is called before dst_rcu_free(), thus
      obsolete is still sets to something <= 0. So, the function doesn't add the
      route again. To avoid that race, let's check the rt6 refcnt instead.
      
      Fixes: 25fb6ca4 ("net IPv6 : Fix broken IPv6 routing table after loopback down-up")
      Fixes: a881ae1f ("ipv6: don't call addrconf_dst_alloc again when enable lo")
      Fixes: 33d99113 ("ipv6: reallocate addrconf router for ipv6 address when lo device up")
      Reported-by: default avatarFrancesco Santoro <francesco.santoro@6wind.com>
      Reported-by: default avatarSamuel Gauthier <samuel.gauthier@6wind.com>
      CC: Balakumaran Kannan <Balakumaran.Kannan@ap.sony.com>
      CC: Maruthi Thotad <Maruthi.Thotad@ap.sony.com>
      CC: Sabrina Dubroca <sd@queasysnail.net>
      CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
      CC: Weilong Chen <chenweilong@huawei.com>
      CC: Gao feng <gaofeng@cn.fujitsu.com>
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f280126e
    • Vadim Fedorenko's avatar
      ip6_tunnel: fix ip6_tnl_lookup · 0f3e7762
      Vadim Fedorenko authored
      [ Upstream commit 68d00f33 ]
      
      The commit ea3dc960 ("ip6_tunnel: Add support for wildcard tunnel
      endpoints.") introduces support for wildcards in tunnels endpoints,
      but in some rare circumstances ip6_tnl_lookup selects wrong tunnel
      interface relying only on source or destination address of the packet
      and not checking presence of wildcard in tunnels endpoints. Later in
      ip6_tnl_rcv this packets can be dicarded because of difference in
      ipproto even if fallback device have proper ipproto configuration.
      
      This patch adds checks of wildcard endpoint in tunnel avoiding such
      behavior
      
      Fixes: ea3dc960 ("ip6_tunnel: Add support for wildcard tunnel endpoints.")
      Signed-off-by: default avatarVadim Fedorenko <junk@yandex-team.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0f3e7762
    • Andrew Lunn's avatar
      net: phy: Trigger state machine on state change and not polling. · a148a818
      Andrew Lunn authored
      [ Upstream commit 3c293f4e ]
      
      The phy_start() is used to indicate the PHY is now ready to do its
      work. The state is changed, normally to PHY_UP which means that both
      the MAC and the PHY are ready.
      
      If the phy driver is using polling, when the next poll happens, the
      state machine notices the PHY is now in PHY_UP, and kicks off
      auto-negotiation, if needed.
      
      If however, the PHY is using interrupts, there is no polling. The phy
      is stuck in PHY_UP until the next interrupt comes along. And there is
      no reason for the PHY to interrupt.
      
      Have phy_start() schedule the state machine to run, which both speeds
      up the polling use case, and makes the interrupt use case actually
      work.
      
      This problems exists whenever there is a state change which will not
      cause an interrupt. Trigger the state machine in these cases,
      e.g. phy_error().
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Cc: Kyle Roeschley <kyle.roeschley@ni.com>
      Tested-by: default avatarKyle Roeschley <kyle.roeschley@ni.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a148a818
    • Eric Dumazet's avatar
      ipv6: tcp: restore IP6CB for pktoptions skbs · 2a909989
      Eric Dumazet authored
      [ Upstream commit 8ce48623 ]
      
      Baozeng Ding reported following KASAN splat :
      
      BUG: KASAN: use-after-free in ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 at addr ffff880029c84ec8
      Read of size 1 by task poc/25548
      Call Trace:
       [<ffffffff82cf43c9>] dump_stack+0x12e/0x185 /lib/dump_stack.c:15
       [<     inline     >] print_address_description /mm/kasan/report.c:204
       [<ffffffff817ced3b>] kasan_report_error+0x48b/0x4b0 /mm/kasan/report.c:283
       [<     inline     >] kasan_report /mm/kasan/report.c:303
       [<ffffffff817ced9e>] __asan_report_load1_noabort+0x3e/0x40 /mm/kasan/report.c:321
       [<ffffffff85c71da1>] ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 /net/ipv6/datagram.c:687
       [<ffffffff85c734c3>] ip6_datagram_recv_ctl+0x33/0x40
       [<ffffffff85c0b07c>] do_ipv6_getsockopt.isra.4+0xaec/0x2150
       [<ffffffff85c0c7f6>] ipv6_getsockopt+0x116/0x230
       [<ffffffff859b5a12>] tcp_getsockopt+0x82/0xd0 /net/ipv4/tcp.c:3035
       [<ffffffff855fb385>] sock_common_getsockopt+0x95/0xd0 /net/core/sock.c:2647
       [<     inline     >] SYSC_getsockopt /net/socket.c:1776
       [<ffffffff855f8ba2>] SyS_getsockopt+0x142/0x230 /net/socket.c:1758
       [<ffffffff8685cdc5>] entry_SYSCALL_64_fastpath+0x23/0xc6
      Memory state around the buggy address:
       ffff880029c84d80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff880029c84e00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      > ffff880029c84e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                                    ^
       ffff880029c84f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff880029c84f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      
      He also provided a syzkaller reproducer.
      
      Issue is that ip6_datagram_recv_specific_ctl() expects to find IP6CB
      data that was moved at a different place in tcp_v6_rcv()
      
      This patch moves tcp_v6_restore_cb() up and calls it from
      tcp_v6_do_rcv() when np->pktoptions is set.
      
      Fixes: 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarBaozeng Ding <sploving1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2a909989
    • WANG Cong's avatar
      net_sched: reorder pernet ops and act ops registrations · 50b43ad1
      WANG Cong authored
      [ Upstream commit ab102b80 ]
      
      Krister reported a kernel NULL pointer dereference after
      tcf_action_init_1() invokes a_o->init(), it is a race condition
      where one thread calling tcf_register_action() to initialize
      the netns data after putting act ops in the global list and
      the other thread searching the list and then calling
      a_o->init(net, ...).
      
      Fix this by moving the pernet ops registration before making
      the action ops visible. This is fine because: a) we don't
      rely on act_base in pernet ops->init(), b) in the worst case we
      have a fully initialized netns but ops is still not ready so
      new actions still can't be created.
      Reported-by: default avatarKrister Johansen <kjlx@templeofstupid.com>
      Tested-by: default avatarKrister Johansen <kjlx@templeofstupid.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      50b43ad1
    • Vlad Tsyrklevich's avatar
      drivers/ptp: Fix kernel memory disclosure · dac04913
      Vlad Tsyrklevich authored
      [ Upstream commit 02a9079c ]
      
      The reserved field precise_offset->rsv is not cleared before being
      copied to user space, leaking kernel stack memory. Clear the struct
      before it's copied.
      Signed-off-by: default avatarVlad Tsyrklevich <vlad@tsyrklevich.net>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dac04913
    • Eric Dumazet's avatar
      netlink: do not enter direct reclaim from netlink_dump() · 3f841d15
      Eric Dumazet authored
      [ Upstream commit d35c99ff ]
      
      Since linux-3.15, netlink_dump() can use up to 16384 bytes skb
      allocations.
      
      Due to struct skb_shared_info ~320 bytes overhead, we end up using
      order-3 (on x86) page allocations, that might trigger direct reclaim and
      add stress.
      
      The intent was really to attempt a large allocation but immediately
      fallback to a smaller one (order-1 on x86) in case of memory stress.
      
      On recent kernels (linux-4.4), we can remove __GFP_DIRECT_RECLAIM to
      meet the goal. Old kernels would need to remove __GFP_WAIT
      
      While we are at it, since we do an order-3 allocation, allow to use
      all the allocated bytes instead of 16384 to reduce syscalls during
      large dumps.
      
      iproute2 already uses 32KB recvmsg() buffer sizes.
      
      Alexei provided an initial patch downsizing to SKB_WITH_OVERHEAD(16384)
      
      Fixes: 9063e21f ("netlink: autosize skb lengthes")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Reviewed-by: default avatarGreg Rose <grose@lightfleet.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3f841d15
    • Anoob Soman's avatar
      packet: call fanout_release, while UNREGISTERING a netdev · 5086cadf
      Anoob Soman authored
      [ Upstream commit 66644982 ]
      
      If a socket has FANOUT sockopt set, a new proto_hook is registered
      as part of fanout_add(). When processing a NETDEV_UNREGISTER event in
      af_packet, __fanout_unlink is called for all sockets, but prot_hook which was
      registered as part of fanout_add is not removed. Call fanout_release, on a
      NETDEV_UNREGISTER, which removes prot_hook and removes fanout from the
      fanout_list.
      
      This fixes BUG_ON(!list_empty(&dev->ptype_specific)) in netdev_run_todo()
      Signed-off-by: default avatarAnoob Soman <anoob.soman@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5086cadf
    • Andrew Collins's avatar
      net: Add netdev all_adj_list refcnt propagation to fix panic · 6fff1319
      Andrew Collins authored
      [ Upstream commit 93409033 ]
      
      This is a respin of a patch to fix a relatively easily reproducible kernel
      panic related to the all_adj_list handling for netdevs in recent kernels.
      
      The following sequence of commands will reproduce the issue:
      
      ip link add link eth0 name eth0.100 type vlan id 100
      ip link add link eth0 name eth0.200 type vlan id 200
      ip link add name testbr type bridge
      ip link set eth0.100 master testbr
      ip link set eth0.200 master testbr
      ip link add link testbr mac0 type macvlan
      ip link delete dev testbr
      
      This creates an upper/lower tree of (excuse the poor ASCII art):
      
                  /---eth0.100-eth0
      mac0-testbr-
                  \---eth0.200-eth0
      
      When testbr is deleted, the all_adj_lists are walked, and eth0 is deleted twice from
      the mac0 list. Unfortunately, during setup in __netdev_upper_dev_link, only one
      reference to eth0 is added, so this results in a panic.
      
      This change adds reference count propagation so things are handled properly.
      
      Matthias Schiffer reported a similar crash in batman-adv:
      
      https://github.com/freifunk-gluon/gluon/issues/680
      https://www.open-mesh.org/issues/247
      
      which this patch also seems to resolve.
      Signed-off-by: default avatarAndrew Collins <acollins@cradlepoint.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6fff1319
    • Shmulik Ladkani's avatar
      net/sched: act_vlan: Push skb->data to mac_header prior calling skb_vlan_*() functions · 9caee42c
      Shmulik Ladkani authored
      [ Upstream commit f39acc84 ]
      
      Generic skb_vlan_push/skb_vlan_pop functions don't properly handle the
      case where the input skb data pointer does not point at the mac header:
      
      - They're doing push/pop, but fail to properly unwind data back to its
        original location.
        For example, in the skb_vlan_push case, any subsequent
        'skb_push(skb, skb->mac_len)' calls make the skb->data point 4 bytes
        BEFORE start of frame, leading to bogus frames that may be transmitted.
      
      - They update rcsum per the added/removed 4 bytes tag.
        Alas if data is originally after the vlan/eth headers, then these
        bytes were already pulled out of the csum.
      
      OTOH calling skb_vlan_push/skb_vlan_pop with skb->data at mac_header
      present no issues.
      
      act_vlan is the only caller to skb_vlan_*() that has skb->data pointing
      at network header (upon ingress).
      Other calles (ovs, bpf) already adjust skb->data at mac_header.
      
      This patch fixes act_vlan to point to the mac_header prior calling
      skb_vlan_*() functions, as other callers do.
      Signed-off-by: default avatarShmulik Ladkani <shmulik.ladkani@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Pravin Shelar <pshelar@ovn.org>
      Cc: Jiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9caee42c
    • Paolo Abeni's avatar
      net: pktgen: fix pkt_size · c002dfd8
      Paolo Abeni authored
      [ Upstream commit 63d75463 ]
      
      The commit 879c7220 ("net: pktgen: Observe needed_headroom
      of the device") increased the 'pkt_overhead' field value by
      LL_RESERVED_SPACE.
      As a side effect the generated packet size, computed as:
      
      	/* Eth + IPh + UDPh + mpls */
      	datalen = pkt_dev->cur_pkt_size - 14 - 20 - 8 -
      		  pkt_dev->pkt_overhead;
      
      is decreased by the same value.
      The above changed slightly the behavior of existing pktgen users,
      and made the procfs interface somewhat inconsistent.
      Fix it by restoring the previous pkt_overhead value and using
      LL_RESERVED_SPACE as extralen in skb allocation.
      Also, change pktgen_alloc_skb() to only partially reserve
      the headroom to allow the caller to prefetch from ll header
      start.
      
      v1 -> v2:
       - fixed some typos in the comments
      
      Fixes: 879c7220 ("net: pktgen: Observe needed_headroom of the device")
      Suggested-by: default avatarBen Greear <greearb@candelatech.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c002dfd8
    • Gavin Schenk's avatar
      net: fec: set mac address unconditionally · ff1b27c3
      Gavin Schenk authored
      [ Upstream commit b82d44d7 ]
      
      If the mac address origin is not dt, you can only safely assign a mac
      address after "link up" of the device. If the link is off the clocks are
      disabled and because of issues assigning registers when clocks are off the
      new mac address cannot be written in .ndo_set_mac_address() on some soc's.
      This fix sets the mac address unconditionally in fec_restart(...) and
      ensures consistency between fec registers and the network layer.
      Signed-off-by: default avatarGavin Schenk <g.schenk@eckelmann.de>
      Acked-by: default avatarFugang Duan <fugang.duan@nxp.com>
      Acked-by: default avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Fixes: 9638d19e ("net: fec: add netif status check before set mac address")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ff1b27c3
  2. 10 Nov, 2016 11 commits