1. 15 Nov, 2016 15 commits
    • Eric Dumazet's avatar
      net: pktgen: remove rcu locking in pktgen_change_name() · e9a5921c
      Eric Dumazet authored
      [ Upstream commit 9a0b1e8b ]
      
      After Jesper commit back in linux-3.18, we trigger a lockdep
      splat in proc_create_data() while allocating memory from
      pktgen_change_name().
      
      This patch converts t->if_lock to a mutex, since it is now only
      used from control path, and adds proper locking to pktgen_change_name()
      
      1) pktgen_thread_lock to protect the outer loop (iterating threads)
      2) t->if_lock to protect the inner loop (iterating devices)
      
      Note that before Jesper patch, pktgen_change_name() was lacking proper
      protection, but lockdep was not able to detect the problem.
      
      Fixes: 8788370a ("pktgen: RCU-ify "if_list" to remove lock in next_to_run()")
      Reported-by: default avatarJohn Sperbeck <jsperbeck@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e9a5921c
    • Brenden Blanco's avatar
      net/mlx4_en: fixup xdp tx irq to match rx · 2eeb5735
      Brenden Blanco authored
      [ Upstream commit 958b3d39 ]
      
      In cases where the number of tx rings is not a multiple of the number of
      rx rings, the tx completion event will be handled on a different core
      from the transmit and population of the ring. Races on the ring will
      lead to a double-free of the page, and possibly other corruption.
      
      The rings are initialized by default with a valid multiple of rings,
      based on the number of cpus, therefore an invalid configuration requires
      ethtool to change the ring layout. For instance 'ethtool -L eth0 rx 9 tx
      8' will cause packets received on rx0, and XDP_TX'd to tx48, to be
      completed on cpu3 (48 % 9 == 3).
      
      Resolve this discrepancy by shifting the irq for the xdp tx queues to
      start again from 0, modulo rx_ring_num.
      
      Fixes: 9ecc2d86 ("net/mlx4_en: add xdp forwarding and data write support")
      Reported-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2eeb5735
    • Paolo Abeni's avatar
      IB/ipoib: move back IB LL address into the hard header · 27bb6e31
      Paolo Abeni authored
      [ Upstream commit fc791b63 ]
      
      After the commit 9207f9d4 ("net: preserve IP control block
      during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
      That destroy the IPoIB address information cached there,
      causing a severe performance regression, as better described here:
      
      http://marc.info/?l=linux-kernel&m=146787279825501&w=2
      
      This change moves the data cached by the IPoIB driver from the
      skb control lock into the IPoIB hard header, as done before
      the commit 936d7de3 ("IPoIB: Stop lying about hard_header_len
      and use skb->cb to stash LL addresses").
      In order to avoid GRO issue, on packet reception, the IPoIB driver
      stash into the skb a dummy pseudo header, so that the received
      packets have actually a hard header matching the declared length.
      To avoid changing the connected mode maximum mtu, the allocated
      head buffer size is increased by the pseudo header length.
      
      After this commit, IPoIB performances are back to pre-regression
      value.
      
      v2 -> v3: rebased
      v1 -> v2: avoid changing the max mtu, increasing the head buf size
      
      Fixes: 9207f9d4 ("net: preserve IP control block during GSO segmentation")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      27bb6e31
    • Nicolas Dichtel's avatar
      ipv6: correctly add local routes when lo goes up · f280126e
      Nicolas Dichtel authored
      [ Upstream commit a220445f ]
      
      The goal of the patch is to fix this scenario:
       ip link add dummy1 type dummy
       ip link set dummy1 up
       ip link set lo down ; ip link set lo up
      
      After that sequence, the local route to the link layer address of dummy1 is
      not there anymore.
      
      When the loopback is set down, all local routes are deleted by
      addrconf_ifdown()/rt6_ifdown(). At this time, the rt6_info entry still
      exists, because the corresponding idev has a reference on it. After the rcu
      grace period, dst_rcu_free() is called, and thus ___dst_free(), which will
      set obsolete to DST_OBSOLETE_DEAD.
      
      In this case, init_loopback() is called before dst_rcu_free(), thus
      obsolete is still sets to something <= 0. So, the function doesn't add the
      route again. To avoid that race, let's check the rt6 refcnt instead.
      
      Fixes: 25fb6ca4 ("net IPv6 : Fix broken IPv6 routing table after loopback down-up")
      Fixes: a881ae1f ("ipv6: don't call addrconf_dst_alloc again when enable lo")
      Fixes: 33d99113 ("ipv6: reallocate addrconf router for ipv6 address when lo device up")
      Reported-by: default avatarFrancesco Santoro <francesco.santoro@6wind.com>
      Reported-by: default avatarSamuel Gauthier <samuel.gauthier@6wind.com>
      CC: Balakumaran Kannan <Balakumaran.Kannan@ap.sony.com>
      CC: Maruthi Thotad <Maruthi.Thotad@ap.sony.com>
      CC: Sabrina Dubroca <sd@queasysnail.net>
      CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
      CC: Weilong Chen <chenweilong@huawei.com>
      CC: Gao feng <gaofeng@cn.fujitsu.com>
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f280126e
    • Vadim Fedorenko's avatar
      ip6_tunnel: fix ip6_tnl_lookup · 0f3e7762
      Vadim Fedorenko authored
      [ Upstream commit 68d00f33 ]
      
      The commit ea3dc960 ("ip6_tunnel: Add support for wildcard tunnel
      endpoints.") introduces support for wildcards in tunnels endpoints,
      but in some rare circumstances ip6_tnl_lookup selects wrong tunnel
      interface relying only on source or destination address of the packet
      and not checking presence of wildcard in tunnels endpoints. Later in
      ip6_tnl_rcv this packets can be dicarded because of difference in
      ipproto even if fallback device have proper ipproto configuration.
      
      This patch adds checks of wildcard endpoint in tunnel avoiding such
      behavior
      
      Fixes: ea3dc960 ("ip6_tunnel: Add support for wildcard tunnel endpoints.")
      Signed-off-by: default avatarVadim Fedorenko <junk@yandex-team.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0f3e7762
    • Andrew Lunn's avatar
      net: phy: Trigger state machine on state change and not polling. · a148a818
      Andrew Lunn authored
      [ Upstream commit 3c293f4e ]
      
      The phy_start() is used to indicate the PHY is now ready to do its
      work. The state is changed, normally to PHY_UP which means that both
      the MAC and the PHY are ready.
      
      If the phy driver is using polling, when the next poll happens, the
      state machine notices the PHY is now in PHY_UP, and kicks off
      auto-negotiation, if needed.
      
      If however, the PHY is using interrupts, there is no polling. The phy
      is stuck in PHY_UP until the next interrupt comes along. And there is
      no reason for the PHY to interrupt.
      
      Have phy_start() schedule the state machine to run, which both speeds
      up the polling use case, and makes the interrupt use case actually
      work.
      
      This problems exists whenever there is a state change which will not
      cause an interrupt. Trigger the state machine in these cases,
      e.g. phy_error().
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Cc: Kyle Roeschley <kyle.roeschley@ni.com>
      Tested-by: default avatarKyle Roeschley <kyle.roeschley@ni.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a148a818
    • Eric Dumazet's avatar
      ipv6: tcp: restore IP6CB for pktoptions skbs · 2a909989
      Eric Dumazet authored
      [ Upstream commit 8ce48623 ]
      
      Baozeng Ding reported following KASAN splat :
      
      BUG: KASAN: use-after-free in ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 at addr ffff880029c84ec8
      Read of size 1 by task poc/25548
      Call Trace:
       [<ffffffff82cf43c9>] dump_stack+0x12e/0x185 /lib/dump_stack.c:15
       [<     inline     >] print_address_description /mm/kasan/report.c:204
       [<ffffffff817ced3b>] kasan_report_error+0x48b/0x4b0 /mm/kasan/report.c:283
       [<     inline     >] kasan_report /mm/kasan/report.c:303
       [<ffffffff817ced9e>] __asan_report_load1_noabort+0x3e/0x40 /mm/kasan/report.c:321
       [<ffffffff85c71da1>] ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 /net/ipv6/datagram.c:687
       [<ffffffff85c734c3>] ip6_datagram_recv_ctl+0x33/0x40
       [<ffffffff85c0b07c>] do_ipv6_getsockopt.isra.4+0xaec/0x2150
       [<ffffffff85c0c7f6>] ipv6_getsockopt+0x116/0x230
       [<ffffffff859b5a12>] tcp_getsockopt+0x82/0xd0 /net/ipv4/tcp.c:3035
       [<ffffffff855fb385>] sock_common_getsockopt+0x95/0xd0 /net/core/sock.c:2647
       [<     inline     >] SYSC_getsockopt /net/socket.c:1776
       [<ffffffff855f8ba2>] SyS_getsockopt+0x142/0x230 /net/socket.c:1758
       [<ffffffff8685cdc5>] entry_SYSCALL_64_fastpath+0x23/0xc6
      Memory state around the buggy address:
       ffff880029c84d80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff880029c84e00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      > ffff880029c84e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                                    ^
       ffff880029c84f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff880029c84f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      
      He also provided a syzkaller reproducer.
      
      Issue is that ip6_datagram_recv_specific_ctl() expects to find IP6CB
      data that was moved at a different place in tcp_v6_rcv()
      
      This patch moves tcp_v6_restore_cb() up and calls it from
      tcp_v6_do_rcv() when np->pktoptions is set.
      
      Fixes: 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarBaozeng Ding <sploving1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2a909989
    • WANG Cong's avatar
      net_sched: reorder pernet ops and act ops registrations · 50b43ad1
      WANG Cong authored
      [ Upstream commit ab102b80 ]
      
      Krister reported a kernel NULL pointer dereference after
      tcf_action_init_1() invokes a_o->init(), it is a race condition
      where one thread calling tcf_register_action() to initialize
      the netns data after putting act ops in the global list and
      the other thread searching the list and then calling
      a_o->init(net, ...).
      
      Fix this by moving the pernet ops registration before making
      the action ops visible. This is fine because: a) we don't
      rely on act_base in pernet ops->init(), b) in the worst case we
      have a fully initialized netns but ops is still not ready so
      new actions still can't be created.
      Reported-by: default avatarKrister Johansen <kjlx@templeofstupid.com>
      Tested-by: default avatarKrister Johansen <kjlx@templeofstupid.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      50b43ad1
    • Vlad Tsyrklevich's avatar
      drivers/ptp: Fix kernel memory disclosure · dac04913
      Vlad Tsyrklevich authored
      [ Upstream commit 02a9079c ]
      
      The reserved field precise_offset->rsv is not cleared before being
      copied to user space, leaking kernel stack memory. Clear the struct
      before it's copied.
      Signed-off-by: default avatarVlad Tsyrklevich <vlad@tsyrklevich.net>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dac04913
    • Eric Dumazet's avatar
      netlink: do not enter direct reclaim from netlink_dump() · 3f841d15
      Eric Dumazet authored
      [ Upstream commit d35c99ff ]
      
      Since linux-3.15, netlink_dump() can use up to 16384 bytes skb
      allocations.
      
      Due to struct skb_shared_info ~320 bytes overhead, we end up using
      order-3 (on x86) page allocations, that might trigger direct reclaim and
      add stress.
      
      The intent was really to attempt a large allocation but immediately
      fallback to a smaller one (order-1 on x86) in case of memory stress.
      
      On recent kernels (linux-4.4), we can remove __GFP_DIRECT_RECLAIM to
      meet the goal. Old kernels would need to remove __GFP_WAIT
      
      While we are at it, since we do an order-3 allocation, allow to use
      all the allocated bytes instead of 16384 to reduce syscalls during
      large dumps.
      
      iproute2 already uses 32KB recvmsg() buffer sizes.
      
      Alexei provided an initial patch downsizing to SKB_WITH_OVERHEAD(16384)
      
      Fixes: 9063e21f ("netlink: autosize skb lengthes")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Reviewed-by: default avatarGreg Rose <grose@lightfleet.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3f841d15
    • Anoob Soman's avatar
      packet: call fanout_release, while UNREGISTERING a netdev · 5086cadf
      Anoob Soman authored
      [ Upstream commit 66644982 ]
      
      If a socket has FANOUT sockopt set, a new proto_hook is registered
      as part of fanout_add(). When processing a NETDEV_UNREGISTER event in
      af_packet, __fanout_unlink is called for all sockets, but prot_hook which was
      registered as part of fanout_add is not removed. Call fanout_release, on a
      NETDEV_UNREGISTER, which removes prot_hook and removes fanout from the
      fanout_list.
      
      This fixes BUG_ON(!list_empty(&dev->ptype_specific)) in netdev_run_todo()
      Signed-off-by: default avatarAnoob Soman <anoob.soman@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5086cadf
    • Andrew Collins's avatar
      net: Add netdev all_adj_list refcnt propagation to fix panic · 6fff1319
      Andrew Collins authored
      [ Upstream commit 93409033 ]
      
      This is a respin of a patch to fix a relatively easily reproducible kernel
      panic related to the all_adj_list handling for netdevs in recent kernels.
      
      The following sequence of commands will reproduce the issue:
      
      ip link add link eth0 name eth0.100 type vlan id 100
      ip link add link eth0 name eth0.200 type vlan id 200
      ip link add name testbr type bridge
      ip link set eth0.100 master testbr
      ip link set eth0.200 master testbr
      ip link add link testbr mac0 type macvlan
      ip link delete dev testbr
      
      This creates an upper/lower tree of (excuse the poor ASCII art):
      
                  /---eth0.100-eth0
      mac0-testbr-
                  \---eth0.200-eth0
      
      When testbr is deleted, the all_adj_lists are walked, and eth0 is deleted twice from
      the mac0 list. Unfortunately, during setup in __netdev_upper_dev_link, only one
      reference to eth0 is added, so this results in a panic.
      
      This change adds reference count propagation so things are handled properly.
      
      Matthias Schiffer reported a similar crash in batman-adv:
      
      https://github.com/freifunk-gluon/gluon/issues/680
      https://www.open-mesh.org/issues/247
      
      which this patch also seems to resolve.
      Signed-off-by: default avatarAndrew Collins <acollins@cradlepoint.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6fff1319
    • Shmulik Ladkani's avatar
      net/sched: act_vlan: Push skb->data to mac_header prior calling skb_vlan_*() functions · 9caee42c
      Shmulik Ladkani authored
      [ Upstream commit f39acc84 ]
      
      Generic skb_vlan_push/skb_vlan_pop functions don't properly handle the
      case where the input skb data pointer does not point at the mac header:
      
      - They're doing push/pop, but fail to properly unwind data back to its
        original location.
        For example, in the skb_vlan_push case, any subsequent
        'skb_push(skb, skb->mac_len)' calls make the skb->data point 4 bytes
        BEFORE start of frame, leading to bogus frames that may be transmitted.
      
      - They update rcsum per the added/removed 4 bytes tag.
        Alas if data is originally after the vlan/eth headers, then these
        bytes were already pulled out of the csum.
      
      OTOH calling skb_vlan_push/skb_vlan_pop with skb->data at mac_header
      present no issues.
      
      act_vlan is the only caller to skb_vlan_*() that has skb->data pointing
      at network header (upon ingress).
      Other calles (ovs, bpf) already adjust skb->data at mac_header.
      
      This patch fixes act_vlan to point to the mac_header prior calling
      skb_vlan_*() functions, as other callers do.
      Signed-off-by: default avatarShmulik Ladkani <shmulik.ladkani@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Pravin Shelar <pshelar@ovn.org>
      Cc: Jiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9caee42c
    • Paolo Abeni's avatar
      net: pktgen: fix pkt_size · c002dfd8
      Paolo Abeni authored
      [ Upstream commit 63d75463 ]
      
      The commit 879c7220 ("net: pktgen: Observe needed_headroom
      of the device") increased the 'pkt_overhead' field value by
      LL_RESERVED_SPACE.
      As a side effect the generated packet size, computed as:
      
      	/* Eth + IPh + UDPh + mpls */
      	datalen = pkt_dev->cur_pkt_size - 14 - 20 - 8 -
      		  pkt_dev->pkt_overhead;
      
      is decreased by the same value.
      The above changed slightly the behavior of existing pktgen users,
      and made the procfs interface somewhat inconsistent.
      Fix it by restoring the previous pkt_overhead value and using
      LL_RESERVED_SPACE as extralen in skb allocation.
      Also, change pktgen_alloc_skb() to only partially reserve
      the headroom to allow the caller to prefetch from ll header
      start.
      
      v1 -> v2:
       - fixed some typos in the comments
      
      Fixes: 879c7220 ("net: pktgen: Observe needed_headroom of the device")
      Suggested-by: default avatarBen Greear <greearb@candelatech.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c002dfd8
    • Gavin Schenk's avatar
      net: fec: set mac address unconditionally · ff1b27c3
      Gavin Schenk authored
      [ Upstream commit b82d44d7 ]
      
      If the mac address origin is not dt, you can only safely assign a mac
      address after "link up" of the device. If the link is off the clocks are
      disabled and because of issues assigning registers when clocks are off the
      new mac address cannot be written in .ndo_set_mac_address() on some soc's.
      This fix sets the mac address unconditionally in fec_restart(...) and
      ensures consistency between fec registers and the network layer.
      Signed-off-by: default avatarGavin Schenk <g.schenk@eckelmann.de>
      Acked-by: default avatarFugang Duan <fugang.duan@nxp.com>
      Acked-by: default avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Fixes: 9638d19e ("net: fec: add netif status check before set mac address")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ff1b27c3
  2. 10 Nov, 2016 25 commits