1. 31 Jan, 2015 8 commits
    • Iyappan Subramanian's avatar
      drivers: net: xgene: fix: Out of order descriptor bytes read · ecf6ba83
      Iyappan Subramanian authored
      This patch fixes the following kernel crash,
      
      	WARNING: CPU: 2 PID: 0 at net/ipv4/tcp_input.c:3079 tcp_clean_rtx_queue+0x658/0x80c()
      	Call trace:
      	[<fffffe0000096b7c>] dump_backtrace+0x0/0x184
      	[<fffffe0000096d10>] show_stack+0x10/0x1c
      	[<fffffe0000685ea0>] dump_stack+0x74/0x98
      	[<fffffe00000b44e0>] warn_slowpath_common+0x88/0xb0
      	[<fffffe00000b461c>] warn_slowpath_null+0x14/0x20
      	[<fffffe00005b5c1c>] tcp_clean_rtx_queue+0x654/0x80c
      	[<fffffe00005b6228>] tcp_ack+0x454/0x688
      	[<fffffe00005b6ca8>] tcp_rcv_established+0x4a4/0x62c
      	[<fffffe00005bf4b4>] tcp_v4_do_rcv+0x16c/0x350
      	[<fffffe00005c225c>] tcp_v4_rcv+0x8e8/0x904
      	[<fffffe000059d470>] ip_local_deliver_finish+0x100/0x26c
      	[<fffffe000059dad8>] ip_local_deliver+0xac/0xc4
      	[<fffffe000059d6c4>] ip_rcv_finish+0xe8/0x328
      	[<fffffe000059dd3c>] ip_rcv+0x24c/0x38c
      	[<fffffe0000563950>] __netif_receive_skb_core+0x29c/0x7c8
      	[<fffffe0000563ea4>] __netif_receive_skb+0x28/0x7c
      	[<fffffe0000563f54>] netif_receive_skb_internal+0x5c/0xe0
      	[<fffffe0000564810>] napi_gro_receive+0xb4/0x110
      	[<fffffe0000482a2c>] xgene_enet_process_ring+0x144/0x338
      	[<fffffe0000482d18>] xgene_enet_napi+0x1c/0x50
      	[<fffffe0000565454>] net_rx_action+0x154/0x228
      	[<fffffe00000b804c>] __do_softirq+0x110/0x28c
      	[<fffffe00000b8424>] irq_exit+0x8c/0xc0
      	[<fffffe0000093898>] handle_IRQ+0x44/0xa8
      	[<fffffe000009032c>] gic_handle_irq+0x38/0x7c
      	[...]
      
      Software writes poison data into the descriptor bytes[15:8] and upon
      receiving the interrupt, if those bytes are overwritten by the hardware with
      the valid data, software also reads bytes[7:0] and executes receive/tx
      completion logic.
      
      If the CPU executes the above two reads in out of order fashion, then the
      bytes[7:0] will have older data and causing the kernel panic.  We have to
      force the order of the reads and thus this patch introduces read memory
      barrier between these reads.
      Signed-off-by: default avatarIyappan Subramanian <isubramanian@apm.com>
      Signed-off-by: default avatarKeyur Chudgar <kchudgar@apm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ecf6ba83
    • David S. Miller's avatar
      Merge branch 'vlan_get_protocol' · 08178e5a
      David S. Miller authored
      Toshiaki Makita says:
      
      ====================
      Fix checksum error when using stacked vlan
      
      When I was testing 802.1ad, I found several drivers don't take into
      account 802.1ad or multiple vlans when retrieving L3 (IP/IPv6) or
      L4 (TCP/UDP) protocol for checksum offload.
      
      It is mainly due to vlan_get_protocol(), which extracts ether type only
      when it is tagged with single 802.1Q. When 802.1ad is used or there are
      multiple vlans, it extracts vlan protocol and drivers cannot determine
      which L3/L4 protocol is used.
      
      Those drivers, most of which have IP_CSUM/IPV6_CSUM features, get L3/L4
      header-offset by software, so it seems that their checksum offload works
      with multiple vlans if we can parse protocols correctly.
      (They know mac header length, and probably don't care about what is in it.)
      
      And another thing, some of Intel's drivers seem to use skb->protocol where
      vlan_get_protocol() is more suitable.
      
      I tested that at least igb/igbvf on I350 works with this patch set.
      
      Note:
      We can hand a double tagged packet with CHECKSUM_PARTIAL to a HW driver
      by creating a vlan device on a bridge device and enabling vlan_filtering
      of the bridge with 802.1ad protocol.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08178e5a
    • Toshiaki Makita's avatar
      ixgbevf: Fix checksum error when using stacked vlan · 10e4fb33
      Toshiaki Makita authored
      When a skb has multiple vlans and it is CHECKSUM_PARTIAL,
      ixgbevf_tx_csum() fails to get the network protocol and checksum related
      descriptor fields are not configured correctly because skb->protocol
      doesn't show the L3 protocol in this case.
      
      Use first->protocol instead of skb->protocol to get the proper network
      protocol.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10e4fb33
    • Toshiaki Makita's avatar
      ixgbe: Fix checksum error when using stacked vlan · 0213668f
      Toshiaki Makita authored
      When a skb has multiple vlans and it is CHECKSUM_PARTIAL,
      ixgbe_tx_csum() fails to get the network protocol and checksum related
      descriptor fields are not configured correctly because skb->protocol
      doesn't show the L3 protocol in this case.
      
      Use vlan_get_protocol() to get the proper network protocol.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0213668f
    • Toshiaki Makita's avatar
      igbvf: Fix checksum error when using stacked vlan · 72b14059
      Toshiaki Makita authored
      When a skb has multiple vlans and it is CHECKSUM_PARTIAL,
      igbvf_tx_csum() fails to get the network protocol and checksum related
      descriptor fields are not configured correctly because skb->protocol
      doesn't show the L3 protocol in this case.
      
      Use vlan_get_protocol() to get the proper network protocol.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      72b14059
    • Toshiaki Makita's avatar
      net: Fix vlan_get_protocol for stacked vlan · d4bcef3f
      Toshiaki Makita authored
      vlan_get_protocol() could not get network protocol if a skb has a 802.1ad
      vlan tag or multiple vlans, which caused incorrect checksum calculation
      in several drivers.
      
      Fix vlan_get_protocol() to retrieve network protocol instead of incorrect
      vlan protocol.
      
      As the logic is the same as skb_network_protocol(), create a common helper
      function __vlan_get_protocol() and call it from existing functions.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4bcef3f
    • Saran Maruti Ramanara's avatar
      net: sctp: fix passing wrong parameter header to param_type2af in sctp_process_param · cfbf654e
      Saran Maruti Ramanara authored
      When making use of RFC5061, section 4.2.4. for setting the primary IP
      address, we're passing a wrong parameter header to param_type2af(),
      resulting always in NULL being returned.
      
      At this point, param.p points to a sctp_addip_param struct, containing
      a sctp_paramhdr (type = 0xc004, length = var), and crr_id as a correlation
      id. Followed by that, as also presented in RFC5061 section 4.2.4., comes
      the actual sctp_addr_param, which also contains a sctp_paramhdr, but
      this time with the correct type SCTP_PARAM_IPV{4,6}_ADDRESS that
      param_type2af() can make use of. Since we already hold a pointer to
      addr_param from previous line, just reuse it for param_type2af().
      
      Fixes: d6de3097 ("[SCTP]: Add the handling of "Set Primary IP Address" parameter to INIT")
      Signed-off-by: default avatarSaran Maruti Ramanara <saran.neti@telus.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cfbf654e
    • Pablo Neira's avatar
      netlink: fix wrong subscription bitmask to group mapping in · 8b7c36d8
      Pablo Neira authored
      The subscription bitmask passed via struct sockaddr_nl is converted to
      the group number when calling the netlink_bind() and netlink_unbind()
      callbacks.
      
      The conversion is however incorrect since bitmask (1 << 0) needs to be
      mapped to group number 1. Note that you cannot specify the group number 0
      (usually known as _NONE) from setsockopt() using NETLINK_ADD_MEMBERSHIP
      since this is rejected through -EINVAL.
      
      This problem became noticeable since 97840cb6 ("netfilter: nfnetlink:
      fix insufficient validation in nfnetlink_bind") when binding to bitmask
      (1 << 0) in ctnetlink.
      Reported-by: default avatarAndre Tomt <andre@tomt.net>
      Reported-by: default avatarIvan Delalande <colona@arista.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b7c36d8
  2. 29 Jan, 2015 20 commits
    • Li Wei's avatar
      ipv4: Don't increase PMTU with Datagram Too Big message. · 3cdaa5be
      Li Wei authored
      RFC 1191 said, "a host MUST not increase its estimate of the Path
      MTU in response to the contents of a Datagram Too Big message."
      Signed-off-by: default avatarLi Wei <lw@cn.fujitsu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3cdaa5be
    • David S. Miller's avatar
      Merge branch 'arm-build-fixes' · a1a0b558
      David S. Miller authored
      Arnd Bergmann says:
      
      ====================
      net: driver fixes from arm randconfig builds
      
      These four patches are fallout from test builds on ARM. I have a
      few more of them in my backlog but have not yet confirmed them
      to still be valid.
      
      The first three patches are about incomplete dependencies on
      old drivers. One could backport them to the beginning of time
      in theory, but there is little value since nobody would run into
      these problems.
      
      The final patch is one I had submitted before together with the
      respective pcmcia patch but forgot to follow up on that. It's
      still a valid but relatively theoretical bug, because the previous
      behavior of the driver was just as broken as what we have in
      mainline.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1a0b558
    • Arnd Bergmann's avatar
      net: am2150: fix nmclan_cs.c shared interrupt handling · 96a30175
      Arnd Bergmann authored
      A recent patch tried to work around a valid warning for the use of a
      deprecated interface by blindly changing from the old
      pcmcia_request_exclusive_irq() interface to pcmcia_request_irq().
      
      This driver has an interrupt handler that is not currently aware
      of shared interrupts, but can be easily converted to be.
      At the moment, the driver reads the interrupt status register
      repeatedly until it contains only zeroes in the interesting bits,
      and handles each bit individually.
      
      This patch adds the missing part of returning IRQ_NONE in case none
      of the bits are set to start with, so we can move on to the next
      interrupt source.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Fixes: 5f5316fc ("am2150: Update nmclan_cs.c to use update PCMCIA API")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      96a30175
    • Arnd Bergmann's avatar
      net: lance,ni64: don't build for ARM · e9b106b8
      Arnd Bergmann authored
      The ni65 and lance ethernet drivers manually program the ISA DMA
      controller that is only available on x86 PCs and a few compatible
      systems. Trying to build it on ARM results in this error:
      
      ni65.c: In function 'ni65_probe1':
      ni65.c:496:62: error: 'DMA1_STAT_REG' undeclared (first use in this function)
           ((inb(DMA1_STAT_REG) >> 4) & 0x0f)
                                                                    ^
      ni65.c:496:62: note: each undeclared identifier is reported only once for each function it appears in
      ni65.c:497:63: error: 'DMA2_STAT_REG' undeclared (first use in this function)
           | (inb(DMA2_STAT_REG) & 0xf0);
      
      The DMA1_STAT_REG and DMA2_STAT_REG registers are only defined for
      alpha, mips, parisc, powerpc and x86, although it is not clear
      which subarchitectures actually have them at the correct location.
      
      This patch for now just disables it for ARM, to avoid randconfig
      build errors. We could also decide to limit it to the set of
      architectures on which it does compile, but that might look more
      deliberate than guessing based on where the drivers build.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e9b106b8
    • Arnd Bergmann's avatar
      net: wan: add missing virt_to_bus dependencies · 303c28d8
      Arnd Bergmann authored
      The cosa driver is rather outdated and does not get built on most
      platforms because it requires the ISA_DMA_API symbol. However
      there are some ARM platforms that have ISA_DMA_API but no virt_to_bus,
      and they get this build error when enabling the ltpc driver.
      
      drivers/net/wan/cosa.c: In function 'tx_interrupt':
      drivers/net/wan/cosa.c:1768:3: error: implicit declaration of function 'virt_to_bus'
         unsigned long addr = virt_to_bus(cosa->txbuf);
         ^
      
      The same problem exists for the Hostess SV-11 and Sealevel Systems 4021
      drivers.
      
      This adds another dependency in Kconfig to avoid that configuration.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      303c28d8
    • Arnd Bergmann's avatar
      net: cs89x0: always build platform code if !HAS_IOPORT_MAP · fc9a5707
      Arnd Bergmann authored
      The cs89x0 driver can either be built as an ISA driver or a platform
      driver, the choice is controlled by the CS89x0_PLATFORM Kconfig
      symbol. Building the ISA driver on a system that does not have
      a way to map I/O ports fails with this error:
      
      drivers/built-in.o: In function `cs89x0_ioport_probe.constprop.1':
      :(.init.text+0x4794): undefined reference to `ioport_map'
      :(.init.text+0x4830): undefined reference to `ioport_unmap'
      
      This changes the Kconfig logic to take that option away and
      always force building the platform variant of this driver if
      CONFIG_HAS_IOPORT_MAP is not set. This is the only correct
      choice in this case, and it avoids the build error.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc9a5707
    • Florian Westphal's avatar
      ppp: deflate: never return len larger than output buffer · e2a4800e
      Florian Westphal authored
      When we've run out of space in the output buffer to store more data, we
      will call zlib_deflate with a NULL output buffer until we've consumed
      remaining input.
      
      When this happens, olen contains the size the output buffer would have
      consumed iff we'd have had enough room.
      
      This can later cause skb_over_panic when ppp_generic skb_put()s
      the returned length.
      Reported-by: default avatarIain Douglas <centos@1n6.org.uk>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2a4800e
    • David S. Miller's avatar
      Merge branch 'netns' · d445d63b
      David S. Miller authored
      Nicolas Dichtel says:
      
      ====================
      netns: audit netdevice creation with IFLA_NET_NS_[PID|FD]
      
      When one of these attributes is set, the netdevice is created into the netns
      pointed by IFLA_NET_NS_[PID|FD] (see the call to rtnl_create_link() in
      rtnl_newlink()). Let's call this netns the dest_net. After this creation, if the
      newlink handler exists, it is called with a netns argument that points to the
      netns where the netlink message has been received (called src_net in the code)
      which is the link netns.
      Hence, with one of these attributes, it's possible to create a x-netns
      netdevice.
      
      Here is the result of my code review:
      - all ip tunnels (sit, ipip, ip6_tunnels, gre[tap][v6], ip_vti[6]) does not
        really allows to use this feature: the netdevice is created in the dest_net
        and the src_net is completely ignored in the newlink handler.
      - VLAN properly handles this x-netns creation.
      - bridge ignores src_net, which seems fine (NETIF_F_NETNS_LOCAL is set).
      - CAIF subsystem is not clear for me (I don't know how it works), but it seems
        to wrongly use src_net. Patch #1 tries to fix this, but it was done only by
        code review (and only compile-tested), so please carefully review it. I may
        miss something.
      - HSR subsystem uses src_net to parse IFLA_HSR_SLAVE[1|2], but the netdevice has
        the flag NETIF_F_NETNS_LOCAL, so the question is: does this netdevice really
        supports x-netns? If not, the newlink handler should use the dest_net instead
        of src_net, I can provide the patch.
      - ieee802154 uses also src_net and does not have NETIF_F_NETNS_LOCAL. Same
        question: does this netdevice really supports x-netns?
      - bonding ignores src_net and flag NETIF_F_NETNS_LOCAL is set, ie x-netns is not
        supported. Fine.
      - CAN does not support rtnl/newlink, ok.
      - ipvlan uses src_net and does not have NETIF_F_NETNS_LOCAL. After looking at
        the code, it seems that this drivers support x-netns. Am I right?
      - macvlan/macvtap uses src_net and seems to have x-netns support.
      - team ignores src_net and has the flag NETIF_F_NETNS_LOCAL, ie x-netns is not
        supported. Ok.
      - veth uses src_net and have x-netns support ;-) Ok.
      - VXLAN didn't properly handle this. The link netns (vxlan->net) is the src_net
        and not dest_net (see patch #2). Note that it was already possible to create a
        x-netns vxlan before the commit f01ec1c0 ("vxlan: add x-netns support")
        but the nedevice remains broken.
      
      To summarize:
       - CAIF patch must be carefully reviewed
       - for HSR, ieee802154, ipvlan: is x-netns supported?
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d445d63b
    • Nicolas Dichtel's avatar
      vxlan: setup the right link netns in newlink hdlr · 33564bbb
      Nicolas Dichtel authored
      Rename the netns to src_net to avoid confusion with the netns where the
      interface stands. The user may specify IFLA_NET_NS_[PID|FD] to create
      a x-netns netndevice: IFLA_NET_NS_[PID|FD] points to the netns where the
      netdevice stands and src_net to the link netns.
      
      Note that before commit f01ec1c0 ("vxlan: add x-netns support"), it was
      possible to create a x-netns vxlan netdevice, but the netdevice was not
      operational.
      
      Fixes: f01ec1c0 ("vxlan: add x-netns support")
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33564bbb
    • Nicolas Dichtel's avatar
      caif: remove wrong dev_net_set() call · 8997c27e
      Nicolas Dichtel authored
      src_net points to the netns where the netlink message has been received. This
      netns may be different from the netns where the interface is created (because
      the user may add IFLA_NET_NS_[PID|FD]). In this case, src_net is the link netns.
      
      It seems wrong to override the netns in the newlink() handler because if it
      was not already src_net, it means that the user explicitly asks to create the
      netdevice in another netns.
      
      CC: Sjur Brændeland <sjur.brandeland@stericsson.com>
      CC: Dmitry Tarnyagin <dmitry.tarnyagin@lockless.no>
      Fixes: 8391c4aa ("caif: Bugfixes in CAIF netdevice for close and flow control")
      Fixes: c4125400 ("caif-hsi: Add rtnl support")
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8997c27e
    • karl beldan's avatar
      lib/checksum.c: fix build for generic csum_tcpudp_nofold · 9ce35779
      karl beldan authored
      Fixed commit added from64to32 under _#ifndef do_csum_ but used it
      under _#ifndef csum_tcpudp_nofold_, breaking some builds (Fengguang's
      robot reported TILEGX's). Move from64to32 under the latter.
      
      Fixes: 150ae0e9 ("lib/checksum.c: fix carry in csum_tcpudp_nofold")
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: default avatarKarl Beldan <karl.beldan@rivierawaves.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ce35779
    • Eric Dumazet's avatar
      tcp: ipv4: initialize unicast_sock sk_pacing_rate · 811230cd
      Eric Dumazet authored
      When I added sk_pacing_rate field, I forgot to initialize its value
      in the per cpu unicast_sock used in ip_send_unicast_reply()
      
      This means that for sch_fq users, RST packets, or ACK packets sent
      on behalf of TIME_WAIT sockets might be sent to slowly or even dropped
      once we reach the per flow limit.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: 95bd09eb ("tcp: TSO packets automatic sizing")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      811230cd
    • karl beldan's avatar
      lib/checksum.c: fix carry in csum_tcpudp_nofold · 150ae0e9
      karl beldan authored
      The carry from the 64->32bits folding was dropped, e.g with:
      saddr=0xFFFFFFFF daddr=0xFF0000FF len=0xFFFF proto=0 sum=1,
      csum_tcpudp_nofold returned 0 instead of 1.
      Signed-off-by: default avatarKarl Beldan <karl.beldan@rivierawaves.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      150ae0e9
    • Roopa Prabhu's avatar
      bridge: dont send notification when skb->len == 0 in rtnl_bridge_notify · 59ccaaaa
      Roopa Prabhu authored
      Reported in: https://bugzilla.kernel.org/show_bug.cgi?id=92081
      
      This patch avoids calling rtnl_notify if the device ndo_bridge_getlink
      handler does not return any bytes in the skb.
      
      Alternately, the skb->len check can be moved inside rtnl_notify.
      
      For the bridge vlan case described in 92081, there is also a fix needed
      in bridge driver to generate a proper notification. Will fix that in
      subsequent patch.
      
      v2: rebase patch on net tree
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      59ccaaaa
    • David S. Miller's avatar
      Merge branch 'tcp_stretch_acks' · 95224ac1
      David S. Miller authored
      Neal Cardwell says:
      
      ====================
      fix stretch ACK bugs in TCP CUBIC and Reno
      
      This patch series fixes the TCP CUBIC and Reno congestion control
      modules to properly handle stretch ACKs in their respective additive
      increase modes, and in the transitions from slow start to additive
      increase.
      
      This finishes the project started by commit 9f9843a7 ("tcp:
      properly handle stretch acks in slow start"), which fixed behavior for
      TCP congestion control when handling stretch ACKs in slow start mode.
      
      Motivation: In the Jan 2015 netdev thread 'BW regression after "tcp:
      refine TSO autosizing"', Eyal Perry documented a regression that Eric
      Dumazet determined was caused by improper handling of TCP stretch
      ACKs.
      
      Background: LRO, GRO, delayed ACKs, and middleboxes can cause "stretch
      ACKs" that cover more than the RFC-specified maximum of 2
      packets. These stretch ACKs can cause serious performance shortfalls
      in common congestion control algorithms, like Reno and CUBIC, which
      were designed and tuned years ago with receiver hosts that were not
      using LRO or GRO, and were instead ACKing every other packet.
      
      Testing: at Google we have been using this approach for handling
      stretch ACKs for CUBIC datacenter and Internet traffic for several
      years, with good results.
      
      v2:
       * fixed return type of tcp_slow_start() to be u32 instead of int
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      95224ac1
    • Neal Cardwell's avatar
      tcp: fix timing issue in CUBIC slope calculation · d6b1a8a9
      Neal Cardwell authored
      This patch fixes a bug in CUBIC that causes cwnd to increase slightly
      too slowly when multiple ACKs arrive in the same jiffy.
      
      If cwnd is supposed to increase at a rate of more than once per jiffy,
      then CUBIC was sometimes too slow. Because the bic_target is
      calculated for a future point in time, calculated with time in
      jiffies, the cwnd can increase over the course of the jiffy while the
      bic_target calculated as the proper CUBIC cwnd at time
      t=tcp_time_stamp+rtt does not increase, because tcp_time_stamp only
      increases on jiffy tick boundaries.
      
      So since the cnt is set to:
      	ca->cnt = cwnd / (bic_target - cwnd);
      as cwnd increases but bic_target does not increase due to jiffy
      granularity, the cnt becomes too large, causing cwnd to increase
      too slowly.
      
      For example:
      - suppose at the beginning of a jiffy, cwnd=40, bic_target=44
      - so CUBIC sets:
         ca->cnt =  cwnd / (bic_target - cwnd) = 40 / (44 - 40) = 40/4 = 10
      - suppose we get 10 acks, each for 1 segment, so tcp_cong_avoid_ai()
         increases cwnd to 41
      - so CUBIC sets:
         ca->cnt =  cwnd / (bic_target - cwnd) = 41 / (44 - 41) = 41 / 3 = 13
      
      So now CUBIC will wait for 13 packets to be ACKed before increasing
      cwnd to 42, insted of 10 as it should.
      
      The fix is to avoid adjusting the slope (determined by ca->cnt)
      multiple times within a jiffy, and instead skip to compute the Reno
      cwnd, the "TCP friendliness" code path.
      Reported-by: default avatarEyal Perry <eyalpe@mellanox.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d6b1a8a9
    • Neal Cardwell's avatar
      tcp: fix stretch ACK bugs in CUBIC · 9cd981dc
      Neal Cardwell authored
      Change CUBIC to properly handle stretch ACKs in additive increase mode
      by passing in the count of ACKed packets to tcp_cong_avoid_ai().
      
      In addition, because we are now precisely accounting for stretch ACKs,
      including delayed ACKs, we can now remove the delayed ACK tracking and
      estimation code that tracked recent delayed ACK behavior in
      ca->delayed_ack.
      Reported-by: default avatarEyal Perry <eyalpe@mellanox.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9cd981dc
    • Neal Cardwell's avatar
      tcp: fix stretch ACK bugs in Reno · c22bdca9
      Neal Cardwell authored
      Change Reno to properly handle stretch ACKs in additive increase mode
      by passing in the count of ACKed packets to tcp_cong_avoid_ai().
      
      In addition, if snd_cwnd crosses snd_ssthresh during slow start
      processing, and we then exit slow start mode, we need to carry over
      any remaining "credit" for packets ACKed and apply that to additive
      increase by passing this remaining "acked" count to
      tcp_cong_avoid_ai().
      Reported-by: default avatarEyal Perry <eyalpe@mellanox.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c22bdca9
    • Neal Cardwell's avatar
      tcp: fix the timid additive increase on stretch ACKs · 814d488c
      Neal Cardwell authored
      tcp_cong_avoid_ai() was too timid (snd_cwnd increased too slowly) on
      "stretch ACKs" -- cases where the receiver ACKed more than 1 packet in
      a single ACK. For example, suppose w is 10 and we get a stretch ACK
      for 20 packets, so acked is 20. We ought to increase snd_cwnd by 2
      (since acked/w = 20/10 = 2), but instead we were only increasing cwnd
      by 1. This patch fixes that behavior.
      Reported-by: default avatarEyal Perry <eyalpe@mellanox.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      814d488c
    • Neal Cardwell's avatar
      tcp: stretch ACK fixes prep · e73ebb08
      Neal Cardwell authored
      LRO, GRO, delayed ACKs, and middleboxes can cause "stretch ACKs" that
      cover more than the RFC-specified maximum of 2 packets. These stretch
      ACKs can cause serious performance shortfalls in common congestion
      control algorithms that were designed and tuned years ago with
      receiver hosts that were not using LRO or GRO, and were instead
      politely ACKing every other packet.
      
      This patch series fixes Reno and CUBIC to handle stretch ACKs.
      
      This patch prepares for the upcoming stretch ACK bug fix patches. It
      adds an "acked" parameter to tcp_cong_avoid_ai() to allow for future
      fixes to tcp_cong_avoid_ai() to correctly handle stretch ACKs, and
      changes all congestion control algorithms to pass in 1 for the ACKed
      count. It also changes tcp_slow_start() to return the number of packet
      ACK "credits" that were not processed in slow start mode, and can be
      processed by the congestion control module in additive increase mode.
      
      In future patches we will fix tcp_cong_avoid_ai() to handle stretch
      ACKs, and fix Reno and CUBIC handling of stretch ACKs in slow start
      and additive increase mode.
      Reported-by: default avatarEyal Perry <eyalpe@mellanox.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e73ebb08
  3. 27 Jan, 2015 12 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 59343cd7
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Don't OOPS on socket AIO, from Christoph Hellwig.
      
       2) Scheduled scans should be aborted upon RFKILL, from Emmanuel
          Grumbach.
      
       3) Fix sleep in atomic context in kvaser_usb, from Ahmed S Darwish.
      
       4) Fix RCU locking across copy_to_user() in bpf code, from Alexei
          Starovoitov.
      
       5) Lots of crash, memory leak, short TX packet et al bug fixes in
          sh_eth from Ben Hutchings.
      
       6) Fix memory corruption in SCTP wrt.  INIT collitions, from Daniel
          Borkmann.
      
       7) Fix return value logic for poll handlers in netxen, enic, and bnx2x.
          From Eric Dumazet and Govindarajulu Varadarajan.
      
       8) Header length calculation fix in mac80211 from Fred Chou.
      
       9) mv643xx_eth doesn't handle highmem correctly in non-TSO code paths.
          From Ezequiel Garcia.
      
      10) udp_diag has bogus logic in it's hash chain skipping, copy same fix
          tcp diag used.  From Herbert Xu.
      
      11) amd-xgbe programs wrong rx flow control register, from Thomas
          Lendacky.
      
      12) Fix race leading to use after free in ping receive path, from Subash
          Abhinov Kasiviswanathan.
      
      13) Cache redirect routes otherwise we can get a heavy backlog of rcu
          jobs liberating DST_NOCACHE entries.  From Hannes Frederic Sowa.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits)
        net: don't OOPS on socket aio
        stmmac: prevent probe drivers to crash kernel
        bnx2x: fix napi poll return value for repoll
        ipv6: replacing a rt6_info needs to purge possible propagated rt6_infos too
        sh_eth: Fix DMA-API usage for RX buffers
        sh_eth: Check for DMA mapping errors on transmit
        sh_eth: Ensure DMA engines are stopped before freeing buffers
        sh_eth: Remove RX overflow log messages
        ping: Fix race in free in receive path
        udp_diag: Fix socket skipping within chain
        can: kvaser_usb: Fix state handling upon BUS_ERROR events
        can: kvaser_usb: Retry the first bulk transfer on -ETIMEDOUT
        can: kvaser_usb: Send correct context to URB completion
        can: kvaser_usb: Do not sleep in atomic context
        ipv4: try to cache dst_entries which would cause a redirect
        samples: bpf: relax test_maps check
        bpf: rcu lock must not be held when calling copy_to_user()
        net: sctp: fix slab corruption from use after free on INIT collisions
        net: mv643xx_eth: Fix highmem support in non-TSO egress path
        sh_eth: Fix serialisation of interrupt disable with interrupt & NAPI handlers
        ...
      59343cd7
    • Christoph Hellwig's avatar
      net: don't OOPS on socket aio · 06539d30
      Christoph Hellwig authored
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06539d30
    • Andy Shevchenko's avatar
      stmmac: prevent probe drivers to crash kernel · 9afec6ef
      Andy Shevchenko authored
      In the case when alloc_netdev fails we return NULL to a caller. But there is no
      check for NULL in the probe drivers. This patch changes NULL to an error
      pointer. The function description is amended to reflect what we may get
      returned.
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9afec6ef
    • Linus Torvalds's avatar
      Merge tag 'powerpc-3.19-5' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux · 7da323bb
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "Two powerpc fixes"
      
      * tag 'powerpc-3.19-5' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
        powerpc/powernv: Restore LPCR with LPCR_PECE1 cleared
        powerpc/xmon: Fix another endiannes issue in RTAS call from xmon
      7da323bb
    • Linus Torvalds's avatar
      Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux · 41592e2f
      Linus Torvalds authored
      Pull one more module fix from Rusty Russell:
       "SCSI was using module_refcount() to figure out when the module was
        unloading: this broke with new atomic refcounting.  The code is still
        suspicious, but this solves the WARN_ON()"
      
      * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
        scsi: always increment reference count
      41592e2f
    • Govindarajulu Varadarajan's avatar
      bnx2x: fix napi poll return value for repoll · 24e579c8
      Govindarajulu Varadarajan authored
      With the commit d75b1ade ("net: less interrupt masking in NAPI") napi
      repoll is done only when work_done == budget. When in busy_poll is we return 0
      in napi_poll. We should return budget.
      Signed-off-by: default avatarGovindarajulu Varadarajan <_govind@gmx.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      24e579c8
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec · bf693f7b
      David S. Miller authored
      Steffen Klassert says:
      
      ====================
      ipsec 2015-01-26
      
      Just two small fixes for _decode_session6() where we
      might decode to wrong header information in some rare
      situations.
      
      Please pull or let me know if there are problems.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf693f7b
    • Hannes Frederic Sowa's avatar
      ipv6: replacing a rt6_info needs to purge possible propagated rt6_infos too · 6e9e16e6
      Hannes Frederic Sowa authored
      Lubomir Rintel reported that during replacing a route the interface
      reference counter isn't correctly decremented.
      
      To quote bug <https://bugzilla.kernel.org/show_bug.cgi?id=91941>:
      | [root@rhel7-5 lkundrak]# sh -x lal
      | + ip link add dev0 type dummy
      | + ip link set dev0 up
      | + ip link add dev1 type dummy
      | + ip link set dev1 up
      | + ip addr add 2001:db8:8086::2/64 dev dev0
      | + ip route add 2001:db8:8086::/48 dev dev0 proto static metric 20
      | + ip route add 2001:db8:8088::/48 dev dev1 proto static metric 10
      | + ip route replace 2001:db8:8086::/48 dev dev1 proto static metric 20
      | + ip link del dev0 type dummy
      | Message from syslogd@rhel7-5 at Jan 23 10:54:41 ...
      |  kernel:unregister_netdevice: waiting for dev0 to become free. Usage count = 2
      |
      | Message from syslogd@rhel7-5 at Jan 23 10:54:51 ...
      |  kernel:unregister_netdevice: waiting for dev0 to become free. Usage count = 2
      
      During replacement of a rt6_info we must walk all parent nodes and check
      if the to be replaced rt6_info got propagated. If so, replace it with
      an alive one.
      
      Fixes: 4a287eba ("IPv6 routing, NLM_F_* flag support: REPLACE and EXCL flags support, warn about missing CREATE flag")
      Reported-by: default avatarLubomir Rintel <lkundrak@v3.sk>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Tested-by: default avatarLubomir Rintel <lkundrak@v3.sk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e9e16e6
    • David S. Miller's avatar
      Merge branch 'sh_eth' · 22577609
      David S. Miller authored
      Ben Hutchings says:
      
      ====================
      Fixes for sh_eth #3
      
      I'm continuing review and testing of Ethernet support on the R-Car H2
      chip.  This series fixes the last of the more serious issues I've found.
      
      These are not tested on any of the other supported chips.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      22577609
    • Ben Hutchings's avatar
      sh_eth: Fix DMA-API usage for RX buffers · 52b9fa36
      Ben Hutchings authored
      - Use the return value of dma_map_single(), rather than calling
        virt_to_page() separately
      - Check for mapping failue
      - Call dma_unmap_single() rather than dma_sync_single_for_cpu()
      Signed-off-by: default avatarBen Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      52b9fa36
    • Ben Hutchings's avatar
      sh_eth: Check for DMA mapping errors on transmit · aa3933b8
      Ben Hutchings authored
      dma_map_single() may fail if an IOMMU or swiotlb is in use, so
      we need to check for this.
      Signed-off-by: default avatarBen Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa3933b8
    • Ben Hutchings's avatar
      sh_eth: Ensure DMA engines are stopped before freeing buffers · 740c7f31
      Ben Hutchings authored
      Currently we try to clear EDRRR and EDTRR and immediately continue to
      free buffers.  This is unsafe because:
      
      - In general, register writes are not serialised with DMA, so we still
        have to wait for DMA to complete somehow
      - The R8A7790 (R-Car H2) manual states that the TX running flag cannot
        be cleared by writing to EDTRR
      - The same manual states that clearing the RX running flag only stops
        RX DMA at the next packet boundary
      
      I applied this patch to the driver to detect DMA writes to freed
      buffers:
      
      > --- a/drivers/net/ethernet/renesas/sh_eth.c
      > +++ b/drivers/net/ethernet/renesas/sh_eth.c
      > @@ -1098,7 +1098,14 @@ static void sh_eth_ring_free(struct net_device *ndev)
      >  	/* Free Rx skb ringbuffer */
      >  	if (mdp->rx_skbuff) {
      >  		for (i = 0; i < mdp->num_rx_ring; i++)
      > +			memcpy(mdp->rx_skbuff[i]->data,
      > +			       "Hello, world", 12);
      > +		msleep(100);
      > +		for (i = 0; i < mdp->num_rx_ring; i++) {
      > +			WARN_ON(memcmp(mdp->rx_skbuff[i]->data,
      > +				       "Hello, world", 12));
      >  			dev_kfree_skb(mdp->rx_skbuff[i]);
      > +		}
      >  	}
      >  	kfree(mdp->rx_skbuff);
      >  	mdp->rx_skbuff = NULL;
      
      then ran the loop:
      
          while ethtool -G eth0 rx 128 ; ethtool -G eth0 rx 64; do echo -n .; done
      
      and 'ping -f' toward the sh_eth port from another machine.  The
      warning fired several times a minute.
      
      To fix these issues:
      
      - Deactivate all TX descriptors rather than writing to EDTRR
      - As there seems to be no way of telling when RX DMA is stopped,
        perform a soft reset to ensure that both DMA enginess are stopped
      - To reduce the possibility of the reset truncating a transmitted
        frame, disable egress and wait a reasonable time to reach a
        packet boundary before resetting
      - Update statistics before resetting
      
      (The 'reasonable time' does not allow for CS/CD in half-duplex
      mode, but half-duplex no longer seems reasonable!)
      Signed-off-by: default avatarBen Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      740c7f31