1. 13 Jan, 2014 9 commits
    • Eric Dumazet's avatar
      net: gro: change GRO overflow strategy · 600adc18
      Eric Dumazet authored
      GRO layer has a limit of 8 flows being held in GRO list,
      for performance reason.
      
      When a packet comes for a flow not yet in the list,
      and list is full, we immediately give it to upper
      stacks, lowering aggregation performance.
      
      With TSO auto sizing and FQ packet scheduler, this situation
      happens more often.
      
      This patch changes strategy to simply evict the oldest flow of
      the list. This works better because of the nature of packet
      trains for which GRO is efficient. This also has the effect
      of lowering the GRO latency if many flows are competing.
      
      Tested :
      
      Used a 40Gbps NIC, with 4 RX queues, and 200 concurrent TCP_STREAM
      netperf.
      
      Before patch, aggregate rate is 11Gbps (while a single flow can reach
      30Gbps)
      
      After patch, line rate is reached.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Jerry Chu <hkchu@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      600adc18
    • Eric Dumazet's avatar
      net/mlx4_en: call gro handler for encapsulated frames · e6a76758
      Eric Dumazet authored
      In order to use the native GRO handling of encapsulated protocols on
      mlx4, we need to call napi_gro_receive() instead of netif_receive_skb()
      unless busy polling is in action.
      
      While we are at it, rename mlx4_en_cq_ll_polling() to
      mlx4_en_cq_busy_polling()
      
      Tested with GRE tunnel : GRO aggregation is now performed on the
      ethernet device instead of being done later on gre device.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Amir Vadai <amirv@mellanox.com>
      Cc: Jerry Chu <hkchu@google.com>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Acked-By: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e6a76758
    • Wei Yongjun's avatar
      gre_offload: fix sparse non static symbol warning · d10dbad2
      Wei Yongjun authored
      Fixes the following sparse warning:
      
      net/ipv4/gre_offload.c:253:5: warning:
       symbol 'gre_gro_complete' was not declared. Should it be static?
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d10dbad2
    • David S. Miller's avatar
      Merge branch 'ip_forward_pmtu' · c139cd3b
      David S. Miller authored
      Hannes Frederic Sowa says:
      
      ====================
      path mtu hardening patches
      
      After a lot of back and forth I want to propose these changes regarding
      path mtu hardening and give an outline why I think this is the best way
      how to proceed:
      
      This set contains the following patches:
      * ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing
      * ipv6: introduce ip6_dst_mtu_forward and protect forwarding path with it
      * ipv4: introduce hardened ip_no_pmtu_disc mode
      
      The first one switches the forwarding path of IPv4 to use the interface
      mtu by default and ignore a possible discovered path mtu. It provides
      a sysctl to switch back to the original behavior (see discussion below).
      
      The second patch does the same thing unconditionally for IPv6. I don't
      provide a knob for IPv6 to switch to original behavior (please see
      below).
      
      The third patch introduces a hardened pmtu mode, where only pmtu
      information are accepted where the protocol is able to do more stringent
      checks on the icmp piggyback payload (please see the patch commit msg
      for further details).
      
      Why is this change necessary?
      
      First of all, RFC 1191 4. Router specification says:
      "When a router is unable to forward a datagram because it exceeds the
       MTU of the next-hop network and its Don't Fragment bit is set, the
       router is required to return an ICMP Destination Unreachable message
       to the source of the datagram, with the Code indicating
       "fragmentation needed and DF set". ..."
      
      For some time now fragmentation has been considered problematic, e.g.:
      * http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-87-3.pdf
      * http://tools.ietf.org/search/rfc4963
      
      Most of them seem to agree that fragmentation should be avoided because
      of efficiency, data corruption or security concerns.
      
      Recently it was shown possible that correctly guessing IP ids could lead
      to data injection on DNS packets:
      <https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf>
      
      While we can try to completly stop fragmentation on the end host
      (this is e.g. implemented via IP_PMTUDISC_INTERFACE), we cannot stop
      fragmentation completly on the forwarding path. On the end host the
      application has to deal with MTUs and has to choose fallback methods
      if fragmentation could be an attack vector. This is already the case for
      most DNS software, where a maximum UDP packet size can be configured. But
      until recently they had no control over local fragmentation and could
      thus emit fragmented packets.
      
      On the forwarding path we can just try to delay the fragmentation to
      the last hop where this is really necessary. Current kernel already does
      that but only because routers don't receive feedback of path mtus, these are
      only send back to the end host system. But it is possible to maliciously
      insert path mtu inforamtion via ICMP packets which have an icmp echo_reply
      payload, because we cannot validate those notifications against local
      sockets. DHCP clients which establish an any-bound RAW-socket could also
      start processing unwanted fragmentation-needed packets.
      
      Why does IPv4 has a knob to revert to old behavior while IPv6 doesn't?
      
      IPv4 does fragmentation on the path while IPv6 does always respond with
      packet-too-big errors. The interface MTU will always be greater than
      the path MTU information. So we would discard packets we could actually
      forward because of malicious information. After this change we would
      let the hop, which really could not forward the packet, notify the host
      of this problem.
      
      IPv4 allowes fragmentation mid-path. In case someone does use a software
      which tries to discover such paths and assumes that the kernel is handling
      the discovered pmtu information automatically. This should be an extremly
      rare case, but because I could not exclude the possibility this knob is
      provided. Also this software could insert non-locked mtu information
      into the kernel. We cannot distinguish that from path mtu information
      currently. Premature fragmentation could solve some problems in wrongly
      configured networks, thus this switch is provided.
      
      One frag-needed packet could reduce the path mtu down to 522 bytes
      (route/min_pmtu).
      
      Misc:
      
      IPv6 neighbor discovery could advertise mtu information for an
      interface. These information update the ipv6-specific interface mtu and
      thus get used by the forwarding path.
      
      Tunnel and xfrm output path will still honour path mtu and also respond
      with Packet-too-Big or fragmentation-needed errors if needed.
      
      Changelog for all patches:
      v2)
      * enabled ip_forward_use_pmtu by default
      * reworded
      v3)
      * disabled ip_forward_use_pmtu by default
      * reworded
      v4)
      * renamed ip_dst_mtu_secure to ip_dst_mtu_maybe_forward
      * updated changelog accordingly
      * removed unneeded !!(... & ...) double negations
      
      v2)
      * by default we honour pmtu information
      3)
      * only honor interface mtu
      * rewritten and simplified
      * no knob to fall back to old mode any more
      
      v2)
      * reworded Documentation
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c139cd3b
    • Hannes Frederic Sowa's avatar
      ipv4: introduce hardened ip_no_pmtu_disc mode · 8ed1dc44
      Hannes Frederic Sowa authored
      This new ip_no_pmtu_disc mode only allowes fragmentation-needed errors
      to be honored by protocols which do more stringent validation on the
      ICMP's packet payload. This knob is useful for people who e.g. want to
      run an unmodified DNS server in a namespace where they need to use pmtu
      for TCP connections (as they are used for zone transfers or fallback
      for requests) but don't want to use possibly spoofed UDP pmtu information.
      
      Currently the whitelisted protocols are TCP, SCTP and DCCP as they check
      if the returned packet is in the window or if the association is valid.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: John Heffner <johnwheffner@gmail.com>
      Suggested-by: default avatarFlorian Weimer <fweimer@redhat.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ed1dc44
    • Hannes Frederic Sowa's avatar
      ipv6: introduce ip6_dst_mtu_forward and protect forwarding path with it · 0954cf9c
      Hannes Frederic Sowa authored
      In the IPv6 forwarding path we are only concerend about the outgoing
      interface MTU, but also respect locked MTUs on routes. Tunnel provider
      or IPSEC already have to recheck and if needed send PtB notifications
      to the sending host in case the data does not fit into the packet with
      added headers (we only know the final header sizes there, while also
      using path MTU information).
      
      The reason for this change is, that path MTU information can be injected
      into the kernel via e.g. icmp_err protocol handler without verification
      of local sockets. As such, this could cause the IPv6 forwarding path to
      wrongfully emit Packet-too-Big errors and drop IPv6 packets.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: John Heffner <johnwheffner@gmail.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0954cf9c
    • Hannes Frederic Sowa's avatar
      ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing · f87c10a8
      Hannes Frederic Sowa authored
      While forwarding we should not use the protocol path mtu to calculate
      the mtu for a forwarded packet but instead use the interface mtu.
      
      We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was
      introduced for multicast forwarding. But as it does not conflict with
      our usage in unicast code path it is perfect for reuse.
      
      I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
      along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular
      dependencies because of IPSKB_FORWARDED.
      
      Because someone might have written a software which does probe
      destinations manually and expects the kernel to honour those path mtus
      I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone
      can disable this new behaviour. We also still use mtus which are locked on a
      route for forwarding.
      
      The reason for this change is, that path mtus information can be injected
      into the kernel via e.g. icmp_err protocol handler without verification
      of local sockets. As such, this could cause the IPv4 forwarding path to
      wrongfully emit fragmentation needed notifications or start to fragment
      packets along a path.
      
      Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
      won't be set and further fragmentation logic will use the path mtu to
      determine the fragmentation size. They also recheck packet size with
      help of path mtu discovery and report appropriate errors.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: John Heffner <johnwheffner@gmail.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f87c10a8
    • Terry Lam's avatar
      HHF qdisc: fix jiffies-time conversion. · 6c76a07a
      Terry Lam authored
      This is to be compatible with the use of "get_time" (i.e. default
      time unit in us) in iproute2 patch for HHF as requested by Stephen.
      Signed-off-by: default avatarTerry Lam <vtlam@google.com>
      Acked-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c76a07a
    • Joe Perches's avatar
      qlcnic: Convert vmalloc/memset to kcalloc · f3c0773f
      Joe Perches authored
      vmalloc is a limited resource.  Don't use it unnecessarily.
      
      It seems this allocation should work with kcalloc.
      
      Remove unnecessary memset(,0,) of buf as it's completely
      overwritten as the previously only unset field in
      struct qlcnic_pci_func_cfg is now set to 0.
      
      Use kfree instead of vfree.
      Use ETH_ALEN instead of 6.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: default avatarJitendra Kalsaria <jitendra.kalsaria@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f3c0773f
  2. 12 Jan, 2014 10 commits
  3. 11 Jan, 2014 17 commits
  4. 10 Jan, 2014 4 commits
    • David S. Miller's avatar
      Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge · 45593c2b
      David S. Miller authored
      Included changes:
      - substitute FSF address with URL
      - deselect current bat-GW when GW-client mode gets deactivated
      - send every DHCP packet using bat-unicast messages when GW-client mode is
        enabled
      - implement the Extended Isolation mechanism (it is an enhancement of the
        already existing batman-AP-isolation). This mechanism allows the user to drop
        packets exchanged by selected clients by using netfilter marks.
      - fix typ0 in header guard
      - minor code cleanups
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      45593c2b
    • David S. Miller's avatar
      Merge branch 'tcp_metrics_saddr' · 795709af
      David S. Miller authored
      Christoph Paasch says:
      
      ====================
      Make tcp-metrics source-address aware
      
      Currently tcp-metrics only stores per-destination addresses. This brings
      problems, when a host has multiple interfaces (e.g., a smartphone having
      WiFi/3G):
      
      For example, a host contacting a server over WiFi will store the tcp-metrics
      per destination IP. If then the host contacts the same server over 3G, the
      same tcp-metrics will be used, although the path-characteristics are completly
      different (e.g., the ssthresh is probably not the same).
      
      In case of TFO this is not a problem, as the server will provide us a new cookie
      once he saw our SYN+DATA with an incorrect cookie.
      It may be (in case of carrier-grade NAT), that we keep the same public IP but
      have a different private IP. Thus, we better reuse the old cookie even if our
      source-IP has changed. However, this scenario is probably very uncommon, as
      carriers try to provide the same src-IP to the clients behind their CGN.
      
      Patches 1 + 2 add the source-IP to the tcp metrics.
      
      Patches 3 to 5 modify the netlink-api to support the source-IP. From now on,
      when using the command "ip tcp_metrics delete address ADDRESS" all entries
      which match this destination IP will be deleted.
      
      Today's iproute2 will complain when doing "ip tcp_metrics flush PREFIX" if
      several entries are present for the same destination-IP but with different
      source-IPs:
      
      root@client:~/test# ip tcp_metrics
      10.2.1.2 age 3.640sec rtt 16250us rttvar 15000us cwnd 10
      10.2.1.2 age 4.030sec rtt 18750us rttvar 15000us cwnd 10
      root@client:~/test# ip tcp_metrics flush 10.2.1.2/16
      Failed to send flush request
      : No such process
      
      Follow-up patches will modify iproute2 to handle this correctly and allow
      specifying the source-IP in the get/del commands.
      
      v2: Added the patch that allows to selectively get/del of tcp-metrics based
          on src-IP and moved the patch that adds the new netlink attribute before
          the other patches.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      795709af
    • Christoph Paasch's avatar
      tcp: metrics: Allow selective get/del of tcp-metrics based on src IP · 3e7013dd
      Christoph Paasch authored
      We want to be able to get/del tcp-metrics based on the src IP. This
      patch adds the necessary parsing of the netlink attribute and if the
      source address is set, it will match on this one too.
      Signed-off-by: default avatarChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3e7013dd
    • Christoph Paasch's avatar
      tcp: metrics: Delete all entries matching a certain destination · bbf852b9
      Christoph Paasch authored
      As we now can have multiple entries per destination-IP, the "ip
      tcp_metrics delete address ADDRESS" command deletes all of them.
      Signed-off-by: default avatarChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bbf852b9