1. 07 Jan, 2014 10 commits
    • Jerry Chu's avatar
      net-gre-gro: Add GRE support to the GRO stack · bf5a755f
      Jerry Chu authored
      This patch built on top of Commit 299603e8
      ("net-gro: Prepare GRO stack for the upcoming tunneling support") to add
      the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO
      stack. It also serves as an example for supporting other encapsulation
      protocols in the GRO stack in the future.
      
      The patch supports version 0 and all the flags (key, csum, seq#) but
      will flush any pkt with the S (seq#) flag. This is because the S flag
      is not support by GSO, and a GRO pkt may end up in the forwarding path,
      thus requiring GSO support to break it up correctly.
      
      Currently the "packet_offload" structure only contains L3 (ETH_P_IP/
      ETH_P_IPV6) GRO offload support so the encapped pkts are limited to
      IP pkts (i.e., w/o L2 hdr). But support for other protocol type can
      be easily added, so is the support for GRE variations like NVGRE.
      
      The patch also support csum offload. Specifically if the csum flag is on
      and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE),
      the code will take advantage of the csum computed by the h/w when
      validating the GRE csum.
      
      Note that commit 60769a5d "ipv4: gre:
      add GRO capability" already introduces GRO capability to IPv4 GRE
      tunnels, using the gro_cells infrastructure. But GRO is done after
      GRE hdr has been removed (i.e., decapped). The following patch applies
      GRO when pkts first come in (before hitting the GRE tunnel code). There
      is some performance advantage for applying GRO as early as possible.
      Also this approach is transparent to other subsystem like Open vSwitch
      where GRE decap is handled outside of the IP stack hence making it
      harder for the gro_cells stuff to apply. On the other hand, some NICs
      are still not capable of hashing on the inner hdr of a GRE pkt (RSS).
      In that case the GRO processing of pkts from the same remote host will
      all happen on the same CPU and the performance may be suboptimal.
      
      I'm including some rough preliminary performance numbers below. Note
      that the performance will be highly dependent on traffic load, mix as
      usual. Moreover it also depends on NIC offload features hence the
      following is by no means a comprehesive study. Local testing and tuning
      will be needed to decide the best setting.
      
      All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs.
      (super_netperf 50 -H 192.168.1.18 -l 30)
      
      An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1
      mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123)
      is configured.
      
      The GRO support for pkts AFTER decap are controlled through the device
      feature of the GRE device (e.g., ethtool -K gre1 gro on/off).
      
      1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off
      thruput: 9.16Gbps
      CPU utilization: 19%
      
      1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off
      thruput: 5.9Gbps
      CPU utilization: 15%
      
      1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on
      thruput: 9.26Gbps
      CPU utilization: 12-13%
      
      1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on
      thruput: 9.26Gbps
      CPU utilization: 10%
      
      The following tests were performed on a different NIC that is capable of
      csum offload. I.e., the h/w is capable of computing IP payload csum
      (CHECKSUM_COMPLETE).
      
      2.1 ethtool -K gre1 gro on (hence will use gro_cells)
      
      2.1.1 ethtool -K eth0 gro off; csum offload disabled
      thruput: 8.53Gbps
      CPU utilization: 9%
      
      2.1.2 ethtool -K eth0 gro off; csum offload enabled
      thruput: 8.97Gbps
      CPU utilization: 7-8%
      
      2.1.3 ethtool -K eth0 gro on; csum offload disabled
      thruput: 8.83Gbps
      CPU utilization: 5-6%
      
      2.1.4 ethtool -K eth0 gro on; csum offload enabled
      thruput: 8.98Gbps
      CPU utilization: 5%
      
      2.2 ethtool -K gre1 gro off
      
      2.2.1 ethtool -K eth0 gro off; csum offload disabled
      thruput: 5.93Gbps
      CPU utilization: 9%
      
      2.2.2 ethtool -K eth0 gro off; csum offload enabled
      thruput: 5.62Gbps
      CPU utilization: 8%
      
      2.2.3 ethtool -K eth0 gro on; csum offload disabled
      thruput: 7.69Gbps
      CPU utilization: 8%
      
      2.2.4 ethtool -K eth0 gro on; csum offload enabled
      thruput: 8.96Gbps
      CPU utilization: 5-6%
      Signed-off-by: default avatarH.K. Jerry Chu <hkchu@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf5a755f
    • Benjamin Poirier's avatar
      net: Do not enable tx-nocache-copy by default · cdb3f4a3
      Benjamin Poirier authored
      There are many cases where this feature does not improve performance or even
      reduces it.
      
      For example, here are the results from tests that I've run using 3.12.6 on one
      Intel Xeon W3565 and one i7 920 connected by ixgbe adapters. The results are
      from the Xeon, but they're similar on the i7. All numbers report the
      mean±stddev over 10 runs of 10s.
      
      1) latency tests similar to what is described in "c6e1a0d1 net: Allow no-cache
      copy from user on transmit"
      There is no statistically significant difference between tx-nocache-copy
      on/off.
      nic irqs spread out (one queue per cpu)
      
      200x netperf -r 1400,1
      tx-nocache-copy off
              692000±1000 tps
              50/90/95/99% latency (us): 275±2/643.8±0.4/799±1/2474.4±0.3
      tx-nocache-copy on
              693000±1000 tps
              50/90/95/99% latency (us): 274±1/644.1±0.7/800±2/2474.5±0.7
      
      200x netperf -r 14000,14000
      tx-nocache-copy off
              86450±80 tps
              50/90/95/99% latency (us): 334.37±0.02/838±1/2100±20/3990±40
      tx-nocache-copy on
              86110±60 tps
              50/90/95/99% latency (us): 334.28±0.01/837±2/2110±20/3990±20
      
      2) single stream throughput tests
      tx-nocache-copy leads to higher service demand
      
                              throughput  cpu0        cpu1        demand
                              (Gb/s)      (Gcycle)    (Gcycle)    (cycle/B)
      
      nic irqs and netperf on cpu0 (1x netperf -T0,0 -t omni -- -d send)
      
      tx-nocache-copy off     9402±5      9.4±0.2                 0.80±0.01
      tx-nocache-copy on      9403±3      9.85±0.04               0.838±0.004
      
      nic irqs on cpu0, netperf on cpu1 (1x netperf -T1,1 -t omni -- -d send)
      
      tx-nocache-copy off     9401±5      5.83±0.03   5.0±0.1     0.923±0.007
      tx-nocache-copy on      9404±2      5.74±0.03   5.523±0.009 0.958±0.002
      
      As a second example, here are some results from Eric Dumazet with latest
      net-next.
      tx-nocache-copy also leads to higher service demand
      
      (cpu is Intel(R) Xeon(R) CPU X5660  @ 2.80GHz)
      
      lpq83:~# ./ethtool -K eth0 tx-nocache-copy on
      lpq83:~# perf stat ./netperf -H lpq84 -c
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
      
       87380  16384  16384    10.00      9407.44   2.50     -1.00    0.522   -1.000
      
       Performance counter stats for './netperf -H lpq84 -c':
      
             4282.648396 task-clock                #    0.423 CPUs utilized
                   9,348 context-switches          #    0.002 M/sec
                      88 CPU-migrations            #    0.021 K/sec
                     355 page-faults               #    0.083 K/sec
          11,812,797,651 cycles                    #    2.758 GHz                     [82.79%]
           9,020,522,817 stalled-cycles-frontend   #   76.36% frontend cycles idle    [82.54%]
           4,579,889,681 stalled-cycles-backend    #   38.77% backend  cycles idle    [67.33%]
           6,053,172,792 instructions              #    0.51  insns per cycle
                                                   #    1.49  stalled cycles per insn [83.64%]
             597,275,583 branches                  #  139.464 M/sec                   [83.70%]
               8,960,541 branch-misses             #    1.50% of all branches         [83.65%]
      
            10.128990264 seconds time elapsed
      
      lpq83:~# ./ethtool -K eth0 tx-nocache-copy off
      lpq83:~# perf stat ./netperf -H lpq84 -c
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
      
       87380  16384  16384    10.00      9412.45   2.15     -1.00    0.449   -1.000
      
       Performance counter stats for './netperf -H lpq84 -c':
      
             2847.375441 task-clock                #    0.281 CPUs utilized
                  11,632 context-switches          #    0.004 M/sec
                      49 CPU-migrations            #    0.017 K/sec
                     354 page-faults               #    0.124 K/sec
           7,646,889,749 cycles                    #    2.686 GHz                     [83.34%]
           6,115,050,032 stalled-cycles-frontend   #   79.97% frontend cycles idle    [83.31%]
           1,726,460,071 stalled-cycles-backend    #   22.58% backend  cycles idle    [66.55%]
           2,079,702,453 instructions              #    0.27  insns per cycle
                                                   #    2.94  stalled cycles per insn [83.22%]
             363,773,213 branches                  #  127.757 M/sec                   [83.29%]
               4,242,732 branch-misses             #    1.17% of all branches         [83.51%]
      
            10.128449949 seconds time elapsed
      
      CC: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@suse.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cdb3f4a3
    • Jiri Pirko's avatar
      ipv4: loopback device: ignore value changes after device is upped · dfd1582d
      Jiri Pirko authored
      When lo is brought up, new ifa is created. Then, devconf and neigh values
      bitfield should be set so later changes of default values would not
      affect lo values.
      
      Note that the same behaviour is in ipv6. Also note that this is likely
      not an issue in many distros (for example Fedora 19) because userspace
      sets address to lo manually before bringing it up.
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dfd1582d
    • FX Le Bail's avatar
      IPv6: add the option to use anycast addresses as source addresses in echo reply · 509aba3b
      FX Le Bail authored
      This change allows to follow a recommandation of RFC4942.
      
      - Add "anycast_src_echo_reply" sysctl to control the use of anycast addresses
        as source addresses for ICMPv6 echo reply. This sysctl is false by default
        to preserve existing behavior.
      - Add inline check ipv6_anycast_destination().
      - Use them in icmpv6_echo_reply().
      
      Reference:
      RFC4942 - IPv6 Transition/Coexistence Security Considerations
         (http://tools.ietf.org/html/rfc4942#section-2.1.6)
      
      2.1.6. Anycast Traffic Identification and Security
      
         [...]
         To avoid exposing knowledge about the internal structure of the
         network, it is recommended that anycast servers now take advantage of
         the ability to return responses with the anycast address as the
         source address if possible.
      Signed-off-by: default avatarFrancois-Xavier Le Bail <fx.lebail@yahoo.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      509aba3b
    • Wei Yongjun's avatar
      net/mlx4_en: fix error return code in mlx4_en_get_qp() · 9ba75fb0
      Wei Yongjun authored
      Fix to return a negative error code from the error handling
      case instead of 0.
      
      Fixes: 837052d0 ('net/mlx4_en: Add netdev support for TCP/IP offloads of vxlan tunneling')
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ba75fb0
    • Hayes Wang's avatar
      r8152: correct some messages · 4a8deae2
      Hayes Wang authored
       - Replace pr_warn_ratelimited() with net_ratelimit() and netdev_warn().
       - Adjust the algnment of some messages.
       - Remove the peroid.
       - Fix some messages don't have terminating newline.
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4a8deae2
    • David S. Miller's avatar
      bna: Fix build due to missing use of dma_unmap_len_set() · ecca6a96
      David S. Miller authored
      > as reported for linux-next of Dec.20, 2013
      > when CONFIG_NEED_DMA_MAP_STATE is not enabled:
      >
      > drivers/net/ethernet/brocade/bna/bnad.c: In function 'bnad_start_xmit':
      > drivers/net/ethernet/brocade/bna/bnad.c:3074:26: error: 'struct bnad_tx_vector' has no member named 'dma_len'
      Reported-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ecca6a96
    • Eric Dumazet's avatar
      gre_offload: statically build GRE offloading support · 438e38fa
      Eric Dumazet authored
      GRO/GSO layers can be enabled on a node, even if said
      node is only forwarding packets.
      
      This patch permits GSO (and upcoming GRO) support for GRE
      encapsulated packets, even if the host has no GRE tunnel setup.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: H.K. Jerry Chu <hkchu@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      438e38fa
    • Hauke Mehrtens's avatar
      bgmac: fix typos · 0e595934
      Hauke Mehrtens authored
      This fixes some typos found by Sergei.
      Reported-by: default avatarSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Signed-off-by: default avatarHauke Mehrtens <hauke@hauke-m.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0e595934
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch · 39b6b299
      David S. Miller authored
      Jesse Gross says:
      
      ====================
      [GIT net-next] Open vSwitch
      
      Open vSwitch changes for net-next/3.14. Highlights are:
       * Performance improvements in the mechanism to get packets to userspace
         using memory mapped netlink and skb zero copy where appropriate.
       * Per-cpu flow stats in situations where flows are likely to be shared
         across CPUs. Standard flow stats are used in other situations to save
         memory and allocation time.
       * A handful of code cleanups and rationalization.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      39b6b299
  2. 06 Jan, 2014 30 commits