1. 25 Jul, 2013 10 commits
    • Eric Dumazet's avatar
      tcp: TCP_NOTSENT_LOWAT socket option · c9bee3b7
      Eric Dumazet authored
      Idea of this patch is to add optional limitation of number of
      unsent bytes in TCP sockets, to reduce usage of kernel memory.
      
      TCP receiver might announce a big window, and TCP sender autotuning
      might allow a large amount of bytes in write queue, but this has little
      performance impact if a large part of this buffering is wasted :
      
      Write queue needs to be large only to deal with large BDP, not
      necessarily to cope with scheduling delays (incoming ACKS make room
      for the application to queue more bytes)
      
      For most workloads, using a value of 128 KB or less is OK to give
      applications enough time to react to POLLOUT events in time
      (or being awaken in a blocking sendmsg())
      
      This patch adds two ways to set the limit :
      
      1) Per socket option TCP_NOTSENT_LOWAT
      
      2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
      not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
      Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
      
      This changes poll()/select()/epoll() to report POLLOUT
      only if number of unsent bytes is below tp->nosent_lowat
      
      Note this might increase number of sendmsg()/sendfile() calls
      when using non blocking sockets,
      and increase number of context switches for blocking sockets.
      
      Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
      defined as :
       Specify the minimum number of bytes in the buffer until
       the socket layer will pass the data to the protocol)
      
      Tested:
      
      netperf sessions, and watching /proc/net/protocols "memory" column for TCP
      
      With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
      used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
      
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      
      Using 128KB has no bad effect on the throughput or cpu usage
      of a single flow, although there is an increase of context switches.
      
      A bonus is that we hold socket lock for a shorter amount
      of time and should improve latencies of ACK processing.
      
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
      OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB
      
       Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
      
                 412,514 context-switches
      
           200.034645535 seconds time elapsed
      
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
      OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB
      
       Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
      
               2,675,818 context-switches
      
           200.029651391 seconds time elapsed
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-By: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c9bee3b7
    • Eric Dumazet's avatar
      net: add sk_stream_is_writeable() helper · 64dc6130
      Eric Dumazet authored
      Several call sites use the hardcoded following condition :
      
      sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)
      
      Lets use a helper because TCP_NOTSENT_LOWAT support will change this
      condition for TCP sockets.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64dc6130
    • Daniel Borkmann's avatar
      net: sctp: trivial: add uapi/linux/sctp.h into maintainers · 4d58c025
      Daniel Borkmann authored
      After this file has moved to the uapi section, we also need to update
      this in the maintainers file.
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4d58c025
    • Daniel Borkmann's avatar
      net: sctp: trivial: update mailing list address · 91705c61
      Daniel Borkmann authored
      The SCTP mailing list address to send patches or questions
      to is linux-sctp@vger.kernel.org and not
      lksctp-developers@lists.sourceforge.net anymore. Therefore,
      update all occurences.
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91705c61
    • Mugunthan V N's avatar
      drivers: net: cpsw: add support to show hw stats via ethtool · d9718546
      Mugunthan V N authored
      Add support to show CPSW hardware statistics to user via ethtool
      so user can find if there were any error reported by hardware or
      the system is over loaded duing high data rate transfer.
      Signed-off-by: default avatarMugunthan V N <mugunthanvnm@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d9718546
    • dingtianhong's avatar
      bonding: Fixed up a error "do not initialise statics to 0 or NULL" in bond_main.c · b07ea07b
      dingtianhong authored
      The error is found by the checkpatch.pl tools.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Cc: Jay Vosburgh <fubar@us.ibm.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b07ea07b
    • dingtianhong's avatar
      bonding: add rtnl protection for bonding_store_fail_over_mac · 9402b746
      dingtianhong authored
      We need rtnl protection while reading slave_cnt and updating
      the .fail_over_mac, and it also follows the logic "don't change
      anything slave-related without rtnl". :)
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Cc: Jay Vosburgh <fubar@us.ibm.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9402b746
    • dingtianhong's avatar
      bonding: bond_sysfs.c checkpatch cleanup · 38c4916a
      dingtianhong authored
      net/bonding/bond_sysfs.c:1302: ERROR: else should follow close brace '}'
      net/bonding/bond_sysfs.c:1314: ERROR: else should follow close brace '}'
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Cc: Jay Vosburgh <fubar@us.ibm.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38c4916a
    • dingtianhong's avatar
      bonding: don't call slave_xxx_netpoll under spinlocks · c4cdef9b
      dingtianhong authored
      The slave_xxx_netpoll will call synchronize_rcu_bh(),
      so the function may schedule and sleep, it should't be
      called under spinlocks.
      
      bond_netpoll_setup() and bond_netpoll_cleanup() are always
      protected by rtnl lock, it is no need to take the read lock,
      as the slave list couldn't be changed outside rtnl lock.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Cc: Jay Vosburgh <fubar@us.ibm.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c4cdef9b
    • Neel Patel's avatar
      drivers/net: enic: Move ethtool code to a separate file · f13bbc2f
      Neel Patel authored
      This patch moves all enic ethtool hooks from enic_main.c to a new file
      enic_ethtool.c
      Signed-off-by: default avatarNeel Patel <neepatel@cisco.com>
      Signed-off-by: default avatarChristian Benvenuti <benve@cisco.com>
      Signed-off-by: default avatarNishank Trivedi <nistrive@cisco.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f13bbc2f
  2. 24 Jul, 2013 15 commits
  3. 23 Jul, 2013 11 commits
    • David S. Miller's avatar
      Merge branch 'team' ("add support for peer notifications and igmp rejoins for team") · 45c91490
      David S. Miller authored
      Jiri Pirko says:
      
      ====================
      The middle patch adjusts core infrastructure so the bonding code can be
      generalized and reused by team.
      
      v1->v2: using msecs_to_jiffies() as suggested by Eric
      
      Jiri Pirko (3):
        team: add peer notification
        net: convert resend IGMP to notifier event
        team: add support for sending multicast rejoins
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      45c91490
    • Jiri Pirko's avatar
      team: add support for sending multicast rejoins · 492b200e
      Jiri Pirko authored
      Similar to what is implemented in bonding. User is able to ask team
      driver to send IGMP rejoins in case port is enabled or disabled. Using
      previously introduced netdev notifier.
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      492b200e
    • Jiri Pirko's avatar
      net: convert resend IGMP to notifier event · 4aa5dee4
      Jiri Pirko authored
      Until now, bond_resend_igmp_join_requests() looks for vlans attached to
      bonding device, bridge where bonding act as port manually. It does not
      care of other scenarios, like stacked bonds or team device above. Make
      this more generic and use netdev notifier to propagate the event to
      upper devices and to actually call ip_mc_rejoin_groups().
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Acked-by: default avatarVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4aa5dee4
    • Jiri Pirko's avatar
      team: add peer notification · fc423ff0
      Jiri Pirko authored
      When port is enabled or disabled, allow to notify peers by unsolicitated
      NAs or gratuitous ARPs. Disabled by default.
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc423ff0
    • Thomas Richter's avatar
      macvlan fdb replace support · ab2cfbb2
      Thomas Richter authored
      Add support for iproute2 command 'bridge fdb replace ...'.
      The rtnletlink call back function ndo_fdb_add will be called
      with the NLM_F_REPLACE flag set.
      Simply return -EOPNOTSUP.
      
      Resubmitted because net-next was closed last week.
      Signed-off-by: default avatarThomas Richter <tmricht@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab2cfbb2
    • Thomas Richter's avatar
      vxlan fdb replace an existing entry · 906dc186
      Thomas Richter authored
      Add support to replace an existing entry found in the
      vxlan fdb database. The entry in question is identified
      by its unicast mac address and the destination information
      is changed. If the entry is not found, it is added in the
      forwarding database. This is similar to changing an entry
      in the neighbour table.
      
      Multicast mac addresses can not be changed with the replace
      option.
      
      This is useful for virtual machine migration when the
      destination of a target virtual machine changes. The replace
      feature can be used instead of delete followed by add.
      
      Resubmitted because net-next was closed last week.
      Signed-off-by: default avatarThomas Richter <tmricht@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      906dc186
    • David S. Miller's avatar
      Merge branch 'tcp' · 20ff44aa
      David S. Miller authored
      Yuchung Cheng says:
      
      ====================
      This patch series improve RTT sampling in three ways:
      1. Sample RTT during fast recovery and reordering events.
      2. Favor ack-based RTT to timestamps because of broken TS ECR fields
      3. Consolidate the RTT measurement logic.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20ff44aa
    • Yuchung Cheng's avatar
      tcp: use RTT from SACK for RTO · ed08495c
      Yuchung Cheng authored
      If RTT is not available because Karn's check has failed or no
      new packet is acked, use the RTT measured from SACK to estimate
      the RTO. The sender can continue to estimate the RTO during loss
      recovery or reordering event upon receiving non-partial ACKs.
      
      This also changes when the RTO is re-armed. Previously it is
      only re-armed when some data is cummulatively acknowledged (i.e.,
      SND.UNA advances), but now it is re-armed whenever RTT estimator
      is updated. This feature is particularly useful to reduce spurious
      timeout for buffer bloat including cellular carriers [1], and
      RTT estimation on reordering events.
      
      [1] "An In-depth Study of LTE: Effect of Network Protocol and
       Application Behavior on Performance", In Proc. of SIGCOMM 2013
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ed08495c
    • Yuchung Cheng's avatar
      tcp: measure RTT from new SACK · 59c9af42
      Yuchung Cheng authored
      Take RTT sample if an ACK selectively acks some sequences that
      have never been retransmitted. The Karn's algorithm does not apply
      even if that ACK (s)acks other retransmitted sequences, because it
      must been generated by an original but perhaps out-of-order packet.
      There is no ambiguity. In case when multiple blocks are newly
      sacked because of ACK losses the earliest block is used to
      measure RTT, similar to cummulative ACKs.
      
      Such RTT samples allow the sender to estimate the RTO during loss
      recovery and packet reordering events. It is still useful even with
      TCP timestamps. That's because during these events the SND.UNA may
      not advance preventing RTT samples from TS ECR (thus the FLAG_ACKED
      check before calling tcp_ack_update_rtt()).  Therefore this new
      RTT source is complementary to existing ACK and TS RTT mechanisms.
      
      This patch does not update the RTO. It is done in the next patch.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      59c9af42
    • Yuchung Cheng's avatar
      tcp: prefer packet timing to TS-ECR for RTT · 5b08e47c
      Yuchung Cheng authored
      Prefer packet timings to TS-ecr for RTT measurements when both
      sources are available. That's because broken middle-boxes and remote
      peer can return packets with corrupted TS ECR fields. Similarly most
      congestion controls that require RTT signals favor timing-based
      sources as well. Also check for bad TS ECR values to avoid RTT
      blow-ups. It has happened on production Web servers.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b08e47c
    • Yuchung Cheng's avatar
      tcp: consolidate SYNACK RTT sampling · 375fe02c
      Yuchung Cheng authored
      The first patch consolidates SYNACK and other RTT measurement to use a
      central function tcp_ack_update_rtt(). A (small) bonus is now SYNACK
      RTT measurement happens after PAWS check, potentially reducing the
      impact of RTO seeding on bad TCP timestamps values.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      375fe02c
  4. 22 Jul, 2013 4 commits