1. 25 Aug, 2015 4 commits
    • santosh.shilimkar@oracle.com's avatar
      RDS: restore return value in rds_cmsg_rdma_args() · 1d2e3f39
      santosh.shilimkar@oracle.com authored
      In rds_cmsg_rdma_args() 'ret' is used by rds_pin_pages() which returns
      number of pinned pages on success. And the same value is returned to the
      caller of rds_cmsg_rdma_args() on success which is not intended.
      
      Commit f4a3fc03 ("RDS: Clean up error handling in rds_cmsg_rdma_args")
      removed the 'ret = 0' line which broke RDS RDMA mode.
      
      Fix it by restoring the return value on rds_pin_pages() success
      keeping the clean-up in place.
      Signed-off-by: default avatarSantosh Shilimkar <ssantosh@kernel.org>
      Signed-off-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d2e3f39
    • Eric Dumazet's avatar
      tcp: refine pacing rate determination · 43e122b0
      Eric Dumazet authored
      When TCP pacing was added back in linux-3.12, we chose
      to apply a fixed ratio of 200 % against current rate,
      to allow probing for optimal throughput even during
      slow start phase, where cwnd can be doubled every other gRTT.
      
      At Google, we found it was better applying a different ratio
      while in Congestion Avoidance phase.
      This ratio was set to 120 %.
      
      We've used the normal tcp_in_slow_start() helper for a while,
      then tuned the condition to select the conservative ratio
      as soon as cwnd >= ssthresh/2 :
      
      - After cwnd reduction, it is safer to ramp up more slowly,
        as we approach optimal cwnd.
      - Initial ramp up (ssthresh == INFINITY) still allows doubling
        cwnd every other RTT.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      43e122b0
    • David Ahern's avatar
      xfrm: Use VRF master index if output device is enslaved · 4ec3b28c
      David Ahern authored
      Directs route lookups to VRF table. Compiles out if NET_VRF is not
      enabled. With this patch able to successfully bring up ipsec tunnels
      in VRFs, even with duplicate network configuration.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ec3b28c
    • Eric Dumazet's avatar
      tcp: fix slow start after idle vs TSO/GSO · 6f021c62
      Eric Dumazet authored
      slow start after idle might reduce cwnd, but we perform this
      after first packet was cooked and sent.
      
      With TSO/GSO, it means that we might send a full TSO packet
      even if cwnd should have been reduced to IW10.
      
      Moving the SSAI check in skb_entail() makes sense, because
      we slightly reduce number of times this check is done,
      especially for large send() and TCP Small queue callbacks from
      softirq context.
      
      As Neal pointed out, we also need to perform the check
      if/when receive window opens.
      
      Tested:
      
      Following packetdrill test demonstrates the problem
      // Test of slow start after idle
      
      `sysctl -q net.ipv4.tcp_slow_start_after_idle=1`
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0    bind(3, ..., ...) = 0
      +0    listen(3, 1) = 0
      
      +0    < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      +0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.100 < . 1:1(0) ack 1 win 511
      +0    accept(3, ..., ...) = 4
      +0    setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0
      
      +0    write(4, ..., 26000) = 26000
      +0    > . 1:5001(5000) ack 1
      +0    > . 5001:10001(5000) ack 1
      +0    %{ assert tcpi_snd_cwnd == 10 }%
      
      +.100 < . 1:1(0) ack 10001 win 511
      +0    %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }%
      +0    > . 10001:20001(10000) ack 1
      +0    > P. 20001:26001(6000) ack 1
      
      +.100 < . 1:1(0) ack 26001 win 511
      +0    %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }%
      
      +4 write(4, ..., 20000) = 20000
      // If slow start after idle works properly, we should send 5 MSS here (cwnd/2)
      +0    > . 26001:31001(5000) ack 1
      +0    %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
      +0    > . 31001:36001(5000) ack 1
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6f021c62
  2. 24 Aug, 2015 29 commits
  3. 23 Aug, 2015 7 commits
    • Jiri Benc's avatar
      route: fix breakage after moving lwtunnel state · 751a587a
      Jiri Benc authored
      __recnt and related fields need to be in its own cacheline for performance
      reasons. Commit 61adedf3 ("route: move lwtunnel state to dst_entry")
      broke that on 32bit archs, causing BUILD_BUG_ON in dst_hold to be triggered.
      
      This patch fixes the breakage by moving the lwtunnel state to the end of
      dst_entry on 32bit archs. Unfortunately, this makes it share the cacheline
      with __refcnt and may affect performance, thus further patches may be
      needed.
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Fixes: 61adedf3 ("route: move lwtunnel state to dst_entry")
      Signed-off-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      751a587a
    • David S. Miller's avatar
      Merge tag 'linux-can-next-for-4.3-20150820' of... · 31fbde99
      David S. Miller authored
      Merge tag 'linux-can-next-for-4.3-20150820' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
      
      Marc Kleine-Budde says:
      
      ====================
      this is a pull request of a two patches for net-next.
      
      The first patch is by Nik Nyby and fixes a typo in a function name. The
      second patch by Lucas Stach demotes register output to debug level.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      31fbde99
    • David S. Miller's avatar
      Merge branch 'tipc-failover-fixes' · c5f98b56
      David S. Miller authored
      Jon Maloy says:
      
      ====================
      tipc: fix link failover/synch problems
      
      We fix three problems with the new link failover/synch implementation,
      which was introduced earlier in this release cycle. They are all related
      to situations where there is a very short interval between the disabling
      and enabling of interfaces.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c5f98b56
    • Jon Paul Maloy's avatar
      tipc: fix stale link problem during synchronization · 2be80c2d
      Jon Paul Maloy authored
      Recent changes to the link synchronization means that we can now just
      drop packets arriving on the synchronizing link before the synch point
      is reached. This has lead to significant simplifications to the
      implementation, but also turns out to have a flip side that we need
      to consider.
      
      Under unlucky circumstances, the two endpoints may end up
      repeatedly dropping each other's packets, while immediately
      asking for retransmission of the same packets, just to drop
      them once more. This pattern will eventually be broken when
      the synch point is reached on the other link, but before that,
      the endpoints may have arrived at the retransmission limit
      (stale counter) that indicates that the link should be broken.
      We see this happen at rare occasions.
      
      The fix for this is to not ask for retransmissions when a link is in
      state LINK_SYNCHING. The fact that the link has reached this state
      means that it has already received the first SYNCH packet, and that it
      knows the synch point. Hence, it doesn't need any more packets until the
      other link has reached the synch point, whereafter it can go ahead and
      ask for the missing packets.
      
      However, because of the reduced traffic on the synching link that
      follows this change, it may now take longer to discover that the
      synch point has been reached. We compensate for this by letting all
      packets, on any of the links, trig a check for synchronization
      termination. This is possible because the packets themselves don't
      contain any information that is needed for discovering this condition.
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2be80c2d
    • Jon Paul Maloy's avatar
      tipc: interrupt link synchronization when a link goes down · 5ae2f8e6
      Jon Paul Maloy authored
      When we introduced the new link failover/synch mechanism
      in commit 6e498158
      ("tipc: move link synch and failover to link aggregation level"),
      we missed the case when the non-tunnel link goes down during the link
      synchronization period. In this case the tunnel link will remain in
      state LINK_SYNCHING, something leading to unpredictable behavior when
      the failover procedure is initiated.
      
      In this commit, we ensure that the node and remaining link goes
      back to regular communication state (SELF_UP_PEER_UP/LINK_ESTABLISHED)
      when one of the parallel links goes down. We also ensure that we don't
      re-enter synch mode if subsequent SYNCH packets arrive on the remaining
      link.
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5ae2f8e6
    • Jon Paul Maloy's avatar
      tipc: eliminate risk of premature link setup during failover · 17b20630
      Jon Paul Maloy authored
      When a link goes down, and there is still a working link towards its
      destination node, a failover is initiated, and the failed link is not
      allowed to re-establish until that procedure is finished. To ensure
      this, the concerned link endpoints are set to state LINK_FAILINGOVER,
      and the node endpoints to NODE_FAILINGOVER during the failover period.
      
      However, if the link reset is due to a disabled bearer, the corres-
      ponding link endpoint is deleted, and only the node endpoint knows
      about the ongoing failover. Now, if the disabled bearer is re-enabled
      during the failover period, the discovery mechanism may create a new
      link endpoint that is ready to be established, despite that this is not
      permitted. This situation may cause both the ongoing failover and any
      subsequent link synchronization to fail.
      
      In this commit, we ensure that a newly created link goes directly to
      state LINK_FAILINGOVER if the corresponding node state is
      NODE_FAILINGOVER. This eliminates the problem described above.
      
      Furthermore, we tighten the criteria for which packets are allowed
      to end a failover state in the function tipc_node_check_state().
      By checking that the receiving link is up and running, instead of just
      checking that it is not in failover mode, we eliminate the risk that
      protocol packets from the re-created link may cause the failover to
      be prematurely terminated.
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      17b20630
    • David S. Miller's avatar
      Merge branch 'nps_enet_fixes' · 7f629be1
      David S. Miller authored
      Noam Camus says:
      
      ====================
      *** nps_enet fixups ***
      
      Change v2
      TX done is handled back with NAPI poll.
      
      Change v1
      This patch set is a bunch of fixes to make nps_enet work correctly with
      all platforms, i.e. real device, emulation system, and simulation system.
      The main trigger for this patch set was that in our emulation system
      the TX end interrupt is "edge-sensitive" and therefore we cannot use the
      cause register since it is not sticky.
      Also:
      TX is handled during HW interrupt context and not NAPI job.
      race with TX done was fixed.
      added acknowledge for TX when device is "level sensitive".
      enable drop of control frames which is not needed for regular usage.
      
      So most of this patch set is about TX handling, which is now more complete.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f629be1