Commits · 7158ce80dd2dfd52e80dbf514c8dbcb75d321377 · nexedi / linux

07 Feb, 2016 40 commits

Merge branch 'ns-tcp-sysctls' · 7158ce80

David S. Miller authored Feb 07, 2016

Nikolay Borisov says:

====================
Namespaceify more of the tcp sysctl knobs

This patch series continues making more of the tcp-related
sysctl knobs be per net-namespace. Most of these apply per
socket and have global defaults so should be safe and I
don't expect any breakages.

Having those per net-namespace is useful when multiple
containers are hosted and it is required to tune the
tcp settings for each independently of the host node.

I've split the patches to be per-sysctl but after
the review if the outcome is positive I'm happy
to either send it in one big blob or just.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

7158ce80

ipv4: Namespaceify tcp_notsent_lowat sysctl knob · 4979f2d9

Nikolay Borisov authored Feb 03, 2016

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4979f2d9

ipv4: Namespaceify tcp_fin_timeout sysctl knob · 1e579caa

Nikolay Borisov authored Feb 03, 2016

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1e579caa

ipv4: Namespaceify tcp_orphan_retries sysctl knob · c402d9be

Nikolay Borisov authored Feb 03, 2016

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c402d9be

ipv4: Namespaceify tcp_retries2 sysctl knob · c6214a97

Nikolay Borisov authored Feb 03, 2016

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c6214a97

ipv4: Namespaceify tcp_retries1 sysctl knob · ae5c3f40

Nikolay Borisov authored Feb 03, 2016

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ae5c3f40

ipv4: Namespaceify tcp reordering sysctl knob · 1043e25f

Nikolay Borisov authored Feb 03, 2016

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1043e25f

ipv4: Namespaceify tcp syncookies sysctl knob · 12ed8244

Nikolay Borisov authored Feb 03, 2016

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

12ed8244

ipv4: Namespaceify tcp synack retries sysctl knob · 7c083ecb

Nikolay Borisov authored Feb 03, 2016

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7c083ecb

ipv4: Namespaceify tcp syn retries sysctl knob · 6fa25166

Nikolay Borisov authored Feb 03, 2016

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6fa25166

Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge · 9d1eb21b

David S. Miller authored Feb 07, 2016

Antonio Quartulli says:

====================
This batch of patches includes a number of corrections and
improvements for our kernel-doc. These changes also make sure
that our doc is now properly processed by the kernel-doc
parsing tool.

Other than that you have a patch updating all the copyright
lines to 2016 and another patch switching the URLs in our
readme, Kconfig and MAINTAINERS file from "http" to "https".
Both by Sven Eckelmann.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9d1eb21b

Merge branch 'virtio_net_ethtool_settings' · e6359193

David S. Miller authored Feb 07, 2016

Nikolay Aleksandrov says:

====================
virtio_net: add ethtool get/set settings support

Patch 1 adds ethtool speed/duplex validation functions which check if the
value is defined. Patch 2 adds support for ethtool (get|set)_settings and
uses the validation functions to check the user-supplied values.

v2: split in 2 patches to allow everyone to make use of the validation
functions and allow virtio_net devices to be half duplex
v3: added a check to return error if the user tries to change anything else
besides duplex/speed as per Michael's comment
v4: Set port type to PORT_OTHER
v5: clear diff1.port (ignore port) when checking for changes since we set
it now and ethtool uses it in the set request

Sorry about the pointless iterations, should've all covered now.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e6359193

virtio_net: add ethtool support for set and get of settings · 16032be5

Nikolay Aleksandrov authored Feb 03, 2016

This patch allows the user to set and retrieve speed and duplex of the
virtio_net device via ethtool. Having this functionality is very helpful
for simulating different environments and also enables the virtio_net
device to participate in operations where proper speed and duplex are
required (e.g. currently bonding lacp mode requires full duplex). Custom
speed and duplex are not allowed, the user-supplied settings are validated
before applying.

Example:
$ ethtool eth1
Settings for eth1:
...
	Speed: Unknown!
	Duplex: Unknown! (255)
$ ethtool -s eth1 speed 1000 duplex full
$ ethtool eth1
Settings for eth1:
...
	Speed: 1000Mb/s
	Duplex: Full

Based on a patch by Roopa Prabhu.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

16032be5

ethtool: add speed/duplex validation functions · 103a8ad1

Nikolay Aleksandrov authored Feb 03, 2016

Add functions which check if the speed/duplex are defined.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

103a8ad1

Merge branch 'sunvnet-tracepoints' · 13340a0a

David S. Miller authored Feb 07, 2016

Sowmini Varadhan says:

====================
sunvnet: perf tracepoint hooks

Added some perf tracepoints to help track and debug sunvnet
descriptor state at run-time.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

13340a0a

sunvnet: perf tracepoint invocations to trace LDC state machine · 365a1028

Sowmini Varadhan authored Feb 02, 2016

Use sunvnet perf trace macros to monitor LDC message exchange state.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

365a1028

sunvnet: Add support for perf LDC event tracing · 46fcc6ef

Sowmini Varadhan authored Feb 02, 2016

Add perf event macros for support of tracing and instrumentation
of LDC state machine
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

46fcc6ef

Merge branch 'tcp_cong_ctrl_refactoring' · ffeb6437

David S. Miller authored Feb 07, 2016

Yuchung Cheng says:

====================
tcp: congestion control refactoring

This patch set refactors the sequence of congestion control,
loss recovery, and transmission logic in TCP ack processing.

The design goal is to decouple and sequence them in the following order:

  0. ACK accounting: free or tag sent packets [unchanged]

  1. loss recovery: identify lost/ecn packets and update congestion state

  2. congestion control: up/down cwnd and pacing rate based on (1)

  3. transmission: send new or retransmit old based on (1) and (2)

This refactoring makes the cwnd changes more clear because it's done
in one place. The packet accounting is also more robust especially
for connections that do not support SACK. Patch 1-4 and 6 are
refactoring and patch 5 improves TCP performance under reordering.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

ffeb6437

tcp: tcp_cong_control helper · d452e6ca

Yuchung Cheng authored Feb 02, 2016

Refactor and consolidate cwnd and rate updates into a new function
tcp_cong_control().
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d452e6ca

tcp: make congestion control more robust against reordering · 2d14a4de

Yuchung Cheng authored Feb 02, 2016

This change enables congestion control to update cwnd based on
not only packet cumulatively acked but also packets delivered
out-of-order. This makes congestion control robust against packet
reordering because it may raise cwnd as long as packets are being
delivered once reordering has been detected (i.e., it only cares
the amount of packets delivered, not the ordering among them).
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2d14a4de

tcp: refactor pkts acked accounting · 3ebd8871

Yuchung Cheng authored Feb 02, 2016

A small refactoring that gets number of packets cumulatively acked
from tcp_clean_rtx_queue() directly.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3ebd8871

tcp: new delivery accounting · ddf1af6f

Yuchung Cheng authored Feb 02, 2016

This patch changes the accounting of how many packets are
newly acked or sacked when the sender receives an ACK.

The current approach basically computes

   newly_acked_sacked = (prior_packets - prior_sacked) -
                        (tp->packets_out - tp->sacked_out)

   where prior_packets and prior_sacked out are snapshot
   at the beginning of the ACK processing.

The new approach tracks the delivery information via a new
TCP state variable "delivered" which monotically increases
as new packets are delivered in order or out-of-order.

The reason for this change is that the current approach is
brittle that produces negative or inaccurate estimate.

   1) For non-SACK connections, an ACK that advances the SND.UNA
   could reset the DUPACK counters (tp->sacked_out) in
   tcp_process_loss() or tcp_fastretrans_alert(). This inflates
   the inflight suddenly and causes under-estimate or even
   negative estimate. Here is a real example:

                   before   after (processing ACK)
   packets_out     75       73
   sacked_out      23        0
   ca state        Loss     Open

   The old approach computes (75-23) - (73 - 0) = -21 delivered
   while the new approach computes 1 delivered since it
   considers the 2nd-24th packets are delivered OOO.

   2) MSS change would re-count packets_out and sacked_out so
   the estimate is in-accurate and can even become negative.
   E.g., the inflight is doubled when MSS is halved.

   3) Spurious retransmission signaled by DSACK is not accounted

The new approach is simpler and more robust. For SACK connections,
tp->delivered increments as packets are being acked or sacked in
SACK and ACK processing.

For non-sack connections, it's done in tcp_remove_reno_sacks() and
tcp_add_reno_sack(). When an ACK advances the SND.UNA, tp->delivered
is incremented by the number of packets ACKed (less the current
number of DUPACKs received plus one packet hole).  Upon receiving
a DUPACK, tp->delivered is incremented assuming one out-of-order
packet is delivered.

Upon receiving a DSACK, tp->delivered is incremtened assuming one
retransmission is delivered in tcp_sacktag_write_queue().
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ddf1af6f

tcp: move cwnd reduction after recovery state procesing · 31ba0c10

Yuchung Cheng authored Feb 02, 2016

Currently the cwnd is reduced and increased in various different
places. The reduction happens in various places in the recovery
state processing (tcp_fastretrans_alert) while the increase
happens afterward.

A better sequence is to identify lost packets and update
the congestion control state (icsk_ca_state) first. Then base
on the new state, up/down the cwnd in one central place. It's
more clear to reason cwnd changes.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

31ba0c10

tcp: retransmit after recovery processing and congestion control · e662ca40

Yuchung Cheng authored Feb 02, 2016

The retransmission and F-RTO transmission currently happen inside
recovery state processing (tcp_fastretrans_alert) but before
congestion control.  This refactoring moves the logic after both
s.t. we can determine how much to send (cwnd) before deciding what to
send.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e662ca40

net: drop write-only stack variable · 3575dbf2

David Herrmann authored Feb 02, 2016

Remove a write-only stack variable from unix_attach_fds(). This is a
left-over from the security fix in:

    commit 712f4aad
    Author: willy tarreau <w@1wt.eu>
    Date:   Sun Jan 10 07:54:56 2016 +0100

        unix: properly account for FDs passed over unix sockets
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

3575dbf2

net: Add support for fill_slave_info to VRF device · 67eb0331

David Ahern authored Feb 02, 2016

Allows userspace to have direct access to VRF table association
versus looking up master device and its table.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

67eb0331

xen-netback: implement dynamic multicast control · 22fae97d

Paul Durrant authored Feb 02, 2016

My recent patch to the Xen Project documents a protocol for 'dynamic
multicast control' in netif.h. This extends the previous multicast control
protocol to not require a shared ring reconnection to turn the feature off.
Instead the backend watches the "request-multicast-control" key in xenstore
and turns the feature off if the key value is written to zero.

This patch adds support for dynamic multicast control in xen-netback.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

22fae97d

Merge branch 'be2net-non-critical-fixes' · ce30905a

David S. Miller authored Feb 07, 2016

Sriharsha Basavapatna says:

====================
be2net patch-set

v2 changes:
	Patch-4:	Changed a tab to space in be.h
	Patches-6,7,8:	Updated commit log summary line: benet --> be2net

Hi David,

The following patch set contains a few non-critical bug fixes. Please
consider applying this to the net-next tree. Thanks.

Patch-1 fixes be_set_phys_id() ethtool function to return an error code.
Patch-2 fixes a warning when some commands fail for VFs.
Patch-3 fixes be_vlan_rem_vid() to verify vlan being removed is in the list.
Patch-4 improves SRIOV queue distribution logic.
Patch-5 avoids running self test on VFs.
Patch-6 fixes error recovery in Lancer to clean up after moving to ready state.
Patch-7 adds retry logic to error recovery in case of recovery failures
Patch-8 fixes time interval used in eq delay computation routine
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

ce30905a

be2net: Fix interval calculation in interrupt moderation · 3c0d49aa

Padmanabh Ratnakar authored Feb 03, 2016

Interrupt moderation parameters need to be recalculated only
after a time interval of 1 ms. Interval calculation is wrong
when there is a rollover of jiffies. Using recommended way of interval
calculation using jiffies to fix this.
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@broadcom.com>
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3c0d49aa

be2net: Add retry in case of error recovery failure · 972f37b4

Padmanabh Ratnakar authored Feb 03, 2016

Retry error recovery MAX_ERR_RECOVERY_RETRY_COUNT times in case of
failure during error recovery.
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@broadcom.com>
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

972f37b4

be2net: Fix Lancer error recovery · 1babbad4

Padmanabh Ratnakar authored Feb 03, 2016

After error is detected, wait for adapter to move to ready state
before destroying queues and cleanup of other resources. Also
skip performing any cleanup for non-Lancer chips and move debug
messages to correct routine.
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@broadcom.com>
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1babbad4

be2net: Don't run ethtool self-tests for VFs · 2e365b1b

Somnath Kotur authored Feb 03, 2016

The CMD_SUBSYSTEM_LOWLEVEL cmds need DEV_CFG Privilege to run
which VFs don't have by default.
Self-tests need to be issued only for PFs.
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2e365b1b

be2net: SRIOV Queue distribution should factor in EQ-count of VFs · ee9ad280

Sriharsha Basavapatna authored Feb 03, 2016

The SRIOV resource distribution logic for RX/TX queue counts is not optimal
when a small number of VFs are enabled. It does not take into account the
VF's EQ count while computing the queue counts. Because of this, the VF
gets a large number of queues, though it doesn't have sufficient EQs,
resulting in wasted queue resources. And the PF gets a smaller share of
queues though it has more EQs. Fix this by capping the VF queue count at
its EQ count.
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ee9ad280

be2net: Fix be_vlan_rem_vid() to check vlan id being removed · 41dcdfbd

Sriharsha Basavapatna authored Feb 03, 2016

The driver decrements its vlan count without checking if it is really
present in its list. This results in an invalid vlan count and impacts
subsequent vlan add/rem ops. The function be_vlan_rem_vid() should be
updated to fix this.
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

41dcdfbd

be2net: check for INSUFFICIENT_PRIVILEGES error · fa5c867d

Suresh Reddy authored Feb 03, 2016

The driver currently logs the message "VF is not privileged to issue
opcode" by checking only the base_status field for UNAUTHORIZED_REQUEST.
Add check to look for INSUFFICIENT_PRIVILEGES in the additional status
field also as not all cmds fail with that base status.
Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com>
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fa5c867d

be2net: return error status from be_set_phys_id() · a5a773a5

Suresh Reddy authored Feb 03, 2016

be_set_phys_id() returns 0 to ethtool when the command fails in the FW.

This patch fixes the set_phys_id() to return -EIO in case the FW cmd fails.
Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com>
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a5a773a5

Merge branch 'vxlan-headers-and-tx-cleanups' · 19f76f63

David S. Miller authored Feb 07, 2016

Jiri Benc says:

====================
vxlan: cleanup headers and tx path

This is first part of vxlan cleanup. It makes vxlan.h better organized and
removes code duplication in the tx path. There's no change in functionality.

This is in preparation for VXLAN-GPE support. Other sets will follow with
cleanup of rx path, modifications of vxlan interface setup and
cleanup/modification of fdb handling. The goal of these patchsets is to
consolidate VXLAN extension handling (right now it's scattered all around
the code) and allow vxlan to operate in L3 mode (i.e. without inner Ethernet
headers).

The full picture (all 30 patches) can be seen at:
https://github.com/jbenc/linux-vxlan/commits/master

Changelog:
v2: added patch 5/6, fixing the udp sum problem spotted by Paolo Abeni
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

19f76f63

vxlan: consolidate vxlan_xmit_skb and vxlan6_xmit_skb · f491e56d

Jiri Benc authored Feb 02, 2016

There's a lot of code duplication. Factor out the duplicate code to a new
function shared between IPv4 and IPv6 xmit path.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f491e56d

vxlan: consolidate csum flag handling · b4ed5cad

Jiri Benc authored Feb 02, 2016

The flag for tx checksumming for tunneling over IPv4 and IPv6 is different.
Decide whether to do tx checksumming in vxlan_xmit_one and pass it on as
a separate flag. This will allow for tx path consolidation in the next
patch.

Unfortunately, gcc is not clever enough to see that udp_sum is always
initialized and gives an uninitialized variable warning. Set it to false to
silence the warning.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b4ed5cad

vxlan: consolidate output route calculation · 1a8496ba

Jiri Benc authored Feb 02, 2016

The code for output route lookup is duplicated for ndo_start_xmit and
ndo_fill_metadata_dst. Move it to a common function.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1a8496ba