Commits · 9d4f97f97bb8adc47f569d995402c33de9a4fa19 · Kirill Smelkov / linux

17 May, 2017 33 commits

sch_dsmark: Fix uninitialized variable warning. · 9d4f97f9

David S. Miller authored May 17, 2017

We still need to initialize err to -EINVAL for
the case where 'opt' is NULL in dsmark_init().

Fixes: 6529eaba ("net: sched: introduce tcf block infractructure")
Signed-off-by: David S. Miller <davem@davemloft.net>

9d4f97f9

Merge branch 'net-sched-multichain-filters' · 656aae43

David S. Miller authored May 17, 2017

Jiri Pirko says:

====================
net: sched: introduce multichain support for filters

Currently, each classful qdisc holds one chain of filters.
This chain is traversed and each filter could be matched on, which
may lead to execution of list of actions. One of such action
could be "reclassify", which would "reset" the processing of the
filter chain.

So this filter chain could be looked at as a flat table.
Sometimes it is convenient for user to configure a hierarchy
of tables. Example usecase is encapsulation.

Hierarchy of tables is a common way how it is done in HW pipelines.
So it is much more convenient to offload this.

This patchset contains two major patches:
8/10 - This patch introduces the support for having multiple
       chains of filters.
10/10 - This patch adds new control action to allow going to specified chain

The rest of the patches are smaller or bigger depencies of those 2.
Please see individual patch descriptions for details.

Corresponding iproute2 patches are appended as a reply to this cover letter.

Simple example:
$ tc qdisc add dev eth0 ingress
$ tc filter add dev eth0 parent ffff: protocol ip pref 33 flower dst_mac 52:54:00:3d:c7:6d action goto chain 11
$ tc filter add dev eth0 parent ffff: protocol ip pref 22 chain 11 flower dst_ip 192.168.40.1 action drop
$ tc filter show dev eth0 root
filter parent ffff: protocol ip pref 33 flower chain 0
filter parent ffff: protocol ip pref 33 flower chain 0 handle 0x1
  dst_mac 52:54:00:3d:c7:6d
  eth_type ipv4
        action order 1: gact action goto chain 11
         random type none pass val 0
         index 2 ref 1 bind 1

filter parent ffff: protocol ip pref 22 flower chain 11
filter parent ffff: protocol ip pref 22 flower chain 11 handle 0x1
  eth_type ipv4
  dst_ip 192.168.40.1
        action order 1: gact action drop
         random type none pass val 0
         index 3 ref 1 bind 1
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

656aae43

net: sched: add termination action to allow goto chain · db50514f

Jiri Pirko authored May 17, 2017

Introduce new type of termination action called "goto_chain". This allows
user to specify a chain to be processed. This action type is
then processed as a return value in tcf_classify loop in similar
way as "reclassify" is, only it does not reset to the first filter
in chain but rather reset to the first filter of the desired chain.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

db50514f

net: sched: push tp down to action init · 9fb9f251

Jiri Pirko authored May 17, 2017

Tp pointer will be needed by the next patch in order to get the chain.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9fb9f251

net: sched: introduce multichain support for filters · 5bc17018

Jiri Pirko authored May 17, 2017

Instead of having only one filter per block, introduce a list of chains
for every block. Create chain 0 by default. UAPI is extended so the user
can specify which chain he wants to change. If the new attribute is not
specified, chain 0 is used. That allows to maintain backward
compatibility. If chain does not exist and user wants to manipulate with
it, new chain is created with specified index. Also, when last filter is
removed from the chain, the chain is destroyed.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5bc17018

net: sched: push chain dump to a separate function · acb31fae

Jiri Pirko authored May 17, 2017

Since there will be multiple chains to dump, push chain dumping code to
a separate function.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

acb31fae

net: sched: introduce helpers to work with filter chains · 2190d1d0

Jiri Pirko authored May 17, 2017

Introduce struct tcf_chain object and set of helpers around it. Wraps up
insertion, deletion and search in the filter chain.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2190d1d0

net: sched: move TC_H_MAJ macro call into tcf_auto_prio · 7961973a

Jiri Pirko authored May 17, 2017

Call the helper from the function rather than to always adjust the
return value of the function.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7961973a

net: sched: replace nprio by a bool to make the function more readable · 9d36d9e5

Jiri Pirko authored May 17, 2017

The use of "nprio" variable in tc_ctl_tfilter is a bit cryptic and makes
a reader wonder what is going on for a while. So help him to understand
this priority allocation dance a litte bit better.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9d36d9e5

net: sched: rename tcf_destroy_chain helper · fbe9c5b0

Jiri Pirko authored May 17, 2017

Make the name consistent with the rest of the helpers around.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fbe9c5b0

net: sched: introduce tcf block infractructure · 6529eaba

Jiri Pirko authored May 17, 2017

Currently, the filter chains are direcly put into the private structures
of qdiscs. In order to be able to have multiple chains per qdisc and to
allow filter chains sharing among qdiscs, there is a need for common
object that would hold the chains. This introduces such object and calls
it "tcf_block".

Helpers to get and put the blocks are provided to be called from
individual qdisc code. Also, the original filter_list pointers are left
in qdisc privs to allow the entry into tcf_block processing without any
added overhead of possible multiple pointer dereference on fast path.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6529eaba

net: sched: move tc_classify function to cls_api.c · 87d83093

Jiri Pirko authored May 17, 2017

Move tc_classify function to cls_api.c where it belongs, rename it to
fit the namespace.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

87d83093

Merge branch 'dsa-sort' · c63fbb0b

David S. Miller authored May 17, 2017

Andrew Lunn says:

====================
net: dsa: Sort various lists

As we gain more DSA drivers and tagging protocols, the lists are
getting a bit unruly. Do some sorting.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

c63fbb0b

drivers: net: DSA: Sort drivers · ec34e93f

Andrew Lunn authored May 16, 2017

With more drivers being added, it is time to sort the drivers to
impose some order.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ec34e93f

net: dsa: Sort DSA tagging protocol drivers · eb7b7211

Andrew Lunn authored May 16, 2017

With more tag protocols being added, regain some order by sorting the
entries in various places.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

eb7b7211

liquidio: fix PF falsely indicating success at setting MAC address of a nonexistent VF · 0d9a5997

Felix Manlunas authored May 16, 2017

In the function assigned to .ndo_set_vf_mac, check the validity of the
vfidx argument before proceeding to tell the firmware to set the VF MAC
address.
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0d9a5997

liquidio: fix insmod failure when multiple NICs are plugged in · e1e3ce62

Rick Farrington authored May 16, 2017

When multiple liquidio NICs are plugged in, the first insmod of the PF
driver succeeds. But after an rmmod, a subsequent insmod fails. Reason is
during rmmod, the PF driver resets the Octeon of only one of the NICs; it
neglects to reset the Octeons of the other NICs.

Fix the insmod failure by adding the missing Octeon resets at rmmod. Keep
a per-NIC refcount that indicates the number of active PFs in a given NIC.
When the refcount goes to zero, then reset the Octeon of that NIC.
Signed-off-by: Rick Farrington <ricardo.farrington@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e1e3ce62

net: dsa: store CPU port pointer in the tree · 8b0d3ea5

Vivien Didelot authored May 16, 2017

A dsa_switch_tree instance holds a dsa_switch pointer and a port index
to identify the switch port to which the CPU is attached.

Now that the DSA layer has a dsa_port structure to hold this data, use
it to point the switch CPU port.

This patch simply substitutes s/dst->cpu_switch/dst->cpu_dp->ds/ and
s/dst->cpu_port/dst->cpu_dp->index/.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8b0d3ea5

Merge branch 'mlxsw-Preparations-for-restructuring' · 631581bf

David S. Miller authored May 17, 2017

Jiri Pirko says:

====================
mlxsw: Preparations for restructuring

This patchset doesn't introduce any functional changes and merely meant
to make the code base more receptive for upcoming restructuring.

The first six patches mainly shuffle code in order to reduce the scope of
structs that shouldn't be defined in the main driver header. Most of them
will be later expanded, so it makes sense to correctly place them now.

The last patches mostly simplify bridge-related functions, so that they
could be more easily modified later on.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

631581bf

mlxsw: spectrum: Default ports to non-virtual mode · 45a4a16c

Ido Schimmel authored May 16, 2017

In virtual mode, packets are classified to FIDs based on their ingress
port and VLAN whereas in non-virtual mode only the VLAN is taken into
account.

Currently ports are initialized to use virtual mode due to the presence
of the PVID vPort. However, we're going to transition ports between both
modes based on the FIDs they use and not merely based on the presence on
a VLAN upper. Therefore, during initialization, no mode will be
explicitly set.

Since the Programmer's Reference Manual (PRM) doesn't specify a default,
explicitly set the port to non-virtual mode and later transition the
port between both modes based on the FIDs it uses.

In a follow-up patchset, this step will be moved to the common FID core
where it logically belongs.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

45a4a16c

mlxsw: spectrum: Move PVID code to appropriate place · b02eae9b

Ido Schimmel authored May 16, 2017

PVID is a port attribute and should therefore reside in the main driver
file and not the switchdev specific one.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b02eae9b

mlxsw: spectrum_switchdev: Don't batch learning operations · 7cbc4277

Ido Schimmel authored May 16, 2017

We no longer batch VLAN operations, so there's no need to set the
learning state for a range of VLANs.

Use a common function to set the learning state for a Port-VLAN, thereby
making the code saner more receptive for upcoming changes.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7cbc4277

mlxsw: spectrum_switchdev: Don't batch STP operations · 45bfe6b4

Ido Schimmel authored May 16, 2017

Simplify the code by using the common function that sets an STP state
for a Port-VLAN and remove the existing one that tries to batch it for
several VLANs.

This will help us in a follow-up patchset to introduce a unified
infrastructure for bridge ports, regardless if the bridge is VLAN-aware
or not.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

45bfe6b4

mlxsw: spectrum_switchdev: Don't batch VLAN operations · fe9ccc78

Ido Schimmel authored May 16, 2017

switchdev's VLAN object has the ability to describe a range of VLAN IDs,
but this is only used when VLAN operations are done using the SELF flag,
which is something we would like to remove as it allows one to bypass
the bridge driver.

Do VLAN operations on a per-VLAN basis, thereby simplifying the code and
preparing it for refactoring in a follow-up patchset.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fe9ccc78

mlxsw: spectrum_switchdev: Remove redundant check · d341e2ce

Ido Schimmel authored May 16, 2017

Since commit 97c24290 ("switchdev: Execute bridge ndos only for
bridge ports") switchdev code checks that port is bridged, so no need to
perform the same check in the driver.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d341e2ce

mlxsw: spectrum_router: Initialize RIFs in a separate function · 348b8fc3

Ido Schimmel authored May 16, 2017

The router interfaces (RIFs) array is currently initialized together
with the general router configuration. However, in a follow-up patchset
we're going to introduce a common RIF core that will require us to
initialize more RIF constructs, so move the RIF initialization to its
own function.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

348b8fc3

mlxsw: spectrum_router: Move FIB notification block to router struct · 7e39d115

Ido Schimmel authored May 16, 2017

The FIB notification block logically belongs inside the router specific
struct, so move it there.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7e39d115

mlxsw: spectrum_router: Move RIFs array to its rightful place · 5f9efffb

Ido Schimmel authored May 16, 2017

The router interfaces (RIFs) array is of no interest to code outside the
routing realm, so declare it inside the router specific struct instead
of the chip-wide one.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5f9efffb

mlxsw: spectrum_switchdev: Reduce scope of bridge struct · 5f6935c6

Ido Schimmel authored May 16, 2017

Some attributes in the global chip struct are only relevant for bridge
operation, so encapsulate them in their own struct that isn't exposed to
non-bridge code.

This will also help us later, when we add more bridge-specific
attributes.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5f6935c6

mlxsw: spectrum_router: Reduce scope of router struct · 9011b677

Ido Schimmel authored May 16, 2017

In a similar fashion to previous patch, the router structure
('mlxsw_sp_router') doesn't need to be accessible to anyone, but the
router code located at spectrum_router.c

Make this apparent and reduce its scope by defining it there.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9011b677

mlxsw: spectrum_buffer: Reduce scope of shared buffer struct · 33cbd87c

Ido Schimmel authored May 16, 2017

The shared buffer structure ('mlxsw_sp_sb') doesn't need to be
accessible to anyone, but the shared buffer code located at
spectrum_buffers.c

Make this apparent and reduce its scope by defining it there.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

33cbd87c

cxgb4: add new T5 pci device id · 29db3984

Ganesh Goudar authored May 16, 2017

Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

29db3984

cxgb4: reduce resource allocation in kdump kernel · 85eacf3f

Ganesh Goudar authored May 16, 2017

When is_kdump_kernel() is true, reduce memory footprint of
cxgb4 by using a single "Queue Set".
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

85eacf3f

16 May, 2017 7 commits

liquidio: use pcie_flr instead of duplicating it · 9ad09803

Christoph Hellwig authored May 16, 2017

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9ad09803

net: phy: Remove residual magic from PHY drivers · 1b86f702

Andrew Lunn authored May 16, 2017

commit fa8cddaf ("net phylib: Remove unnecessary condition check in phy")
removed the only place where the PHY flag PHY_HAS_MAGICANEG was
checked. But it left the flag being set in the drivers. Remove the flag.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

1b86f702

bnx2x: Remove open coded carrier check · 3fdd34c1

Leon Romanovsky authored May 16, 2017

There is inline function to test if carrier present,
so it makes open-coded solution redundant.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3fdd34c1

tcp: internal implementation for pacing · 218af599

Eric Dumazet authored May 16, 2017

BBR congestion control depends on pacing, and pacing is
currently handled by sch_fq packet scheduler for performance reasons,
and also because implemening pacing with FQ was convenient to truly
avoid bursts.

However there are many cases where this packet scheduler constraint
is not practical.
- Many linux hosts are not focusing on handling thousands of TCP
  flows in the most efficient way.
- Some routers use fq_codel or other AQM, but still would like
  to use BBR for the few TCP flows they initiate/terminate.

This patch implements an automatic fallback to internal pacing.

Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.

If sch_fq happens to be in the egress path, pacing is delegated to
the qdisc, otherwise pacing is done by TCP itself.

One advantage of pacing from TCP stack is to get more precise rtt
estimations, and less work done from TX completion, since TCP Small
queue limits are not generally hit. Setups with single TX queue but
many cpus might even benefit from this.

Note that unlike sch_fq, we do not take into account header sizes.
Taking care of these headers would add additional complexity for
no practical differences in behavior.

Some performance numbers using 800 TCP_STREAM flows rate limited to
~48 Mbit per second on 40Gbit NIC.

If MQ+pfifo_fast is used on the NIC :

$ sar -n DEV 1 5 | grep eth
14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
 4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
 0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
 1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
 1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0

Now use MQ+FQ :

lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
lpaa23:~# tc qdisc replace dev eth0 root mq

$ sar -n DEV 1 5 | grep eth
14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
 1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
 0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
 4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
 3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0

As expected, number of interrupts per second is very different.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

218af599

Merge branch 'udp-scalability-improvements' · 8dfedc53

David S. Miller authored May 16, 2017

Paolo Abeni says:

====================
udp: scalability improvements

This patch series implement an idea suggested by Eric Dumazet to
reduce the contention of the udp sk_receive_queue lock when the socket is
under flood.

An ancillary queue is added to the udp socket, and the socket always
tries first to read packets from such queue. If it's empty, we splice
the content from sk_receive_queue into the ancillary queue.

The first patch introduces some helpers to keep the udp code small, and the
following two implement the ancillary queue strategy. The code is split
to hopefully help the reviewing process.

The measured overall gain under udp flood is up to the 30% depending on
the numa layout and the number of ingress queue used by the relevant nic.

The performance numbers have been gathered using pktgen as sender, with 64
bytes packets, random src port on a host b2b connected via a 10Gbs link
with the dut.

The receiver used the udp_sink program by Jesper [1] and an h/w l4 rx hash on
the ingress nic, so that the number of ingress nic rx queues hit by the udp
traffic could be controlled via ethtool -L.

The udp_sink program was bound to the first idle cpu, to get more
stable numbers.

On a single numa node receiver:

nic rx queues           vanilla                 patched kernel
1                       1820 kpps               1900 kpps
2                       1950 kpps               2500 kpps
16                      1670 kpps               2120 kpps

When using a single nic rx queue, busy polling was also enabled,
elsewhere, in the above scenario, the bh processing becomes the bottle-neck
and this produces large artifacts in the measured performances (e.g.
improving the udp sink run time, decreases the overall tput, since more
action from the scheduler comes into play).

[1] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c

v1 -> v2:
  Patches 1/3 and 2/3 are unchanged, in patch 3/3 the rx_queue_lock_held param
  of udp_rmem_release() is now a bool.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8dfedc53

udp: keep the sk_receive_queue held when splicing · 6dfb4367

Paolo Abeni authored May 16, 2017

On packet reception, when we are forced to splice the
sk_receive_queue, we can keep the related lock held, so
that we can avoid re-acquiring it, if fwd memory
scheduling is required.

v1 -> v2:
  the rx_queue_lock_held param in udp_rmem_release() is
  now a bool
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6dfb4367

udp: use a separate rx queue for packet reception · 2276f58a

Paolo Abeni authored May 16, 2017

under udp flood the sk_receive_queue spinlock is heavily contended.
This patch try to reduce the contention on such lock adding a
second receive queue to the udp sockets; recvmsg() looks first
in such queue and, only if empty, tries to fetch the data from
sk_receive_queue. The latter is spliced into the newly added
queue every time the receive path has to acquire the
sk_receive_queue lock.

The accounting of forward allocated memory is still protected with
the sk_receive_queue lock, so udp_rmem_release() needs to acquire
both locks when the forward deficit is flushed.

On specific scenarios we can end up acquiring and releasing the
sk_receive_queue lock multiple times; that will be covered by
the next patch
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2276f58a