Commits · ac0488ef59a54e42ad744ae1a91fafbcb2566a06 · nexedi / linux

22 Mar, 2017 40 commits

nfp: disable FW on reconfiguration errors · ac0488ef

Jakub Kicinski authored Mar 21, 2017

Since we no longer need to keep the FW enabled for .ndo_close()
to work we can always stop FW after reconfiguration failure.
This seems to make most FWs more resilient to faults (at least
in error injection scenarios).
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ac0488ef

nfp: remove defensive checks around ndo_open()/ndo_close() · 219ad6c1

Jakub Kicinski authored Mar 21, 2017

Device open and close handlers check if the device is already
in the desired state.  Thanks to our reconfig infrastructure
this should not be necessary, there doesn't seem to be any
code in the driver which depends on it.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

219ad6c1

nfp: flush xmit_more on error paths · 28b0cfee

Jakub Kicinski authored Mar 21, 2017

In case of ring full or DMA mapping error remember to flush xmit_more
delayed kicks.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

28b0cfee

nfp: remove RX queue pointers · 83d08a1d

Jakub Kicinski authored Mar 21, 2017

NFP6000 doesn't use queue pointers/doorbells for RX, it uses
'done' bit in descriptors.  Remove the pointers from data structures.
Since we are saving space in rx_ring structure make fields we
previously compressed to 16bits word size again.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

83d08a1d

nfp: don't use netdev_warn() before netdev is registered · 87232d96

Jakub Kicinski authored Mar 21, 2017

Fix warning which was using netdev_warn() instead of dev_warn()
to early.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

87232d96

nfp: fix nfp_cpp_read()/nfp_cpp_write() error paths · 7d2da603

Jakub Kicinski authored Mar 21, 2017

When acquiring an area fails we can't call function doing both
release and free.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7d2da603

nfp: fix invalid area detection · 1bb665e3

Jakub Kicinski authored Mar 21, 2017

Core should detect when someone is trying to request an access
window which is too large for a given type of access.  Otherwise
the requester will be put on a wait queue for ever without any
error message.

Add const qualifiers to clarify that we are only looking at read-
-only members in relevant functions.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1bb665e3

nfp: don't ignore return value of wait_event_interruptible · 76e8f93e

Jakub Kicinski authored Mar 21, 2017

When signal interrupts waiting for an area to become available
we assume success.  Pay attention to the return code.  Unpack
the code a little bit to make it more readable.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

76e8f93e

nfp: correct return codes when msleep gets interrupted · 69a4aa89

Jakub Kicinski authored Mar 21, 2017

msleep_interruptible() returns time left to wait, not error
code.  Return ERESTARTSYS when interrupted.

While at it correct a comment and make the polling a bit
more aggressive.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

69a4aa89

nfp: lock area cache earlier · a831ffb5

Jakub Kicinski authored Mar 21, 2017

We shouldn't access area_cache_list without its lock even
to check if it's empty.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a831ffb5

nfp: document expected locking in the core · 61e81abd

Jakub Kicinski authored Mar 21, 2017

Document which fields of nfp_cpp are protected by which locks.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

61e81abd

nfp: move mutex code out of nfp_cppcore.c · 8672103f

Jakub Kicinski authored Mar 21, 2017

After mutex cache removal we can put the mutex code in a separate
source file.  This makes it clear it doesn't play with internals
of struct nfp_cpp any more.

No functional changes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8672103f

nfp: remove cpp mutex cache · 832ff948

Jakub Kicinski authored Mar 21, 2017

CPP mutex cache was introduced to work around the fact that the
same host could successfully acquire a lock multiple times. It
used to collapse multiple users to the same struct nfp_cpp_mutex
and track use count. Unfortunately it's racy. Since we now force
all nfp_mutex_lock() callers within the host to actually succeed
at acquiring the lock we no longer need the cache, let's remove it.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

832ff948

nfp: fail graciously when someone tries to grab global lock · f1ba63ec

Jakub Kicinski authored Mar 21, 2017

The global device lock is acquired to search the resource table.
The lock is actually itself part of the table (entry 0).
Therefore if someone asks for resource 0 we would deadlock since
double locking is no longer allowed.

Currently the driver doesn't try to lock that resource so let's
simply make sure we fail graciously and not add special handling
of this case until really need.  Hide the relevant defines in
the source file.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f1ba63ec

nfp: disallow sharing mutexes on the same machine · 3d4fc6eb

Jakub Kicinski authored Mar 21, 2017

NFP can be connected to multiple machines via PCI or other buses.
Access to hardware resources is arbitrated using locks residing
in device memory.  Currently nfpcore only respects the mutexes
when it comes to inter-host locking, but if we try to acquire
the same lock again, on one host - it will simply return success
because owner of the lock is already set to that host.

This makes the locks useless for arbitration within one host
and unfair because whichever host grabbed the lock will have
a chance to reacquire it without others getting a shot.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3d4fc6eb

net: dwc-xlgmac: fix an error code in xlgmac_alloc_pages() · 31c7ba9e

Dan Carpenter authored Mar 21, 2017

The dma_mapping_error() returns true if there is an error but we want
to return -ENOMEM and not 1.

Fixes: 65e0ace2 ("net: dwc-xlgmac: Initial driver for DesignWare Enterprise Ethernet")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jie Deng <jiedeng@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

31c7ba9e

rtnetlink: Add dump all for netconf · a7678c70

David Ahern authored Mar 21, 2017

Use the rtnl_dump_all to dump all netconf handlers that have been
registered. Allows userspace to send a dump request for PF_UNSPEC
and get all families.

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a7678c70

Merge branch 'phy-mmd-cleanup' · 276a74d8

David S. Miller authored Mar 22, 2017

Russell King says:

====================
Clean up PHY MMD accessors

This series cleans up phylib's MMD accessors, so that we have a common
way of accessing the Clause 45 register set.

The current situation is far from ideal - we have phy_(read|write)_mmd()
which accesses Clause 45 registers over Clause 45 accesses, and we have
phy_(read|write)_mmd_indirect(), which accesses Clause 45 registers via
Clause 22 register 13/14.

Generic code uses the indirect methods to access standard Clause 45
features, and when we come to add Clause 45 PHY support to phylib, we
would need to make these conditional upon the PHY type, or duplicate
these functions.

An alternative solution is to merge these accessors together, and select
the appropriate access method depending upon the 802.3 clause that the
PHY conforms with.  The result is that we have a single set of
phy_(read|write)_mmd() accessors.

For cases which require special handling, we still allow PHY drivers to
override all MMD accesses - except rather than just overriding the
indirect accesses.  This keeps existing overrides working.

Combining the two also has another beneficial side effect - we get rid
of similar functions that take arguments in different orders.  The
old direct accessors took the phy structure, devad and register number,
whereas the indirect accessors took the phy structure, register number
and devad in that order.  Care must be taken when updating future
drivers that the argument order is correct, and the function name is
not merely replaced.

This patch set is against net-next.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

276a74d8

net: phy: clean up mmd_phy_indirect() · 060fbc89

Russell King authored Mar 21, 2017

Make mmd_phy_indirect() use the same terminology as the rest of the
code, making clear what each address is - phy address, devad, and
register number.

While here, remove the "inline" from this static function, leaving
it to the compiler to decide whether to inline this function, and
get rid of unnecessary parens.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

060fbc89

net: phy: remove the indirect MMD read/write methods · 3b85d8df

Russell King authored Mar 21, 2017

Remove the indirect MMD read/write methods which are now no longer
necessary.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

3b85d8df

net: phy: convert micrel to new read_mmd/write_mmd driver methods · d11437e0

Russell King authored Mar 21, 2017

Convert micrel to the new read_mmd/write_mmd driver methods.  This
Clause 22 PHY does not support any MMD access method.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

d11437e0

net: phy: switch remaining users to phy_(read|write)_mmd() · a6d99fcd

Russell King authored Mar 21, 2017

Switch everyone over to using phy_read_mmd() and phy_write_mmd() now
that they are able to handle both Clause 22 indirect addressing and
Clause 45 direct addressing methods to the MMD registers.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

a6d99fcd

net: lan78xx: update for phy_(read|write)_mmd_indirect() removal · 5f613677

Russell King authored Mar 21, 2017

lan78xx appears to use phylib in a rather weird way, accessing the PHY
partly through phylib, and partly by making direct accesses to it,
including to the Clause 45 registers.  As the indirect MMD accessors are
going away, update this driver to use the plain phy_(read|write)_mmd()
accessors instead.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Woojung Huh <Woojung.Huh@microchip.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

5f613677

net: phy: make phy_(read|write)_mmd() generic MMD accessors · 1ee6b9bc

Russell King authored Mar 21, 2017

Make phy_(read|write)_mmd() generic 802.3 clause 45 register accessors
for both Clause 22 and Clause 45 PHYs, using either the direct register
reading for Clause 45, or the indirect method for Clause 22 PHYs.
Allow this behaviour to be overriden by PHY drivers where necessary.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

1ee6b9bc

net: phy: move phy MMD accessors to phy-core.c · 9860118b

Russell King authored Mar 21, 2017

Move the phy_(read|write)__mmd() helpers out of line, they will become
our main MMD accessor functions, and so will be a little more complex.
This complexity doesn't belong in an inline function.  Also move the
_indirect variants as well to keep like functionality together.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

9860118b

net: stmmac: Use AVB mode by default · 2d72d501

Thierry Reding authored Mar 21, 2017

Prior to the recent multi-queue changes the driver would configure the
queues to use the AVB mode, but the mode then got switched to DCB. The
hardware still works fine in DCB mode, but my testing capabilities are
limited, so it's safer to revert to the prior setting anyway.
Signed-off-by: Thierry Reding <treding@nvidia.com>
Acked-By: Joao Pinto <jpinto@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2d72d501

net: stmmac: Restore DT backwards-compatibility · 33e85b8d

Thierry Reding authored Mar 21, 2017

Recent changes to support multiple queues in the device tree bindings
resulted in the number of RX and TX queues to be initialized to zero for
device trees not adhering to the new bindings.

Restore backwards-compatibility with those device trees by falling back
to a single RX and TX queues each.
Signed-off-by: Thierry Reding <treding@nvidia.com>
Acked-By: Joao Pinto <jpinto@synopsys.com>
Tested-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

33e85b8d

net: stmmac: Always enable MAC RX queues · f3976874

Thierry Reding authored Mar 21, 2017

The MAC RX queues always need to be enabled in order to receive network
packets. Remove the condition that this only needs to be done for multi-
queue configurations.
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f3976874

net: convert sk_filter.refcnt from atomic_t to refcount_t · 4c355cdf

Reshetova, Elena authored Mar 21, 2017

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

4c355cdf

net: greth: Utilize of_get_mac_address() · 726bceca

Tobias Klauser authored Mar 21, 2017

Do not open code getting the MAC address exclusively from the
"local-mac-address" property, but instead use of_get_mac_address() which
looks up the MAC address using the 3 typical property names.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

726bceca

liquidio: fix Coverity scan errors · 58ad3198

Felix Manlunas authored Mar 20, 2017

Fix Coverity scan errors by not dereferencing lio->glists_dma_base pointer
if it's NULL.

See http://marc.info/?l=linux-netdev&m=149002294305614&w=2Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: VSR Burru <veerasenareddy.burru@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

58ad3198

net: tcp: Permit user set TCP_MAXSEG to default value · cfc62d87

Gao Feng authored Mar 21, 2017

When user_mss is zero, it means use the default value. But the current
codes don't permit user set TCP_MAXSEG to the default value.
It would return the -EINVAL when val is zero.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cfc62d87

Merge branch 'ovs-sample-action-optimization' · b2a1674a

David S. Miller authored Mar 22, 2017

Andy Zhou says:

====================
net-next sample action optimization v4

The sample action can be used for translating Openflow 'clone' action.
However its implementation has not been sufficiently optimized for this
use case. This series attempts to close the gap.

Patch 3 commit message has more details on the specific optimizations
implemented.

---
v3->v4: Enhance patch 4.
        Fix two bugs pointed out by Pravin,
        Remove 'is_sample' variable.

v2->v3: Enhance patch 4, Rafctor to move more common logic to clone_execute().

v1->v2: Address Pravin's comment, Refactor recirc and sample
        to share more common code
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

b2a1674a

Openvswitch: Refactor sample and recirc actions implementation · bef7f756

andy zhou authored Mar 20, 2017

Added clone_execute() that both the sample and the recirc
action implementation can use.
Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

bef7f756

openvswitch: Optimize sample action for the clone use cases · 798c1661

andy zhou authored Mar 20, 2017

With the introduction of open flow 'clone' action, the OVS user space
can now translate the 'clone' action into kernel datapath 'sample'
action, with 100% probability, to ensure that the clone semantics,
which is that the packet seen by the clone action is the same as the
packet seen by the action after clone, is faithfully carried out
in the datapath.

While the sample action in the datpath has the matching semantics,
its implementation is only optimized for its original use.
Specifically, there are two limitation: First, there is a 3 level of
nesting restriction, enforced at the flow downloading time. This
limit turns out to be too restrictive for the 'clone' use case.
Second, the implementation avoid recursive call only if the sample
action list has a single userspace action.

The main optimization implemented in this series removes the static
nesting limit check, instead, implement the run time recursion limit
check, and recursion avoidance similar to that of the 'recirc' action.
This optimization solve both #1 and #2 issues above.

One related optimization attempts to avoid copying flow key as
long as the actions enclosed does not change the flow key. The
detection is performed only once at the flow downloading time.

Another related optimization is to rewrite the action list
at flow downloading time in order to save the fast path from parsing
the sample action list in its original form repeatedly.
Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

798c1661

openvswitch: Refactor recirc key allocation. · 4572ef52

andy zhou authored Mar 20, 2017

The logic of allocating and copy key for each 'exec_actions_level'
was specific to execute_recirc(). However, future patches will reuse
as well.  Refactor the logic into its own function clone_key().
Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

4572ef52

openvswitch: Deferred fifo API change. · 47c697aa

andy zhou authored Mar 20, 2017

add_deferred_actions() API currently requires actions to be passed in
as a fully encoded netlink message. So far both 'sample' and 'recirc'
actions happens to carry actions as fully encoded netlink messages.
However, this requirement is more restrictive than necessary, future
patch will need to pass in action lists that are not fully encoded
by themselves.
Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

47c697aa

Merge branch 'vrf-perf' · 29dd5ec0

David S. Miller authored Mar 22, 2017

David Ahern says:

====================
net: vrf: performance improvements

Device based features for VRF such as qdisc, netfilter and packet
captures are implemented by switching the dst on skbuffs to its per-VRF
dst. This has the effect of controlling the output function which points
a function in the VRF driver. [1] The skb proceeds down the stack with
dst->dev pointing to the VRF device. Netfilter, qdisc and tc rules and
network taps are evaluated based on this device. Finally, the skb makes
it to the vrf_xmit function which resets the dst based on a FIB lookup.

The feature comes at cost - between 5 and 10% depending on test (TCP vs
UDP, stream vs RR and IPv4 vs IPv6). The main cost is requiring a FIB
lookup in the VRF driver for each packet sent through it. The FIB lookup
is required because the real dst gets dropped so that the skb can
traverse the stack with dst->dev set to the VRF device.

All of that is really driven by the qdisc and not replicating the
processing of __dev_queue_xmit if a qdisc is set up on the device. But,
VRF devices by default do not have a qdisc and really have no need for
multiple Tx queues. This means the performance overhead is inflicted upon
all users for the potential use case of a qdisc being configured.

The overhead can be avoided by checking if the default configuration
applies to a specific VRF device before switching the dst. If a device
does not have a qdisc, the pass through netfilter hooks and packet taps
can be done inline without dropping the dst and thus avoiding the
performance penalty. With this change performance overhead of VRF drops
to neglible (difference with run-over-run variance) to 3% depending on
test type.

netperf performance comparison for 3 cases:
1. L3_MASTER_DEVICE compiled out
2. VRF with this patch set
3. current VRF code

IPv4
----
           no-l3mdev     new-vrf     old-vrf
TCP_RR       28778        28938*       27169
TCP_CRR      10706        10490         9770
UDP_RR       30750        29813        29256

* Although higher in the final run used for submitting this patch set, I
  think what this really represents is a neglible performance overhead for
  VRF with this change (i.e, within the +-1% variance of runs). Most
  notably the FIB lookups in the Tx path are avoided for TCP_RR.

IPv6
----
           no-l3mdev     new-vrf     old-vrf
TCP_RR       29495        29432       27794
TCP_CRR      10520        10338        9870
UDP_RR       26137        27019*      26511

* UDP is consistently better with VRF for two reasons:
  1. Source address selection with L3 domains is considering fewer
     addresses since only addresses on interfaces in the domain are
     considered for the selection. Specifically, perf-top shows
     shows ipv6_get_saddr_eval, ipv6_dev_get_saddr and __ipv6_dev_get_saddr
     running much lower with vrf than without.

  2. The VRF table contains all routes (i.e, there are no separate local
     and main tables per VRF). That means ip6_pol_route_output only has 1
     lookup for VRF where it does 2 without it (1 in the local table and 1
     in the main table).

[1] http://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

29dd5ec0

net: vrf: performance improvements for IPv6 · a9ec54d1

David Ahern authored Mar 20, 2017

The VRF driver allows users to implement device based features for an
entire domain. For example, a qdisc or netfilter rules can be attached
to a VRF device or tcpdump can be used to view packets for all devices
in the L3 domain.

The device-based features come with a performance penalty, most
notably in the Tx path. The VRF driver uses the l3mdev_l3_out hook
to switch the dst on an skb to its private dst. This allows the skb
to traverse the xmit stack with the device set to the VRF device
which in turn enables the netfilter and qdisc features. The VRF
driver then performs the FIB lookup again and reinserts the packet.

This patch avoids the redirect for IPv6 packets if a qdisc has not
been attached to a VRF device which is the default config. In this
case the netfilter hooks and network taps are directly traversed in
the l3mdev_l3_out handler. If a qdisc is attached to a VRF device,
then the redirect using the vrf dst is done.

Additional overhead is removed by only checking packet taps if a
socket is open on the device (vrf_dev->ptype_all list is not empty).
Packet sockets bound to any device will still get a copy of the
packet via the real ingress or egress interface.

The end result of this change is a decrease in the overhead of VRF
for the default, baseline case (ie., no netfilter rules, no packet
sockets, no qdisc) from a +3% improvement for UDP which has a lookup
per packet (VRF being better than no l3mdev) to ~2% loss for TCP_CRR
which connects a socket for each request-response.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a9ec54d1

net: vrf: performance improvements for IPv4 · dcdd43c4

David Ahern authored Mar 20, 2017

The VRF driver allows users to implement device based features for an
entire domain. For example, a qdisc or netfilter rules can be attached
to a VRF device or tcpdump can be used to view packets for all devices
in the L3 domain.

The device-based features come with a performance penalty, most
notably in the Tx path. The VRF driver uses the l3mdev_l3_out hook
to switch the dst on an skb to its private dst. This allows the skb
to traverse the xmit stack with the device set to the VRF device
which in turn enables the netfilter and qdisc features. The VRF
driver then performs the FIB lookup again and reinserts the packet.

This patch avoids the redirect for IPv4 packets if a qdisc has not
been attached to a VRF device which is the default config. In this
case the netfilter hooks and network taps are directly traversed in
the l3mdev_l3_out handler. If a qdisc is attached to a VRF device,
then the redirect using the vrf dst is done.

Additional overhead is removed by only checking packet taps if a
socket is open on the device (vrf_dev->ptype_all list is not empty).
Packet sockets bound to any device will still get a copy of the
packet via the real ingress or egress interface.

The end result of this change is a decrease in the overhead of VRF
for the default, baseline case (ie., no netfilter rules, no packet
sockets, no qdisc) to ~3% for UDP which has a lookup per packet and
< 1% overhead for connected sockets that leverage early demux and
avoid FIB lookups.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dcdd43c4