Commits · b8a98a4cf3221d8140969e3f5bde09206a6cb623 · Kirill Smelkov / linux

30 Mar, 2018 40 commits

net/mlx5e: Keep single pre-initialized UMR WQE per RQ · b8a98a4c

Tariq Toukan authored Dec 20, 2017

All UMR WQEs of an RQ share many common fields. We use
pre-initialized structures to save calculations in datapath.
One field (xlt_offset) was the only reason we saved a pre-initialized
copy per WQE index.
Here we remove its initialization (move its calculation to datapath),
and reduce the number of copies to one-per-RQ.

A very small datapath calculation is added, it occurs once per a MPWQE
(i.e. once every 256KB), but reduces memory consumption and gives
better cache utilization.

Performance testing:
Tested packet rate, no degradation sensed.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

b8a98a4c

net/mlx5e: Remove page_ref bulking in Striding RQ · 9f9e9cd5

Tariq Toukan authored Nov 28, 2017

When many packets reside on the same page, the bulking of
page_ref modifications reduces the total number of atomic
operations executed.

Besides the necessary 2 operations on page alloc/free, we
have the following extra ops per page:
- one on WQE allocation (bump refcnt to maximum possible),
- zero ops for SKBs,
- one on WQE free,
a constant of two operations in total, no matter how many
packets/SKBs actually populate the page.

Without this bulking, we have:
- no ops on WQE allocation or free,
- one op per SKB,

Comparing the two methods when PAGE_SIZE is 4K:
- As mentioned above, bulking method always executes 2 operations,
  not more, but not less.
- In the default MTU configuration (1500, stride size is 2K),
  the non-bulking method execute 2 ops as well.
- For larger MTUs with stride size of 4K, non-bulking method
  executes only a single op.
- For XDP (stride size of 4K, no SKBs), non-bulking method
  executes no ops at all!

Hence, to optimize the flows with linear SKB and XDP over Striding RQ,
we here remove the page_ref bulking method.

Performance testing:
ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.

Single core packet rate (64 bytes).

Early drop in TC: no degradation.

XDP_DROP:
before: 14,270,188 pps
after:  20,503,603 pps, 43% improvement.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

9f9e9cd5

net/mlx5e: Support XDP over Striding RQ · 22f45398

Tariq Toukan authored Feb 07, 2018

Add XDP support over Striding RQ.
Now that linear SKB is supported over Striding RQ,
we can support XDP by setting stride size to PAGE_SIZE
and headroom to XDP_PACKET_HEADROOM.

Upon a MPWQE free, do not release pages that are being
XDP xmit, they will be released upon completions.

Striding RQ is capable of a higher packet-rate than
conventional RQ.
A performance gain is expected for all cases that had
a HW packet-rate bottleneck. This is the case whenever
using many flows that distribute to many cores.

Performance testing:
ConnectX-5, 24 rings, default MTU.
CQE compression ON (to reduce completions BW in PCI).

XDP_DROP packet rate:
--------------------------------------------------
| pkt size | XDP rate   | 100GbE linerate | pct% |
--------------------------------------------------
|   64byte | 126.2 Mpps |      148.0 Mpps |  85% |
|  128byte |  80.0 Mpps |       84.8 Mpps |  94% |
|  256byte |  42.7 Mpps |       42.7 Mpps | 100% |
|  512byte |  23.4 Mpps |       23.4 Mpps | 100% |
--------------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

22f45398

net/mlx5e: Refactor RQ XDP_TX indication · 121e8927

Tariq Toukan authored Dec 12, 2017

Make the xdp_xmit indication available for Striding RQ
by taking it out of the type-specific union.
This refactor is a preparation for a downstream patch that
adds XDP support over Striding RQ.
In addition, use a bitmap instead of a boolean for possible
future flags.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

121e8927

net/mlx5e: Use linear SKB in Striding RQ · 619a8f2a

Tariq Toukan authored Feb 07, 2018

Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.

In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.

To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.

We can satisfy 1 and 2 by configuring:
	stride size = MTU + headroom + tailoom.

This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.

Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
  small amount (linear part) and large amount (fragment).
  This saves datapath cycles in driver and improves utilization
  of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
  assumption of equal-size fragments.

Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.

This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.

Performance testing:
ConnectX-5, single core, single RX ring, default MTU.

UDP packet rate, early drop in TC layer:

--------------------------------------------
| pkt size | before    | after     | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
|  500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
|   64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------

TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

619a8f2a

net/mlx5e: Use inline MTTs in UMR WQEs · ea3886ca

Tariq Toukan authored Jul 10, 2017

When modifying the page mapping of a HW memory region
(via a UMR post), post the new values inlined in WQE,
instead of using a data pointer.

This is a micro-optimization, inline UMR WQEs of different
rings scale better in HW.

In addition, this obsoletes a few control flows and helps
delete ~50 LOC.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

ea3886ca

net/mlx5e: Do not busy-wait for UMR completion in Striding RQ · e4d86a4a

Tariq Toukan authored Jan 07, 2018

Do not busy-wait a pending UMR completion. Under high HW load,
busy-waiting a delayed completion would fully utilize the CPU core
and mistakenly indicate a SW bottleneck.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

e4d86a4a

net/mlx5e: Code movements in RX UMR WQE post · 18187fb2

Tariq Toukan authored Jul 18, 2017

Gets the process of a UMR WQE post in one function,
in preparation for a downstream patch that inlines
the WQE data.
No functional change here.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

18187fb2

net/mlx5e: Derive Striding RQ size from MTU · 73281b78

Tariq Toukan authored Feb 11, 2018

In Striding RQ, each WQE serves multiple packets
(hence called Multi-Packet WQE, MPWQE).
The size of a MPWQE is constant (currently 256KB).

Upon a ringparam set operation, we calculate the number of
MPWQEs per RQ. For this, first it is needed to determine the
number of packets that can reside within a single MPWQE.
In this patch we use the actual MTU size instead of ETH_DATA_LEN
for this calculation.

This implies that a change in MTU might require a change
in Striding RQ ring size.

In addition, this obsoletes some WQEs-to-packets translation
functions and helps delete ~60 LOC.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

73281b78

net/mlx5e: Save MTU in channels params · 472a1e44

Tariq Toukan authored Mar 12, 2018

Knowing the MTU is required for RQ creation flow.
By our design, channels creation flow is totally isolated
from priv/netdev, and can be completed with access to
channels params and mdev.
Adding the MTU to the channels params helps preserving that.
In addition, we save it in RQ to make its access faster in
datapath checks.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

472a1e44

net/mlx5e: IPoIB, Fix spelling mistake · 53378898

Talat Batheesh authored Mar 06, 2018

Fix spelling mistake in debug message text.
"dettaching" -> "detaching"
Signed-off-by: Talat Batheesh <talatb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

53378898

net/mlx5: Change teardown with force mode failure message to warning · 6c750628

Alaa Hleihel authored Feb 01, 2018

With ConnectX-4, we expect the force teardown to fail in case that
DC was enabled, therefore change the message from error to warning.
Signed-off-by: Alaa Hleihel <alaa@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

6c750628

net/mlx5: Eliminate query xsrq dead code · b2d3907c

Saeed Mahameed authored Feb 13, 2018

1. This function is not used anywhere in mlx5 driver
2. It has a memcpy statement that makes no sense and produces build
warning with gcc8

drivers/net/ethernet/mellanox/mlx5/core/transobj.c: In function 'mlx5_core_query_xsrq':
drivers/net/ethernet/mellanox/mlx5/core/transobj.c:347:3: error: 'memcpy' source argument is the same as destination [-Werror=restrict]

Fixes: 01949d01 ("net/mlx5_core: Enable XRCs and SRQs when using ISSI > 0")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

b2d3907c

net/mlx5e: Use eq ptr from cq · 7b2117bb

Saeed Mahameed authored Feb 01, 2018

Instead of looking for the EQ of the CQ, remove that redundant code and
use the eq pointer stored in the cq struct.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

7b2117bb

tc-testing: Add newline when writing test case files · c0b6edef

Lucas Bates authored Mar 29, 2018

When using the -i feature to generate random ID numbers for test
cases in tdc, the function that writes the JSON to file doesn't
add a newline character to the end of the file, so we have to
add our own.
Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c0b6edef

liquidio: prevent rx queues from getting stalled · ccdd0b4c

Raghu Vatsavayi authored Mar 29, 2018

This commit has fix for RX traffic issues when we stress test the driver
with continuous ifconfig up/down under very high traffic conditions.

Reason for the issue is that, in existing liquidio_stop function NAPI is
disabled even before actual FW/HW interface is brought down via
send_rx_ctrl_cmd(lio, 0). Between time frame of NAPI disable and actual
interface down in firmware, firmware continuously enqueues rx traffic to
host. When interrupt happens for new packets, host irq handler fails in
scheduling NAPI as the NAPI is already disabled.

After "ifconfig <iface> up", Host re-enables NAPI but cannot schedule it
until it receives another Rx interrupt. Host never receives Rx interrupt as
it never cleared the Rx interrupt it received during interface down
operation. NIC Rx interrupt gets cleared only when Host processes queue and
clears the queue counts. Above anomaly leads to other issues like packet
overflow in FW/HW queues, backpressure.

Fix:
This commit fixes this issue by disabling NAPI only after informing
firmware to stop queueing packets to host via send_rx_ctrl_cmd(lio, 0).
send_rx_ctrl_cmd is not visible in the patch as it is already there in the
code. The DOWN command also waits for any pending packets to be processed
by NAPI so that the deadlock will not occur.
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ccdd0b4c

Merge branch 'ieee802154-for-davem-2018-03-29' of... · 6f14f49c

David S. Miller authored Mar 30, 2018

Merge branch 'ieee802154-for-davem-2018-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next

Stefan Schmidt says:

====================
pull-request: ieee802154-next 2018-03-29

An update from ieee802154 for *net-next*

Colin fixed a unused variable in the new mcr20a driver.
Harry fixed an unitialised data read in the debugfs interface of the
ca8210 driver.

If there are any issues or you think these are to late for -rc1 (both can also
go into -rc2 as they are simple fixes) let me know.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6f14f49c

tc-testing: add connmark action tests · 1dad0f9f

Roman Mashak authored Mar 29, 2018

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1dad0f9f

MAINTAINERS: Update my email address from freescale to nxp · fe3f4e80

Claudiu Manoil authored Mar 29, 2018

The freescale.com address will no longer be available.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fe3f4e80

dt-bindings: net: renesas-ravb: Add support for r8a77470 SoC · 9b857563

Biju Das authored Mar 29, 2018

Add a new compatible string for the RZ/G1C (R8A77470) SoC.
Signed-off-by: Biju Das <biju.das@bp.renesas.com>
Reviewed-by: Fabrizio Castro <fabrizio.castro@bp.renesas.com>
Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>

9b857563

Merge branch 'stmmac-DWMAC5' · 8bafb83e

David S. Miller authored Mar 30, 2018

Jose Abreu says:

====================
Fix TX Timeout and implement Safety Features

Fix the TX Timeout handler to correctly reconfigure the whole system and
start implementing features for DWMAC5 cores, specifically the Safety
Features.

Changes since v1:
	- Display error stats in ethtool
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8bafb83e

net: stmmac: Add support for DWMAC5 and implement Safety Features · 8bf993a5

Jose Abreu authored Mar 29, 2018

This adds initial suport for DWMAC5 and implements the Automotive Safety
Package which is available from core version 5.10.

The Automotive Safety Pacakge (also called Safety Features) offers us
with error protection in the core by implementing ECC Protection in
memories, on-chip data path parity protection, FSM parity and timeout
protection and Application/CSR interface timeout protection.

In case of an uncorrectable error we call stmmac_global_err() and
reconfigure the whole core.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

8bf993a5

net: stmmac: Rework and fix TX Timeout code · 34877a15

Jose Abreu authored Mar 29, 2018

Currently TX Timeout handler does not behaves as expected and leads to
an unrecoverable state. Rework current implementation of TX Timeout
handling to actually perform a complete reset of the driver state and IP.

We use deferred work to init a task which will be responsible for
resetting the system.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

34877a15

net: mvneta: remove duplicate *_coal assignment · 02281a35

Jisheng Zhang authored Mar 29, 2018

The style of the rx/tx queue's *_coal member assignment is:

static void foo_coal_set(...)
{
	set the coal in hw;
	update queue's foo_coal member; [1]
}

In other place, we call foo_coal_set(pp, queue->foo_coal), so the above [1]
is duplicated and could be removed.
Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

02281a35

Merge branch 'do-not-allow-adding-routes-if-disable_ipv6-is-enabled' · e7696042

David S. Miller authored Mar 30, 2018

Lorenzo Bianconi says:

====================
do not allow adding routes if disable_ipv6 is enabled

Do not allow userspace to add static ipv6 routes if disable_ipv6 is enabled.
Update disable_ipv6 documentation according to that change

Changes since v1:
- added an extack message telling the user that IPv6 is disabled on the nexthop
  device
- rebased on-top of net-next
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e7696042

Documentation: ip-sysctl.txt: clarify disable_ipv6 · 2f0aaf7f

Lorenzo Bianconi authored Mar 29, 2018

Clarify that when disable_ipv6 is enabled even the ipv6 routes
are deleted for the selected interface and from now it will not
be possible to add addresses/routes to that interface
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2f0aaf7f

ipv6: do not set routes if disable_ipv6 has been enabled · 428604fb

Lorenzo Bianconi authored Mar 29, 2018

Do not allow setting ipv6 routes from userspace if disable_ipv6 has been
enabled. The issue can be triggered using the following reproducer:

- sysctl net.ipv6.conf.all.disable_ipv6=1
- ip -6 route add a:b:c:d::/64 dev em1
- ip -6 route show
  a:b:c:d::/64 dev em1 metric 1024 pref medium

Fix it checking disable_ipv6 value in ip6_route_info_create routine
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

428604fb

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · d162190b

David S. Miller authored Mar 30, 2018

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for your net-next
tree. This batch comes with more input sanitization for xtables to
address bug reports from fuzzers, preparation works to the flowtable
infrastructure and assorted updates. In no particular order, they are:

1) Make sure userspace provides a valid standard target verdict, from
   Florian Westphal.

2) Sanitize error target size, also from Florian.

3) Validate that last rule in basechain matches underflow/policy since
   userspace assumes this when decoding the ruleset blob that comes
   from the kernel, from Florian.

4) Consolidate hook entry checks through xt_check_table_hooks(),
   patch from Florian.

5) Cap ruleset allocations at 512 mbytes, 134217728 rules and reject
   very large compat offset arrays, so we have a reasonable upper limit
   and fuzzers don't exercise the oom-killer. Patches from Florian.

6) Several WARN_ON checks on xtables mutex helper, from Florian.

7) xt_rateest now has a hashtable per net, from Cong Wang.

8) Consolidate counter allocation in xt_counters_alloc(), from Florian.

9) Earlier xt_table_unlock() call in {ip,ip6,arp,eb}tables, patch
   from Xin Long.

10) Set FLOW_OFFLOAD_DIR_* to IP_CT_DIR_* definitions, patch from
    Felix Fietkau.

11) Consolidate code through flow_offload_fill_dir(), also from Felix.

12) Inline ip6_dst_mtu_forward() just like ip_dst_mtu_maybe_forward()
    to remove a dependency with flowtable and ipv6.ko, from Felix.

13) Cache mtu size in flow_offload_tuple object, this is safe for
    forwarding as f87c10a8 describes, from Felix.

14) Rename nf_flow_table.c to nf_flow_table_core.o, to simplify too
    modular infrastructure, from Felix.

15) Add rt0, rt2 and rt4 IPv6 routing extension support, patch from
    Ahmed Abdelsalam.

16) Remove unused parameter in nf_conncount_count(), from Yi-Hung Wei.

17) Support for counting only to nf_conncount infrastructure, patch
    from Yi-Hung Wei.

18) Add strict NFT_CT_{SRC_IP,DST_IP,SRC_IP6,DST_IP6} key datatypes
    to nft_ct.

19) Use boolean as return value from ipt_ah and from IPVS too, patch
    from Gustavo A. R. Silva.

20) Remove useless parameters in nfnl_acct_overquota() and
    nf_conntrack_broadcast_help(), from Taehee Yoo.

21) Use ipv6_addr_is_multicast() from xt_cluster, also from Taehee Yoo.

22) Statify nf_tables_obj_lookup_byhandle, patch from Fengguang Wu.

23) Fix typo in xt_limit, from Geert Uytterhoeven.

24) Do no use VLAs in Netfilter code, again from Gustavo.

25) Use ADD_COUNTER from ebtables, from Taehee Yoo.

26) Bitshift support for CONNMARK and MARK targets, from Jack Ma.

27) Use pr_*() and add pr_fmt(), from Arushi Singhal.

28) Add synproxy support to ctnetlink.

29) ICMP type and IGMP matching support for ebtables, patches from
    Matthias Schiffer.

30) Support for the revision infrastructure to ebtables, from
    Bernie Harris.

31) String match support for ebtables, also from Bernie.

32) Documentation for the new flowtable infrastructure.

33) Use generic comparison functions in ebt_stp, from Joe Perches.

34) Demodularize filter chains in nftables.

35) Register conntrack hooks in case nftables NAT chain is added.

36) Merge assignments with return in a couple of spots in the
    Netfilter codebase, also from Arushi.

37) Document that xtables percpu counters are stored in the same
    memory area, from Ben Hutchings.

38) Revert mark_source_chains() sanity checks that break existing
    rulesets, from Florian Westphal.

39) Use is_zero_ether_addr() in the ipset codebase, from Joe Perches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d162190b

Merge branch 'Close-race-between-un-register_netdevice_notifier-and-pernet_operations' · b9a12601

David S. Miller authored Mar 30, 2018

Kirill Tkhai says:

====================
Close race between {un, }register_netdevice_notifier and pernet_operations

the problem is {,un}register_netdevice_notifier() do not take
pernet_ops_rwsem, and they don't see network namespaces, being
initialized in setup_net() and cleanup_net(), since at this
time net is not hashed to net_namespace_list.

This may lead to imbalance, when a notifier is called at time of
setup_net()/net is alive, but it's not called at time of cleanup_net(),
for the devices, hashed to the net, and vise versa. See (3/3) for
the scheme of imbalance.

This patchset fixes the problem by acquiring pernet_ops_rwsem
at the time of {,un}register_netdevice_notifier() (3/3).
(1-2/3) are preparations in xfrm and netfilter subsystems.

The problem was introduced a long ago, but backporting won't be easy,
since every previous kernel version may have changes in netdevice
notifiers, and they all need review and testing. Otherwise, there
may be more pernet_operations, which register or unregister
netdevice notifiers, and that leads to deadlock (which is was fixed
in 1-2/3). This patchset is for net-next.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

b9a12601

net: Close race between {un, }register_netdevice_notifier() and setup_net()/cleanup_net() · 328fbe74

Kirill Tkhai authored Mar 29, 2018

{un,}register_netdevice_notifier() iterate over all net namespaces
hashed to net_namespace_list. But pernet_operations register and
unregister netdevices in unhashed net namespace, and they are not
seen for netdevice notifiers. This results in asymmetry:

1)Race with register_netdevice_notifier()
  pernet_operations::init(net)	...
   register_netdevice()		...
    call_netdevice_notifiers()  ...
      ... nb is not called ...
  ...				register_netdevice_notifier(nb) -> net skipped
  ...				...
  list_add_tail(&net->list, ..) ...

  Then, userspace stops using net, and it's destructed:

  pernet_operations::exit(net)
   unregister_netdevice()
    call_netdevice_notifiers()
      ... nb is called ...

This always happens with net::loopback_dev, but it may be not the only device.

2)Race with unregister_netdevice_notifier()
  pernet_operations::init(net)
   register_netdevice()
    call_netdevice_notifiers()
      ... nb is called ...

  Then, userspace stops using net, and it's destructed:

  list_del_rcu(&net->list)	...
  pernet_operations::exit(net)  unregister_netdevice_notifier(nb) -> net skipped
   dev_change_net_namespace()	...
    call_netdevice_notifiers()
      ... nb is not called ...
   unregister_netdevice()
    call_netdevice_notifiers()
      ... nb is not called ...

This race is more danger, since dev_change_net_namespace() moves real
network devices, which use not trivial netdevice notifiers, and if this
will happen, the system will be left in unpredictable state.

The patch closes the race. During the testing I found two places,
where register_netdevice_notifier() is called from pernet init/exit
methods (which led to deadlock) and fixed them (see previous patches).

The review moved me to one more unusual registration place:
raw_init() (can driver). It may be a reason of problems,
if someone creates in-kernel CAN_RAW sockets, since they
will be destroyed in exit method and raw_release()
will call unregister_netdevice_notifier(). But grep over
kernel tree does not show, someone creates such sockets
from kernel space.

Theoretically, there can be more places like this, and which are
hidden from review, but we found them on the first bumping there
(since there is no a race, it will be 100% reproducible).
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

328fbe74

netfilter: Rework xt_TEE netdevice notifier · 9e2f6c5d

Kirill Tkhai authored Mar 29, 2018

Register netdevice notifier for every iptable entry
is not good, since this breaks modularity, and
the hidden synchronization is based on rtnl_lock().

This patch reworks the synchronization via new lock,
while the rest of logic remains as it was before.
This is required for the next patch.

Tested via:

while :; do
	unshare -n iptables -t mangle -A OUTPUT -j TEE --gateway 1.1.1.2 --oif lo;
done
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

9e2f6c5d

xfrm: Register xfrm_dev_notifier in appropriate place · e9a441b6

Kirill Tkhai authored Mar 29, 2018

Currently, driver registers it from pernet_operations::init method,
and this breaks modularity, because initialization of net namespace
and netdevice notifiers are orthogonal actions. We don't have
per-namespace netdevice notifiers; all of them are global for all
devices in all namespaces.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e9a441b6

Merge branch 'Implement-of_get_nvmem_mac_address-helper' · caeeeda3

David S. Miller authored Mar 30, 2018

Mike Looijmans says:

====================
of_net: Implement of_get_nvmem_mac_address helper

Posted this as a small set now, with an (optional) second patch that shows
how the changes work and what I've used to test the code on a Topic Miami board.
I've taken the liberty to add appropriate "Acked" and "Review" tags.

v4: Replaced "6" with ETH_ALEN

v3: Add patch that implements mac in nvmem for the Cadence MACB controller
    Remove the integrated of_get_mac_address call

v2: Use of_nvmem_cell_get to avoid needing the assiciated device
    Use void* instead of char*
    Add devicetree binding doc
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

caeeeda3

net: macb: Try to retrieve MAC addess from nvmem provider · aa076e3d

Mike Looijmans authored Mar 29, 2018

Call of_get_nvmem_mac_address() to fetch the MAC address from an nvmem
cell, if one is provided in the device tree. This allows the address to
be stored in an I2C EEPROM device for example.
Signed-off-by: Mike Looijmans <mike.looijmans@topic.nl>
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

aa076e3d

of_net: Implement of_get_nvmem_mac_address helper · 9217e566

Mike Looijmans authored Mar 29, 2018

It's common practice to store MAC addresses for network interfaces into
nvmem devices. However the code to actually do this in the kernel lacks,
so this patch adds of_get_nvmem_mac_address() for drivers to obtain the
address from an nvmem cell provider.

This is particulary useful on devices where the ethernet interface cannot
be configured by the bootloader, for example because it's in an FPGA.
Signed-off-by: Mike Looijmans <mike.looijmans@topic.nl>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

9217e566

Merge branch 'nfp-flower-handle-MTU-changes' · 64e828df

David S. Miller authored Mar 30, 2018

Jakub Kicinski says:

====================
nfp: flower: handle MTU changes

This set improves MTU handling for flower offload.  The max MTU is
correctly capped and physical port MTU is communicated to the FW
(and indirectly HW).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

64e828df

nfp: flower: offload phys port MTU change · 29a5dcae

John Hurley authored Mar 28, 2018

Trigger a port mod message to request an MTU change on the NIC when any
physical port representor is assigned a new MTU value. The driver waits
10 msec for an ack that the FW has set the MTU. If no ack is received the
request is rejected and an appropriate warning flagged.

Rather than maintain an MTU queue per repr, one is maintained per app.
Because the MTU ndo is protected by the rtnl lock, there can never be
contention here. Portmod messages from the NIC are also protected by
rtnl so we first check if the portmod is an ack and, if so, handle outside
rtnl and the cmsg work queue.

Acks are detected by the marking of a bit in a portmod response. They are
then verfied by checking the port number and MTU value expected by the
app. If the expected MTU is 0 then no acks are currently expected.

Also, ensure that the packet headroom reserved by the flower firmware is
considered when accepting an MTU change on any repr.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

29a5dcae

nfp: modify app MTU setting callbacks · 167cebef

John Hurley authored Mar 28, 2018

Rename the 'change_mtu' app callback to 'check_mtu'. This is called
whenever an MTU change is requested on a netdev. It can reject the
change but is not responsible for implementing it.

Introduce a new 'repr_change_mtu' app callback that is hit when the MTU
of a repr is to be changed. This is responsible for performing the MTU
change and verifying it.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

167cebef

Merge branch 'phylink-API-changes' · 44465c47

David S. Miller authored Mar 30, 2018

Florian Fainelli says:

====================
phylink: API changes

This patch series contains two API changes to PHYLINK which will later be used
by DSA to migrate to PHYLINK. Because these are API changes that impact other
outstanding work (e.g: MVPP2) I would rather get them included sooner to minimize
conflicts.

Thank you!

Changes in v2:

- added missing documentation to mac_link_{up,down} that the interface
  must be configured in mac_config()

- added Russell's, Andrew's and my tags
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

44465c47

sfp/phylink: move module EEPROM ethtool access into netdev core ethtool · e679c9c1

Russell King authored Mar 28, 2018

Provide a pointer to the SFP bus in struct net_device, so that the
ethtool module EEPROM methods can access the SFP directly, rather
than needing every user to provide a hook for it.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e679c9c1