Commits · 1b0c3247a092db672bf4599f234f0e90d6e30e8b · Kirill Smelkov / linux

23 Aug, 2019 3 commits

ice: Account for all states of FW DCBx and LLDP · 1b0c3247

Dave Ertman authored Jul 29, 2019

Currently, only the DCBx status is taken into account to
determine if FW LLDP is possible.  But there are NVM version
coming out with DCBx enabled, and FW LLDP disabled.  This
is causing errors where the driver sees that DCBx is not
disabled, and then tries to register for LLDP MIB change
events, and fails.

Change the logic to detect both DCBx and LLDP states in the
FW engine.
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

1b0c3247

ice: Allow egress control packets from PF_VSI · 0c3a6101

Dave Ertman authored Jul 29, 2019

For control packets (i.e. LLDP packets) to be able to egress
from the main VSI, a bit has to be set in the TX_descriptor.
This should only be done for the main VSI and only if the
FW LLDP agent is disabled.  A bit to allow this also has to
be set in the VSI context.

Add the logic to add the necessary bits in the VSI context
for the PF_VSI and the TX_descriptors for control packets
egressing the PF_VSI.
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

0c3a6101

net/ncsi: update response packet length for GCPS/GNS/GNPTS commands · 6d24e141

Ben Wei authored Aug 21, 2019

Update response packet length for the following commands per NC-SI spec
- Get Controller Packet Statistics
- Get NC-SI Statistics
- Get NC-SI Pass-through Statistics command
Signed-off-by: Ben Wei <benwei@fb.com>
Reviewed-by: Justin Lee <justin.lee1@dell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6d24e141

22 Aug, 2019 32 commits

ethernet: Delete unnecessary checks before the macro call “dev_kfree_skb” · 399e06a5

Markus Elfring authored Aug 22, 2019

The dev_kfree_skb() function performs also input parameter validation.
Thus the test around the shown calls is not needed.

This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

399e06a5

net: usb: Delete unnecessary checks before the macro call “dev_kfree_skb” · 5477fccf

Markus Elfring authored Aug 21, 2019

The dev_kfree_skb() function performs also input parameter validation.
Thus the test around the shown calls is not needed.

This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

5477fccf

hamradio: Delete unnecessary checks before the macro call “dev_kfree_skb” · b7deac31

Markus Elfring authored Aug 21, 2019

The dev_kfree_skb() function performs also input parameter validation.
Thus the test around the shown calls is not needed.

This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

b7deac31

can: Delete unnecessary checks before the macro call “dev_kfree_skb” · 038dab7e

Markus Elfring authored Aug 21, 2019

The dev_kfree_skb() function performs also input parameter validation.
Thus the test around the shown calls is not needed.

This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Sean Nyekjaer <sean@geanix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

038dab7e

mISDN: Delete unnecessary checks before the macro call “dev_kfree_skb” · 0f817a5e

Markus Elfring authored Aug 21, 2019

The dev_kfree_skb() function performs also input parameter validation.
Thus the test around the shown calls is not needed.

This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

0f817a5e

net: fec: add C45 MDIO read/write support · d3ee8ec7

Marco Hartmann authored Aug 21, 2019

IEEE 802.3ae clause 45 defines a modified MDIO protocol that uses a two
staged access model in order to increase the address space.

This patch adds support for C45 MDIO read and write accesses, which are
used whenever the MII_ADDR_C45 flag in the regnum argument is set.
In case it is not set, C22 accesses are used as before.
Signed-off-by: Marco Hartmann <marco.hartmann@nxp.com>
Acked-by: Fugang Duan <fugang.duan@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d3ee8ec7

Merge branch 'dt-bindings-net-meson-dwmac-convert-to-yaml' · c2a5b91e

David S. Miller authored Aug 22, 2019

Neil Armstrong says:

====================
dt-bindings: net: meson-dwmac: convert to yaml

This patchsets converts the Amlogic Meson DWMAC glue bindings over to
YAML schemas using the already converted dwmac bindings.

The first patch is needed because the Amlogic glue needs a supplementary
reg cell to access the DWMAC glue registers.

Changes since v3:
- Specified net-next target tree

Changes since v2:
- Added review tags
- Updated allwinner,sun7i-a20-gmac.yaml reg maxItems
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

c2a5b91e

dt-bindings: net: meson-dwmac: convert to yaml · d5a57e4e

Neil Armstrong authored Aug 20, 2019

Now that we have the DT validation in place, let's convert the device tree
bindings for the Synopsys DWMAC Glue for Amlogic SoCs over to a YAML schemas.
Reviewed-by: Rob Herring <robh@kernel.org>
Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d5a57e4e

dt-bindings: net: snps, dwmac: update reg minItems maxItems · 57b77df7

Neil Armstrong authored Aug 20, 2019

The Amlogic Meson DWMAC glue bindings needs a second reg cells for the
glue registers, thus update the reg minItems/maxItems to allow more
than a single reg cell.

Also update the allwinner,sun7i-a20-gmac.yaml derivative schema to specify
maxItems to 1.
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Acked-by: Rob Herring <robh@kernel.org>
Acked-by: Maxime Ripard <maxime.ripard@bootlin.com>
Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

57b77df7

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 7ee7f3e8

David S. Miller authored Aug 22, 2019

Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2019-08-22

This series contains updates to i40e driver only.

Arnd Bergmann reduces the stack usage which was causing warnings on
32-bit architectures due to large structure sizes for 2 functions
getting inlined, so use noinline_for_stack to prevent the compilers from
combining the 2 functions.

Mauro S. M. Rodrigues fixes an issue when reading an EEPROM from SFP
modules that comply with SFF-8472 but do not implement the Digital
Diagnostic Monitoring (DDM) interface for i40e.

Huhai found we were not checking the return value for configuring the
transmit ring and continuing with XDP configuration of the transmit
ring.

Beilei fixes an issue of shifting signed 32-bit integers.

Sylwia adds support for "packet drop mode" to the MAC configuration for
admin queue command.  This bit controls the behavior when a no-drop
packet is blocking a TC queue.  Adds support for persistent LLDP by
checking the LLDP flag and reading the LLDP from the NVM when enabled.

Adrian fixes the "recovery mode" check to take into account which device
we are on, since x710 devices have 4 register values to check for status
and x722 devices only have 2 register values to check.

Piotr Azarewicz bumps the supported firmware API version to 1.9 which
extends the PHY access admin queue command support.

Jake makes sure the traffic class stats for a VEB are reset when the VEB
stats are reset.

Slawomir fixes a NULL pointer dereference where the VSI pointer was not
updated before passing it to the i40e_set_vf_mac() when the VF is in a
reset state, so wait for the reset to complete.

Grzegorz removes the i40e_update_dcb_config() which was not using the
correct NVM reads, so call i40e_init_dcb() in its place to correctly
update the DCB configuration.

Piotr Kwapulinski expands the scope of i40e_set_mac_type() since this is
needed during probe to determine if we are in recovery mode.  Fixed the
driver reset path when in recovery mode.

Marcin fixed an issue where we were breaking out of a loop too early
when trying to get the PHY capabilities.

v2: Combined patch 7 & 9 in the original series, since both patches
    bumped firmware API version.  Also combined patches 12 & 13 in the
    original series, since one increased the scope of checking for MAC
    and the follow-on patch made use of function within the new scope.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

7ee7f3e8

i40e: fix retrying in i40e_aq_get_phy_capabilities · 1b5f5d38

Marcin Formela authored Jul 23, 2019

Fixed a bug where driver was breaking out of the loop and
reporting an error without retrying first.
Signed-off-by: Marcin Formela <marcin.formela@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

1b5f5d38

i40e: Persistent LLDP support · 65c275e4

Sylwia Wnuczko authored Jul 23, 2019

This patch adds a function to read NVM module data and uses it to
read current LLDP agent configuration from NVM API version 1.8.
Signed-off-by: Sylwia Wnuczko <sylwia.wnuczko@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

65c275e4

i40e: allow reset in recovery mode · a39f165d

Piotr Kwapulinski authored Jul 23, 2019

Driver waits after issuing a reset. When a reset takes too long a driver
gives up. Implemented by invoking PF reset in a loop. After defined
number of unsuccessful PF reset trials it returns error.
Without this patch PF reset fails when NIC is in recovery mode.

So make i40e_set_mac_type() public. i40e driver requires i40e_set_mac_type()
to be public. It is required for recovery mode handling. Without this patch
recovery mode could not be detected in i40e_probe().
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

a39f165d

i40e: Remove function i40e_update_dcb_config() · 541d9731

Grzegorz Siwik authored Jul 23, 2019

This patch removes function i40e_update_dcb_config(). Instead of
i40e_update_dcb_config() we use i40e_init_dcb(), which implements the
correct NVM read.
Signed-off-by: Grzegorz Siwik <grzegorz.siwik@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

541d9731

i40e: Fix crash caused by stress setting of VF MAC addresses · 9889707b

Slawomir Laba authored Jul 23, 2019

Add update to the VSI pointer passed to the i40e_set_vf_mac function.
If VF is in reset state the driver waits in i40e_set_vf_mac function
for the reset to be complete, yet after reset the vsi pointer
that was passed into this function is no longer valid.

The patch updates local VSI pointer directly from pf->vsi array,
by using the id stored in VF pointer (lan_vsi_idx).

Without this commit the driver might occasionally invoke general
protection fault in kernel and disable the OS entirely.
Signed-off-by: Slawomir Laba <slawomirx.laba@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

9889707b

i40e: reset veb.tc_stats when resetting veb.stats · 1e0303fd

Jacob Keller authored Jul 23, 2019

The stats structure for the VEB switch statistics is reset periodically,
but the tc_stats are not reset at the same time.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

1e0303fd

i40e: Update FW API version to 1.9 · f93b3fd9

Piotr Azarewicz authored Jul 23, 2019

Upcoming FW increment API version to 1.9 due to Extend PHY access AQ
command support. SW is ready for that support as well.
Signed-off-by: Piotr Azarewicz <piotr.azarewicz@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

f93b3fd9

i40e: check_recovery_mode had wrong if statement · d4256c8e

Adrian Podlawski authored Jul 23, 2019

Function check_recovery_mode had wrong if statement.
Now we check proper FWS1B register values, which are responsible for
the recovery mode. Recovery mode has 4 values for x710 and 2 for x722.
That's why we need 6 different flags which are defined in the code.
Now in the if statement, we recognize type of mac address
and register value.
Without those changes driver could show wrong state.
Signed-off-by: Adrian Podlawski <adrian.podlawski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

d4256c8e

i40e: Add drop mode parameter to set mac config · d802c760

Sylwia Wnuczko authored Jul 23, 2019

This patch adds "drop mode" parameter to set mac config AQ command.
This bit controls the behavior when a no-drop packet is blocking a TC
queue.
0 – The PF driver is notified.
1 – The blocking packet is dropped and then the PF driver is notified.
Signed-off-by: Sylwia Wnuczko <sylwia.wnuczko@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

d802c760

i40e: fix shifts of signed values · fb598262

Beilei Xing authored Jul 23, 2019

This patch fixes following error reported by cppcheck:
(error) Shifting signed 32-bit value by 31 bits is undefined behaviour
Signed-off-by: Beilei Xing <beilei.xing@intel.com>
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fb598262

i40e: add check on i40e_configure_tx_ring() return value · 408bfc38

huhai authored Jul 22, 2019

When i40e_configure_tx_ring(vsi->tx_rings[i]) returns an error, we should
exit from i40e_vsi_configure_tx and return the error, instead of continuing
to check whether xdp is enable, and configure the xdp transmit ring.
Signed-off-by: huhai <huhai@kylinos.cn>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

408bfc38

i40e: Check if transceiver implements DDM before access · bc6c1eaa

Mauro S. M. Rodrigues authored Jul 18, 2019

Similar to the ixgbe issue fixed in:
655c9141 ("ixgbe: Check DDM existence in transceiver before access)

i40e has the same issue when reading eeprom from SFP's module that comply
with SFF-8472 but not implement the Digital Diagnostic Monitoring (DDM)
interface described in it. The existence of such area is specified by bit
6 of byte 92, set to 1 if implemented.

Without this patch, due to not checking this bit i40e fails to read SFP
module's eeprom with the follow message:

ethtool -m enP51p1s0f0
Cannot get Module EEPROM data: Input/output error

Because it fails to read the additional 256 bytes in which it was assumed
to exist the DDM data.
Signed-off-by: "Mauro S. M. Rodrigues" <maurosr@linux.vnet.ibm.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

bc6c1eaa

i40e: reduce stack usage in i40e_set_fc · 33b16568

Arnd Bergmann authored Jul 15, 2019

The functions i40e_aq_get_phy_abilities_resp() and i40e_set_fc() both
have giant structure on the stack, which makes each one use stack frames
larger than 500 bytes.

As clang decides one function into the other, we get a warning for
exceeding the frame size limit on 32-bit architectures:

drivers/net/ethernet/intel/i40e/i40e_common.c:1654:23: error: stack frame size of 1116 bytes in function 'i40e_set_fc' [-Werror,-Wframe-larger-than=]

When building with gcc, the inlining does not happen, but i40e_set_fc()
calls i40e_aq_get_phy_abilities_resp() anyway, so they add up on the
kernel stack just as much.

The parts that actually use large stacks don't overlap, so make sure
each one is a separate function, and mark them as noinline_for_stack to
prevent the compilers from combining them again.

Fixes: 0a862b43 ("i40e/i40evf: Add module_types and update_link_info")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

33b16568

nexthops: remove redundant assignment to variable err · c76c9925

Colin Ian King authored Aug 22, 2019

Variable err is initialized to a value that is never read and it is
re-assigned later. The initialization is redundant and can be removed.

Addresses-Coverity: ("Unused Value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c76c9925

Merge branch 'mlx5-hyperv' · 8da3803d

David S. Miller authored Aug 22, 2019

Haiyang Zhang says:

====================
Add software backchannel and mlx5e HV VHCA stats

This patch set adds paravirtual backchannel in software in pci_hyperv,
which is required by the mlx5e driver HV VHCA stats agent.

The stats agent is responsible on running a periodic rx/tx packets/bytes
stats update.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8da3803d

net/mlx5e: Add mlx5e HV VHCA stats agent · cef35af3

Eran Ben Elisha authored Aug 22, 2019

HV VHCA stats agent is responsible on running a preiodic rx/tx
packets/bytes stats update. Currently the supported format is version
MLX5_HV_VHCA_STATS_VERSION. Block ID 1 is dedicated for statistics data
transfer from the VF to the PF.

The reporter fetch the statistics data from all opened channels, fill it
in a buffer and send it to mlx5_hv_vhca_write_agent.

As the stats layer should include some metadata per block (sequence and
offset), the HV VHCA layer shall modify the buffer before actually send it
over block 1.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cef35af3

net/mlx5: Add HV VHCA control agent · 29ddad43

Eran Ben Elisha authored Aug 22, 2019

Control agent is responsible over of the control block (ID 0). It should
update the PF via this block about every capability change. In addition,
upon block 0 invalidate, it should activate all other supported agents
with data requests from the PF.

Upon agent create/destroy, the invalidate callback of the control agent
is being called in order to update the PF driver about this change.

The control agent is an integral part of HV VHCA and will be created
and destroy as part of the HV VHCA init/cleanup flow.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

29ddad43

net/mlx5: Add HV VHCA infrastructure · 87175120

Eran Ben Elisha authored Aug 22, 2019

HV VHCA is a layer which provides PF to VF communication channel based on
HyperV PCI config channel. It implements Mellanox's Inter VHCA control
communication protocol. The protocol contains control block in order to
pass messages between the PF and VF drivers, and data blocks in order to
pass actual data.

The infrastructure is agent based. Each agent will be responsible of
contiguous buffer blocks in the VHCA config space. This infrastructure will
bind agents to their blocks, and those agents can only access read/write
the buffer blocks assigned to them. Each agent will provide three
callbacks (control, invalidate, cleanup). Control will be invoked when
block-0 is invalidated with a command that concerns this agent. Invalidate
callback will be invoked if one of the blocks assigned to this agent was
invalidated. Cleanup will be invoked before the agent is being freed in
order to clean all of its open resources or deferred works.

Block-0 serves as the control block. All execution commands from the PF
will be written by the PF over this block. VF will ack on those by
writing on block-0 as well. Its format is described by struct
mlx5_hv_vhca_control_block layout.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

87175120

net/mlx5: Add wrappers for HyperV PCIe operations · 913d14e8

Eran Ben Elisha authored Aug 22, 2019

Add wrapper functions for HyperV PCIe read / write /
block_invalidate_register operations.  This will be used as an
infrastructure in the downstream patch for software communication.

This will be enabled by default if CONFIG_PCI_HYPERV_INTERFACE is set.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

913d14e8

PCI: hv: Add a Hyper-V PCI interface driver for software backchannel interface · 348dd93e

Haiyang Zhang authored Aug 22, 2019

This interface driver is a helper driver allows other drivers to
have a common interface with the Hyper-V PCI frontend driver.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

348dd93e

PCI: hv: Add a paravirtual backchannel in software · e5d2f910

Dexuan Cui authored Aug 22, 2019

Windows SR-IOV provides a backchannel mechanism in software for communication
between a VF driver and a PF driver. These "configuration blocks" are
similar in concept to PCI configuration space, but instead of doing reads and
writes in 32-bit chunks through a very slow path, packets of up to 128 bytes
can be sent or received asynchronously.

Nearly every SR-IOV device contains just such a communications channel in
hardware, so using this one in software is usually optional. Using the
software channel, however, allows driver implementers to leverage software
tools that fuzz the communications channel looking for vulnerabilities.

The usage model for these packets puts the responsibility for reading or
writing on the VF driver. The VF driver sends a read or a write packet,
indicating which "block" is being referred to by number.

If the PF driver wishes to initiate communication, it can "invalidate" one or
more of the first 64 blocks. This invalidation is delivered via a callback
supplied by the VF driver by this driver.

No protocol is implied, except that supplied by the PF and VF drivers.
Signed-off-by: Jake Oshins <jakeo@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e5d2f910

Merge tag 'mlx5-updates-2019-08-21' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · fed07ef3

David S. Miller authored Aug 21, 2019

Saeed Mahameed says:

====================
mlx5 tc flow handling for concurrent execution (Part 3)

This series includes updates to mlx5 ethernet and core driver:

Vlad submits part 3 of 3 part series to allow TC flow handling
for concurrent execution.

Vlad says:
==========

Structure mlx5e_neigh_hash_entry code that uses it are refactored in
following ways:

- Extend neigh_hash_entry with rcu and modify its users to always take
  reference to the structure when using it (neigh_hash_entry has already
  had atomic reference counter which was only used when scheduling neigh
  update on workqueue from atomic context of neigh update netevent).

- Always use mlx5e_neigh_update_table->encap_lock when modifying neigh
  update hash table and list. Originally, this lock was only used to
  synchronize with netevent handler function, which is called from bh
  context and cannot use rtnl lock for synchronization. Use rcu read lock
  instead of encap_lock to lookup nhe in atomic context of netevent even
  handler function. Convert encap_lock to mutex to allow creating new
  neigh hash entries while holding it, which is safe to do because the
  lock is no longer used in atomic context.

- Rcu-ify mlx5e_neigh_hash_entry->encap_list by changing operations on
  encap list to their rcu counterparts and extending encap structure
  with rcu_head to free the encap instances after rcu grace period. This
  allows fast traversal of list of encaps attached to nhe under rcu read
  lock protection.

- Take encap_table_lock when accessing encap entries in neigh update and
  neigh stats update code to protect from concurrent encap entry
  insertion or removal.

This approach leads to potential race condition when neigh update and
neigh stats update code can access encap and flow entries that are not
fully initialized or are being destroyed, or neigh can change state
without updating encaps that are created concurrently. Prevent these
issues by following changes in flow and encap initialization:

- Extend mlx5e_tc_flow with 'init_done' completion. Modify neigh update
  to wait for both encap and flow completions to prevent concurrent
  access to a structure that is being initialized by tc.

- Skip structures that failed during initialization: encaps with
  encap_id<0 and flows that don't have OFFLOADED flag set.

- To ensure that no new flows are added to encap when it is being
  accessed by neigh update or neigh stats update, take encap_table_lock
  mutex.

- To prevent concurrent deletion by tc, ensure that neigh update and
  neigh stats update hold references to encap and flow instances while
  using them.

With changes presented in this patch set it is now safe to execute tc
concurrently with neigh update and neigh stats update. However, these
two workqueue tasks modify same flow "tmp_list" field to store flows
with reference taken in temporary list to release the references after
update operation finishes and should not be executed concurrently with
each other.

Last 3 patches of this series provide 3 new mlx5 trace points to track
mlx5 tc requests and mlx5 neigh updates.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

fed07ef3

21 Aug, 2019 5 commits

net/mlx5e: Add trace point for neigh update · 5970882a

Vlad Buslov authored Jun 25, 2019

Allow tracing neigh state during neigh update task that is executed on
workqueue and is scheduled by neigh state change event.

Usage example:
 ># cd /sys/kernel/debug/tracing
 ># echo mlx5:mlx5e_rep_neigh_update >> set_event
 ># cat trace
    ...
    kworker/u48:7-2221  [009] ...1  1475.387435: mlx5e_rep_neigh_update:
netdev: ens1f0 MAC: 24:8a:07:9a:17:9a IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_connected=1

Added corresponding documentation in
    Documentation/networking/device-driver/mellanox/mlx5.rst
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Dmytro Linkin <dmitrolin@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

5970882a

net/mlx5e: Add trace point for neigh used value update · c786fe59

Vlad Buslov authored Jun 25, 2019

Allow tracing result of neigh used value update task that is executed
periodically on workqueue.

Usage example:
 ># cd /sys/kernel/debug/tracing
 ># echo mlx5:mlx5e_tc_update_neigh_used_value >> set_event
 ># cat trace
    ...
    kworker/u48:4-8806  [009] ...1 55117.882428: mlx5e_tc_update_neigh_used_value:
netdev: ens1f0 IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_used=1

Added corresponding documentation in
    Documentation/networking/device-driver/mellanox/mlx5.rst
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Dmytro Linkin <dmitrolin@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

c786fe59

net/mlx5e: Add tc flower tracepoints · 7a978759

Dmytro Linkin authored Jun 27, 2019

Implemented following tracepoints:
1. Configure flower (mlx5e_configure_flower)
2. Delete flower (mlx5e_delete_flower)
3. Stats flower (mlx5e_stats_flower)

Usage example:
 ># cd /sys/kernel/debug/tracing
 ># echo mlx5:mlx5e_configure_flower >> set_event
 ># cat trace
    ...
    tc-6535  [019] ...1  2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT

Added corresponding documentation in
    Documentation/networking/device-driver/mellanox/mlx5.rst
Signed-off-by: Dmytro Linkin <dmitrolin@mellanox.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

7a978759

net/mlx5e: Only access fully initialized flows in neigh update · 95435ad7

Vlad Buslov authored Aug 04, 2019

To remove dependency on rtnl lock and prevent neigh update code from
accessing uninitialized flows when executing concurrently with tc, extend
mlx5e_tc_flow with 'init_done' completion. Modify helper
mlx5e_take_all_encap_flows() to wait for flow completion after obtaining
reference to it. Modify mlx5e_tc_encap_flows_del() and
mlx5e_tc_encap_flows_add() to skip flows that don't have OFFLOADED flag
set, which can happen if concurrent flow initialization failed.

This commit finishes neigh update refactoring for concurrent execution
started in previous change in this series.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

95435ad7

net/mlx5e: Refactor neigh update for concurrent execution · 2a1f1768

Vlad Buslov authored Aug 04, 2019

In order to remove dependency on rtnl lock and allow neigh update workqueue
task to execute concurrently with tc, refactor mlx5e_rep_neigh_update() for
concurrent execution:

- Lock encap table when accessing encap entry to prevent concurrent
  changes. To do this properly, the initial encap state check is moved from
  mlx5e_rep_neigh_update() into mlx5e_rep_update_flows() to be performed
  under encap_tbl_lock protection.

- Wait for encap to be fully initialized before accessing it by means of
  'res_ready' completion.

- Add mlx5e_take_all_encap_flows() helper which is used to construct a
  temporary list of flows and efi indexes that is used to access current
  encap data in flow which can be attached to multiple encaps
  simultaneously. Release the flows from temporary list after
  encap_tbl_lock critical section. This is necessary because
  mlx5e_flow_put() can't be called while holding encap_tbl_lock.

- Modify mlx5e_tc_encap_flows_add() and mlx5e_tc_encap_flows_del() to work
  with user-provided list of flows built by mlx5e_take_all_encap_flows(),
  instead of traversing encap flow list directly.

This is first step in complex neigh update refactoring, which is finished
by following commit in this series.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

2a1f1768