Commits · 40867d74c374b235e14d839f3a77f26684feefe5 · Kirill Smelkov / linux

16 Mar, 2022 1 commit

net: Add l3mdev index to flow struct and avoid oif reset for port devices · 40867d74

David Ahern authored Mar 14, 2022

The fundamental premise of VRF and l3mdev core code is binding a socket
to a device (l3mdev or netdev with an L3 domain) to indicate L3 scope.
Legacy code resets flowi_oif to the l3mdev losing any original port
device binding. Ben (among others) has demonstrated use cases where the
original port device binding is important and needs to be retained.
This patch handles that by adding a new entry to the common flow struct
that can indicate the l3mdev index for later rule and table matching
avoiding the need to reset flowi_oif.

In addition to allowing more use cases that require port device binds,
this patch brings a few datapath simplications:

1. l3mdev_fib_rule_match is only called when walking fib rules and
always after l3mdev_update_flow. That allows an optimization to bail
early for non-VRF type uses cases when flowi_l3mdev is not set. Also,
only that index needs to be checked for the FIB table id.

2. l3mdev_update_flow can be called with flowi_oif set to a l3mdev
(e.g., VRF) device. By resetting flowi_oif only for this case the
FLOWI_FLAG_SKIP_NH_OIF flag is not longer needed and can be removed,
removing several checks in the datapath. The flowi_iif path can be
simplified to only be called if the it is not loopback (loopback can
not be assigned to an L3 domain) and the l3mdev index is not already
set.

3. Avoid another device lookup in the output path when the fib lookup
returns a reject failure.

Note: 2 functional tests for local traffic with reject fib rules are
updated to reflect the new direct failure at FIB lookup time for ping
rather than the failure on packet path. The current code fails like this:

HINT: Fails since address on vrf device is out of device scope
COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
ping: Warning: source address might be selected on device other than: eth1
PING 172.16.3.1 (172.16.3.1) from 172.16.3.1 eth1: 56(84) bytes of data.

--- 172.16.3.1 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

where the test now directly fails:

HINT: Fails since address on vrf device is out of device scope
COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
ping: connect: No route to host
Signed-off-by: David Ahern <dsahern@kernel.org>
Tested-by: Ben Greear <greearb@candelatech.com>
Link: https://lore.kernel.org/r/20220314204551.16369-1-dsahern@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

40867d74

15 Mar, 2022 21 commits

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · c84d86a0

Jakub Kicinski authored Mar 15, 2022

Tony Nguyen says:

====================
100GbE Intel Wired LAN Driver Updates 2022-03-14

Jacob Keller says:

The ice_virtchnl_pf.c file has become a single place for a lot of
virtualization functionality. This includes most of the virtchnl message
handling, integration with kernel hooks like the .ndo operations, reset
logic, and more.

We are planning in the future to implement and support Scalable IOV in the
ice driver. To do this, much (but not all) of the code in ice_virtchnl_pf.c
will want to be reused.

Rather than dump all of the Scalable IOV implementation into
ice_virtchnl_pf.c it makes sense to house it in a separate file. But that
still leaves all of the Single Root IOV code littered among more generic
logic.

The long term goal is to re-organize the code such that generic re-usable
code is split into separate files. The ice_sriov.c file would end up
containing all of the Single Root IOV implementation specific details, while
ice_vf_lib.[ch] and ice_virtchnl.[ch] contain the generic pieces.

As a first step, notice that ice_sriov.c currently does not contain much of
the SR-IOV implementation. This is housed primarily in ice_virtchnl_pf.c

The code in ice_sriov.c is really generic and relates to the VF mailbox,
including mailbox overflow detection.

Rename ice_sriov.c to ice_vf_mbx.c, and then rename ice_virtchnl_pf.c to
ice_sriov.c

A later series will finish the refactor by splitting ice_sriov.c into
multiple files, moving the generic code into ice_vf_lib.c and ice_virtchnl.c

To prepare for that series, perform some basic cleanup and other refactors
that we've accumulated during this development cycle.

This series builds on top of the recent hash table refactor work.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: use ice_is_vf_trusted helper function
  ice: log an error message when eswitch fails to configure
  ice: cleanup error logging for ice_ena_vfs
  ice: move ice_set_vf_port_vlan near other .ndo ops
  ice: refactor spoofchk control code in ice_sriov.c
  ice: rename ICE_MAX_VF_COUNT to avoid confusion
  ice: remove unused definitions from ice_sriov.h
  ice: convert vf->vc_ops to a const pointer
  ice: remove circular header dependencies on ice.h
  ice: rename ice_virtchnl_pf.c to ice_sriov.c
  ice: rename ice_sriov.c to ice_vf_mbx.c
====================

Link: https://lore.kernel.org/r/20220315011155.2166817-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c84d86a0

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next · abe2fec8

Jakub Kicinski authored Mar 15, 2022

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

1) Revert CHECKSUM_UNNECESSARY for UDP packet from conntrack.

2) Reject unsupported families when creating tables, from Phil Sutter.

3) GRE support for the flowtable, from Toshiaki Makita.

4) Add GRE offload support for act_ct, also from Toshiaki.

5) Update mlx5 driver to support for GRE flowtable offload,
   from Toshiaki Makita.

6) Oneliner to clean up incorrect indentation in nf_conntrack_bridge,
   from Jiapeng Chong.

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: bridge: clean up some inconsistent indenting
  net/mlx5: Support GRE conntrack offload
  act_ct: Support GRE offload
  netfilter: flowtable: Support GRE
  netfilter: nf_tables: Reject tables of unsupported family
  Revert "netfilter: conntrack: mark UDP zero checksum as CHECKSUM_UNNECESSARY"
====================

Link: https://lore.kernel.org/r/20220315091513.66544-1-pablo@netfilter.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

abe2fec8

net: mscc: ocelot: fix build error due to missing IEEE_8021QAZ_MAX_TCS · 72f56fdb

Vladimir Oltean authored Mar 15, 2022

IEEE_8021QAZ_MAX_TCS is defined in include/uapi/linux/dcbnl.h, which is
included by net/dcbnl.h. Then, linux/netdevice.h conditionally includes
net/dcbnl.h if CONFIG_DCB is enabled.

Therefore, when CONFIG_DCB is disabled, this indirect dependency is
broken.

There isn't a good reason to include net/dcbnl.h headers into the ocelot
switch library which exports low-level hardware API, so replace
IEEE_8021QAZ_MAX_TCS with OCELOT_NUM_TC which has the same value.

Fixes: 978777d0 ("net: dsa: felix: configure default-prio and dscp priorities")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20220315131215.273450-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

72f56fdb

net: sparx5: fix a couple warning messages · c24f6577

Dan Carpenter authored Mar 14, 2022

The WARN_ON() macro takes a condition, not a warning message.

Fixes: 0933bd04 ("net: sparx5: Add support for ptp clocks")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20220314140327.GB30883@kiliSigned-off-by: Paolo Abeni <pabeni@redhat.com>

c24f6577

Merge branch 'netdevsim-support-for-l3-hw-stats' · 583024cf

Paolo Abeni authored Mar 15, 2022

Petr Machata says:

====================
netdevsim: Support for L3 HW stats

"L3 stats" is a suite of interface statistics aimed at reflecting traffic
taking place in a HW device, on an object corresponding to some software
netdevice. Support for this stats suite has been added recently, in commit
ca0a53dc ("Merge branch 'net-hw-counters-for-soft-devices'").

In this patch set:

- Patch #1 adds support for L3 stats to netdevsim.

  Real devices can have various conditions for when an L3 counter is
  available. To simulate this, netdevsim maintains a list of devices
  suitable for HW stats collection. Only when l3_stats is enabled on both a
  netdevice itself, and in netdevsim, will netdevsim contribute values to
  L3 stats.

  This enablement and disablement is done via debugfs:

    # echo $ifindex > /sys/kernel/debug/netdevsim/$DEV/hwstats/l3/enable_ifindex
    # echo $ifindex > /sys/kernel/debug/netdevsim/$DEV/hwstats/l3/disable_ifindex

  Besides this, there is a third toggle to mark a device for future failure:

    # echo $ifindex > /sys/kernel/debug/netdevsim/$DEV/hwstats/l3/fail_next_enable

- This allows HW-independent testing of stats reporting and in-kernel APIs,
  as well as a test for enablement rollback, which is difficult to do
  otherwise. This netdevsim-specific selftest is added in patch #2.

- Patch #3 adds another driver-specific selftest, namely a test aimed at
  checking mlxsw-induced stats monitoring events.

====================

Link: https://lore.kernel.org/r/cover.1647265833.git.petrm@nvidia.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

583024cf

selftests: mlxsw: hw_stats_l3: Add a new test · ed2ae69c

Petr Machata authored Mar 14, 2022

Add a test that verifies that UAPI notifications are emitted, as mlxsw
installs and deinstalls HW counters for the L3 offload xstats.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ed2ae69c

selftests: netdevsim: hw_stats_l3: Add a new test · 9b18942e

Petr Machata authored Mar 14, 2022

Add a test that verifies basic UAPI contracts, netdevsim operation,
rollbacks after partial enablement in core, and UAPI notifications.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

9b18942e

netdevsim: Introduce support for L3 offload xstats · 1a6d7ae7

Petr Machata authored Mar 14, 2022

Add support for testing of HW stats support that was added recently, namely
the L3 stats support. L3 stats are provided for devices for which the L3
stats have been turned on, and that were enabled for netdevsim through a
debugfs toggle:

    # echo $ifindex > /sys/kernel/debug/netdevsim/$DEV/hwstats/l3/enable_ifindex

For fully enabled netdevices, netdevsim counts 10pps of ingress traffic and
20pps of egress traffic. Similarly, L3 stats can be disabled for a given
device, and netdevsim ceases pretending there is any HW traffic going on:

    # echo $ifindex > /sys/kernel/debug/netdevsim/$DEV/hwstats/l3/disable_ifindex

Besides this, there is a third toggle to mark a device for future failure:

    # echo $ifindex > /sys/kernel/debug/netdevsim/$DEV/hwstats/l3/fail_next_enable

A future request to enable L3 stats on such netdevice will be bounced by
netdevsim:

    # ip -j l sh dev d | jq '.[].ifindex'
    66
    # echo 66 > /sys/kernel/debug/netdevsim/netdevsim10/hwstats/l3/enable_ifindex
    # echo 66 > /sys/kernel/debug/netdevsim/netdevsim10/hwstats/l3/fail_next_enable
    # ip stats set dev d l3_stats on
    Error: netdevsim: Stats enablement set to fail.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

1a6d7ae7

net: phy: Kconfig: micrel_phy: fix dependency issue · 231fdac3

Anders Roxell authored Mar 14, 2022

When building driver CONFIG_MICREL_PHY the follow error shows up:

aarch64-linux-gnu-ld: drivers/net/phy/micrel.o: in function `lan8814_ts_info':
micrel.c:(.text+0x1764): undefined reference to `ptp_clock_index'
micrel.c:(.text+0x1764): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `ptp_clock_index'
aarch64-linux-gnu-ld: drivers/net/phy/micrel.o: in function `lan8814_probe':
micrel.c:(.text+0x4720): undefined reference to `ptp_clock_register'
micrel.c:(.text+0x4720): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `ptp_clock_register'

Rework Kconfig for MICREL_PHY to depend on 'PTP_1588_CLOCK_OPTIONAL'.
Arnd describes in a good way why its needed to add this depends in patch
e5f31552 ("ethernet: fix PTP_1588_CLOCK dependencies").
Reported-by: kernel test robot <lkp@intel.com>
Fixes: ece19502 ("net: phy: micrel: 1588 support for LAN8814 phy")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220314110254.12498-1-anders.roxell@linaro.orgSigned-off-by: Paolo Abeni <pabeni@redhat.com>

231fdac3

net: sfp: add 2500base-X quirk for Lantech SFP module · 00eec9fe

Michael Walle authored Mar 12, 2022

The Lantech 8330-262D-E module is 2500base-X capable, but it reports the
nominal bitrate as 2500MBd instead of 3125MBd. Add a quirk for the
module.

The following in an EEPROM dump of such a SFP with the serial number
redacted:

00: 03 04 07 00 00 00 01 20 40 0c 05 01 19 00 00 00 ???...? @????...
10: 1e 0f 00 00 4c 61 6e 74 65 63 68 20 20 20 20 20 ??..Lantech
20: 20 20 20 20 00 00 00 00 38 33 33 30 2d 32 36 32 ....8330-262
30: 44 2d 45 20 20 20 20 20 56 31 2e 30 03 52 00 cb D-E V1.0?R.?
40: 00 1a 00 00 46 43 XX XX XX XX XX XX XX XX XX XX .?..FCXXXXXXXXXX
50: 20 20 20 20 32 32 30 32 31 34 20 20 68 b0 01 98 220214 h???
60: 45 58 54 52 45 4d 45 4c 59 20 43 4f 4d 50 41 54 EXTREMELY COMPAT
70: 49 42 4c 45 20 20 20 20 20 20 20 20 20 20 20 20 IBLE
Signed-off-by: Michael Walle <michael@walle.cc>
Link: https://lore.kernel.org/r/20220312205014.4154907-1-michael@walle.ccSigned-off-by: Paolo Abeni <pabeni@redhat.com>

00eec9fe

ice: use ice_is_vf_trusted helper function · 1261691d

Jacob Keller authored Feb 22, 2022

The ice_vc_cfg_promiscuous_mode_msg function directly checks
ICE_VIRTCHNL_VF_CAP_PRIVILEGE, instead of using the existing helper
function ice_is_vf_trusted. Switch this to use the helper function so
that all trusted checks are consistent. This aids in any potential
future refactor by ensuring consistent code.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

1261691d

ice: log an error message when eswitch fails to configure · 2b369448

Jacob Keller authored Feb 22, 2022

When ice_eswitch_configure fails, print an error message to make it more
obvious why VF initialization did not succeed.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

2b369448

ice: cleanup error logging for ice_ena_vfs · 94ab2488

Jacob Keller authored Feb 22, 2022

The ice_ena_vfs function and some of its sub-functions like
ice_set_per_vf_res use a "if (<function>) { <print error> ; <exit> }"
flow. This flow discards specialized errors reported by the called
function.

This style is generally not preferred as it makes tracing error sources
more difficult. It also means we cannot log the actual error received
properly.

Refactor several calls in the ice_ena_vfs function that do this to catch
the error in the 'ret' variable. Report this in the messages, and then
return the more precise error value.

Doing this reveals that ice_set_per_vf_res returns -EINVAL or -EIO in
places where -ENOSPC makes more sense. Fix these calls up to return the
more appropriate value.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

94ab2488

ice: move ice_set_vf_port_vlan near other .ndo ops · 346f7aa3

Jacob Keller authored Feb 22, 2022

The ice_set_vf_port_vlan function is located in ice_sriov.c very far
away from the other .ndo operations that it is similar to. Move this so
that its located near the other .ndo operation definitions.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

346f7aa3

ice: refactor spoofchk control code in ice_sriov.c · a8ea6d86

Jacob Keller authored Feb 22, 2022

The API to control the VSI spoof checking for a VF VSI has three
functions: enable, disable, and set. The set function takes the VSI and
the VF and decides whether to call enable or disable based on the
vf->spoofchk field.

In some flows, vf->spoofchk is not yet set, such as the function used to
control the setting for a VF. (vf->spoofchk is only updated after a
success).

Simplify this API by refactoring ice_vf_set_spoofchk_cfg to be
"ice_vsi_apply_spoofchk" which takes the boolean and allows all callers
to avoid having to determine whether to call enable or disable
themselves.

This matches the expected callers better, and will prevent the need to
export more than one function when this code must be called from another
file.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

a8ea6d86

ice: rename ICE_MAX_VF_COUNT to avoid confusion · dc36796e

Jacob Keller authored Feb 22, 2022

The ICE_MAX_VF_COUNT field is defined in ice_sriov.h. This count is true
for SR-IOV but will not be true for all VF implementations, such as when
the ice driver supports Scalable IOV.

Rename this definition to clearly indicate ICE_MAX_SRIOV_VFS.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

dc36796e

ice: remove unused definitions from ice_sriov.h · 00a57e29

Jacob Keller authored Feb 22, 2022

A few more macros exist in ice_sriov.h which are not used anywhere.
These can be safely removed. Note that ICE_VIRTCHNL_VF_CAP_L2 capability
is set but never checked anywhere in the driver. Thus it is also safe to
remove.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

00a57e29

ice: convert vf->vc_ops to a const pointer · a7e11710

Jacob Keller authored Feb 22, 2022

The vc_ops structure is used to allow different handlers for virtchnl
commands when the driver is in representor mode. The current
implementation uses a copy of the ops table in each VF, and modifies
this copy dynamically.

The usual practice in kernel code is to store the ops table in a
constant structure and point to different versions. This has a number of
advantages:

  1. Reduced memory usage. Each VF merely points to the correct table,
     so they're able to re-use the same constant lookup table in memory.
  2. Consistency. It becomes more difficult to accidentally update or
     edit only one op call. Instead, the code switches to the correct
     able by a single pointer write. In general this is atomic, either
     the pointer is updated or its not.
  3. Code Layout. The VF structure can store a pointer to the table
     without needing to have the full structure definition defined prior
     to the VF structure definition. This will aid in future refactoring
     of code by allowing the VF pointer to be kept in ice_vf_lib.h while
     the virtchnl ops table can be maintained in ice_virtchnl.h

There is one major downside in the case of the vc_ops structure. Most of
the operations in the table are the same between the two current
implementations. This can appear to lead to duplication since each
implementation must now fill in the complete table. It could make
spotting the differences in the representor mode more challenging.
Unfortunately, methods to make this less error prone either add
complexity overhead (macros using CPP token concatenation) or don't work
on all compilers we support (constant initializer from another constant
structure).

The cost of maintaining two structures does not out weigh the benefits
of the constant table model.

While we're making these changes, go ahead and rename the structure and
implementations with "virtchnl" instead of "vc_vf_". This will more
closely align with the planned file renaming, and avoid similar names when
we later introduce a "vf ops" table for separating Scalable IOV and
Single Root IOV implementations.

Leave the accessor/assignment functions in order to avoid issues with
compiling with options disabled. The interface makes it easier to handle
when CONFIG_PCI_IOV is disabled in the kernel.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

a7e11710

ice: remove circular header dependencies on ice.h · 649c87c6

Jacob Keller authored Feb 22, 2022

Several headers in the ice driver include ice.h even though they are
themselves included by that header. The most notable of these is
ice_common.h, but several other headers also do this.

Such a recursive inclusion is problematic as it forces headers to be
included in a strict order, otherwise compilation errors can result. The
circular inclusions do not trigger an endless loop due to standard
header inclusion guards, however other errors can occur.

For example, ice_flow.h defines ice_rss_hash_cfg, which is used by
ice_sriov.h as part of the definition of ice_vf_hash_ip_ctx.

ice_flow.h includes ice_acl.h, which includes ice_common.h, and which
finally includes ice.h. Since ice.h itself includes ice_sriov.h, this
creates a circular dependency.

The definition in ice_sriov.h requires things from ice_flow.h, but
ice_flow.h itself will lead to trying to load ice_sriov.h as part of its
process for expanding ice.h. The current code avoids this issue by
having an implicit dependency without the include of ice_flow.h.

If we were to fix that so that ice_sriov.h explicitly depends on
ice_flow.h the following pattern would occur:

  ice_flow.h -> ice_acl.h -> ice_common.h -> ice.h -> ice_sriov.h

At this point, during the expansion of, the header guard for ice_flow.h
is already set, so when ice_sriov.h attempts to load the ice_flow.h
header it is skipped. Then, we go on to begin including the rest of
ice_sriov.h, including structure definitions which depend on
ice_rss_hash_cfg. This produces a compiler warning because
ice_rss_hash_cfg hasn't yet been included. Remember, we're just at the
start of ice_flow.h!

If the order of headers is incorrect (ice_flow.h is not implicitly
loaded first in all files which include ice_sriov.h) then we get the
same failure.

Removing this recursive inclusion requires fixing a few cases where some
headers depended on the header inclusions from ice.h. In addition, a few
other changes are also required.

Most notably, ice_hw_to_dev is implemented as a macro in ice_osdep.h,
which is the likely reason that ice_common.h includes ice.h at all. This
macro implementation requires the full definition of ice_pf in order to
properly compile.

Fix this by moving it to a function declared in ice_main.c, so that we
do not require all files to depend on the layout of the ice_pf
structure.

Note that this change only fixes circular dependencies, but it does not
fully resolve all implicit dependencies where one header may depend on
the inclusion of another. I tried to fix as many of the implicit
dependencies as I noticed, but fixing them all requires a somewhat
tedious analysis of each header and attempting to compile it separately.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

649c87c6

ice: rename ice_virtchnl_pf.c to ice_sriov.c · 0deb0bf7

Jacob Keller authored Feb 22, 2022

The ice_virtchnl_pf.c and ice_virtchnl_pf.h files are where most of the
code for implementing Single Root IOV virtualization resides. This code
includes support for bringing up and tearing down VFs, hooks into the
kernel SR-IOV netdev operations, and for handling virtchnl messages from
VFs.

In the future, we plan to support Scalable IOV in addition to Single
Root IOV as an alternative virtualization scheme. This implementation
will re-use some but not all of the code in ice_virtchnl_pf.c

To prepare for this future, we want to refactor and split up the code in
ice_virtchnl_pf.c into the following scheme:

 * ice_vf_lib.[ch]

   Basic VF structures and accessors. This is where scheme-independent
   code will reside.

 * ice_virtchnl.[ch]

   Virtchnl message handling. This is where the bulk of the logic for
   processing messages from VFs using the virtchnl messaging scheme will
   reside. This is separated from ice_vf_lib.c because it is distinct
   and has a bulk of the processing code.

 * ice_sriov.[ch]

   Single Root IOV implementation, including initialization and the
   routines for interacting with SR-IOV based netdev operations.

 * (future) ice_siov.[ch]

   Scalable IOV implementation.

As a first step, lets assume that all of the code in
ice_virtchnl_pf.[ch] is for Single Root IOV. Rename this file to
ice_sriov.c and its header to ice_sriov.h

Future changes will further split out the code in these files following
the plan outlined here.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

0deb0bf7

ice: rename ice_sriov.c to ice_vf_mbx.c · d775155a

Jacob Keller authored Feb 22, 2022

The ice_sriov.c file primarily contains code which handles the logic for
mailbox overflow detection and some other utility functions related to
the virtualization mailbox.

The bulk of the SR-IOV implementation is actually found in
ice_virtchnl_pf.c, and this file isn't strictly SR-IOV specific.

In the future, the ice driver will support an additional virtualization
scheme known as Scalable IOV, and the code in this file will be used
for this alternative implementation.

Rename this file (and its associated header) to ice_vf_mbx.c, so that we
can later re-use the ice_sriov.c file as the SR-IOV specific file.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

d775155a

14 Mar, 2022 18 commits

nfp: flower: avoid newline at the end of message in NL_SET_ERR_MSG_MOD · bdd6a89d

Niklas Söderlund authored Mar 12, 2022

Fix the following coccicheck warning:

drivers/net/ethernet/netronome/nfp/flower/action.c:959:7-69: WARNING avoid newline at end of message in NL_SET_ERR_MSG_MOD
Signed-off-by: Niklas Söderlund <niklas.soderlund@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20220312095823.2425775-1-niklas.soderlund@corigine.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

bdd6a89d

net/mlx5e: Fix use-after-free in mlx5e_stats_grp_sw_update_stats · 8772cc49

Saeed Mahameed authored Mar 11, 2022

We need to sync page pool stats only for active channels. Reading ethtool
stats on a down netdev or a netdev with modified number of channels will
result in a user-after-free, trying to access page pools that are freed
already.

BUG: KASAN: use-after-free in mlx5e_stats_grp_sw_update_stats+0x465/0xf80
Read of size 8 at addr ffff888004835e40 by task ethtool/720

Fixes: cc10e84b ("mlx5: add support for page_pool_get_stats")
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Joe Damato <jdamato@fastly.com>
Link: https://lore.kernel.org/r/20220312005353.786255-1-saeed@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8772cc49

net/mlx4_en: use kzalloc · 3c2dfb73

Julia Lawall authored Mar 12, 2022

Use kzalloc instead of kmalloc + memset.

The semantic patch that makes this change is:
(https://coccinelle.gitlabpages.inria.fr/website/)

//<smpl>
@@
expression res, size, flag;
@@
- res = kmalloc(size, flag);
+ res = kzalloc(size, flag);
  ...
- memset(res, 0, size);
//</smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220312102705.71413-3-Julia.Lawall@inria.frSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3c2dfb73

net: disable preemption in dev_core_stats_XXX_inc() helpers · fc93db15

Eric Dumazet authored Mar 12, 2022

syzbot was kind enough to remind us that dev->{tx_dropped|rx_dropped}
could be increased in process context.

BUG: using smp_processor_id() in preemptible [00000000] code: syz-executor413/3593
caller is netdev_core_stats_alloc+0x98/0x110 net/core/dev.c:10298
CPU: 1 PID: 3593 Comm: syz-executor413 Not tainted 5.17.0-rc7-syzkaller-02426-g97aeb877 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 check_preemption_disabled+0x16b/0x170 lib/smp_processor_id.c:49
 netdev_core_stats_alloc+0x98/0x110 net/core/dev.c:10298
 dev_core_stats include/linux/netdevice.h:3855 [inline]
 dev_core_stats_rx_dropped_inc include/linux/netdevice.h:3866 [inline]
 tun_get_user+0x3455/0x3ab0 drivers/net/tun.c:1800
 tun_chr_write_iter+0xe1/0x200 drivers/net/tun.c:2015
 call_write_iter include/linux/fs.h:2074 [inline]
 new_sync_write+0x431/0x660 fs/read_write.c:503
 vfs_write+0x7cd/0xae0 fs/read_write.c:590
 ksys_write+0x12d/0x250 fs/read_write.c:643
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f2cf4f887e3
Code: 5d 41 5c 41 5d 41 5e e9 9b fd ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
RSP: 002b:00007ffd50dd5fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007ffd50dd6000 RCX: 00007f2cf4f887e3
RDX: 000000000000002a RSI: 0000000000000000 RDI: 00000000000000c8
RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffd50dd5ff0 R14: 00007ffd50dd5fe8 R15: 00007ffd50dd5fe4
 </TASK>

Fixes: 625788b5 ("net: add per-cpu storage and net->core_stats")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: jeffreyji <jeffreyji@google.com>
Cc: Brian Vazquez <brianvv@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220312214505.3294762-1-eric.dumazet@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

fc93db15

drivers: net: packetengines: fix typos in comments · ebc0b8b5

Julia Lawall authored Mar 14, 2022

Various spelling mistakes in comments.
Detected with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/20220314115354.144023-13-Julia.Lawall@inria.frSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ebc0b8b5

Merge branch 'dpaa2-mac-protocol-change' · 5e7350e8

David S. Miller authored Mar 14, 2022

Ioana Ciornei says:

====================
dpaa2-mac: add support for changing the protocol at runtime

This patch set adds support for changing the Ethernet protocol at
runtime on Layerscape SoCs which have the Lynx 28G SerDes block.

The first two patches add a new generic PHY driver for the Lynx 28G and
the bindings file associated. The driver reads the PLL configuration at
probe time (the frequency provided to the lanes) and determines what
protocols can be supported.
Based on this the driver can deny or approve a request from the
dpaa2-mac to setup a new protocol.

The next 2 patches add some MC APIs for inquiring what is the running
version of firmware and setting up a new protocol on the MAC.

Moving along, we extract the code for setting up the supported
interfaces on a MAC on a different function since in the next patches
will update the logic.

In the next patch, the dpaa2-mac is updated so that it retrieves the
SerDes PHY based on the OF node and in case of a major reconfig, call
the PHY driver to set up the new protocol on the associated lane and the
MC firmware to reconfigure the MAC side of things.

Finally, the LX2160A dtsi is annotated with the SerDes PHY nodes for the
1st SerDes block. Beside this, the LX2160A Clearfog dtsi is annotated
with the 'phys' property for the exposed SFP cages.

Changes in v2:
	- 1/8: add MODULE_LICENSE
Changes in v3:
	- 2/8: fix 'make dt_binding_check' errors
	- 7/8: reverse order of dpaa2_mac_start() and phylink_start()
	- 7/8: treat all RGMII variants in dpmac_eth_if_mode
	- 7/8: remove the .mac_prepare callback
	- 7/8: ignore PHY_INTERFACE_MODE_NA in validate
Changes in v4:
	- 1/8: remove the DT nodes parsing
	- 1/8: add an xlate function
	- 2/8: remove the children phy nodes for each lane
	- 7/8: rework the of_phy_get if statement
	- 8/8: remove the DT nodes for each lane and the lane id in the
	  phys phandle
Changes in v5:
	- 2/8: use phy as the name of the DT node in the example
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

5e7350e8

arch: arm64: dts: lx2160a: describe the SerDes block #1 · 3cbe93a1

Ioana Ciornei authored Mar 11, 2022

Describe the SerDes block #1 using the generic phys infrastructure. This
way, the ethernet nodes can each reference their serdes lanes
individually using the 'phys' dts property.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3cbe93a1

dpaa2-mac: configure the SerDes phy on a protocol change · f978fe85

Ioana Ciornei authored Mar 11, 2022

This patch integrates the dpaa2-eth driver with the generic PHY
infrastructure in order to search, find and reconfigure the SerDes lanes
in case of a protocol change.

On the .mac_config() callback, the phy_set_mode_ext() API is called so
that the Lynx 28G SerDes PHY driver can change the lane's configuration.
In the same phylink callback the MC firmware is called so that it
reconfigures the MAC side to run using the new protocol.

The consumer drivers - dpaa2-eth and dpaa2-switch - are updated to call
the dpaa2_mac_start/stop functions newly added which will
power_on/power_off the associated SerDes lane.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f978fe85

dpaa2-mac: move setting up supported_interfaces into a function · aa95c371

Ioana Ciornei authored Mar 11, 2022

The logic to setup the supported interfaces will get annotated based on
what the configuration of the SerDes PLLs supports. Move the current
setup into a separate function just to try to keep it clean.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

aa95c371

dpaa2-mac: retrieve API version and detect features · dff95381

Ioana Ciornei authored Mar 11, 2022

Retrieve the API version running on the firmware and based on it detect
which features are available for usage.
The first one to be listed is the capability to change the MAC protocol
at runtime.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dff95381

dpaa2-mac: add the MC API for reconfiguring the protocol · 332b9ea5

Ioana Ciornei authored Mar 11, 2022

The MC firmware gained recently a new command which can reconfigure the
running protocol on the underlying MAC. Add this new command which will
be used in the next patches in order to do a major reconfig on the
interface.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

332b9ea5

dpaa2-mac: add the MC API for retrieving the version · 38d28b02

Ioana Ciornei authored Mar 11, 2022

The dpmac_get_api_version command will be used in the next patches to
determine if the current firmware is capable or not to change the
Ethernet protocol running on the MAC.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

38d28b02

dt-bindings: phy: add bindings for Lynx 28G PHY · c553f22e

Ioana Ciornei authored Mar 11, 2022

Add device tree binding for the Lynx 28G SerDes PHY driver used on
Layerscape based SoCs.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c553f22e

phy: add support for the Layerscape SerDes 28G · 8f73b37c

Ioana Ciornei authored Mar 11, 2022

This patch adds a new generic PHY driver to support the Lynx 28G SerDes
block found on some of the Layerscape SoCs such as LX2160A.
At the moment, only the following Ethernet protocols are supported:
SGMII/1000Base-X and 10GBaseR.

SerDes lanes which are not running an Ethernet protocol or a currently
supported Ethenet protocol will be left as it was configured through the
RCW (Reset Configuration Word) at boot time.

At probe time, the platform driver will read the current
configuration of both PLLs found on a SerDes block and will determine
what protocols are supported using that PLL.

For example, if a PLL is configured to generate a clock net (frate) of
5GHz the only protocols sustained by that PLL are SGMII/1000Base-X
(using a quarter of the full clock rate) and QSGMII using the full clock
net frequency on the lane.

On the .set_mode() callback, the PHY driver will first check if the
requested operating mode (protocol) is even supported by the current PLL
configuration and will error out if not.
Then, the lane is reconfigured to run on the requested protocol.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8f73b37c

Merge branch 'dsa-felix-qos' · 92ebb236

David S. Miller authored Mar 14, 2022

Vladimir Oltean says:

====================
Basic QoS classification on Felix DSA switch using dcbnl

Basic QoS classification for Ocelot switches means port-based default
priority, DSCP-based and VLAN PCP based. This is opposed to advanced QoS
classification which is done through the VCAP IS1 TCAM based engine.

The patch set is a logical continuation of this RFC which attempted to
describe the default-prio as a matchall entry placed at the end of a
series of offloaded tc filters:
https://patchwork.kernel.org/project/netdevbpf/cover/20210113154139.1803705-1-olteanv@gmail.com/

I have tried my best to satisfy the feedback that we should cater for
pre-configured QoS profiles. Ironically, the only pre-configured QoS
profile that the Felix switch driver has is for VLAN PCP (1:1 mapping
with QoS class), yet IEEE 802.1Q or dcbnl offer no mechanism for
reporting or changing that.

Testing was done with the iproute2 dcb app. The qos_class of packets was
dumped from net/dsa/tag_ocelot.c.

(1) $ dcb app show dev swp3
default-prio 0
(2) $ dcb app replace dev swp3 default-prio 3
(3) $ dcb app replace dev swp3 dscp-prio CS3:5
(4) $ dcb app replace dev swp3 dscp-prio CS2:2
(5) $ dcb app show dev swp3
default-prio 3
dscp-prio CS2:2 CS3:5

Traffic sent with "ping -Q 64 <ipaddr>", which means CS2.
These packets match qos_class 0 after command (1),
qos_class 3 after command (2),
qos_class 3 after command (3), and
qos_class 2 after command (2).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

92ebb236

net: dsa: felix: configure default-prio and dscp priorities · 978777d0

Vladimir Oltean authored Mar 11, 2022

Follow the established programming model for this driver and provide
shims in the felix DSA driver which call the implementations from the
ocelot switch lib. The ocelot switchdev driver wasn't integrated with
dcbnl due to lack of hardware availability.

The switch doesn't have any fancy QoS classification enabled by default.
The provided getters will create a default-prio app table entry of 0,
and no dscp entry. However, the getters have been made to actually
retrieve the hardware configuration rather than static values, to be
future proof in case DSA will need this information from more call paths.

For default-prio, there is a single field per port, in ANA_PORT_QOS_CFG,
called QOS_DEFAULT_VAL.

DSCP classification is enabled per-port, again via ANA_PORT_QOS_CFG
(field QOS_DSCP_ENA), and individual DSCP values are configured as
trusted or not through register ANA_DSCP_CFG (replicated 64 times).
An untrusted DSCP value falls back to other QoS classification methods.
If trusted, the selected ANA_DSCP_CFG register also holds the QoS class
in the QOS_DSCP_VAL field.

The hardware also supports DSCP remapping (DSCP value X is translated to
DSCP value Y before the QoS class is determined based on the app table
entry for Y) and DSCP packet rewriting. The dcbnl framework, for being
so flexible in other useless areas, doesn't appear to support this.
So this functionality has been left out.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

978777d0

net: dsa: report and change port dscp priority using dcbnl · 47d75f78

Vladimir Oltean authored Mar 11, 2022

Similar to the port-based default priority, IEEE 802.1Q-2018 allows the
Application Priority Table to define QoS classes (0 to 7) per IP DSCP
value (0 to 63).

In the absence of an app table entry for a packet with DSCP value X,
QoS classification for that packet falls back to other methods (VLAN PCP
or port-based default). The presence of an app table for DSCP value X
with priority Y makes the hardware classify the packet to QoS class Y.

As opposed to the default-prio where DSA exposes only a "set" in
dsa_switch_ops (because the port-based default is the fallback, it
always exists, either implicitly or explicitly), for DSCP priorities we
expose an "add" and a "del". The addition of a DSCP entry means trusting
that DSCP priority, the deletion means ignoring it.

Drivers that already trust (at least some) DSCP values can describe
their configuration in dsa_switch_ops :: port_get_dscp_prio(), which is
called for each DSCP value from 0 to 63.

Again, there can be more than one dcbnl app table entry for the same
DSCP value, DSA chooses the one with the largest configured priority.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

47d75f78

net: dsa: report and change port default priority using dcbnl · d538eca8

Vladimir Oltean authored Mar 11, 2022

The port-based default QoS class is assigned to packets that lack a
VLAN PCP (or the port is configured to not trust the VLAN PCP),
an IP DSCP (or the port is configured to not trust IP DSCP), and packets
on which no tc-skbedit action has matched.

Similar to other drivers, this can be exposed to user space using the
DCB Application Priority Table. IEEE 802.1Q-2018 specifies in Table
D-8 - Sel field values that when the Selector is 1, the Protocol ID
value of 0 denotes the "Default application priority. For use when
application priority is not otherwise specified."

The way in which the dcbnl integration in DSA has been designed has to
do with its requirements. Andrew Lunn explains that SOHO switches are
expected to come with some sort of pre-configured QoS profile, and that
it is desirable for this to come pre-loaded into the DSA slave interfaces'
DCB application priority table.

In the dcbnl design, this is possible because calls to dcb_ieee_setapp()
can be initiated by anyone including being self-initiated by this device
driver.

However, what makes this challenging to implement in DSA is that the DSA
core manages the net_devices (effectively hiding them from drivers),
while drivers manage the hardware. The DSA core has no knowledge of what
individual drivers' QoS policies are. DSA could export to drivers a
wrapper over dcb_ieee_setapp() and these could call that function to
pre-populate the app priority table, however drivers don't have a good
moment in time to do this. The dsa_switch_ops :: setup() method gets
called before the net_devices are created (dsa_slave_create), and so is
dsa_switch_ops :: port_setup(). What remains is dsa_switch_ops ::
port_enable(), but this gets called upon each ndo_open. If we add app
table entries on every open, we'd need to remove them on close, to avoid
duplicate entry errors. But if we delete app priority entries on close,
what we delete may not be the initial, driver pre-populated entries, but
rather user-added entries.

So it is clear that letting drivers choose the timing of the
dcb_ieee_setapp() call is inappropriate. The alternative which was
chosen is to introduce hardware-specific ops in dsa_switch_ops, and
effectively hide dcbnl details from drivers as well. For pre-populating
the application table, dsa_slave_dcbnl_init() will call
ds->ops->port_get_default_prio() which is supposed to read from
hardware. If the operation succeeds, DSA creates a default-prio app
table entry. The method is called as soon as the slave_dev is
registered, but before we release the rtnl_mutex. This is done such that
user space sees the app table entries as soon as it sees the interface
being registered.

The fact that we populate slave_dev->dcbnl_ops with a non-NULL pointer
changes behavior in dcb_doit() from net/dcb/dcbnl.c, which used to
return -EOPNOTSUPP for any dcbnl operation where netdev->dcbnl_ops is
NULL. Because there are still dcbnl-unaware DSA drivers even if they
have dcbnl_ops populated, the way to restore the behavior is to make all
dcbnl_ops return -EOPNOTSUPP on absence of the hardware-specific
dsa_switch_ops method.

The dcbnl framework absurdly allows there to be more than one app table
entry for the same selector and protocol (in other words, more than one
port-based default priority). In the iproute2 dcb program, there is a
"replace" syntactical sugar command which performs an "add" and a "del"
to hide this away. But we choose the largest configured priority when we
call ds->ops->port_set_default_prio(), using __fls(). When there is no
default-prio app table entry left, the port-default priority is restored
to 0.

Link: https://patchwork.kernel.org/project/netdevbpf/patch/20210113154139.1803705-2-olteanv@gmail.com/Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d538eca8