Commits · ec5ae61076d07be986df19773662506220757c9f · Kirill Smelkov / linux

12 May, 2020 21 commits

net: dsa: sja1105: save/restore VLANs using a delta commit method · ec5ae610

Vladimir Oltean authored May 12, 2020

Managing the VLAN table that is present in hardware will become very
difficult once we add a third operating state
(best_effort_vlan_filtering). That is because correct cleanup (not too
little, not too much) becomes virtually impossible, when VLANs can be
added from the bridge layer, from dsa_8021q for basic tagging, for
cross-chip bridging, as well as retagging rules for sub-VLANs and
cross-chip sub-VLANs. So we need to rethink VLAN interaction with the
switch in a more scalable way.

In preparation for that, use the priv->expect_dsa_8021q boolean to
classify any VLAN request received through .port_vlan_add or
.port_vlan_del towards either one of 2 internal lists: bridge VLANs and
dsa_8021q VLANs.

Then, implement a central sja1105_build_vlan_table method that creates a
VLAN configuration from scratch based on the 2 lists of VLANs kept by
the driver, and based on the VLAN awareness state. Currently, if we are
VLAN-unaware, install the dsa_8021q VLANs, otherwise the bridge VLANs.

Then, implement a delta commit procedure that identifies which VLANs
from this new configuration are actually different from the config
previously committed to hardware. We apply the delta through the dynamic
configuration interface (we don't reset the switch). The result is that
the hardware should see the exact sequence of operations as before this
patch.

This also helps remove the "br" argument passed to
dsa_8021q_crosschip_bridge_join, which it was only using to figure out
whether it should commit the configuration back to us or not, based on
the VLAN awareness state of the bridge. We can simplify that, by always
allowing those VLANs inside of our dsa_8021q_vlans list, and committing
those to hardware when necessary.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ec5ae610

net: dsa: sja1105: deny alterations of dsa_8021q VLANs from the bridge · 60b33aeb

Vladimir Oltean authored May 12, 2020

At the moment, this can never happen. The 2 modes that we operate in do
not permit that:

 - SJA1105_VLAN_UNAWARE: we are guarded from bridge VLANs added by the
   user by the DSA core. We will later lift this restriction by setting
   ds->vlan_bridge_vtu = true, and that is where we'll need it.

 - SJA1105_VLAN_FILTERING_FULL: in this mode, dsa_8021q configuration is
   disabled. So the user is free to add these VLANs in the 1024-3071
   range.

The reason for the patch is that we'll introduce a third VLAN awareness
state, where both dsa_8021q as well as the bridge are going to call our
.port_vlan_add and .port_vlan_del methods.

For that, we need a good way to discriminate between the 2. The easiest
(and less intrusive way for upper layers) is to recognize the fact that
dsa_8021q configurations are always driven by our driver - we _know_
when a .port_vlan_add method will be called from dsa_8021q because _we_
initiated it.

So introduce an expect_dsa_8021q boolean which is only used, at the
moment, for blacklisting VLANs in range 1024-3071 in the modes when
dsa_8021q is active.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

60b33aeb

net: dsa: sja1105: keep the VLAN awareness state in a driver variable · 7f14937f

Vladimir Oltean authored May 12, 2020

Soon we'll add a third operating mode to the driver. Introduce a
vlan_state to make things more easy to manage, and use it where
applicable.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7f14937f

net: dsa: tag_8021q: introduce a vid_is_dsa_8021q helper · 1f66b0f0

Vladimir Oltean authored May 12, 2020

This function returns a boolean denoting whether the VLAN passed as
argument is part of the 1024-3071 range that the dsa_8021q tagging
scheme uses.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1f66b0f0

net: dsa: provide an option for drivers to always receive bridge VLANs · 54a0ed0d

Russell King authored May 12, 2020

DSA assumes that a bridge which has vlan filtering disabled is not
vlan aware, and ignores all vlan configuration. However, the kernel
software bridge code allows configuration in this state.

This causes the kernel's idea of the bridge vlan state and the
hardware state to disagree, so "bridge vlan show" indicates a correct
configuration but the hardware lacks all configuration. Even worse,
enabling vlan filtering on a DSA bridge immediately blocks all traffic
which, given the output of "bridge vlan show", is very confusing.

Provide an option that drivers can set to indicate they want to receive
vlan configuration even when vlan filtering is disabled. At the very
least, this is safe for Marvell DSA bridges, which do not look up
ingress traffic in the VTU if the port is in 8021Q disabled state. It is
also safe for the Ocelot switch family. Whether this change is suitable
for all DSA bridges is not known.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

54a0ed0d

Merge branch 'sfc-siena_check_caps-fixups' · 26831d78

David S. Miller authored May 12, 2020

Edward Cree says:

====================
sfc: siena_check_caps fixups

Fix a bug and a build warning introduced in a recent refactor.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

26831d78

sfc: siena_check_caps() can be static · 1b0cde40

Edward Cree authored May 12, 2020

Reported-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1b0cde40

sfc: actually wire up siena_check_caps() · 527c1e61

Edward Cree authored May 12, 2020

Assign it to siena_a0_nic_type.check_caps function pointer.

Fixes: be904b85 ("sfc: make capability checking a nic_type function")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

527c1e61

dt-bindings: net: Convert UniPhier AVE4 controller to json-schema · 966a5c08

Kunihiko Hayashi authored May 12, 2020

Convert the UniPhier AVE4 controller binding to DT schema format.
Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

966a5c08

Merge branch 'ionic-updates' · 92a84c78

David S. Miller authored May 12, 2020

Shannon Nelson says:

====================
ionic updates

This set of patches is a bunch of code cleanup, a little
documentation, longer tx sg lists, more ethtool stats,
and a couple more transceiver types.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

92a84c78

ionic: update doc files · 7c7b58ec

Shannon Nelson authored May 11, 2020

Update the basic doc file with some configuration hints and a
little bit of stats information.
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

7c7b58ec

ionic: add more ethtool stats · f64e0c56

Shannon Nelson authored May 11, 2020

Add hardware port stats and a few more driver collected
statistics to the ethtool stats output.
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

f64e0c56

ionic: more ionic name tweaks · c06107ca

Shannon Nelson authored May 11, 2020

Fix up a few more local names that need an "ionic" prefix.
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

c06107ca

ionic: ionic_intr_free parameter change · 36ac2c50

Shannon Nelson authored May 11, 2020

Change the ionic_intr_free parameter from struct ionic_lif to
struct ionic since that's what it actually cares about.
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

36ac2c50

ionic: reset device at probe · 5c784311

Shannon Nelson authored May 11, 2020

Once we're talking to the device, tell it to reset to
be sure we've got a fresh, clean environment.
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

5c784311

ionic: shorter dev cmd wait time · 62ba8766

Shannon Nelson authored May 11, 2020

Shorten our msleep time while polling for the dev command
request to finish.  Yes, checkpatch.pl complains that the
msleep might actually go longer - that won't hurt, but we'll
take the shorter time if we can get it.
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

62ba8766

ionic: add support for more xcvr types · cba155d5

Shannon Nelson authored May 11, 2020

Add a couple more SFP and QSFP transceiver types to our
ethtool get link ksettings.
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

cba155d5

ionic: protect vf calls from fw reset · a836c352

Shannon Nelson authored May 11, 2020

When going into a firmware upgrade cycle, we set the device as
not present to keep some user commands from trying to change
the driver while we're only half there.  Unfortunately, the
ndo_vf_* calls don't check netif_device_present() so we need
to add a check in the callbacks.
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

a836c352

ionic: updates to ionic FW api description · c4e7a75a

Shannon Nelson authored May 11, 2020

Lots of comment cleanup for better documentation, a few new
fields added, and a few minor mistakes fixed up.
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

c4e7a75a

ionic: support longer tx sg lists · 5b3f3f2a

Shannon Nelson authored May 11, 2020

The version 1 Tx queues can use longer SG lists than the
original version 0 queues, but we need to check to see if the
firmware supports the v1 Tx queues.  This implements the queue
type query for all queue types, and uses the information to
set up for using the longer Tx SG lists.

Because the Tx SG list can be longer, we need to limit the
max ring length to be sure we stay inside the boundaries of a
DMA allocation max size, so we lower the max Tx ring size.

The driver sets its highest known version in the Q_IDENTITY
command, and the FW returns the highest version that it knows,
bounded by the driver's version.  The negotiated version number
is later used in the Q_INIT commands.
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

5b3f3f2a

checkpatch: warn about uses of ENOTSUPP · 6b9ea5ff

Jakub Kicinski authored May 11, 2020

ENOTSUPP often feels like the right error code to use, but it's
in fact not a standard Unix error. E.g.:

$ python
>>> import errno
>>> errno.errorcode[errno.ENOTSUPP]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'errno' has no attribute 'ENOTSUPP'

There were numerous commits converting the uses back to EOPNOTSUPP
but in some cases we are stuck with the high error code for backward
compatibility reasons.

Let's try prevent more ENOTSUPPs from getting into the kernel.

Recent example:
https://lore.kernel.org/netdev/20200510182252.GA411829@lunn.ch/

v3 (Joe):
 - fix the "not file" condition.

v2 (Joe):
 - add a link to recent discussion,
 - don't match when scanning files, not patches to avoid sudden
   influx of conversion patches.
https://lore.kernel.org/netdev/20200511165319.2251678-1-kuba@kernel.org/

v1:
https://lore.kernel.org/netdev/20200510185148.2230767-1-kuba@kernel.org/Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6b9ea5ff

11 May, 2020 19 commits

Merge branch 'improve-msg_control-kernel-vs-user-pointer-handling' · 97cf0ef9

David S. Miller authored May 11, 2020

Christoph Hellwig says:

====================
improve msg_control kernel vs user pointer handling

this series replace the msg_control in the kernel msghdr structure
with an anonymous union and separate fields for kernel vs user
pointers.  In addition to helping a bit with type safety and reducing
sparse warnings, this also allows to remove the set_fs() in
kernel_recvmsg, helping with an eventual entire removal of set_fs().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

97cf0ef9

net: cleanly handle kernel vs user buffers for ->msg_control · 1f466e1f

Christoph Hellwig authored May 11, 2020

The msg_control field in struct msghdr can either contain a user
pointer when used with the recvmsg system call, or a kernel pointer
when used with sendmsg.  To complicate things further kernel_recvmsg
can stuff a kernel pointer in and then use set_fs to make the uaccess
helpers accept it.

Replace it with a union of a kernel pointer msg_control field, and
a user pointer msg_control_user one, and allow kernel_recvmsg operate
on a proper kernel pointer using a bitfield to override the normal
choice of a user pointer for recvmsg.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

1f466e1f

net/scm: cleanup scm_detach_fds · 2618d530

Christoph Hellwig authored May 11, 2020

Factor out two helpes to keep the code tidy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

2618d530

net: add a CMSG_USER_DATA macro · 0462b6bd

Christoph Hellwig authored May 11, 2020

Add a variant of CMSG_DATA that operates on user pointer to avoid
sparse warnings about casting to/from user pointers.  Also fix up
CMSG_DATA to rely on the gcc extension that allows void pointer
arithmetics to cut down on the amount of casts.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

0462b6bd

Merge branch 'net-dsa-Constify-two-tagger-ops' · 3242956b

David S. Miller authored May 11, 2020

Florian Fainelli says:

====================
net: dsa: Constify two tagger ops

This patch series constifies the dsa_device_ops for ocelot and sja1105
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

3242956b

net: dsa: tag_sja1105: Constify dsa_device_ops · 097f0244

Florian Fainelli authored May 11, 2020

sja1105_netdev_ops should be const since that is what the DSA layer
expects.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

097f0244

net: dsa: ocelot: Constify dsa_device_ops · 2fa3888b

Florian Fainelli authored May 11, 2020

ocelot_netdev_ops should be const since that is what the DSA layer
expects.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2fa3888b

Merge branch 'sfc-remove-nic_data-usage-in-common-code' · 9b1b31d5

David S. Miller authored May 11, 2020

Edward Cree says:

====================
sfc: remove nic_data usage in common code

efx->nic_data should only be used from NIC-specific code (i.e. nic_type
 functions and things they call), in files like ef10[_sriov].c and
 siena.c.  This series refactors several nic_data usages from common
 code (mainly in mcdi_filters.c) into nic_type functions, in preparation
 for the upcoming ef100 driver which will use those functions but have
 its own struct layout for efx->nic_data distinct from ef10's.
After this series, one nic_data usage (in ptp.c) remains; it wasn't
 clear to me how to fix it, and ef100 devices don't yet have PTP support
 (so the initial ef100 driver will not call that code).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9b1b31d5

sfc: make firmware-variant printing a nic_type function · 9b46132c

Edward Cree authored May 11, 2020

Instead of having efx_mcdi_print_fwver() look at efx_nic_rev and
 conditionally poke around inside ef10-specific nic_data, add a new
 efx->type->print_additional_fwver() method to do this work.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9b46132c

sfc: make filter table probe caller responsible for adding VLANs · ed02112c

Edward Cree authored May 11, 2020

By making the caller of efx_mcdi_filter_table_probe() loop over the
 vlan_list calling efx_mcdi_filter_add_vlan(), instead of doing it in
 efx_mcdi_filter_table_probe(), the latter avoids looking in ef10-
 specific nic_data.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ed02112c

sfc: move rx_rss_context_exclusive into struct efx_mcdi_filter_table · dbf2c669

Edward Cree authored May 11, 2020

It's both set and used solely by mcdi_filters.c, so there's no reason
 for it to be in ef10-specific nic_data.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dbf2c669

sfc: rework handling of (firmware) multicast chaining state · fd14e5fd

Edward Cree authored May 11, 2020

Store the mc_chaining bit in struct efx_mcdi_filter_table, so that common
code in mcdi_filters.c doesn't need to get it from ef10-specific nic_data.
Also, probe the firmware workaround just before the call to
efx_mcdi_filter_table_probe(), rather than in a random other part of the
driver bringup, to ensure that (a) it gets probed in time and (b) it gets
reprobed as necessary on resets, no matter how the surrounding code gets
reorganised and reordered.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fd14e5fd

sfc: move 'must restore' flags out of ef10-specific nic_data · e4fe938c

Edward Cree authored May 11, 2020

Common code in mcdi_filters.c uses these flags, so by moving them to
 either struct efx_nic (in the case of must_realloc_vis) or struct
 efx_mcdi_filter_table (for must_restore_rss_contexts and
 must_restore_filters), decouple this code from ef10's nic_data.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e4fe938c

sfc: use efx_has_cap for capability checks outside of NIC-specific code · 484a75b1

Edward Cree authored May 11, 2020

Removes some efx_ef10_nic_data references from common code.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

484a75b1

sfc: make capability checking a nic_type function · be904b85

Tom Zhao authored May 11, 2020

Various MCDI functions (especially in filter handling) need to check the
datapath caps, but those live in nic_data (since they don't exist on
Siena). Decouple from ef10-specific data structures by adding check_caps
to the nic_type, to allow using these functions from non-ef10 drivers.

Also add a convenience macro efx_has_cap() to reduce the amount of
boilerplate involved in calling it.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be904b85

sfc: move vport_id to struct efx_nic · dfcabb07

Edward Cree authored May 11, 2020

Remove some usage of ef10-specific nic_data structs from common MCDI
 functions, in preparation for using them from a non-EF10 driver.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dfcabb07

Merge branch 'net-Optimize-the-qed-allocations-inside-kdump-kernel' · a90f704a

David S. Miller authored May 11, 2020

Bhupesh Sharma says:

====================
net: Optimize the qed* allocations inside kdump kernel

Changes since v1:
----------------
- v1 can be seen here: http://lists.infradead.org/pipermail/kexec/2020-May/024935.html
- Addressed review comments received on v1:
  * Removed unnecessary paranthesis.
  * Used a different macro for minimum RX/TX ring count value in kdump
    kernel.

Since kdump kernel(s) run under severe memory constraint with the
basic idea being to save the crashdump vmcore reliably when the primary
kernel panics/hangs, large memory allocations done by a network driver
can cause the crashkernel to panic with OOM.

The qed* drivers take up approximately 214MB memory when run in the
kdump kernel with the default configuration settings presently used in
the driver. With an usual crashkernel size of 512M, this allocation
is equal to almost half of the total crashkernel size allocated.

See some logs obtained via memstrack tool (see [1]) below:
 dracut-pre-pivot[676]: ======== Report format module_summary: ========
 dracut-pre-pivot[676]: Module qed using 149.6MB (2394 pages), peak allocation 149.6MB (2394 pages)
 dracut-pre-pivot[676]: Module qede using 65.3MB (1045 pages), peak allocation 65.3MB (1045 pages)

This patchset tries to reduce the overall memory allocation profile of
the qed* driver when they run in the kdump kernel. With these
optimization we can see a saving of approx 85M in the kdump kernel:
 dracut-pre-pivot[671]: ======== Report format module_summary: ========
 dracut-pre-pivot[671]: Module qed using 124.6MB (1993 pages), peak allocation 124.7MB (1995 pages)
 <..snip..>
 dracut-pre-pivot[671]: Module qede using 4.6MB (73 pages), peak allocation 4.6MB (74 pages)

And the kdump kernel can save vmcore successfully via both ssh and nfs
interfaces.

This patchset contains two patches:
[PATCH 1/2] - Reduces the default TX and RX ring count in kdump kernel.
[PATCH 2/2] - Disables qed SRIOV feature in kdump kernel (as it is
              normally not a supported kdump target for saving
	      vmcore).

[1]. Memstrack tool: https://github.com/ryncsn/memstrack
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

a90f704a

net: qed: Disable SRIOV functionality inside kdump kernel · 37d4f8a6

Bhupesh Sharma authored May 11, 2020

Since we have kdump kernel(s) running under severe memory constraint
it makes sense to disable the qed SRIOV functionality when running the
kdump kernel as kdump configurations on several distributions don't
support SRIOV targets for saving the vmcore (see [1] for example).

Currently the qed SRIOV functionality ends up consuming memory in
the kdump kernel, when we don't really use the same.

An example log seen in the kdump kernel with the SRIOV functionality
enabled can be seen below (obtained via memstrack tool, see [2]):
 dracut-pre-pivot[676]: ======== Report format module_summary: ========
 dracut-pre-pivot[676]: Module qed using 149.6MB (2394 pages), peak allocation 149.6MB (2394 pages)

This patch disables the SRIOV functionality inside kdump kernel and with
the same applied the memory consumption goes down:
 dracut-pre-pivot[671]: ======== Report format module_summary: ========
 dracut-pre-pivot[671]: Module qed using 124.6MB (1993 pages), peak allocation 124.7MB (1995 pages)

[1]. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/installing-and-configuring-kdump_managing-monitoring-and-updating-the-kernel#supported-kdump-targets_supported-kdump-configurations-and-targets
[2]. Memstrack tool: https://github.com/ryncsn/memstrack

Cc: kexec@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: Ariel Elior <aelior@marvell.com>
Cc: GR-everest-linux-l2@marvell.com
Cc: Manish Chopra <manishc@marvell.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

37d4f8a6

net: qed*: Reduce RX and TX default ring count when running inside kdump kernel · 73e03097

Bhupesh Sharma authored May 11, 2020

Normally kdump kernel(s) run under severe memory constraint with the
basic idea being to save the crashdump vmcore reliably when the primary
kernel panics/hangs.

Currently the qed* ethernet driver ends up consuming a lot of memory in
the kdump kernel, leading to kdump kernel panic when one tries to save
the vmcore via ssh/nfs (thus utilizing the services of the underlying
qed* network interfaces).

An example OOM message log seen in the kdump kernel can be seen here
[1], with crashkernel size reservation of 512M.

Using tools like memstrack (see [2]), we can track the modules taking up
the bulk of memory in the kdump kernel and organize the memory usage
output as per 'highest allocator first'. An example log for the OOM case
indicates that the qed* modules end up allocating approximately 216M
memory, which is a large part of the total crashkernel size:

 dracut-pre-pivot[676]: ======== Report format module_summary: ========
 dracut-pre-pivot[676]: Module qed using 149.6MB (2394 pages), peak allocation 149.6MB (2394 pages)
 dracut-pre-pivot[676]: Module qede using 65.3MB (1045 pages), peak allocation 65.3MB (1045 pages)

This patch reduces the default RX and TX ring count from 1024 to 64
when running inside kdump kernel, which leads to a significant memory
saving.

An example log with the patch applied shows the reduced memory
allocation in the kdump kernel:
 dracut-pre-pivot[674]: ======== Report format module_summary: ========
 dracut-pre-pivot[674]: Module qed using 141.8MB (2268 pages), peak allocation 141.8MB (2268 pages)
 <..snip..>
[dracut-pre-pivot[674]: Module qede using 4.8MB (76 pages), peak allocation 4.9MB (78 pages)

Tested crashdump vmcore save via ssh/nfs protocol using underlying qed*
network interface after applying this patch.

[1] OOM log:
------------

 kworker/0:6: page allocation failure: order:6,
 mode:0x60c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
 kworker/0:6 cpuset=/ mems_allowed=0
 CPU: 0 PID: 145 Comm: kworker/0:6 Not tainted 4.18.0-109.el8.aarch64 #1
 Hardware name: To be filled by O.E.M. Saber/Saber, BIOS 0ACKL025
 01/18/2019
 Workqueue: events work_for_cpu_fn
 Call trace:
  dump_backtrace+0x0/0x188
  show_stack+0x24/0x30
  dump_stack+0x90/0xb4
  warn_alloc+0xf4/0x178
  __alloc_pages_nodemask+0xcac/0xd58
  alloc_pages_current+0x8c/0xf8
  kmalloc_order_trace+0x38/0x108
  qed_iov_alloc+0x40/0x248 [qed]
  qed_resc_alloc+0x224/0x518 [qed]
  qed_slowpath_start+0x254/0x928 [qed]
   __qede_probe+0xf8/0x5e0 [qede]
  qede_probe+0x68/0xd8 [qede]
  local_pci_probe+0x44/0xa8
  work_for_cpu_fn+0x20/0x30
  process_one_work+0x1ac/0x3e8
  worker_thread+0x44/0x448
  kthread+0x130/0x138
  ret_from_fork+0x10/0x18
  Cannot start slowpath
  qede: probe of 0000:05:00.1 failed with error -12

[2]. Memstrack tool: https://github.com/ryncsn/memstrack

Cc: kexec@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: Ariel Elior <aelior@marvell.com>
Cc: GR-everest-linux-l2@marvell.com
Cc: Manish Chopra <manishc@marvell.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

73e03097