Commits · a6b118febbab3f6454057612b355d0b667c1fafa · Kirill Smelkov / linux

02 Jul, 2020 5 commits

mptcp: add receive buffer auto-tuning · a6b118fe

Florian Westphal authored Jun 30, 2020

When mptcp is used, userspace doesn't read from the tcp (subflow)
socket but from the parent (mptcp) socket receive queue.

skbs are moved from the subflow socket to the mptcp rx queue either from
'data_ready' callback (if mptcp socket can be locked), a work queue, or
the socket receive function.

This means tcp_rcv_space_adjust() is never called and thus no receive
buffer size auto-tuning is done.

An earlier (not merged) patch added tcp_rcv_space_adjust() calls to the
function that moves skbs from subflow to mptcp socket.
While this enabled autotuning, it also meant tuning was done even if
userspace was reading the mptcp socket very slowly.

This adds mptcp_rcv_space_adjust() and calls it after userspace has
read data from the mptcp socket rx queue.

Its very similar to tcp_rcv_space_adjust, with two differences:

1. The rtt estimate is the largest one observed on a subflow
2. The rcvbuf size and window clamp of all subflows is adjusted
   to the mptcp-level rcvbuf.

Otherwise, we get spurious drops at tcp (subflow) socket level if
the skbs are not moved to the mptcp socket fast enough.

Before:
time mptcp_connect.sh -t -f $((4*1024*1024)) -d 300 -l 0.01% -r 0 -e "" -m mmap
[..]
ns4 MPTCP -> ns3 (10.0.3.2:10108      ) MPTCP   (duration 40823ms) [ OK ]
ns4 MPTCP -> ns3 (10.0.3.2:10109      ) TCP     (duration 23119ms) [ OK ]
ns4 TCP   -> ns3 (10.0.3.2:10110      ) MPTCP   (duration  5421ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP   (duration 41446ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP     (duration 23427ms) [ OK ]
ns4 TCP   -> ns3 (dead:beef:3::2:10113) MPTCP   (duration  5426ms) [ OK ]
Time: 1396 seconds

After:
ns4 MPTCP -> ns3 (10.0.3.2:10108      ) MPTCP   (duration  5417ms) [ OK ]
ns4 MPTCP -> ns3 (10.0.3.2:10109      ) TCP     (duration  5427ms) [ OK ]
ns4 TCP   -> ns3 (10.0.3.2:10110      ) MPTCP   (duration  5422ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP   (duration  5415ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP     (duration  5422ms) [ OK ]
ns4 TCP   -> ns3 (dead:beef:3::2:10113) MPTCP   (duration  5423ms) [ OK ]
Time: 296 seconds
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

a6b118fe

selftests: mptcp: add option to specify size of file to transfer · 767659f6

Florian Westphal authored Jun 30, 2020

The script generates two random files that are then sent via tcp and
mptcp connections.

In order to compare throughput over consecutive runs add an option
to provide the file size on the command line: "-f 128000".

Also add an option, -t, to enable tcp tests. This is useful to
compare throughput of mptcp connections and tcp connections.

Example: run tests with a 4mb file size, 300ms delay 0.01% loss,
default gso/tso/gro settings and with large write/blocking io:

mptcp_connect.sh -t -f $((4 * 1024 * 1024)) -d 300 -l 0.01%  -r 0 -e "" -m mmap
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

767659f6

net: sched: Allow changing default qdisc to FQ-PIE · b97e9d9d

Danny Lin authored Jul 01, 2020

Similar to fq_codel and the other qdiscs that can set as default,
fq_pie is also suitable for general use without explicit configuration,
which makes it a valid choice for this.

This is useful in situations where a painless out-of-the-box solution
for reducing bufferbloat is desired but fq_codel is not necessarily the
best choice. For example, fq_pie can be better for DASH streaming, but
there could be more cases where it's the better choice of the two simple
AQMs available in the kernel.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>

b97e9d9d

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · d8c8a96c

David S. Miller authored Jul 01, 2020

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2020-07-01

This series contains updates to all Intel drivers, but a majority of the
changes are to the i40e driver.

Jeff converts 'fall through' comments to the 'fallthrough;' keyword for
all Intel drivers. Removed unnecessary delay in the ixgbe ethtool
diagnostics test.

Arkadiusz implements Total Port Shutdown for i40e. This is the revised
patch based on Jakub's feedback from an earlier submission of this
patch, where additional code comments and description was needed to
describe the functionality.

Wei Yongjun fixes return error code for iavf_init_get_resources().

Magnus optimizes XDP code in i40e; starting with AF_XDP zero-copy
transmit completion path. Then by only executing a division when
necessary in the napi_poll data path. Move the check for transmit ring
full outside the send loop to increase performance.

Ciara add XDP ring statistics to i40e and the ability to dump these
statistics and descriptors.

Tony fixes reporting iavf statistics.

Radoslaw adds support for 2.5 and 5 Gbps by implementing the newer ethtool
ksettings API in ixgbe.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d8c8a96c

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 11a20c71

David S. Miller authored Jul 01, 2020

Tony Nguyen says:

====================
100GbE Intel Wired LAN Driver Updates 2020-07-01

This series contains updates to the ice driver only.

Jacob implements a devlink region for device capabilities.

Bruce removes structs containing only one-element arrays that are either
unused or only used for indexing. Instead, use pointer arithmetic or
other indexing to access the elements. Converts "C struct hack"
variable-length types to the preferred C99 flexible array member.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

11a20c71

01 Jul, 2020 35 commits

ice: replace single-element array used for C struct hack · 66486d89

Bruce Allan authored Jun 29, 2020

Convert the pre-C90-extension "C struct hack" method (using a single-
element array at the end of a structure for implementing variable-length
types) to the preferred use of C99 flexible array member.

Additional code cleanups were done near areas affected by this change.
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

66486d89

ice: avoid unnecessary single-member variable-length structs · b3c38904

Bruce Allan authored Jun 29, 2020

There are a number of structures that consist of a one-element array as the
only struct member. Some of those are unused so remove them. Others are
used to index into a buffer/array consisting of a variable number of a
different data or structure type. Those are unnecessary since we can use
simple pointer arithmetic or index directly into the buffer to access
individual elements of the buffer/array.

Additional code cleanups were done near areas affected by this change.
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

b3c38904

bonding: allow xfrm offload setup post-module-load · a3b658cf

Jarod Wilson authored Jun 30, 2020

At the moment, bonding xfrm crypto offload can only be set up if the bonding
module is loaded with active-backup mode already set. We need to be able to
make this work with bonds set to AB after the bonding driver has already
been loaded.

So what's done here is:

1) move #define BOND_XFRM_FEATURES to net/bonding.h so it can be used
by both bond_main.c and bond_options.c
2) set BOND_XFRM_FEATURES in bond_dev->hw_features universally, rather than
only when loading in AB mode
3) wire up xfrmdev_ops universally too
4) disable BOND_XFRM_FEATURES in bond_dev->features if not AB
5) exit early (non-AB case) from bond_ipsec_offload_ok, to prevent a
performance hit from traversing into the underlying drivers
5) toggle BOND_XFRM_FEATURES in bond_dev->wanted_features and call
netdev_change_features() from bond_option_mode_set()

In my local testing, I can change bonding modes back and forth on the fly,
have hardware offload work when I'm in AB, and see no performance penalty
to non-AB software encryption, despite having xfrm bits all wired up for
all modes now.

Fixes: 18cb261a ("bonding: support hardware encryption offload to slaves")
Reported-by: Huy Nguyen <huyn@mellanox.com>
CC: Saeed Mahameed <saeedm@mellanox.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org
CC: intel-wired-lan@lists.osuosl.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a3b658cf

ice: implement snapshot for device capabilities · 8d7aab35

Jacob Keller authored Jun 18, 2020

Add a new devlink region used for capturing a snapshot of the device
capabilities buffer which is reported by the firmware over the AdminQ.
This information can useful in debugging driver and firmware
interactions.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

8d7aab35

Merge branch 'net-ipa-endpoint-configuration-updates' · 651f8bd4

David S. Miller authored Jul 01, 2020

Alex Elder says:

====================
net: ipa: endpoint configuration updates

This series updates code that configures IPA endpoints.  The changes
made mainly affect access to registers that are valid only for RX, or
only for TX endpoints.

The first three patches avoid writing endpoint registers if they are
not defined to be valid.  The fourth patch slightly modifies the
parameters for the offset macros used for these endpoint registers,
to make it explicit when only some endpoints are valid.

The last patch just tweaks one line of code so it uses a convention
used everywhere else in the driver.

Version 2 of this series eliminates some of the "assert()" comments
that Jakub inquired about.  The ones removed will actually go away
in an upcoming (not-yet-posted) patch series anyway.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

651f8bd4

net: ipa: HOL_BLOCK_EN_FMASK is a 1-bit mask · 547c8788

Alex Elder authored Jun 30, 2020

The convention throughout the IPA driver is to directly use
single-bit field mask values, rather than using (for example)
u32_encode_bits() to set or clear them.

Fix the one place that doesn't follow that convention, which sets
HOL_BLOCK_EN_FMASK in ipa_endpoint_init_hol_block_enable().
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

547c8788

net: ipa: clarify endpoint register macro constraints · 8b97bcb7

Alex Elder authored Jun 30, 2020

A handful of registers are valid only for RX endpoints, and some
others are valid only for TX endpoints.  For these endpoints, add
a comment above their defined offset macro that indicates the
endpoints to which they apply.

Extend the endpoint parameter naming convention as well, to make
these constraints more explicit.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

8b97bcb7

net: ipa: mode register is TX only · 00b9102a

Alex Elder authored Jun 30, 2020

The INIT_MODE endpoint configuration register is only valid for TX
endpoints.  Rather than writing a zero to that register for RX
endpoints, avoid writing the register at all.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

00b9102a

net: ipa: metadata_mask register is RX only · 9b63f093

Alex Elder authored Jun 30, 2020

The INIT_HDR_METADATA_MASK endpoint configuration register is only
valid for RX endpoints.  Rather than writing a zero to that register
for TX endpoints, avoid writing the register at all.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

9b63f093

net: ipa: head-of-line block registers are RX only · f8d34dfd

Alex Elder authored Jun 30, 2020

The INIT_HOL_BLOCK_EN and INIT_HOL_BLOCK_TIMER endpoint registers
are only valid for RX endpoints.

Have ipa_endpoint_modem_hol_block_clear_all() skip writing these
registers for TX endpoints.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

f8d34dfd

Merge branch 'net-ipa-small-improvements' · 21ddff5c

David S. Miller authored Jul 01, 2020

Alex Elder says:

====================
net: ipa: small improvements

This series contains two patches that improve the error output
that's reported when an error occurs while changing the state of a
GSI channel or event ring.  The first ensures all such error
conditions report an error, and the second simplifies the messages a
little and ensures they are all consistent.

A third (independent) patch gets rid of an unused symbol in the
microcontroller code.

Version 2 fixes two alignment problems pointed out by checkpatch.pl,
as requested by Jakub Kicinski.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

21ddff5c

net: ipa: kill IPA_MEM_UC_OFFSET · 722208ea

Alex Elder authored Jun 30, 2020

The microcontroller shared memory area is at the beginning of the
IPA resident memory.  IPA_MEM_UC_OFFSET was defined as the offset
within that region where it's found, but it's 0, and it's never
actually used.  Just get rid of the definition, and move some of the
description it had to be above the definition of the ipa_uc_mem_area
structure.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

722208ea

net: ipa: standarize more GSI error messages · 8463488a

Alex Elder authored Jun 30, 2020

Make minor updates to error messages reported in "gsi.c":
  - Use local variables to reduce multi-line function calls
  - Don't use parentheses in messages
  - Do some slight rewording in a few cases
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

8463488a

net: ipa: always report GSI state errors · a442b3c7

Alex Elder authored Jun 30, 2020

We check the state of an event ring or channel both before and after
any GSI command issued that will change that state.  In most--but
not all--cases, if the state is something different than expected we
report an error message.

Add error messages where missing, so that all unexpected states
provide information about what went wrong.  Drop the parentheses
around the state value shown in all cases.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

a442b3c7

Merge branch 'net-ipa-simple-refactorizations' · 6f6746d7

David S. Miller authored Jul 01, 2020

Alex Elder says:

====================
net: ipa: simple refactorizations

This series makes three small changes to some endpoint configuration
code.  The first uses a constant to represent the frequency of an
internal clock used for timers in the IPA.  The second modifies a
limit used so it matches Qualcomm's internal code.  And the third
reworks a few lines of code, eliminating a multi-line function call.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6f6746d7

net: ipa: reuse a local variable in ipa_endpoint_init_aggr() · 9e88cb5f

Alex Elder authored Jun 29, 2020

Reuse the "limit" local variable in ipa_endpoint_init_aggr() when
setting the aggregation size limit.  Simple cleanup.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

9e88cb5f

net: ipa: reduce aggregation time limit · 1d86652b

Alex Elder authored Jun 29, 2020

Halve the time limit used when aggregation is enabled on an RX
endpoint, to half a millisecond.

Use DIV_ROUND_CLOSEST() to compute the value that represents the
time period, to get better accuracy in the event the time limit is
not an even multiple of the granularity.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

1d86652b

net: ipa: rework ipa_aggr_granularity_val() · 317a5740

Alex Elder authored Jun 29, 2020

The timer used for aggregation makes use of an internal 32 KHz clock.
The granularity of the timer is programmed by a field whose value is
computed by ipa_aggr_granularity_val().  Redefine the way that value
is computed by using a new TIMER_FREQUENCY constant representing the
underlying clock frequency.

Add two BUILD_BUG_ON() calls to ensure the value used is valid.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

317a5740

Merge branch 'add-XDP-support-to-xen-netfront' · 8c964397

David S. Miller authored Jul 01, 2020

Denis Kirjanov says:

====================
xen networking: add XDP support to xen-netfront

The first patch adds a new extra type to enable proper synchronization
between an RX request/response pair.
The second patch implements BFP interface for xen-netfront.
The third patch enables extra space for XDP processing.

v14:
- fixed compilation warnings

v13:
- fixed compilation due to previous rename

v12:
- xen-netback: rename netfront_xdp_headroom to xdp_headroom

v11:
- add the new headroom constant to netif.h
- xenbus_scanf check
- lock a bulk of puckets in xennet_xdp_xmit()

v10:
- add a new xen_netif_extra_info type to enable proper synchronization
 between an RX request/response pair.
- order local variable declarations

v9:
- assign an xdp program before switching to Reconfiguring
- minor cleanups
- address checkpatch issues

v8:
- add PAGE_POOL config dependency
- keep the state of XDP processing in netfront_xdp_enabled
- fixed allocator type in xdp_rxq_info_reg_mem_model()
- minor cleanups in xen-netback

v7:
- use page_pool_dev_alloc_pages() on page allocation
- remove the leftover break statement from netback_changed

v6:
- added the missing SOB line
- fixed subject

v5:
- split netfront/netback changes
- added a sync point between backend/frontend on switching to XDP
- added pagepool API

v4:
- added verbose patch descriprion
- don't expose the XDP headroom offset to the domU guest
- add a modparam to netback to toggle XDP offset
- don't process jumbo frames for now

v3:
- added XDP_TX support (tested with xdping echoserver)
- added XDP_REDIRECT support (tested with modified xdp_redirect_kern)
- moved xdp negotiation to xen-netback

v2:
- avoid data copying while passing to XDP
- tell xen-netback that we need the headroom space
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8c964397

xen networking: add XDP offset adjustment to xen-netback · 1c9535c7

Denis Kirjanov authored Jun 29, 2020

the patch basically adds the offset adjustment and netfront
state reading to make XDP work on netfront side.
Reviewed-by: Paul Durrant <paul@xen.org>
Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

1c9535c7

xen networking: add basic XDP support for xen-netfront · 6c5aa6fc

Denis Kirjanov authored Jun 29, 2020

The patch adds a basic XDP processing to xen-netfront driver.

We ran an XDP program for an RX response received from netback
driver. Also we request xen-netback to adjust data offset for
bpf_xdp_adjust_head() header space for custom headers.

synchronization between frontend and backend parts is done
by using xenbus state switching:
Reconfiguring -> Reconfigured- > Connected

UDP packets drop rate using xdp program is around 310 kpps
using ./pktgen_sample04_many_flows.sh and 160 kpps without the patch.
Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

6c5aa6fc

xen: netif.h: add a new extra type for XDP · 2cef30d7

Denis Kirjanov authored Jun 29, 2020

The patch adds a new extra type to be able to diffirentiate
between RX responses on xen-netfront side with the adjusted offset
required for XDP processing.

The offset value from a guest is passed via xenstore.
Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

2cef30d7

ixgbe: Add ethtool support to enable 2.5 and 5.0 Gbps support · a296d665

Radoslaw Tyl authored Jun 26, 2020

Added full support for new version Ethtool API. New API allow use
2500Gbase-T and 5000base-T supported and advertised link speed modes.
Signed-off-by: Radoslaw Tyl <radoslawx.tyl@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

a296d665

ixgbe: Cleanup unneeded delay in ethtool test · bb0967c0

Jeff Kirsher authored Jun 25, 2020

There is a 4 seconds delay in ixgbe_diag_test() that is holding up other
ioctls such as SIOCGIFCONF that Oracle database applications use.
One of Oracle's product runs "ethtool -t ethX online" periodically for
system monitoring and that is impacting database applications that use
SIOCGIFCONF at that same time.

This 4 second delay was needed in out early 1GbE parts to give the PHY
time to recover from a reset.  This code was carried forward to the 10 GbE
driver even it was not needed for the supported PHYs in the ixgbe driver.

CC: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
CC: Jack Vogel <jack.vogel@oracle.com>
Reported-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

bb0967c0

iavf: Fix updating statistics · 93580766

Tony Nguyen authored Jun 24, 2020

Commit bac84861 ("iavf: Refactor the watchdog state machine") inverted
the logic for when to update statistics. Statistics should be updated when
no other commands are pending, instead they were only requested when a
command was processed. iavf_request_stats() would see a pending request
and not request statistics to be updated. This caused statistics to never
be updated; fix the logic.

Fixes: bac84861 ("iavf: Refactor the watchdog state machine")
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>

93580766

i40e: introduce new dump desc XDP command · 44ea803e

Ciara Loftus authored Jun 23, 2020

Interfaces already exist for dumping Rx and Tx descriptor information.
Introduce another for doing the same for XDP descriptors.
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

44ea803e

i40e: add XDP ring statistics to dump VSI debug output · 890c402c

Ciara Loftus authored Jun 23, 2020

Prior to this, only the Rx and Tx ring statistics were dumped. The XDP
ring statistics are now dumped as well.
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

890c402c

i40e: add XDP ring statistics to VSI stats · e2968260

Ciara Loftus authored Jun 23, 2020

Prior to this, only Rx and Tx ring statistics were accounted for.
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

e2968260

i40e: move check of full Tx ring to outside of send loop · 1fd972eb

Magnus Karlsson authored Jun 23, 2020

Move the check if the HW Tx ring is full to outside the send
loop. Currently it is checked for every single descriptor that we
send. Instead, tell the send loop to only process a maximum number of
packets equal to the number of available slots in the Tx ring. This
way, we can remove the check inside the send loop to and gain some
performance.
Suggested-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

1fd972eb

i40e: eliminate division in napi_poll data path · 4b5539c0

Magnus Karlsson authored Jun 23, 2020

Eliminate a division in the napi_poll data path. This division is
executed even though it is only needed in the rare case when there are
not enough interrupt lines so they have to be shared between queue
pairs. Instead, just test for this case and only execute the division
if needed. The code has been lifted from the ice driver.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

4b5539c0

i40e: optimize AF_XDP Tx completion path · 5574ff7b

Magnus Karlsson authored Jun 23, 2020

Improve the performance of the AF_XDP zero-copy Tx completion
path. When there are no XDP buffers being sent using XDP_TX or
XDP_REDIRECT, we do not have go through the SW ring to clean up any
entries since the AF_XDP path does not use these. In these cases, just
fast forward the next-to-use counter and skip going through the SW
ring. The limit on the maximum number of entries to complete is also
removed since the algorithm is now O(1). To simplify the code path, the
maximum number of entries to complete for the XDP path is therefore
also increased from 256 to 512 (the default number of Tx HW
descriptors). This should be fine since the completion in the XDP path
is faster than in the SKB path that has 256 as the maximum number.

This patch provides around 4% throughput improvement for the l2fwd
application in xdpsock on my machine.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

5574ff7b

iavf: fix error return code in iavf_init_get_resources() · 753f3884

Wei Yongjun authored Jun 18, 2020

Fix to return negative error code -ENOMEM from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: b66c7bc1 ("iavf: Refactor init state machine")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

753f3884

i40e: Add support for a new feature Total Port Shutdown · d5ec9e2c

Arkadiusz Kubalewski authored Jun 17, 2020

After OS requests to down a link on a physical network port, the
traffic is no longer being processed but the physical link with
a link partner is still established.

Currently there is a feature (Link down on close) which allows
to physically bring the link down (after OS request).

With this patch new feature with similar capability is introduced:
TOTAL_PORT_SHUTDOWN
Allows to physically disable the link on the NIC's port.
If enabled, (after link down request from the OS)
no link, traffic or led activity is possible on that port.

If I40E_FLAG_TOTAL_PORT_SHUTDOWN is enabled, the
I40E_FLAG_LINK_DOWN_ON_CLOSE_ENABLED must be explicitly forced to
true and cannot be disabled at that time.
The functionalities are exclusive in terms of configuration, but
they also have similar behavior (allowing to disable physical link
of the port), with following differences:
- LINK_DOWN_ON_CLOSE_ENABLED is configurable at host OS run-time
  and is supported by whole family of 7xx Intel Ethernet Controllers
- TOTAL_PORT_SHUTDOWN may be enabled only before OS loads (in BIOS)
  only if motherboard's BIOS and NIC's FW has support of it
- when LINK_DOWN_ON_CLOSE_ENABLED is used, the link is being brought
  down by sending phy_type=0 to NIC's FW
- when TOTAL_PORT_SHUTDOWN is used, phy_type is not altered, instead
  the link is being brought down by clearing bit
  (I40E_AQ_PHY_ENABLE_LINK) in abilities field of
  i40e_aq_set_phy_config structure

Introduced changes:
- new private flag I40E_FLAG_TOTAL_PORT_SHUTDOWN for handling the
  feature
- probe of NVM if the feature was enabled at driver's port
  initialization
- special handling on link-down procedure to let FW physically
  shutdown the port if the feature was enabled
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

d5ec9e2c

ethernet/intel: Convert fallthrough code comments · 5463fce6

Jeff Kirsher authored Jun 03, 2020

Convert all the remaining 'fall through" code comments to the newer
'fallthrough;' keyword.
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

5463fce6

Merge branch 'net-ethernet-use-generic-power-management' · 6d79dc67

David S. Miller authored Jul 01, 2020

Vaibhav Gupta says:

====================
net: ethernet: use generic power management

Linux Kernel Mentee: Remove Legacy Power Management.

The purpose of this patch series is to remove legacy power management callbacks
from net ethernet drivers.

The callbacks performing suspend() and resume() operations are still calling
pci_save_state(), pci_set_power_state(), etc. and handling the power management
themselves, which is not recommended.

The conversion requires the removal of the those function calls and change the
callback definition accordingly and make use of dev_pm_ops structure.

All patches are compile-tested only.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6d79dc67